7. Conclusions

The study clearly showed that a structured file delivers significantly better results when training a RAG model than a non-structured one. Especially in scientific and legal contexts, links, bibliographies and footnotes are of crucial importance. These elements contribute significantly to the understanding and correct contextualisation of the information. Processing without including these document parts would therefore not be expedient, as important semantic relationships and source references would be lost.

Tables are also of great importance as they often contain structured and crucial data. In many cases, they play a central role in understanding and extracting information. Here in particular, understanding the data is crucial in order to recognise the correct correlations and extract the information correctly.

Overall, the study shows that a structured database that takes into account all relevant parts such as footnotes, links and tables is essential in order to achieve high-quality results from a RAG model.

The test suite created as part of this study now enables the comparison of different RAG system providers and embeddings. With the provided training files and set of prompts, the systems can be tested and their responses and scores compared to evaluate the performance of the different approaches.

Nevertheless, the study is only a first step that lays important foundations for further development. In the next study, embedding models will be included in order to analyse how semantics can be stored and used in these models. A key objective will be to develop a structured intermediate format from which different models can be supplied efficiently. This should ensure better consistency and quality of the data.

Another focus is on using our Octopus service to extract as much data as possible from PDFs and other formats fully automatically and to provide significantly more semantics than is usual in current systems. The aim is to enable more precise and context-rich processing that goes beyond pure text extraction.

Finally, a later study will look at data storage and the implementation of content delivery services to optimise how semantic data can be efficiently stored, retrieved and integrated into various applications.