3.1. Methodology
Investigation of information loss in structured documents
In this study, we analyse which structured information is lost when documents are converted to plain text and what effects this has on the generation of responses by RAG systems. In particular, we focus on reference systems, tables, typographical markup and different layout and semantic structures.
We used CustomGPT. This is a customisable GPT model that can be specifically combined with your own content and data sources to generate domain-specific answers. It enables easy integration of Retrieval Augmented Generation (RAG) in specific application scenarios.
We have chosen a simple but typical approach to the topic in order to make practical challenges in the transfer of information from structured documents understandable.
In further studies, we take a more differentiated approach, analyse the architecture of RAG systems in detail and systematically investigate the causes of qualitative differences in response generation.
Referencing systems: Footnotes, endnotes, bibliographies and indexes
An essential component of scientific and technical documents are referencing systems that enable information to be linked together in a structured way.
Footnotes are references within a document that contain additional information or references. In PDFs, they are marked by superscript numbers or symbols that are linked to the corresponding notes at the bottom of the page. In XML-based formats, on the other hand, footnotes are often placed directly at the relevant point in the text so that their context remains closely linked to the main text.
When converting from PDF to plain text, these explicit links are lost, which means that the reference between the footnote and the marked text passage is no longer clear.
Endnotes work in a similar way to footnotes, but are grouped together at the end of a document or chapter. The links are clearly defined in structured documents, but are usually lost in plain text or only appear as an unstructured list.
Indexes serve as a keywording system and contain metadata about a document to enable targeted searches or cross-references. In structured formats such as XML or PDF, they are linked to the relevant page numbers or sections.
After conversion to plain text, usually only a simple list of terms remains, without the original links. This severely limits the semantic benefit.
Bibliographies contain bibliographical information on the works cited. In XML or structured text formats, they are often stored in a specific citation structure and linked to the citations in the text.
In plain text, the links between quotation and source can be lost, making it more difficult to follow scientific arguments.
Tables and their structure
Tables not only contain data, but also create logical relationships through their structure.
In structured formats (e.g. XML, HTML, Markdown), tables are defined by explicit tags that delimit rows and columns.
In PDFs, tables can be available as raster graphics or embedded structures.
In plain text, column and line references are often lost or replaced by unstructured lists.
RAG systems cannot extract clear table relationships from unstructured text representations, which means that important semantic information is lost.
Typographic and in-line labelling
Text markings such as bold print, italics, underlining or coloured markings contribute significantly to the meaning of a text. They emphasise key terms, show hierarchies or mark important passages.
Plain text largely ignores such markings, which means that important contextual information and emphasis is lost. This is particularly problematic if terms or definitions have been emphasised by formatting.
Layout: Text presentation and visual structure
The layout of a document plays a decisive role in the readability and structuring of information. In structured formats such as XML, HTML or PDF, texts can be emphasised using various design elements. Various creative means have been combined for this purpose.
Semantics: Example maths formulas
Semantic structures that carry specific meanings, such as maths formulae, are a particularly critical area.
Realisation
To carry out the analysis, a special sample document was created in three formats - PDF, HTML and Markdown - containing the same knowledge base. The document size was irrelevant in this case - it was about 10 pages long. What was more important was the clear structure and unambiguousness of the texts to make the answer easy to check. This allows a direct comparison of information loss between formats and their impact on RAG systems.
To assess the practical impact of these losses, three RAG-based systems were trained on the identical technical basis, each with a document from one of the three formats. The trained RAG bots were then confronted with specific questions from the areas analysed, including referencing systems (footnotes, endnotes, bibliographies, indices), table structures, layout and semantic features. The generated answers were systematically checked for correctness in order to determine the extent to which the structured information was retained and whether differences between the formats had a measurable impact on the quality of the answers.
Each question asked and the respective answers of the three bots were saved and remain permanently accessible. This enables the results to be analysed in detail and also allows third parties to check the generated answers themselves and carry out their own tests. The saved answers therefore provide a transparent basis for assessing the impact of format differences on RAG systems.