3. Introduction

With the further development of retrieval augmented generation (RAG) systems, the question of what structured information is lost from documents when they are converted to plain text - and what impact this has on the utilisation and quality of the generated answers - is coming into focus. While AI developers who operate RAG systems are usually familiar with Python and data-driven processes, the content often comes from subject matter experts who are familiar with structured document formats such as XML, Word or PDF. These different perspectives lead to challenges in data preparation and utilisation.

Language models are trained on the basis of large volumes of text, which are converted into plain text using various libraries. Valuable metadata and structural information is often lost in the process. This loss can affect the accuracy and relevance of the answers, as RAG systems typically use the proximity of text passages as an indicator of their contextual coherence. In this study, we investigate what structural information is no longer available when documents are converted to plain text and what impact this has on the generation of responses.