5.8. Index
Index entries pose a particular challenge for PDF-to-text conversion. They usually only contain a page number that refers to the pagination - not necessarily to the actual page numbering in the PDF. In addition, the pagination can be in different formats (e.g. Roman or Arabic numerals), which makes assignment more difficult.
Another problem is that the location of an index entry in the body text is not always identical to the actual index entry. In the example document, the keyword "Eckeltier" does not appear in the body text at all; it was linked as an index entry to the location "Straßenhunde". As a result, valuable semantic information is lost in the PDF database
In contrast, index entries and marginalia in HTML or revised Markdown are anchored directly in the body text, making them easier to identify and process. These structures are essential for a deeper semantic indexing of texts in particular, as they create connections and cross-references.

The question here is:
Which dogs are labelled as disgusting animals
The assignment must be recognised in the index.
Result
RAG | Assessment | Text | Link | Notes |
wrong | No idea. I have no information on this. | - | ||
MD | right | In some contexts, street dogs are referred to as "disgusting animals", especially when they are exposed to constant deficiencies due to inadequate nutrition | - | |
HTML | right | In some contexts, street dogs are referred to as "disgusting animals", especially when they are exposed to constant deficiencies due to inadequate nutrition | - |