2.3. PDF

The processing of PDF documents as an input format in Octopus is a central component of the platform and offers numerous functions for extracting, analysing and processing information from PDFs. Here is a detailed explanation:

  1. Support of PDF as input format
    Octopus can process PDF documents as an input format, regardless of whether they contain text, images, tables or other content. It is designed to analyse the structure and content of a PDF and convert it into standardised formats.
  2. Processing steps
    • Analysing the document structure:
      Octopus recognises the structure of a PDF, including lists, tables, images, links and other elements. This structure is usually made available in XML.
    • Extraction of content:
      Content such as text, images and tables are extracted from the PDF. Metadata and semantic information can also be captured in the process.
    • Transformation into other formats:
      The platform enables the conversion of PDF documents into various formats such as HTML, JATS, DocBook and other XML formats. This is done by using transformation paths that have been specially developed for such conversions.
    • OCR services:
      For PDFs that contain scanned images, Octopus can use OCR (Optical Character Recognition) to extract the text from the images and process it further.
    • Semantic enrichment:
      The extracted content can be enriched with additional information, e.g. by identifying key terms or linking to external data sources.
  3. Special features of PDF processing
    • Writing back to the original document:
      Octopus makes it possible to write the analysed and edited information back into the original PDF. This is particularly useful for updating or adding to documents.
    • Flexible handling of formats:
      The platform processes PDFs regardless of their original formatting or semantics, which offers a high degree of flexibility when handling documents.
    • Pattern recognition:
      Octopus recognises patterns in the document structure, such as tables or lists, but not the exact page position or original formatting.
  4. Areas of application
    • Digitisation and archiving:
      PDFs can be analysed and converted into standardised formats in order to store them in digital archives.
    • Creation of structured documents:
      The platform enables the creation of structured documents for technical documentation or publications.
    • Integration into workflows:
      The extracted and transformed content can be integrated into various workflows, e.g. for the creation of websites, databases or publications.
  5. Advantages of PDF processing in Octopus
    • Support for around 200 input formats, including PDF.
    • Possibility of displaying content in different layouts without programming effort (e.g. with OFX).
    • Use of AI technologies for analysing and transforming documents.

Conclusion

The processing of PDF documents in Octopus is a powerful function that enables content to be efficiently extracted, analysed and transformed into various formats. With functions such as OCR, semantic enrichment and writing back to the original document, Octopus offers a flexible and versatile solution for working with PDFs.