Docling process

Info

This service is currently in a beta phase and can change regularly. The same applies to the documentation.

The RAG Service provides users with an efficient way to upload and process PDF documents using Docling. The system converts uploaded PDFs into Markdown format while also automatically annotating them. This enhanced Markdown output, referred to as Markdown Plus, includes metadata and structural annotations for improved document parsing and customization.

The Arcana manger is shown with a demo file uploaded, and the conversion status is set to Docling in progress.

This service is using a fork of the Docling API with many modifications, which will be published in the future.

Process Flow

  1. File Upload: Users upload their PDF documents through the RAG interface.
  2. Conversion Process: Once the PDF file is uploaded, it will be automatically converted into the Markdown Plus format.
  3. Annotation & Metadata: The Markdown Plus output is automatically annotated with structural markers.
  4. By clicking the Details button, users can download the annotated Markdown file, adjust split markers, and re-upload it for further processing. The Details window for a file is shown with all the file details as well as the option to download either the JSON format or the Markdown plus file. Additionally, there is an option to upload an updated Markdown Plus file.
  5. Validation: Upon re-upload, a validation process ensures that the paging structure remains intact.

Markdown Plus Annotations

Docling generates a structured and annotated Markdown file using the following markers:

1. Page Marker

  • Format: [Page (number)]: #
  • Purpose: Indicates the beginning of a new page in the original PDF document.
  • Example:
    [Page 1]: #

2. Vertical Position Marker

  • Format: [Y: (number)]: #
  • Purpose: Represents the approximate vertical position of an item on the page (The height of the pages is scaled to 1000 lines).
  • Details:
    • Each page is divided into five sections.
    • If no continuous item (such as a table or image) exists in a given section, a Y marker is assigned.
  • Example:
    [Y: 300]: #

3. Split Marker

  • Format: [SPLIT]: #
  • Purpose: Defines segmentation points in the document for later processing.
  • Usage:
    • The split markers guide document chunking for downstream applications.
    • Users can manually adjust split markers in the Markdown Plus file before re-uploading.
  • Example:
    [SPLIT]: #

Metadata Header

Each annotated Markdown file contains a header section with metadata about the document, including:

  1. Author
  2. Title
  3. Description
  4. Filename
  5. Extension
  6. Number of Pages
  7. Version

For example:

---
Author: jkunkel1
Title: Title of the file
Description: ''
Filename: file name
Extension: pdf
Number of Pages: 20
Version: 1.0
---

User Interaction with Annotated Markdown

  1. Download Markdown Plus: Users can export the annotated Markdown file for review.
  2. Modify Split Markers: If desired, users can manually edit [SPLIT]: # markers to customize segmentation.
  3. Re-Upload Modified File: The system verifies that the paging structure remains undisturbed before processing the document further.