Docling Application Tutorial

After trying different document parser projects, I am going to dive into IBM's open-source project Docling. I will introduce the most important modules and classes in this project.

Docling Application Tutorial

Important Docling Modules

The docling.document_converter module in the Docling project is the core module responsible for converting documents (like .docx, .pdf, .txt, etc.) into Docling’s structured format like .md or .json. Here are the list of Class in docling.document_converter module:

  • DocumentConverter – A high-level Python class designed for converting documents into a structured DoclingDocument format.
  • ConversionResult – Object returned by DocumentConverter.convert() method.
  • FormatOption – Serves as the base class for all format-specific options passed to DocumentConverter. As an abstract base class, FormatOption cannot be instantiated directly. Instead, it is subclassed for specific formats, such as PdfFormatOption for PDF documents or WordFormatOption for Word documents.
  • InputFormat – Represents the supported input file formats that DocumentConverter can process. Each attribute of the InputFormat class represents a distinct supported document type:PDF, DOCX, PPTX, HTML, .ect
  • PdfFormatOption – A specialized subclass of FormatOption, designed specifically to configure options for PDF document conversion within the docling framework. The backend attribute dictates the document backend responsible for handling PDF processing. The model_config attribute enables the use of arbitrary types within the model's configuration.
  • ImageFormatOption – It is used to configure settings specific to image document conversion (e.g., PNG, JPEG, TIFF) within the DocumentConverter pipeline. This class enables customization of how image-based documents are processed, particularly for tasks like Optical Character Recognition (OCR) and other image-specific parsing requirements.
  • WordFormatOption –It is designed to configure settings specific to Microsoft Word document conversion (e.g., .docx files) within the DocumentConverter pipeline. This class allows customization of how Word documents are processed, such as extracting text, tables, or other content during conversion.
  • PowerpointFormatOption 
  • MarkdownFormatOption 
  • HTMLFormatOption 
  • SimplePipeline – The SimplePipeline class is particularly useful for users who want a straightforward way to set up a document conversion workflow without manually configuring every aspect of the DocumentConverter and its associated options.
  • StandardPdfPipeline –This is the default internal processing pipeline used by Docling to parse and convert PDF documents into structured Doc objects.

The docling.datamodel.pipeline_options module in the Docling library provides classes for configuring document conversion pipelines used by the DocumentConverter. Here are the list of Class in docling.datamodel.pipeline_options module:

Simple Conversion

This is an simple example from Docling's documentation.

# Imports the DocumentConverter class from the docling library.
from docling.document_converter import DocumentConverter

# Defines the URL of the PDF document to be converted.
source = "https://arxiv.org/pdf/2408.09869"  

# Create an instance of the DocumentConverter class.
converter = DocumentConverter()

# Call the convert method of the DocumentConverter object.
result = converter.convert(source)

# Call the export_to_markdown method of the document object to get the markdown representation of the document.
markdown = result.document.export_to_markdown()
print(markdown)

The method DocumentConverter().convert() is the main entry point in Docling to convert documents (like .docx, .pdf, or .html) into a structured internal format that can then be exported (e.g., to Markdown, JSON, etc.). This method returns an object of type ConversionResult object.

This ConversionResult object contains several useful properties, including:

  • .document(DoclingDocument) : The main parsed document as a DoclingDocument object, which you can export to Markdown, JSON, or other formats.
  • .status(ConversionStatus) : The conversion status (e.g., success, partial success, failure).
  • .errors(List[ErrorItem]) : Any errors encountered during conversion.
  • .input(InputDocument) : Meta information about the input document.

The DoclingDocument class in the Docling Python library is a core data model representing a unified document structure. It is defined as a Pydantic model and serves as the central representation format for parsed and processed documents, regardless of the original input format (PDF, DOCX, HTML, images, etc.). DoclingDocument provides numerous methods for document manipulation and export:

  • export_to_markdown()
  • export_to_html()
  • export_to_document_tokens() (DocTags)
  • export_to_dict()

PDF to Markdown Conversion

There is another example that refers to custom conversion from docling's documentation.

# PdfPipelineOptions is used to set general options for PDF processing, and TesseractCliOcrOptions is used to configure the Tesseract OCR engine.
from docling.datamodel.pipeline_options import PdfPipelineOptions, TesseractCliOcrOptions

# DocumentConverter is the main class that you use to perform the conversion. PdfFormatOption and InputFormat are used to specify the input format and its corresponding options.
from docling.document_converter import DocumentConverter, PdfFormatOption, InputFormat

# The PyPdfiumDocumentBackend backend is responsible for low-level PDF parsing and rendering pages into text and images for subsequent pipeline processing.
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend

# Creates an instance of PdfPipelineOptions to configure the conversion process. 
pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    do_table_structure=True,
    enable_remote_services=False,
    images_scale=2,
    ocr_options=TesseractCliOcrOptions(lang=["chi_sim"])
)

# Create and use DocumentConverter
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options
        )
    }
)

In summary, the code above sets up a DocumentConverter that is configured to convert PDF files using the pypdfium2 backend. It will perform OCR using Tesseract with support for both Chinese and English, and it will try to preserve the structure of tables. Then we can convert PDF to markdown by:

# Defines the URL of the PDF document to be converted.
source = "https://arxiv.org/pdf/2408.09869"  

# Call the convert method of the DocumentConverter object.
result = converter.convert(source)

# Call the export_to_markdown method of the document object to get the markdown representation of the document.
markdown = result.document.export_to_markdown()
print(markdown)

DocumentCoverter is the key object. You can create a instance of DocumentCoverter by passing two parameters:

  • allowed_formats restricts input document types accepted by the converter.
  • format_options allows detailed configuration of document processing pipelines and backend choices per input format.
DocumentConverter(
    allowed_formats: Optional[List[InputFormat]] = None,
    format_options: Optional[Dict[InputFormat, FormatOption]] = None
)

Configure Image Output

from docling.datamodel.pipeline_options import PdfPipelineOptions, TesseractCliOcrOptions
from docling.document_converter import DocumentConverter, PdfFormatOption, InputFormat

# Import ImageRefMode data type form doc module
from docling_core.types.doc import ImageRefMode

# Configure pipeline options
pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    do_table_structure=True,
    enable_remote_services=False,
    images_scale=2,
    generate_page_images=True,
    generate_picture_images=True,
    ocr_options=TesseractCliOcrOptions(lang=["chi_sim"])
)

# Create and use DocumentConverter
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options
        )
    }
)

ImageRefMode controls how images are referenced or embedded when exporting documents (e.g., to Markdown or HTML). Then we can convert PDF to markdown by:

# Defines the URL of the PDF document to be converted.
source = "https://arxiv.org/pdf/2408.09869"  

# Call the convert method of the DocumentConverter object.
result = converter.convert(source)

# Call the save_as_markdown method of the document object to save the markdown representation of the document.
result.document.save_as_markdown("output.md", image_mode=ImageRefMode.REFERENCED)

You can refer to DoclingDocument to know more about the save_as_markdown() method.

save_as_markdown(filename: Union[str, Path], artifacts_dir: Optional[Path] = None, delim: str = '\n\n', from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, strict_text: bool = False, escaping_underscores: bool = True, image_placeholder: str = '<!-- image -->', image_mode: ImageRefMode = PLACEHOLDER, indent: int = 4, text_width: int = -1, page_no: Optional[int] = None, included_content_layers: Optional[set[ContentLayer]] = None, page_break_placeholder: Optional[str] = None, include_annotations: bool = True)

DOCX to Markdown Conversion

from docling.document_converter import DocumentConverter, WordFormatOption, InputFormat, SimplePipeline
from docling.datamodel.pipeline_options import PaginatedPipelineOptions
from docling_core.types.doc import ImageRefMode

# Create and use DocumentConverter
converter = DocumentConverter(
    format_options={
        InputFormat.DOCX: WordFormatOption(
            pipeline_cls=SimplePipeline,
            pipeline_options=PaginatedPipelineOptions(generate_picture_images = True)
        )
    }
)