Docling Application Tutorial
After trying different document parser projects, I am going to dive into IBM's open-source project Docling. I will introduce the most important modules and classes in this project.
Important Docling Modules
The docling.document_converter
module in the Docling project is the core module responsible for converting documents (like .docx
, .pdf
, .txt
, etc.) into Docling’s structured format like .md
or .json
. Here are the list of Class in docling.document_converter
module:
DocumentConverter
– A high-level Python class designed for converting documents into a structuredDoclingDocument
format.ConversionResult
– Object returned byDocumentConverter.convert()
method.FormatOption
– Serves as the base class for all format-specific options passed toDocumentConverter
. As an abstract base class,FormatOption
cannot be instantiated directly. Instead, it is subclassed for specific formats, such asPdfFormatOption
for PDF documents orWordFormatOption
for Word documents.InputFormat
– Represents the supported input file formats thatDocumentConverter
can process. Each attribute of theInputFormat
class represents a distinct supported document type:PDF
,DOCX
,PPTX
,HTML
, .ectPdfFormatOption
– A specialized subclass ofFormatOption
, designed specifically to configure options for PDF document conversion within thedocling
framework. Thebackend
attribute dictates the document backend responsible for handling PDF processing. Themodel_config
attribute enables the use of arbitrary types within the model's configuration.ImageFormatOption
– It is used to configure settings specific to image document conversion (e.g., PNG, JPEG, TIFF) within theDocumentConverter
pipeline. This class enables customization of how image-based documents are processed, particularly for tasks like Optical Character Recognition (OCR) and other image-specific parsing requirements.WordFormatOption
–It is designed to configure settings specific to Microsoft Word document conversion (e.g., .docx files) within theDocumentConverter
pipeline. This class allows customization of how Word documents are processed, such as extracting text, tables, or other content during conversion.PowerpointFormatOption
MarkdownFormatOption
HTMLFormatOption
SimplePipeline
– TheSimplePipeline
class is particularly useful for users who want a straightforward way to set up a document conversion workflow without manually configuring every aspect of theDocumentConverter
and its associated options.StandardPdfPipeline
–This is the default internal processing pipeline used by Docling to parse and convert PDF documents into structured Doc objects.
The docling.datamodel.pipeline_options
module in the Docling library provides classes for configuring document conversion pipelines used by the DocumentConverter
. Here are the list of Class in docling.datamodel.pipeline_options
module:
BaseOptions
– TheBaseOptions
class in the serves as an abstract base class for pipeline configuration options used in document conversion processes. It provides a foundational structure for format-specific pipeline options classes, such asPdfPipelineOptions
,AsrPipelineOptions
, or others, ensuring a consistent interface for configuring document processing pipelines within theDocumentConverter
.AsrPipelineOptions
–EasyOcrOptions
– Options for the EasyOCR engine.LayoutOptions
– Options for layout processing.OcrEngine
– Enum of valid OCR engines.OcrMacOptions
– Options for the Mac OCR engine.OcrOptions
– OCR options.PaginatedPipelineOptions
–PdfBackend
– Enum of valid PDF backends.PdfPipelineOptions
– Options for the PDF pipeline.PictureDescriptionApiOptions
–PictureDescriptionBaseOptions
–PictureDescriptionVlmOptions
–PipelineOptions
– Base pipeline options.ProcessingPipeline
–RapidOcrOptions
– Options for the RapidOCR engine.TableFormerMode
– Modes for the TableFormer model.TableStructureOptions
– Options for the table structure.TesseractCliOcrOptions
– Options for the TesseractCli engine.TesseractOcrOptions
– Options for the Tesseract engine.ThreadedPdfPipelineOptions
– Pipeline options for the threaded PDF pipeline with batching and backpressure controlVlmPipelineOptions
–
Simple Conversion
This is an simple example from Docling's documentation.
# Imports the DocumentConverter class from the docling library.
from docling.document_converter import DocumentConverter
# Defines the URL of the PDF document to be converted.
source = "https://arxiv.org/pdf/2408.09869"
# Create an instance of the DocumentConverter class.
converter = DocumentConverter()
# Call the convert method of the DocumentConverter object.
result = converter.convert(source)
# Call the export_to_markdown method of the document object to get the markdown representation of the document.
markdown = result.document.export_to_markdown()
print(markdown)
The method DocumentConverter().convert()
is the main entry point in Docling to convert documents (like .docx, .pdf, or .html) into a structured internal format that can then be exported (e.g., to Markdown, JSON, etc.). This method returns an object of type ConversionResult
object.
This ConversionResult
object contains several useful properties, including:
.document
(DoclingDocument
) : The main parsed document as aDoclingDocument
object, which you can export to Markdown, JSON, or other formats..status
(ConversionStatus
) : The conversion status (e.g., success, partial success, failure)..errors
(List[ErrorItem]
) : Any errors encountered during conversion..input
(InputDocument
) : Meta information about the input document.
The DoclingDocument
class in the Docling Python library is a core data model representing a unified document structure. It is defined as a Pydantic model and serves as the central representation format for parsed and processed documents, regardless of the original input format (PDF, DOCX, HTML, images, etc.). DoclingDocument
provides numerous methods for document manipulation and export:
export_to_markdown()
export_to_html()
export_to_document_tokens() (DocTags)
export_to_dict()
PDF to Markdown Conversion
There is another example that refers to custom conversion from docling's documentation.
# PdfPipelineOptions is used to set general options for PDF processing, and TesseractCliOcrOptions is used to configure the Tesseract OCR engine.
from docling.datamodel.pipeline_options import PdfPipelineOptions, TesseractCliOcrOptions
# DocumentConverter is the main class that you use to perform the conversion. PdfFormatOption and InputFormat are used to specify the input format and its corresponding options.
from docling.document_converter import DocumentConverter, PdfFormatOption, InputFormat
# The PyPdfiumDocumentBackend backend is responsible for low-level PDF parsing and rendering pages into text and images for subsequent pipeline processing.
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
# Creates an instance of PdfPipelineOptions to configure the conversion process.
pipeline_options = PdfPipelineOptions(
do_ocr=True,
do_table_structure=True,
enable_remote_services=False,
images_scale=2,
ocr_options=TesseractCliOcrOptions(lang=["chi_sim"])
)
# Create and use DocumentConverter
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options
)
}
)
In summary, the code above sets up a DocumentConverter
that is configured to convert PDF files using the pypdfium2
backend. It will perform OCR using Tesseract with support for both Chinese and English, and it will try to preserve the structure of tables. Then we can convert PDF to markdown by:
# Defines the URL of the PDF document to be converted.
source = "https://arxiv.org/pdf/2408.09869"
# Call the convert method of the DocumentConverter object.
result = converter.convert(source)
# Call the export_to_markdown method of the document object to get the markdown representation of the document.
markdown = result.document.export_to_markdown()
print(markdown)
DocumentCoverter
is the key object. You can create a instance of DocumentCoverter
by passing two parameters:
allowed_formats
restricts input document types accepted by the converter.format_options
allows detailed configuration of document processing pipelines and backend choices per input format.
DocumentConverter(
allowed_formats: Optional[List[InputFormat]] = None,
format_options: Optional[Dict[InputFormat, FormatOption]] = None
)
Configure Image Output
from docling.datamodel.pipeline_options import PdfPipelineOptions, TesseractCliOcrOptions
from docling.document_converter import DocumentConverter, PdfFormatOption, InputFormat
# Import ImageRefMode data type form doc module
from docling_core.types.doc import ImageRefMode
# Configure pipeline options
pipeline_options = PdfPipelineOptions(
do_ocr=True,
do_table_structure=True,
enable_remote_services=False,
images_scale=2,
generate_page_images=True,
generate_picture_images=True,
ocr_options=TesseractCliOcrOptions(lang=["chi_sim"])
)
# Create and use DocumentConverter
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options
)
}
)
ImageRefMode
controls how images are referenced or embedded when exporting documents (e.g., to Markdown or HTML). Then we can convert PDF to markdown by:
# Defines the URL of the PDF document to be converted.
source = "https://arxiv.org/pdf/2408.09869"
# Call the convert method of the DocumentConverter object.
result = converter.convert(source)
# Call the save_as_markdown method of the document object to save the markdown representation of the document.
result.document.save_as_markdown("output.md", image_mode=ImageRefMode.REFERENCED)
You can refer to DoclingDocument
to know more about the save_as_markdown()
method.
save_as_markdown(filename: Union[str, Path], artifacts_dir: Optional[Path] = None, delim: str = '\n\n', from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, strict_text: bool = False, escaping_underscores: bool = True, image_placeholder: str = '<!-- image -->', image_mode: ImageRefMode = PLACEHOLDER, indent: int = 4, text_width: int = -1, page_no: Optional[int] = None, included_content_layers: Optional[set[ContentLayer]] = None, page_break_placeholder: Optional[str] = None, include_annotations: bool = True)
DOCX to Markdown Conversion
from docling.document_converter import DocumentConverter, WordFormatOption, InputFormat, SimplePipeline
from docling.datamodel.pipeline_options import PaginatedPipelineOptions
from docling_core.types.doc import ImageRefMode
# Create and use DocumentConverter
converter = DocumentConverter(
format_options={
InputFormat.DOCX: WordFormatOption(
pipeline_cls=SimplePipeline,
pipeline_options=PaginatedPipelineOptions(generate_picture_images = True)
)
}
)