Document Parser-文档解析

智能问答AI系统最核心的就是数据处理,因为大模型对于文档格式是有要求的,.txt, .md, .csv, .json等结构化的纯文档格式会更友好。其它格式的文档如PDF,首先要做的是文本转化。

Document Parser-文档解析
A document parser is a program or a software system designed to analyze the content and structure of a document and extract specific, relevant information from it. The goal is to transform unstructured or semi-structured data within a document into a structured, machine-readable format that can be easily used, stored, and analyzed by other systems or applications.

Document Parser在AI领域有非常重要的作用,因此也诞生了非常多相应的产品,包括Google的Document AI。这里首先介绍三种Text-based PDF转Markdown的开源工具,大家可测试哪种会好。

Datalab PDF结构化文本转换

Datalab builds state-of-the-art document intelligence models to convert complex PDFs and other unstructured formats into structured, machine-readable outputs — fast, accurately, and at scale. 

Datalab同时提供开源的与商业化的转换器。商业的用户需要开通官方API,开源用户用可参考Github的教程。最简单的方式是采用CLI(Command-Line Interface),即运行以下的指令:

# Install Datala maker
pip install marker-pdf

# Convert PDF to markdown
marker_single /path/to/file.pdf --output_dir /path/to/output/folder

Docling PDF结构化文本转换

Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.

同样参考Github的教程。

# Install Docling
pip install docling

# Convert PDF to markdown
docling your_file.pdf

Markitdown PDF结构化文本转换

MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines.
# Install Markitdown
pip install 'markitdown[all]'

# Convert PDF to markdown
markitdown path-to-file.pdf -o document.md

Markitdown目前是不支持.md文件中有引用图片的,主要用于文本的转化。