生产力

Document Parser-文档解析

智能问答AI系统最核心的就是数据处理，因为大模型对于文档格式是有要求的，.txt, .md, .csv, .json等结构化的纯文档格式会更友好。其它格式的文档如PDF，首先要做的是文本转化。

谢现实

Jul 22, 2025 — 2 min read

A document parser is a program or a software system designed to analyze the content and structure of a document and extract specific, relevant information from it. The goal is to transform unstructured or semi-structured data within a document into a structured, machine-readable format that can be easily used, stored, and analyzed by other systems or applications.

Document Parser在AI领域有非常重要的作用，因此也诞生了非常多相应的产品，包括Google的Document AI。这里首先介绍三种Text-based PDF转Markdown的开源工具，大家可测试哪种会好。

Datalab PDF结构化文本转换

Datalab builds state-of-the-art document intelligence models to convert complex PDFs and other unstructured formats into structured, machine-readable outputs — fast, accurately, and at scale.

Datalab同时提供开源的与商业化的转换器。商业的用户需要开通官方API，开源用户用可参考Github的教程。最简单的方式是采用CLI(Command-Line Interface)，即运行以下的指令：

# Install Datala maker
pip install marker-pdf

# Convert PDF to markdown
marker_single /path/to/file.pdf --output_dir /path/to/output/folder

Docling PDF结构化文本转换

Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.

同样参考Github的教程。

# Install Docling
pip install docling

# Convert PDF to markdown
docling your_file.pdf

Markitdown PDF结构化文本转换

MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines.

# Install Markitdown
pip install 'markitdown[all]'

# Convert PDF to markdown
markitdown path-to-file.pdf -o document.md

Markitdown目前是不支持.md文件中有引用图片的，主要用于文本的转化。

Document Parser-文档解析

谢现实

Datalab PDF结构化文本转换

Docling PDF结构化文本转换

Markitdown PDF结构化文本转换

Read more

Python工具箱

流行的制造者(Hit Makers)

AI智能PPT生成工具

Cloud-based Video Editor