mineru document extractor
MinerU document extraction — convert PDFs, scanned documents, images, Word (DOC/DOCX), PowerPoint (PPT/PPTX), and web pages into clean Markdown, HTML, LaTeX,...
MinerU document extraction — convert PDFs, scanned documents, images, Word (DOC/DOCX), PowerPoint (PPT/PPTX), and web pages into clean Markdown, HTML, LaTeX,...
Real data. Real impact.
Emerging
Developers
Per week
Open source
Skills give you superpowers. Install in 30 seconds.
MinerU is a powerful document extraction tool. Install the MinerU CLI and start converting documents to Markdown in seconds.
npm install -g mineru-open-api
Or via Go (macOS/Linux):
go install github.com/opendatalab/MinerU-Ecosystem/cli/mineru-open-api@latest
Verify:
mineru-open-api version
MinerU | MinerU | |
|---|---|---|
| Token required | No | Yes () |
| Speed | Fast | Normal |
| Table recognition | Yes | Yes |
| Formula recognition | Yes | Yes |
| OCR | Yes | Yes |
| Output formats | Markdown only | md, html, latex, docx, json |
| Batch mode | No | Yes |
| Model selection | pipeline | vlm, pipeline, MinerU-HTML |
| File size limit | 10 MB | Much higher |
| Page limit | 20 pages | Much higher |
mineru-open-api flash-extract <file> for quick Markdown conversionmineru-open-api auth, then use mineru-open-api extract for multi-format output, VLM model, and batch processingmineru-open-api crawl <url> to convert web content-o directoryOnly required for MinerU
extract and crawl. Not needed for MinerU flash-extract.
mineru-open-api auth # Interactive token setup export MINERU_TOKEN="your-token" # Or set via environment variable
Token resolution order:
--token flag > MINERU_TOKEN env > ~/.mineru/config.yaml.
MinerU accepts a wide range of document formats:
| Format | MinerU | MinerU |
|---|---|---|
PDF () | Yes | Yes |
Images (, , , , , , ) | Yes | Yes |
Word () | Yes | Yes |
Word () | No | Yes |
PowerPoint () | Yes | Yes |
PowerPoint () | No | Yes |
HTML () | No | Yes |
| URLs (remote files) | Yes | Yes |
MinerU
crawl accepts any HTTP/HTTPS URL and extracts web page content to Markdown.
Fast, token-free MinerU document extraction. Outputs Markdown only. Limited to 10 MB / 20 pages per file.
mineru-open-api flash-extract report.pdf # MinerU Markdown to stdout mineru-open-api flash-extract report.pdf -o ./out/ # Save to file mineru-open-api flash-extract https://example.com/doc.pdf # URL mode mineru-open-api flash-extract report.pdf --language en # Specify language mineru-open-api flash-extract report.pdf --pages 1-10 # Page range
Flags:
--output/-o (output path), --language (default ch), --pages (page range), --timeout (default 900s).
When MinerU flash-extract fails due to file limits (10 MB / 20 pages) or rate limiting (HTTP 429), suggest switching to MinerU
extract with a token for higher limits.
Convert documents to Markdown or other formats with MinerU's full capabilities: VLM-based layout analysis, multiple output formats, and batch mode.
mineru-open-api extract report.pdf # MinerU Markdown to stdout mineru-open-api extract report.pdf -f html # MinerU HTML output mineru-open-api extract report.pdf -o ./out/ -f md,docx # Multiple formats mineru-open-api extract *.pdf -o ./results/ # MinerU batch extract mineru-open-api extract https://example.com/doc.pdf # Extract from URL
Flags:
--output/-o, --format/-f (md/json/html/latex/docx), --model (vlm/pipeline/html), --ocr, --formula, --table, --language, --pages, --timeout, --list, --concurrency.
MinerU | MinerU | |
|---|---|---|
| Parsing accuracy | Higher — better at complex layouts | Standard |
| Hallucination risk | May produce hallucinated text in rare cases | No hallucination |
Use MinerU
--model vlm for complex formatting. Use MinerU --model pipeline for no-hallucination reliability.
mineru-open-api crawl https://example.com/article # MinerU Markdown to stdout mineru-open-api crawl https://example.com/article -o ./out/ # Save to file mineru-open-api crawl url1 url2 -o ./pages/ # MinerU batch crawl
Flags:
--output/-o, --format/-f (md/json/html), --timeout, --list, --concurrency.
mineru-open-api auth # Interactive MinerU token setup mineru-open-api auth --verify # Verify current token mineru-open-api auth --show # Show token source
Without
-o: MinerU result → stdout, progress → stderr. With -o: saved to file/directory. Batch mode and binary formats (docx) require -o.
mineru-open-api extract "report 01.pdf"flash-extract when: no token configured, simple extraction, file under 10 MB / 20 pagesextract when: user needs non-Markdown formats, VLM model, batch processing, or file exceeds flash-extract limits-o, generate output directory: ~/MinerU-Skill/<name>_<hash>/ where <hash> = first 6 chars of MD5 of the source pathflash-extract success, append a brief hint about MinerU extract upgrade path (once per session)npm install -g mineru-open-apiFor full CLI reference and troubleshooting, see: https://github.com/opendatalab/MinerU-Ecosystem/tree/main/cli
--language valuesThe
--language flag accepts the following values (default: ch). Used by both MinerU flash-extract and extract.
| Value | Included languages | 说明 |
|---|---|---|
| Chinese, English, Chinese Traditional | 中英文(默认值) |
| Chinese, English, Chinese Traditional, Japanese | 繁体、手写体 |
| English | 纯英文 |
| Chinese, English, Chinese Traditional, Japanese | 日文为主 |
| Korean, English | 韩文 |
| Chinese, English, Chinese Traditional, Japanese | 繁体中文为主 |
| Tamil, English | 泰米尔文 |
| Telugu, English | 泰卢固文 |
| Kannada | 卡纳达文 |
| Greek, English | 希腊文 |
| Thai, English | 泰文 |
| Value | Script/Family | Included languages |
|---|---|---|
| Latin script (拉丁语系) | French, German, Afrikaans, Italian, Spanish, Bosnian, Portuguese, Czech, Welsh, Danish, Estonian, Irish, Croatian, Uzbek, Hungarian, Serbian (Latin), Indonesian, Occitan, Icelandic, Lithuanian, Maori, Malay, Dutch, Norwegian, Polish, Slovak, Slovenian, Albanian, Swedish, Swahili, Tagalog, Turkish, Latin, Azerbaijani, Kurdish, Latvian, Maltese, Pali, Romanian, Vietnamese, Finnish, Basque, Galician, Luxembourgish, Romansh, Catalan, Quechua |
| Arabic script (阿拉伯语系) | Arabic, Persian, Uyghur, Urdu, Pashto, Kurdish, Sindhi, Balochi, English |
| Cyrillic script (西里尔语系) | Russian, Belarusian, Ukrainian, Serbian (Cyrillic), Bulgarian, Mongolian, Abkhazian, Adyghe, Kabardian, Avar, Dargin, Ingush, Chechen, Lak, Lezgin, Tabasaran, Kazakh, Kyrgyz, Tajik, Macedonian, Tatar, Chuvash, Bashkir, Malian, Moldovan, Udmurt, Komi, Ossetian, Buryat, Kalmyk, Tuvan, Sakha, Karakalpak, English |
| East Slavic (东斯拉夫语系) | Russian, Belarusian, Ukrainian, English |
| Devanagari script (天城文语系) | Hindi, Marathi, Nepali, Bihari, Maithili, Angika, Bhojpuri, Magahi, Santali, Newari, Konkani, Sanskrit, Haryanvi, English |
No automatic installation available. Please visit the source repository for installation instructions.
View Installation Instructions1,500+ AI skills, agents & workflows. Install in 30 seconds. Part of the Torly.ai family.
© 2026 Torly.ai. All rights reserved.