Mistral OCR, Markdown Chunking, and Hugging Face Dataset Creator
Upload a PDF or image file. The application will:
Extract text and images using Mistral OCR
Embed images as base64 data URIs in markdown
Chunk markdown by headers and optionally character count
Store embedded images in chunk metadata
Create/update a Hugging Face Dataset
Upload PDF or Image File
Drop File Here
- or -
Click to Upload
Chunking Options
Max Chunk Size (Characters)
↺
0
8000
Chunk Overlap (Characters)
↺
0
1000
Strip Headers from Content
Hugging Face Output Options
HF Dataset Repository
Hugging Face Token
Process and Save
Result Status
Examples
Max Chunk Size (Characters)
Chunk Overlap (Characters)
Strip Headers from Content
Hugging Face Token
HF Dataset Repository
Requires MISTRAL_API_KEY or HF token