Mistral OCR, Markdown Chunking, and Hugging Face Dataset Creator

Upload a PDF or image file. The application will:

  1. Extract text and images using Mistral OCR
  2. Embed images as base64 data URIs in markdown
  3. Chunk markdown by headers and optionally character count
  4. Store embedded images in chunk metadata
  5. Create/update a Hugging Face Dataset

Chunking Options

0 8000
0 1000

Hugging Face Output Options

Examples
Max Chunk Size (Characters) Chunk Overlap (Characters) Strip Headers from Content Hugging Face Token HF Dataset Repository

Requires MISTRAL_API_KEY or HF token