Mistral OCR, Markdown Chunking, and Hugging Face Dataset Creator

Upload one or more PDF or image files. The application will:

  1. Extract text and images using Mistral OCR for each file
  2. Embed images as base64 data URIs in markdown
  3. Chunk markdown by headers and optionally character count
  4. Store embedded images in chunk metadata
  5. Create/update a Hugging Face Dataset with all processed data

Chunking Options

0 8000
0 1000

Hugging Face Output Options

Examples
Max Chunk Size (Characters) Chunk Overlap (Characters) Strip Headers from Content Hugging Face Token HF Dataset Repository

Requires MISTRAL_API_KEY or HF token