PDF2Dataset

Convert PDFs and images into clean Hugging Face datasets with Mistral OCR, structure-aware chunking, validation, deduplication, and quality filtering.

Chunking

0 4096

Quality filters

0 500
0 50

Hugging Face output

Document preview

Requires MISTRAL_API_KEY, a Hugging Face token, or an active Hugging Face CLI login.

Examples
Max chunk size Hugging Face token Dataset repository Append to existing dataset Minimum characters Minimum words