Convert PDFs and images into clean Hugging Face datasets with Mistral OCR, structure-aware chunking, validation, deduplication, and quality filtering.
Character budget per chunk. Use 0 to keep each document as one chunk.
Requires MISTRAL_API_KEY, a Hugging Face token, or an active Hugging Face CLI login.
MISTRAL_API_KEY