Hugging Face Releases FinePDFs: A 3-Trillion-Token Dataset Built from PDFs

"Hugging Face has unveiled FinePDFs, the largest publicly available corpus built entirely from PDFs. The dataset spans 475 million documents in 1,733 languages, totaling roughly 3 trillion tokens. At 3.65 terabytes in size, FinePDFs introduces a new dimension to open training datasets by tapping into a resource long considered too complex and expensive to process. While most large-scale language model datasets rely on HTML sources such as Common Crawl, PDFs offer unique advantages."

"The FinePDFs pipeline addressed these challenges through a mix of text-based extraction (Docling) and GPU-powered OCR (RolmOCR), alongside deduplication, language identification, and PII anonymization. According to Hugging Face, this dual strategy allowed them to process documents at scale while maintaining quality across diverse formats. The dataset covers a wide array of languages, with English making up the largest share at over 1.1 trillion tokens."

"To evaluate FinePDFs, Hugging Face trained 1.67B parameter models on subsets of the dataset. Results showed that FinePDFs performs nearly on par with SmolLM-3 Web, a state-of-the-art HTML dataset. More importantly, combining the two provided a notable performance boost across benchmarks, reinforcing the idea that PDFs bring complementary knowledge. This emphasis on evaluation drew immediate questions from the community. On LinkedIn, data scientist Arthur Wuhrmann asked: What is the evaluation? What is the score?"

FinePDFs comprises 475 million PDF documents across 1,733 languages, totaling roughly 3 trillion tokens and occupying 3.65 terabytes. The corpus targets higher-quality, domain-specific material common in law, academia, and technical writing, and addresses PDF-specific extraction challenges like embedded text, OCR needs, and complex formatting. A pipeline combining text-based extraction (Docling) and GPU-powered OCR (RolmOCR) enabled scalable processing with deduplication, language identification, and PII anonymization. English provides over 1.1 trillion tokens, several languages exceed 100 billion tokens, and 978 languages have more than 1 million tokens. Training experiments show PDFs complement HTML sources and improve benchmark performance when combined.

#pdf-corpus #ocr-and-extraction #multilingual-datasets #llm-training-data

Read at InfoQ

Unable to calculate read time

Collection

[

...

]

Hugging Face Releases FinePDFs: A 3-Trillion-Token Dataset Built from PDFsHugging Face Releases FinePDFs: A 3-Trillion-Token Dataset Built from PDFs Briefly

Hugging Face Releases FinePDFs: A 3-Trillion-Token Dataset Built from PDFs
Hugging Face Releases FinePDFs: A 3-Trillion-Token Dataset Built from PDFs
Briefly