#multilingual-datasets

[ follow ]
Artificial intelligence
fromInfoQ
16 hours ago

Hugging Face Releases FinePDFs: A 3-Trillion-Token Dataset Built from PDFs

FinePDFs is a 3.65 TB, 475 million–document PDF corpus across 1,733 languages offering trillions of tokens and complementary, domain-rich data for LLM training.
[ Load more ]