Create a large, deduplicated dataset for LLM pre-training
Convert a model to Safetensors and open a PR
Build datasets using natural language
Manage and analyze labeled datasets
Build datasets and workflows using AI models
Upload files to a Hugging Face repository
Launch and explore labeled datasets
Explore recent datasets from Hugging Face Hub
A collection of parsers for LLM benchmark datasets
Create a report in BoAmps format
Display trending datasets and spaces
Search narrators and view network connections
List of French datasets not referenced on the Hub
TxT360: Trillion Extracted Text is a large-scale dataset tool designed to create a massive, deduplicated dataset for training large language models (LLMs). It extracts and organizes text from various sources, ensuring a diverse and comprehensive dataset for AI training purposes.
1. What makes TxT360: Trillion Extracted Text unique?
TxT360 stands out for its trillion-scale dataset and robust deduplication process, ensuring high-quality training data for LLMs.
2. Can I customize the dataset based on specific needs?
Yes, TxT360 offers customizable filters to tailor the dataset according to your requirements.
3. Is TxT360 suitable for training multilingual LLMs?
Absolutely! TxT360 supports multiple languages, making it ideal for training models that handle diverse linguistic data.