English🌎|简体中文🀄

Note

📚 In 2025, we have open-sourced a high-quality multilingual dataset, WanJuan 3.0 (WanJuan Silu)

🧾 January 2025: Initial Release of Multilingual Pre-training Corpus: Primarily text-based data.Collected publicly available web content, literature, patents, and more from 5 countries/regions.Total data size exceeds 1.2TB, with 300 billion tokens, achieving international leadership.The initial release includes Thai, Russian, Arabic, Korean, and Vietnamese sub-corpora, each exceeding 150GB.Leveraging the "InternLM" Intelligent Tagging System, the research team categorized each sub-corpus into 7 major classes (e.g., history, politics, culture, real estate, shopping, weather, dining, encyclopedias, professional knowledge) and 32 sub-classes, ensuring localized linguistic and cultural relevance.Designed for researchers to easily retrieve data for diverse needs.
Download Links: Russian • Arabic • Korean • Vietnamese • Thai.

🌏 March 2025: Second Release of Multilingual Multimodal Corpus: which comprises over 1.2TB of indigenous textual corpora from five countries. Each subset includes seven major categories and 34 subcategories, covering a wide range of local characteristics, such as history, politics, culture, real estate, shopping, weather, dining, encyclopedic knowledge, and professional expertise. Here are the download links for the five subsets, and we welcome everyone to download and use them.

Comprises 4 data types:

Image-Text: Over 2 million images (raw size: 362.174GB).
Audio-Text: 200 hours of ultra-high-precision annotated audio per language.
Video-Text: Over 8 million video clips (raw duration: 28,000+ hours; refined to 16,000+ hours of high-quality content).
Localized SFT (Supervised Fine-Tuning):184,000 SFT entries covering local culture, daily conversations, code, mathematics, and science.23,000 entries per language, including 3,000 culturally unique Q&A pairs designed by local residents and 20,000 translated entries filtered through a quality-check pipeline combining rules and model scoring.Covers 8 languages across 4 modalities, totaling 11.5 million entries, refined to industrial-grade quality for "ready-to-use" applications.
Download Links: 5 languages (Arabic, Russian, Korean, Vietnamese, Thai) • 3 languages (Serbian, Hungarian, Czech).

🔥🔥🔥OpenDataLab Provide ecology for high-quality datasets for community. It provides:

🌟Extensive open data resources for AI Model

●くろまる High-speed and simple way to access open datasets
●くろまる 7700+ Large scale and high-quality open datasets for large model
●くろまる 1200+ Open datasets for Computer Vision
●くろまる 200+ Open datasets by CVPR
●くろまる Categorized datasets for hot topics

✨Open-source data processing toolkits

●くろまる Data acquisition toolkits supporting large datasets
●くろまる Data acquisition toolkits supporting kinds of tasks
●くろまる Open source intelligent Toolbox for Labeling

💫Dataset description language

●くろまる Format standardization
●くろまる DSDL: Dataset Description Language
●くろまる Define a CV dataset by DSDL
●くろまる OpenDataLab Standardized 100+ CV Datasets

Check our tutorials videos (in Chinese) to get started.

📣 We have upgraded and launched the function of authors uploading datasets independently. We hereby invite you to participate in using it to better promote your open source datasets, AI research results, etc., so that more people can access, obtain and use your dataset.

This is an introduction to the dataset autonomous upload function 【help doc】,You can create and share your dataset according to our guidelines. 💪

If you have any questions or obstacles, please feel free to contact us OpenDataLab@pjlab.org.cn.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenDataLab

🌟Extensive open data resources for AI Model

✨Open-source data processing toolkits

💫Dataset description language

Popular repositories Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!