Skip to content

Roadmap

Albert Villanova del Moral edited this page Apr 22, 2021 · 1 revision

April 2021: short/mid term roadmap for Datasets

Topics

Datasets Hub
Datasets Viewer
AutoNLP
External integrations
Tasks + Evaluations
Datasets Streaming
Image/Audio support
Researchers usage
GitHub repository
Community/Contributors

Datasets Hub

Make the dataset script optional
Load processed datasets
Use cold storage (parquet)
More documentation + concrete tutorials
Integrate a validation tool in the CI for yaml tags + dataset card

Datasets Viewer

Fix runs out of disk space
Update the dependencies

AutoNLP

Fix methods that have memory issues: cast (WIP), filter, concatenate_datasets
Add audio type
How to download a processed dataset from the Hub
How to implement a universal dataset loader

External integrations

Improve error messages per file
Test using big JSON files
Allow to get datasets metadata without loading them
Allow to use the dataset builders as iterators

Tasks + Evaluations

Add task-specific preparation
Define task-specific feature templates
Add task argument in load_dataset
Automatic post processing based on the supervised_keys passed in the info and the queried task
User defined post processing to cover cases that automatic post processing can't handle (maybe using the post_process method of the builder)
Sync with AutoNLP

Datasets Streaming

Use fsspec
Create a new class StreamingDataset
Enable the streaming of csv/text/json data
Set the format of a streaming dataset

Image/Audio support

Implement new feature types Image and Audio
- Implement a decoding step
- Either keep storing the path in the arrow data, or write the encoded bytes in the arrow data

Researchers usage

Keep small datasets in memory and without caching
Load one split without download and processing the others
Update Wikipedia
- Complete the dataset card with usage examples to show how to use a specific date
- Preprocess recent wikipedia dumps (en, fr, es, de...)
- Optimize Beam pipelines
- Process Wikipedia systematically
Add FAQs in the documentation or as a markdown file in the repo

GitHub repository

Try git lfs for dummy data
Fix conda build

Community

Share Roadmap
Add all the tasks on the Roadmap as GitHub Issues
Create GitHub Projects:
- Core library
- Addition of new datasets
Improve the docs on how to contribute to the core library
Refactorize code to make it simpler

Clone this wiki locally