Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Roadmap

Albert Villanova del Moral edited this page Apr 22, 2021 · 1 revision

April 2021: short/mid term roadmap for Datasets

Topics

  • Datasets Hub
  • Datasets Viewer
  • AutoNLP
  • External integrations
  • Tasks + Evaluations
  • Datasets Streaming
  • Image/Audio support
  • Researchers usage
  • GitHub repository
  • Community/Contributors

Datasets Hub

  • Make the dataset script optional
  • Load processed datasets
  • Use cold storage (parquet)
  • More documentation + concrete tutorials
  • Integrate a validation tool in the CI for yaml tags + dataset card

Datasets Viewer

  • Fix runs out of disk space
  • Update the dependencies

AutoNLP

  • Fix methods that have memory issues: cast (WIP), filter, concatenate_datasets
  • Add audio type
  • How to download a processed dataset from the Hub
  • How to implement a universal dataset loader

External integrations

  • Improve error messages per file
  • Test using big JSON files
  • Allow to get datasets metadata without loading them
  • Allow to use the dataset builders as iterators

Tasks + Evaluations

  • Add task-specific preparation
  • Define task-specific feature templates
  • Add task argument in load_dataset
  • Automatic post processing based on the supervised_keys passed in the info and the queried task
  • User defined post processing to cover cases that automatic post processing can't handle (maybe using the post_process method of the builder)
  • Sync with AutoNLP

Datasets Streaming

  • Use fsspec
  • Create a new class StreamingDataset
  • Enable the streaming of csv/text/json data
  • Set the format of a streaming dataset

Image/Audio support

  • Implement new feature types Image and Audio
    • Implement a decoding step
    • Either keep storing the path in the arrow data, or write the encoded bytes in the arrow data

Researchers usage

  • Keep small datasets in memory and without caching
  • Load one split without download and processing the others
  • Update Wikipedia
    • Complete the dataset card with usage examples to show how to use a specific date
    • Preprocess recent wikipedia dumps (en, fr, es, de...)
    • Optimize Beam pipelines
    • Process Wikipedia systematically
  • Add FAQs in the documentation or as a markdown file in the repo

GitHub repository

  • Try git lfs for dummy data
  • Fix conda build

Community

  • Share Roadmap
  • Add all the tasks on the Roadmap as GitHub Issues
  • Create GitHub Projects:
    • Core library
    • Addition of new datasets
  • Improve the docs on how to contribute to the core library
  • Refactorize code to make it simpler

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /