Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

This repository serves as a collection of scrapers procuring and structuring various legal datasets

License

Notifications You must be signed in to change notification settings

Sean-In-The-Library/LegalDatasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

278 Commits

Repository files navigation

LegalDatasets

This repository serves as a collection of scrapers procuring and structuring various legal datasets

We want to link to already prepared legal datasets and prepare new datasets. These datasets can then be used for many downstream tasks, such as pretraining language models or judgment prediction.

Pretraining Datasets

Each of the pretraining datasets will be saved in jsonl format with the following fields:

  • id: unique identifier for the document (uuid5 if not present yet)
  • type: type of the document (e.g. legislation, caselaw, commentary)
  • language: language of the document
  • jurisdiction: jurisdiction of the document (e.g. germany)
  • title: title of the document
  • date: date of the document
  • url: url of the document
  • metadata: additional metadata of the document (as a json object)
  • text: the text of the document

These pretraining datasets will be used to train the language models.

Finetuning Datasets

We select a few (10 – 20) datasets to form a large-scale multi-lingual multi-jurisdictional benchmark (LEXTREME) for finetuning.

About

This repository serves as a collection of scrapers procuring and structuring various legal datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 85.6%
  • Jupyter Notebook 11.3%
  • R 2.8%
  • Shell 0.3%

AltStyle によって変換されたページ (->オリジナル) /