Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

A list of datasets/corpora for NLP tasks, in reverse chronological order.

Notifications You must be signed in to change notification settings

karthikncode/nlp-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

18 Commits

Repository files navigation

Datasets for Natural Language Processing

This is a list of datasets/corpora for NLP tasks, in reverse chronological order. Suggestions and pull requests are welcome. The goal is to make this a collaborative effort to maintain an updated list of quality datasets.

Areas

Question Answering

  • (NLVR) A Corpus of Natural Language for Visual Reasoning, 2017 [paper] [data]
  • (MS MARCO) MS MARCO: A Human Generated MAchine Reading COmprehension Dataset, 2016 [paper] [data]
  • (NewsQA) NewsQA: A Machine Comprehension Dataset, 2016 [paper] [data]
  • (SQuAD) SQuAD: 100,000+ Questions for Machine Comprehension of Text, 2016 [paper] [data]
  • (GraphQuestions) On Generating Characteristic-rich Question Sets for QA Evaluation, 2016 [paper] [data]
  • (Story Cloze) A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories, 2016 [paper] [data]
  • (Children's Book Test) The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations, 2015 [paper] [data]
  • (SimpleQuestions) Large-scale Simple Question Answering with Memory Networks, 2015 [paper] [data]
  • (WikiQA) WikiQA: A Challenge Dataset for Open-Domain Question Answering, 2015 [paper] [data]
  • (CNN-DailyMail) Teaching Machines to Read and Comprehend, 2015 [paper] [code to generate] [data]
  • (QuizBowl) A Neural Network for Factoid Question Answering over Paragraphs, 2014 [paper] [data]
  • (MCTest) MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text, 2013 [paper] [data] [alternate data link]
  • (QASent) What is the Jeopardy model? A quasisynchronous grammar for QA, 2007 [paper] [data]

Dialogue Systems

  • (Ubuntu Dialogue Corpus) The Ubuntu Dialogue Corpus : A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems, 2015 [paper] [data]

Goal-Oriented Dialogue Systems

  • (Frames) Frames: A Corpus for Adding Memory to Goal-Oriented Dialogue Systems, 2016 [paper] [data]
  • (DSTC 2 & 3) Dialog State Tracking Challenge 2 & 3, 2013 [paper] [data]

About

A list of datasets/corpora for NLP tasks, in reverse chronological order.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

AltStyle によって変換されたページ (->オリジナル) /