Name	Name	Last commit message	Last commit date
Latest commit History 91 Commits
build_tools/travis	build_tools/travis
test	test
torchtext	torchtext
.flake8	.flake8
.gitignore	.gitignore
.travis.yml	.travis.yml
LICENSE	LICENSE
README.md	README.md
codecov.yml	codecov.yml
pytest.ini	pytest.ini
requirements.txt	requirements.txt
setup.py	setup.py

Name

Last commit message

Last commit date

Latest commit

History

[WIP] torch-text

This repository consists of:

torchtext.data : Generic data loaders, abstractions, and iterators for text
torchtext.datasets : Pre-built loaders for common NLP datasets
(maybe) torchtext.models : Model definitions and pre-trained models for popular NLP examples (though the situation is not the same as vision, where people can download a pretrained ImageNet model and immediately make it useful for other tasks -- it might make more sense to leave NLP models in the torch/examples repo)

Data

The data module provides the following:

Ability to describe declaratively how to load a custom NLP dataset that's in a "normal" format:

pos = data.TabularDataset(
 path='data/pos/pos_wsj_train.tsv', format='tsv',
 fields=[('text', data.Field()),
 ('labels', data.Field())])
sentiment = data.TabularDataset(
 path='data/sentiment/train.json', format='json',
 fields=[{'sentence_tokenized': ('text', data.Field(sequential=True)),
 'sentiment_gold': ('labels', data.Field(sequential=False))}])

Ability to define a preprocessing pipeline:

src = data.Field(tokenize=my_custom_tokenizer)
trg = data.Field(tokenize=my_custom_tokenizer)
mt_train = datasets.TranslationDataset(
 path='data/mt/wmt16-ende.train', exts=('.en', '.de'),
 fields=(src, trg))

Batching, padding, and numericalizing (including building a vocabulary object):

# continuing from above
mt_dev = data.TranslationDataset(
 path='data/mt/newstest2014', exts=('.en', '.de'),
 fields=(src, trg))
src.build_vocab(mt_train, max_size=80000)
trg.build_vocab(mt_train, max_size=40000)
# mt_dev shares the fields, so it shares their vocab objects
train_iter = data.BucketIterator(
 dataset=mt_train, batch_size=32, 
 sort_key=lambda x: data.interleave_keys(len(x.src), len(x.trg)))
# usage
>>>next(iter(train_iter))
<data.Batch(batch_size=32, src=[LongTensor (32, 25)], trg=[LongTensor (32, 28)])>

Wrapper for dataset splits (train, validation, test):

TEXT = data.Field()
LABELS = data.Field()
train, val, test = data.TabularDataset.splits(
 path='/data/pos_wsj/pos_wsj', train='_train.tsv',
 validation='_dev.tsv', test='_test.tsv', format='tsv',
 fields=[('text', TEXT), ('labels', LABELS)])
train_iter, val_iter, test_iter = data.BucketIterator.splits(
 (train, val, test), batch_sizes=(16, 256, 256),
 sort_key=lambda x: len(x.text), device=0)
TEXT.build_vocab(train)
LABELS.build_vocab(train)

Datasets

Some datasets it would be useful to have built in:

bAbI and successors from FAIR
SST (done) and IMDb sentiment
SNLI (done)
Penn Treebank (for language modeling (done) and parsing)
WMT and/or IWSLT machine translation
SQuAD

See the "test" directory for examples of dataset usage.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SmartAI/text

Folders and files

Latest commit

History

Repository files navigation

[WIP] torch-text

Data

Datasets

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[WIP] torch-text

Data

Datasets

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages