Natural Language Corpus Data: Beautiful Data
This directory contains code and data to accompany the chapter
Natural Language Corpus Data
from the book
Beautiful Data (Segaran and Hammerbacher, 2009).
If you like this you may also like:
How to Write a Spelling Corrector.
Data files are derived from the Google Web Trillion Word Corpus,
as described
by Thorsten Brants and Alex Franz, and distributed by the Linguistic
Data Consortium.
Code copyright (c) 2008-2009 by Peter Norvig. You are free to use this
code under the MIT
license.
To run this code, download the files listed below. Then from a shell execute
python -i ngrams.py (or start a Python IDE and import
ngrams), and if you want to test if everything works, call
test(). Note that the hillclimbing function has a random
component, so if you have bad luck it is possible that some of the tests will fail,
even if everything is correctly installed. (It is unlikely that they will fail twice in a row.)
Files for Download
0.7MB
ch14.pdf The chapter from
the book.
0.0 MB
ngrams.py The Python code for
everything in the chapter.
0.0 MB
ngrams-test.txt Unit tests; run by the Python function
test().
4.9 MB
count_1w.txt The 1/3
million most frequent words, all lowercase, with counts. (Called
vocab_common in the chapter, but I changed file names here.)
5.6 MB
count_2w.txt The 1/4 million
most frequent two-word (lowercase) bigrams, with counts.
0.0 MB
count_2l.txt Counts for all
2-letter (lowercase) bigrams.
0.2 MB
count_3l.txt Counts for all
3-letter (lowercase) trigrams.
0.0 MB
count_1edit.txt Counts for all
single-edit spelling correction edits, from the file
spell-errors.txt.
0.5 MB
spell-errors.txt A
collection of "right: wrong1, wrong2" spelling mistakes, collected
from
Wikipedia
and
Roger Mitton.
The following files are not referenced in the chapter, but may be useful to you.