ScriptSmith/topwords

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
counts.txt		counts.txt
pattern.txt		pattern.txt
words.txt		words.txt

Repository files navigation

Top english words

A comprehensive list of the top 3 million+ english words in project gutenberg. Data is sourced from Allison Parrish's awesome gutenberg-dammit project.

Usage

Use the word list:

$ head words.txt
the
of
and
to
a
in
that
i
he

Use the word count list:

$ head counts.txt
169852828 the
92493412 of
83626800 and
69017783 to
54796935 a
47554786 in
30598554 that
30324861 i
27900933 he

Download

Clone this repo:

git clone https://github.com/scriptsmith/topwords.git
cd topwords

Recreating

Tools used:

jq
parallel
grep
sed
GNU coreutils
- tr
- sort
- uniq
- cut

The following pattern was used to find words in the corpus:

[A-Za-z]+('[A-Za-z]+)?(?<!('s))

Clone this repo

git clone https://github.com/scriptsmith/topwords.git
cd topwords

Get the data

Download and extract the guttenberg-dammit data. This is a free resource, so don't abuse it.

Extract the words

Finds words from the 40000+ books with English as a primary language:

jq -r '.[] | select((.Language | length) == 1 and .Language[0] == "English") | "gutenberg-dammit-files/" + ."gd-path"' gutenberg-dammit-files/gutenberg-metadata.json | parallel "grep -ohPf pattern.txt {}" | tr '[:upper:]' '[:lower:]' > allwords.txt

Sort and count words

If your temporary directory can't store more than 60GiB, change the value of TMP_DIR

TMP_DIR=/tmp
sort -T $TMP_DIR allwords.txt | uniq -c | sed 's/^\s*//' | sort -nr > counts.txt

Remove word counts

cut -d ' ' -f2 counts.txt > words.txt

About

A list of the top 3 million+ English words in Project Gutenberg, along with their frequency.

Releases

No releases published

Packages

No packages published

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

ScriptSmith/topwords

Folders and files

Latest commit

History

Repository files navigation

Top english words

Usage

Download

Recreating

Clone this repo

Get the data

Extract the words

Sort and count words

Remove word counts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

License

ScriptSmith/topwords

Folders and files

Latest commit

History

Repository files navigation

Top english words

Usage

Download

Recreating

Clone this repo

Get the data

Extract the words

Sort and count words

Remove word counts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages