Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

A list of the top 3 million+ English words in Project Gutenberg, along with their frequency.

License

Notifications You must be signed in to change notification settings

ScriptSmith/topwords

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

5 Commits

Repository files navigation

Top english words

A comprehensive list of the top 3 million+ english words in project gutenberg. Data is sourced from Allison Parrish's awesome gutenberg-dammit project.

Usage

Use the word list:

$ head words.txt
the
of
and
to
a
in
that
i
he

Use the word count list:

$ head counts.txt
169852828 the
92493412 of
83626800 and
69017783 to
54796935 a
47554786 in
30598554 that
30324861 i
27900933 he

Download

or

Clone this repo:

git clone https://github.com/scriptsmith/topwords.git
cd topwords

Recreating

Tools used:

  • jq
  • parallel
  • grep
  • sed
  • GNU coreutils
    • tr
    • sort
    • uniq
    • cut

The following pattern was used to find words in the corpus:

[A-Za-z]+('[A-Za-z]+)?(?<!('s))

Clone this repo

git clone https://github.com/scriptsmith/topwords.git
cd topwords

Get the data

Download and extract the guttenberg-dammit data. This is a free resource, so don't abuse it.

Extract the words

Finds words from the 40000+ books with English as a primary language:

jq -r '.[] | select((.Language | length) == 1 and .Language[0] == "English") | "gutenberg-dammit-files/" + ."gd-path"' gutenberg-dammit-files/gutenberg-metadata.json | parallel "grep -ohPf pattern.txt {}" | tr '[:upper:]' '[:lower:]' > allwords.txt

Sort and count words

If your temporary directory can't store more than 60GiB, change the value of TMP_DIR

TMP_DIR=/tmp
sort -T $TMP_DIR allwords.txt | uniq -c | sed 's/^\s*//' | sort -nr > counts.txt

Remove word counts

cut -d ' ' -f2 counts.txt > words.txt

About

A list of the top 3 million+ English words in Project Gutenberg, along with their frequency.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

AltStyle によって変換されたページ (->オリジナル) /