Name	Name	Last commit message	Last commit date
Latest commit History 233 Commits
.github/workflows	.github/workflows
contrib	contrib
evaluate	evaluate
jamspell	jamspell
main	main
test_data	test_data
tests	tests
web_server	web_server
.gitignore	.gitignore
CMakeLists.txt	CMakeLists.txt
LICENSE	LICENSE
MANIFEST.in	MANIFEST.in
README.md	README.md
clear.sh	clear.sh
jamspell.i	jamspell.i
requirements_test.txt	requirements_test.txt
setup.cfg	setup.cfg
setup.py	setup.py
test_jamspell.py	test_jamspell.py

JamSpell

Build Status Release

JamSpell is a spell checking library with following features:

accurate - it considers words surroundings (context) for better correction
fast - near 5K words per second
multi-language - it's written in C++ and available for many languages with swig bindings

Colab example

JamSpellPro

jamspell.com - check out a new jamspell version with following features

Improved accuracy (catboost gradient boosted decision trees candidates ranking model)
Splits merged words
Pre-trained models for many languages (small, medium, large) for:
en, ru, de, fr, it, es, tr, uk, pl, nl, pt, hi, no
Ability to add words / sentences at runtime
Fine-tuning / additional training
Memory optimization for training large models
Static dictionary support
Built-in Java, C#, Ruby support
Windows support

Content

Benchmarks
Usage
- Python
- C++
- Other languages
- HTTP API
Train

Benchmarks

Errors Top 7 Errors Fix Rate Top 7 Fix Rate Broken Speed
(words/second)

JamSpell 3.25% 1.27% 79.53% 84.10% 0.64% 4854

Norvig 7.62% 5.00% 46.58% 66.51% 0.69% 395

Hunspell 13.10% 10.33% 47.52% 68.56% 7.14% 163

Dummy 13.14% 13.14% 0.00% 0.00% 0.00% -

Model was trained on 300K wikipedia sentences + 300K news sentences (english). 95% was used for train, 5% was used for evaluation. Errors model was used to generate errored text from the original one. JamSpell corrector was compared with Norvig's one, Hunspell and a dummy one (no corrections).

We used following metrics:

Errors - percent of words with errors after spell checker processed
Top 7 Errors - percent of words missing in top7 candidated
Fix Rate - percent of errored words fixed by spell checker
Top 7 Fix Rate - percent of errored words fixed by one of top7 candidates
Broken - percent of non-errored words broken by spell checker
Speed - number of words per second

To ensure that our model is not too overfitted for wikipedia+news we checked it on "The Adventures of Sherlock Holmes" text:

Errors Top 7 Errors Fix Rate Top 7 Fix Rate Broken Speed (words per second)

JamSpell 3.56% 1.27% 72.03% 79.73% 0.50% 5524

Norvig 7.60% 5.30% 35.43% 56.06% 0.45% 647

Hunspell 9.36% 6.44% 39.61% 65.77% 2.95% 284

Dummy 11.16% 11.16% 0.00% 0.00% 0.00% -

More details about reproducing available in "Train" section.

Usage

Python

Install swig3 (usually it is in your distro package manager)
Install jamspell:

pip install jamspell

Download or train language model
Use it:

import jamspell
corrector = jamspell.TSpellCorrector()
corrector.LoadLangModel('en.bin')
corrector.FixFragment('I am the begt spell cherken!')
# u'I am the best spell checker!'
corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 3)
# (u'best', u'beat', u'belt', u'bet', u'bent', ... )
corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 5)
# (u'checker', u'chicken', u'checked', u'wherein', u'coherent', ...)

C++

Add jamspell and contrib dirs to your project
Use it:

#include <jamspell/spell_corrector.hpp>
int main(int argc, const char** argv) {
 NJamSpell::TSpellCorrector corrector;
 corrector.LoadLangModel("model.bin");
 corrector.FixFragment(L"I am the begt spell cherken!");
 // "I am the best spell checker!"
 corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
 // "best", "beat", "belt", "bet", "bent", ... )
 corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
 // "checker", "chicken", "checked", "wherein", "coherent", ... )
 return 0;
}

Other languages

You can generate extensions for other languages using swig tutorial. The swig interface file is jamspell.i. Pull requests with build scripts are welcome.

HTTP API

Install cmake
Clone and build jamspell (it includes http server):

git clone https://github.com/bakwc/JamSpell.git
cd JamSpell
mkdir build
cd build
cmake ..
make

Download or train language model
Run http server:

./web_server/web_server en.bin localhost 8080

GET Request example:

$ curl "http://localhost:8080/fix?text=I am the begt spell cherken"
I am the best spell checker

POST Request example

$ curl -d "I am the begt spell cherken" http://localhost:8080/fix
I am the best spell checker

Candidate example

curl "http://localhost:8080/candidates?text=I am the begt spell cherken"
# or
curl -d "I am the begt spell cherken" http://localhost:8080/candidates

{
 "results": [
 {
 "candidates": [
 "best",
 "beat",
 "belt",
 "bet",
 "bent",
 "beet",
 "beit"
 ],
 "len": 4,
 "pos_from": 9
 },
 {
 "candidates": [
 "checker",
 "chicken",
 "checked",
 "wherein",
 "coherent",
 "cheered",
 "cherokee"
 ],
 "len": 7,
 "pos_from": 20
 }
 ]
}

Here pos_from - misspelled word first letter position, len - misspelled word len

Train

To train custom model you need:

Install cmake
Clone and build jamspell:

git clone https://github.com/bakwc/JamSpell.git
cd JamSpell
mkdir build
cd build
cmake ..
make

Prepare a utf-8 text file with sentences to train at (eg. sherlockholmes.txt) and another file with language alphabet (eg. alphabet_en.txt)
Train model:

./main/jamspell train ../test_data/alphabet_en.txt ../test_data/sherlockholmes.txt model_sherlock.bin

To evaluate spellchecker you can use evaluate/evaluate.py script:

python evaluate/evaluate.py -a alphabet_file.txt -jsp your_model.bin -mx 50000 your_test_data.txt

You can use evaluate/generate_dataset.py to generate you train/test data. It supports txt files, Leipzig Corpora Collection format and fb2 books.

Download models

Here is a few simple models. They trained on 300K news + 300k wikipedia sentences. We strongly recommend to train your own model, at least on a few million sentences to achieve better quality. See Train section above.

en.tar.gz (35Mb)
fr.tar.gz (31Mb)
ru.tar.gz (38Mb)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

bakwc/JamSpell

Folders and files

Latest commit

History

Repository files navigation

JamSpell

JamSpellPro

Content

Benchmarks

Usage

Python

C++

Other languages

HTTP API

Train

Download models

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages

Uh oh!

Contributors 17

Uh oh!

Languages

License

bakwc/JamSpell

Folders and files

Latest commit

History

Repository files navigation

JamSpell

JamSpellPro

Content

Benchmarks

Usage

Python

C++

Other languages

HTTP API

Train

Download models

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors 17

Uh oh!

Languages

Packages