Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

bakwc/JamSpell

Repository files navigation

JamSpell

Build Status Release

JamSpell is a spell checking library with following features:

  • accurate - it considers words surroundings (context) for better correction
  • fast - near 5K words per second
  • multi-language - it's written in C++ and available for many languages with swig bindings

Colab example

JamSpellPro

jamspell.com - check out a new jamspell version with following features

  • Improved accuracy (catboost gradient boosted decision trees candidates ranking model)
  • Splits merged words
  • Pre-trained models for many languages (small, medium, large) for:
    en, ru, de, fr, it, es, tr, uk, pl, nl, pt, hi, no
  • Ability to add words / sentences at runtime
  • Fine-tuning / additional training
  • Memory optimization for training large models
  • Static dictionary support
  • Built-in Java, C#, Ruby support
  • Windows support

Content

Benchmarks

Errors Top 7 Errors Fix Rate Top 7 Fix Rate Broken Speed
(words/second)
JamSpell 3.25% 1.27% 79.53% 84.10% 0.64% 4854
Norvig 7.62% 5.00% 46.58% 66.51% 0.69% 395
Hunspell 13.10% 10.33% 47.52% 68.56% 7.14% 163
Dummy 13.14% 13.14% 0.00% 0.00% 0.00% -

Model was trained on 300K wikipedia sentences + 300K news sentences (english). 95% was used for train, 5% was used for evaluation. Errors model was used to generate errored text from the original one. JamSpell corrector was compared with Norvig's one, Hunspell and a dummy one (no corrections).

We used following metrics:

  • Errors - percent of words with errors after spell checker processed
  • Top 7 Errors - percent of words missing in top7 candidated
  • Fix Rate - percent of errored words fixed by spell checker
  • Top 7 Fix Rate - percent of errored words fixed by one of top7 candidates
  • Broken - percent of non-errored words broken by spell checker
  • Speed - number of words per second

To ensure that our model is not too overfitted for wikipedia+news we checked it on "The Adventures of Sherlock Holmes" text:

Errors Top 7 Errors Fix Rate Top 7 Fix Rate Broken Speed (words per second)
JamSpell 3.56% 1.27% 72.03% 79.73% 0.50% 5524
Norvig 7.60% 5.30% 35.43% 56.06% 0.45% 647
Hunspell 9.36% 6.44% 39.61% 65.77% 2.95% 284
Dummy 11.16% 11.16% 0.00% 0.00% 0.00% -

More details about reproducing available in "Train" section.

Usage

Python

  1. Install swig3 (usually it is in your distro package manager)

  2. Install jamspell:

pip install jamspell
  1. Download or train language model

  2. Use it:

import jamspell
corrector = jamspell.TSpellCorrector()
corrector.LoadLangModel('en.bin')
corrector.FixFragment('I am the begt spell cherken!')
# u'I am the best spell checker!'
corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 3)
# (u'best', u'beat', u'belt', u'bet', u'bent', ... )
corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 5)
# (u'checker', u'chicken', u'checked', u'wherein', u'coherent', ...)

C++

  1. Add jamspell and contrib dirs to your project

  2. Use it:

#include <jamspell/spell_corrector.hpp>
int main(int argc, const char** argv) {
 NJamSpell::TSpellCorrector corrector;
 corrector.LoadLangModel("model.bin");
 corrector.FixFragment(L"I am the begt spell cherken!");
 // "I am the best spell checker!"
 corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
 // "best", "beat", "belt", "bet", "bent", ... )
 corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
 // "checker", "chicken", "checked", "wherein", "coherent", ... )
 return 0;
}

Other languages

You can generate extensions for other languages using swig tutorial. The swig interface file is jamspell.i. Pull requests with build scripts are welcome.

HTTP API

  • Install cmake

  • Clone and build jamspell (it includes http server):

git clone https://github.com/bakwc/JamSpell.git
cd JamSpell
mkdir build
cd build
cmake ..
make
./web_server/web_server en.bin localhost 8080
  • GET Request example:
$ curl "http://localhost:8080/fix?text=I am the begt spell cherken"
I am the best spell checker
  • POST Request example
$ curl -d "I am the begt spell cherken" http://localhost:8080/fix
I am the best spell checker
  • Candidate example
curl "http://localhost:8080/candidates?text=I am the begt spell cherken"
# or
curl -d "I am the begt spell cherken" http://localhost:8080/candidates
{
 "results": [
 {
 "candidates": [
 "best",
 "beat",
 "belt",
 "bet",
 "bent",
 "beet",
 "beit"
 ],
 "len": 4,
 "pos_from": 9
 },
 {
 "candidates": [
 "checker",
 "chicken",
 "checked",
 "wherein",
 "coherent",
 "cheered",
 "cherokee"
 ],
 "len": 7,
 "pos_from": 20
 }
 ]
}

Here pos_from - misspelled word first letter position, len - misspelled word len

Train

To train custom model you need:

  1. Install cmake

  2. Clone and build jamspell:

git clone https://github.com/bakwc/JamSpell.git
cd JamSpell
mkdir build
cd build
cmake ..
make
  1. Prepare a utf-8 text file with sentences to train at (eg. sherlockholmes.txt) and another file with language alphabet (eg. alphabet_en.txt)

  2. Train model:

./main/jamspell train ../test_data/alphabet_en.txt ../test_data/sherlockholmes.txt model_sherlock.bin
  1. To evaluate spellchecker you can use evaluate/evaluate.py script:
python evaluate/evaluate.py -a alphabet_file.txt -jsp your_model.bin -mx 50000 your_test_data.txt
  1. You can use evaluate/generate_dataset.py to generate you train/test data. It supports txt files, Leipzig Corpora Collection format and fb2 books.

Download models

Here is a few simple models. They trained on 300K news + 300k wikipedia sentences. We strongly recommend to train your own model, at least on a few million sentences to achieve better quality. See Train section above.

About

Modern spell checking library - accurate, fast, multi-language

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 17

Languages

AltStyle によって変換されたページ (->オリジナル) /