generated from KOLANICH/python_project_boilerplate.py
An abstraction layer around word splitters for python
| .ci | Initial commit | |
| .github | Initial commit | |
| tests | Initial commit | |
| WordSplitAbs | Initial commit | |
| .editorconfig | Initial commit | |
| .gitignore | Initial commit | |
| .gitlab-ci.yml | Initial commit | |
| Code_Of_Conduct.md | Initial commit | |
| MANIFEST.in | Initial commit | |
| pyproject.toml | Initial commit | |
| ReadMe.md | Initial commit | |
| UNLICENSE | Initial commit | |
WordSplitAbs.py Unlicensed work
(削除) wheel (GitLab) (削除ここまで)
wheel (GHA via nightly.link)
(削除) GitLab Build Status (削除ここまで)
(削除) GitLab Coverage (削除ここまで)
(削除) GitHub Actions (削除ここまで)
Libraries.io Status
Code style: antiflash
This is an abstraction layer around Python libraries for splitting (tokenization) of words joined without delimiters.
It is often called words tokenization, but it is a bit different thing: tokenization is when words are naturally not splitted (in Eastern-Asian languages, for example), but splitting is when they are naturally splitted, but the delimiters got missed.
Tutorial
from WordSplitAbs import ChosenWordSplitter
s = ChosenWordSplitter() # A resource-consuming stage, the most splitters load a corpus or a semi-preprocessed model here and infer a usable model from it. So you want to call it as less as possible.
print(s("wordsplittingisinferenceofconcatenatedwordsboundaries")) # "word splitting is inference of concatenated words boundaries"
Backends
| Backend | Has default corpus | Deps | Model | Quality | Notes |
|---|---|---|---|---|---|
| instant_segment | ❌ | Unigram + bigram | Recommended | A rewrite of wordsegment into Rust with high performance boost |
|
| wordsegment | ✔️ | Unigram + bigram | Recommended | ||
| WordSegmentationDP | ❌ | pythonnet + WordSegmentationDP.dll + Corpus file |
Unigram + Bayes | Recommended | |
| WordSegmentationTM | ❌ | pythonnet + WordSegmentationTM.dll + Corpus file |
Unigram + Bayes | Recommended | |
| SymSpell | ❌ | pythonnet + SymSpell.dll + Corpus file |
Unigram + Bigram | Not recommended, fails to split elementary phrases | |
| wordninja | ✔️ | Unigram order | Not the best quality |