1
0
Fork
You've already forked WordSplitAbs.py
0
An abstraction layer around word splitters for python
  • Python 98.3%
  • Shell 1.7%
Find a file
2023年10月15日 23:15:35 +03:00
.ci Initial commit 2023年10月15日 23:15:35 +03:00
.github Initial commit 2023年10月15日 23:15:35 +03:00
tests Initial commit 2023年10月15日 23:15:35 +03:00
WordSplitAbs Initial commit 2023年10月15日 23:15:35 +03:00
.editorconfig Initial commit 2023年10月15日 23:15:35 +03:00
.gitignore Initial commit 2023年10月15日 23:15:35 +03:00
.gitlab-ci.yml Initial commit 2023年10月15日 23:15:35 +03:00
Code_Of_Conduct.md Initial commit 2023年10月15日 23:15:35 +03:00
MANIFEST.in Initial commit 2023年10月15日 23:15:35 +03:00
pyproject.toml Initial commit 2023年10月15日 23:15:35 +03:00
ReadMe.md Initial commit 2023年10月15日 23:15:35 +03:00
UNLICENSE Initial commit 2023年10月15日 23:15:35 +03:00

WordSplitAbs.py Unlicensed work

(削除) wheel (GitLab) (削除ここまで) wheel (GHA via nightly.link) (削除) GitLab Build Status (削除ここまで) (削除) GitLab Coverage (削除ここまで) (削除) GitHub Actions (削除ここまで) Libraries.io Status Code style: antiflash

This is an abstraction layer around Python libraries for splitting (tokenization) of words joined without delimiters.

It is often called words tokenization, but it is a bit different thing: tokenization is when words are naturally not splitted (in Eastern-Asian languages, for example), but splitting is when they are naturally splitted, but the delimiters got missed.

Tutorial


from WordSplitAbs import ChosenWordSplitter
s = ChosenWordSplitter() # A resource-consuming stage, the most splitters load a corpus or a semi-preprocessed model here and infer a usable model from it. So you want to call it as less as possible.
print(s("wordsplittingisinferenceofconcatenatedwordsboundaries")) # "word splitting is inference of concatenated words boundaries"

Backends

Backend Has default corpus Deps Model Quality Notes
instant_segment Unigram + bigram Recommended A rewrite of wordsegment into Rust with high performance boost
wordsegment ✔️ Unigram + bigram Recommended
WordSegmentationDP pythonnet + WordSegmentationDP.dll + Corpus file Unigram + Bayes Recommended
WordSegmentationTM pythonnet + WordSegmentationTM.dll + Corpus file Unigram + Bayes Recommended
SymSpell pythonnet + SymSpell.dll + Corpus file Unigram + Bigram Not recommended, fails to split elementary phrases
wordninja ✔️ Unigram order Not the best quality