generated from KOLANICH/python_project_boilerplate.py

Code Issues Pull requests Projects Releases Packages Wiki Activity

An abstraction layer around word splitters for python

natural-language-processing word-segmentation word-split word-splitting word-tokenizing

Python 98.3%
Shell 1.7%

Find a file

KOLANICH b8fc74cb47 Initial commit		2023年10月15日 23:15:35 +03:00
.ci	Initial commit	2023年10月15日 23:15:35 +03:00
.github	Initial commit	2023年10月15日 23:15:35 +03:00
tests	Initial commit	2023年10月15日 23:15:35 +03:00
WordSplitAbs	Initial commit	2023年10月15日 23:15:35 +03:00
.editorconfig	Initial commit	2023年10月15日 23:15:35 +03:00
.gitignore	Initial commit	2023年10月15日 23:15:35 +03:00
.gitlab-ci.yml	Initial commit	2023年10月15日 23:15:35 +03:00
Code_Of_Conduct.md	Initial commit	2023年10月15日 23:15:35 +03:00
MANIFEST.in	Initial commit	2023年10月15日 23:15:35 +03:00
pyproject.toml	Initial commit	2023年10月15日 23:15:35 +03:00
ReadMe.md	Initial commit	2023年10月15日 23:15:35 +03:00
UNLICENSE	Initial commit	2023年10月15日 23:15:35 +03:00

ReadMe.md

WordSplitAbs.py Unlicensed work

~~(削除) wheel (GitLab) (削除ここまで)~~ wheel (GHA via nightly.link) ~~(削除) GitLab Build Status (削除ここまで)~~ ~~(削除) GitLab Coverage (削除ここまで)~~ ~~(削除) GitHub Actions (削除ここまで)~~ Libraries.io Status Code style: antiflash

This is an abstraction layer around Python libraries for splitting (tokenization) of words joined without delimiters.

It is often called words tokenization, but it is a bit different thing: tokenization is when words are naturally not splitted (in Eastern-Asian languages, for example), but splitting is when they are naturally splitted, but the delimiters got missed.

Tutorial


from WordSplitAbs import ChosenWordSplitter
s = ChosenWordSplitter() # A resource-consuming stage, the most splitters load a corpus or a semi-preprocessed model here and infer a usable model from it. So you want to call it as less as possible.
print(s("wordsplittingisinferenceofconcatenatedwordsboundaries")) # "word splitting is inference of concatenated words boundaries"

Backends

Backend	Has default corpus	Deps	Model	Quality
instant_segment	❌	Unigram + bigram	Recommended	A rewrite of `wordsegment` into Rust with high performance boost
wordsegment	✔️	Unigram + bigram	Recommended
WordSegmentationDP	❌	pythonnet + `WordSegmentationDP.dll` + Corpus file	Unigram + Bayes	Recommended
WordSegmentationTM	❌	pythonnet + `WordSegmentationTM.dll` + Corpus file	Unigram + Bayes	Recommended
SymSpell	❌	pythonnet + `SymSpell.dll` + Corpus file	Unigram + Bigram	Not recommended, fails to split elementary phrases
wordninja	✔️	Unigram order	Not the best quality