rashomon/language-detection

Codeberg has changed its Terms of Use to allow more licenses for your projects. For more information, read our blog post.

This website requires JavaScript.

Fork 0

Code Issues Pull requests Activity

No description

21 commits 1 branch 0 tags 644 KiB

Jupyter Notebook 94.1%

Python 5.9%

Find a file

Shawon Ashraf a9c4219b0e typo fix		2025年06月24日 11:06:58 +06:00
notebook	moved to notebook dir	2025年06月24日 10:41:08 +06:00
src	a small log message	2025年06月24日 11:04:06 +06:00
.gitignore	init	2025年06月24日 02:55:24 +06:00
.python-version	init	2025年06月24日 02:55:24 +06:00
pyproject.toml	added logging and ruff checks	2025年06月24日 11:03:24 +06:00
README.md	typo fix	2025年06月24日 11:06:58 +06:00
uv.lock	added logging and ruff checks	2025年06月24日 11:03:24 +06:00

README.md

Language Detction from documents using n-gram profiles

This notebook is an attempt at building an n-gram profile based language detector inspired by N-gram-based text categorization Cavnar, Trenkle (1994).

BibTex entry

@inproceedings{Cavnar1994NgrambasedTC,
 title={N-gram-based text categorization},
 author={William B. Cavnar and John M. Trenkle},
 year={1994},
 url={https://api.semanticscholar.org/CorpusID:170740}
}

Env Setup

Make sure to have uv installed before you proceed.

uv sync
source .venv/bin/activate

To run the example notebook,

jupyter notebook

Otherwise you can run the cli script,

uv run src/main.py PROFILE_SIZE
# example
uv run src/main.py 200