No description
|
|
||
|---|---|---|
| notebook | moved to notebook dir | |
| src | a small log message | |
| .gitignore | init | |
| .python-version | init | |
| pyproject.toml | added logging and ruff checks | |
| README.md | typo fix | |
| uv.lock | added logging and ruff checks | |
Language Detction from documents using n-gram profiles
This notebook is an attempt at building an n-gram profile based language detector inspired by N-gram-based text categorization Cavnar, Trenkle (1994).
BibTex entry
@inproceedings{Cavnar1994NgrambasedTC,
title={N-gram-based text categorization},
author={William B. Cavnar and John M. Trenkle},
year={1994},
url={https://api.semanticscholar.org/CorpusID:170740}
}
Env Setup
Make sure to have uv installed before you proceed.
uv sync
source .venv/bin/activate
To run the example notebook,
jupyter notebook
Otherwise you can run the cli script,
uv run src/main.py PROFILE_SIZE
# example
uv run src/main.py 200