A utility library for quickly cleaning texts
Python version in the dev environment: 3.11.5
pip install -U texy
Pipelines with parallelization in Rust:
>>> from texy.pipelines import extreme_clean, strict_clean, relaxed_clean >>> data = ["hello ;/ from the other side π \t "] print(extreme_clean(data)) >>> ['hello from the other side'] print(strict_clean(data)) >>> ['hello ;/ from the other side'] print(relaxed_clean(data)) >>> ['hello ;/ from the other side π']
Parallelize custom functions with Python Multiprocessing:
from texy.pipelines import parallelize def dummy(x): return [i[0] for i in x] data = ["a ", "b ", "c ", "d ", "e ", "f ", "g ", "h ?."] * 100 print(parallelize(dummy, data, 2))
| Pipeline | Actions |
|---|---|
relaxed_clean |
remove_newlines, remove_html, remove_xml, merge_spaces |
strict_clean |
remove_newlines, remove_urls, remove_emails, remove_html, remove_xml, remove_emoticons, remove_emojis, remove_infrequent_punctuations, merge_spaces |
extreme_clean |
remove_newlines, remove_urls, remove_emails, remove_html, remove_xml, remove_emoticons, remove_emojis, remove_all_punctuations, merge_spaces |