Kyrgyz language support #1344

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

alexeyev wants to merge 2 commits into JaidedAI:master

from alexeyev:kyrgyz

Open

Kyrgyz language support #1344

alexeyev wants to merge 2 commits into JaidedAI:master from alexeyev:kyrgyz

Conversation

alexeyev

Copy link

@alexeyev alexeyev commented Dec 5, 2024

Hello, thank you for your fantastic work.

Please, add the support of the Kyrgyz language. How can I help?

In this pull request I provide the list of characters and a list of words built based on the two corpora from here using this hacky script:

import re
paths = [#"data/kir_community_2017/kir_community_2017-words.txt",
 "data/kir_newscrawl_2016_1M/kir_newscrawl_2016_1M-words.txt",
 "data/kir_wikipedia_2021_300K/kir_wikipedia_2021_300K-words.txt"]
tokens = []
removable = re.compile(r"(.*[′...ЇЈЎ&')¤/ ́˅(\"A-Za-z0-9Α-Ωα-ω.úƒƖ1⁄2ö+ЄІ,:;?!>< ]+.*|Ё.*|\w-\w+)", re.UNICODE)
for path in paths:
 with (open(path, "r", encoding="utf-8") as rf):
 for line in rf:
 line = line.strip()
 if line:
 split_line = line.split("\t")
 count = int(split_line[2])
 if count < 6:
 continue
 token = split_line[1].strip() \
 .replace("ɵ", "ө") \
 .replace("Θ", "Ө") \
 .replace("ʏ", "ү")
 token = token.strip("•₣‰ʿ°—‘»2¬/μ«£:;""„'() ́`$%–No.,-")
 if len(token) > 2 and not removable.match(token):
 tokens.append(token)
tokens = sorted(list(set(tokens)))
tokens_clipped_tail = []
for token in tokens:
 if token == "өөө":
 break
 else:
 tokens_clipped_tail.append(token)
with open("ky.txt", "w", encoding="utf-8") as wf:
 wf.write("\n".join(tokens_clipped_tail))
print(f"A total of {len(tokens_clipped_tail)} tokens.")