Jump to content
Wikipedia The Free Encyclopedia

Tehran Monolingual Corpus

From Wikipedia, the free encyclopedia
This article does not cite any sources . Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Tehran Monolingual Corpus" – news · newspapers · books · scholar · JSTOR
(December 2010) (Learn how and when to remove this message)

The Tehran Monolingual Corpus (TMC) is a large-scale Persian monolingual corpus. TMC is suited for Language Modeling and relevant research areas in Natural Language Processing.

The corpus is extracted from Hamshahri Corpus and ISNA news agency website. The quality of Hamshahri corpus is improved for language modeling purpose by a series of tokenization and spell-checking steps.

TMC comprises more than 250 million words. The total number of unique words (with frequency of two or more) of the corpus is about 300 thousand, which is relatively good for a highly-inflectional language like Persian.

TMC is created by Natural Language Processing Lab of University of Tehran. The corpus is free for research use, after obtaining permission from the corpus aggregator.

See also

[edit ]
[edit ]
Text corpora,
English
Text corpora,
non-English
Organizations

AltStyle によって変換されたページ (->オリジナル) /