Arabic Word Frequency Counts

ARABIC WORD FREQUENCY COUNTS

When tokenizing a text for purposes of generating a word frequency count, I define an Arabic word as:

one or more consecutive Arabic characters [\xC1-\xD6\xD8-\xDB\xDD-\xDF\xE1\xE3-\xE6\xEC\xED]
including Persian characters [\x81\x8D\x8E\x90]
short vowels and diacritics [\xF0-\xF3\xF5\xF6\xF8\xFA]
and the lengthening character [\xDC]

Note: All hex values are those of the Arabic Windows (1256) code page.

When tokenizing Arabic input it's a good idea to make a preliminary pass to detect and fix punctuation anomalies, such as the Arabic character ra' (\xD1) used as a numeric comma or "decimal separator" (U+066B), and the Arabic lengthening character (\xDC) used as an em dash or numeric hyphen. Numbers are sometimes encoded visually instead of logically, and the digit zero occasionally functions as a period (full stop).

Aften tokenizing according to the above criteria I remove all short vowels, diacritics and the lengthening character, and count the remainder as a word. Null strings are discarded.

Here are the types/tokens figures from the last three word frequencies I have generated:

date types tokens
Feb. 1999 1,359,309 167,216,930
Feb. 2001 2,578,709 589,184,483
Aug. 2002 3,509,499 1,141,563,654

The table below shows the top 30 words and their frequencies from my three frequency counts. Note that starting with my Feb. 2001 count I added a "file count" figure and began to use it—instead of the frequency count—as the primary sort key. On Nov. 13, 2002 I wrote a script to get the Google frequency: it's an interesting statistic, especially if you look at the web as the Mother of All Corpora.

Feb. 1999 Feb. 2001 Aug. 2002 Nov. 2002
rank

word

frequency rank

word

frequency file count rank

word

frequency file count Google
1 5,645,218 1 13,624,732 1,144,319 1 26,533,543 2,511,236 6,600,000
2 3,871,153 2 18,817,693 1,128,546 2 36,615,810 2,422,564 7,540,000
3 旻? 2,310,879 3 旻? 7,508,546 915,406 3 旻? 14,173,880 1,996,755 4,320,000
4 2,219,600 4 3,669,877 776,762 4 6,717,942 1,684,706 2,660,000
5 煤? 1,516,247 5 6,518,011 748,416 5 12,214,896 1,637,296 1,450,000
6 煤簿 1,072,702 6 煤簿 3,530,111 733,062 6 煤簿 6,963,708 1,625,725 1,710,000
7 933,872 7 煤? 4,228,886 666,316 7 煤? 7,861,905 1,424,032 1,200,000
8 煤俯 727,170 8 煤俯 2,441,785 645,854 8 煤俯 4,712,703 1,406,014 1,370,000
9 673,928 9 2,307,276 618,865 9 4,597,178 1,401,634 1,500,000
10 縊? 664,751 10 縊? 2,262,268 562,428 10 縊? 4,234,060 1,216,084 1,730,000
11 縊? 621,972 11 縊? 2,056,768 533,077 11 縊? 3,936,525 1,167,126 1,440,000
12 614,348 12 2,134,871 516,655 12 蛮? 2,728,220 1,119,473 1,240,000
13 596,737 13 罷? 1,557,175 498,405 13 罷? 3,016,512 1,103,814 1,230,000
14 罷? 471,859 14 蛮? 1,355,002 498,352 14 3,981,028 1,102,480 2,100,000
15 444,508 15 2,053,785 448,718 15 令煤 2,285,299 983,953 624,000
16 倚? 390,446 16 令煤 1,128,034 433,457 16 焜? 2,236,668 952,957 947,000
17 瀁? 385,909 17 焜? 1,148,968 429,963 17 3,780,512 949,315 2,000,000
18 蛮? 383,454 18 1,264,232 415,445 18 4,960,500 944,445 2,320,000
19 372,917 19 倚? 1,336,611 414,018 19 2,373,084 876,683 1,100,000
20 347,762 20 2,222,026 395,224 20 倚? 2,432,523 874,247 1,190,000
21 刀? 336,817 21 瀁? 1,316,091 392,905 21 瀁? 2,417,022 828,361 1,170,000
22 330,130 22 1,165,657 378,694 22 2,211,351 817,968 1,060,000
23 焜? 316,837 23 絡? 886,973 349,694 23 聢煤 1,998,242 805,975 531,000
24 300,602 24 煤敘? 904,412 348,480 24 1,750,386 795,319 828,000
25 299,244 25 淅? 774,256 348,359 25 册輦 1,850,654 781,413 507,000
26 令煤 297,653 26 册輦 942,069 346,054 26 煤敘? 1,847,613 781,323 579,000
27 煤斫罷? 289,300 27 聟? 921,810 344,874 27 絡? 1,756,690 777,851 628,000
28 売? 269,280 28 聢煤 992,961 344,420 28 淅? 1,537,048 767,380 790,000
29 煤敘? 268,549 29 879,497 343,512 29 煤辭? 1,426,832 765,200 840,000
30 煤痳 267,092 30 聢? 863,727 343,123 30 聟? 1,675,304 713,037 652,000

The above is not a lemmatized list. Although some word forms are easily merged (e.g., 煤? ?=? 刀?), most word forms require contextual analysis to be disambiguated (e.g. ?=? 駐糲 or 駐鬻? or 当糲 or 当鬻? or ).

Before retrieving citations from my corpus I find it useful to go to the wordlist first and use it to test the regular expression that I will later use for searching the corpus. The wordlist also allows me to see the total number of hits that I will get when generating a concordance and thus anticipate the file size of the concordance. Some word forms are unambiguous and lend themselves to fairly simple regular expressions when searching for them in the wordlist. For example, the regular expression /[A><]stqlAb/ produced:

30 forms (Total Freq: 570 = 1 every 2,002,743 words) (Here is a page with my Transliteration).

word rank freq filecnt
煤排舗畴? 203,353 156 118
煤排舗畴罷? 236,180 115 88
排舗畴? 247,925 103 80
翡畴喨渝波 338,618 46 43
排舗畴罷? 402,948 33 30
煤排舗畴罷 527,621 24 17
翡畴喨渝波輊 600,354 18 13
翡喨渝波 670,143 11 11
畴喨渝波 738,984 9 9
版喨渝波 878,371 7 6
排舗畴罷 888,965 6 6
煤途舗畴? 950,488 7 5
排舗畴版 1,257,608 3 3
排舗畴斐 1,257,609 3 3
排舗畴斐? 1,257,610 3 3
煤池舗畴? 1,260,732 3 3
煤池舗畴罷? 1,260,733 3 3
途舗畴? 1,514,774 2 2
排舗畴版? 1,526,307 2 2
排舗畴罷? 1,526,308 2 2
煤排舗畴版? 1,537,658 2 2
版畴喨渝波 1,596,397 2 2
瀁喨渝波 1,724,178 2 2
瘁排舗畴? 1,763,413 2 2
煤途舗畴罷? 2,135,762 1 1
煤排舗畴披渾 2,147,334 1 1
翆穡盃煤排舗畴? 3,166,051 1 1
翡喨渝波? 3,185,260 1 1
翡喨渝波輊 3,185,261 1 1
翡畴喨渝波琶 3,200,045 1 1

Other lemmas require more complex regular expressions, such as the following, for the noun stem fltAn:

/^[wf]?([blk]?|[bk]?Al|[blk]?hAl|ll)fltAn/ (Note: this includes the colloquial prefix hAl-)

29 forms (Total Freq: 1,728 = 1 every 660,627 words) (Here is a page with my Transliteration).

word rank freq filecnt
煤氈頁? 78,320 779 661
氈頁? 101,135 469 426
翡痞疂売 192,043 144 132
聟疂売 290,316 60 59
瘁氈頁? 298,073 56 56
氈頁簀 351,632 41 40
氈頁簓 367,836 49 36
版痞疂売 482,666 21 21
痞疂売 495,940 20 20
煤氈頁簓 514,485 22 18
否疂売 662,372 11 11
氈頁粤 665,759 11 11
氈頁粤? 737,270 9 9
煤氈頁粱 892,673 6 6
聟疂売? 931,416 6 6
翡痞疂売? 1,080,908 5 4
歿痞疂売 1,335,814 3 3
氈頁粱? 1,710,064 2 2
翡痞疂売辟 1,838,284 2 2
翦氈頁? 1,853,312 2 2
聟疂売? 1,881,266 2 2
煤氈頁粱? 2,262,027 1 1
煤氈頁粱? 2,262,028 1 1
氈頁簀? 2,809,431 1 1
氈頁粱 2,809,432 1 1

綰痞疂売

3,138,700 1 1
聟疂売綰 3,358,429 1 1
聟疂売繝 3,358,430 1 1
聲痞疂売 3,395,987 1 1

The following regular expressions extract all the inflected forms of the preposition fy (both versions produce the same output):

/^[wf]?f[Yy](|h|hA|hmA|hm|hn|k|kmA|km|kn|nA)$/
/^[wf]?f[Yy]([hk]?|hA|[hk]mA?|[hk]n|nA)$/

48 forms (Total Freq: 42,810,237 = 1 every 26 words) (Here is a page with my Transliteration).

word rank freq filecnt
2 36,615,810 2,422,564
聟? 30 1,675,304 713,037
沚綰 36 1,222,801 632,440
沚? 50 1,030,089 545,319
132 1,894,110 321,671
毫? 920 122,836 95,251
沚繝 3,081 46,072 36,049
聟? 3,103 56,930 35,850
聟辣 5,674 46,993 20,247
沚繝? 5,837 24,842 19,709
聟辣? 7,115 19,925 16,027
沚簀 7,938 18,509 14,195
沚? 11,933 12,639 8,909
毫? 25,202 3,981 3,560
毫辣 29,518 3,705 2,874
沚焜 30,851 4,485 2,712
毫辣? 34,545 3,087 2,317
沚繖 46,295 2,632 1,518
聟辣? 52,220 1,792 1,266
聟辟? 72,453 955 750
聟辣稠 101,248 581 425
聟轜 120,370 357 314
沍綰 160,503 399 185
沍? 168,479 340 169
毫辣稠 170,194 204 166
毫辣? 186,930 158 139
聟轜? 190,969 213 133
沚焙 218,127 130 103
聟辣? 240,648 99 85
沚焜? 291,082 83 58
毫轜 360,477 39 38
毫辟? 456,896 29 23
毫辣? 562,437 18 15
沍簀 606,172 14 13
沍繝 687,798 11 10
聟跂? 689,782 11 10
聟轜? 747,527 9 9
聟跂 769,046 9 8
沍繝? 817,607 8 7
沍? 869,451 9 6
毫轜? 953,608 7 5
毫跂 1,340,653 3 3
毫跂? 1,340,654 3 3
沍焜 1,715,068 2 2
毫轜? 2,805,347 1 1
沍繖 2,826,754 1 1
聟跂稠 3,359,666 1 1
聟跂? 3,359,667 1 1

Some contamination from other lemmas does occur. Some of the homographs, such as 聟?, are easy to spot, but others are quite unexpected. For example, 沚焙 is also the Armenian proper name Vigen (I thank my Arabic-L colleagues for pointing this out), and 沚簀 is also the Belgian firm FINA and a rare spelling of Vienna. Unexpected contamination usually shows up when a concordance is generated (see CONCORDANCING). I can anticipate some of the extraneous lemmas by running the wordlist through my morphological parser (see MORPHOLOGY ANALYSIS) before generating the concordance.


HOME | CORPUS COMPILATION | WORD FREQUENCY COUNTS | CONCORDANCING | MORPHOLOGY ANALYSIS | ARABIC LEXICON

Copyright ? 2002 QAMUS LLC

AltStyle によって変換されたページ (->オリジナル) /