Arabic Word Frequency Counts
ARABIC
WORD FREQUENCY COUNTS
When tokenizing a text for
purposes of generating a word frequency count, I define an Arabic
word as:
one or more consecutive Arabic
characters [\xC1-\xD6\xD8-\xDB\xDD-\xDF\xE1\xE3-\xE6\xEC\xED]
including Persian characters [\x81\x8D\x8E\x90]
short vowels and diacritics [\xF0-\xF3\xF5\xF6\xF8\xFA]
and the lengthening character [\xDC]
Note: All hex values are those of
the Arabic Windows (1256) code page.
When tokenizing Arabic input it's
a good idea to make a preliminary pass to detect and fix
punctuation anomalies, such as the Arabic character ra'
(\xD1) used as a numeric comma or "decimal
separator" (U+066B), and the Arabic lengthening character (\xDC) used as an em dash or numeric hyphen. Numbers are
sometimes encoded visually instead of logically, and the digit
zero occasionally functions as a period (full stop).
Aften tokenizing according to the
above criteria I remove all short vowels, diacritics and the
lengthening character, and count the remainder as a word. Null
strings are discarded.
Here are the types/tokens figures
from the last three word frequencies I have generated:
date
types
tokens
Feb. 1999
1,359,309
167,216,930
Feb. 2001
2,578,709
589,184,483
Aug. 2002
3,509,499
1,141,563,654
The table below shows the top 30
words and their frequencies from my three frequency counts. Note
that starting with my Feb. 2001 count I added a "file count"
figure and began to use itinstead of the frequency countas
the primary sort key. On Nov. 13, 2002 I wrote a script to get
the Google frequency: it's an interesting statistic, especially
if you look at the web as the Mother of All Corpora.
Feb. 1999
Feb. 2001
Aug. 2002
Nov. 2002
rank
word
frequency
rank
word
frequency
file count
rank
word
frequency
file count
Google
1
沚
5,645,218
1
邃
13,624,732
1,144,319
1
邃
26,533,543
2,511,236
6,600,000
2
邃
3,871,153
2
沚
18,817,693
1,128,546
2
沚
36,615,810
2,422,564
7,540,000
3
旻?
2,310,879
3
旻?
7,508,546
915,406
3
旻?
14,173,880
1,996,755
4,320,000
4
売
2,219,600
4
昶
3,669,877
776,762
4
昶
6,717,942
1,684,706
2,660,000
5
煤?
1,516,247
5
売
6,518,011
748,416
5
売
12,214,896
1,637,296
1,450,000
6
煤簿
1,072,702
6
煤簿
3,530,111
733,062
6
煤簿
6,963,708
1,625,725
1,710,000
7
昶
933,872
7
煤?
4,228,886
666,316
7
煤?
7,861,905
1,424,032
1,200,000
8
煤俯
727,170
8
煤俯
2,441,785
645,854
8
煤俯
4,712,703
1,406,014
1,370,000
9
窕
673,928
9
窕
2,307,276
618,865
9
窕
4,597,178
1,401,634
1,500,000
10
縊?
664,751
10
縊?
2,262,268
562,428
10
縊?
4,234,060
1,216,084
1,730,000
11
縊?
621,972
11
縊?
2,056,768
533,077
11
縊?
3,936,525
1,167,126
1,440,000
12
稠
614,348
12
稠
2,134,871
516,655
12
蛮?
2,728,220
1,119,473
1,240,000
13
畴
596,737
13
罷?
1,557,175
498,405
13
罷?
3,016,512
1,103,814
1,230,000
14
罷?
471,859
14
蛮?
1,355,002
498,352
14
稠
3,981,028
1,102,480
2,100,000
15
嫡
444,508
15
畴
2,053,785
448,718
15
令煤
2,285,299
983,953
624,000
16
倚?
390,446
16
令煤
1,128,034
433,457
16
焜?
2,236,668
952,957
947,000
17
瀁?
385,909
17
焜?
1,148,968
429,963
17
畴
3,780,512
949,315
2,000,000
18
蛮?
383,454
18
焉
1,264,232
415,445
18
嫡
4,960,500
944,445
2,320,000
19
焉
372,917
19
倚?
1,336,611
414,018
19
焉
2,373,084
876,683
1,100,000
20
痺
347,762
20
嫡
2,222,026
395,224
20
倚?
2,432,523
874,247
1,190,000
21
刀?
336,817
21
瀁?
1,316,091
392,905
21
瀁?
2,417,022
828,361
1,170,000
22
披
330,130
22
痺
1,165,657
378,694
22
痺
2,211,351
817,968
1,060,000
23
焜?
316,837
23
絡?
886,973
349,694
23
聢煤
1,998,242
805,975
531,000
24
陪
300,602
24
煤敘?
904,412
348,480
24
渭
1,750,386
795,319
828,000
25
繙
299,244
25
淅?
774,256
348,359
25
册輦
1,850,654
781,413
507,000
26
令煤
297,653
26
册輦
942,069
346,054
26
煤敘?
1,847,613
781,323
579,000
27
煤斫罷?
289,300
27
聟?
921,810
344,874
27
絡?
1,756,690
777,851
628,000
28
売?
269,280
28
聢煤
992,961
344,420
28
淅?
1,537,048
767,380
790,000
29
煤敘?
268,549
29
渭
879,497
343,512
29
煤辭?
1,426,832
765,200
840,000
30
煤痳
267,092
30
聢?
863,727
343,123
30
聟?
1,675,304
713,037
652,000
The above is not a lemmatized list.
Although some word forms are easily merged (e.g., 煤? ?=? 刀?), most word forms require contextual
analysis to be disambiguated (e.g. 売 ?=? 駐糲 or 駐鬻? or 当糲 or 当鬻? or 鯛).
Before retrieving citations from
my corpus I find it useful to go to the wordlist first and use it
to test the regular expression that I will later use for
searching the corpus. The wordlist also allows me to see the
total number of hits that I will get when generating a
concordance and thus anticipate the file size of the concordance.
Some word forms are unambiguous and lend themselves to fairly
simple regular expressions when searching for them in the
wordlist. For example, the regular expression /[A><]stqlAb/ produced:
30 forms (Total Freq: 570 = 1
every 2,002,743 words) (Here is a page with my Transliteration).
word
rank
freq
filecnt
煤排舗畴?
203,353
156
118
煤排舗畴罷?
236,180
115
88
排舗畴?
247,925
103
80
翡畴喨渝波
338,618
46
43
排舗畴罷?
402,948
33
30
煤排舗畴罷
527,621
24
17
翡畴喨渝波輊
600,354
18
13
翡喨渝波
670,143
11
11
畴喨渝波
738,984
9
9
版喨渝波
878,371
7
6
排舗畴罷
888,965
6
6
煤途舗畴?
950,488
7
5
排舗畴版
1,257,608
3
3
排舗畴斐
1,257,609
3
3
排舗畴斐?
1,257,610
3
3
煤池舗畴?
1,260,732
3
3
煤池舗畴罷?
1,260,733
3
3
途舗畴?
1,514,774
2
2
排舗畴版?
1,526,307
2
2
排舗畴罷?
1,526,308
2
2
煤排舗畴版?
1,537,658
2
2
版畴喨渝波
1,596,397
2
2
瀁喨渝波
1,724,178
2
2
瘁排舗畴?
1,763,413
2
2
煤途舗畴罷?
2,135,762
1
1
煤排舗畴披渾
2,147,334
1
1
翆穡盃煤排舗畴?
3,166,051
1
1
翡喨渝波?
3,185,260
1
1
翡喨渝波輊
3,185,261
1
1
翡畴喨渝波琶
3,200,045
1
1
Other lemmas require more complex
regular expressions, such as the following, for the noun stem fltAn:
/^[wf]?([blk]?|[bk]?Al|[blk]?hAl|ll)fltAn/
(Note: this includes the
colloquial prefix hAl-)
29 forms (Total Freq: 1,728 = 1
every 660,627 words) (Here is a page with my Transliteration).
word
rank
freq
filecnt
煤氈頁?
78,320
779
661
氈頁?
101,135
469
426
翡痞疂売
192,043
144
132
聟疂売
290,316
60
59
瘁氈頁?
298,073
56
56
氈頁簀
351,632
41
40
氈頁簓
367,836
49
36
版痞疂売
482,666
21
21
痞疂売
495,940
20
20
煤氈頁簓
514,485
22
18
否疂売
662,372
11
11
氈頁粤
665,759
11
11
氈頁粤?
737,270
9
9
煤氈頁粱
892,673
6
6
聟疂売?
931,416
6
6
翡痞疂売?
1,080,908
5
4
歿痞疂売
1,335,814
3
3
氈頁粱?
1,710,064
2
2
翡痞疂売辟
1,838,284
2
2
翦氈頁?
1,853,312
2
2
聟疂売?
1,881,266
2
2
煤氈頁粱?
2,262,027
1
1
煤氈頁粱?
2,262,028
1
1
氈頁簀?
2,809,431
1
1
氈頁粱
2,809,432
1
1
聟疂売綰
3,358,429
1
1
聟疂売繝
3,358,430
1
1
聲痞疂売
3,395,987
1
1
The following regular expressions
extract all the inflected forms of the preposition fy
(both versions produce the same output):
/^[wf]?f[Yy](|h|hA|hmA|hm|hn|k|kmA|km|kn|nA)$/
/^[wf]?f[Yy]([hk]?|hA|[hk]mA?|[hk]n|nA)$/
48 forms (Total Freq: 42,810,237 =
1 every 26 words) (Here is a page with my Transliteration).
word
rank
freq
filecnt
沚
2
36,615,810
2,422,564
聟?
30
1,675,304
713,037
沚綰
36
1,222,801
632,440
沚?
50
1,030,089
545,319
沍
132
1,894,110
321,671
毫?
920
122,836
95,251
沚繝
3,081
46,072
36,049
聟?
3,103
56,930
35,850
聟辣
5,674
46,993
20,247
沚繝?
5,837
24,842
19,709
聟辣?
7,115
19,925
16,027
沚簀
7,938
18,509
14,195
沚?
11,933
12,639
8,909
毫?
25,202
3,981
3,560
毫辣
29,518
3,705
2,874
沚焜
30,851
4,485
2,712
毫辣?
34,545
3,087
2,317
沚繖
46,295
2,632
1,518
聟辣?
52,220
1,792
1,266
聟辟?
72,453
955
750
聟辣稠
101,248
581
425
聟轜
120,370
357
314
沍綰
160,503
399
185
沍?
168,479
340
169
毫辣稠
170,194
204
166
毫辣?
186,930
158
139
聟轜?
190,969
213
133
沚焙
218,127
130
103
聟辣?
240,648
99
85
沚焜?
291,082
83
58
毫轜
360,477
39
38
毫辟?
456,896
29
23
毫辣?
562,437
18
15
沍簀
606,172
14
13
沍繝
687,798
11
10
聟跂?
689,782
11
10
聟轜?
747,527
9
9
聟跂
769,046
9
8
沍繝?
817,607
8
7
沍?
869,451
9
6
毫轜?
953,608
7
5
毫跂
1,340,653
3
3
毫跂?
1,340,654
3
3
沍焜
1,715,068
2
2
毫轜?
2,805,347
1
1
沍繖
2,826,754
1
1
聟跂稠
3,359,666
1
1
聟跂?
3,359,667
1
1
Some contamination from other
lemmas does occur. Some of the homographs, such as 聟?, are easy to spot, but others are quite
unexpected. For example, 沚焙 is also the
Armenian proper name Vigen (I thank my Arabic-L colleagues for
pointing this out), and 沚簀 is also the Belgian firm FINA and a rare
spelling of Vienna. Unexpected contamination usually shows up
when a concordance is generated (see CONCORDANCING). I can anticipate some of the extraneous
lemmas by running the wordlist through my morphological parser (see
MORPHOLOGY
ANALYSIS) before
generating the concordance.
HOME | CORPUS COMPILATION | WORD FREQUENCY COUNTS | CONCORDANCING | MORPHOLOGY ANALYSIS | ARABIC LEXICON
Copyright ? 2002
QAMUS LLC