Arabic Word Frequency Counts

ARABIC WORD FREQUENCY COUNTS

When tokenizing a text for purposes of generating a word frequency count, I define an Arabic word as:

one or more consecutive Arabic characters [\xC1-\xD6\xD8-\xDB\xDD-\xDF\xE1\xE3-\xE6\xEC\xED]
including Persian characters [\x81\x8D\x8E\x90]
short vowels and diacritics [\xF0-\xF3\xF5\xF6\xF8\xFA]
and the lengthening character [\xDC]

Note: All hex values are those of the Arabic Windows (1256) code page.

When tokenizing Arabic input it's a good idea to make a preliminary pass to detect and fix punctuation anomalies, such as the Arabic character ra' (\xD1) used as a numeric comma or "decimal separator" (U+066B), and the Arabic lengthening character (\xDC) used as an em dash or numeric hyphen. Numbers are sometimes encoded visually instead of logically, and the digit zero occasionally functions as a period (full stop).

Aften tokenizing according to the above criteria I remove all short vowels, diacritics and the lengthening character, and count the remainder as a word. Null strings are discarded.

Here are the types/tokens figures from the last three word frequencies I have generated:

date types tokens

Feb. 1999 1,359,309 167,216,930

Feb. 2001 2,578,709 589,184,483

Aug. 2002 3,509,499 1,141,563,654

The table below shows the top 30 words and their frequencies from my three frequency counts. Note that starting with my Feb. 2001 count I added a "file count" figure and began to use it—instead of the frequency count—as the primary sort key. On Nov. 13, 2002 I wrote a script to get the Google frequency: it's an interesting statistic, especially if you look at the web as the Mother of All Corpora.

Feb. 1999 Feb. 2001 Aug. 2002 Nov. 2002

rank

word

frequency rank

word

frequency file count rank

word

frequency file count Google

1 في 5,645,218 1 من 13,624,732 1,144,319 1 من 26,533,543 2,511,236 6,600,000

2 من 3,871,153 2 في 18,817,693 1,128,546 2 في 36,615,810 2,422,564 7,540,000

3 عل? 2,310,879 3 عل? 7,508,546 915,406 3 عل? 14,173,880 1,996,755 4,320,000

4 ان 2,219,600 4 عن 3,669,877 776,762 4 عن 6,717,942 1,684,706 2,660,000

5 ال? 1,516,247 5 ان 6,518,011 748,416 5 ان 12,214,896 1,637,296 1,450,000

6 التي 1,072,702 6 التي 3,530,111 733,062 6 التي 6,963,708 1,625,725 1,710,000

7 عن 933,872 7 ال? 4,228,886 666,316 7 ال? 7,861,905 1,424,032 1,200,000

8 الذي 727,170 8 الذي 2,441,785 645,854 8 الذي 4,712,703 1,406,014 1,370,000

9 مع 673,928 9 مع 2,307,276 618,865 9 مع 4,597,178 1,401,634 1,500,000

10 هذ? 664,751 10 هذ? 2,262,268 562,428 10 هذ? 4,234,060 1,216,084 1,730,000

11 هذ? 621,972 11 هذ? 2,056,768 533,077 11 هذ? 3,936,525 1,167,126 1,440,000

12 ما 614,348 12 ما 2,134,871 516,655 12 بع? 2,728,220 1,119,473 1,240,000

13 لا 596,737 13 بي? 1,557,175 498,405 13 بي? 3,016,512 1,103,814 1,230,000

14 بي? 471,859 14 بع? 1,355,002 498,352 14 ما 3,981,028 1,102,480 2,100,000

15 أن 444,508 15 لا 2,053,785 448,718 15 خلال 2,285,299 983,953 624,000

16 ذل? 390,446 16 خلال 1,128,034 433,457 16 كم? 2,236,668 952,957 947,000

17 كا? 385,909 17 كم? 1,148,968 429,963 17 لا 3,780,512 949,315 2,000,000

18 بع? 383,454 18 كل 1,264,232 415,445 18 أن 4,960,500 944,445 2,320,000

19 كل 372,917 19 ذل? 1,336,611 414,018 19 كل 2,373,084 876,683 1,100,000

20 لم 347,762 20 أن 2,222,026 395,224 20 ذل? 2,432,523 874,247 1,190,000

21 إل? 336,817 21 كا? 1,316,091 392,905 21 كا? 2,417,022 828,361 1,170,000

22 بن 330,130 22 لم 1,165,657 378,694 22 لم 2,211,351 817,968 1,060,000

23 كم? 316,837 23 حي? 886,973 349,694 23 وقال 1,998,242 805,975 531,000

24 او 300,602 24 العا? 904,412 348,480 24 قد 1,750,386 795,319 828,000

25 هو 299,244 25 قب? 774,256 348,359 25 رئيس 1,850,654 781,413 507,000

26 خلال 297,653 26 رئيس 942,069 346,054 26 العا? 1,847,613 781,323 579,000

27 العربي? 289,300 27 وف? 921,810 344,874 27 حي? 1,756,690 777,851 628,000

28 ان? 269,280 28 وقال 992,961 344,420 28 قب? 1,537,048 767,380 790,000

29 العا? 268,549 29 قد 879,497 343,512 29 اليو? 1,426,832 765,200 840,000

30 الله 267,092 30 وق? 863,727 343,123 30 وف? 1,675,304 713,037 652,000

The above is not a lemmatized list. Although some word forms are easily merged (e.g., ال? ?=? إل?), most word forms require contextual analysis to be disambiguated (e.g. ان ?=? أَنْ or أَنّ? or إِنْ or إِنّ? or آن).

Before retrieving citations from my corpus I find it useful to go to the wordlist first and use it to test the regular expression that I will later use for searching the corpus. The wordlist also allows me to see the total number of hits that I will get when generating a concordance and thus anticipate the file size of the concordance. Some word forms are unambiguous and lend themselves to fairly simple regular expressions when searching for them in the wordlist. For example, the regular expression /[A><]stqlAb/ produced:

30 forms (Total Freq: 570 = 1 every 2,002,743 words) (Here is a page with my Transliteration).

word rank freq filecnt

الاستقلا? 203,353 156 118

الاستقلابي? 236,180 115 88

استقلا? 247,925 103 80

والاستقلاب 338,618 46 43

استقلابي? 402,948 33 30

الاستقلابي 527,621 24 17

والاستقلابية 600,354 18 13

واستقلاب 670,143 11 11

لاستقلاب 738,984 9 9

باستقلاب 878,371 7 6

استقلابي 888,965 6 6

الإستقلا? 950,488 7 5

استقلابا 1,257,608 3 3

استقلابه 1,257,609 3 3

استقلابه? 1,257,610 3 3

الأستقلا? 1,260,732 3 3

الأستقلابي? 1,260,733 3 3

إستقلا? 1,514,774 2 2

استقلابا? 1,526,307 2 2

استقلابي? 1,526,308 2 2

الاستقلابا? 1,537,658 2 2

بالاستقلاب 1,596,397 2 2

كاستقلاب 1,724,178 2 2

للاستقلا? 1,763,413 2 2

الإستقلابي? 2,135,762 1 1

الاستقلابنقص 2,147,334 1 1

وأمراضالاستقلا? 3,166,051 1 1

واستقلاب? 3,185,260 1 1

واستقلابية 3,185,261 1 1

والاستقلابات 3,200,045 1 1

Other lemmas require more complex regular expressions, such as the following, for the noun stem fltAn:

/^[wf]?([blk]?|[bk]?Al|[blk]?hAl|ll)fltAn/ (Note: this includes the colloquial prefix hAl-)

29 forms (Total Freq: 1,728 = 1 every 660,627 words) (Here is a page with my Transliteration).

word rank freq filecnt

الفلتا? 78,320 779 661

فلتا? 101,135 469 426

والفلتان 192,043 144 132

وفلتان 290,316 60 59

للفلتا? 298,073 56 56

فلتانا 351,632 41 40

فلتانة 367,836 49 36

بالفلتان 482,666 21 21

لفلتان 495,940 20 20

الفلتانة 514,485 22 18

بفلتان 662,372 11 11

فلتانه 665,759 11 11

فلتانه? 737,270 9 9

الفلتاني 892,673 6 6

وفلتان? 931,416 6 6

والفلتان? 1,080,908 5 4

فالفلتان 1,335,814 3 3

فلتاني? 1,710,064 2 2

والفلتانين 1,838,284 2 2

وبفلتا? 1,853,312 2 2

وفلتان? 1,881,266 2 2

الفلتاني? 2,262,027 1 1

الفلتاني? 2,262,028 1 1

فلتانا? 2,809,431 1 1

فلتاني 2,809,432 1 1

هالفلتان

3,138,700 1 1

وفلتانها 3,358,429 1 1

وفلتانهم 3,358,430 1 1

وللفلتان 3,395,987 1 1

The following regular expressions extract all the inflected forms of the preposition fy (both versions produce the same output):

/^[wf]?f[Yy](|h|hA|hmA|hm|hn|k|kmA|km|kn|nA)$/
/^[wf]?f[Yy]([hk]?|hA|[hk]mA?|[hk]n|nA)$/

48 forms (Total Freq: 42,810,237 = 1 every 26 words) (Here is a page with my Transliteration).

word rank freq filecnt

في 2 36,615,810 2,422,564

وف? 30 1,675,304 713,037

فيها 36 1,222,801 632,440

في? 50 1,030,089 545,319

فى 132 1,894,110 321,671

فف? 920 122,836 95,251

فيهم 3,081 46,072 36,049

وف? 3,103 56,930 35,850

وفيه 5,674 46,993 20,247

فيهم? 5,837 24,842 19,709

وفيه? 7,115 19,925 16,027

فينا 7,938 18,509 14,195

في? 11,933 12,639 8,909

فف? 25,202 3,981 3,560

ففيه 29,518 3,705 2,874

فيكم 30,851 4,485 2,712

ففيه? 34,545 3,087 2,317

فيهن 46,295 2,632 1,518

وفيه? 52,220 1,792 1,266

وفين? 72,453 955 750

وفيهما 101,248 581 425

وفيك 120,370 357 314

فىها 160,503 399 185

فى? 168,479 340 169

ففيهما 170,194 204 166

ففيه? 186,930 158 139

وفيك? 190,969 213 133

فيكن 218,127 130 103

وفيه? 240,648 99 85

فيكم? 291,082 83 58

ففيك 360,477 39 38

ففين? 456,896 29 23

ففيه? 562,437 18 15

فىنا 606,172 14 13

فىهم 687,798 11 10

وفىه? 689,782 11 10

وفيك? 747,527 9 9

وفىه 769,046 9 8

فىهم? 817,607 8 7

فى? 869,451 9 6

ففيك? 953,608 7 5

ففىه 1,340,653 3 3

ففىه? 1,340,654 3 3

فىكم 1,715,068 2 2

ففيك? 2,805,347 1 1

فىهن 2,826,754 1 1

وفىهما 3,359,666 1 1

وفىه? 3,359,667 1 1

Some contamination from other lemmas does occur. Some of the homographs, such as وف?, are easy to spot, but others are quite unexpected. For example, فيكن is also the Armenian proper name Vigen (I thank my Arabic-L colleagues for pointing this out), and فينا is also the Belgian firm FINA and a rare spelling of Vienna. Unexpected contamination usually shows up when a concordance is generated (see CONCORDANCING). I can anticipate some of the extraneous lemmas by running the wordlist through my morphological parser (see MORPHOLOGY ANALYSIS) before generating the concordance.