Commit d41c145

committed

Create 7-1-NLTK-with-the-Greek-Script.ipynb

1 parent b2633a7 commit d41c145Copy full SHA for d41c145

File tree

1 file changed

+259

-0

lines changed

7-1-NLTK-with-the-Greek-Script.ipynb

1 file changed

+259

-0

lines changed

`‎7-1-NLTK-with-the-Greek-Script.ipynb‎`

Lines changed: 259 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,259 @@`
	`1`	`+{`
	`2`	`+ "cells": [`
	`3`	`+ {`
	`4`	`+ "cell_type": "markdown",`
	`5`	`+ "metadata": {},`
	`6`	`+ "source": [`
	`7`	`+ "# NLTK with non-Latin scripts (Greek)"`
	`8`	`+ ]`
	`9`	`+ },`
	`10`	`+ {`
	`11`	`+ "cell_type": "markdown",`
	`12`	`+ "metadata": {},`
	`13`	`+ "source": [`
	`14`	`+ "## 1. Cleaning text"`
	`15`	`+ ]`
	`16`	`+ },`
	`17`	`+ {`
	`18`	`+ "cell_type": "code",`
	`19`	`+ "execution_count": 13,`
	`20`	`+ "metadata": {},`
	`21`	`+ "outputs": [`
	`22`	`+ {`
	`23`	`+ "data": {`
	`24`	`+ "text/plain": [`
	`25`	`+ "'αυτος είναι ο χορός της βροχής της φυλής, ό,τι περίεργο.'"`
	`26`	`+ ]`
	`27`	`+ },`
	`28`	`+ "execution_count": 13,`
	`29`	`+ "metadata": {},`
	`30`	`+ "output_type": "execute_result"`
	`31`	`+ }`
	`32`	`+ ],`
	`33`	`+ "source": [`
	`34`	`+ "sentence = \"ΑΥΤΟΣ είναι ο χορός της βροχής της φυλής, ό,τι περίεργο.\"\n",`
	`35`	`+ "sentence = sentence.lower()\n",`
	`36`	`+ "sentence"`
	`37`	`+ ]`
	`38`	`+ },`
	`39`	`+ {`
	`40`	`+ "cell_type": "markdown",`
	`41`	`+ "metadata": {},`
	`42`	`+ "source": [`
	`43`	+ "A package called [`unidecode`](https://pypi.org/project/Unidecode) can be used to transliterate any Unicode string into the "closest possible representation" in ASCII text:"
	`44`	`+ ]`
	`45`	`+ },`
	`46`	`+ {`
	`47`	`+ "cell_type": "code",`
	`48`	`+ "execution_count": 14,`
	`49`	`+ "metadata": {},`
	`50`	`+ "outputs": [`
	`51`	`+ {`
	`52`	`+ "data": {`
	`53`	`+ "text/plain": [`
	`54`	`+ "'autos einai o khoros tes brokhes tes phules, o,ti periergo.'"`
	`55`	`+ ]`
	`56`	`+ },`
	`57`	`+ "execution_count": 14,`
	`58`	`+ "metadata": {},`
	`59`	`+ "output_type": "execute_result"`
	`60`	`+ }`
	`61`	`+ ],`
	`62`	`+ "source": [`
	`63`	`+ "from unidecode import unidecode\n",`
	`64`	`+ "\n",`
	`65`	`+ "sentence_latin = unidecode(sentence)\n",`
	`66`	`+ "sentence_latin"`
	`67`	`+ ]`
	`68`	`+ },`
	`69`	`+ {`
	`70`	`+ "cell_type": "code",`
	`71`	`+ "execution_count": 15,`
	`72`	`+ "metadata": {},`
	`73`	`+ "outputs": [`
	`74`	`+ {`
	`75`	`+ "data": {`
	`76`	`+ "text/plain": [`
	`77`	`+ "'αυτος ειναι ο χορος της βροχης της φυλης, ο,τι περιεργο.'"`
	`78`	`+ ]`
	`79`	`+ },`
	`80`	`+ "execution_count": 15,`
	`81`	`+ "metadata": {},`
	`82`	`+ "output_type": "execute_result"`
	`83`	`+ }`
	`84`	`+ ],`
	`85`	`+ "source": [`
	`86`	`+ "import unicodedata\n",`
	`87`	`+ "\n",`
	`88`	`+ "def strip_accents(s):\n",`
	`89`	`+ " return ''.join(c for c in unicodedata.normalize('NFD', s) # NFD = Normalization Form Canonical Decomposition, one of four Unicode normalization forms.\n",`
	`90`	`+ " if unicodedata.category(c) != 'Mn') # The character category \"Mn\" stands for Nonspacing_Mark\n",`
	`91`	`+ "sentence_no_accents = strip_accents(sentence)\n",`
	`92`	`+ "sentence_no_accents"`
	`93`	`+ ]`
	`94`	`+ },`
	`95`	`+ {`
	`96`	`+ "cell_type": "code",`
	`97`	`+ "execution_count": 16,`
	`98`	`+ "metadata": {},`
	`99`	`+ "outputs": [`
	`100`	`+ {`
	`101`	`+ "data": {`
	`102`	`+ "text/plain": [`
	`103`	`+ "['αυτος',\n",`
	`104`	`+ " 'ειναι',\n",`
	`105`	`+ " 'ο',\n",`
	`106`	`+ " 'χορος',\n",`
	`107`	`+ " 'της',\n",`
	`108`	`+ " 'βροχης',\n",`
	`109`	`+ " 'της',\n",`
	`110`	`+ " 'φυλης,',\n",`
	`111`	`+ " 'ο,τι',\n",`
	`112`	`+ " 'περιεργο.']"`
	`113`	`+ ]`
	`114`	`+ },`
	`115`	`+ "execution_count": 16,`
	`116`	`+ "metadata": {},`
	`117`	`+ "output_type": "execute_result"`
	`118`	`+ }`
	`119`	`+ ],`
	`120`	`+ "source": [`
	`121`	`+ "from nltk.tokenize import WhitespaceTokenizer\n",`
	`122`	`+ "\n",`
	`123`	`+ "tokens = WhitespaceTokenizer().tokenize(sentence_no_accents)\n",`
	`124`	`+ "tokens"`
	`125`	`+ ]`
	`126`	`+ },`
	`127`	`+ {`
	`128`	`+ "cell_type": "code",`
	`129`	`+ "execution_count": 21,`
	`130`	`+ "metadata": {},`
	`131`	`+ "outputs": [`
	`132`	`+ {`
	`133`	`+ "data": {`
	`134`	`+ "text/plain": [`
	`135`	`+ "['αυτος',\n",`
	`136`	`+ " 'ειναι',\n",`
	`137`	`+ " 'ο',\n",`
	`138`	`+ " 'χορος',\n",`
	`139`	`+ " 'της',\n",`
	`140`	`+ " 'βροχης',\n",`
	`141`	`+ " 'της',\n",`
	`142`	`+ " 'φυλης',\n",`
	`143`	`+ " 'ο,τι',\n",`
	`144`	`+ " 'περιεργο']"`
	`145`	`+ ]`
	`146`	`+ },`
	`147`	`+ "execution_count": 21,`
	`148`	`+ "metadata": {},`
	`149`	`+ "output_type": "execute_result"`
	`150`	`+ }`
	`151`	`+ ],`
	`152`	`+ "source": [`
	`153`	`+ "from string import punctuation\n",`
	`154`	`+ "\n",`
	`155`	`+ "new_tokens = []\n",`
	`156`	`+ "\n",`
	`157`	`+ "for token in tokens:\n",`
	`158`	`+ " if token == 'ο,τι':\n",`
	`159`	`+ " new_tokens.append('ο,τι')\n",`
	`160`	`+ " else:\n",`
	`161`	`+ " new_tokens.append(token.translate(str.maketrans({key: None for key in punctuation})))\n",`
	`162`	`+ "\n",`
	`163`	`+ "new_tokens_with_stopwords = new_tokens\n",`
	`164`	`+ "new_tokens"`
	`165`	`+ ]`
	`166`	`+ },`
	`167`	`+ {`
	`168`	`+ "cell_type": "markdown",`
	`169`	`+ "metadata": {},`
	`170`	`+ "source": [`
	`171`	`+ "## 2. Removing stopwords"`
	`172`	`+ ]`
	`173`	`+ },`
	`174`	`+ {`
	`175`	`+ "cell_type": "code",`
	`176`	`+ "execution_count": 18,`
	`177`	`+ "metadata": {},`
	`178`	`+ "outputs": [`
	`179`	`+ {`
	`180`	`+ "data": {`
	`181`	`+ "text/plain": [`
	`182`	`+ "83"`
	`183`	`+ ]`
	`184`	`+ },`
	`185`	`+ "execution_count": 18,`
	`186`	`+ "metadata": {},`
	`187`	`+ "output_type": "execute_result"`
	`188`	`+ }`
	`189`	`+ ],`
	`190`	`+ "source": [`
	`191`	`+ "# Greek stopwords adapted from https://github.com/6/stopwords-json however better lists with more stopwords are available: https://www.translatum.gr/forum/index.php?topic=3550.0?topic=3550.0\n",`
	`192`	+ "greek_stopwords = [\"αλλα\",\"αν\",\"αντι\",\"απο\",\"αυτα\",\"αυτες\",\"αυτη\",\"αυτο\",\"αυτοι\",\"αυτος\",\"αυτους\",\"αυτων\",\"για\",\"δε\",\"δεν\",\"εαν\",\"ειμαι\",\"ειμαστε\",\"ειναι\",\"εισαι\",\"ειστε\",\"εκεινα\",\"εκεινες\",\"εκεινη\",\"εκεινο\",\"εκεινοι\",\"εκεινος\",\"εκεινους\",\"εκεινων\",\"ενω\",\"επι\",\"η\",\"θα\",\"ισως\",\"κ\",\"και\",\"κατα\",\"κι\",\"μα\",\"με\",\"μετα\",\"μη\",\"μην\",\"να\",\"ο\",\"οι\",\"ομως\",\"οπως\",\"οσο\",\"οτι\",\"ο,τι\",\"παρα\",\"ποια\",\"ποιες\",\"ποιο\",\"ποιοι\",\"ποιος\",\"ποιους\",\"ποιων\",\"που\",\"προς\",\"πως\",\"σε\",\"στη\",\"στην\",\"στο\",\"στον\",\"στης\",\"στου\",\"στους\",\"στις\",\"στα\",\"τα\",\"την\",\"της\",\"το\",\"τον\",\"τοτε\",\"του\",\"των\",\"τις\",\"τους\",\"ως\"]\n",
	`193`	`+ "len(greek_stopwords)"`
	`194`	`+ ]`
	`195`	`+ },`
	`196`	`+ {`
	`197`	`+ "cell_type": "code",`
	`198`	`+ "execution_count": 23,`
	`199`	`+ "metadata": {},`
	`200`	`+ "outputs": [`
	`201`	`+ {`
	`202`	`+ "data": {`
	`203`	`+ "text/plain": [`
	`204`	`+ "['χορος', 'βροχης', 'φυλης', 'περιεργο']"`
	`205`	`+ ]`
	`206`	`+ },`
	`207`	`+ "execution_count": 23,`
	`208`	`+ "metadata": {},`
	`209`	`+ "output_type": "execute_result"`
	`210`	`+ }`
	`211`	`+ ],`
	`212`	`+ "source": [`
	`213`	`+ "new_tokens_set = set(new_tokens)\n",`
	`214`	`+ "greek_stopwords_set = set(greek_stopwords)\n",`
	`215`	`+ "intersection_set = new_tokens_set.intersection(greek_stopwords_set)\n",`
	`216`	`+ "intersection_set\n",`
	`217`	`+ "\n",`
	`218`	`+ "for element in intersection_set:\n",`
	`219`	`+ " new_tokens = list(filter((element).__ne__, new_tokens)) # __ne__ is the != operator.\n",`
	`220`	`+ "new_tokens"`
	`221`	`+ ]`
	`222`	`+ },`
	`223`	`+ {`
	`224`	`+ "cell_type": "markdown",`
	`225`	`+ "metadata": {},`
	`226`	`+ "source": [`
	`227`	`+ "## 3. Other packages"`
	`228`	`+ ]`
	`229`	`+ },`
	`230`	`+ {`
	`231`	`+ "cell_type": "markdown",`
	`232`	`+ "metadata": {},`
	`233`	`+ "source": [`
	`234`	+ "There are more interesting packages like [`polyglot`](https://pypi.org/project/polyglot/) and [`greek-stemmer`](https://pypi.org/project/greek-stemmer/). However, these require [`PyICU`](https://pypi.org/project/PyICU/) in order to work and installing this on Windows is a pain."
	`235`	`+ ]`
	`236`	`+ }`
	`237`	`+ ],`
	`238`	`+ "metadata": {`
	`239`	`+ "kernelspec": {`
	`240`	`+ "display_name": "Python 3",`
	`241`	`+ "language": "python",`
	`242`	`+ "name": "python3"`
	`243`	`+ },`
	`244`	`+ "language_info": {`
	`245`	`+ "codemirror_mode": {`
	`246`	`+ "name": "ipython",`
	`247`	`+ "version": 3`
	`248`	`+ },`
	`249`	`+ "file_extension": ".py",`
	`250`	`+ "mimetype": "text/x-python",`
	`251`	`+ "name": "python",`
	`252`	`+ "nbconvert_exporter": "python",`
	`253`	`+ "pygments_lexer": "ipython3",`
	`254`	`+ "version": "3.6.4"`
	`255`	`+ }`
	`256`	`+ },`
	`257`	`+ "nbformat": 4,`
	`258`	`+ "nbformat_minor": 2`
	`259`	`+}`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit d41c145

File tree

1 file changed

1 file changed

`‎7-1-NLTK-with-the-Greek-Script.ipynb‎`

0 commit comments