Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit d41c145

Browse files
Create 7-1-NLTK-with-the-Greek-Script.ipynb
1 parent b2633a7 commit d41c145

File tree

1 file changed

+259
-0
lines changed

1 file changed

+259
-0
lines changed

‎7-1-NLTK-with-the-Greek-Script.ipynb‎

Lines changed: 259 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,259 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# NLTK with non-Latin scripts (Greek)"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"## 1. Cleaning text"
15+
]
16+
},
17+
{
18+
"cell_type": "code",
19+
"execution_count": 13,
20+
"metadata": {},
21+
"outputs": [
22+
{
23+
"data": {
24+
"text/plain": [
25+
"'αυτος είναι ο χορός της βροχής της φυλής, ό,τι περίεργο.'"
26+
]
27+
},
28+
"execution_count": 13,
29+
"metadata": {},
30+
"output_type": "execute_result"
31+
}
32+
],
33+
"source": [
34+
"sentence = \"ΑΥΤΟΣ είναι ο χορός της βροχής της φυλής, ό,τι περίεργο.\"\n",
35+
"sentence = sentence.lower()\n",
36+
"sentence"
37+
]
38+
},
39+
{
40+
"cell_type": "markdown",
41+
"metadata": {},
42+
"source": [
43+
"A package called [`unidecode`](https://pypi.org/project/Unidecode) can be used to transliterate any Unicode string into the "closest possible representation" in ASCII text:"
44+
]
45+
},
46+
{
47+
"cell_type": "code",
48+
"execution_count": 14,
49+
"metadata": {},
50+
"outputs": [
51+
{
52+
"data": {
53+
"text/plain": [
54+
"'autos einai o khoros tes brokhes tes phules, o,ti periergo.'"
55+
]
56+
},
57+
"execution_count": 14,
58+
"metadata": {},
59+
"output_type": "execute_result"
60+
}
61+
],
62+
"source": [
63+
"from unidecode import unidecode\n",
64+
"\n",
65+
"sentence_latin = unidecode(sentence)\n",
66+
"sentence_latin"
67+
]
68+
},
69+
{
70+
"cell_type": "code",
71+
"execution_count": 15,
72+
"metadata": {},
73+
"outputs": [
74+
{
75+
"data": {
76+
"text/plain": [
77+
"'αυτος ειναι ο χορος της βροχης της φυλης, ο,τι περιεργο.'"
78+
]
79+
},
80+
"execution_count": 15,
81+
"metadata": {},
82+
"output_type": "execute_result"
83+
}
84+
],
85+
"source": [
86+
"import unicodedata\n",
87+
"\n",
88+
"def strip_accents(s):\n",
89+
" return ''.join(c for c in unicodedata.normalize('NFD', s) # NFD = Normalization Form Canonical Decomposition, one of four Unicode normalization forms.\n",
90+
" if unicodedata.category(c) != 'Mn') # The character category \"Mn\" stands for Nonspacing_Mark\n",
91+
"sentence_no_accents = strip_accents(sentence)\n",
92+
"sentence_no_accents"
93+
]
94+
},
95+
{
96+
"cell_type": "code",
97+
"execution_count": 16,
98+
"metadata": {},
99+
"outputs": [
100+
{
101+
"data": {
102+
"text/plain": [
103+
"['αυτος',\n",
104+
" 'ειναι',\n",
105+
" 'ο',\n",
106+
" 'χορος',\n",
107+
" 'της',\n",
108+
" 'βροχης',\n",
109+
" 'της',\n",
110+
" 'φυλης,',\n",
111+
" 'ο,τι',\n",
112+
" 'περιεργο.']"
113+
]
114+
},
115+
"execution_count": 16,
116+
"metadata": {},
117+
"output_type": "execute_result"
118+
}
119+
],
120+
"source": [
121+
"from nltk.tokenize import WhitespaceTokenizer\n",
122+
"\n",
123+
"tokens = WhitespaceTokenizer().tokenize(sentence_no_accents)\n",
124+
"tokens"
125+
]
126+
},
127+
{
128+
"cell_type": "code",
129+
"execution_count": 21,
130+
"metadata": {},
131+
"outputs": [
132+
{
133+
"data": {
134+
"text/plain": [
135+
"['αυτος',\n",
136+
" 'ειναι',\n",
137+
" 'ο',\n",
138+
" 'χορος',\n",
139+
" 'της',\n",
140+
" 'βροχης',\n",
141+
" 'της',\n",
142+
" 'φυλης',\n",
143+
" 'ο,τι',\n",
144+
" 'περιεργο']"
145+
]
146+
},
147+
"execution_count": 21,
148+
"metadata": {},
149+
"output_type": "execute_result"
150+
}
151+
],
152+
"source": [
153+
"from string import punctuation\n",
154+
"\n",
155+
"new_tokens = []\n",
156+
"\n",
157+
"for token in tokens:\n",
158+
" if token == 'ο,τι':\n",
159+
" new_tokens.append('ο,τι')\n",
160+
" else:\n",
161+
" new_tokens.append(token.translate(str.maketrans({key: None for key in punctuation})))\n",
162+
"\n",
163+
"new_tokens_with_stopwords = new_tokens\n",
164+
"new_tokens"
165+
]
166+
},
167+
{
168+
"cell_type": "markdown",
169+
"metadata": {},
170+
"source": [
171+
"## 2. Removing stopwords"
172+
]
173+
},
174+
{
175+
"cell_type": "code",
176+
"execution_count": 18,
177+
"metadata": {},
178+
"outputs": [
179+
{
180+
"data": {
181+
"text/plain": [
182+
"83"
183+
]
184+
},
185+
"execution_count": 18,
186+
"metadata": {},
187+
"output_type": "execute_result"
188+
}
189+
],
190+
"source": [
191+
"# Greek stopwords adapted from https://github.com/6/stopwords-json however better lists with more stopwords are available: https://www.translatum.gr/forum/index.php?topic=3550.0?topic=3550.0\n",
192+
"greek_stopwords = [\"αλλα\",\"αν\",\"αντι\",\"απο\",\"αυτα\",\"αυτες\",\"αυτη\",\"αυτο\",\"αυτοι\",\"αυτος\",\"αυτους\",\"αυτων\",\"για\",\"δε\",\"δεν\",\"εαν\",\"ειμαι\",\"ειμαστε\",\"ειναι\",\"εισαι\",\"ειστε\",\"εκεινα\",\"εκεινες\",\"εκεινη\",\"εκεινο\",\"εκεινοι\",\"εκεινος\",\"εκεινους\",\"εκεινων\",\"ενω\",\"επι\",\"η\",\"θα\",\"ισως\",\"κ\",\"και\",\"κατα\",\"κι\",\"μα\",\"με\",\"μετα\",\"μη\",\"μην\",\"να\",\"ο\",\"οι\",\"ομως\",\"οπως\",\"οσο\",\"οτι\",\"ο,τι\",\"παρα\",\"ποια\",\"ποιες\",\"ποιο\",\"ποιοι\",\"ποιος\",\"ποιους\",\"ποιων\",\"που\",\"προς\",\"πως\",\"σε\",\"στη\",\"στην\",\"στο\",\"στον\",\"στης\",\"στου\",\"στους\",\"στις\",\"στα\",\"τα\",\"την\",\"της\",\"το\",\"τον\",\"τοτε\",\"του\",\"των\",\"τις\",\"τους\",\"ως\"]\n",
193+
"len(greek_stopwords)"
194+
]
195+
},
196+
{
197+
"cell_type": "code",
198+
"execution_count": 23,
199+
"metadata": {},
200+
"outputs": [
201+
{
202+
"data": {
203+
"text/plain": [
204+
"['χορος', 'βροχης', 'φυλης', 'περιεργο']"
205+
]
206+
},
207+
"execution_count": 23,
208+
"metadata": {},
209+
"output_type": "execute_result"
210+
}
211+
],
212+
"source": [
213+
"new_tokens_set = set(new_tokens)\n",
214+
"greek_stopwords_set = set(greek_stopwords)\n",
215+
"intersection_set = new_tokens_set.intersection(greek_stopwords_set)\n",
216+
"intersection_set\n",
217+
"\n",
218+
"for element in intersection_set:\n",
219+
" new_tokens = list(filter((element).__ne__, new_tokens)) # __ne__ is the != operator.\n",
220+
"new_tokens"
221+
]
222+
},
223+
{
224+
"cell_type": "markdown",
225+
"metadata": {},
226+
"source": [
227+
"## 3. Other packages"
228+
]
229+
},
230+
{
231+
"cell_type": "markdown",
232+
"metadata": {},
233+
"source": [
234+
"There are more interesting packages like [`polyglot`](https://pypi.org/project/polyglot/) and [`greek-stemmer`](https://pypi.org/project/greek-stemmer/). However, these require [`PyICU`](https://pypi.org/project/PyICU/) in order to work and installing this on Windows is a pain."
235+
]
236+
}
237+
],
238+
"metadata": {
239+
"kernelspec": {
240+
"display_name": "Python 3",
241+
"language": "python",
242+
"name": "python3"
243+
},
244+
"language_info": {
245+
"codemirror_mode": {
246+
"name": "ipython",
247+
"version": 3
248+
},
249+
"file_extension": ".py",
250+
"mimetype": "text/x-python",
251+
"name": "python",
252+
"nbconvert_exporter": "python",
253+
"pygments_lexer": "ipython3",
254+
"version": "3.6.4"
255+
}
256+
},
257+
"nbformat": 4,
258+
"nbformat_minor": 2
259+
}

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /