I am working on a project which crunches plain text files (.lst
).
The name of the file names (fileName
) are important because I'll extract node
(e.g. abessijn
) and component
(e.g. WR-P-E-A
) from them into a dataframe.
Examples:
abessijn.WR-P-E-A.lst
A-bom.WR-P-E-A.lst
acroniem.WR-P-E-C.lst
acroniem.WR-P-E-G.lst
adapter.WR-P-E-A.lst
adapter.WR-P-E-C.lst
adapter.WR-P-E-G.lst
Each file consists of one or more line. Each line consists of a sentence (inside <sentence>
tags). Example (abessijn.WR-P-E-A.lst
)
<sentence>Vooral mijn abessijn ruikt heerlijk kruidig .. : ) )</sentence>
<sentence>Mijn abessijn denkt daar heel anders over .. : ) ) Maar mijn kinderen richt ik ook niet af , zit niet in mijn bloed .</sentence>
From each line I extract the sentence, do some small modifications to it, and call it sentence
. Up next is an element called leftContext
, which takes the first part of the split between node
(e.g. abessijn
) and the sentence it came from. Finally, from leftContext
I get precedingWord
, which is the word preceding node
in sentence
, or the right most word in leftContext
(with some limitations such as the option of a compound formed with a hyphen). Example:
ID | filename | node | component | precedingWord | leftContext | sentence
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 adapter.WR-P-P-F.lst adapter WR-P-P-F aanpassingseenheid Een aanpassingseenheid ( Een aanpassingseenheid ( adapter ) ,
2 adapter.WR-P-P-F.lst adapter WR-P-P-F toestel Het toestel ( Het toestel ( adapter ) draagt zorg voor de overbrenging van gegevens
3 adapter.WR-P-P-F.lst adapter WR-P-P-F de de aansluiting tussen de sensor en de de aansluiting tussen de sensor en de adapter ,
4 airbag.WS-U-E-A.lst airbag WS-U-E-A den ja voor den ja voor den airbag op te pompen eh :p
5 airbag.WS-U-E-A.lst airbag WS-U-E-A ne Dobby , als ze valt heeft ze dan wel al ne Dobby , als ze valt heeft ze dan wel al ne airbag hee
That dataframe is exported as dataset.csv
.
After that, the intention of my project comes at hand: I create a frequency table that takes node
and precedingWord
into account. From a variable I define neuter
and non_neuter
, e.g (in Python)
neuter = ["het", "Het"]
non_neuter = ["de","De"]
and a rest category unspecified
. When precedingWord
is an item from the list, assign it to the variable. Example of a frequency table output:
node | neuter | nonNeuter | unspecified
-------------------------------------------------
A-bom 0 4 2
acroniem 3 0 2
act 3 2 1
The frequency list is exported as frequencies.csv
.
I came up with the following script (paste) to do that:
import os, pandas as pd, numpy as np, regex as re
from glob import glob
from datetime import datetime
from html import unescape
start_time = datetime.now()
# Create empty dataframe with correct column names
columnNames = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
df = pd.DataFrame(data=np.zeros((0,len(columnNames))), columns=columnNames)
# Create correct path where to fetch files
subdir = "rawdata"
path = os.path.abspath(os.path.join(os.getcwd(), os.pardir, subdir))
# "Cache" regex
# See http://stackoverflow.com/q/452104/1150683
p_filename = re.compile(r"[./\\]")
p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
p_non_graph = re.compile(r"[^\x21-\x7E\s]")
p_quote = re.compile(r"\"")
p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
# Loop files in folder
for file in glob(path+"\\*.lst"):
with open(file, encoding="utf-8") as f:
[n, c] = p_filename.split(file.lower())[-3:-1]
fn = ".".join([n, c])
for line in f:
s = p_sentence.search(unescape(line)).group(1)
s = s.lower()
s = p_typography.sub("", s)
s = p_non_graph.sub("", s)
s = p_quote.sub("'", s)
s = p_ellipsis.sub("... ", s)
if n in re.split(r"[ :?.,]", s):
lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
pw = p_last_word.sub("\1円", lc)
df = df.append([dict(fileName=fn, component=c,
precedingWord=pw, node=n,
leftContext=lc, sentence=s)])
continue
# Reset indices
df.reset_index(drop=True, inplace=True)
# Export dataset
df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
# Let's make a frequency list
# Create new dataframe
# Define neuter and non_neuter
neuter = ["het"]
non_neuter = ["de"]
# Create crosstab
df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
df.loc[df.precedingWord.isin(neuter + non_neuter)==0, "gender"] = "rest"
freqDf = pd.crosstab(df.node, df.gender)
freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
# How long has the script been running?
time_difference = datetime.now() - start_time
print("Time difference of", time_difference)
I'm running Python in 32 bit, because I'd like to use the nltk module in the future and they discourage users to use 64 bit.
Seeing that the goal of Python was to make everything go smoothly, I was confused. The goal is to beat a similar R script in execution speed. I read Python was fast, so where did I go wrong? A colleague advised me to use Python which might be a faster case. I know you can't simply compare two languages that way, each has its merits. But in this specific case it seemed to me that Python ought to be faster in data crunching. I figured that when both scripts are equally well optimised, Python should be faster in this case. However, my current Python code is badly optimised whereas my R script is decent enough.
Please tell me if I'm completely wrong. If someone could help me use some higher hierarchy functions as proposed on SO (imap, ifilter...) and explaining them that would be great.
What is the problem? Is Python slower in reading files and lines, or in doing regexes? Or is R simply better equipped to dealing with dataframes and can't it be beaten by pandas? Or is my code simply badly optimised and should Python indeed be the victor?
You can download the test data I used here.
I'm a bit of a beginner, so please take the time to explain how they work and/or why they're better.
2 Answers 2
Speed comparisons are always a good thing, but it can be tricky to determine what is actually being compared. I think it's premature to decide that the comparison is "Python" vs. "R" without a lot of work to verify that all the libraries you use and functions you write are reasonably optimized for each language.
One of the strengths of python is a pretty good set of line profiling tools. I like line_profiler
, because it works in IPython notebooks, a format I find convenient for "real time" coding and tooling around.
I used line profiling to run a (Python 2 version of) your code, and I found that one single line in your script was responsible for 94.5% of the execution time. Can you guess what it is? It was this one:
df = df.append([dict(fileName=fn, component=c,
precedingWord=pw, node=n,
leftContext=lc, sentence=s)])
I'm not a veteran of pandas but I think it's safe to say that building a data frame row-by-row in pandas is not very efficient.
How did I refactor your code to run in Python 2 and ran line_profiler
on it? Essentially I just wrapped everything you wrote into a dummy function called run_code()
and called the line profiler on that function. For Python 2:
- I had to modify how you were parsing your directory (I use
os.listdir()
instead ofgrob
stuff) - I had to
import codec
because Python 2 isn't natively able to Unicode the same way that Python 3 is.
The bottom-line results: your code took about 25.6 seconds (on my machine) to run, but simply filling the Pandas dataframe by column after parsing was done instead of row-by-row during parsing took 1.2 seconds. This simple modification led to speedup of more than 20×! You could probably get even faster by pre-allocating Numpy structured arrays for each column, and then using those to fill a dataframe by column.
In addition to the timing issue, there are a number of other aspects of your code that you may want to consider revising:
Variable names are confusing (what is
s
, etc.) and don't always follow PEP8. (AvoidcamelCase
and usesnake_case
instead, etc.)For simple timing, consider the
timeit
module instead ofdatetime
. (Of course, the main point of my answer is that line profiling is essential to figure out where the slow parts of your code are, be sure to use more than simple timing commands when you are optimizing...but there are certainly times where simple timing is useful, andtimeit
is a module engineered for that task.You should factor your code into smaller functions that each accomplish a task. For example, one function could be generating a list of filenames to parse. Another function could parse the data and return a dataframe, and a third could find the frequencies.
In the end, it's tough to know what your original speed comparison means. If your original Python implementation of this script is based off of a direct translation from R, then it probably means that Pandas sucks at filling dataframes by row. But even if that is true, its unclear if Pandas is "slower" than R, because being aware of the row-by-row limitation, you should be able to easily work around it in almost every forseeable use case. (Can anyone think of an example where filling a dataframe row by row is essential and it can't be done any other way?)
Thanks for asking a fun question!
Here's all the code I used.
import os
import re
from datetime import datetime
import numpy as np
import pandas as pd
from glob import glob
# unicode file parsing support in Python 2.x
import codecs
# get unescape to work in Python 2.x
import HTMLParser
unescape = HTMLParser.HTMLParser().unescape
# line profiling
%load_ext line_profiler
import timeit
def run_code():
start_time = datetime.now()
# Create empty dataframe with correct column names
column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)
# Create correct path where to fetch files
subdir = "rawdata"
path = os.path.abspath(os.path.join(os.getcwd(), subdir))
# "Cache" regex
# See http://stackoverflow.com/q/452104/1150683
p_filename = re.compile(r"[./\\]")
p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
p_non_graph = re.compile(r"[^\x21-\x7E\s]")
p_quote = re.compile(r"\"")
p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
# Loop files in folder
filenames = [name for name in os.listdir(path) if re.match('.*[.]lst', name)]
for filename in filenames:
with codecs.open('rawdata/' + filename, 'r+', encoding='utf-8') as f:
[n, c] = p_filename.split(filename.lower())[-3:-1]
fn = ".".join([n, c])
for line in f:
uline = unicode(line)
s = p_sentence.search(unescape(uline)).group(1)
s = s.lower()
s = p_typography.sub("", s)
s = p_non_graph.sub("", s)
s = p_quote.sub("'", s)
s = p_ellipsis.sub("... ", s)
if n in re.split(r"[ :?.,]", s):
lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
pw = p_last_word.sub("\1円", lc)
df = df.append([dict(fileName=fn, component=c,
precedingWord=pw, node=n,
leftContext=lc, sentence=s)])
continue
# Reset indices
df.reset_index(drop=True, inplace=True)
# Export dataset
df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
# Let's make a frequency list
# Create new dataframe
# Define neuter and non_neuter
neuter = ["het"]
non_neuter = ["de"]
# Create crosstab
df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
df.loc[df.precedingWord.isin(neuter + non_neuter) == 0, "gender"] = "rest"
freqDf = pd.crosstab(df.node, df.gender)
freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
# How long has the script been running?
time_difference = datetime.now() - start_time
print("Time difference of", time_difference)
return
%lprun -f run_code run_code()
In IPython the result of the line profiler is displayed in a pseudo-popup "help" window. Here it is:
Timer unit: 1e-06 s
Total time: 25.6168 s
File: <ipython-input-5-b8823da4f6a5>
Function: run_code at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def run_code():
2
3 1 10 10.0 0.0 start_time = datetime.now()
4
5 # Create empty dataframe with correct column names
6 1 2 2.0 0.0 column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
7 1 384 384.0 0.0 df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)
8
9 # Create correct path where to fetch files
10 1 2 2.0 0.0 subdir = "rawdata"
11 1 119 119.0 0.0 path = os.path.abspath(os.path.join(os.getcwd(), subdir))
12
13 # "Cache" regex
14 # See http://stackoverflow.com/q/452104/1150683
15 1 265 265.0 0.0 p_filename = re.compile(r"[./\\]")
16
17 1 628 628.0 0.0 p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
18 1 697 697.0 0.0 p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
19 1 411 411.0 0.0 p_non_graph = re.compile(r"[^\x21-\x7E\s]")
20 1 128 128.0 0.0 p_quote = re.compile(r"\"")
21 1 339 339.0 0.0 p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
22
23 1 1048 1048.0 0.0 p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
24
25 # Loop files in folder
26 108 1122 10.4 0.0 filenames = [name for name in os.listdir(path) if re.match('.*[.]lst', name)]
27
28 108 250 2.3 0.0 for filename in filenames:
29 107 5341 49.9 0.0 with codecs.open('rawdata/' + filename, 'r+', encoding='utf-8') as f:
30 107 867 8.1 0.0 [n, c] = p_filename.split(filename.lower())[-3:-1]
31 107 277 2.6 0.0 fn = ".".join([n, c])
32 6607 395024 59.8 1.5 for line in f:
33 6500 17927 2.8 0.1 uline = unicode(line)
34 6500 119436 18.4 0.5 s = p_sentence.search(unescape(uline)).group(1)
35 6500 19466 3.0 0.1 s = s.lower()
36 6500 53653 8.3 0.2 s = p_typography.sub("", s)
37 6500 25654 3.9 0.1 s = p_non_graph.sub("", s)
38 6500 17735 2.7 0.1 s = p_quote.sub("'", s)
39 6500 31662 4.9 0.1 s = p_ellipsis.sub("... ", s)
40
41 6500 119657 18.4 0.5 if n in re.split(r"[ :?.,]", s):
42 5825 117687 20.2 0.5 lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
43
44 5825 133397 22.9 0.5 pw = p_last_word.sub("\1円", lc)
45
46 5825 12575 2.2 0.0 df = df.append([dict(fileName=fn, component=c,
47 5825 8539 1.5 0.0 precedingWord=pw, node=n,
48 5825 24222087 4158.3 94.6 leftContext=lc, sentence=s)])
49 continue
50
51 # Reset indices
52 1 104 104.0 0.0 df.reset_index(drop=True, inplace=True)
53
54 # Export dataset
55 1 293388 293388.0 1.1 df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
56
57 # Let's make a frequency list
58 # Create new dataframe
59
60 # Define neuter and non_neuter
61 1 3 3.0 0.0 neuter = ["het"]
62 1 1 1.0 0.0 non_neuter = ["de"]
63
64 # Create crosstab
65 1 2585 2585.0 0.0 df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
66 1 2125 2125.0 0.0 df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
67 1 1417 1417.0 0.0 df.loc[df.precedingWord.isin(neuter + non_neuter) == 0, "gender"] = "rest"
68
69 1 9666 9666.0 0.0 freqDf = pd.crosstab(df.node, df.gender)
70
71 1 1042 1042.0 0.0 freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
72
73 # How long has the script been running?
74 1 20 20.0 0.0 time_difference = datetime.now() - start_time
75 1 46 46.0 0.0 print("Time difference of", time_difference)
76 1 1 1.0 0.0 return
As you can see, building the pandas data frame takes almost all the time. That suggests a trivial optimization:
def run_code_faster():
start_time = datetime.now()
# Create empty dataframe with correct column names
column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)
# Create correct path where to fetch files
subdir = "rawdata"
path = os.path.abspath(os.path.join(os.getcwd(), subdir))
# "Cache" regex
# See http://stackoverflow.com/q/452104/1150683
p_filename = re.compile(r"[./\\]")
p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
p_non_graph = re.compile(r"[^\x21-\x7E\s]")
p_quote = re.compile(r"\"")
p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
# Loop files in folder
filenames = [name for name in os.listdir(path) if re.match('.*[.]lst', name)]
fn_list = []
c_list = []
pw_list = []
n_list = []
lc_list = []
s_list = []
for filename in filenames:
with codecs.open('rawdata/' + filename, 'r+', encoding='utf-8') as f:
[n, c] = p_filename.split(filename.lower())[-3:-1]
fn = ".".join([n, c])
for line in f:
uline = unicode(line)
s = p_sentence.search(unescape(uline)).group(1)
s = s.lower()
s = p_typography.sub("", s)
s = p_non_graph.sub("", s)
s = p_quote.sub("'", s)
s = p_ellipsis.sub("... ", s)
if n in re.split(r"[ :?.,]", s):
lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
pw = p_last_word.sub("\1円", lc)
# df = df.append([dict(fileName=fn, component=c,
# precedingWord=pw, node=n,
# leftContext=lc, sentence=s)])
fn_list.append(fn)
c_list.append(c)
pw_list.append(pw)
n_list.append(n)
lc_list.append(lc)
s_list.append(s)
continue
# Assign data frame
df['fileName'] = fn_list
df['component'] = c_list
df['precedingWord'] = pw_list
df['node'] = n_list
df['leftContext'] = lc_list
df['sentence'] = s_list
# Reset indices
df.reset_index(drop=True, inplace=True)
# Export dataset
df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
# Let's make a frequency list
# Create new dataframe
# Define neuter and non_neuter
neuter = ["het"]
non_neuter = ["de"]
# Create crosstab
df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
df.loc[df.precedingWord.isin(neuter + non_neuter) == 0, "gender"] = "rest"
freqDf = pd.crosstab(df.node, df.gender)
freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
# How long has the script been running?
time_difference = datetime.now() - start_time
print("Time difference of", time_difference)
return
%lprun -f run_code_faster run_code_faster()
Timer unit: 1e-06 s
Total time: 1.21669 s
File: <ipython-input-2-6ca852e32327>
Function: run_code_faster at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def run_code_faster():
2
3 1 10 10.0 0.0 start_time = datetime.now()
4
5 # Create empty dataframe with correct column names
6 1 2 2.0 0.0 column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
7 1 412 412.0 0.0 df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)
8
9 # Create correct path where to fetch files
10 1 1 1.0 0.0 subdir = "rawdata"
11 1 120 120.0 0.0 path = os.path.abspath(os.path.join(os.getcwd(), subdir))
12
13 # "Cache" regex
14 # See http://stackoverflow.com/q/452104/1150683
15 1 11 11.0 0.0 p_filename = re.compile(r"[./\\]")
16
17 1 6 6.0 0.0 p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
18 1 12 12.0 0.0 p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
19 1 6 6.0 0.0 p_non_graph = re.compile(r"[^\x21-\x7E\s]")
20 1 5 5.0 0.0 p_quote = re.compile(r"\"")
21 1 5 5.0 0.0 p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
22
23 1 6 6.0 0.0 p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
24
25 # Loop files in folder
26 108 964 8.9 0.1 filenames = [name for name in os.listdir(path) if re.match('.*[.]lst', name)]
27
28 1 1 1.0 0.0 fn_list = []
29 1 1 1.0 0.0 c_list = []
30 1 1 1.0 0.0 pw_list = []
31 1 2 2.0 0.0 n_list = []
32 1 2 2.0 0.0 lc_list = []
33 1 2 2.0 0.0 s_list = []
34
35 108 286 2.6 0.0 for filename in filenames:
36 107 6811 63.7 0.6 with codecs.open('rawdata/' + filename, 'r+', encoding='utf-8') as f:
37 107 1026 9.6 0.1 [n, c] = p_filename.split(filename.lower())[-3:-1]
38 107 314 2.9 0.0 fn = ".".join([n, c])
39 6607 311585 47.2 25.6 for line in f:
40 6500 15037 2.3 1.2 uline = unicode(line)
41 6500 94829 14.6 7.8 s = p_sentence.search(unescape(uline)).group(1)
42 6500 17369 2.7 1.4 s = s.lower()
43 6500 42040 6.5 3.5 s = p_typography.sub("", s)
44 6500 23783 3.7 2.0 s = p_non_graph.sub("", s)
45 6500 16132 2.5 1.3 s = p_quote.sub("'", s)
46 6500 31856 4.9 2.6 s = p_ellipsis.sub("... ", s)
47
48 6500 101812 15.7 8.4 if n in re.split(r"[ :?.,]", s):
49 5825 71344 12.2 5.9 lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
50
51 5825 103504 17.8 8.5 pw = p_last_word.sub("\1円", lc)
52
53 # df = df.append([dict(fileName=fn, component=c,
54 # precedingWord=pw, node=n,
55 # leftContext=lc, sentence=s)])
56 5825 11036 1.9 0.9 fn_list.append(fn)
57 5825 9798 1.7 0.8 c_list.append(c)
58 5825 9587 1.6 0.8 pw_list.append(pw)
59 5825 9642 1.7 0.8 n_list.append(n)
60 5825 9516 1.6 0.8 lc_list.append(lc)
61 5825 9390 1.6 0.8 s_list.append(s)
62 continue
63 # Assign data frame
64 1 1448 1448.0 0.1 df['fileName'] = fn_list
65 1 517 517.0 0.0 df['component'] = c_list
66 1 532 532.0 0.0 df['precedingWord'] = pw_list
67 1 493 493.0 0.0 df['node'] = n_list
68 1 511 511.0 0.0 df['leftContext'] = lc_list
69 1 437 437.0 0.0 df['sentence'] = s_list
70
71 # Reset indices
72 1 88 88.0 0.0 df.reset_index(drop=True, inplace=True)
73
74 # Export dataset
75 1 296747 296747.0 24.4 df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
76
77 # Let's make a frequency list
78 # Create new dataframe
79
80 # Define neuter and non_neuter
81 1 3 3.0 0.0 neuter = ["het"]
82 1 1 1.0 0.0 non_neuter = ["de"]
83
84 # Create crosstab
85 1 3878 3878.0 0.3 df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
86 1 1871 1871.0 0.2 df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
87 1 1405 1405.0 0.1 df.loc[df.precedingWord.isin(neuter + non_neuter) == 0, "gender"] = "rest"
88
89 1 9203 9203.0 0.8 freqDf = pd.crosstab(df.node, df.gender)
90
91 1 1234 1234.0 0.1 freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
92
93 # How long has the script been running?
94 1 12 12.0 0.0 time_difference = datetime.now() - start_time
95 1 43 43.0 0.0 print("Time difference of", time_difference)
96 1 1 1.0 0.0 return
-
1\$\begingroup\$ What a thorough anwer, thank you for that! +1 I ran your code and, indeed, it is unbelievably fast. And a very "easy" solution. Should've thought about that myself. On all my data it runs in 5 minutes - whereas the R script needs 40 minutes. Two question though: why do we need
return
at the end of the function? And how should I use timeit? In other words, how can I use it from within Python and get an output when the script has finished running that is neatly formatted in mm:ss? I googled it, but got nothing. \$\endgroup\$Bram Vanroy– Bram Vanroy2015年08月24日 21:56:28 +00:00Commented Aug 24, 2015 at 21:56 -
\$\begingroup\$ The
return
is good form for ending functions, even if you aren't returning any values. It makes it easier for readers of the code to see when your function definition is over. Also, If you want to adapt therun_code()
function to actually return your pandas dataframe, then you could end withreturn df
or similar. \$\endgroup\$Curt F.– Curt F.2015年08月24日 22:04:47 +00:00Commented Aug 24, 2015 at 22:04 -
\$\begingroup\$ I think the
timeit
docs are pretty informative on how to use it from the command line and from a Python interpreter. The main purpose oftimeit
is to avoid pitfalls from variable time from system overhead distorting your timing measurements. If your code takes 5 minutes to run,datetime
is fine, but you'll have a hard time interpretingdatetime
result if you are timing a codeblock that takes 5 microseconds to run. Results will be highly variable and non-repeatable,timeit
fixes that. \$\endgroup\$Curt F.– Curt F.2015年08月24日 22:10:43 +00:00Commented Aug 24, 2015 at 22:10 -
\$\begingroup\$ Would this be a good idea of seperating in functions?: pastebin.com/hgR55ajZ I can't put the initial part in a function because of the scope of the imported variables. \$\endgroup\$Bram Vanroy– Bram Vanroy2015年08月24日 22:25:57 +00:00Commented Aug 24, 2015 at 22:25
-
1\$\begingroup\$ I posted it as a new question. I'll award you the bounty a day before it expires. Who knows, maybe some one else can write an even better performing script! \$\endgroup\$Bram Vanroy– Bram Vanroy2015年08月24日 23:15:01 +00:00Commented Aug 24, 2015 at 23:15
Gorillas vs. Sharks:
You can't simply compare r and python, they're like, well, Gorillas and Sharks.
It's not as simple as: which is better?
And, it's often not the language that's better, but the quality of the code!
Possible things that could be slowing execution:
Things not specific to you:
- Beginner's language abuse: as a self-confessed beginner, you may be using the functions or functionality of Python incorrectly.
- Library abuse: Some libraries aren't exactly a walk in the park to use, you may not be using some of the libraries properly.
- Library quality: Assuming library quality is a bit naïve, the actual code behind the library may not be as optimised or good as the other language's implementation.
Things specific to you:
- Regex: There's many regex being executed, which, is quite a time consuming process, this regex:
(?:(?=[.,:;?!) ])|(?<=\( ))
managed to register 3000 steps for me!
s = s.lower()
, if you're using certain characters in your content, this can have issues.
I'd recommend taking a look over at How do I do a case insensitive string comparison in Python?.
if n in re.split(r"[ :?.,]", s): lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
What is n
, re
, lc
or s
? Your variable names are unclear and confusing to read.
-
\$\begingroup\$ I think the variable names are quite clear. Take a look at the .append function, you'll see that the letters are abbreviations of the column names. I know you can't simply compare two languages that way, each has its merits. But in this specific case it seemed to me that Python ought to be faster in data crunching. My current Python code works, so there's no issue with
lower()
for me. As I stated I'm indeed a beginner, and I'm here to look for help. If someone could help me using some higher hierarchy functions as proposed on SO (imap, ifilter...) and explaining them that would be great. \$\endgroup\$Bram Vanroy– Bram Vanroy2015年08月22日 19:03:25 +00:00Commented Aug 22, 2015 at 19:03
The goal is to beat a similar R script in execution speed. I read Python was fast, so where did I go wrong?
If I had a nickel for every time somebody stated something along those lines... A language being fast doesn't mean every implementation in the 'faster' language is going to beat the implementations in the 'slower' languages. It's a fallacy. \$\endgroup\$dict
call that kinda clarifies the names, but it take one to walk through a lot of lines before he gets to that "clarification". The overall approach is too imperative in the worst sense possible. The code is a problem generator. \$\endgroup\$imap
so I came to ask for an imporvement here. I figured people proficient in Python would be able to do an overall better job than I could. And they did! See answers below. \$\endgroup\$