Generating frequency tables based on CSV dataset

Question 1

I am working on a project which crunches plain text files (.lst).

The name of the file names (fileName) are important because I'll extract node (e.g. abessijn) and component (e.g. WR-P-E-A) from them into a dataframe.

Examples:

abessijn.WR-P-E-A.lst
A-bom.WR-P-E-A.lst
acroniem.WR-P-E-C.lst
acroniem.WR-P-E-G.lst
adapter.WR-P-E-A.lst
adapter.WR-P-E-C.lst
adapter.WR-P-E-G.lst

Each file consists of one or more line. Each line consists of a sentence (inside <sentence> tags). Example (abessijn.WR-P-E-A.lst)

<sentence>Vooral mijn abessijn ruikt heerlijk kruidig .. : ) )</sentence>
<sentence>Mijn abessijn denkt daar heel anders over .. : ) ) Maar mijn kinderen richt ik ook niet af , zit niet in mijn bloed .</sentence>

From each line I extract the sentence, do some small modifications to it, and call it sentence. Up next is an element called leftContext, which takes the first part of the split between node (e.g. abessijn) and the sentence it came from. Finally, from leftContext I get precedingWord, which is the word preceding node in sentence, or the right most word in leftContext (with some limitations such as the option of a compound formed with a hyphen). Example:

ID | filename | node | component | precedingWord | leftContext | sentence
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 adapter.WR-P-P-F.lst adapter WR-P-P-F aanpassingseenheid Een aanpassingseenheid ( Een aanpassingseenheid ( adapter ) , 
2 adapter.WR-P-P-F.lst adapter WR-P-P-F toestel Het toestel ( Het toestel ( adapter ) draagt zorg voor de overbrenging van gegevens
3 adapter.WR-P-P-F.lst adapter WR-P-P-F de de aansluiting tussen de sensor en de de aansluiting tussen de sensor en de adapter , 
4 airbag.WS-U-E-A.lst airbag WS-U-E-A den ja voor den ja voor den airbag op te pompen eh :p
5 airbag.WS-U-E-A.lst airbag WS-U-E-A ne Dobby , als ze valt heeft ze dan wel al ne Dobby , als ze valt heeft ze dan wel al ne airbag hee

That dataframe is exported as dataset.csv.

After that, the intention of my project comes at hand: I create a frequency table that takes node and precedingWord into account. From a variable I define neuter and non_neuter, e.g (in Python)

neuter = ["het", "Het"]
non_neuter = ["de","De"]

and a rest category unspecified. When precedingWord is an item from the list, assign it to the variable. Example of a frequency table output:

node | neuter | nonNeuter | unspecified
-------------------------------------------------
A-bom 0 4 2
acroniem 3 0 2
act 3 2 1

The frequency list is exported as frequencies.csv.

I came up with the following script (paste) to do that:

import os, pandas as pd, numpy as np, regex as re
from glob import glob
from datetime import datetime
from html import unescape
start_time = datetime.now()
# Create empty dataframe with correct column names
columnNames = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
df = pd.DataFrame(data=np.zeros((0,len(columnNames))), columns=columnNames)
# Create correct path where to fetch files
subdir = "rawdata"
path = os.path.abspath(os.path.join(os.getcwd(), os.pardir, subdir))
# "Cache" regex
# See http://stackoverflow.com/q/452104/1150683
p_filename = re.compile(r"[./\\]")
p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
p_non_graph = re.compile(r"[^\x21-\x7E\s]")
p_quote = re.compile(r"\"")
p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
# Loop files in folder
for file in glob(path+"\\*.lst"):
 with open(file, encoding="utf-8") as f:
 [n, c] = p_filename.split(file.lower())[-3:-1]
 fn = ".".join([n, c])
 for line in f:
 s = p_sentence.search(unescape(line)).group(1)
 s = s.lower()
 s = p_typography.sub("", s)
 s = p_non_graph.sub("", s)
 s = p_quote.sub("'", s)
 s = p_ellipsis.sub("... ", s)
 if n in re.split(r"[ :?.,]", s):
 lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
 pw = p_last_word.sub("\1円", lc)
 df = df.append([dict(fileName=fn, component=c, 
 precedingWord=pw, node=n, 
 leftContext=lc, sentence=s)])
 continue
# Reset indices
df.reset_index(drop=True, inplace=True)
# Export dataset
df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
# Let's make a frequency list
# Create new dataframe
# Define neuter and non_neuter
neuter = ["het"]
non_neuter = ["de"]
# Create crosstab
df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
df.loc[df.precedingWord.isin(neuter + non_neuter)==0, "gender"] = "rest"
freqDf = pd.crosstab(df.node, df.gender)
freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
# How long has the script been running?
time_difference = datetime.now() - start_time
print("Time difference of", time_difference)

I'm running Python in 32 bit, because I'd like to use the nltk module in the future and they discourage users to use 64 bit.

Seeing that the goal of Python was to make everything go smoothly, I was confused. The goal is to beat a similar R script in execution speed. I read Python was fast, so where did I go wrong? A colleague advised me to use Python which might be a faster case. I know you can't simply compare two languages that way, each has its merits. But in this specific case it seemed to me that Python ought to be faster in data crunching. I figured that when both scripts are equally well optimised, Python should be faster in this case. However, my current Python code is badly optimised whereas my R script is decent enough.

Please tell me if I'm completely wrong. If someone could help me use some higher hierarchy functions as proposed on SO (imap, ifilter...) and explaining them that would be great.

What is the problem? Is Python slower in reading files and lines, or in doing regexes? Or is R simply better equipped to dealing with dataframes and can't it be beaten by pandas? Or is my code simply badly optimised and should Python indeed be the victor?

You can download the test data I used here.

I'm a bit of a beginner, so please take the time to explain how they work and/or why they're better.

Question 2

The goal is to beat a similar R script in execution speed. I read Python was fast, so where did I go wrong? If I had a nickel for every time somebody stated something along those lines... A language being fast doesn't mean every implementation in the 'faster' language is going to beat the implementations in the 'slower' languages. It's a fallacy.

Question 3

A week ago (on SO) I told you to remove the data frame append operation (the most time consuming part of your code) and split the algorithm into pure functions. And I told you to get a free copy of PyCharm to help make your code cleaner and profile it when needed. The advices are still here. Moreover, your variables are poorly named. I know you've got a dict call that kinda clarifies the names, but it take one to walk through a lot of lines before he gets to that "clarification". The overall approach is too imperative in the worst sense possible. The code is a problem generator.

Question 4

@EliKorvigo I'm aware of that. As I told you then, I couldn't get it to work with higher hierarchy functions such as imap so I came to ask for an imporvement here. I figured people proficient in Python would be able to do an overall better job than I could. And they did! See answers below.

Question 5

Speed comparisons are always a good thing, but it can be tricky to determine what is actually being compared. I think it's premature to decide that the comparison is "Python" vs. "R" without a lot of work to verify that all the libraries you use and functions you write are reasonably optimized for each language.

One of the strengths of python is a pretty good set of line profiling tools. I like line_profiler, because it works in IPython notebooks, a format I find convenient for "real time" coding and tooling around.

I used line profiling to run a (Python 2 version of) your code, and I found that one single line in your script was responsible for 94.5% of the execution time. Can you guess what it is? It was this one:

df = df.append([dict(fileName=fn, component=c, 
 precedingWord=pw, node=n, 
 leftContext=lc, sentence=s)])

I'm not a veteran of pandas but I think it's safe to say that building a data frame row-by-row in pandas is not very efficient.

How did I refactor your code to run in Python 2 and ran line_profiler on it? Essentially I just wrapped everything you wrote into a dummy function called run_code() and called the line profiler on that function. For Python 2:

I had to modify how you were parsing your directory (I use os.listdir()instead of grob stuff)
I had to import codec because Python 2 isn't natively able to Unicode the same way that Python 3 is.

The bottom-line results: your code took about 25.6 seconds (on my machine) to run, but simply filling the Pandas dataframe by column after parsing was done instead of row-by-row during parsing took 1.2 seconds. This simple modification led to speedup of more than 20×! You could probably get even faster by pre-allocating Numpy structured arrays for each column, and then using those to fill a dataframe by column.

In addition to the timing issue, there are a number of other aspects of your code that you may want to consider revising:

Variable names are confusing (what is s, etc.) and don't always follow PEP8. (Avoid camelCase and use snake_case instead, etc.)
For simple timing, consider the timeit module instead of datetime. (Of course, the main point of my answer is that line profiling is essential to figure out where the slow parts of your code are, be sure to use more than simple timing commands when you are optimizing...but there are certainly times where simple timing is useful, and timeit is a module engineered for that task.
You should factor your code into smaller functions that each accomplish a task. For example, one function could be generating a list of filenames to parse. Another function could parse the data and return a dataframe, and a third could find the frequencies.

In the end, it's tough to know what your original speed comparison means. If your original Python implementation of this script is based off of a direct translation from R, then it probably means that Pandas sucks at filling dataframes by row. But even if that is true, its unclear if Pandas is "slower" than R, because being aware of the row-by-row limitation, you should be able to easily work around it in almost every forseeable use case. (Can anyone think of an example where filling a dataframe row by row is essential and it can't be done any other way?)

Thanks for asking a fun question!

Here's all the code I used.

import os
import re
from datetime import datetime
import numpy as np
import pandas as pd
from glob import glob
# unicode file parsing support in Python 2.x
import codecs
# get unescape to work in Python 2.x
import HTMLParser
unescape = HTMLParser.HTMLParser().unescape
# line profiling
%load_ext line_profiler
import timeit

def run_code():
 start_time = datetime.now()
 # Create empty dataframe with correct column names
 column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
 df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)
 # Create correct path where to fetch files
 subdir = "rawdata"
 path = os.path.abspath(os.path.join(os.getcwd(), subdir))
 # "Cache" regex
 # See http://stackoverflow.com/q/452104/1150683
 p_filename = re.compile(r"[./\\]")
 p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
 p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
 p_non_graph = re.compile(r"[^\x21-\x7E\s]")
 p_quote = re.compile(r"\"")
 p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
 p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
 # Loop files in folder
 filenames = [name for name in os.listdir(path) if re.match('.*[.]lst', name)]
 for filename in filenames:
 with codecs.open('rawdata/' + filename, 'r+', encoding='utf-8') as f:
 [n, c] = p_filename.split(filename.lower())[-3:-1]
 fn = ".".join([n, c])
 for line in f:
 uline = unicode(line)
 s = p_sentence.search(unescape(uline)).group(1)
 s = s.lower()
 s = p_typography.sub("", s)
 s = p_non_graph.sub("", s)
 s = p_quote.sub("'", s)
 s = p_ellipsis.sub("... ", s)
 if n in re.split(r"[ :?.,]", s):
 lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
 pw = p_last_word.sub("\1円", lc)
 df = df.append([dict(fileName=fn, component=c, 
 precedingWord=pw, node=n, 
 leftContext=lc, sentence=s)])
 continue
 # Reset indices
 df.reset_index(drop=True, inplace=True)
 # Export dataset
 df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
 # Let's make a frequency list
 # Create new dataframe
 # Define neuter and non_neuter
 neuter = ["het"]
 non_neuter = ["de"]
 # Create crosstab
 df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
 df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
 df.loc[df.precedingWord.isin(neuter + non_neuter) == 0, "gender"] = "rest"
 freqDf = pd.crosstab(df.node, df.gender)
 freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
 # How long has the script been running?
 time_difference = datetime.now() - start_time
 print("Time difference of", time_difference)
 return

%lprun -f run_code run_code()

In IPython the result of the line profiler is displayed in a pseudo-popup "help" window. Here it is:

Timer unit: 1e-06 s
Total time: 25.6168 s
File: <ipython-input-5-b8823da4f6a5>
Function: run_code at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
 1 def run_code():
 2 
 3 1 10 10.0 0.0 start_time = datetime.now()
 4 
 5 # Create empty dataframe with correct column names
 6 1 2 2.0 0.0 column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
 7 1 384 384.0 0.0 df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)
 8 
 9 # Create correct path where to fetch files
 10 1 2 2.0 0.0 subdir = "rawdata"
 11 1 119 119.0 0.0 path = os.path.abspath(os.path.join(os.getcwd(), subdir))
 12 
 13 # "Cache" regex
 14 # See http://stackoverflow.com/q/452104/1150683
 15 1 265 265.0 0.0 p_filename = re.compile(r"[./\\]")
 16 
 17 1 628 628.0 0.0 p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
 18 1 697 697.0 0.0 p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
 19 1 411 411.0 0.0 p_non_graph = re.compile(r"[^\x21-\x7E\s]")
 20 1 128 128.0 0.0 p_quote = re.compile(r"\"")
 21 1 339 339.0 0.0 p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
 22 
 23 1 1048 1048.0 0.0 p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
 24 
 25 # Loop files in folder
 26 108 1122 10.4 0.0 filenames = [name for name in os.listdir(path) if re.match('.*[.]lst', name)]
 27 
 28 108 250 2.3 0.0 for filename in filenames:
 29 107 5341 49.9 0.0 with codecs.open('rawdata/' + filename, 'r+', encoding='utf-8') as f:
 30 107 867 8.1 0.0 [n, c] = p_filename.split(filename.lower())[-3:-1]
 31 107 277 2.6 0.0 fn = ".".join([n, c])
 32 6607 395024 59.8 1.5 for line in f:
 33 6500 17927 2.8 0.1 uline = unicode(line)
 34 6500 119436 18.4 0.5 s = p_sentence.search(unescape(uline)).group(1)
 35 6500 19466 3.0 0.1 s = s.lower()
 36 6500 53653 8.3 0.2 s = p_typography.sub("", s)
 37 6500 25654 3.9 0.1 s = p_non_graph.sub("", s)
 38 6500 17735 2.7 0.1 s = p_quote.sub("'", s)
 39 6500 31662 4.9 0.1 s = p_ellipsis.sub("... ", s)
 40 
 41 6500 119657 18.4 0.5 if n in re.split(r"[ :?.,]", s):
 42 5825 117687 20.2 0.5 lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
 43 
 44 5825 133397 22.9 0.5 pw = p_last_word.sub("\1円", lc)
 45 
 46 5825 12575 2.2 0.0 df = df.append([dict(fileName=fn, component=c, 
 47 5825 8539 1.5 0.0 precedingWord=pw, node=n, 
 48 5825 24222087 4158.3 94.6 leftContext=lc, sentence=s)])
 49 continue
 50 
 51 # Reset indices
 52 1 104 104.0 0.0 df.reset_index(drop=True, inplace=True)
 53 
 54 # Export dataset
 55 1 293388 293388.0 1.1 df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
 56 
 57 # Let's make a frequency list
 58 # Create new dataframe
 59 
 60 # Define neuter and non_neuter
 61 1 3 3.0 0.0 neuter = ["het"]
 62 1 1 1.0 0.0 non_neuter = ["de"]
 63 
 64 # Create crosstab
 65 1 2585 2585.0 0.0 df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
 66 1 2125 2125.0 0.0 df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
 67 1 1417 1417.0 0.0 df.loc[df.precedingWord.isin(neuter + non_neuter) == 0, "gender"] = "rest"
 68 
 69 1 9666 9666.0 0.0 freqDf = pd.crosstab(df.node, df.gender)
 70 
 71 1 1042 1042.0 0.0 freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
 72 
 73 # How long has the script been running?
 74 1 20 20.0 0.0 time_difference = datetime.now() - start_time
 75 1 46 46.0 0.0 print("Time difference of", time_difference)
 76 1 1 1.0 0.0 return

As you can see, building the pandas data frame takes almost all the time. That suggests a trivial optimization:

def run_code_faster():
 start_time = datetime.now()
 # Create empty dataframe with correct column names
 column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
 df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)
 # Create correct path where to fetch files
 subdir = "rawdata"
 path = os.path.abspath(os.path.join(os.getcwd(), subdir))
 # "Cache" regex
 # See http://stackoverflow.com/q/452104/1150683
 p_filename = re.compile(r"[./\\]")
 p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
 p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
 p_non_graph = re.compile(r"[^\x21-\x7E\s]")
 p_quote = re.compile(r"\"")
 p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
 p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
 # Loop files in folder
 filenames = [name for name in os.listdir(path) if re.match('.*[.]lst', name)]
 fn_list = []
 c_list = []
 pw_list = []
 n_list = []
 lc_list = []
 s_list = []
 for filename in filenames:
 with codecs.open('rawdata/' + filename, 'r+', encoding='utf-8') as f:
 [n, c] = p_filename.split(filename.lower())[-3:-1]
 fn = ".".join([n, c])
 for line in f:
 uline = unicode(line)
 s = p_sentence.search(unescape(uline)).group(1)
 s = s.lower()
 s = p_typography.sub("", s)
 s = p_non_graph.sub("", s)
 s = p_quote.sub("'", s)
 s = p_ellipsis.sub("... ", s)
 if n in re.split(r"[ :?.,]", s):
 lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
 pw = p_last_word.sub("\1円", lc)
# df = df.append([dict(fileName=fn, component=c, 
# precedingWord=pw, node=n, 
# leftContext=lc, sentence=s)])
 fn_list.append(fn)
 c_list.append(c)
 pw_list.append(pw)
 n_list.append(n)
 lc_list.append(lc)
 s_list.append(s)
 continue
 # Assign data frame
 df['fileName'] = fn_list
 df['component'] = c_list
 df['precedingWord'] = pw_list
 df['node'] = n_list
 df['leftContext'] = lc_list
 df['sentence'] = s_list
 # Reset indices
 df.reset_index(drop=True, inplace=True)
 # Export dataset
 df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
 # Let's make a frequency list
 # Create new dataframe
 # Define neuter and non_neuter
 neuter = ["het"]
 non_neuter = ["de"]
 # Create crosstab
 df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
 df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
 df.loc[df.precedingWord.isin(neuter + non_neuter) == 0, "gender"] = "rest"
 freqDf = pd.crosstab(df.node, df.gender)
 freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
 # How long has the script been running?
 time_difference = datetime.now() - start_time
 print("Time difference of", time_difference)
 return

%lprun -f run_code_faster run_code_faster()

Timer unit: 1e-06 s
Total time: 1.21669 s
File: <ipython-input-2-6ca852e32327>
Function: run_code_faster at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
 1 def run_code_faster():
 2 
 3 1 10 10.0 0.0 start_time = datetime.now()
 4 
 5 # Create empty dataframe with correct column names
 6 1 2 2.0 0.0 column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
 7 1 412 412.0 0.0 df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)
 8 
 9 # Create correct path where to fetch files
 10 1 1 1.0 0.0 subdir = "rawdata"
 11 1 120 120.0 0.0 path = os.path.abspath(os.path.join(os.getcwd(), subdir))
 12 
 13 # "Cache" regex
 14 # See http://stackoverflow.com/q/452104/1150683
 15 1 11 11.0 0.0 p_filename = re.compile(r"[./\\]")
 16 
 17 1 6 6.0 0.0 p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
 18 1 12 12.0 0.0 p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
 19 1 6 6.0 0.0 p_non_graph = re.compile(r"[^\x21-\x7E\s]")
 20 1 5 5.0 0.0 p_quote = re.compile(r"\"")
 21 1 5 5.0 0.0 p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
 22 
 23 1 6 6.0 0.0 p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
 24 
 25 # Loop files in folder
 26 108 964 8.9 0.1 filenames = [name for name in os.listdir(path) if re.match('.*[.]lst', name)]
 27 
 28 1 1 1.0 0.0 fn_list = []
 29 1 1 1.0 0.0 c_list = []
 30 1 1 1.0 0.0 pw_list = []
 31 1 2 2.0 0.0 n_list = []
 32 1 2 2.0 0.0 lc_list = []
 33 1 2 2.0 0.0 s_list = []
 34 
 35 108 286 2.6 0.0 for filename in filenames:
 36 107 6811 63.7 0.6 with codecs.open('rawdata/' + filename, 'r+', encoding='utf-8') as f:
 37 107 1026 9.6 0.1 [n, c] = p_filename.split(filename.lower())[-3:-1]
 38 107 314 2.9 0.0 fn = ".".join([n, c])
 39 6607 311585 47.2 25.6 for line in f:
 40 6500 15037 2.3 1.2 uline = unicode(line)
 41 6500 94829 14.6 7.8 s = p_sentence.search(unescape(uline)).group(1)
 42 6500 17369 2.7 1.4 s = s.lower()
 43 6500 42040 6.5 3.5 s = p_typography.sub("", s)
 44 6500 23783 3.7 2.0 s = p_non_graph.sub("", s)
 45 6500 16132 2.5 1.3 s = p_quote.sub("'", s)
 46 6500 31856 4.9 2.6 s = p_ellipsis.sub("... ", s)
 47 
 48 6500 101812 15.7 8.4 if n in re.split(r"[ :?.,]", s):
 49 5825 71344 12.2 5.9 lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
 50 
 51 5825 103504 17.8 8.5 pw = p_last_word.sub("\1円", lc)
 52 
 53 # df = df.append([dict(fileName=fn, component=c, 
 54 # precedingWord=pw, node=n, 
 55 # leftContext=lc, sentence=s)])
 56 5825 11036 1.9 0.9 fn_list.append(fn)
 57 5825 9798 1.7 0.8 c_list.append(c)
 58 5825 9587 1.6 0.8 pw_list.append(pw)
 59 5825 9642 1.7 0.8 n_list.append(n)
 60 5825 9516 1.6 0.8 lc_list.append(lc)
 61 5825 9390 1.6 0.8 s_list.append(s)
 62 continue
 63 # Assign data frame
 64 1 1448 1448.0 0.1 df['fileName'] = fn_list
 65 1 517 517.0 0.0 df['component'] = c_list
 66 1 532 532.0 0.0 df['precedingWord'] = pw_list
 67 1 493 493.0 0.0 df['node'] = n_list
 68 1 511 511.0 0.0 df['leftContext'] = lc_list
 69 1 437 437.0 0.0 df['sentence'] = s_list
 70 
 71 # Reset indices
 72 1 88 88.0 0.0 df.reset_index(drop=True, inplace=True)
 73 
 74 # Export dataset
 75 1 296747 296747.0 24.4 df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
 76 
 77 # Let's make a frequency list
 78 # Create new dataframe
 79 
 80 # Define neuter and non_neuter
 81 1 3 3.0 0.0 neuter = ["het"]
 82 1 1 1.0 0.0 non_neuter = ["de"]
 83 
 84 # Create crosstab
 85 1 3878 3878.0 0.3 df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
 86 1 1871 1871.0 0.2 df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
 87 1 1405 1405.0 0.1 df.loc[df.precedingWord.isin(neuter + non_neuter) == 0, "gender"] = "rest"
 88 
 89 1 9203 9203.0 0.8 freqDf = pd.crosstab(df.node, df.gender)
 90 
 91 1 1234 1234.0 0.1 freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
 92 
 93 # How long has the script been running?
 94 1 12 12.0 0.0 time_difference = datetime.now() - start_time
 95 1 43 43.0 0.0 print("Time difference of", time_difference)
 96 1 1 1.0 0.0 return

Question 6

What a thorough anwer, thank you for that! +1 I ran your code and, indeed, it is unbelievably fast. And a very "easy" solution. Should've thought about that myself. On all my data it runs in 5 minutes - whereas the R script needs 40 minutes. Two question though: why do we need return at the end of the function? And how should I use timeit? In other words, how can I use it from within Python and get an output when the script has finished running that is neatly formatted in mm:ss? I googled it, but got nothing.

Question 7

The return is good form for ending functions, even if you aren't returning any values. It makes it easier for readers of the code to see when your function definition is over. Also, If you want to adapt the run_code() function to actually return your pandas dataframe, then you could end with return df or similar.

Question 8

I think the timeit docs are pretty informative on how to use it from the command line and from a Python interpreter. The main purpose of timeit is to avoid pitfalls from variable time from system overhead distorting your timing measurements. If your code takes 5 minutes to run, datetime is fine, but you'll have a hard time interpreting datetime result if you are timing a codeblock that takes 5 microseconds to run. Results will be highly variable and non-repeatable, timeit fixes that.

Question 9

Would this be a good idea of seperating in functions?: pastebin.com/hgR55ajZ I can't put the initial part in a function because of the scope of the imported variables.

Question 10

I posted it as a new question. I'll award you the bounty a day before it expires. Who knows, maybe some one else can write an even better performing script!

Question 11

Gorillas vs. Sharks:

You can't simply compare r and python, they're like, well, Gorillas and Sharks.

It's not as simple as: which is better?

And, it's often not the language that's better, but the quality of the code!

Possible things that could be slowing execution:

Things not specific to you:

Beginner's language abuse: as a self-confessed beginner, you may be using the functions or functionality of Python incorrectly.
Library abuse: Some libraries aren't exactly a walk in the park to use, you may not be using some of the libraries properly.
Library quality: Assuming library quality is a bit naïve, the actual code behind the library may not be as optimised or good as the other language's implementation.

Things specific to you:

Regex: There's many regex being executed, which, is quite a time consuming process, this regex: (?:(?=[.,:;?!) ])|(?<=\( )) managed to register 3000 steps for me!

s = s.lower(), if you're using certain characters in your content, this can have issues.

I'd recommend taking a look over at How do I do a case insensitive string comparison in Python?.

 if n in re.split(r"[ :?.,]", s):
 lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]

What is n, re, lc or s? Your variable names are unclear and confusing to read.

Question 12

I think the variable names are quite clear. Take a look at the .append function, you'll see that the letters are abbreviations of the column names. I know you can't simply compare two languages that way, each has its merits. But in this specific case it seemed to me that Python ought to be faster in data crunching. My current Python code works, so there's no issue with lower() for me. As I stated I'm indeed a beginner, and I'm here to look for help. If someone could help me using some higher hierarchy functions as proposed on SO (imap, ifilter...) and explaining them that would be great.

Curt F. Curt F. 1,65611 silver badges22 bronze badges · Accepted Answer · 2015-08-24 16:39:20Z

Speed comparisons are always a good thing, but it can be tricky to determine what is actually being compared. I think it's premature to decide that the comparison is "Python" vs. "R" without a lot of work to verify that all the libraries you use and functions you write are reasonably optimized for each language.

One of the strengths of python is a pretty good set of line profiling tools. I like line_profiler, because it works in IPython notebooks, a format I find convenient for "real time" coding and tooling around.

I used line profiling to run a (Python 2 version of) your code, and I found that one single line in your script was responsible for 94.5% of the execution time. Can you guess what it is? It was this one:

df = df.append([dict(fileName=fn, component=c, 
 precedingWord=pw, node=n, 
 leftContext=lc, sentence=s)])

I'm not a veteran of pandas but I think it's safe to say that building a data frame row-by-row in pandas is not very efficient.

How did I refactor your code to run in Python 2 and ran line_profiler on it? Essentially I just wrapped everything you wrote into a dummy function called run_code() and called the line profiler on that function. For Python 2:

I had to modify how you were parsing your directory (I use os.listdir()instead of grob stuff)
I had to import codec because Python 2 isn't natively able to Unicode the same way that Python 3 is.

The bottom-line results: your code took about 25.6 seconds (on my machine) to run, but simply filling the Pandas dataframe by column after parsing was done instead of row-by-row during parsing took 1.2 seconds. This simple modification led to speedup of more than 20×! You could probably get even faster by pre-allocating Numpy structured arrays for each column, and then using those to fill a dataframe by column.

In addition to the timing issue, there are a number of other aspects of your code that you may want to consider revising:

Variable names are confusing (what is s, etc.) and don't always follow PEP8. (Avoid camelCase and use snake_case instead, etc.)
For simple timing, consider the timeit module instead of datetime. (Of course, the main point of my answer is that line profiling is essential to figure out where the slow parts of your code are, be sure to use more than simple timing commands when you are optimizing...but there are certainly times where simple timing is useful, and timeit is a module engineered for that task.
You should factor your code into smaller functions that each accomplish a task. For example, one function could be generating a list of filenames to parse. Another function could parse the data and return a dataframe, and a third could find the frequencies.

In the end, it's tough to know what your original speed comparison means. If your original Python implementation of this script is based off of a direct translation from R, then it probably means that Pandas sucks at filling dataframes by row. But even if that is true, its unclear if Pandas is "slower" than R, because being aware of the row-by-row limitation, you should be able to easily work around it in almost every forseeable use case. (Can anyone think of an example where filling a dataframe row by row is essential and it can't be done any other way?)

Thanks for asking a fun question!

Here's all the code I used.

import os
import re
from datetime import datetime
import numpy as np
import pandas as pd
from glob import glob
# unicode file parsing support in Python 2.x
import codecs
# get unescape to work in Python 2.x
import HTMLParser
unescape = HTMLParser.HTMLParser().unescape
# line profiling
%load_ext line_profiler
import timeit

def run_code():
 start_time = datetime.now()
 # Create empty dataframe with correct column names
 column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
 df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)
 # Create correct path where to fetch files
 subdir = "rawdata"
 path = os.path.abspath(os.path.join(os.getcwd(), subdir))
 # "Cache" regex
 # See http://stackoverflow.com/q/452104/1150683
 p_filename = re.compile(r"[./\\]")
 p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
 p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
 p_non_graph = re.compile(r"[^\x21-\x7E\s]")
 p_quote = re.compile(r"\"")
 p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
 p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
 # Loop files in folder
 filenames = [name for name in os.listdir(path) if re.match('.*[.]lst', name)]
 for filename in filenames:
 with codecs.open('rawdata/' + filename, 'r+', encoding='utf-8') as f:
 [n, c] = p_filename.split(filename.lower())[-3:-1]
 fn = ".".join([n, c])
 for line in f:
 uline = unicode(line)
 s = p_sentence.search(unescape(uline)).group(1)
 s = s.lower()
 s = p_typography.sub("", s)
 s = p_non_graph.sub("", s)
 s = p_quote.sub("'", s)
 s = p_ellipsis.sub("... ", s)
 if n in re.split(r"[ :?.,]", s):
 lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
 pw = p_last_word.sub("\1円", lc)
 df = df.append([dict(fileName=fn, component=c, 
 precedingWord=pw, node=n, 
 leftContext=lc, sentence=s)])
 continue
 # Reset indices
 df.reset_index(drop=True, inplace=True)
 # Export dataset
 df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
 # Let's make a frequency list
 # Create new dataframe
 # Define neuter and non_neuter
 neuter = ["het"]
 non_neuter = ["de"]
 # Create crosstab
 df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
 df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
 df.loc[df.precedingWord.isin(neuter + non_neuter) == 0, "gender"] = "rest"
 freqDf = pd.crosstab(df.node, df.gender)
 freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
 # How long has the script been running?
 time_difference = datetime.now() - start_time
 print("Time difference of", time_difference)
 return

%lprun -f run_code run_code()

In IPython the result of the line profiler is displayed in a pseudo-popup "help" window. Here it is:

Timer unit: 1e-06 s
Total time: 25.6168 s
File: <ipython-input-5-b8823da4f6a5>
Function: run_code at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
 1 def run_code():
 2 
 3 1 10 10.0 0.0 start_time = datetime.now()
 4 
 5 # Create empty dataframe with correct column names
 6 1 2 2.0 0.0 column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
 7 1 384 384.0 0.0 df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)
 8 
 9 # Create correct path where to fetch files
 10 1 2 2.0 0.0 subdir = "rawdata"
 11 1 119 119.0 0.0 path = os.path.abspath(os.path.join(os.getcwd(), subdir))
 12 
 13 # "Cache" regex
 14 # See http://stackoverflow.com/q/452104/1150683
 15 1 265 265.0 0.0 p_filename = re.compile(r"[./\\]")
 16 
 17 1 628 628.0 0.0 p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
 18 1 697 697.0 0.0 p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
 19 1 411 411.0 0.0 p_non_graph = re.compile(r"[^\x21-\x7E\s]")
 20 1 128 128.0 0.0 p_quote = re.compile(r"\"")
 21 1 339 339.0 0.0 p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
 22 
 23 1 1048 1048.0 0.0 p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
 24 
 25 # Loop files in folder
 26 108 1122 10.4 0.0 filenames = [name for name in os.listdir(path) if re.match('.*[.]lst', name)]
 27 
 28 108 250 2.3 0.0 for filename in filenames:
 29 107 5341 49.9 0.0 with codecs.open('rawdata/' + filename, 'r+', encoding='utf-8') as f:
 30 107 867 8.1 0.0 [n, c] = p_filename.split(filename.lower())[-3:-1]
 31 107 277 2.6 0.0 fn = ".".join([n, c])
 32 6607 395024 59.8 1.5 for line in f:
 33 6500 17927 2.8 0.1 uline = unicode(line)
 34 6500 119436 18.4 0.5 s = p_sentence.search(unescape(uline)).group(1)
 35 6500 19466 3.0 0.1 s = s.lower()
 36 6500 53653 8.3 0.2 s = p_typography.sub("", s)
 37 6500 25654 3.9 0.1 s = p_non_graph.sub("", s)
 38 6500 17735 2.7 0.1 s = p_quote.sub("'", s)
 39 6500 31662 4.9 0.1 s = p_ellipsis.sub("... ", s)
 40 
 41 6500 119657 18.4 0.5 if n in re.split(r"[ :?.,]", s):
 42 5825 117687 20.2 0.5 lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
 43 
 44 5825 133397 22.9 0.5 pw = p_last_word.sub("\1円", lc)
 45 
 46 5825 12575 2.2 0.0 df = df.append([dict(fileName=fn, component=c, 
 47 5825 8539 1.5 0.0 precedingWord=pw, node=n, 
 48 5825 24222087 4158.3 94.6 leftContext=lc, sentence=s)])
 49 continue
 50 
 51 # Reset indices
 52 1 104 104.0 0.0 df.reset_index(drop=True, inplace=True)
 53 
 54 # Export dataset
 55 1 293388 293388.0 1.1 df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
 56 
 57 # Let's make a frequency list
 58 # Create new dataframe
 59 
 60 # Define neuter and non_neuter
 61 1 3 3.0 0.0 neuter = ["het"]
 62 1 1 1.0 0.0 non_neuter = ["de"]
 63 
 64 # Create crosstab
 65 1 2585 2585.0 0.0 df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
 66 1 2125 2125.0 0.0 df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
 67 1 1417 1417.0 0.0 df.loc[df.precedingWord.isin(neuter + non_neuter) == 0, "gender"] = "rest"
 68 
 69 1 9666 9666.0 0.0 freqDf = pd.crosstab(df.node, df.gender)
 70 
 71 1 1042 1042.0 0.0 freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
 72 
 73 # How long has the script been running?
 74 1 20 20.0 0.0 time_difference = datetime.now() - start_time
 75 1 46 46.0 0.0 print("Time difference of", time_difference)
 76 1 1 1.0 0.0 return

As you can see, building the pandas data frame takes almost all the time. That suggests a trivial optimization:

def run_code_faster():
 start_time = datetime.now()
 # Create empty dataframe with correct column names
 column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
 df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)
 # Create correct path where to fetch files
 subdir = "rawdata"
 path = os.path.abspath(os.path.join(os.getcwd(), subdir))
 # "Cache" regex
 # See http://stackoverflow.com/q/452104/1150683
 p_filename = re.compile(r"[./\\]")
 p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
 p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
 p_non_graph = re.compile(r"[^\x21-\x7E\s]")
 p_quote = re.compile(r"\"")
 p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
 p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
 # Loop files in folder
 filenames = [name for name in os.listdir(path) if re.match('.*[.]lst', name)]
 fn_list = []
 c_list = []
 pw_list = []
 n_list = []
 lc_list = []
 s_list = []
 for filename in filenames:
 with codecs.open('rawdata/' + filename, 'r+', encoding='utf-8') as f:
 [n, c] = p_filename.split(filename.lower())[-3:-1]
 fn = ".".join([n, c])
 for line in f:
 uline = unicode(line)
 s = p_sentence.search(unescape(uline)).group(1)
 s = s.lower()
 s = p_typography.sub("", s)
 s = p_non_graph.sub("", s)
 s = p_quote.sub("'", s)
 s = p_ellipsis.sub("... ", s)
 if n in re.split(r"[ :?.,]", s):
 lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
 pw = p_last_word.sub("\1円", lc)
# df = df.append([dict(fileName=fn, component=c, 
# precedingWord=pw, node=n, 
# leftContext=lc, sentence=s)])
 fn_list.append(fn)
 c_list.append(c)
 pw_list.append(pw)
 n_list.append(n)
 lc_list.append(lc)
 s_list.append(s)
 continue
 # Assign data frame
 df['fileName'] = fn_list
 df['component'] = c_list
 df['precedingWord'] = pw_list
 df['node'] = n_list
 df['leftContext'] = lc_list
 df['sentence'] = s_list
 # Reset indices
 df.reset_index(drop=True, inplace=True)
 # Export dataset
 df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
 # Let's make a frequency list
 # Create new dataframe
 # Define neuter and non_neuter
 neuter = ["het"]
 non_neuter = ["de"]
 # Create crosstab
 df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
 df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
 df.loc[df.precedingWord.isin(neuter + non_neuter) == 0, "gender"] = "rest"
 freqDf = pd.crosstab(df.node, df.gender)
 freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
 # How long has the script been running?
 time_difference = datetime.now() - start_time
 print("Time difference of", time_difference)
 return

%lprun -f run_code_faster run_code_faster()

Timer unit: 1e-06 s
Total time: 1.21669 s
File: <ipython-input-2-6ca852e32327>
Function: run_code_faster at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
 1 def run_code_faster():
 2 
 3 1 10 10.0 0.0 start_time = datetime.now()
 4 
 5 # Create empty dataframe with correct column names
 6 1 2 2.0 0.0 column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
 7 1 412 412.0 0.0 df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)
 8 
 9 # Create correct path where to fetch files
 10 1 1 1.0 0.0 subdir = "rawdata"
 11 1 120 120.0 0.0 path = os.path.abspath(os.path.join(os.getcwd(), subdir))
 12 
 13 # "Cache" regex
 14 # See http://stackoverflow.com/q/452104/1150683
 15 1 11 11.0 0.0 p_filename = re.compile(r"[./\\]")
 16 
 17 1 6 6.0 0.0 p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
 18 1 12 12.0 0.0 p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
 19 1 6 6.0 0.0 p_non_graph = re.compile(r"[^\x21-\x7E\s]")
 20 1 5 5.0 0.0 p_quote = re.compile(r"\"")
 21 1 5 5.0 0.0 p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
 22 
 23 1 6 6.0 0.0 p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
 24 
 25 # Loop files in folder
 26 108 964 8.9 0.1 filenames = [name for name in os.listdir(path) if re.match('.*[.]lst', name)]
 27 
 28 1 1 1.0 0.0 fn_list = []
 29 1 1 1.0 0.0 c_list = []
 30 1 1 1.0 0.0 pw_list = []
 31 1 2 2.0 0.0 n_list = []
 32 1 2 2.0 0.0 lc_list = []
 33 1 2 2.0 0.0 s_list = []
 34 
 35 108 286 2.6 0.0 for filename in filenames:
 36 107 6811 63.7 0.6 with codecs.open('rawdata/' + filename, 'r+', encoding='utf-8') as f:
 37 107 1026 9.6 0.1 [n, c] = p_filename.split(filename.lower())[-3:-1]
 38 107 314 2.9 0.0 fn = ".".join([n, c])
 39 6607 311585 47.2 25.6 for line in f:
 40 6500 15037 2.3 1.2 uline = unicode(line)
 41 6500 94829 14.6 7.8 s = p_sentence.search(unescape(uline)).group(1)
 42 6500 17369 2.7 1.4 s = s.lower()
 43 6500 42040 6.5 3.5 s = p_typography.sub("", s)
 44 6500 23783 3.7 2.0 s = p_non_graph.sub("", s)
 45 6500 16132 2.5 1.3 s = p_quote.sub("'", s)
 46 6500 31856 4.9 2.6 s = p_ellipsis.sub("... ", s)
 47 
 48 6500 101812 15.7 8.4 if n in re.split(r"[ :?.,]", s):
 49 5825 71344 12.2 5.9 lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
 50 
 51 5825 103504 17.8 8.5 pw = p_last_word.sub("\1円", lc)
 52 
 53 # df = df.append([dict(fileName=fn, component=c, 
 54 # precedingWord=pw, node=n, 
 55 # leftContext=lc, sentence=s)])
 56 5825 11036 1.9 0.9 fn_list.append(fn)
 57 5825 9798 1.7 0.8 c_list.append(c)
 58 5825 9587 1.6 0.8 pw_list.append(pw)
 59 5825 9642 1.7 0.8 n_list.append(n)
 60 5825 9516 1.6 0.8 lc_list.append(lc)
 61 5825 9390 1.6 0.8 s_list.append(s)
 62 continue
 63 # Assign data frame
 64 1 1448 1448.0 0.1 df['fileName'] = fn_list
 65 1 517 517.0 0.0 df['component'] = c_list
 66 1 532 532.0 0.0 df['precedingWord'] = pw_list
 67 1 493 493.0 0.0 df['node'] = n_list
 68 1 511 511.0 0.0 df['leftContext'] = lc_list
 69 1 437 437.0 0.0 df['sentence'] = s_list
 70 
 71 # Reset indices
 72 1 88 88.0 0.0 df.reset_index(drop=True, inplace=True)
 73 
 74 # Export dataset
 75 1 296747 296747.0 24.4 df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
 76 
 77 # Let's make a frequency list
 78 # Create new dataframe
 79 
 80 # Define neuter and non_neuter
 81 1 3 3.0 0.0 neuter = ["het"]
 82 1 1 1.0 0.0 non_neuter = ["de"]
 83 
 84 # Create crosstab
 85 1 3878 3878.0 0.3 df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
 86 1 1871 1871.0 0.2 df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
 87 1 1405 1405.0 0.1 df.loc[df.precedingWord.isin(neuter + non_neuter) == 0, "gender"] = "rest"
 88 
 89 1 9203 9203.0 0.8 freqDf = pd.crosstab(df.node, df.gender)
 90 
 91 1 1234 1234.0 0.1 freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
 92 
 93 # How long has the script been running?
 94 1 12 12.0 0.0 time_difference = datetime.now() - start_time
 95 1 43 43.0 0.0 print("Time difference of", time_difference)
 96 1 1 1.0 0.0 return

What a thorough anwer, thank you for that! +1 I ran your code and, indeed, it is unbelievably fast. And a very "easy" solution. Should've thought about that myself. On all my data it runs in 5 minutes - whereas the R script needs 40 minutes. Two question though: why do we need return at the end of the function? And how should I use timeit? In other words, how can I use it from within Python and get an output when the script has finished running that is neatly formatted in mm:ss? I googled it, but got nothing.
The return is good form for ending functions, even if you aren't returning any values. It makes it easier for readers of the code to see when your function definition is over. Also, If you want to adapt the run_code() function to actually return your pandas dataframe, then you could end with return df or similar.
I think the timeit docs are pretty informative on how to use it from the command line and from a Python interpreter. The main purpose of timeit is to avoid pitfalls from variable time from system overhead distorting your timing measurements. If your code takes 5 minutes to run, datetime is fine, but you'll have a hard time interpreting datetime result if you are timing a codeblock that takes 5 microseconds to run. Results will be highly variable and non-repeatable, timeit fixes that.
Would this be a good idea of seperating in functions?: pastebin.com/hgR55ajZ I can't put the initial part in a function because of the scope of the imported variables.
I posted it as a new question. I'll award you the bounty a day before it expires. Who knows, maybe some one else can write an even better performing script!

Stack Exchange Network

Generating frequency tables based on CSV dataset

2 Answers 2

Gorillas vs. Sharks:

Possible things that could be slowing execution:

Things not specific to you:

Things specific to you:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Generating frequency tables based on CSV dataset

2 Answers 2

Gorillas vs. Sharks:

Possible things that could be slowing execution:

Things not specific to you:

Things specific to you:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions