I have been told that it would be wise to split up my code into semantically useful blocks. I tried the following but I need feedback.
What I already did:
- As you can see I create two .csv files, so it seemed logical to have two functions. One for the main dataset, the other for a frequency table
- I only import modules where I need them, though I did need some globally.
I probably should give fn
, c
and s
better variable names. However, I'm not sure how to name them because they are identical to the column names they will be assigned to (e.g. node = node, sentence = sentence).
So what else can I do to better format my code into useful functions? Do note that I want to put an emphasis on performance, so any edits that decrease performance are discouraged.
import os, pandas as pd, numpy as np
from datetime import datetime
start_time = datetime.now()
# Create empty dataframe with correct column names
column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)
# Create correct path where to fetch files
subdir = "rawdata"
path = os.path.abspath(os.path.join(os.getcwd(), os.pardir, subdir))
def main_dataset():
import regex as re
from html import unescape
# Loop files in folder
filenames = [name for name in os.listdir(path) if re.match(".*?[.]lst", name)]
# "Cache" regex
# See http://stackoverflow.com/q/452104/1150683
p_filename = re.compile(r"[./\\]")
p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
p_non_graph = re.compile(r"[^\x21-\x7E\s]")
p_quote = re.compile(r"\"")
p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
fn_list = []
c_list = []
pw_list = []
n_list = []
lc_list = []
s_list = []
for filename in filenames:
with open(path + '/' + filename, 'r+', encoding="utf-8") as f:
[n, c] = p_filename.split(filename.lower())[-3:-1]
fn = ".".join([n, c])
for line in f:
s = p_sentence.search(unescape(line)).group(1)
s = s.lower()
s = p_typography.sub("", s)
s = p_non_graph.sub("", s)
s = p_quote.sub("'", s)
s = p_ellipsis.sub("... ", s)
if n in re.split(r"[ :?.,]", s):
lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
pw = p_last_word.sub("\1円", lc)
fn_list.append(fn)
c_list.append(c)
pw_list.append(pw)
n_list.append(n)
lc_list.append(lc)
s_list.append(s)
continue
# Assign data frame
df['fileName'] = fn_list
df['component'] = c_list
df['precedingWord'] = pw_list
df['node'] = n_list
df['leftContext'] = lc_list
df['sentence'] = s_list
# Reset indices
df.reset_index(drop=True, inplace=True)
# Export dataset
df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
def frequency_table():
# Define neuter and non_neuter
neuter = ["het"]
non_neuter = ["de"]
# Create crosstab
df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
df.loc[df.precedingWord.isin(neuter + non_neuter) == 0, "gender"] = "rest"
freqDf = pd.crosstab(df.node, df.gender)
freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
# How long has the script been running?
time_difference = datetime.now() - start_time
print("Time difference of", time_difference)
main_dataset()
frequency_table()
-
3\$\begingroup\$ Standard recommendation: read and follow python.org/dev/peps/pep-0008 \$\endgroup\$jonrsharpe– jonrsharpe2015年08月25日 12:22:36 +00:00Commented Aug 25, 2015 at 12:22
2 Answers 2
Styling
Some minor edits, but you should import each module on a new line, especially if you're assigning them different names.
import os
import numpy as np
import pandas as pd
It seems redundant here to assign subdir
for just a single use. If there's plans to use it later sure, but as it stands this should be fine.
path = os.path.abspath(os.path.join(os.getcwd(), os.pardir, "rawdata"))
I know you said that you wanted to save on performance, but when you instantly call this function anyway you're always going to run these imports, so they should be at the top with the rest.
import regex as re
from html import unescape
The p_quote
regex string is unnecessary when you can just use '"'
. Python sees no real difference between strings bound in single or double quotes, meaning you can easily reference them like this.
p_quote = re.compile('"')
Don't use path + '/' + filename
to set a file path. Use the os
module. It performs certain operations based on the OS Python is being run on. os.path.join(path, filename)
will work cross platform.
Also there's no real point to opening the file in r+
mode, as that allows you to read and write but all you ever do is read the file anyway.
As for your naming conundrum, I'd just use the actual column names. They're not going to actually clash since you call the columns with strings so there's a clear delineation from what is a list and what's a string. And what can be clearer than using the same name. Also generally you can leave list
out of a variable name except if you actually need to distinguish it from others.
The exception to both in your case of course being that you have a separate list of filenames, so I've given that a different name but you might find a different name more suitable.
filenameDataList.append(filenameData)
components.append(component)
precedingWords.append(precedingWord)
nodes.append(node)
leftContexts.append(leftContext)
sentences.append(sentence)
...
df['sentence'] = sentences
Minor Performance
These are two notes that wont significantly reduce your speed from file to file because they're not affecting the regex but they will save you a fraction of a second per file which can build up if you're running it on many files.
You don't need to set n
and c
as part of a list, you can just comma separate them and python will understand. And it's faster to not create an unneeded list.
[n, c] = p_filename.split(filename.lower())[-3:-1]
There's also no reason to use join
here. It's slower than just string concatenation since it's just two items and you're creating the list for the sake of the join.
fn = n + "." + c
Refactoring
I would put all the import
s together as I said above. Also I would put all the regex compilations near the top. They're effectively a constant so it's best to always keep constants near the top of the file rather than in a function where they'll be recreated the exact same (not that this affects your performance). This would mean that you have this all before your data frame is created:
import numpy as np
import os
import pandas as pd
import regex as re
from html import unescape
from datetime import datetime
start_time = datetime.now()
p_filename = re.compile(r"[./\\]")
p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
p_non_graph = re.compile(r"[^\x21-\x7E\s]")
p_quote = re.compile(r"\"")
p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
As for splitting up functions, I don't personally think that's necessary as your functions aren't overly long and it's a relatively specific process without a lot of repeating code, so I would keep to the two separate ones you have here.
As jonrsharpe's comment says, read and follow PEP8. Several things to be improved:
You should not import inside a function. All
import
statements should be at the top of your *.py file.Your variable names could be more descriptive. What is
s
? What isfn_list
? (I know I named some of these variables in my answer to your other question, but I was in a hurry and they could definitely use further improvement.
Your functions don't
return
anything and don't take in any parameters. That means your modularization of the code is incomplete. The functions are relying on variables in the global namespace. Yourmain_dataset()
function should probably take in eitherpath
orfilenames
as a parameter, for example, so that you don't have to modify your code to use it on different directories or files. That is, instead ofdef main_dataset():
, you should writedef main_dataset(path):
for example. And it should probably end inreturn df
. That way you don't have loaddf
from the *.csv file again if you are going to use it in downstream analyses.frequency_table()
should probably end inreturn freqDf
(although note thatfreqDf
is not a PEP8-recommend variable name).If you chose, you could make the functions far more versatile by passing many more parameters.
frequency_table()
could take in a path (as a string) to be used for the output file, so it could save results anywhere. It could also take inneuter
andnon_neuter
as parameters. (Those are good variable names BTW!)Rather than having your two functions called explicitly at the bottom of the the code, the python idiom is to nest those calls under a
if __name__ == "__main__":
statement. That way the code is not run if you are justimport
ing your file, but will be run if you call your file from the command line. A very excellent Stack Overflow answer has more info on theif __name__ == "__main__":
idiom in python.
-
\$\begingroup\$ Could you please expand on
4.
? I'm not sure I understand. \$\endgroup\$Bram Vanroy– Bram Vanroy2015年08月25日 16:56:57 +00:00Commented Aug 25, 2015 at 16:56 -
\$\begingroup\$ I added a to link point #4, let me know if it helps. \$\endgroup\$Curt F.– Curt F.2015年08月26日 14:23:53 +00:00Commented Aug 26, 2015 at 14:23