Dataset and frequency list generation by looping over files

Question 1

I have been told that it would be wise to split up my code into semantically useful blocks. I tried the following but I need feedback.

What I already did:

As you can see I create two .csv files, so it seemed logical to have two functions. One for the main dataset, the other for a frequency table
I only import modules where I need them, though I did need some globally.

I probably should give fn, c and s better variable names. However, I'm not sure how to name them because they are identical to the column names they will be assigned to (e.g. node = node, sentence = sentence).

So what else can I do to better format my code into useful functions? Do note that I want to put an emphasis on performance, so any edits that decrease performance are discouraged.

import os, pandas as pd, numpy as np
from datetime import datetime
start_time = datetime.now()
# Create empty dataframe with correct column names
column_names = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
df = pd.DataFrame(data=np.zeros((0, len(column_names))), columns=column_names)
# Create correct path where to fetch files
subdir = "rawdata"
path = os.path.abspath(os.path.join(os.getcwd(), os.pardir, subdir))
def main_dataset():
 import regex as re
 from html import unescape
 # Loop files in folder
 filenames = [name for name in os.listdir(path) if re.match(".*?[.]lst", name)]
 # "Cache" regex
 # See http://stackoverflow.com/q/452104/1150683
 p_filename = re.compile(r"[./\\]")
 p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
 p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
 p_non_graph = re.compile(r"[^\x21-\x7E\s]")
 p_quote = re.compile(r"\"")
 p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
 p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
 fn_list = []
 c_list = []
 pw_list = []
 n_list = []
 lc_list = []
 s_list = []
 for filename in filenames:
 with open(path + '/' + filename, 'r+', encoding="utf-8") as f:
 [n, c] = p_filename.split(filename.lower())[-3:-1]
 fn = ".".join([n, c])
 for line in f:
 s = p_sentence.search(unescape(line)).group(1)
 s = s.lower()
 s = p_typography.sub("", s)
 s = p_non_graph.sub("", s)
 s = p_quote.sub("'", s)
 s = p_ellipsis.sub("... ", s)
 if n in re.split(r"[ :?.,]", s):
 lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
 pw = p_last_word.sub("\1円", lc)
 fn_list.append(fn)
 c_list.append(c)
 pw_list.append(pw)
 n_list.append(n)
 lc_list.append(lc)
 s_list.append(s)
 continue
 # Assign data frame
 df['fileName'] = fn_list
 df['component'] = c_list
 df['precedingWord'] = pw_list
 df['node'] = n_list
 df['leftContext'] = lc_list
 df['sentence'] = s_list
 # Reset indices
 df.reset_index(drop=True, inplace=True)
 # Export dataset
 df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
def frequency_table():
 # Define neuter and non_neuter
 neuter = ["het"]
 non_neuter = ["de"]
 # Create crosstab
 df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
 df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
 df.loc[df.precedingWord.isin(neuter + non_neuter) == 0, "gender"] = "rest"
 freqDf = pd.crosstab(df.node, df.gender)
 freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
 # How long has the script been running?
 time_difference = datetime.now() - start_time
 print("Time difference of", time_difference)
main_dataset()
frequency_table()

Question 2

Standard recommendation: read and follow python.org/dev/peps/pep-0008

Question 3

Styling

Some minor edits, but you should import each module on a new line, especially if you're assigning them different names.

import os
import numpy as np
import pandas as pd

It seems redundant here to assign subdir for just a single use. If there's plans to use it later sure, but as it stands this should be fine.

path = os.path.abspath(os.path.join(os.getcwd(), os.pardir, "rawdata"))

I know you said that you wanted to save on performance, but when you instantly call this function anyway you're always going to run these imports, so they should be at the top with the rest.

import regex as re
from html import unescape

The p_quote regex string is unnecessary when you can just use '"'. Python sees no real difference between strings bound in single or double quotes, meaning you can easily reference them like this.

p_quote = re.compile('"')

Don't use path + '/' + filename to set a file path. Use the os module. It performs certain operations based on the OS Python is being run on. os.path.join(path, filename) will work cross platform.

Also there's no real point to opening the file in r+ mode, as that allows you to read and write but all you ever do is read the file anyway.

As for your naming conundrum, I'd just use the actual column names. They're not going to actually clash since you call the columns with strings so there's a clear delineation from what is a list and what's a string. And what can be clearer than using the same name. Also generally you can leave list out of a variable name except if you actually need to distinguish it from others.

The exception to both in your case of course being that you have a separate list of filenames, so I've given that a different name but you might find a different name more suitable.

filenameDataList.append(filenameData)
components.append(component)
precedingWords.append(precedingWord)
nodes.append(node)
leftContexts.append(leftContext)
sentences.append(sentence)
...
df['sentence'] = sentences

Minor Performance

These are two notes that wont significantly reduce your speed from file to file because they're not affecting the regex but they will save you a fraction of a second per file which can build up if you're running it on many files.

You don't need to set n and c as part of a list, you can just comma separate them and python will understand. And it's faster to not create an unneeded list.

[n, c] = p_filename.split(filename.lower())[-3:-1]

There's also no reason to use join here. It's slower than just string concatenation since it's just two items and you're creating the list for the sake of the join.

fn = n + "." + c

Refactoring

I would put all the imports together as I said above. Also I would put all the regex compilations near the top. They're effectively a constant so it's best to always keep constants near the top of the file rather than in a function where they'll be recreated the exact same (not that this affects your performance). This would mean that you have this all before your data frame is created:

import numpy as np
import os
import pandas as pd
import regex as re
from html import unescape
from datetime import datetime
start_time = datetime.now()
p_filename = re.compile(r"[./\\]")
p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
p_non_graph = re.compile(r"[^\x21-\x7E\s]")
p_quote = re.compile(r"\"")
p_ellipsis = re.compile(r"\.{3}(?=[^ ])")

As for splitting up functions, I don't personally think that's necessary as your functions aren't overly long and it's a relatively specific process without a lot of repeating code, so I would keep to the two separate ones you have here.

Question 4

As jonrsharpe's comment says, read and follow PEP8. Several things to be improved:
- You should not import inside a function. All import statements should be at the top of your *.py file.
- Your variable names could be more descriptive. What is s? What is fn_list? (I know I named some of these variables in my answer to your other question, but I was in a hurry and they could definitely use further improvement.
Your functions don't return anything and don't take in any parameters. That means your modularization of the code is incomplete. The functions are relying on variables in the global namespace. Your main_dataset() function should probably take in either path or filenames as a parameter, for example, so that you don't have to modify your code to use it on different directories or files. That is, instead of def main_dataset():, you should write def main_dataset(path): for example. And it should probably end in return df. That way you don't have load df from the *.csv file again if you are going to use it in downstream analyses. frequency_table() should probably end in return freqDf (although note that freqDf is not a PEP8-recommend variable name).
If you chose, you could make the functions far more versatile by passing many more parameters. frequency_table() could take in a path (as a string) to be used for the output file, so it could save results anywhere. It could also take in neuter and non_neuter as parameters. (Those are good variable names BTW!)
Rather than having your two functions called explicitly at the bottom of the the code, the python idiom is to nest those calls under a if __name__ == "__main__": statement. That way the code is not run if you are just importing your file, but will be run if you call your file from the command line. A very excellent Stack Overflow answer has more info on the if __name__ == "__main__": idiom in python.

Question 5

Could you please expand on 4.? I'm not sure I understand.

Question 6

I added a to link point #4, let me know if it helps.

score 2 · Accepted Answer · 2015-08-25 14:30:15Z

Styling

Some minor edits, but you should import each module on a new line, especially if you're assigning them different names.

import os
import numpy as np
import pandas as pd

It seems redundant here to assign subdir for just a single use. If there's plans to use it later sure, but as it stands this should be fine.

path = os.path.abspath(os.path.join(os.getcwd(), os.pardir, "rawdata"))

I know you said that you wanted to save on performance, but when you instantly call this function anyway you're always going to run these imports, so they should be at the top with the rest.

import regex as re
from html import unescape

The p_quote regex string is unnecessary when you can just use '"'. Python sees no real difference between strings bound in single or double quotes, meaning you can easily reference them like this.

p_quote = re.compile('"')

Don't use path + '/' + filename to set a file path. Use the os module. It performs certain operations based on the OS Python is being run on. os.path.join(path, filename) will work cross platform.

Also there's no real point to opening the file in r+ mode, as that allows you to read and write but all you ever do is read the file anyway.

As for your naming conundrum, I'd just use the actual column names. They're not going to actually clash since you call the columns with strings so there's a clear delineation from what is a list and what's a string. And what can be clearer than using the same name. Also generally you can leave list out of a variable name except if you actually need to distinguish it from others.

The exception to both in your case of course being that you have a separate list of filenames, so I've given that a different name but you might find a different name more suitable.

filenameDataList.append(filenameData)
components.append(component)
precedingWords.append(precedingWord)
nodes.append(node)
leftContexts.append(leftContext)
sentences.append(sentence)
...
df['sentence'] = sentences

Minor Performance

These are two notes that wont significantly reduce your speed from file to file because they're not affecting the regex but they will save you a fraction of a second per file which can build up if you're running it on many files.

You don't need to set n and c as part of a list, you can just comma separate them and python will understand. And it's faster to not create an unneeded list.

[n, c] = p_filename.split(filename.lower())[-3:-1]

There's also no reason to use join here. It's slower than just string concatenation since it's just two items and you're creating the list for the sake of the join.

fn = n + "." + c

Refactoring

I would put all the imports together as I said above. Also I would put all the regex compilations near the top. They're effectively a constant so it's best to always keep constants near the top of the file rather than in a function where they'll be recreated the exact same (not that this affects your performance). This would mean that you have this all before your data frame is created:

import numpy as np
import os
import pandas as pd
import regex as re
from html import unescape
from datetime import datetime
start_time = datetime.now()
p_filename = re.compile(r"[./\\]")
p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
p_non_graph = re.compile(r"[^\x21-\x7E\s]")
p_quote = re.compile(r"\"")
p_ellipsis = re.compile(r"\.{3}(?=[^ ])")

As for splitting up functions, I don't personally think that's necessary as your functions aren't overly long and it's a relatively specific process without a lot of repeating code, so I would keep to the two separate ones you have here.

Stack Exchange Network

Dataset and frequency list generation by looping over files

2 Answers 2

Styling

Minor Performance

Refactoring

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Dataset and frequency list generation by looping over files

2 Answers 2

Styling

Minor Performance

Refactoring

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions