Speed up simple Python function that uses list comprehension

Question 1

I'm extracting 4 columns from an imported CSV file (~500MB) to be used for fitting a scikit-learn regression model.

It seems that this function used to do the extraction is extremely slow. I just learnt python today, any suggestions on how the function can be sped up?

Can multithreading/core be used? My system has 4 cores.

def splitData(jobs):
 salaries = [jobs[i]['salaryNormalized'] for i, v in enumerate(jobs)]
 descriptions = [jobs[i]['description'] + jobs[i]['normalizedLocation'] + jobs[i]['category'] for i, v in enumerate(jobs)]
 titles = [jobs[i]['title'] for i, v in enumerate(jobs)]
 return salaries, descriptions, titles

Full Code

def loadData(filePath):
 reader = csv.reader( open(filePath) )
 rows = []
 for i, row in enumerate(reader):
 categories = ["id", "title", "description", "rawLocation", "normalizedLocation",
 "contractType", "contractTime", "company", "category",
 "salaryRaw", "salaryNormalized","sourceName"]
 # Skip header row
 if i != 0: 
 rows.append( dict(zip(categories, row)) )
 return rows
def splitData(jobs):
 salaries = []
 descriptions = []
 titles = []
 for i in xrange(len(jobs)):
 salaries.append( jobs[i]['salaryNormalized'] )
 descriptions.append( jobs[i]['description'] + jobs[i]['normalizedLocation'] + jobs[i]['category'] )
 titles.append( jobs[i]['title'] )
 return salaries, descriptions, titles
def fit(salaries, descriptions, titles):
 #Vectorize
 vect = TfidfVectorizer()
 vect2 = TfidfVectorizer()
 descriptions = vect.fit_transform(descriptions)
 titles = vect2.fit_transform(titles)
 #Fit
 X = hstack((descriptions, titles))
 y = [ np.log(float(salaries[i])) for i, v in enumerate(salaries) ]
 rr = Ridge(alpha=0.035)
 rr.fit(X, y)
 return vect, vect2, rr, X, y
jobs = loadData( paths['train_data_path'] )
salaries, descriptions, titles = splitData(jobs)
vect, vect2, rr, X_train, y_train = fit(salaries, descriptions, titles)

Question 2

Can you provide a sample of the train_data_path?

Question 3

@WinstonEwert Sure: http://www.mediafire.com/?4d3j8g88d6j2h0x

Question 4

Could you post a few lines of the data file directly in the question?

Question 5

Please only state the code purpose in the title

Question 6

def loadData(filePath):
 reader = csv.reader( open(filePath) )
 rows = []
 for i, row in enumerate(reader):
 categories = ["id", "title", "description", "rawLocation", "normalizedLocation",
 "contractType", "contractTime", "company", "category",
 "salaryRaw", "salaryNormalized","sourceName"]

A list like this should really be a global variable to avoid the expense of recreating it constantly. But you'd do better not store your stuff in a dictionary. Instead do this:

 for (id, title, description, raw_location, normalized_location, contract_type,
 contractTime, company, category, salary_raw, salary_normalized, source_name) in reader:
 yield salary_normalized, ''.join((description, normalized_location, category)), title

All the stuff is then stored in python local variables (fairly efficient). Then yield produces the three elements you are actually wanting. Just use

 salaries, descriptions, titles = zip(*loadData(...))

to get your three lists again

 # Skip header row
 if i != 0: 
 rows.append( dict(zip(categories, row)) )

Rather than this, call reader.next() before the loop to take out the header

 return rows

Question 7

loadtxt() gives me the error ValueError: cannot set an array element with a sequence on line X = np.array(X, dtype). Currently trying to upload the CSV somewhere

Question 8

Here's the CSV: http://www.mediafire.com/?4d3j8g88d6j2h0x

Question 9

@Nyxynyx, my bad. numpy doesn't handled quoted fields. Replaced my answer with a better one. My code processes the file in 8 seconds (does everything before the fit function).

Question 10

It appears that I need to do rows = zip( load_data(...) ) instead of salaries, descriptions, titles = zip(loadData(...)). Is this correct?

Question 11

@Nyxynyx, typo. Should be salaries, description, titles = zip(*loadData(...))). zip(*...) does the opposite of zip(...)

Question 12

titles = [jobs[i]['title'] for i, v in enumerate(jobs)] can (should?) be rewritten :

titles = [j['title'] for j in jobs.items()] because we just want to access the value at position i (More details)

Thus, the whole code would be :

def splitData(jobs):
 salaries = [j['salaryNormalized'] for j in jobs.items)]
 descriptions = [j['description'] + j['normalizedLocation'] + j['category'] for j in jobs.items)]
 titles = [j['title'] for j in jobs.items)]
 return salaries, descriptions, titles

Not quite sure how much it helps from a performance point of view.

Edit : Otherwise, another option might be to write a generator which returns j['salaryNormalized'], j['description'] + j['normalizedLocation'] + j['category'], j['title'] as you need it. It depends how you use your function really.

Question 13

I have updated the question with more code, that may help! How can I write the generator?

Question 14

It seems that you are looping through the entire job data set 3 times (once each for salaries, descriptions and titles). You will be able to speed this up three-fold, if you extract all the info in one pass:

def split_data(jobs):
 for job, info in jobs.items():
 salaries.append(info['salaryNormalized'])
 descriptions.append([info['description'], 
 info['normalizedLocation'], 
 info['category']])
 titles.append(info['title'])

EDIT Added loadData(); slight tweak to return a dictionary of dictionaries, instead of a list of dictionaries:

def load_data(filepath):
 reader = csv.reader(open(filepath))
 jobs = {}
 for i, row in enumerate(reader):
 categories = ["id", "title", "description", "rawLocation", "normalizedLocation",
 "contractType", "contractTime", "company", "category",
 "salaryRaw", "salaryNormalized","sourceName"]
 if i != 0:
 jobs[i] = dict(zip(categories, row))
 return jobs

An example:

jobs = {0 : {'salaryNormalized' : 10000, 'description' : 'myKob',
 'normalizedLocation': 'Hawaii', 'category': 'categ1',
 'title' : 'tourist'},
 1 : {'salaryNormalized' : 15000, 'description' : 'myKob',
 'normalizedLocation': 'Hawaii', 'category': 'categ2',
 'title' : 'tourist'},
 2 : {'salaryNormalized' : 50000, 'description' : 'myKob',
 'normalizedLocation': 'Hawaii', 'category': 'categ10',
 'title' : 'resort_manager'}}
salaries, descriptions, titles = [], [], []
split_data(jobs)
print(salaries)
--> [10000, 15000, 50000]
print(descriptions)
--> [['myKob', 'Hawaii', 'categ1'], ['myKob', 'Hawaii', 'categ2'], ['myKob', 'Hawaii', 'categ10']]
print(titles)
--> ['tourist', 'tourist', 'resort_manager']

Hope this helps!

Question 15

What makes you think that will speed it up three fold?

Question 16

I am not suggesting that it will be exact, there will be some margin in that. Deductively, the original runs a loop through the data (~500MB) 3 times, that's expensive and dominates the time taken to run. This one only loops once. We have reduced the dominant factor by 3. My algorithm theory isn't perfect though - have I missed the mark on this one? Appreciate your thoughts

Question 17

The problem is that your new loop does three times as much per iteration. It appends to a list three times, instead of 1. That means each iteration will take 3x the cost and your saving cancel completely out. Margins made it difficult to say which version will end up being faster.

Question 18

That is a good point. However, original uses list comprehension to build the list one item at a time (though more efficiently than append) (total 3 'additions to list', for each entry in 500 MB). This version appends 3 times for one iteration (also therefore 3 calls to append for each entry in 500 MB). To me, that is the same number of append operations. Also kinda feel that the cost of appending is nominal compared the cost of looping over 500 MB. What do you think?

Question 19

Looping is actually fairly cheap. Its the appending which is expensive. That's why I'd expect both versions to have very similar speeds. Combining the loops reduces the looping overhead. Using a comprehension will be faster to append. Who wins out? I don't know. But I'm pretty sure there isn't a 3X difference.

Winston Ewert Winston Ewert 30.7k4 gold badges52 silver badges79 bronze badges · Accepted Answer · 2013-05-02 01:04:30Z

def loadData(filePath):
 reader = csv.reader( open(filePath) )
 rows = []
 for i, row in enumerate(reader):
 categories = ["id", "title", "description", "rawLocation", "normalizedLocation",
 "contractType", "contractTime", "company", "category",
 "salaryRaw", "salaryNormalized","sourceName"]

A list like this should really be a global variable to avoid the expense of recreating it constantly. But you'd do better not store your stuff in a dictionary. Instead do this:

 for (id, title, description, raw_location, normalized_location, contract_type,
 contractTime, company, category, salary_raw, salary_normalized, source_name) in reader:
 yield salary_normalized, ''.join((description, normalized_location, category)), title

All the stuff is then stored in python local variables (fairly efficient). Then yield produces the three elements you are actually wanting. Just use

 salaries, descriptions, titles = zip(*loadData(...))

to get your three lists again

 # Skip header row
 if i != 0: 
 rows.append( dict(zip(categories, row)) )

Rather than this, call reader.next() before the loop to take out the header

 return rows

loadtxt() gives me the error ValueError: cannot set an array element with a sequence on line X = np.array(X, dtype). Currently trying to upload the CSV somewhere
@Nyxynyx, my bad. numpy doesn't handled quoted fields. Replaced my answer with a better one. My code processes the file in 8 seconds (does everything before the fit function).
It appears that I need to do rows = zip( load_data(...) ) instead of salaries, descriptions, titles = zip(loadData(...)). Is this correct?
@Nyxynyx, typo. Should be salaries, description, titles = zip(*loadData(...))). zip(*...) does the opposite of zip(...)

Stack Exchange Network

Speed up simple Python function that uses list comprehension

Full Code

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Speed up simple Python function that uses list comprehension

Full Code

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions