1
\$\begingroup\$

I'm extracting 4 columns from an imported CSV file (~500MB) to be used for fitting a scikit-learn regression model.

It seems that this function used to do the extraction is extremely slow. I just learnt python today, any suggestions on how the function can be sped up?

Can multithreading/core be used? My system has 4 cores.

def splitData(jobs):
 salaries = [jobs[i]['salaryNormalized'] for i, v in enumerate(jobs)]
 descriptions = [jobs[i]['description'] + jobs[i]['normalizedLocation'] + jobs[i]['category'] for i, v in enumerate(jobs)]
 titles = [jobs[i]['title'] for i, v in enumerate(jobs)]
 return salaries, descriptions, titles

Full Code

def loadData(filePath):
 reader = csv.reader( open(filePath) )
 rows = []
 for i, row in enumerate(reader):
 categories = ["id", "title", "description", "rawLocation", "normalizedLocation",
 "contractType", "contractTime", "company", "category",
 "salaryRaw", "salaryNormalized","sourceName"]
 # Skip header row
 if i != 0: 
 rows.append( dict(zip(categories, row)) )
 return rows
def splitData(jobs):
 salaries = []
 descriptions = []
 titles = []
 for i in xrange(len(jobs)):
 salaries.append( jobs[i]['salaryNormalized'] )
 descriptions.append( jobs[i]['description'] + jobs[i]['normalizedLocation'] + jobs[i]['category'] )
 titles.append( jobs[i]['title'] )
 return salaries, descriptions, titles
def fit(salaries, descriptions, titles):
 #Vectorize
 vect = TfidfVectorizer()
 vect2 = TfidfVectorizer()
 descriptions = vect.fit_transform(descriptions)
 titles = vect2.fit_transform(titles)
 #Fit
 X = hstack((descriptions, titles))
 y = [ np.log(float(salaries[i])) for i, v in enumerate(salaries) ]
 rr = Ridge(alpha=0.035)
 rr.fit(X, y)
 return vect, vect2, rr, X, y
jobs = loadData( paths['train_data_path'] )
salaries, descriptions, titles = splitData(jobs)
vect, vect2, rr, X_train, y_train = fit(salaries, descriptions, titles)
asked May 1, 2013 at 19:55
\$\endgroup\$
4
  • \$\begingroup\$ Can you provide a sample of the train_data_path? \$\endgroup\$ Commented May 1, 2013 at 22:55
  • \$\begingroup\$ @WinstonEwert Sure: http://www.mediafire.com/?4d3j8g88d6j2h0x \$\endgroup\$ Commented May 2, 2013 at 2:06
  • \$\begingroup\$ Could you post a few lines of the data file directly in the question? \$\endgroup\$ Commented May 2, 2013 at 6:49
  • \$\begingroup\$ Please only state the code purpose in the title \$\endgroup\$ Commented May 16, 2015 at 14:05

3 Answers 3

2
\$\begingroup\$
def loadData(filePath):
 reader = csv.reader( open(filePath) )
 rows = []
 for i, row in enumerate(reader):
 categories = ["id", "title", "description", "rawLocation", "normalizedLocation",
 "contractType", "contractTime", "company", "category",
 "salaryRaw", "salaryNormalized","sourceName"]

A list like this should really be a global variable to avoid the expense of recreating it constantly. But you'd do better not store your stuff in a dictionary. Instead do this:

 for (id, title, description, raw_location, normalized_location, contract_type,
 contractTime, company, category, salary_raw, salary_normalized, source_name) in reader:
 yield salary_normalized, ''.join((description, normalized_location, category)), title

All the stuff is then stored in python local variables (fairly efficient). Then yield produces the three elements you are actually wanting. Just use

 salaries, descriptions, titles = zip(*loadData(...))

to get your three lists again

 # Skip header row
 if i != 0: 
 rows.append( dict(zip(categories, row)) )

Rather than this, call reader.next() before the loop to take out the header

 return rows
answered May 2, 2013 at 1:04
\$\endgroup\$
5
  • \$\begingroup\$ loadtxt() gives me the error ValueError: cannot set an array element with a sequence on line X = np.array(X, dtype). Currently trying to upload the CSV somewhere \$\endgroup\$ Commented May 2, 2013 at 1:56
  • \$\begingroup\$ Here's the CSV: http://www.mediafire.com/?4d3j8g88d6j2h0x \$\endgroup\$ Commented May 2, 2013 at 2:06
  • \$\begingroup\$ @Nyxynyx, my bad. numpy doesn't handled quoted fields. Replaced my answer with a better one. My code processes the file in 8 seconds (does everything before the fit function). \$\endgroup\$ Commented May 2, 2013 at 2:57
  • \$\begingroup\$ It appears that I need to do rows = zip( load_data(...) ) instead of salaries, descriptions, titles = zip(loadData(...)). Is this correct? \$\endgroup\$ Commented May 2, 2013 at 3:09
  • \$\begingroup\$ @Nyxynyx, typo. Should be salaries, description, titles = zip(*loadData(...))). zip(*...) does the opposite of zip(...) \$\endgroup\$ Commented May 2, 2013 at 3:11
1
\$\begingroup\$

titles = [jobs[i]['title'] for i, v in enumerate(jobs)] can (should?) be rewritten :

titles = [j['title'] for j in jobs.items()] because we just want to access the value at position i (More details)

Thus, the whole code would be :

def splitData(jobs):
 salaries = [j['salaryNormalized'] for j in jobs.items)]
 descriptions = [j['description'] + j['normalizedLocation'] + j['category'] for j in jobs.items)]
 titles = [j['title'] for j in jobs.items)]
 return salaries, descriptions, titles

Not quite sure how much it helps from a performance point of view.

Edit : Otherwise, another option might be to write a generator which returns j['salaryNormalized'], j['description'] + j['normalizedLocation'] + j['category'], j['title'] as you need it. It depends how you use your function really.

answered May 1, 2013 at 21:23
\$\endgroup\$
1
  • \$\begingroup\$ I have updated the question with more code, that may help! How can I write the generator? \$\endgroup\$ Commented May 1, 2013 at 22:30
1
\$\begingroup\$

It seems that you are looping through the entire job data set 3 times (once each for salaries, descriptions and titles). You will be able to speed this up three-fold, if you extract all the info in one pass:

def split_data(jobs):
 for job, info in jobs.items():
 salaries.append(info['salaryNormalized'])
 descriptions.append([info['description'], 
 info['normalizedLocation'], 
 info['category']])
 titles.append(info['title'])

EDIT Added loadData(); slight tweak to return a dictionary of dictionaries, instead of a list of dictionaries:

def load_data(filepath):
 reader = csv.reader(open(filepath))
 jobs = {}
 for i, row in enumerate(reader):
 categories = ["id", "title", "description", "rawLocation", "normalizedLocation",
 "contractType", "contractTime", "company", "category",
 "salaryRaw", "salaryNormalized","sourceName"]
 if i != 0:
 jobs[i] = dict(zip(categories, row))
 return jobs

An example:

jobs = {0 : {'salaryNormalized' : 10000, 'description' : 'myKob',
 'normalizedLocation': 'Hawaii', 'category': 'categ1',
 'title' : 'tourist'},
 1 : {'salaryNormalized' : 15000, 'description' : 'myKob',
 'normalizedLocation': 'Hawaii', 'category': 'categ2',
 'title' : 'tourist'},
 2 : {'salaryNormalized' : 50000, 'description' : 'myKob',
 'normalizedLocation': 'Hawaii', 'category': 'categ10',
 'title' : 'resort_manager'}}
salaries, descriptions, titles = [], [], []
split_data(jobs)
print(salaries)
--> [10000, 15000, 50000]
print(descriptions)
--> [['myKob', 'Hawaii', 'categ1'], ['myKob', 'Hawaii', 'categ2'], ['myKob', 'Hawaii', 'categ10']]
print(titles)
--> ['tourist', 'tourist', 'resort_manager']

Hope this helps!

answered May 1, 2013 at 22:22
\$\endgroup\$
5
  • \$\begingroup\$ What makes you think that will speed it up three fold? \$\endgroup\$ Commented May 1, 2013 at 22:34
  • \$\begingroup\$ I am not suggesting that it will be exact, there will be some margin in that. Deductively, the original runs a loop through the data (~500MB) 3 times, that's expensive and dominates the time taken to run. This one only loops once. We have reduced the dominant factor by 3. My algorithm theory isn't perfect though - have I missed the mark on this one? Appreciate your thoughts \$\endgroup\$ Commented May 1, 2013 at 22:40
  • \$\begingroup\$ The problem is that your new loop does three times as much per iteration. It appends to a list three times, instead of 1. That means each iteration will take 3x the cost and your saving cancel completely out. Margins made it difficult to say which version will end up being faster. \$\endgroup\$ Commented May 1, 2013 at 22:48
  • \$\begingroup\$ That is a good point. However, original uses list comprehension to build the list one item at a time (though more efficiently than append) (total 3 'additions to list', for each entry in 500 MB). This version appends 3 times for one iteration (also therefore 3 calls to append for each entry in 500 MB). To me, that is the same number of append operations. Also kinda feel that the cost of appending is nominal compared the cost of looping over 500 MB. What do you think? \$\endgroup\$ Commented May 1, 2013 at 22:57
  • \$\begingroup\$ Looping is actually fairly cheap. Its the appending which is expensive. That's why I'd expect both versions to have very similar speeds. Combining the loops reduces the looping overhead. Using a comprehension will be faster to append. Who wins out? I don't know. But I'm pretty sure there isn't a 3X difference. \$\endgroup\$ Commented May 1, 2013 at 23:29

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.