I'm extracting 4 columns from an imported CSV file (~500MB) to be used for fitting a scikit-learn
regression model.
It seems that this function used to do the extraction is extremely slow. I just learnt python today, any suggestions on how the function can be sped up?
Can multithreading/core be used? My system has 4 cores.
def splitData(jobs):
salaries = [jobs[i]['salaryNormalized'] for i, v in enumerate(jobs)]
descriptions = [jobs[i]['description'] + jobs[i]['normalizedLocation'] + jobs[i]['category'] for i, v in enumerate(jobs)]
titles = [jobs[i]['title'] for i, v in enumerate(jobs)]
return salaries, descriptions, titles
Full Code
def loadData(filePath):
reader = csv.reader( open(filePath) )
rows = []
for i, row in enumerate(reader):
categories = ["id", "title", "description", "rawLocation", "normalizedLocation",
"contractType", "contractTime", "company", "category",
"salaryRaw", "salaryNormalized","sourceName"]
# Skip header row
if i != 0:
rows.append( dict(zip(categories, row)) )
return rows
def splitData(jobs):
salaries = []
descriptions = []
titles = []
for i in xrange(len(jobs)):
salaries.append( jobs[i]['salaryNormalized'] )
descriptions.append( jobs[i]['description'] + jobs[i]['normalizedLocation'] + jobs[i]['category'] )
titles.append( jobs[i]['title'] )
return salaries, descriptions, titles
def fit(salaries, descriptions, titles):
#Vectorize
vect = TfidfVectorizer()
vect2 = TfidfVectorizer()
descriptions = vect.fit_transform(descriptions)
titles = vect2.fit_transform(titles)
#Fit
X = hstack((descriptions, titles))
y = [ np.log(float(salaries[i])) for i, v in enumerate(salaries) ]
rr = Ridge(alpha=0.035)
rr.fit(X, y)
return vect, vect2, rr, X, y
jobs = loadData( paths['train_data_path'] )
salaries, descriptions, titles = splitData(jobs)
vect, vect2, rr, X_train, y_train = fit(salaries, descriptions, titles)
3 Answers 3
def loadData(filePath):
reader = csv.reader( open(filePath) )
rows = []
for i, row in enumerate(reader):
categories = ["id", "title", "description", "rawLocation", "normalizedLocation",
"contractType", "contractTime", "company", "category",
"salaryRaw", "salaryNormalized","sourceName"]
A list like this should really be a global variable to avoid the expense of recreating it constantly. But you'd do better not store your stuff in a dictionary. Instead do this:
for (id, title, description, raw_location, normalized_location, contract_type,
contractTime, company, category, salary_raw, salary_normalized, source_name) in reader:
yield salary_normalized, ''.join((description, normalized_location, category)), title
All the stuff is then stored in python local variables (fairly efficient). Then yield produces the three elements you are actually wanting. Just use
salaries, descriptions, titles = zip(*loadData(...))
to get your three lists again
# Skip header row
if i != 0:
rows.append( dict(zip(categories, row)) )
Rather than this, call reader.next() before the loop to take out the header
return rows
-
\$\begingroup\$
loadtxt()
gives me the errorValueError: cannot set an array element with a sequence
on lineX = np.array(X, dtype)
. Currently trying to upload the CSV somewhere \$\endgroup\$Nyxynyx– Nyxynyx2013年05月02日 01:56:05 +00:00Commented May 2, 2013 at 1:56 -
\$\begingroup\$ Here's the CSV:
http://www.mediafire.com/?4d3j8g88d6j2h0x
\$\endgroup\$Nyxynyx– Nyxynyx2013年05月02日 02:06:14 +00:00Commented May 2, 2013 at 2:06 -
\$\begingroup\$ @Nyxynyx, my bad. numpy doesn't handled quoted fields. Replaced my answer with a better one. My code processes the file in 8 seconds (does everything before the
fit
function). \$\endgroup\$Winston Ewert– Winston Ewert2013年05月02日 02:57:05 +00:00Commented May 2, 2013 at 2:57 -
\$\begingroup\$ It appears that I need to do
rows = zip( load_data(...) )
instead ofsalaries, descriptions, titles = zip(loadData(...))
. Is this correct? \$\endgroup\$Nyxynyx– Nyxynyx2013年05月02日 03:09:59 +00:00Commented May 2, 2013 at 3:09 -
\$\begingroup\$ @Nyxynyx, typo. Should be
salaries, description, titles = zip(*loadData(...)))
.zip(*...)
does the opposite ofzip(...)
\$\endgroup\$Winston Ewert– Winston Ewert2013年05月02日 03:11:29 +00:00Commented May 2, 2013 at 3:11
titles = [jobs[i]['title'] for i, v in enumerate(jobs)]
can (should?) be rewritten :
titles = [j['title'] for j in jobs.items()]
because we just want to access the value at position i (More details)
Thus, the whole code would be :
def splitData(jobs):
salaries = [j['salaryNormalized'] for j in jobs.items)]
descriptions = [j['description'] + j['normalizedLocation'] + j['category'] for j in jobs.items)]
titles = [j['title'] for j in jobs.items)]
return salaries, descriptions, titles
Not quite sure how much it helps from a performance point of view.
Edit : Otherwise, another option might be to write a generator which returns j['salaryNormalized'], j['description'] + j['normalizedLocation'] + j['category'], j['title']
as you need it. It depends how you use your function really.
-
\$\begingroup\$ I have updated the question with more code, that may help! How can I write the generator? \$\endgroup\$Nyxynyx– Nyxynyx2013年05月01日 22:30:47 +00:00Commented May 1, 2013 at 22:30
It seems that you are looping through the entire job data set 3 times (once each for salaries, descriptions and titles). You will be able to speed this up three-fold, if you extract all the info in one pass:
def split_data(jobs):
for job, info in jobs.items():
salaries.append(info['salaryNormalized'])
descriptions.append([info['description'],
info['normalizedLocation'],
info['category']])
titles.append(info['title'])
EDIT Added loadData(); slight tweak to return a dictionary of dictionaries, instead of a list of dictionaries:
def load_data(filepath):
reader = csv.reader(open(filepath))
jobs = {}
for i, row in enumerate(reader):
categories = ["id", "title", "description", "rawLocation", "normalizedLocation",
"contractType", "contractTime", "company", "category",
"salaryRaw", "salaryNormalized","sourceName"]
if i != 0:
jobs[i] = dict(zip(categories, row))
return jobs
An example:
jobs = {0 : {'salaryNormalized' : 10000, 'description' : 'myKob',
'normalizedLocation': 'Hawaii', 'category': 'categ1',
'title' : 'tourist'},
1 : {'salaryNormalized' : 15000, 'description' : 'myKob',
'normalizedLocation': 'Hawaii', 'category': 'categ2',
'title' : 'tourist'},
2 : {'salaryNormalized' : 50000, 'description' : 'myKob',
'normalizedLocation': 'Hawaii', 'category': 'categ10',
'title' : 'resort_manager'}}
salaries, descriptions, titles = [], [], []
split_data(jobs)
print(salaries)
--> [10000, 15000, 50000]
print(descriptions)
--> [['myKob', 'Hawaii', 'categ1'], ['myKob', 'Hawaii', 'categ2'], ['myKob', 'Hawaii', 'categ10']]
print(titles)
--> ['tourist', 'tourist', 'resort_manager']
Hope this helps!
-
\$\begingroup\$ What makes you think that will speed it up three fold? \$\endgroup\$Winston Ewert– Winston Ewert2013年05月01日 22:34:13 +00:00Commented May 1, 2013 at 22:34
-
\$\begingroup\$ I am not suggesting that it will be exact, there will be some margin in that. Deductively, the original runs a loop through the data (~500MB) 3 times, that's expensive and dominates the time taken to run. This one only loops once. We have reduced the dominant factor by 3. My algorithm theory isn't perfect though - have I missed the mark on this one? Appreciate your thoughts \$\endgroup\$Nick Burns– Nick Burns2013年05月01日 22:40:22 +00:00Commented May 1, 2013 at 22:40
-
\$\begingroup\$ The problem is that your new loop does three times as much per iteration. It appends to a list three times, instead of 1. That means each iteration will take 3x the cost and your saving cancel completely out. Margins made it difficult to say which version will end up being faster. \$\endgroup\$Winston Ewert– Winston Ewert2013年05月01日 22:48:17 +00:00Commented May 1, 2013 at 22:48
-
\$\begingroup\$ That is a good point. However, original uses list comprehension to build the list one item at a time (though more efficiently than append) (total 3 'additions to list', for each entry in 500 MB). This version appends 3 times for one iteration (also therefore 3 calls to append for each entry in 500 MB). To me, that is the same number of append operations. Also kinda feel that the cost of appending is nominal compared the cost of looping over 500 MB. What do you think? \$\endgroup\$Nick Burns– Nick Burns2013年05月01日 22:57:39 +00:00Commented May 1, 2013 at 22:57
-
\$\begingroup\$ Looping is actually fairly cheap. Its the appending which is expensive. That's why I'd expect both versions to have very similar speeds. Combining the loops reduces the looping overhead. Using a comprehension will be faster to append. Who wins out? I don't know. But I'm pretty sure there isn't a 3X difference. \$\endgroup\$Winston Ewert– Winston Ewert2013年05月01日 23:29:17 +00:00Commented May 1, 2013 at 23:29
train_data_path
? \$\endgroup\$http://www.mediafire.com/?4d3j8g88d6j2h0x
\$\endgroup\$