4
\$\begingroup\$

I have a CSV file with 400 000 rows and the following headers:

header_names = ['LEAGUE', 'YEAR', 'DATE', 'HOME', '1', 'X', '2', 'AWAY', 'SCORE', 'SCORE_1', 'SCORE_2', 'FTR', 'FAVORITE', 'UNDER-OVER'] 

The aim of my function is for every row to take all the previous, filter them by items in the current row and return some statistic.

This is my script so far:

import pandas as pd
filepath = 'data.csv'
header_names = ['LEAGUE', 'YEAR', 'DATE', 'HOME', '1', 'X', '2', 'AWAY', 'SCORE', 'SCORE_1', 'SCORE_2', 'FTR', 'FAVORITE', 'UNDER-OVER'] # Add appropriate headers
df = pd.read_csv(filepath, sep=',', na_values=['', '-'], parse_dates=True, header=None, names=header_names, skiprows=1, nrows=1000)
def mid_func(x):
 global mid
 mid += 1
 return mid
mid = -1
df.insert(0, 'MID', df.apply(mid_func, axis=1))
new_df = df.copy()
def home_1_simple_filter(x):
 mid_stop = x[0] - 1
 home = x[4]
 odd_1 = x[5]
 start = time.time()
 filtered = df[(df['HOME'] == home) & (df['1'] == odd_1)].ix[:mid_stop]['FTR']
 stop = time.time() - start
 print round(stop*1000.,2), 'ms', home, odd_1, mid_stop
 return filtered
start = time.time()
new_df['HOME_1'] = df.apply(home_1_simple_filter, axis=1)
stop = time.time() - start
print stop

The mid_func is to help me take the previous row. The whole process takes 3 seconds for the first 1000, and 0.002 seconds on average.

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Aug 16, 2014 at 10:25
\$\endgroup\$
5
  • \$\begingroup\$ 0.002 seconds per row doesn't seem like much. Do you have some target time in mind? Have you profiled the code to see where the time goes (I would guess that reading in the CSV will be a big chunk of it, which you can't speed up by altering your filter)? \$\endgroup\$ Commented Aug 16, 2014 at 10:41
  • \$\begingroup\$ There will be 60 filter operations like the one I mentioned, for each one of the 400 000 rows, so the overall time needed would be in hours. \$\endgroup\$ Commented Aug 16, 2014 at 10:47
  • \$\begingroup\$ Verify your indentation in mid_func()? \$\endgroup\$ Commented Aug 17, 2014 at 5:50
  • \$\begingroup\$ What do you mean about the indentation ? \$\endgroup\$ Commented Aug 23, 2014 at 22:48
  • \$\begingroup\$ @evil_inside indentation is the space before lines \$\endgroup\$ Commented Jul 22, 2015 at 19:03

1 Answer 1

2
\$\begingroup\$

Well, the code doesn't run, and you haven't shown any example input/output. Lest this be lost to obscurity, I will review what I can and make some wild guesses.

You're fully ignoring the original CSV's header names and overwriting with your own. That's almost always a bad idea; use .rename() instead.

mid_func needs to go away. If you needed the effect of a new integral index, then use an Index from Pandas or an np.arange. But the way it's used betrays a deep misunderstanding about this operation and how it should actually be done with the Pandas API: a call to .expanding().

Don't x[4], etc. That's a row Series for which you should be using column name strings.

.ix() is deprecated and needs to be replaced one way or the other.

I think much of this can be replaced by

df.groupby(['HOME', '1'])['FTR'].expanding()

but again, without data samples, it's impossible to say for sure.

answered Jan 12 at 19:25
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.