Speeding up filtering function in Pandas

Question 1

I have a CSV file with 400 000 rows and the following headers:

header_names = ['LEAGUE', 'YEAR', 'DATE', 'HOME', '1', 'X', '2', 'AWAY', 'SCORE', 'SCORE_1', 'SCORE_2', 'FTR', 'FAVORITE', 'UNDER-OVER']

The aim of my function is for every row to take all the previous, filter them by items in the current row and return some statistic.

This is my script so far:

import pandas as pd
filepath = 'data.csv'
header_names = ['LEAGUE', 'YEAR', 'DATE', 'HOME', '1', 'X', '2', 'AWAY', 'SCORE', 'SCORE_1', 'SCORE_2', 'FTR', 'FAVORITE', 'UNDER-OVER'] # Add appropriate headers
df = pd.read_csv(filepath, sep=',', na_values=['', '-'], parse_dates=True, header=None, names=header_names, skiprows=1, nrows=1000)
def mid_func(x):
 global mid
 mid += 1
 return mid
mid = -1
df.insert(0, 'MID', df.apply(mid_func, axis=1))
new_df = df.copy()
def home_1_simple_filter(x):
 mid_stop = x[0] - 1
 home = x[4]
 odd_1 = x[5]
 start = time.time()
 filtered = df[(df['HOME'] == home) & (df['1'] == odd_1)].ix[:mid_stop]['FTR']
 stop = time.time() - start
 print round(stop*1000.,2), 'ms', home, odd_1, mid_stop
 return filtered
start = time.time()
new_df['HOME_1'] = df.apply(home_1_simple_filter, axis=1)
stop = time.time() - start
print stop

The mid_func is to help me take the previous row. The whole process takes 3 seconds for the first 1000, and 0.002 seconds on average.

Question 2

0.002 seconds per row doesn't seem like much. Do you have some target time in mind? Have you profiled the code to see where the time goes (I would guess that reading in the CSV will be a big chunk of it, which you can't speed up by altering your filter)?

Question 3

There will be 60 filter operations like the one I mentioned, for each one of the 400 000 rows, so the overall time needed would be in hours.

Question 4

Verify your indentation in mid_func()?

Question 5

What do you mean about the indentation ?

Question 6

@evil_inside indentation is the space before lines

Question 7

Well, the code doesn't run, and you haven't shown any example input/output. Lest this be lost to obscurity, I will review what I can and make some wild guesses.

You're fully ignoring the original CSV's header names and overwriting with your own. That's almost always a bad idea; use .rename() instead.

mid_func needs to go away. If you needed the effect of a new integral index, then use an Index from Pandas or an np.arange. But the way it's used betrays a deep misunderstanding about this operation and how it should actually be done with the Pandas API: a call to .expanding().

Don't x[4], etc. That's a row Series for which you should be using column name strings.

.ix() is deprecated and needs to be replaced one way or the other.

I think much of this can be replaced by

df.groupby(['HOME', '1'])['FTR'].expanding()

but again, without data samples, it's impossible to say for sure.

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Answer 1 · 2025-01-12 19:25:36Z

Well, the code doesn't run, and you haven't shown any example input/output. Lest this be lost to obscurity, I will review what I can and make some wild guesses.

You're fully ignoring the original CSV's header names and overwriting with your own. That's almost always a bad idea; use .rename() instead.

mid_func needs to go away. If you needed the effect of a new integral index, then use an Index from Pandas or an np.arange. But the way it's used betrays a deep misunderstanding about this operation and how it should actually be done with the Pandas API: a call to .expanding().

Don't x[4], etc. That's a row Series for which you should be using column name strings.

.ix() is deprecated and needs to be replaced one way or the other.

I think much of this can be replaced by

df.groupby(['HOME', '1'])['FTR'].expanding()

but again, without data samples, it's impossible to say for sure.

Stack Exchange Network

Speeding up filtering function in Pandas

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Speeding up filtering function in Pandas

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions