I have a CSV file with 400 000 rows and the following headers:
header_names = ['LEAGUE', 'YEAR', 'DATE', 'HOME', '1', 'X', '2', 'AWAY', 'SCORE', 'SCORE_1', 'SCORE_2', 'FTR', 'FAVORITE', 'UNDER-OVER']
The aim of my function is for every row to take all the previous, filter them by items in the current row and return some statistic.
This is my script so far:
import pandas as pd
filepath = 'data.csv'
header_names = ['LEAGUE', 'YEAR', 'DATE', 'HOME', '1', 'X', '2', 'AWAY', 'SCORE', 'SCORE_1', 'SCORE_2', 'FTR', 'FAVORITE', 'UNDER-OVER'] # Add appropriate headers
df = pd.read_csv(filepath, sep=',', na_values=['', '-'], parse_dates=True, header=None, names=header_names, skiprows=1, nrows=1000)
def mid_func(x):
global mid
mid += 1
return mid
mid = -1
df.insert(0, 'MID', df.apply(mid_func, axis=1))
new_df = df.copy()
def home_1_simple_filter(x):
mid_stop = x[0] - 1
home = x[4]
odd_1 = x[5]
start = time.time()
filtered = df[(df['HOME'] == home) & (df['1'] == odd_1)].ix[:mid_stop]['FTR']
stop = time.time() - start
print round(stop*1000.,2), 'ms', home, odd_1, mid_stop
return filtered
start = time.time()
new_df['HOME_1'] = df.apply(home_1_simple_filter, axis=1)
stop = time.time() - start
print stop
The mid_func
is to help me take the previous row. The whole process takes 3 seconds for the first 1000, and 0.002 seconds on average.
1 Answer 1
Well, the code doesn't run, and you haven't shown any example input/output. Lest this be lost to obscurity, I will review what I can and make some wild guesses.
You're fully ignoring the original CSV's header names and overwriting with your own. That's almost always a bad idea; use .rename()
instead.
mid_func
needs to go away. If you needed the effect of a new integral index, then use an Index
from Pandas or an np.arange
. But the way it's used betrays a deep misunderstanding about this operation and how it should actually be done with the Pandas API: a call to .expanding()
.
Don't x[4]
, etc. That's a row Series
for which you should be using column name strings.
.ix()
is deprecated and needs to be replaced one way or the other.
I think much of this can be replaced by
df.groupby(['HOME', '1'])['FTR'].expanding()
but again, without data samples, it's impossible to say for sure.
mid_func()
? \$\endgroup\$