I do not know how to do this without four nested for
loops.
I'd like to apply a function to every possible combination of subsets for hour
and day
, return that value, and then pivot the data frame into a square matrix. However, these for
loops seem excessive so I'm looking for a more efficient way to do this. The data I have is fairly large so any gain in speed would be beneficial.
I took a stab at compression lists but that seems excessive too.
The following code runs fine with smaller groups (days=1:2
and hours=1:2
) but with large groups, say years of data, performance is lost.
Note: I'm requesting help given any custom function, not just a cross-product solution as suggested here, Performant cartesian product (CROSS JOIN) with pandas
Sample data
import pandas as pd
import numpy as np
dat = pd.DataFrame({'day': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2}, 'hour': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2}, 'distance': {0: 1.2898851269657656, 1: 0.0, 2: 0.8371526423804061, 3: 0.8703856587273138, 4: 0.6257425922449789, 5: 0.0, 6: 0.0, 7: 0.0, 8: 1.2895328696587023, 9: 0.0, 10: 0.6875527848294374, 11: 0.0, 12: 0.0, 13: 0.9009031833559706, 14: 0.0, 15: 1.1040652963428623, 16: 0.0, 17: 0.0, 18: 0.0, 19: 0.0}})0}})
Code
def custom_fn(x, y):
x = pd.Series(x)
y = pd.Series(y)
x = x**2
y = np.sqrt(y)
return x.sum() - y.sum()
# Empty data.frame to append to
dmat = pd.DataFrame()
# For i, j = hour; k, l = day
for i in range(1, 3):
for j in range(1, 3):
for k in range(1, 3):
for l in range(1, 3):
x = dat[(dat['hour'] == i) & (dat['day'] == k)].distance
y = dat[(dat['hour'] == j) & (dat['day'] == l)].distance
# Calculate difference
jds = custom_fn(x, y)
# Build data frame and append
outdat = pd.DataFrame({'day_hour_a': f"{k}_{i}", 'day_hour_b': f"{l}_{j}", 'jds': [round(jds, 4)]})
dmat = dmat.append(outdat, ignore_index=True)
# Pivot data to get matrix
distMatrix = dmat.pivot(index='day_hour_a', columns='day_hour_b', values='jds')
Output
> print(distMatrix)
day_hour_b 1_1 1_2 2_1 2_2
day_hour_a
1_1 -0.2609 2.3782 1.7354 2.4630
1_2 -2.1118 0.5273 -0.1155 0.6121
2_1 -2.4903 0.1488 -0.4940 0.2336
2_2 -2.5557 0.0834 -0.5594 0.1682
2 Answers 2
I'm afraid that I'm not familiar with pandas, so there are definitely some things that I'm missing. However, I've done a bit of poking this code with a profiler, and I have a two suggestions that I expect would be helpful.
In my timings, over half of the runtime (3.5s out of 6.5s for 100 repetitions) in your example was spent on these two lines:
x = dat[(dat['hour'] == i) & (dat['day'] == k)].distance
y = dat[(dat['hour'] == j) & (dat['day'] == l)].distance
If I understand the code correctly, (dat['hour'] == i)
is passing over the whole dataset searching for indexes with the specified hour. Especially given that this is in the middle of a very hot loop, that seems like prime work to avoid doing! Consider changing the data structure to allow for quicker look ups: perhaps have a dictionary that maps your chosen hour to a list or set of indexes.
Second, let's take a look at this custom_fn
def custom_fn(x, y):
x = pd.Series(x)
y = pd.Series(y)
x = x**2
y = np.sqrt(y)
return x.sum() - y.sum()
Now, I would normally not pay much attention to this function, because according to the profiler it just used up 8% of the total runtime. However, I did notice that it is almost completely seperable. The last line uses data derived from both x
and y
but until that you just use x
for x
things and y
for y
things. That suggests that there is considerable opportunity to cache the relevant x.sum()
and y.sum()
components, with the relevant calculations done outside of so deeply nested a loop.
For reference, the following code is my initial go at using some of this caching approach. There is still plenty of opportunity to improve, including changing data structure as mentioned, but it is already significantly faster. It has come down from 6.5 seconds to 3.5 seconds in total, of which 2 seconds is packing the dmat table.
import pandas as pd
import numpy as np
dat = pd.DataFrame({'day': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2}, 'hour': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2}, 'distance': {0: 1.2898851269657656, 1: 0.0, 2: 0.8371526423804061, 3: 0.8703856587273138, 4: 0.6257425922449789, 5: 0.0, 6: 0.0, 7: 0.0, 8: 1.2895328696587023, 9: 0.0, 10: 0.6875527848294374, 11: 0.0, 12: 0.0, 13: 0.9009031833559706, 14: 0.0, 15: 1.1040652963428623, 16: 0.0, 17: 0.0, 18: 0.0, 19: 0.0}})
def x_sum(x):
x = pd.Series(x)
x = x**2
return x.sum()
def y_sum(y):
y = pd.Series(y)
y = np.sqrt(y)
return y.sum()
def custom_fn(x_s, y_s):
return x_s - y_s
def get_hour(i):
return (dat['hour'] == i)
def get_day(k):
return (dat['day'] == k)
def get_day_hour(hour, day):
x = dat[hour & day].distance
return x
# Empty data.frame to append to
dmat = pd.DataFrame()
day_indices = {k : get_day(k) for k in range(1, 3)}
hour_indices = {i : get_hour(i) for i in range(1, 3)}
x_sum_indices = { (i, j): x_sum(get_day_hour(hour_indices[i], day_indices[j])) for i in range(1, 3) for j in range(1, 3)}
y_sum_indices = { (i, j): y_sum(get_day_hour(hour_indices[i], day_indices[j])) for i in range(1, 3) for j in range(1, 3)}
# For i, j = hour; k, l = day
for i in range(1, 3):
for j in range(1, 3):
for k in range(1, 3):
for l in range(1, 3):
x_s = x_sum_indices[(i, k)]
y_s = y_sum_indices[(j, l)]
# Calculate difference
jds = custom_fn(x_s, y_s)
# Build data frame and append
outdat = pd.DataFrame({'day_hour_a': f"{k}_{i}", 'day_hour_b': f"{l}_{j}", 'jds': [round(jds, 4)]})
dmat = dmat.append(outdat, ignore_index=True)
return dmat
# Pivot data to get matrix
distMatrix = dmat.pivot(index='day_hour_a', columns='day_hour_b', values='jds')
print(distMatrix)
If you had a different custom_fn
function that was not so easily separable, you could still benefit from caching the inputs to the function. E.g.
x_indices = { (i, j): get_day_hour(hour_indices[i], day_indices[j]) for i in range(1, 3) for j in range(1, 3)}
y_indices = x_indices
...
jds = custom_fn(x_indices[(i, k)], y_indices[(j, l)])
With a little rework I was able to increase speed 4x.
Initial Code:
start = time.time()
# Empty data.frame to append to
dmat = pd.DataFrame()
# For i, j = hour; k, l = day
for i in range(1, 3):
for j in range(1, 3):
for k in range(1, 3):
for l in range(1, 3):
x = dat[(dat['hour'] == i) & (dat['day'] == k)].distance
y = dat[(dat['hour'] == j) & (dat['day'] == l)].distance
# Calculate difference
jds = custom_fn(x, y)
# Build data frame and append
outdat = pd.DataFrame({'day_hour_a': f"{k}_{i}", 'day_hour_b': f"{l}_{j}", 'jds': [round(jds, 4)]})
dmat = dmat.append(outdat, ignore_index=True)
# Pivot data to get matrix
distMatrix = dmat.pivot(index='day_hour_a', columns='day_hour_b', values='jds')
end = time.time()
print(end - start)
Time 1:
> 0.07694768905639648
Reworked code:
start = time.time()
x = []
g = dat.groupby(['day', 'hour'])['distance']
for k1, g1 in g:
for k2, g2 in g:
x += [(k1, k2, custom_fn(g1, g2))]
x = pd.DataFrame(x).pivot(index=0, columns=1, values=2)
print(x)
end = time.time()
print(end - start)
Time 2
> 0.022540807723999023
x
andy
to contain a single value each time through? That is, does[day,hour]
represent a unique key for the distance column? \$\endgroup\$x
andy
will be subsetpd.Series
provided to a custom function. Note that I don't need cross products because I will be using a custom function on eachday
andhour
. Thanks! \$\endgroup\$dat =
line. \$\endgroup\$