Apply function to every subset combination and return square matrix

Question 1

I do not know how to do this without four nested for loops.

I'd like to apply a function to every possible combination of subsets for hour and day, return that value, and then pivot the data frame into a square matrix. However, these for loops seem excessive so I'm looking for a more efficient way to do this. The data I have is fairly large so any gain in speed would be beneficial.

I took a stab at compression lists but that seems excessive too.

The following code runs fine with smaller groups (days=1:2 and hours=1:2) but with large groups, say years of data, performance is lost.

Note: I'm requesting help given any custom function, not just a cross-product solution as suggested here, Performant cartesian product (CROSS JOIN) with pandas

Sample data

import pandas as pd
import numpy as np
dat = pd.DataFrame({'day': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2}, 'hour': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2}, 'distance': {0: 1.2898851269657656, 1: 0.0, 2: 0.8371526423804061, 3: 0.8703856587273138, 4: 0.6257425922449789, 5: 0.0, 6: 0.0, 7: 0.0, 8: 1.2895328696587023, 9: 0.0, 10: 0.6875527848294374, 11: 0.0, 12: 0.0, 13: 0.9009031833559706, 14: 0.0, 15: 1.1040652963428623, 16: 0.0, 17: 0.0, 18: 0.0, 19: 0.0}})0}})

Code

def custom_fn(x, y):
 x = pd.Series(x)
 y = pd.Series(y)
 x = x**2
 y = np.sqrt(y)
 return x.sum() - y.sum()
# Empty data.frame to append to
dmat = pd.DataFrame()
# For i, j = hour; k, l = day
for i in range(1, 3):
 for j in range(1, 3):
 for k in range(1, 3):
 for l in range(1, 3):
 x = dat[(dat['hour'] == i) & (dat['day'] == k)].distance
 y = dat[(dat['hour'] == j) & (dat['day'] == l)].distance
 # Calculate difference
 jds = custom_fn(x, y)
 # Build data frame and append
 outdat = pd.DataFrame({'day_hour_a': f"{k}_{i}", 'day_hour_b': f"{l}_{j}", 'jds': [round(jds, 4)]})
 dmat = dmat.append(outdat, ignore_index=True)
# Pivot data to get matrix
distMatrix = dmat.pivot(index='day_hour_a', columns='day_hour_b', values='jds')

Output

> print(distMatrix)
 day_hour_b 1_1 1_2 2_1 2_2
day_hour_a 
1_1 -0.2609 2.3782 1.7354 2.4630
1_2 -2.1118 0.5273 -0.1155 0.6121
2_1 -2.4903 0.1488 -0.4940 0.2336
2_2 -2.5557 0.0834 -0.5594 0.1682

Question 2

Inside your 4-loops, are you expecting x and y to contain a single value each time through? That is, does [day,hour] represent a unique key for the distance column?

Question 3

Also, your answer may be in this SO question: stackoverflow.com/questions/53699012/…

Question 4

@AustinHasting x and y will be subset pd.Series provided to a custom function. Note that I don't need cross products because I will be using a custom function on each day and hour. Thanks!

Question 5

I couldn't run this on account of a few extra characters at the end of the dat = line.

Question 6

I'm afraid that I'm not familiar with pandas, so there are definitely some things that I'm missing. However, I've done a bit of poking this code with a profiler, and I have a two suggestions that I expect would be helpful.

In my timings, over half of the runtime (3.5s out of 6.5s for 100 repetitions) in your example was spent on these two lines:

x = dat[(dat['hour'] == i) & (dat['day'] == k)].distance
y = dat[(dat['hour'] == j) & (dat['day'] == l)].distance

If I understand the code correctly, (dat['hour'] == i) is passing over the whole dataset searching for indexes with the specified hour. Especially given that this is in the middle of a very hot loop, that seems like prime work to avoid doing! Consider changing the data structure to allow for quicker look ups: perhaps have a dictionary that maps your chosen hour to a list or set of indexes.

Second, let's take a look at this custom_fn

def custom_fn(x, y):
 x = pd.Series(x)
 y = pd.Series(y)
 x = x**2
 y = np.sqrt(y)
 return x.sum() - y.sum()

Now, I would normally not pay much attention to this function, because according to the profiler it just used up 8% of the total runtime. However, I did notice that it is almost completely seperable. The last line uses data derived from both x and y but until that you just use x for x things and y for y things. That suggests that there is considerable opportunity to cache the relevant x.sum() and y.sum() components, with the relevant calculations done outside of so deeply nested a loop.

For reference, the following code is my initial go at using some of this caching approach. There is still plenty of opportunity to improve, including changing data structure as mentioned, but it is already significantly faster. It has come down from 6.5 seconds to 3.5 seconds in total, of which 2 seconds is packing the dmat table.

import pandas as pd
import numpy as np
dat = pd.DataFrame({'day': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2}, 'hour': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2}, 'distance': {0: 1.2898851269657656, 1: 0.0, 2: 0.8371526423804061, 3: 0.8703856587273138, 4: 0.6257425922449789, 5: 0.0, 6: 0.0, 7: 0.0, 8: 1.2895328696587023, 9: 0.0, 10: 0.6875527848294374, 11: 0.0, 12: 0.0, 13: 0.9009031833559706, 14: 0.0, 15: 1.1040652963428623, 16: 0.0, 17: 0.0, 18: 0.0, 19: 0.0}})
def x_sum(x):
 x = pd.Series(x)
 x = x**2
 return x.sum()
def y_sum(y):
 y = pd.Series(y)
 y = np.sqrt(y)
 return y.sum()
def custom_fn(x_s, y_s):
 return x_s - y_s
def get_hour(i):
 return (dat['hour'] == i)
def get_day(k): 
 return (dat['day'] == k)
def get_day_hour(hour, day):
 x = dat[hour & day].distance
 return x
# Empty data.frame to append to
dmat = pd.DataFrame()
day_indices = {k : get_day(k) for k in range(1, 3)}
hour_indices = {i : get_hour(i) for i in range(1, 3)}
x_sum_indices = { (i, j): x_sum(get_day_hour(hour_indices[i], day_indices[j])) for i in range(1, 3) for j in range(1, 3)}
y_sum_indices = { (i, j): y_sum(get_day_hour(hour_indices[i], day_indices[j])) for i in range(1, 3) for j in range(1, 3)}
# For i, j = hour; k, l = day
for i in range(1, 3):
 for j in range(1, 3):
 for k in range(1, 3):
 for l in range(1, 3):
 x_s = x_sum_indices[(i, k)]
 y_s = y_sum_indices[(j, l)]
 # Calculate difference
 jds = custom_fn(x_s, y_s)
 # Build data frame and append
 outdat = pd.DataFrame({'day_hour_a': f"{k}_{i}", 'day_hour_b': f"{l}_{j}", 'jds': [round(jds, 4)]})
 dmat = dmat.append(outdat, ignore_index=True)
return dmat
# Pivot data to get matrix
distMatrix = dmat.pivot(index='day_hour_a', columns='day_hour_b', values='jds')
print(distMatrix)

If you had a different custom_fn function that was not so easily separable, you could still benefit from caching the inputs to the function. E.g.

x_indices = { (i, j): get_day_hour(hour_indices[i], day_indices[j]) for i in range(1, 3) for j in range(1, 3)}
y_indices = x_indices
...
 jds = custom_fn(x_indices[(i, k)], y_indices[(j, l)])

Question 7

With a little rework I was able to increase speed 4x.

Initial Code:

start = time.time()
# Empty data.frame to append to
dmat = pd.DataFrame()
# For i, j = hour; k, l = day
for i in range(1, 3):
 for j in range(1, 3):
 for k in range(1, 3):
 for l in range(1, 3):
 x = dat[(dat['hour'] == i) & (dat['day'] == k)].distance
 y = dat[(dat['hour'] == j) & (dat['day'] == l)].distance
 # Calculate difference
 jds = custom_fn(x, y)
 # Build data frame and append
 outdat = pd.DataFrame({'day_hour_a': f"{k}_{i}", 'day_hour_b': f"{l}_{j}", 'jds': [round(jds, 4)]})
 dmat = dmat.append(outdat, ignore_index=True)
# Pivot data to get matrix
distMatrix = dmat.pivot(index='day_hour_a', columns='day_hour_b', values='jds')
end = time.time()
print(end - start)

Time 1:

> 0.07694768905639648

Reworked code:

start = time.time()
x = []
g = dat.groupby(['day', 'hour'])['distance']
for k1, g1 in g:
 for k2, g2 in g:
 x += [(k1, k2, custom_fn(g1, g2))]
x = pd.DataFrame(x).pivot(index=0, columns=1, values=2)
print(x)
end = time.time()
print(end - start)

Time 2

> 0.022540807723999023

Josiah Josiah 6,1062 gold badges18 silver badges39 bronze badges · Answer 1 · 2019-04-03 23:33:52Z

I'm afraid that I'm not familiar with pandas, so there are definitely some things that I'm missing. However, I've done a bit of poking this code with a profiler, and I have a two suggestions that I expect would be helpful.

In my timings, over half of the runtime (3.5s out of 6.5s for 100 repetitions) in your example was spent on these two lines:

x = dat[(dat['hour'] == i) & (dat['day'] == k)].distance
y = dat[(dat['hour'] == j) & (dat['day'] == l)].distance

If I understand the code correctly, (dat['hour'] == i) is passing over the whole dataset searching for indexes with the specified hour. Especially given that this is in the middle of a very hot loop, that seems like prime work to avoid doing! Consider changing the data structure to allow for quicker look ups: perhaps have a dictionary that maps your chosen hour to a list or set of indexes.

Second, let's take a look at this custom_fn

def custom_fn(x, y):
 x = pd.Series(x)
 y = pd.Series(y)
 x = x**2
 y = np.sqrt(y)
 return x.sum() - y.sum()

Now, I would normally not pay much attention to this function, because according to the profiler it just used up 8% of the total runtime. However, I did notice that it is almost completely seperable. The last line uses data derived from both x and y but until that you just use x for x things and y for y things. That suggests that there is considerable opportunity to cache the relevant x.sum() and y.sum() components, with the relevant calculations done outside of so deeply nested a loop.

For reference, the following code is my initial go at using some of this caching approach. There is still plenty of opportunity to improve, including changing data structure as mentioned, but it is already significantly faster. It has come down from 6.5 seconds to 3.5 seconds in total, of which 2 seconds is packing the dmat table.

import pandas as pd
import numpy as np
dat = pd.DataFrame({'day': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2}, 'hour': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2}, 'distance': {0: 1.2898851269657656, 1: 0.0, 2: 0.8371526423804061, 3: 0.8703856587273138, 4: 0.6257425922449789, 5: 0.0, 6: 0.0, 7: 0.0, 8: 1.2895328696587023, 9: 0.0, 10: 0.6875527848294374, 11: 0.0, 12: 0.0, 13: 0.9009031833559706, 14: 0.0, 15: 1.1040652963428623, 16: 0.0, 17: 0.0, 18: 0.0, 19: 0.0}})
def x_sum(x):
 x = pd.Series(x)
 x = x**2
 return x.sum()
def y_sum(y):
 y = pd.Series(y)
 y = np.sqrt(y)
 return y.sum()
def custom_fn(x_s, y_s):
 return x_s - y_s
def get_hour(i):
 return (dat['hour'] == i)
def get_day(k): 
 return (dat['day'] == k)
def get_day_hour(hour, day):
 x = dat[hour & day].distance
 return x
# Empty data.frame to append to
dmat = pd.DataFrame()
day_indices = {k : get_day(k) for k in range(1, 3)}
hour_indices = {i : get_hour(i) for i in range(1, 3)}
x_sum_indices = { (i, j): x_sum(get_day_hour(hour_indices[i], day_indices[j])) for i in range(1, 3) for j in range(1, 3)}
y_sum_indices = { (i, j): y_sum(get_day_hour(hour_indices[i], day_indices[j])) for i in range(1, 3) for j in range(1, 3)}
# For i, j = hour; k, l = day
for i in range(1, 3):
 for j in range(1, 3):
 for k in range(1, 3):
 for l in range(1, 3):
 x_s = x_sum_indices[(i, k)]
 y_s = y_sum_indices[(j, l)]
 # Calculate difference
 jds = custom_fn(x_s, y_s)
 # Build data frame and append
 outdat = pd.DataFrame({'day_hour_a': f"{k}_{i}", 'day_hour_b': f"{l}_{j}", 'jds': [round(jds, 4)]})
 dmat = dmat.append(outdat, ignore_index=True)
return dmat
# Pivot data to get matrix
distMatrix = dmat.pivot(index='day_hour_a', columns='day_hour_b', values='jds')
print(distMatrix)

If you had a different custom_fn function that was not so easily separable, you could still benefit from caching the inputs to the function. E.g.

x_indices = { (i, j): get_day_hour(hour_indices[i], day_indices[j]) for i in range(1, 3) for j in range(1, 3)}
y_indices = x_indices
...
 jds = custom_fn(x_indices[(i, k)], y_indices[(j, l)])

Amstell Amstell 2881 silver badge12 bronze badges · Answer 2 · 2019-04-04 15:33:13Z

With a little rework I was able to increase speed 4x.

Initial Code:

start = time.time()
# Empty data.frame to append to
dmat = pd.DataFrame()
# For i, j = hour; k, l = day
for i in range(1, 3):
 for j in range(1, 3):
 for k in range(1, 3):
 for l in range(1, 3):
 x = dat[(dat['hour'] == i) & (dat['day'] == k)].distance
 y = dat[(dat['hour'] == j) & (dat['day'] == l)].distance
 # Calculate difference
 jds = custom_fn(x, y)
 # Build data frame and append
 outdat = pd.DataFrame({'day_hour_a': f"{k}_{i}", 'day_hour_b': f"{l}_{j}", 'jds': [round(jds, 4)]})
 dmat = dmat.append(outdat, ignore_index=True)
# Pivot data to get matrix
distMatrix = dmat.pivot(index='day_hour_a', columns='day_hour_b', values='jds')
end = time.time()
print(end - start)

Time 1:

> 0.07694768905639648

Reworked code:

start = time.time()
x = []
g = dat.groupby(['day', 'hour'])['distance']
for k1, g1 in g:
 for k2, g2 in g:
 x += [(k1, k2, custom_fn(g1, g2))]
x = pd.DataFrame(x).pivot(index=0, columns=1, values=2)
print(x)
end = time.time()
print(end - start)

Time 2

> 0.022540807723999023

Stack Exchange Network

Apply function to every subset combination and return square matrix

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Apply function to every subset combination and return square matrix

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions