7
\$\begingroup\$

I do not know how to do this without four nested for loops.

I'd like to apply a function to every possible combination of subsets for hour and day, return that value, and then pivot the data frame into a square matrix. However, these for loops seem excessive so I'm looking for a more efficient way to do this. The data I have is fairly large so any gain in speed would be beneficial.

I took a stab at compression lists but that seems excessive too.

The following code runs fine with smaller groups (days=1:2 and hours=1:2) but with large groups, say years of data, performance is lost.

Note: I'm requesting help given any custom function, not just a cross-product solution as suggested here, Performant cartesian product (CROSS JOIN) with pandas

Sample data

import pandas as pd
import numpy as np
dat = pd.DataFrame({'day': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2}, 'hour': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2}, 'distance': {0: 1.2898851269657656, 1: 0.0, 2: 0.8371526423804061, 3: 0.8703856587273138, 4: 0.6257425922449789, 5: 0.0, 6: 0.0, 7: 0.0, 8: 1.2895328696587023, 9: 0.0, 10: 0.6875527848294374, 11: 0.0, 12: 0.0, 13: 0.9009031833559706, 14: 0.0, 15: 1.1040652963428623, 16: 0.0, 17: 0.0, 18: 0.0, 19: 0.0}})0}})

Code

def custom_fn(x, y):
 x = pd.Series(x)
 y = pd.Series(y)
 x = x**2
 y = np.sqrt(y)
 return x.sum() - y.sum()
# Empty data.frame to append to
dmat = pd.DataFrame()
# For i, j = hour; k, l = day
for i in range(1, 3):
 for j in range(1, 3):
 for k in range(1, 3):
 for l in range(1, 3):
 x = dat[(dat['hour'] == i) & (dat['day'] == k)].distance
 y = dat[(dat['hour'] == j) & (dat['day'] == l)].distance
 # Calculate difference
 jds = custom_fn(x, y)
 # Build data frame and append
 outdat = pd.DataFrame({'day_hour_a': f"{k}_{i}", 'day_hour_b': f"{l}_{j}", 'jds': [round(jds, 4)]})
 dmat = dmat.append(outdat, ignore_index=True)
# Pivot data to get matrix
distMatrix = dmat.pivot(index='day_hour_a', columns='day_hour_b', values='jds')

Output

> print(distMatrix)
 day_hour_b 1_1 1_2 2_1 2_2
day_hour_a 
1_1 -0.2609 2.3782 1.7354 2.4630
1_2 -2.1118 0.5273 -0.1155 0.6121
2_1 -2.4903 0.1488 -0.4940 0.2336
2_2 -2.5557 0.0834 -0.5594 0.1682
asked Mar 29, 2019 at 22:21
\$\endgroup\$
4
  • \$\begingroup\$ Inside your 4-loops, are you expecting x and y to contain a single value each time through? That is, does [day,hour] represent a unique key for the distance column? \$\endgroup\$ Commented Mar 29, 2019 at 23:53
  • \$\begingroup\$ Also, your answer may be in this SO question: stackoverflow.com/questions/53699012/… \$\endgroup\$ Commented Mar 30, 2019 at 0:01
  • \$\begingroup\$ @AustinHasting x and y will be subset pd.Series provided to a custom function. Note that I don't need cross products because I will be using a custom function on each day and hour. Thanks! \$\endgroup\$ Commented Mar 30, 2019 at 1:24
  • \$\begingroup\$ I couldn't run this on account of a few extra characters at the end of the dat = line. \$\endgroup\$ Commented Apr 3, 2019 at 22:16

2 Answers 2

2
+25
\$\begingroup\$

I'm afraid that I'm not familiar with pandas, so there are definitely some things that I'm missing. However, I've done a bit of poking this code with a profiler, and I have a two suggestions that I expect would be helpful.

In my timings, over half of the runtime (3.5s out of 6.5s for 100 repetitions) in your example was spent on these two lines:

x = dat[(dat['hour'] == i) & (dat['day'] == k)].distance
y = dat[(dat['hour'] == j) & (dat['day'] == l)].distance

If I understand the code correctly, (dat['hour'] == i) is passing over the whole dataset searching for indexes with the specified hour. Especially given that this is in the middle of a very hot loop, that seems like prime work to avoid doing! Consider changing the data structure to allow for quicker look ups: perhaps have a dictionary that maps your chosen hour to a list or set of indexes.


Second, let's take a look at this custom_fn

def custom_fn(x, y):
 x = pd.Series(x)
 y = pd.Series(y)
 x = x**2
 y = np.sqrt(y)
 return x.sum() - y.sum()

Now, I would normally not pay much attention to this function, because according to the profiler it just used up 8% of the total runtime. However, I did notice that it is almost completely seperable. The last line uses data derived from both x and y but until that you just use x for x things and y for y things. That suggests that there is considerable opportunity to cache the relevant x.sum() and y.sum() components, with the relevant calculations done outside of so deeply nested a loop.


For reference, the following code is my initial go at using some of this caching approach. There is still plenty of opportunity to improve, including changing data structure as mentioned, but it is already significantly faster. It has come down from 6.5 seconds to 3.5 seconds in total, of which 2 seconds is packing the dmat table.

import pandas as pd
import numpy as np
dat = pd.DataFrame({'day': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2}, 'hour': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2}, 'distance': {0: 1.2898851269657656, 1: 0.0, 2: 0.8371526423804061, 3: 0.8703856587273138, 4: 0.6257425922449789, 5: 0.0, 6: 0.0, 7: 0.0, 8: 1.2895328696587023, 9: 0.0, 10: 0.6875527848294374, 11: 0.0, 12: 0.0, 13: 0.9009031833559706, 14: 0.0, 15: 1.1040652963428623, 16: 0.0, 17: 0.0, 18: 0.0, 19: 0.0}})
def x_sum(x):
 x = pd.Series(x)
 x = x**2
 return x.sum()
def y_sum(y):
 y = pd.Series(y)
 y = np.sqrt(y)
 return y.sum()
def custom_fn(x_s, y_s):
 return x_s - y_s
def get_hour(i):
 return (dat['hour'] == i)
def get_day(k): 
 return (dat['day'] == k)
def get_day_hour(hour, day):
 x = dat[hour & day].distance
 return x
# Empty data.frame to append to
dmat = pd.DataFrame()
day_indices = {k : get_day(k) for k in range(1, 3)}
hour_indices = {i : get_hour(i) for i in range(1, 3)}
x_sum_indices = { (i, j): x_sum(get_day_hour(hour_indices[i], day_indices[j])) for i in range(1, 3) for j in range(1, 3)}
y_sum_indices = { (i, j): y_sum(get_day_hour(hour_indices[i], day_indices[j])) for i in range(1, 3) for j in range(1, 3)}
# For i, j = hour; k, l = day
for i in range(1, 3):
 for j in range(1, 3):
 for k in range(1, 3):
 for l in range(1, 3):
 x_s = x_sum_indices[(i, k)]
 y_s = y_sum_indices[(j, l)]
 # Calculate difference
 jds = custom_fn(x_s, y_s)
 # Build data frame and append
 outdat = pd.DataFrame({'day_hour_a': f"{k}_{i}", 'day_hour_b': f"{l}_{j}", 'jds': [round(jds, 4)]})
 dmat = dmat.append(outdat, ignore_index=True)
return dmat
# Pivot data to get matrix
distMatrix = dmat.pivot(index='day_hour_a', columns='day_hour_b', values='jds')
print(distMatrix)

If you had a different custom_fn function that was not so easily separable, you could still benefit from caching the inputs to the function. E.g.

x_indices = { (i, j): get_day_hour(hour_indices[i], day_indices[j]) for i in range(1, 3) for j in range(1, 3)}
y_indices = x_indices
...
 jds = custom_fn(x_indices[(i, k)], y_indices[(j, l)])
answered Apr 3, 2019 at 23:33
\$\endgroup\$
0
1
\$\begingroup\$

With a little rework I was able to increase speed 4x.

Initial Code:

start = time.time()
# Empty data.frame to append to
dmat = pd.DataFrame()
# For i, j = hour; k, l = day
for i in range(1, 3):
 for j in range(1, 3):
 for k in range(1, 3):
 for l in range(1, 3):
 x = dat[(dat['hour'] == i) & (dat['day'] == k)].distance
 y = dat[(dat['hour'] == j) & (dat['day'] == l)].distance
 # Calculate difference
 jds = custom_fn(x, y)
 # Build data frame and append
 outdat = pd.DataFrame({'day_hour_a': f"{k}_{i}", 'day_hour_b': f"{l}_{j}", 'jds': [round(jds, 4)]})
 dmat = dmat.append(outdat, ignore_index=True)
# Pivot data to get matrix
distMatrix = dmat.pivot(index='day_hour_a', columns='day_hour_b', values='jds')
end = time.time()
print(end - start)

Time 1:

> 0.07694768905639648

Reworked code:

start = time.time()
x = []
g = dat.groupby(['day', 'hour'])['distance']
for k1, g1 in g:
 for k2, g2 in g:
 x += [(k1, k2, custom_fn(g1, g2))]
x = pd.DataFrame(x).pivot(index=0, columns=1, values=2)
print(x)
end = time.time()
print(end - start)

Time 2

> 0.022540807723999023
answered Apr 4, 2019 at 15:33
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.