5
\$\begingroup\$

I have written a function in python with three loops which is time consuming. Is it possible to do the same operation in less time with some other way. Here are my code and sample data you can run at your end

#Expand files to minute-level
def expand_b34(data):
 final_list = []
 df_columns = list(data.columns.tolist())
 for i in range(len(data)):
 #print i
 row_list = np.repeat(data.values[i][np.newaxis,:], data['DURATION'].loc[i], axis=0).tolist()
 start_end_list = range(data['START'].loc[i], data['END_MINUTE'].loc[i])
 for j in range(len(row_list)):
 row_list[j].extend([start_end_list[j]])
 for k in row_list:
 final_list.append(k)
 data = pd.DataFrame(final_list)
 data.columns = df_columns + ['START_MINUTE']
 data = data.drop(['START','END_MINUTE'], axis=1)
 return data 
df = expand_b34(test)
Sample data
date Id LD GOOD_AP_ORIG ap_station JULIAN_DAY START DURATION END_MINUTE PLDS PL PLT PLAY
16080 4012007 1 G 5000 16081 0 60 60 0 0 0 
16080 4012007 1 G 5000 16081 60 60 120 0 0 0 
16080 4012007 1 G 5000 16081 120 60 180 0 0 0 
16080 4012007 1 G 5000 16080 180 60 240 0 0 0 
16080 4012007 1 G 5000 16080 240 120 360 0 0 0 

Attached is the sample data for your reference

Code Work flow:

  1. calculate difference between START and END_MINUTE
  2. expand the observations for the difference

So if the difference is 10, ten more lines of observations will be generated for the observation which is having the same data except for the START_MINUTE variable. START_MINUTE starts at 0 (because for that particular observation START == 0) and ends at 9.

For a single observation the loop creates 10 duplicate observations except for the start minute variable, which varies from 0 to 9.

Can someone help me to do the same observation expansion by optimizing my code?

Graipher
41.6k7 gold badges70 silver badges134 bronze badges
asked Oct 14, 2016 at 13:15
\$\endgroup\$
4
  • \$\begingroup\$ Optimize for what? Speed or memory? Are either giving you issues? \$\endgroup\$ Commented Oct 14, 2016 at 15:02
  • \$\begingroup\$ @Bey The description of the tag performance answers your first two questions: "Performance is a subset of Optimization: performance is the goal when you want the execution time of your program or routine to be optimal." \$\endgroup\$ Commented Oct 14, 2016 at 15:13
  • \$\begingroup\$ Have you tried using pypy instead of CPython, pypy excels at repeated calls because of the JIT (the only thing that might give it trouble is Pandas). See my answer here: codereview.stackexchange.com/a/144196/82612 ~10x perfomance increase only switching from CPython to pypy. \$\endgroup\$ Commented Oct 14, 2016 at 19:20
  • \$\begingroup\$ How slow? I re-worked your script and got it down to 5 lines and one for loop with a couple of list comprehensions. While it may look more elegant than yours, your function is 5 times faster! However, your START_MINUTE is a running count of all records not just for each corresponding row of original df. \$\endgroup\$ Commented Oct 15, 2016 at 1:47

1 Answer 1

2
\$\begingroup\$

The outer loop isn't crazy, especially if you're memory-sensitive. The fully-vectorised way of performing a Cartesian product requires no loops and I expect to be faster, but will briefly take up more memory prior to the time range filter. The inner loops definitely need to go away one way or the other.

Calling that number a Julian day is a lie. Based on its value and that you posted in 2016, that seems more like days-after-Unix-epoch.

Try:

import io
import pandas as pd
with io.StringIO(
''' date Id LD GOOD_AP_ORIG ap_station JULIAN_DAY START DURATION END_MINUTE PLDS PL PLT PLAY
 16080 4012007 1 G 5000 16081 0 60 60 0 0 0 
 16080 4012007 1 G 5000 16081 60 60 120 0 0 0 
 16080 4012007 1 G 5000 16081 120 60 180 0 0 0 
 16080 4012007 1 G 5000 16080 180 60 240 0 0 0 
 16080 4012007 1 G 5000 16080 240 120 360 0 0 0
''') as f:
 # delim_whitespace is deprecated
 df = pd.read_csv(f, sep=' +', engine='python')
# Not actually a Julian date; it's most likely 'days after epoch'
df['date'] = pd.to_datetime(df['date'], unit='D', origin='unix')
df['DURATION'] = pd.to_timedelta(df['DURATION'], unit='m')
df['START'] = pd.to_timedelta(df['START'], unit='m')
df['END_MINUTE'] = df['START'] + df['DURATION']
minutes_idx = pd.timedelta_range(
 start=df['START'].min(), end=df['END_MINUTE'].max(),
 freq='min', closed='left', name='time',
)
merged = pd.merge(
 left=minutes_idx.to_series(), right=df, how='cross',
)
merged = merged[
 (merged['START'] <= merged['time']) & (merged['time'] < merged['END_MINUTE'])
]
print(df.head(3).T)
 0 1 2
date 2014年01月10日 00:00:00 2014年01月10日 00:00:00 2014年01月10日 00:00:00
Id 4012007 4012007 4012007
LD 1 1 1
GOOD_AP_ORIG G G G
ap_station 5000 5000 5000
JULIAN_DAY 16081 16081 16081
START 0 days 00:00:00 0 days 01:00:00 0 days 02:00:00
DURATION 0 days 01:00:00 0 days 01:00:00 0 days 01:00:00
END_MINUTE 0 days 01:00:00 0 days 02:00:00 0 days 03:00:00
PLDS 0 0 0
PL 0 0 0
PLT 0 0 0
PLAY NaN NaN NaN
answered Jan 20 at 1:16
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.