I have written a function in python with three loops which is time consuming. Is it possible to do the same operation in less time with some other way. Here are my code and sample data you can run at your end
#Expand files to minute-level
def expand_b34(data):
final_list = []
df_columns = list(data.columns.tolist())
for i in range(len(data)):
#print i
row_list = np.repeat(data.values[i][np.newaxis,:], data['DURATION'].loc[i], axis=0).tolist()
start_end_list = range(data['START'].loc[i], data['END_MINUTE'].loc[i])
for j in range(len(row_list)):
row_list[j].extend([start_end_list[j]])
for k in row_list:
final_list.append(k)
data = pd.DataFrame(final_list)
data.columns = df_columns + ['START_MINUTE']
data = data.drop(['START','END_MINUTE'], axis=1)
return data
df = expand_b34(test)
Sample data date Id LD GOOD_AP_ORIG ap_station JULIAN_DAY START DURATION END_MINUTE PLDS PL PLT PLAY 16080 4012007 1 G 5000 16081 0 60 60 0 0 0 16080 4012007 1 G 5000 16081 60 60 120 0 0 0 16080 4012007 1 G 5000 16081 120 60 180 0 0 0 16080 4012007 1 G 5000 16080 180 60 240 0 0 0 16080 4012007 1 G 5000 16080 240 120 360 0 0 0
Attached is the sample data for your reference
Code Work flow:
- calculate difference between
START
andEND_MINUTE
- expand the observations for the difference
So if the difference is 10
, ten more lines of observations will be generated for the observation which is having the same data except for the START_MINUTE
variable. START_MINUTE
starts at 0
(because for that particular observation START == 0
) and ends at 9
.
For a single observation the loop creates 10 duplicate observations except for the start minute variable, which varies from 0 to 9.
Can someone help me to do the same observation expansion by optimizing my code?
1 Answer 1
The outer loop isn't crazy, especially if you're memory-sensitive. The fully-vectorised way of performing a Cartesian product requires no loops and I expect to be faster, but will briefly take up more memory prior to the time range filter. The inner loops definitely need to go away one way or the other.
Calling that number a Julian day is a lie. Based on its value and that you posted in 2016, that seems more like days-after-Unix-epoch.
Try:
import io
import pandas as pd
with io.StringIO(
''' date Id LD GOOD_AP_ORIG ap_station JULIAN_DAY START DURATION END_MINUTE PLDS PL PLT PLAY
16080 4012007 1 G 5000 16081 0 60 60 0 0 0
16080 4012007 1 G 5000 16081 60 60 120 0 0 0
16080 4012007 1 G 5000 16081 120 60 180 0 0 0
16080 4012007 1 G 5000 16080 180 60 240 0 0 0
16080 4012007 1 G 5000 16080 240 120 360 0 0 0
''') as f:
# delim_whitespace is deprecated
df = pd.read_csv(f, sep=' +', engine='python')
# Not actually a Julian date; it's most likely 'days after epoch'
df['date'] = pd.to_datetime(df['date'], unit='D', origin='unix')
df['DURATION'] = pd.to_timedelta(df['DURATION'], unit='m')
df['START'] = pd.to_timedelta(df['START'], unit='m')
df['END_MINUTE'] = df['START'] + df['DURATION']
minutes_idx = pd.timedelta_range(
start=df['START'].min(), end=df['END_MINUTE'].max(),
freq='min', closed='left', name='time',
)
merged = pd.merge(
left=minutes_idx.to_series(), right=df, how='cross',
)
merged = merged[
(merged['START'] <= merged['time']) & (merged['time'] < merged['END_MINUTE'])
]
print(df.head(3).T)
0 1 2
date 2014年01月10日 00:00:00 2014年01月10日 00:00:00 2014年01月10日 00:00:00
Id 4012007 4012007 4012007
LD 1 1 1
GOOD_AP_ORIG G G G
ap_station 5000 5000 5000
JULIAN_DAY 16081 16081 16081
START 0 days 00:00:00 0 days 01:00:00 0 days 02:00:00
DURATION 0 days 01:00:00 0 days 01:00:00 0 days 01:00:00
END_MINUTE 0 days 01:00:00 0 days 02:00:00 0 days 03:00:00
PLDS 0 0 0
PL 0 0 0
PLT 0 0 0
PLAY NaN NaN NaN
pypy
instead ofCPython
,pypy
excels at repeated calls because of the JIT (the only thing that might give it trouble is Pandas). See my answer here: codereview.stackexchange.com/a/144196/82612 ~10x perfomance increase only switching fromCPython
topypy
. \$\endgroup\$for
loop with a couple of list comprehensions. While it may look more elegant than yours, your function is 5 times faster! However, yourSTART_MINUTE
is a running count of all records not just for each corresponding row of original df. \$\endgroup\$