Creating multiple observations from single observation depending on conditions

Question 1

I have written a function in python with three loops which is time consuming. Is it possible to do the same operation in less time with some other way. Here are my code and sample data you can run at your end

#Expand files to minute-level
def expand_b34(data):
 final_list = []
 df_columns = list(data.columns.tolist())
 for i in range(len(data)):
 #print i
 row_list = np.repeat(data.values[i][np.newaxis,:], data['DURATION'].loc[i], axis=0).tolist()
 start_end_list = range(data['START'].loc[i], data['END_MINUTE'].loc[i])
 for j in range(len(row_list)):
 row_list[j].extend([start_end_list[j]])
 for k in row_list:
 final_list.append(k)
 data = pd.DataFrame(final_list)
 data.columns = df_columns + ['START_MINUTE']
 data = data.drop(['START','END_MINUTE'], axis=1)
 return data 
df = expand_b34(test)

Sample data
date Id LD GOOD_AP_ORIG ap_station JULIAN_DAY START DURATION END_MINUTE PLDS PL PLT PLAY
16080 4012007 1 G 5000 16081 0 60 60 0 0 0 
16080 4012007 1 G 5000 16081 60 60 120 0 0 0 
16080 4012007 1 G 5000 16081 120 60 180 0 0 0 
16080 4012007 1 G 5000 16080 180 60 240 0 0 0 
16080 4012007 1 G 5000 16080 240 120 360 0 0 0

Attached is the sample data for your reference

Code Work flow:

calculate difference between START and END_MINUTE
expand the observations for the difference

So if the difference is 10, ten more lines of observations will be generated for the observation which is having the same data except for the START_MINUTE variable. START_MINUTE starts at 0 (because for that particular observation START == 0) and ends at 9.

For a single observation the loop creates 10 duplicate observations except for the start minute variable, which varies from 0 to 9.

Can someone help me to do the same observation expansion by optimizing my code?

Question 2

Optimize for what? Speed or memory? Are either giving you issues?

Question 3

@Bey The description of the tag performance answers your first two questions: "Performance is a subset of Optimization: performance is the goal when you want the execution time of your program or routine to be optimal."

Question 4

Have you tried using pypy instead of CPython, pypy excels at repeated calls because of the JIT (the only thing that might give it trouble is Pandas). See my answer here: codereview.stackexchange.com/a/144196/82612 ~10x perfomance increase only switching from CPython to pypy.

Question 5

How slow? I re-worked your script and got it down to 5 lines and one for loop with a couple of list comprehensions. While it may look more elegant than yours, your function is 5 times faster! However, your START_MINUTE is a running count of all records not just for each corresponding row of original df.

Question 6

The outer loop isn't crazy, especially if you're memory-sensitive. The fully-vectorised way of performing a Cartesian product requires no loops and I expect to be faster, but will briefly take up more memory prior to the time range filter. The inner loops definitely need to go away one way or the other.

Calling that number a Julian day is a lie. Based on its value and that you posted in 2016, that seems more like days-after-Unix-epoch.

Try:

import io
import pandas as pd
with io.StringIO(
''' date Id LD GOOD_AP_ORIG ap_station JULIAN_DAY START DURATION END_MINUTE PLDS PL PLT PLAY
 16080 4012007 1 G 5000 16081 0 60 60 0 0 0 
 16080 4012007 1 G 5000 16081 60 60 120 0 0 0 
 16080 4012007 1 G 5000 16081 120 60 180 0 0 0 
 16080 4012007 1 G 5000 16080 180 60 240 0 0 0 
 16080 4012007 1 G 5000 16080 240 120 360 0 0 0
''') as f:
 # delim_whitespace is deprecated
 df = pd.read_csv(f, sep=' +', engine='python')
# Not actually a Julian date; it's most likely 'days after epoch'
df['date'] = pd.to_datetime(df['date'], unit='D', origin='unix')
df['DURATION'] = pd.to_timedelta(df['DURATION'], unit='m')
df['START'] = pd.to_timedelta(df['START'], unit='m')
df['END_MINUTE'] = df['START'] + df['DURATION']
minutes_idx = pd.timedelta_range(
 start=df['START'].min(), end=df['END_MINUTE'].max(),
 freq='min', closed='left', name='time',
)
merged = pd.merge(
 left=minutes_idx.to_series(), right=df, how='cross',
)
merged = merged[
 (merged['START'] <= merged['time']) & (merged['time'] < merged['END_MINUTE'])
]
print(df.head(3).T)

 0 1 2
date 2014年01月10日 00:00:00 2014年01月10日 00:00:00 2014年01月10日 00:00:00
Id 4012007 4012007 4012007
LD 1 1 1
GOOD_AP_ORIG G G G
ap_station 5000 5000 5000
JULIAN_DAY 16081 16081 16081
START 0 days 00:00:00 0 days 01:00:00 0 days 02:00:00
DURATION 0 days 01:00:00 0 days 01:00:00 0 days 01:00:00
END_MINUTE 0 days 01:00:00 0 days 02:00:00 0 days 03:00:00
PLDS 0 0 0
PL 0 0 0
PLT 0 0 0
PLAY NaN NaN NaN

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Answer 1 · 2025-01-20 01:16:55Z

The outer loop isn't crazy, especially if you're memory-sensitive. The fully-vectorised way of performing a Cartesian product requires no loops and I expect to be faster, but will briefly take up more memory prior to the time range filter. The inner loops definitely need to go away one way or the other.

Calling that number a Julian day is a lie. Based on its value and that you posted in 2016, that seems more like days-after-Unix-epoch.

Try:

import io
import pandas as pd
with io.StringIO(
''' date Id LD GOOD_AP_ORIG ap_station JULIAN_DAY START DURATION END_MINUTE PLDS PL PLT PLAY
 16080 4012007 1 G 5000 16081 0 60 60 0 0 0 
 16080 4012007 1 G 5000 16081 60 60 120 0 0 0 
 16080 4012007 1 G 5000 16081 120 60 180 0 0 0 
 16080 4012007 1 G 5000 16080 180 60 240 0 0 0 
 16080 4012007 1 G 5000 16080 240 120 360 0 0 0
''') as f:
 # delim_whitespace is deprecated
 df = pd.read_csv(f, sep=' +', engine='python')
# Not actually a Julian date; it's most likely 'days after epoch'
df['date'] = pd.to_datetime(df['date'], unit='D', origin='unix')
df['DURATION'] = pd.to_timedelta(df['DURATION'], unit='m')
df['START'] = pd.to_timedelta(df['START'], unit='m')
df['END_MINUTE'] = df['START'] + df['DURATION']
minutes_idx = pd.timedelta_range(
 start=df['START'].min(), end=df['END_MINUTE'].max(),
 freq='min', closed='left', name='time',
)
merged = pd.merge(
 left=minutes_idx.to_series(), right=df, how='cross',
)
merged = merged[
 (merged['START'] <= merged['time']) & (merged['time'] < merged['END_MINUTE'])
]
print(df.head(3).T)

 0 1 2
date 2014年01月10日 00:00:00 2014年01月10日 00:00:00 2014年01月10日 00:00:00
Id 4012007 4012007 4012007
LD 1 1 1
GOOD_AP_ORIG G G G
ap_station 5000 5000 5000
JULIAN_DAY 16081 16081 16081
START 0 days 00:00:00 0 days 01:00:00 0 days 02:00:00
DURATION 0 days 01:00:00 0 days 01:00:00 0 days 01:00:00
END_MINUTE 0 days 01:00:00 0 days 02:00:00 0 days 03:00:00
PLDS 0 0 0
PL 0 0 0
PLT 0 0 0
PLAY NaN NaN NaN

Stack Exchange Network

Creating multiple observations from single observation depending on conditions

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Creating multiple observations from single observation depending on conditions

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions