I have a pandas data frame and I want to calculate some features based on some short_window
, long_window
and bins
values. More specifically, for each different row, I want to calculate some features. In order to do so, I move one row forward the df_long = df.loc[row:long_window+row]
such as in the first iteration the pandas data frame for row=0
would be df_long = df.loc[0:50+0]
and some features would be calculated based on this data frame, for row=1
would be df_long = df.loc[1:50+1]
and some other features would be calculated and continues.
from numpy.random import seed
from numpy.random import randint
import pandas as pd
from joblib import Parallel, delayed
bins = 12
short_window = 10
long_window = 50
# seed random number generator
seed(1)
price = pd.DataFrame({
'DATE_TIME': pd.date_range('2012-01-01', '2012-02-01', freq='30min'),
'value': randint(2, 20, 1489),
'amount': randint(50, 200, 1489)
})
def vap(row, df, short_window, long_window, bins):
df_long = df.loc[row:long_window+row]
df_short = df_long.tail(short_window)
binning = pd.cut(df_long['value'], bins, retbins=True)[1]
group_months = pd.DataFrame(df_short['amount'].groupby(pd.cut(df_short['value'], binning)).sum())
return group_months['amount'].tolist(), df.loc[long_window + row + 1, 'DATE_TIME']
def feature_extraction(data, short_window, long_window, bins):
# Vap feature extraction
ls = [f"feature{row + 1}" for row in range(bins)]
amount, date = zip(*Parallel(n_jobs=4)(delayed(vap)(i, data, short_window, long_window, bins)
for i in range(0, data.shape[0] - long_window - 1)))
temp = pd.DataFrame(date, columns=['DATE_TIME'])
temp[ls] = pd.DataFrame(amount, index=temp.index)
data = data.merge(temp, on='DATE_TIME', how='outer')
return data
df = feature_extraction(price, short_window, long_window, bins)
I tried to run it in parallel in order to save time but due to the dimensions of my data, it takes a long of time to finish.
Is there any way to change this iterative process (df_long = df.loc[row:long_window+row])
in order to reduce the computational cost? I was wondering if there is any way to use pandas.rolling but I am not sure how to use it in this case.
Any help would be much appreciated! Thank you
-
\$\begingroup\$ This might be of use: pandas.pydata.org/pandas-docs/stable/reference/api/…. Is this program runnable? Can you explain what it's meant to do? \$\endgroup\$AMC– AMC2019年12月14日 01:17:57 +00:00Commented Dec 14, 2019 at 1:17
-
\$\begingroup\$ Yes, it is runnable. For each different pandas data frame of a specific length calculates some features. Then this is being done for each different row. I know pandas rolling but I don't know how there is a way to run this example with pandas.rolling. Thanks! \$\endgroup\$Nestoras Chalkidis– Nestoras Chalkidis2019年12月14日 10:42:38 +00:00Commented Dec 14, 2019 at 10:42
1 Answer 1
Just some stylistic suggestions
Constants
Constants in your program should be UPPERCASE. (PEP 8)
bins -> BINS
short_window -> SHORT_WINDOW
long_window -> LONG_WINDOW
price -> PRICE
Docstrings
You can add docstrings to your functions to allow more description about the function and about the parameters it accepts and the value(s) it returns, if any. (PEP 8)
def vap(row, df, short_window, long_window, bins):
"""
Description Here
:param row: Description Here
:param df: Description Here
:param short_window: Description Here
:param long_window: Description Here
:param bins: Description Here
:return: Description Here
"""
Type Hints
You can add type hints to your functions to show what types of parameters are accepted, and what types are returned.
You can also use typing
's NewVar
to create custom types to return.
from typing import List
PandasTimeStamp = NewType("PandasTimeStamp", pd._libs.tslibs.timestamps.Timestamp)
def vap(row: int, df: pd.DataFrame, short_window: int, long_window: int, bins: int) -> List, PandasTimeStamp:
...
-
\$\begingroup\$ Thanks but this suggestion doesn't solve my problem. \$\endgroup\$Nestoras Chalkidis– Nestoras Chalkidis2019年12月17日 12:07:07 +00:00Commented Dec 17, 2019 at 12:07
Explore related questions
See similar questions with these tags.