Pivot DataFrame with DateTimeIndex

Question 1

I have some data, which I've faked using:

def fake_disrete_data():
 in_li = []
 sample_points = 24 * 4
 for day, bias in zip((11, 12, 13), (.5, .7, 1.)):
 day_time = datetime(2016, 6, day, 0, 0, 0)
 for x in range(int(sample_points)):
 in_li.append((day_time + timedelta(minutes=15*x),
 int(x / 4),
 bias))
 return pd.DataFrame(in_li, columns=("time", "mag_sig", "bias")).set_index("time")
fake_disc = fake_disrete_data()

I can pivot each column individually using and then concatenate them using:

cols = list(fake_disc.columns.values)
dfs = []
for col in cols:
 dfs.append(pd.pivot_table(fake_disc,
 index=fake_disc.index.date,
 columns=fake_disc.index.hour,
 values=col,
 aggfunc=np.mean))
all_df = pd.concat(dfs, axis=1, keys=cols)

But is there a better way to do this?

I'm trying to follow the answers seen in Pandas pivot table for multiple columns at once and How to pivot multilabel table in pandas, but I'm having a difficult time translating their methods to the DateTimeIndex case.

Question 2

Review

all in all, this code is rather clean. I would use a generator comprehension and itertools.chain in fake_disrete_data instead of the nested for-loop, but that is a matter of taste

linewraps

I prefer to wrap lines after the ( instead of the first argument. Here I follow the same rules as black. This leads to lesser indents, but slightly longer code, for example:

dfs.append(
 pd.pivot_table(
 fake_disc,
 index=fake_disc.index.date,
 columns=fake_disc.index.hour,
 values=col,
 aggfunc=np.mean,
 )
)

list comprehension

instead of the appending, you can do

dfs = [
 pd.pivot_table(
 fake_disc,
 index=fake_disc.index.date,
 columns=fake_disc.index.hour,
 values=col,
 aggfunc=np.mean,
 )
 for col in columns
]

or even better, feed a dict to pd.concat, so you don't have to specify the keys argument

dfs = {
 col: pd.pivot_table(
 fake_disc,
 index=fake_disc.index.date,
 columns=fake_disc.index.hour,
 values=col,
 aggfunc='mean',
 )
 for col in fake_disc.columns
}
pd.concat(dfs, axis=1)

I also changed the np.mean to 'mean', so you don't have to specifically import np for this, and avoided having to create the columns list

Alternative approach

pd.pivot is a wrapper around unstack, groupby and stack. If you want to do something more complicated, you can do those operations by hand

fake_disc = fake_disrete_data()
fake_disc.columns = fake_disc.columns.set_names('variable')
df = fake_disc.stack().to_frame().rename(columns={0: 'value'})
df['hour'] = df.index.get_level_values('time').hour

this creates an intermediary DataFrame

time variable value hour
2016年06月11日 00:00:00 mag_sig 0.0 0
2016年06月11日 00:00:00 bias 0.5 0
2016年06月11日 00:15:00 mag_sig 0.0 0
2016年06月11日 00:15:00 bias 0.5 0
2016年06月11日 00:30:00 mag_sig 0.0 0
...

This you can group. To group the time per hour, you can use pd.Grouper

pivot = (
 df.groupby([pd.Grouper(level="time", freq="d"), "hour", "variable"])
 .mean()
 .unstack(["variable", "hour"])
 .sort_index(axis="columns", level=["variable", "hour"])
)

 value
variable bias ... mag_sig
hour 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
time 
2016年06月11日 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ... 14.0 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0
2016年06月12日 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 ... 14.0 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0
2016年06月13日 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 14.0 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0

performance

according to the %%timeit cell magic in Jupyterlab, the first approach (with the dict and concat) takes about 23ms, the second approach about 10ms. Depending on your usecase, this difference might be important. If it is not, pick the method which is most readable to your future self

Maarten Fabré Maarten Fabré 9,3901 gold badge15 silver badges27 bronze badges · Accepted Answer · 2018-09-25 08:16:08Z

Review

all in all, this code is rather clean. I would use a generator comprehension and itertools.chain in fake_disrete_data instead of the nested for-loop, but that is a matter of taste

linewraps

I prefer to wrap lines after the ( instead of the first argument. Here I follow the same rules as black. This leads to lesser indents, but slightly longer code, for example:

dfs.append(
 pd.pivot_table(
 fake_disc,
 index=fake_disc.index.date,
 columns=fake_disc.index.hour,
 values=col,
 aggfunc=np.mean,
 )
)

list comprehension

instead of the appending, you can do

dfs = [
 pd.pivot_table(
 fake_disc,
 index=fake_disc.index.date,
 columns=fake_disc.index.hour,
 values=col,
 aggfunc=np.mean,
 )
 for col in columns
]

or even better, feed a dict to pd.concat, so you don't have to specify the keys argument

dfs = {
 col: pd.pivot_table(
 fake_disc,
 index=fake_disc.index.date,
 columns=fake_disc.index.hour,
 values=col,
 aggfunc='mean',
 )
 for col in fake_disc.columns
}
pd.concat(dfs, axis=1)

I also changed the np.mean to 'mean', so you don't have to specifically import np for this, and avoided having to create the columns list

Alternative approach

pd.pivot is a wrapper around unstack, groupby and stack. If you want to do something more complicated, you can do those operations by hand

fake_disc = fake_disrete_data()
fake_disc.columns = fake_disc.columns.set_names('variable')
df = fake_disc.stack().to_frame().rename(columns={0: 'value'})
df['hour'] = df.index.get_level_values('time').hour

this creates an intermediary DataFrame

time variable value hour
2016年06月11日 00:00:00 mag_sig 0.0 0
2016年06月11日 00:00:00 bias 0.5 0
2016年06月11日 00:15:00 mag_sig 0.0 0
2016年06月11日 00:15:00 bias 0.5 0
2016年06月11日 00:30:00 mag_sig 0.0 0
...

This you can group. To group the time per hour, you can use pd.Grouper

pivot = (
 df.groupby([pd.Grouper(level="time", freq="d"), "hour", "variable"])
 .mean()
 .unstack(["variable", "hour"])
 .sort_index(axis="columns", level=["variable", "hour"])
)

 value
variable bias ... mag_sig
hour 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
time 
2016年06月11日 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ... 14.0 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0
2016年06月12日 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 ... 14.0 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0
2016年06月13日 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 14.0 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0

performance

according to the %%timeit cell magic in Jupyterlab, the first approach (with the dict and concat) takes about 23ms, the second approach about 10ms. Depending on your usecase, this difference might be important. If it is not, pick the method which is most readable to your future self

Stack Exchange Network

Pivot DataFrame with DateTimeIndex

1 Answer 1

Review

linewraps

list comprehension

Alternative approach

performance

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Pivot DataFrame with DateTimeIndex

1 Answer 1

Review

linewraps

list comprehension

Alternative approach

performance

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions