I have some data, which I've faked using:
def fake_disrete_data():
in_li = []
sample_points = 24 * 4
for day, bias in zip((11, 12, 13), (.5, .7, 1.)):
day_time = datetime(2016, 6, day, 0, 0, 0)
for x in range(int(sample_points)):
in_li.append((day_time + timedelta(minutes=15*x),
int(x / 4),
bias))
return pd.DataFrame(in_li, columns=("time", "mag_sig", "bias")).set_index("time")
fake_disc = fake_disrete_data()
I can pivot each column individually using and then concatenate them using:
cols = list(fake_disc.columns.values)
dfs = []
for col in cols:
dfs.append(pd.pivot_table(fake_disc,
index=fake_disc.index.date,
columns=fake_disc.index.hour,
values=col,
aggfunc=np.mean))
all_df = pd.concat(dfs, axis=1, keys=cols)
But is there a better way to do this?
I'm trying to follow the answers seen in Pandas pivot table for multiple columns at once and How to pivot multilabel table in pandas, but I'm having a difficult time translating their methods to the DateTimeIndex
case.
1 Answer 1
Review
all in all, this code is rather clean. I would use a generator comprehension and itertools.chain
in fake_disrete_data
instead of the nested for-loop, but that is a matter of taste
linewraps
I prefer to wrap lines after the (
instead of the first argument. Here I follow the same rules as black. This leads to lesser indents, but slightly longer code, for example:
dfs.append(
pd.pivot_table(
fake_disc,
index=fake_disc.index.date,
columns=fake_disc.index.hour,
values=col,
aggfunc=np.mean,
)
)
list comprehension
instead of the appending, you can do
dfs = [
pd.pivot_table(
fake_disc,
index=fake_disc.index.date,
columns=fake_disc.index.hour,
values=col,
aggfunc=np.mean,
)
for col in columns
]
or even better, feed a dict to pd.concat
, so you don't have to specify the keys
argument
dfs = {
col: pd.pivot_table(
fake_disc,
index=fake_disc.index.date,
columns=fake_disc.index.hour,
values=col,
aggfunc='mean',
)
for col in fake_disc.columns
}
pd.concat(dfs, axis=1)
I also changed the np.mean
to 'mean'
, so you don't have to specifically import np for this, and avoided having to create the columns
list
Alternative approach
pd.pivot
is a wrapper around unstack
, groupby
and stack
. If you want to do something more complicated, you can do those operations by hand
fake_disc = fake_disrete_data()
fake_disc.columns = fake_disc.columns.set_names('variable')
df = fake_disc.stack().to_frame().rename(columns={0: 'value'})
df['hour'] = df.index.get_level_values('time').hour
this creates an intermediary DataFrame
time variable value hour
2016年06月11日 00:00:00 mag_sig 0.0 0
2016年06月11日 00:00:00 bias 0.5 0
2016年06月11日 00:15:00 mag_sig 0.0 0
2016年06月11日 00:15:00 bias 0.5 0
2016年06月11日 00:30:00 mag_sig 0.0 0
...
This you can group. To group the time per hour, you can use pd.Grouper
pivot = (
df.groupby([pd.Grouper(level="time", freq="d"), "hour", "variable"])
.mean()
.unstack(["variable", "hour"])
.sort_index(axis="columns", level=["variable", "hour"])
)
value variable bias ... mag_sig hour 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23 time 2016年06月11日 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ... 14.0 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 2016年06月12日 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 ... 14.0 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 2016年06月13日 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 14.0 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0
performance
according to the %%timeit
cell magic in Jupyterlab,
the first approach (with the dict and concat) takes about 23ms, the second approach about 10ms. Depending on your usecase, this difference might be important. If it is not, pick the method which is most readable to your future self