2
\$\begingroup\$

This question is about pivoting and padding columns, two very frequent activities in Pandas.

I have a raw dataframe. I need to manipulate from long to wide and then pad NaN based on a specific rule.

My code works, but I think is not efficient and elegant, plus it is not able to generalize as I highlighted at the end.

Before coming here and having working code I read some very valuable questions on stack like the following ones:

And fortunately I reached my goal, but in a very bad way and that's why I am here.

But let's start showing this ugly code.

# Our df for this example
dict_df = {"Time":[1,1,1,1,2,2,2,2,3,3,3,3] ,
 "KPI":["A","B","C","D","A","B","C","D","A","B","C","D"],
 "SKU1":[10,1,0.1,0.01,40,4,0.4,0.04,90,9,0.9,0.09],
 "SKU2":[20,2,0.2,0.02,50,5,0.5,0.05,100,10,1,0.1],
 "SKU3":[30,3,0.3,0.03,60,6,0.6,0.06,110,11,1.1,0.11],
 "SKU4":[70,7,0.7,0.07,80,8,0.8,0.08,120,12,1.2,0.12]
 }
df_test = pd.DataFrame(dict_df)
 Time KPI SKU1 SKU2 SKU3 SKU4
 0 1 A 10.00 20.00 30.00 70.00
 1 1 B 1.00 2.00 3.00 7.00
 2 1 C 0.10 0.20 0.30 0.70
 3 1 D 0.01 0.02 0.03 0.07
 4 2 A 40.00 50.00 60.00 80.00
 5 2 B 4.00 5.00 6.00 8.00
 6 2 C 0.40 0.50 0.60 0.80
 7 2 D 0.04 0.05 0.06 0.08
 8 3 A 90.00 100.00 110.00 120.00
 9 3 B 9.00 10.00 11.00 12.00
 10 3 C 0.90 1.00 1.10 1.20
 11 3 D 0.09 0.10 0.11 0.12

I want to filter specific KPIs:

filter_kpi = ["A","B","C"]
df_test_filtered = df_test[df_test["KPI"].isin(filter_kpi)]
#Setting Time column as index
df_test_filtered.index = df_test_filtered.Time
#Pivoting the dataframe
pivot_test = df_test_filtered.pivot(values = ["SKU1","SKU2","SKU3","SKU4"],
 columns ="KPI", 
 index="Time")
print(pivot_test)

After that I flatten the columns based on the question "Custom variable names when reshaping":

pivot_test.columns = [''.join(col) for col in pivot_test.columns]

And got this output:

 SKU1A SKU1B SKU1C SKU2A SKU2B SKU2C SKU3A SKU3B SKU3C SKU4A SKU4B SKU4C
Time 
1 10.0 1.0 0.1 20.0 2.0 0.2 30.0 3.0 0.3 70.0 7.0 0.7
2 40.0 4.0 0.4 50.0 5.0 0.5 60.0 6.0 0.6 80.0 8.0 0.8
3 90.0 9.0 0.9 100.0 10.0 1.0 110.0 11.0 1.1 120.0 12.0 1.2

And now the ugly part of my code. The goal is to pad with 3 columns of Nan (but can be any fixed number of columns) the space between each SKU.

The index now is made by 3 rows but can be of 4,2 or any other length.

So between the set SKU1_, SKU2_ etc...

# I initialize an empty DataFrame
df_empty = pd.DataFrame()
# I am defining the range that will define when to split the original dataframe
data_range = np.arange(0,len(pivot_test.columns),3)
#Creating the index for the empty dataframe that will be used for padding
df_nan_len =np.arange(1,len(pivot_test)+1,1)
df_nan = pd.DataFrame(np.nan, index=df_nan_len, columns=['0', '1','2'])
#loop for each element in the range
for i in data_range:
 #Splitting the DataFrame 
 filter = pivot_test.iloc[:,i:i+3]
 df_empty = pd.concat([df_empty,filter],
 axis =1)
 df_empty = pd.concat([df_empty,df_nan],
 axis =1)

And from that I get my final output:

 SKU1A SKU1B SKU1C 0 1 2 SKU2A SKU2B SKU2C 0 ... SKU3C 0 1 2 SKU4A SKU4B SKU4C 0 1 2
1 10.0 1.0 0.1 NaN NaN NaN 20.0 2.0 0.2 NaN ... 0.3 NaN NaN NaN 70.0 7.0 0.7 NaN NaN NaN
2 40.0 4.0 0.4 NaN NaN NaN 50.0 5.0 0.5 NaN ... 0.6 NaN NaN NaN 80.0 8.0 0.8 NaN NaN NaN
3 90.0 9.0 0.9 NaN NaN NaN 100.0 10.0 1.0 NaN ... 1.1 NaN NaN NaN 120.0 12.0 1.2 NaN NaN NaN

I am already aware that if I had a different number of KPIs for example 4 or 2 this code will not work properly.

The code works, but I think the loop, how I pad the NaN, it is something I can improve.

Thank you for your time and patience!

More information and context

The data come from the IRI data source. KPI1, KP2, KP3 are just a way to describe some of the information that I get:

  • Price Euro/Quantity
  • Weighted distribution selling
  • Standardized sales per business Euro
  • Standardized sales per business quantity
  • Standardized sales per business piece

And so on, they are 8 but I wanted to create a working minimum reproducible example and that's why I choose only 3 variables.

Why I need to pad? Because I inherited and excel file where this information are padded in that way and I need to attach each month the new data to the "master excel data book", after saving in *xlsx format the dataframe

This is just to give you an example. enter image description here

Why I can't use just a join operation, opening the excel file and reading it? It's in the plan, my final goal. I had to fix some date issue first, because the data I receive are ordered with this time notation KW 14/2022 that I have to change in order to use as a key with the final notation 21/03/2022

How they handled the "updating process" until now? They used an excel table where in one tab you paste the data, in one tab you extend the rows and it returns the final result already padded.

My goal is to create a Data Pipeline in Python to automatically handle this monthly data update.

This padding solution helps me to copy and paste quickly this data for the next deadline.

It is this process data quality inefficient? Yes, I am working really hard to fix it, but it is not something I decided at the beginning.

Thank you again for the suggestions and time.
Hopes it help.

Reinderien
70.9k5 gold badges76 silver badges256 bronze badges
asked Aug 9, 2022 at 21:48
\$\endgroup\$
3
  • \$\begingroup\$ What do the 0, 1 and 2 NaN columns actually represent? Similar for the SKU values: are they prices? Something else? \$\endgroup\$ Commented Aug 10, 2022 at 12:20
  • \$\begingroup\$ You show the what but not the why. Surely this is not the last step in processing. Why are you filling with NaN, and what will replace those NaNs? \$\endgroup\$ Commented Aug 10, 2022 at 12:21
  • \$\begingroup\$ Just edited the question to give more context, thank you @Reinderien :) \$\endgroup\$ Commented Aug 11, 2022 at 11:07

2 Answers 2

1
\$\begingroup\$

I'm sure someone will come up with a better solution than this, but hopefully some ideas here are helpful. Starting from this dataframe:

 SKU1A SKU1B SKU1C SKU2A SKU2B SKU2C SKU3A SKU3B SKU3C SKU4A SKU4B SKU4C
Time 
1 10.0 1.0 0.1 20.0 2.0 0.2 30.0 3.0 0.3 70.0 7.0 0.7
2 40.0 4.0 0.4 50.0 5.0 0.5 60.0 6.0 0.6 80.0 8.0 0.8
3 90.0 9.0 0.9 100.0 10.0 1.0 110.0 11.0 1.1 120.0 12.0 1.2

We want to work with the SKUs, so it's probably easiest to transpose this and reset the index so we can access the SKUs in a column. So we can use

df = df.T
df.index.name = "SKUs"
df = df.reset_index()

to get

 SKUs 1 2 3
0 SKU1A 10.0 40.0 90.0
1 SKU1B 1.0 4.0 9.0
2 SKU1C 0.1 0.4 0.9
...

Next, we want to group each of the SKUs by the initial SKU part, which we can do using

group = df.groupby(df["SKUs"].str.extract("(SKU\d+)", expand=False))

Here, the regular expression "(SKU\d+)" says to capture the letters SKU followed by any number of consecutive digits. This means SKU11A will be handled as you'd expect. Now we can get the individual groups with:

groups = [group.get_group(x) for x in group.groups]

Which gives, as you'd expect, a separate dataframe for each SKU (ie., a dataframe for SKU1, for SKU2, etc.).

Now comes an ugly part from me.

Ultimately, we want to alternate between an element from groups, and an empty dataframe of your desired structure. We can create the empty dataframe like this:

desired_num_cols = 3
empty_df = pd.DataFrame(
 [[pd.NA] * groups[0].shape[1]] * desired_num_cols,
 columns=groups[0].columns
)
empty_df['SKUs'] = list(range(desired_num_cols))

and now we just need to concat this in an alternating fashion with elements from groups, which we can do as follows:

df = pd.concat(
 itertools.chain(
 *zip(
 *[
 groups,
 itertools.cycle([empty_df]),
 ]
 )
 )
)

Broken down:

  • itertools.cycle([empty_df]) will just give an empty_df until groups is exhausted.
  • The inner zip packs together tuples of (group, empty_df).
  • The itertools.chain(*...) then unpacks these tuples so that we have [group1, empty_df, group2, empty_df, ...].

Now set the index to SKUs and transpose:

df = df.reset_index(drop=True).set_index("SKUs")
df.T

and we get your final answer:

SKUs SKU1A SKU1B SKU1C 0 1 2 SKU2A SKU2B SKU2C 0 1 2 SKU3A SKU3B SKU3C 0 1 2 SKU4A SKU4B SKU4C 0 1 2
1 10.0 1.0 0.1 NaN NaN NaN 20.0 2.0 0.2 NaN NaN NaN 30.0 3.0 0.3 NaN NaN NaN 70.0 7.0 0.7 NaN NaN NaN
2 40.0 4.0 0.4 NaN NaN NaN 50.0 5.0 0.5 NaN NaN NaN 60.0 6.0 0.6 NaN NaN NaN 80.0 8.0 0.8 NaN NaN NaN
3 90.0 9.0 0.9 NaN NaN NaN 100.0 10.0 1.0 NaN NaN NaN 110.0 11.0 1.1 NaN NaN NaN 120.0 12.0 1.2 NaN NaN NaN
answered Aug 10, 2022 at 2:37
\$\endgroup\$
1
\$\begingroup\$

The premise for your problem assumes a structure for your dataframe that you shouldn't use, at least for data processing. It seems you have no influence over the format of the Excel file to which you need to export.

Think about the "per"s in your frame description: per time, per KPI, per SKU, you start off with some unidentified floating-point value (maybe a price; who knows). Whenever you write "per", that should become a level in a multi-level index. This reduces your dataframe to one column, "Price".

Thinking about your "pad" operation, you're really just adding a second column (of even more unknown purpose):

import pandas as pd
# Our df for this example
dict_df = {
 "Time": [1,1,1,1,2,2,2,2,3,3,3,3],
 "KPI": ["A","B","C","D","A","B","C","D","A","B","C","D"],
 "SKU1": [10,1,0.1,0.01,40,4,0.4,0.04,90,9,0.9,0.09],
 "SKU2": [20,2,0.2,0.02,50,5,0.5,0.05,100,10,1,0.1],
 "SKU3": [30,3,0.3,0.03,60,6,0.6,0.06,110,11,1.1,0.11],
 "SKU4": [70,7,0.7,0.07,80,8,0.8,0.08,120,12,1.2,0.12]
}
df_test = pd.DataFrame(dict_df)
# Reshape data to be in stacked, multi-index form
df_test.set_index(['Time', 'KPI'], inplace=True)
df_test.columns = df_test.columns.str.slice(3).astype(int)
df_test = df_test.stack()
df_test.index.set_names('SKU', level=-1, inplace=True)
df_test = pd.DataFrame({'Price': df_test})
filter_kpi = {"A", "B", "C"}
df_test = df_test[df_test.index.get_level_values('KPI').isin(filter_kpi)]
df_test['Mystery'] = pd.NA
print(df_test)
 Price Mystery
Time KPI SKU 
1 A 1 10.0 <NA>
 2 20.0 <NA>
 3 30.0 <NA>
 4 70.0 <NA>
 B 1 1.0 <NA>
 2 2.0 <NA>
 3 3.0 <NA>
 4 7.0 <NA>
 C 1 0.1 <NA>
 2 0.2 <NA>
 3 0.3 <NA>
 4 0.7 <NA>
2 A 1 40.0 <NA>
 2 50.0 <NA>
 3 60.0 <NA>
 4 80.0 <NA>
 B 1 4.0 <NA>
 2 5.0 <NA>
 3 6.0 <NA>
 4 8.0 <NA>
 C 1 0.4 <NA>
 2 0.5 <NA>
 3 0.6 <NA>
 4 0.8 <NA>
3 A 1 90.0 <NA>
 2 100.0 <NA>
 3 110.0 <NA>
 4 120.0 <NA>
 B 1 9.0 <NA>
 2 10.0 <NA>
 3 11.0 <NA>
 4 12.0 <NA>
 C 1 0.9 <NA>
 2 1.0 <NA>
 3 1.1 <NA>
 4 1.2 <NA>
answered Aug 10, 2022 at 13:04
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.