Transposing an Excel sheet with pandas adding additional information for a different readability - Using Fake Pizzeria Data

Question 1

The goal is to transpose the excel file that comes in the following form

into this

The code I created works properly, but I think it can be improved.

I would also like to avoid using reduce to avoid importing functools.

#importing the libraries
import pandas as pd 
from functools import reduce
 
#defining the mapping list
descriptions_pizzeria = [
 "Pizzeria Da Michele - Napoli",
 "Pizzeria Da Michele - Roma",
 "Pizzeria Da Michele - Bologna",
 "Pizzeria Da Michele - Londra",
 ]
pizzeria_code = [
 1,
 2,
 3
 ]
#path where the data are stored
url_path = "https://github.com/uomodellamansarda/AC_StackQuestions/blob/main/Pizzerie.xlsx? raw=true" 
#reading the data
df = pd.read_excel(url_path, index_col=0, header=3)
#printing to have an idea
print(df)
#from an inspection we now that 
#the first column is where the Pizzeria code is stored
#but we need to rename it
#we need to rename the first column
df = df.rename(columns={"Unnamed: 1":"Pizzeria code"})
print(df.info())
#Grouping based on the pizzeria code
G = df.groupby('Pizzeria code')
pizzeria_df_list = []
for i, code in enumerate(pizzeria_code):
 #selecting the related Pizzeria commercial name
 description = descriptions_pizzeria[i]
 df_temp = G.get_group(code)
 #Dropping and trasposing the Pizzeria Code column
 df_temp = df_temp.drop(columns="Pizzeria code").transpose()
 #Based on what we know about the columns we can create the suffix
 col_suffix = ['Average Values', '# of customers ', 'Servings']
 #and rename the columns 
 df_temp.columns = [description + "_" + x for x in col_suffix]
 pizzeria_df_list.append(df_temp)
pizzeria_df = reduce(lambda x, y: pd.merge(x, y,right_index=True, left_index=True), pizzeria_df_list)
pizzeria_df.to_excel("Pizzeria_trasposed.xlsx")

Question 2

TL;DR

Instead of looping through each pizzeria and merging temporary subframes, it's more idiomatic to manipulate/map the columns at once.

The bulk of the work is just changing the original columns (left) to be filled/dropped/sorted/renamed (right):

 FROM THESE TO THESE 
|----------------------|-------------------------------------------------|
| Category Pizzeria | Category Pizzeria |
| Sales NaN | |
| NaN 1.0 | Average Values Pizzeria Da Michele - Napoli |
| NaN 2.0 | # of Customers Pizzeria Da Michele - Napoli |
| NaN 3.0 | Servings Pizzeria Da Michele - Napoli |
| People NaN | |
| NaN 1.0 | Average Values Pizzeria Da Michele - Roma |
| NaN 2.0 | # of Customers Pizzeria Da Michele - Roma |
| NaN 3.0 | Servings Pizzeria Da Michele - Roma |
| Dishes NaN | |
| NaN 1.0 | Average Values Pizzeria Da Michele - Bologna |
| NaN 2.0 | # of Customers Pizzeria Da Michele - Bologna |
| NaN 3.0 | Servings Pizzeria Da Michele - Bologna |
|----------------------|-------------------------------------------------|

df = pd.read_excel(url_path, header=3)
df = df.rename(columns={'Unnamed: 0': 'Category', 'Unnamed: 1': 'Pizzeria'})
# fill the null categories
df['Category'] = df['Category'].ffill()
# group up the pizzerias
df = df.dropna().sort_values('Pizzeria')
# map to the preferred pizzeria names
df['Pizzeria'] = df['Pizzeria'].map({
 1: 'Pizzeria Da Michele - Napoli',
 2: 'Pizzeria Da Michele - Roma',
 3: 'Pizzeria Da Michele - Bologna',
 4: 'Pizzeria Da Michele - Londra',
})
# map to the preferred category names
df['Category'] = df['Category'].map({
 'Dishes': 'Servings',
 'People': '# of Customers',
 'Sales': 'Average Values',
})
# set the combined pizzeria_category as the index
df = df.set_index(['Pizzeria', 'Category'])
df.index = [f'{p}_{c}' for p, c in df.index] # or ['_'.join(i) for i in df.index]
# transpose and output
df.T.to_excel('Pizzeria_transposed.xlsx')

Output:

 Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele -
 Napoli_Average Values Napoli_# of Customers Napoli_Servings Roma_Average Values Roma_# of Customers Roma_Servings Bologna_Average Values Bologna_# of Customers Bologna_Servings
2021年11月01日 00:00:00 100.0 70.0 10.0 300.0 200.0 30.0 50.0 20.0 5.0 
2021年11月08日 00:00:00 250.0 160.0 25.0 350.0 150.0 35.0 100.0 70.0 10.0 
2021年11月15日 00:00:00 100.0 70.0 10.0 500.0 300.0 50.0 250.0 160.0 25.0 
2021年11月22日 00:00:00 300.0 200.0 30.0 2000.0 1000.0 200.0 10.0 20.0 1.0 
2021年11月29日 00:00:00 400.0 250.0 40.0 1000.0 500.0 100.0 300.0 200.0 30.0

Detailed breakdown

Load the excel sheet without setting index_col because it will be easier to manipulate the Category as a column rather than index:

df = pd.read_excel(url_path, header=3)
df = df.rename(columns={'Unnamed: 0': 'Category', 'Unnamed: 1': 'Pizzeria'})
# Category Pizzeria 2021年11月01日 00:00:00 2021年11月08日 00:00:00 2021年11月15日 00:00:00 2021年11月22日 00:00:00 2021年11月29日 00:00:00
# 0 Sales NaN NaN NaN NaN NaN NaN
# 1 NaN 1.0 100.0 250.0 100.0 300.0 400.0
# 2 NaN 2.0 300.0 350.0 500.0 2000.0 1000.0
# 3 NaN 3.0 50.0 100.0 250.0 10.0 300.0
# 4 People NaN NaN NaN NaN NaN NaN
# 5 NaN 1.0 70.0 160.0 70.0 200.0 250.0
# 6 NaN 2.0 200.0 150.0 300.0 1000.0 500.0
# 7 NaN 3.0 20.0 70.0 160.0 20.0 200.0
# 8 Dishes NaN NaN NaN NaN NaN NaN
# 9 NaN 1.0 10.0 25.0 10.0 30.0 40.0
# 10 NaN 2.0 30.0 35.0 50.0 200.0 100.0
# 11 NaN 3.0 5.0 10.0 25.0 1.0 30.0

ffill (forward-fill) the null Category values and then sort by Pizzeria so that the pizzerias are grouped up:

df['Category'] = df['Category'].ffill()
df = df.dropna().sort_values('Pizzeria')
# Category Pizzeria 2021年11月01日 00:00:00 2021年11月08日 00:00:00 2021年11月15日 00:00:00 2021年11月22日 00:00:00 2021年11月29日 00:00:00
# 1 Sales 1.0 100.0 250.0 100.0 300.0 400.0
# 5 People 1.0 70.0 160.0 70.0 200.0 250.0
# 9 Dishes 1.0 10.0 25.0 10.0 30.0 40.0
# 2 Sales 2.0 300.0 350.0 500.0 2000.0 1000.0
# 6 People 2.0 200.0 150.0 300.0 1000.0 500.0
# 10 Dishes 2.0 30.0 35.0 50.0 200.0 100.0
# 3 Sales 3.0 50.0 100.0 250.0 10.0 300.0
# 7 People 3.0 20.0 70.0 160.0 20.0 200.0
# 11 Dishes 3.0 5.0 10.0 25.0 1.0 30.0

map Pizzeria and Category to your preferred strings:

df['Pizzeria'] = df['Pizzeria'].map({
 1: 'Pizzeria Da Michele - Napoli',
 2: 'Pizzeria Da Michele - Roma',
 3: 'Pizzeria Da Michele - Bologna',
 4: 'Pizzeria Da Michele - Londra',
})
df['Category'] = df['Category'].map({
 'Dishes': 'Servings',
 'People': '# of Customers',
 'Sales': 'Average Values',
})
# Category Pizzeria 2021年11月01日 00:00:00 2021年11月08日 00:00:00 2021年11月15日 00:00:00 2021年11月22日 00:00:00 2021年11月29日 00:00:00
# 1 Average Values Pizzeria Da Michele - Napoli 100.0 250.0 100.0 300.0 400.0
# 5 # of Customers Pizzeria Da Michele - Napoli 70.0 160.0 70.0 200.0 250.0
# 9 Servings Pizzeria Da Michele - Napoli 10.0 25.0 10.0 30.0 40.0
# 2 Average Values Pizzeria Da Michele - Roma 300.0 350.0 500.0 2000.0 1000.0
# 6 # of Customers Pizzeria Da Michele - Roma 200.0 150.0 300.0 1000.0 500.0
# 10 Servings Pizzeria Da Michele - Roma 30.0 35.0 50.0 200.0 100.0
# 3 Average Values Pizzeria Da Michele - Bologna 50.0 100.0 250.0 10.0 300.0
# 7 # of Customers Pizzeria Da Michele - Bologna 20.0 70.0 160.0 20.0 200.0
# 11 Servings Pizzeria Da Michele - Bologna 5.0 10.0 25.0 1.0 30.0

Join Pizzeria and Category inside the index (either explicitly unpack+concat the tuples or join them):

df = df.set_index(['Pizzeria', 'Category'])
df.index = [f'{p}_{c}' for p, c in df.index] # or ['_'.join(i) for i in df.index]
# 2021年11月01日 00:00:00 2021年11月08日 00:00:00 2021年11月15日 00:00:00 2021年11月22日 00:00:00 2021年11月29日 00:00:00
# Pizzeria Da Michele - Napoli_Average Values 100.0 250.0 100.0 300.0 400.0
# Pizzeria Da Michele - Napoli_# of Customers 70.0 160.0 70.0 200.0 250.0
# Pizzeria Da Michele - Napoli_Servings 10.0 25.0 10.0 30.0 40.0
# Pizzeria Da Michele - Roma_Average Values 300.0 350.0 500.0 2000.0 1000.0
# Pizzeria Da Michele - Roma_# of Customers 200.0 150.0 300.0 1000.0 500.0
# Pizzeria Da Michele - Roma_Servings 30.0 35.0 50.0 200.0 100.0
# Pizzeria Da Michele - Bologna_Average Values 50.0 100.0 250.0 10.0 300.0
# Pizzeria Da Michele - Bologna_# of Customers 20.0 70.0 160.0 20.0 200.0
# Pizzeria Da Michele - Bologna_Servings 5.0 10.0 25.0 1.0 30.0

Finally transpose and save:

df.T.to_excel('Pizzeria_transposed.xlsx')
# Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele -
# Napoli_Average Values Napoli_# of Customers Napoli_Servings Roma_Average Values Roma_# of Customers Roma_Servings Bologna_Average Values Bologna_# of Customers Bologna_Servings
# 2021年11月01日 00:00:00 100.0 70.0 10.0 300.0 200.0 30.0 50.0 20.0 5.0 
# 2021年11月08日 00:00:00 250.0 160.0 25.0 350.0 150.0 35.0 100.0 70.0 10.0 
# 2021年11月15日 00:00:00 100.0 70.0 10.0 500.0 300.0 50.0 250.0 160.0 25.0 
# 2021年11月22日 00:00:00 300.0 200.0 30.0 2000.0 1000.0 200.0 10.0 20.0 1.0 
# 2021年11月29日 00:00:00 400.0 250.0 40.0 1000.0 500.0 100.0 300.0 200.0 30.0

Question 3

Nice explanation and definitely more pandas-idiomatic. I would suggest in step 3, dict(enumerate(descriptions_pizzeria, 1)) is maybe a bit too clever, and instead descriptions_pizzeria should be merged with pizzeria_code and become a simple dictionary or intenum, mapping location codes to strings. This way the correspondence is explicit so you don't end up accidentally adding a name to the wrong index in the list and making the auto mapping invalid, if that makes sense. Also more efficient at runtime to predefine this.

Question 4

Also in section 4 df.index = ['_'.join(i) for i in df.index] maybe do a tuple unpacking to make it more explicit, e.g. df.index = [f"{pizzeria}_{sales_category}" for pizzeria, sales_category in df.index] - it's less performant than your approach but a lot clearer I think and this is not a performance bottleneck in the code so I'd take that trade-off :). Subjective though...

Question 5

@Greedo Thanks for the feedback! Great points -- will update.

tdy tdy 2,2661 gold badge10 silver badges21 bronze badges · Accepted Answer · 2021-11-18 09:03:15Z

TL;DR

Instead of looping through each pizzeria and merging temporary subframes, it's more idiomatic to manipulate/map the columns at once.

The bulk of the work is just changing the original columns (left) to be filled/dropped/sorted/renamed (right):

 FROM THESE TO THESE 
|----------------------|-------------------------------------------------|
| Category Pizzeria | Category Pizzeria |
| Sales NaN | |
| NaN 1.0 | Average Values Pizzeria Da Michele - Napoli |
| NaN 2.0 | # of Customers Pizzeria Da Michele - Napoli |
| NaN 3.0 | Servings Pizzeria Da Michele - Napoli |
| People NaN | |
| NaN 1.0 | Average Values Pizzeria Da Michele - Roma |
| NaN 2.0 | # of Customers Pizzeria Da Michele - Roma |
| NaN 3.0 | Servings Pizzeria Da Michele - Roma |
| Dishes NaN | |
| NaN 1.0 | Average Values Pizzeria Da Michele - Bologna |
| NaN 2.0 | # of Customers Pizzeria Da Michele - Bologna |
| NaN 3.0 | Servings Pizzeria Da Michele - Bologna |
|----------------------|-------------------------------------------------|

df = pd.read_excel(url_path, header=3)
df = df.rename(columns={'Unnamed: 0': 'Category', 'Unnamed: 1': 'Pizzeria'})
# fill the null categories
df['Category'] = df['Category'].ffill()
# group up the pizzerias
df = df.dropna().sort_values('Pizzeria')
# map to the preferred pizzeria names
df['Pizzeria'] = df['Pizzeria'].map({
 1: 'Pizzeria Da Michele - Napoli',
 2: 'Pizzeria Da Michele - Roma',
 3: 'Pizzeria Da Michele - Bologna',
 4: 'Pizzeria Da Michele - Londra',
})
# map to the preferred category names
df['Category'] = df['Category'].map({
 'Dishes': 'Servings',
 'People': '# of Customers',
 'Sales': 'Average Values',
})
# set the combined pizzeria_category as the index
df = df.set_index(['Pizzeria', 'Category'])
df.index = [f'{p}_{c}' for p, c in df.index] # or ['_'.join(i) for i in df.index]
# transpose and output
df.T.to_excel('Pizzeria_transposed.xlsx')

Output:

 Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele -
 Napoli_Average Values Napoli_# of Customers Napoli_Servings Roma_Average Values Roma_# of Customers Roma_Servings Bologna_Average Values Bologna_# of Customers Bologna_Servings
2021年11月01日 00:00:00 100.0 70.0 10.0 300.0 200.0 30.0 50.0 20.0 5.0 
2021年11月08日 00:00:00 250.0 160.0 25.0 350.0 150.0 35.0 100.0 70.0 10.0 
2021年11月15日 00:00:00 100.0 70.0 10.0 500.0 300.0 50.0 250.0 160.0 25.0 
2021年11月22日 00:00:00 300.0 200.0 30.0 2000.0 1000.0 200.0 10.0 20.0 1.0 
2021年11月29日 00:00:00 400.0 250.0 40.0 1000.0 500.0 100.0 300.0 200.0 30.0

Detailed breakdown

Load the excel sheet without setting index_col because it will be easier to manipulate the Category as a column rather than index:

df = pd.read_excel(url_path, header=3)
df = df.rename(columns={'Unnamed: 0': 'Category', 'Unnamed: 1': 'Pizzeria'})
# Category Pizzeria 2021年11月01日 00:00:00 2021年11月08日 00:00:00 2021年11月15日 00:00:00 2021年11月22日 00:00:00 2021年11月29日 00:00:00
# 0 Sales NaN NaN NaN NaN NaN NaN
# 1 NaN 1.0 100.0 250.0 100.0 300.0 400.0
# 2 NaN 2.0 300.0 350.0 500.0 2000.0 1000.0
# 3 NaN 3.0 50.0 100.0 250.0 10.0 300.0
# 4 People NaN NaN NaN NaN NaN NaN
# 5 NaN 1.0 70.0 160.0 70.0 200.0 250.0
# 6 NaN 2.0 200.0 150.0 300.0 1000.0 500.0
# 7 NaN 3.0 20.0 70.0 160.0 20.0 200.0
# 8 Dishes NaN NaN NaN NaN NaN NaN
# 9 NaN 1.0 10.0 25.0 10.0 30.0 40.0
# 10 NaN 2.0 30.0 35.0 50.0 200.0 100.0
# 11 NaN 3.0 5.0 10.0 25.0 1.0 30.0

ffill (forward-fill) the null Category values and then sort by Pizzeria so that the pizzerias are grouped up:

df['Category'] = df['Category'].ffill()
df = df.dropna().sort_values('Pizzeria')
# Category Pizzeria 2021年11月01日 00:00:00 2021年11月08日 00:00:00 2021年11月15日 00:00:00 2021年11月22日 00:00:00 2021年11月29日 00:00:00
# 1 Sales 1.0 100.0 250.0 100.0 300.0 400.0
# 5 People 1.0 70.0 160.0 70.0 200.0 250.0
# 9 Dishes 1.0 10.0 25.0 10.0 30.0 40.0
# 2 Sales 2.0 300.0 350.0 500.0 2000.0 1000.0
# 6 People 2.0 200.0 150.0 300.0 1000.0 500.0
# 10 Dishes 2.0 30.0 35.0 50.0 200.0 100.0
# 3 Sales 3.0 50.0 100.0 250.0 10.0 300.0
# 7 People 3.0 20.0 70.0 160.0 20.0 200.0
# 11 Dishes 3.0 5.0 10.0 25.0 1.0 30.0

map Pizzeria and Category to your preferred strings:

df['Pizzeria'] = df['Pizzeria'].map({
 1: 'Pizzeria Da Michele - Napoli',
 2: 'Pizzeria Da Michele - Roma',
 3: 'Pizzeria Da Michele - Bologna',
 4: 'Pizzeria Da Michele - Londra',
})
df['Category'] = df['Category'].map({
 'Dishes': 'Servings',
 'People': '# of Customers',
 'Sales': 'Average Values',
})
# Category Pizzeria 2021年11月01日 00:00:00 2021年11月08日 00:00:00 2021年11月15日 00:00:00 2021年11月22日 00:00:00 2021年11月29日 00:00:00
# 1 Average Values Pizzeria Da Michele - Napoli 100.0 250.0 100.0 300.0 400.0
# 5 # of Customers Pizzeria Da Michele - Napoli 70.0 160.0 70.0 200.0 250.0
# 9 Servings Pizzeria Da Michele - Napoli 10.0 25.0 10.0 30.0 40.0
# 2 Average Values Pizzeria Da Michele - Roma 300.0 350.0 500.0 2000.0 1000.0
# 6 # of Customers Pizzeria Da Michele - Roma 200.0 150.0 300.0 1000.0 500.0
# 10 Servings Pizzeria Da Michele - Roma 30.0 35.0 50.0 200.0 100.0
# 3 Average Values Pizzeria Da Michele - Bologna 50.0 100.0 250.0 10.0 300.0
# 7 # of Customers Pizzeria Da Michele - Bologna 20.0 70.0 160.0 20.0 200.0
# 11 Servings Pizzeria Da Michele - Bologna 5.0 10.0 25.0 1.0 30.0

Join Pizzeria and Category inside the index (either explicitly unpack+concat the tuples or join them):

df = df.set_index(['Pizzeria', 'Category'])
df.index = [f'{p}_{c}' for p, c in df.index] # or ['_'.join(i) for i in df.index]
# 2021年11月01日 00:00:00 2021年11月08日 00:00:00 2021年11月15日 00:00:00 2021年11月22日 00:00:00 2021年11月29日 00:00:00
# Pizzeria Da Michele - Napoli_Average Values 100.0 250.0 100.0 300.0 400.0
# Pizzeria Da Michele - Napoli_# of Customers 70.0 160.0 70.0 200.0 250.0
# Pizzeria Da Michele - Napoli_Servings 10.0 25.0 10.0 30.0 40.0
# Pizzeria Da Michele - Roma_Average Values 300.0 350.0 500.0 2000.0 1000.0
# Pizzeria Da Michele - Roma_# of Customers 200.0 150.0 300.0 1000.0 500.0
# Pizzeria Da Michele - Roma_Servings 30.0 35.0 50.0 200.0 100.0
# Pizzeria Da Michele - Bologna_Average Values 50.0 100.0 250.0 10.0 300.0
# Pizzeria Da Michele - Bologna_# of Customers 20.0 70.0 160.0 20.0 200.0
# Pizzeria Da Michele - Bologna_Servings 5.0 10.0 25.0 1.0 30.0

Finally transpose and save:

df.T.to_excel('Pizzeria_transposed.xlsx')
# Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele -
# Napoli_Average Values Napoli_# of Customers Napoli_Servings Roma_Average Values Roma_# of Customers Roma_Servings Bologna_Average Values Bologna_# of Customers Bologna_Servings
# 2021年11月01日 00:00:00 100.0 70.0 10.0 300.0 200.0 30.0 50.0 20.0 5.0 
# 2021年11月08日 00:00:00 250.0 160.0 25.0 350.0 150.0 35.0 100.0 70.0 10.0 
# 2021年11月15日 00:00:00 100.0 70.0 10.0 500.0 300.0 50.0 250.0 160.0 25.0 
# 2021年11月22日 00:00:00 300.0 200.0 30.0 2000.0 1000.0 200.0 10.0 20.0 1.0 
# 2021年11月29日 00:00:00 400.0 250.0 40.0 1000.0 500.0 100.0 300.0 200.0 30.0

Nice explanation and definitely more pandas-idiomatic. I would suggest in step 3, dict(enumerate(descriptions_pizzeria, 1)) is maybe a bit too clever, and instead descriptions_pizzeria should be merged with pizzeria_code and become a simple dictionary or intenum, mapping location codes to strings. This way the correspondence is explicit so you don't end up accidentally adding a name to the wrong index in the list and making the auto mapping invalid, if that makes sense. Also more efficient at runtime to predefine this.
Also in section 4 df.index = ['_'.join(i) for i in df.index] maybe do a tuple unpacking to make it more explicit, e.g. df.index = [f"{pizzeria}_{sales_category}" for pizzeria, sales_category in df.index] - it's less performant than your approach but a lot clearer I think and this is not a performance bottleneck in the code so I'd take that trade-off :). Subjective though...
@Greedo Thanks for the feedback! Great points -- will update.

Stack Exchange Network

Transposing an Excel sheet with pandas adding additional information for a different readability - Using Fake Pizzeria Data

1 Answer 1

TL;DR

Detailed breakdown

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Transposing an Excel sheet with pandas adding additional information for a different readability - Using Fake Pizzeria Data

1 Answer 1

TL;DR

Detailed breakdown

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions