The goal is to transpose the excel file that comes in the following form
into this
The code I created works properly, but I think it can be improved.
I would also like to avoid using reduce
to avoid importing functools
.
#importing the libraries
import pandas as pd
from functools import reduce
#defining the mapping list
descriptions_pizzeria = [
"Pizzeria Da Michele - Napoli",
"Pizzeria Da Michele - Roma",
"Pizzeria Da Michele - Bologna",
"Pizzeria Da Michele - Londra",
]
pizzeria_code = [
1,
2,
3
]
#path where the data are stored
url_path = "https://github.com/uomodellamansarda/AC_StackQuestions/blob/main/Pizzerie.xlsx? raw=true"
#reading the data
df = pd.read_excel(url_path, index_col=0, header=3)
#printing to have an idea
print(df)
#from an inspection we now that
#the first column is where the Pizzeria code is stored
#but we need to rename it
#we need to rename the first column
df = df.rename(columns={"Unnamed: 1":"Pizzeria code"})
print(df.info())
#Grouping based on the pizzeria code
G = df.groupby('Pizzeria code')
pizzeria_df_list = []
for i, code in enumerate(pizzeria_code):
#selecting the related Pizzeria commercial name
description = descriptions_pizzeria[i]
df_temp = G.get_group(code)
#Dropping and trasposing the Pizzeria Code column
df_temp = df_temp.drop(columns="Pizzeria code").transpose()
#Based on what we know about the columns we can create the suffix
col_suffix = ['Average Values', '# of customers ', 'Servings']
#and rename the columns
df_temp.columns = [description + "_" + x for x in col_suffix]
pizzeria_df_list.append(df_temp)
pizzeria_df = reduce(lambda x, y: pd.merge(x, y,right_index=True, left_index=True), pizzeria_df_list)
pizzeria_df.to_excel("Pizzeria_trasposed.xlsx")
1 Answer 1
TL;DR
Instead of looping through each pizzeria and merging temporary subframes, it's more idiomatic to manipulate/map the columns at once.
The bulk of the work is just changing the original columns (left) to be filled/dropped/sorted/renamed (right):
FROM THESE TO THESE
|----------------------|-------------------------------------------------|
| Category Pizzeria | Category Pizzeria |
| Sales NaN | |
| NaN 1.0 | Average Values Pizzeria Da Michele - Napoli |
| NaN 2.0 | # of Customers Pizzeria Da Michele - Napoli |
| NaN 3.0 | Servings Pizzeria Da Michele - Napoli |
| People NaN | |
| NaN 1.0 | Average Values Pizzeria Da Michele - Roma |
| NaN 2.0 | # of Customers Pizzeria Da Michele - Roma |
| NaN 3.0 | Servings Pizzeria Da Michele - Roma |
| Dishes NaN | |
| NaN 1.0 | Average Values Pizzeria Da Michele - Bologna |
| NaN 2.0 | # of Customers Pizzeria Da Michele - Bologna |
| NaN 3.0 | Servings Pizzeria Da Michele - Bologna |
|----------------------|-------------------------------------------------|
df = pd.read_excel(url_path, header=3)
df = df.rename(columns={'Unnamed: 0': 'Category', 'Unnamed: 1': 'Pizzeria'})
# fill the null categories
df['Category'] = df['Category'].ffill()
# group up the pizzerias
df = df.dropna().sort_values('Pizzeria')
# map to the preferred pizzeria names
df['Pizzeria'] = df['Pizzeria'].map({
1: 'Pizzeria Da Michele - Napoli',
2: 'Pizzeria Da Michele - Roma',
3: 'Pizzeria Da Michele - Bologna',
4: 'Pizzeria Da Michele - Londra',
})
# map to the preferred category names
df['Category'] = df['Category'].map({
'Dishes': 'Servings',
'People': '# of Customers',
'Sales': 'Average Values',
})
# set the combined pizzeria_category as the index
df = df.set_index(['Pizzeria', 'Category'])
df.index = [f'{p}_{c}' for p, c in df.index] # or ['_'.join(i) for i in df.index]
# transpose and output
df.T.to_excel('Pizzeria_transposed.xlsx')
Output:
Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele -
Napoli_Average Values Napoli_# of Customers Napoli_Servings Roma_Average Values Roma_# of Customers Roma_Servings Bologna_Average Values Bologna_# of Customers Bologna_Servings
2021年11月01日 00:00:00 100.0 70.0 10.0 300.0 200.0 30.0 50.0 20.0 5.0
2021年11月08日 00:00:00 250.0 160.0 25.0 350.0 150.0 35.0 100.0 70.0 10.0
2021年11月15日 00:00:00 100.0 70.0 10.0 500.0 300.0 50.0 250.0 160.0 25.0
2021年11月22日 00:00:00 300.0 200.0 30.0 2000.0 1000.0 200.0 10.0 20.0 1.0
2021年11月29日 00:00:00 400.0 250.0 40.0 1000.0 500.0 100.0 300.0 200.0 30.0
Detailed breakdown
Load the excel sheet without setting
index_col
because it will be easier to manipulate theCategory
as a column rather than index:df = pd.read_excel(url_path, header=3) df = df.rename(columns={'Unnamed: 0': 'Category', 'Unnamed: 1': 'Pizzeria'}) # Category Pizzeria 2021年11月01日 00:00:00 2021年11月08日 00:00:00 2021年11月15日 00:00:00 2021年11月22日 00:00:00 2021年11月29日 00:00:00 # 0 Sales NaN NaN NaN NaN NaN NaN # 1 NaN 1.0 100.0 250.0 100.0 300.0 400.0 # 2 NaN 2.0 300.0 350.0 500.0 2000.0 1000.0 # 3 NaN 3.0 50.0 100.0 250.0 10.0 300.0 # 4 People NaN NaN NaN NaN NaN NaN # 5 NaN 1.0 70.0 160.0 70.0 200.0 250.0 # 6 NaN 2.0 200.0 150.0 300.0 1000.0 500.0 # 7 NaN 3.0 20.0 70.0 160.0 20.0 200.0 # 8 Dishes NaN NaN NaN NaN NaN NaN # 9 NaN 1.0 10.0 25.0 10.0 30.0 40.0 # 10 NaN 2.0 30.0 35.0 50.0 200.0 100.0 # 11 NaN 3.0 5.0 10.0 25.0 1.0 30.0
ffill
(forward-fill) the nullCategory
values and then sort byPizzeria
so that the pizzerias are grouped up:df['Category'] = df['Category'].ffill() df = df.dropna().sort_values('Pizzeria') # Category Pizzeria 2021年11月01日 00:00:00 2021年11月08日 00:00:00 2021年11月15日 00:00:00 2021年11月22日 00:00:00 2021年11月29日 00:00:00 # 1 Sales 1.0 100.0 250.0 100.0 300.0 400.0 # 5 People 1.0 70.0 160.0 70.0 200.0 250.0 # 9 Dishes 1.0 10.0 25.0 10.0 30.0 40.0 # 2 Sales 2.0 300.0 350.0 500.0 2000.0 1000.0 # 6 People 2.0 200.0 150.0 300.0 1000.0 500.0 # 10 Dishes 2.0 30.0 35.0 50.0 200.0 100.0 # 3 Sales 3.0 50.0 100.0 250.0 10.0 300.0 # 7 People 3.0 20.0 70.0 160.0 20.0 200.0 # 11 Dishes 3.0 5.0 10.0 25.0 1.0 30.0
map
Pizzeria
andCategory
to your preferred strings:df['Pizzeria'] = df['Pizzeria'].map({ 1: 'Pizzeria Da Michele - Napoli', 2: 'Pizzeria Da Michele - Roma', 3: 'Pizzeria Da Michele - Bologna', 4: 'Pizzeria Da Michele - Londra', }) df['Category'] = df['Category'].map({ 'Dishes': 'Servings', 'People': '# of Customers', 'Sales': 'Average Values', }) # Category Pizzeria 2021年11月01日 00:00:00 2021年11月08日 00:00:00 2021年11月15日 00:00:00 2021年11月22日 00:00:00 2021年11月29日 00:00:00 # 1 Average Values Pizzeria Da Michele - Napoli 100.0 250.0 100.0 300.0 400.0 # 5 # of Customers Pizzeria Da Michele - Napoli 70.0 160.0 70.0 200.0 250.0 # 9 Servings Pizzeria Da Michele - Napoli 10.0 25.0 10.0 30.0 40.0 # 2 Average Values Pizzeria Da Michele - Roma 300.0 350.0 500.0 2000.0 1000.0 # 6 # of Customers Pizzeria Da Michele - Roma 200.0 150.0 300.0 1000.0 500.0 # 10 Servings Pizzeria Da Michele - Roma 30.0 35.0 50.0 200.0 100.0 # 3 Average Values Pizzeria Da Michele - Bologna 50.0 100.0 250.0 10.0 300.0 # 7 # of Customers Pizzeria Da Michele - Bologna 20.0 70.0 160.0 20.0 200.0 # 11 Servings Pizzeria Da Michele - Bologna 5.0 10.0 25.0 1.0 30.0
Join
Pizzeria
andCategory
inside the index (either explicitly unpack+concat the tuples orjoin
them):df = df.set_index(['Pizzeria', 'Category']) df.index = [f'{p}_{c}' for p, c in df.index] # or ['_'.join(i) for i in df.index] # 2021年11月01日 00:00:00 2021年11月08日 00:00:00 2021年11月15日 00:00:00 2021年11月22日 00:00:00 2021年11月29日 00:00:00 # Pizzeria Da Michele - Napoli_Average Values 100.0 250.0 100.0 300.0 400.0 # Pizzeria Da Michele - Napoli_# of Customers 70.0 160.0 70.0 200.0 250.0 # Pizzeria Da Michele - Napoli_Servings 10.0 25.0 10.0 30.0 40.0 # Pizzeria Da Michele - Roma_Average Values 300.0 350.0 500.0 2000.0 1000.0 # Pizzeria Da Michele - Roma_# of Customers 200.0 150.0 300.0 1000.0 500.0 # Pizzeria Da Michele - Roma_Servings 30.0 35.0 50.0 200.0 100.0 # Pizzeria Da Michele - Bologna_Average Values 50.0 100.0 250.0 10.0 300.0 # Pizzeria Da Michele - Bologna_# of Customers 20.0 70.0 160.0 20.0 200.0 # Pizzeria Da Michele - Bologna_Servings 5.0 10.0 25.0 1.0 30.0
Finally transpose and save:
df.T.to_excel('Pizzeria_transposed.xlsx') # Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - Pizzeria Da Michele - # Napoli_Average Values Napoli_# of Customers Napoli_Servings Roma_Average Values Roma_# of Customers Roma_Servings Bologna_Average Values Bologna_# of Customers Bologna_Servings # 2021年11月01日 00:00:00 100.0 70.0 10.0 300.0 200.0 30.0 50.0 20.0 5.0 # 2021年11月08日 00:00:00 250.0 160.0 25.0 350.0 150.0 35.0 100.0 70.0 10.0 # 2021年11月15日 00:00:00 100.0 70.0 10.0 500.0 300.0 50.0 250.0 160.0 25.0 # 2021年11月22日 00:00:00 300.0 200.0 30.0 2000.0 1000.0 200.0 10.0 20.0 1.0 # 2021年11月29日 00:00:00 400.0 250.0 40.0 1000.0 500.0 100.0 300.0 200.0 30.0
-
2\$\begingroup\$ Nice explanation and definitely more pandas-idiomatic. I would suggest in step 3,
dict(enumerate(descriptions_pizzeria, 1))
is maybe a bit too clever, and insteaddescriptions_pizzeria
should be merged withpizzeria_code
and become a simple dictionary or intenum, mapping location codes to strings. This way the correspondence is explicit so you don't end up accidentally adding a name to the wrong index in the list and making the auto mapping invalid, if that makes sense. Also more efficient at runtime to predefine this. \$\endgroup\$Greedo– Greedo2021年11月18日 17:32:15 +00:00Commented Nov 18, 2021 at 17:32 -
2\$\begingroup\$ Also in section 4
df.index = ['_'.join(i) for i in df.index]
maybe do a tuple unpacking to make it more explicit, e.g.df.index = [f"{pizzeria}_{sales_category}" for pizzeria, sales_category in df.index]
- it's less performant than your approach but a lot clearer I think and this is not a performance bottleneck in the code so I'd take that trade-off :). Subjective though... \$\endgroup\$Greedo– Greedo2021年11月18日 17:39:45 +00:00Commented Nov 18, 2021 at 17:39 -
\$\begingroup\$ @Greedo Thanks for the feedback! Great points -- will update. \$\endgroup\$tdy– tdy2021年11月18日 17:46:30 +00:00Commented Nov 18, 2021 at 17:46