This post is modified from this one: https://codereview.stackexchange.com/posts/292885/edit (Alternatives to iterrow loops in python pandas dataframes).
I have a piece of code to calculate price sensitivity based on the product and its rating.
Below is the original data set with product type, reported year, customer’s rating, price per unit, and quantity. There are 2 products, Product 1 and Product 2 in the below data set. Each product has 5 reported years, the rating is the same for each product every year, but price and quantity vary each year.
Input data:
import math
import numpy as np
from numpy.random import default_rng
import random
from datetime import datetime
import pandas as pd
from pandas.tseries.offsets import BDay
from itertools import cycle
data = pd.DataFrame([['Product 1', 'Year 1', 'Good', 34, 7], ['Product 1', 'Year 2', 'Good', 22, 5], ['Product 1', 'Year 3', 'Good', 30, 2], ['Product 1', 'Year 4', 'Good', 50, 1], ['Product 1', 'Year 5', 'Good', 44, 103], ['Product 2', 'Year 1', 'Bad', 200, 12], ['Product 2', 'Year 2', 'Bad', 103, 50], ['Product 2', 'Year 3', 'Bad', 150, 192], ['Product 2', 'Year 4', 'Bad', 309, 20], ['Product 2', 'Year 5', 'Bad', 200, 12]], columns = ['Product', 'Year', 'Rating', 'Price', 'Quantity'])
I then created 2 correlation matrices, namely product and rating matrices.
Product correlation matrix:
#Product correlation matrix
#1. Create a dataframe df_corr_product that initially has correlation of 0.5
row, col = 5 * len(data['Product'].unique().tolist()), 5 * len(data['Product'].unique().tolist()) + 1
df_corr_product = pd.DataFrame.from_records([[0.5]*col]*row)
df_corr_product = df_corr_product.loc[ : , df_corr_product.columns != 0]
#2. CREATE NEW COLUMNS in df_corr_product
#year
year_list = cycle(['Year 1', 'Year 2', 'Year 3', 'Year 4', 'Year 5'])
df_corr_product['Year'] = [next(year_list) for i in range(len(df_corr_product))]
df_corr_product.insert(0, 'Year', df_corr_product.pop('Year'))
#product
product_list = data['Product'].unique().tolist()
rep = 5
df_corr_product['Product'] = [j for j in product_list for i in range(rep)]
df_corr_product.insert(1, 'Product', df_corr_product.pop('Product'))
#rating
df_rating = data[['Product', 'Rating']].drop_duplicates()
rating_list = [list(i) for i in zip(df_rating['Product'], df_rating['Rating'])]
product_list_1 = df_corr_product['Product'].tolist()
L = []
for i in range(len(product_list_1)):
for j in range(len(rating_list)):
if product_list_1[i] == rating_list[j][0]:
L.append(rating_list[j][1])
df_corr_product['Rating'] = L
df_corr_product.insert(2, 'Rating', df_corr_product.pop('Rating'))
#3. INSERT YEAR, PRODUCT'S NAME, RATING HEADERS TO df_corr_product
#Year
df_corr_product.loc[-1] = ['', '', ''] + [next(year_list) for i in range(len(df_corr_product))]
df_corr_product.iloc[-1] = df_corr_product.iloc[-1].astype(str)
df_corr_product.index = df_corr_product.index + 1
df_corr_product = df_corr_product.sort_index()
#Product
df_corr_product.loc[-1] = ['', '', ''] + [ele for ele in product_list for i in range(rep)]
df_corr_product.iloc[-1] = df_corr_product.iloc[-1].astype(str)
df_corr_product.index = df_corr_product.index + 1
df_corr_product = df_corr_product.sort_index()
#Rating
df_corr_product.loc[-1] = ['', '', ''] + L
df_corr_product.iloc[-1] = df_corr_product.iloc[-1].astype(str)
df_corr_product.index = df_corr_product.index + 1
df_corr_product = df_corr_product.sort_index()
new_labels = pd.MultiIndex.from_arrays([df_corr_product.columns, df_corr_product.iloc[0], df_corr_product.iloc[1]], names=['Year', 'Rating', 'Product'])
df_corr_product = df_corr_product.set_axis(new_labels, axis=1).iloc[3:].reset_index().drop('index', axis = 1)
#4. Update correlation: if column & row has the same product name, 1, otherwise, keep 0.5
for i, j in df_corr_product.iterrows():
i = df_corr_product.index.tolist()[0]
while i <= len(df_corr_product.index):
df_corr_product.iloc[i:i+5, i+3:i+8] = 1.0
i += 5
for i, j in df_corr_product.iterrows():
df_corr_product.iloc[i][i+1] = float(0) #Set 0 to diagonal
The idea is that:
- Values at diagonal are 0
- If any cell has the same column and row’s product names, its value is 1, otherwise 0.5, such as the below output: Product correlation matrix should look like this
Rating correlation matrix:
#Rating correlation matrix
#1. Create a dataframe df_corr_rating that initially has correlation of 1
row, col = 5 * len(data['Product'].unique().tolist()), 5 * len(data['Product'].unique().tolist()) + 1
df_corr_rating = pd.DataFrame.from_records([[float(1)]*col]*row)
df_corr_rating = df_corr_rating.loc[ : , df_corr_rating.columns != 0]
#2. CREATE NEW COLUMNS in df_corr_rating
#year
df_corr_rating['Year'] = [next(year_list) for i in range(len(df_corr_rating))]
df_corr_rating.insert(0, 'Year', df_corr_rating.pop('Year'))
#product
df_corr_rating['Product'] = [j for j in product_list for i in range(rep)]
df_corr_rating.insert(1, 'Product', df_corr_rating.pop('Product'))
#rating
df_corr_rating['Rating'] = L
df_corr_rating.insert(2, 'Rating', df_corr_rating.pop('Rating'))
#3. INSERT YEAR, PRODUCT'S NAME, RATING HEADERS TO df_corr_rating
#Year
df_corr_rating.loc[-1] = ['', '', ''] + [next(year_list) for i in range(len(df_corr_rating))]
df_corr_rating.iloc[-1] = df_corr_rating.iloc[-1].astype(str)
df_corr_rating.index = df_corr_rating.index + 1
df_corr_rating = df_corr_rating.sort_index()
#Product
df_corr_rating.loc[-1] = ['', '', ''] + [j for j in product_list for i in range(rep)]
df_corr_rating.iloc[-1] = df_corr_rating.iloc[-1].astype(str)
df_corr_rating.index = df_corr_rating.index + 1
df_corr_rating = df_corr_rating.sort_index()
#Rating
df_corr_rating.loc[-1] = ['', '', ''] + L
df_corr_rating.iloc[-1] = df_corr_rating.iloc[-1].astype(str)
df_corr_rating.index = df_corr_rating.index + 1
df_corr_rating = df_corr_rating.sort_index()
new_labels = pd.MultiIndex.from_arrays([df_corr_rating.columns, df_corr_rating.iloc[0], df_corr_rating.iloc[1]], names=['Year', 'Rating', 'Product'])
df_corr_rating = df_corr_rating.set_axis(new_labels, axis=1).iloc[3:].reset_index().drop('index', axis = 1)
#4. UPDATE CORRELATION: CHANGE CELL VALUE TO 0.8 IF HY
for i, j in df_corr_rating.iterrows():
for k in range(3, len(df_corr_rating.columns)):
if (df_corr_rating.columns[k][1] == 'Bad' and df_corr_rating.iloc[i,2] == 'Good') or (df_corr_rating.columns[k][1] == 'Good' and df_corr_rating.iloc[i,2] == 'Bad'):
df_corr_rating.iloc[i][k-2] = 0.8
for i, j in df_corr_rating.iterrows():
df_corr_rating.iloc[i, i+3] = float(0) #Set 0 to diagonal values
- Diagonal values should be set to 0
- In each cell, if both row and column have the same ratings (i.e., both are "Good"), populate 1, otherwise 0.8 (i.e., if row is "Good", column is "Bad", set to 0.8). Output:
Expected rating correlation matrix
Finally, I multiplied column "Price" in the original dataset with its transpose and the product of these 2 matrices.
df_product_main = df_corr_product.iloc[:, 3:]
df_rating_main = df_corr_rating.iloc[:, 3:]
df_product_rating = df_product_main.to_numpy() * df_rating_main.to_numpy()
s = data [['Price']].to_numpy()
v = df_product_rating
t = np.multiply(s, s.transpose())
u = np.multiply(t, v)
z = pd.DataFrame(u)
The final output is:
print(z)
The point is, if my data is limited to less than 1000 rows, my code runs quite well. However, if I increase it to more than 10 000 rows, it goes through an endless loop. The running time is more than 3 hours, causing it to crash. I’d assume the root cause is my loops in the matrix parts. Do you have other better options to optimize mine?
1 Answer 1
Just to give you an idea.
I'd start by really making sure that I need to calculate that huge matrix. I'm assuming you do.
Generally, if you are doing number crunching in pandas that you could be doing in numpy, then stop and use numpy.
You have already used an outer product here: np.multiply(s, s.transpose())
. The same idea lets you build masks, and use those to multiply the necessary elements of this matrix. For example:
df = data.groupby(['Year', 'Product']).first()
rating = (df['Rating'] == 'Good').to_numpy()
rating_mask = rating.reshape((-1, 1)) == rating.reshape((1, -1))
p = df['Price'].astype(float).to_numpy()
price_table = p * p.reshape((-1, 1))
# If you know that the prices are divisible by 5,
# then you can do `//=5, *=4` instead,
# and you don't need the `.astype(float)` above.
# Money is better left in the hands of integers.
price_table[rating_mask] *= 0.8
# zero the diagonal
np.fill_diagonal(price_table, 0)
L
looks very scary: it's a cross product done in python with quadratic complexity. Check outmerge
capabilities of pandas at least, that should be marginally more efficient. \$\endgroup\$