Replace iterrow loops in pandas matrices with something else to shorten the running time

Question 1

This post is modified from this one: https://codereview.stackexchange.com/posts/292885/edit (Alternatives to iterrow loops in python pandas dataframes).

I have a piece of code to calculate price sensitivity based on the product and its rating.

Below is the original data set with product type, reported year, customer’s rating, price per unit, and quantity. There are 2 products, Product 1 and Product 2 in the below data set. Each product has 5 reported years, the rating is the same for each product every year, but price and quantity vary each year.

Input data:

import math
import numpy as np
from numpy.random import default_rng
import random
from datetime import datetime
import pandas as pd
from pandas.tseries.offsets import BDay
from itertools import cycle
data = pd.DataFrame([['Product 1', 'Year 1', 'Good', 34, 7], ['Product 1', 'Year 2', 'Good', 22, 5], ['Product 1', 'Year 3', 'Good', 30, 2], ['Product 1', 'Year 4', 'Good', 50, 1], ['Product 1', 'Year 5', 'Good', 44, 103], ['Product 2', 'Year 1', 'Bad', 200, 12], ['Product 2', 'Year 2', 'Bad', 103, 50], ['Product 2', 'Year 3', 'Bad', 150, 192], ['Product 2', 'Year 4', 'Bad', 309, 20], ['Product 2', 'Year 5', 'Bad', 200, 12]], columns = ['Product', 'Year', 'Rating', 'Price', 'Quantity'])

I then created 2 correlation matrices, namely product and rating matrices.

Product correlation matrix:

#Product correlation matrix
#1. Create a dataframe df_corr_product that initially has correlation of 0.5
row, col = 5 * len(data['Product'].unique().tolist()), 5 * len(data['Product'].unique().tolist()) + 1
df_corr_product = pd.DataFrame.from_records([[0.5]*col]*row)
df_corr_product = df_corr_product.loc[ : , df_corr_product.columns != 0]
#2. CREATE NEW COLUMNS in df_corr_product
#year
year_list = cycle(['Year 1', 'Year 2', 'Year 3', 'Year 4', 'Year 5'])
df_corr_product['Year'] = [next(year_list) for i in range(len(df_corr_product))]
df_corr_product.insert(0, 'Year', df_corr_product.pop('Year'))
#product
product_list = data['Product'].unique().tolist()
rep = 5
df_corr_product['Product'] = [j for j in product_list for i in range(rep)]
df_corr_product.insert(1, 'Product', df_corr_product.pop('Product'))
#rating
df_rating = data[['Product', 'Rating']].drop_duplicates()
rating_list = [list(i) for i in zip(df_rating['Product'], df_rating['Rating'])]
product_list_1 = df_corr_product['Product'].tolist()
L = []
for i in range(len(product_list_1)):
 for j in range(len(rating_list)):
 if product_list_1[i] == rating_list[j][0]:
 L.append(rating_list[j][1])
df_corr_product['Rating'] = L
df_corr_product.insert(2, 'Rating', df_corr_product.pop('Rating'))
#3. INSERT YEAR, PRODUCT'S NAME, RATING HEADERS TO df_corr_product
#Year
df_corr_product.loc[-1] = ['', '', ''] + [next(year_list) for i in range(len(df_corr_product))]
df_corr_product.iloc[-1] = df_corr_product.iloc[-1].astype(str)
df_corr_product.index = df_corr_product.index + 1 
df_corr_product = df_corr_product.sort_index()
#Product
df_corr_product.loc[-1] = ['', '', ''] + [ele for ele in product_list for i in range(rep)]
df_corr_product.iloc[-1] = df_corr_product.iloc[-1].astype(str)
df_corr_product.index = df_corr_product.index + 1 
df_corr_product = df_corr_product.sort_index()
#Rating
df_corr_product.loc[-1] = ['', '', ''] + L
df_corr_product.iloc[-1] = df_corr_product.iloc[-1].astype(str)
df_corr_product.index = df_corr_product.index + 1 
df_corr_product = df_corr_product.sort_index()
new_labels = pd.MultiIndex.from_arrays([df_corr_product.columns, df_corr_product.iloc[0], df_corr_product.iloc[1]], names=['Year', 'Rating', 'Product'])
df_corr_product = df_corr_product.set_axis(new_labels, axis=1).iloc[3:].reset_index().drop('index', axis = 1)
#4. Update correlation: if column & row has the same product name, 1, otherwise, keep 0.5
for i, j in df_corr_product.iterrows(): 
 i = df_corr_product.index.tolist()[0]
 while i <= len(df_corr_product.index):
 df_corr_product.iloc[i:i+5, i+3:i+8] = 1.0
 i += 5
for i, j in df_corr_product.iterrows():
 df_corr_product.iloc[i][i+1] = float(0) #Set 0 to diagonal

The idea is that:

Values at diagonal are 0
If any cell has the same column and row’s product names, its value is 1, otherwise 0.5, such as the below output: Product correlation matrix should look like this

Rating correlation matrix:

#Rating correlation matrix
#1. Create a dataframe df_corr_rating that initially has correlation of 1
row, col = 5 * len(data['Product'].unique().tolist()), 5 * len(data['Product'].unique().tolist()) + 1
df_corr_rating = pd.DataFrame.from_records([[float(1)]*col]*row)
df_corr_rating = df_corr_rating.loc[ : , df_corr_rating.columns != 0]
#2. CREATE NEW COLUMNS in df_corr_rating
#year
df_corr_rating['Year'] = [next(year_list) for i in range(len(df_corr_rating))]
df_corr_rating.insert(0, 'Year', df_corr_rating.pop('Year'))
#product
df_corr_rating['Product'] = [j for j in product_list for i in range(rep)]
df_corr_rating.insert(1, 'Product', df_corr_rating.pop('Product'))
#rating
df_corr_rating['Rating'] = L
df_corr_rating.insert(2, 'Rating', df_corr_rating.pop('Rating'))
#3. INSERT YEAR, PRODUCT'S NAME, RATING HEADERS TO df_corr_rating
#Year
df_corr_rating.loc[-1] = ['', '', ''] + [next(year_list) for i in range(len(df_corr_rating))]
df_corr_rating.iloc[-1] = df_corr_rating.iloc[-1].astype(str)
df_corr_rating.index = df_corr_rating.index + 1 
df_corr_rating = df_corr_rating.sort_index()
#Product
df_corr_rating.loc[-1] = ['', '', ''] + [j for j in product_list for i in range(rep)]
df_corr_rating.iloc[-1] = df_corr_rating.iloc[-1].astype(str)
df_corr_rating.index = df_corr_rating.index + 1 
df_corr_rating = df_corr_rating.sort_index()
#Rating
df_corr_rating.loc[-1] = ['', '', ''] + L
df_corr_rating.iloc[-1] = df_corr_rating.iloc[-1].astype(str)
df_corr_rating.index = df_corr_rating.index + 1 
df_corr_rating = df_corr_rating.sort_index()
new_labels = pd.MultiIndex.from_arrays([df_corr_rating.columns, df_corr_rating.iloc[0], df_corr_rating.iloc[1]], names=['Year', 'Rating', 'Product'])
df_corr_rating = df_corr_rating.set_axis(new_labels, axis=1).iloc[3:].reset_index().drop('index', axis = 1)
#4. UPDATE CORRELATION: CHANGE CELL VALUE TO 0.8 IF HY
for i, j in df_corr_rating.iterrows():
 for k in range(3, len(df_corr_rating.columns)):
 if (df_corr_rating.columns[k][1] == 'Bad' and df_corr_rating.iloc[i,2] == 'Good') or (df_corr_rating.columns[k][1] == 'Good' and df_corr_rating.iloc[i,2] == 'Bad'):
 df_corr_rating.iloc[i][k-2] = 0.8
 
for i, j in df_corr_rating.iterrows():
 df_corr_rating.iloc[i, i+3] = float(0) #Set 0 to diagonal values

Diagonal values should be set to 0
In each cell, if both row and column have the same ratings (i.e., both are "Good"), populate 1, otherwise 0.8 (i.e., if row is "Good", column is "Bad", set to 0.8). Output:

Expected rating correlation matrix

Finally, I multiplied column "Price" in the original dataset with its transpose and the product of these 2 matrices.

df_product_main = df_corr_product.iloc[:, 3:]
df_rating_main = df_corr_rating.iloc[:, 3:]
df_product_rating = df_product_main.to_numpy() * df_rating_main.to_numpy() 
 
s = data [['Price']].to_numpy()
v = df_product_rating
t = np.multiply(s, s.transpose())
u = np.multiply(t, v)
z = pd.DataFrame(u)

The final output is:

print(z)

Expected outputs

The point is, if my data is limited to less than 1000 rows, my code runs quite well. However, if I increase it to more than 10 000 rows, it goes through an endless loop. The running time is more than 3 hours, causing it to crash. I’d assume the root cause is my loops in the matrix parts. Do you have other better options to optimize mine?

Question 2

This seems to be the same as your first question. My answer there should equally apply here: codereview.stackexchange.com/a/292898/238890

Question 3

Your loop to build a list L looks very scary: it's a cross product done in python with quadratic complexity. Check out merge capabilities of pandas at least, that should be marginally more efficient.

Question 4

This question is similar to: Alternatives to iterrow loops in python pandas dataframes. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem.

Question 5

hi, yes this is the same to codereview.stackexchange.com/questions/292885/… because I was requested to write a new post that clarifies all python libraries I used and the steps taken

Question 6

Just to give you an idea.

I'd start by really making sure that I need to calculate that huge matrix. I'm assuming you do.

Generally, if you are doing number crunching in pandas that you could be doing in numpy, then stop and use numpy.

You have already used an outer product here: np.multiply(s, s.transpose()). The same idea lets you build masks, and use those to multiply the necessary elements of this matrix. For example:

df = data.groupby(['Year', 'Product']).first()
rating = (df['Rating'] == 'Good').to_numpy()
rating_mask = rating.reshape((-1, 1)) == rating.reshape((1, -1))
p = df['Price'].astype(float).to_numpy()
price_table = p * p.reshape((-1, 1))
# If you know that the prices are divisible by 5, 
# then you can do `//=5, *=4` instead, 
# and you don't need the `.astype(float)` above. 
# Money is better left in the hands of integers. 
price_table[rating_mask] *= 0.8
# zero the diagonal
np.fill_diagonal(price_table, 0)

Eman Yalpsid Eman Yalpsid 1,56911 silver badges16 bronze badges · Answer 1 · 2024-07-08 21:35:40Z

Just to give you an idea.

I'd start by really making sure that I need to calculate that huge matrix. I'm assuming you do.

Generally, if you are doing number crunching in pandas that you could be doing in numpy, then stop and use numpy.

You have already used an outer product here: np.multiply(s, s.transpose()). The same idea lets you build masks, and use those to multiply the necessary elements of this matrix. For example:

df = data.groupby(['Year', 'Product']).first()
rating = (df['Rating'] == 'Good').to_numpy()
rating_mask = rating.reshape((-1, 1)) == rating.reshape((1, -1))
p = df['Price'].astype(float).to_numpy()
price_table = p * p.reshape((-1, 1))
# If you know that the prices are divisible by 5, 
# then you can do `//=5, *=4` instead, 
# and you don't need the `.astype(float)` above. 
# Money is better left in the hands of integers. 
price_table[rating_mask] *= 0.8
# zero the diagonal
np.fill_diagonal(price_table, 0)

Stack Exchange Network

Replace iterrow loops in pandas matrices with something else to shorten the running time

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Replace iterrow loops in pandas matrices with something else to shorten the running time

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions