5
\$\begingroup\$

I have a pandas dataframe, my_calculation, that is a Cartesian product of 4 different categories (Name, Freqset, Formula, and Location) and two additional Address categories. There are 3 different names, 12 different frequency sets, 14 different formulas, and 13 different locations making 6552 rows. There can be any number of possible addresses (their possible states are stored in a different frame), but in this frame, for any location there is only one possible Address1 and one possible Address2.

0 Name Freqset Formula Location Address1 Address2
1 Jeff freqset1 form1 New York Box 12 Box 15
2 Jeff freqset1 form1 Buffalo Box 60 Box 11
3 Jeff freqset1 form1 Miami Box 10 Box 80
4...............................................................
................................................................
6551 Leroy freqset12 form14 Charleston Box 100 Box 28

I only made the dataframe in this manner because that's how I would logically set it up in a db table for maintainability. If there is a more efficient structure for this dataframe, or an easier structure to work with, please let me know.

I need to add a column for frequencies for each address (Freq1 for Address1 and Freq2 for Address2) and then a column for a probability calculation I will call Calculation Result that is based on the Formula column (I haven't addressed this yet). I want to use this frame as an index to do what will amount to millions of additional calculations. My thinking is that a dataframe of values to choose millions of results from SHOULD be more efficient than doing millions of calculations on the fly. If this is flawed thinking, please let me know.

The frequencies are stored in a dictionary of dataframes. An example of one dataframe:

 Address New York Buffalo Miami
0 Box 1 0.24560 0.95000 0.25000
1 Box 2 0.00100 0.45190 0.65091
............................................. 
n Box n 0.45450 0.20341 0.11110

They are accessed like this:

freq_dict.freq['Freqset1']
freq_dict.freq['Freqset2']

etc...

The frequencies come out of excel automatically in this structure and I am unsure if any manipulation of this structure from its current state will make things any easier or more efficient.

Here is the only way I can get the Freq1 and Freq2 columns in my_calculation to populate. I'm almost positive this isn't the way to go about things.

import numpy as np
import pandas as pd
#arrays of Cartesian categories
names = np.array(['Jeff', 'Jenn', 'Leroy'])
locations = np.array(['New York', 'Buffalo', 'Miami', 'Tampa', 'Boston',
 'Pittsburgh', 'Portland', 'Seattle', 'Toronto',
 'Witchita', 'Austin', 'Bangor', 'Charleston'])
freqsets = np.array(['freqset1','freqset2','freqset3','freqset4',
 'freqset5','freqset6','freqset7','freqset8',
 'freqset9','freqset10','freqset11','freqset12'])
formulas = np.array(['form1', 'form2', 'form3', 'form4', 'form5',
 'form6', 'form7', 'form8', 'form9', 'form10',
 'form11', 'form12', 'form13', 'form14'])
temp_arr1 = np.zeros((6552, ))
temp_arr2 = np.zeros((6552, ))
i = 0
for name in names:
 for location in locations:
 for freqset in freqsets:
 for formula in formulas:
 temp_arr1[i] = freq_dict.freq[freqset].query('Address ==' + my_calculation['Address1'][i])[my_calculation['Location'][i]].item()
 temp_arr1[i] = freq_dict.freq[freqset].query('Address ==' + my_calculation['Address2'][i])[my_calculation['Location'][i]].item()
 i+=1
my_calculation['Freq1'] = pd.Series(temp_arr1)
my_calculation['Freq2'] = pd.Series(temp_arr2)

Now my_calculation dataframe looks like this (last column not part of code yet):

0 Name Freqset Formula Location Address1 Address2 Freq1 Freq2 Calculation Result
1 Jeff freqset1 form1 New York Box 12 Box 15 0.00020 0.23140 TBD 
2 Jeff freqset1 form1 Buffalo Box 60 Box 11 0.05121 0.15432 TBD 
3 Jeff freqset1 form1 Miami Box 10 Box 80 0.12120 0.54230 TBD 
4......................................................................................
.......................................................................................
6551 Leroy freqset12 form14 Charleston Box 100 Box 28 0.32001 0.00023 TBD 

In the long run I want to be able to create this table quickly because I want to do it thousands of times. Right now it takes about 28 seconds on my machine to do just one, and I have yet to populate a Calculation Result column.

Let me know if something doesn't quite run correctly. I am transcribing code across machines and can correct easily.

asked Sep 9, 2020 at 15:45
\$\endgroup\$
1
  • \$\begingroup\$ Before this gets away from me, I already see one place I can improve. I'm doing 6552 queries of the dictionary. I should be able to reduce this to 312. (ie 12 freqsets x 26 addresses) \$\endgroup\$ Commented Sep 9, 2020 at 18:40

1 Answer 1

1
\$\begingroup\$

You should align to a multi-index with no nested loops:

import pandas as pd
freq_df = pd.DataFrame(
 index=pd.MultiIndex.from_arrays(
 arrays=(
 ('freqset1', 'freqset1', 'freqset1'),
 (0, 1, 2),
 ),
 names=('freqset', 'id'),
 ),
 data={
 'Address': ('Box 1', 'Box 2', 'Box n'),
 'New York': (0.24560, 0.00100, 0.45450),
 'Buffalo': (0.95000, 0.45190, 0.20341),
 'Miami': (0.25000, 0.65091, 0.11110),
 },
)
names = ['Jeff', 'Jenn', 'Leroy']
locations = [
 'New York', 'Buffalo', 'Miami', 'Tampa', 'Boston',
 'Pittsburgh', 'Portland', 'Seattle', 'Toronto',
 'Witchita', 'Austin', 'Bangor', 'Charleston',
]
freqsets = [
 'freqset1','freqset2','freqset3','freqset4',
 'freqset5','freqset6','freqset7','freqset8',
 'freqset9','freqset10','freqset11','freqset12',
]
formulas = [
 'form1', 'form2', 'form3', 'form4', 'form5',
 'form6', 'form7', 'form8', 'form9', 'form10',
 'form11', 'form12', 'form13', 'form14',
]
cartesian = pd.MultiIndex.from_product(
 iterables=(names, locations, freqsets, formulas),
 names=('name', 'location', 'freqset', 'formula'),
)

The multi-index looks like

MultiIndex([( 'Jeff', 'New York', 'freqset1', 'form1'),
 ( 'Jeff', 'New York', 'freqset1', 'form2'),
 ( 'Jeff', 'New York', 'freqset1', 'form3'),
 ( 'Jeff', 'New York', 'freqset1', 'form4'),
 ( 'Jeff', 'New York', 'freqset1', 'form5'),
 ( 'Jeff', 'New York', 'freqset1', 'form6'),
 ( 'Jeff', 'New York', 'freqset1', 'form7'),
 ( 'Jeff', 'New York', 'freqset1', 'form8'),
 ( 'Jeff', 'New York', 'freqset1', 'form9'),
 ( 'Jeff', 'New York', 'freqset1', 'form10'),
 ...
 ('Leroy', 'Charleston', 'freqset12', 'form5'),
 ('Leroy', 'Charleston', 'freqset12', 'form6'),
 ('Leroy', 'Charleston', 'freqset12', 'form7'),
 ('Leroy', 'Charleston', 'freqset12', 'form8'),
 ('Leroy', 'Charleston', 'freqset12', 'form9'),
 ('Leroy', 'Charleston', 'freqset12', 'form10'),
 ('Leroy', 'Charleston', 'freqset12', 'form11'),
 ('Leroy', 'Charleston', 'freqset12', 'form12'),
 ('Leroy', 'Charleston', 'freqset12', 'form13'),
 ('Leroy', 'Charleston', 'freqset12', 'form14')],
 length=6552)

I can't get any further into a review because your code is non-reproducible; Address1 and Address2 values are unclear.

answered Dec 20, 2024 at 13:18
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.