1

TLDR: How to iterate across all options of multiple columns in a pandas dataframe without specifying the columns or their values explicitly?

Long Version: I have a pandas dataframe that looks like this, only it has a lot more features or drug dose combinations than are listed here. Instead of just 3 types of features, it could have something like 70...:

> dosage_df
First Score Last Score A_dose B_dose C_dose
22 28 1 40 130
55 11 2 40 130
15 72 3 40 130
42 67 1 90 130
90 74 2 90 130
87 89 3 90 130
14 43 1 40 700
12 61 2 40 700
41 5 3 40 700

Along with my data frame, I also have a python dictionary with the relevant ranges for each feature. The keys are the feature names, and the different values which it can take are the keys:

> dict_of_dose_ranges = {'A_dose': [1, 2, 3], 'B_dose': [40, 90], 'C_dose': [130,700]}

For my purposes, I need to generate a particular combination (say A_dose = 1, B_dose = 90, and C_dose = 700), and based on those settings take the relevant slice out of my dataframe, and do relevant calculations from that smaller subset, and save the results somewhere.

I need to do this for ALL possible combinations of ALL of my features (far more than the 3 which are here, and which will be variable in the future).

In this case, I could easily pop this into SkLearn's Parameter grid, generate the options:

> from sklearn.grid_search import ParameterGrid
> all_options = list(ParameterGrid(dict_of_dose_ranges)) 
> all_options

and get:

[{'A_dose': 1, 'B_dose': 40, 'C_dose': 130},
 {'A_dose': 1, 'B_dose': 40, 'C_dose': 700},
 {'A_dose': 1, 'B_dose': 90, 'C_dose': 130},
 {'A_dose': 1, 'B_dose': 90, 'C_dose': 700},
 {'A_dose': 2, 'B_dose': 40, 'C_dose': 130},
 {'A_dose': 2, 'B_dose': 40, 'C_dose': 700},
 {'A_dose': 2, 'B_dose': 90, 'C_dose': 130},
 {'A_dose': 2, 'B_dose': 90, 'C_dose': 700},
 {'A_dose': 3, 'B_dose': 40, 'C_dose': 130},
 {'A_dose': 3, 'B_dose': 40, 'C_dose': 700},
 {'A_dose': 3, 'B_dose': 90, 'C_dose': 130},
 {'A_dose': 3, 'B_dose': 90, 'C_dose': 700}]

This is where I run into problems:

Problem #1) I can now iterate across all_options, but I'm not sure how to now SELECT out of my dosage_df from each of the dictionary options (i.e. {'A_dose': 1, 'B_dose': 40, 'C_dose': 130}) WITHOUT doing it explicitly.

In the past, I could do something like:

dosage_df[(dosage_df.A_dose == 1) & (dosage_df.B_dose == 40) & (dosage_df.C_dose == 130)]
First Score Last Score A_dose B_dose C_dose
0 22 28 140 130

But now I'm not sure what to put inside the brackets to slice it dynamically...

dosage_df[?????]

Problem #2) When I actually enter in my full dictionary of features with their respective ranges, I get an error because it deems it as having too many options...

from sklearn.grid_search import ParameterGrid
all_options = list(ParameterGrid(dictionary_of_features_and_ranges)) 
all_options
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
<ipython-input-138-7b73d5e248f5> in <module>()
 1 from sklearn.grid_search import ParameterGrid
----> 2 all_options = list(ParameterGrid(dictionary_of_features_and_ranges))
 3 all_options
OverflowError: long int too large to convert to int

I tried a number of alternate approaches including using double while loops, a tree / recursion method from here, another recursion method from here, but it wasn't coming together.... Any help is much appreciated.

asked Sep 7, 2016 at 20:31

2 Answers 2

2

You can use itertools.product to generate all possible dosage combinations, and DataFrame.query to do the selection:

from itertools import product
for dosage_comb in product(*dict_of_dose_ranges.values()):
 dosage_items = zip(dict_of_dose_ranges.keys(), dosage_comb)
 query_str = ' & '.join('{} == {}'.format(*x) for x in dosage_items)
 sub_df = dosage_df.query(query_str)
 # Do Stuff...
answered Sep 7, 2016 at 23:47
0

What about using the underlying numpy array and some boolean logic to build an array containing only the lines you want ?

dosage_df = pd.DataFrame((np.random.rand(40000,10)*100).astype(np.int))
dict_of_dose_ranges={3:[10,11,12,13,15,20],4:[20,22,23,24]}
#combined_doses will be bool array that will select all the lines that match the wanted combinations of doses
combined_doses=np.ones(dosage_df.shape[0]).astype(np.bool)
for item in dict_of_dose_ranges.items():
 #item[0] is the kind of dose
 #item[1] are the values of that kind of dose
 next_dose=np.zeros(dosage_df.shape[0]).astype(np.bool)
 #we then iterate over the wanted values
 for value in item[1]:
 # we select and "logical or" all lines matching the values
 next_dose|=(dosage_df[item[0]] == value)
 # we "logical and" all the kinds of dose
 combined_doses&=next_dose
print(dosage_df[combined_doses])
answered Sep 7, 2016 at 21:21
4
  • Were you able to get this to work? I got an error: IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices Commented Sep 8, 2016 at 3:31
  • I've edited a bit the code to correct a mistake arr[:,item[0]] and swapped arthmetic for bitwise operators and added example values so that it is copy/pastable. As it is now, it runs fine for me. Commented Sep 8, 2016 at 12:29
  • I think it only works for you b/c the columns of dosage_df and dict_of_dose_ranges has keys which are integers. Commented Sep 8, 2016 at 21:05
  • @Afflatus I fixed it, it should work with all column names now. Commented Sep 9, 2016 at 7:15

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.