I have some questionnaire data in CSV files for different projects. I created a function that takes a specific subset of columns and calculates aggregated values. The problem is that across these different projects, the column names are different but they still need to be aggregated in the same way
The way I'm handling this right now is as follows...
Each project uses a different python script where I use a dictionary to map keys to specific columns in my dataframe/csv file:
import pandas as pd
df = pd.read_csv("data.csv")
q_map = {'q1': df['question1'],
'q2': df['question2'],
'q3_h': df['question3_hours'],
'q3_m': df['question3_minutes']}
A different q_map
is needed for each projects because the column names will vary. For example, here q1
is mapped to df['question1']
, but in another project it might be called df['q1_1']
I then pass q_map
into my aggregation function:
def aggregate(q_map):
if len(q_map) != 4:
raise Exception("Incorrect number of items")
total_a = q_map['q1'] + q_map['q2']
total_minutes = q_map['q3_h']*60 + q_map['q3_m']
return total_a, total_minutes
total, minutes = aggregate(q_map)
So in essence the dictionary is used as a way to ensure that the column names are always the same within the function, that way the function itself doesn't need to care if columns are named differently across projects, everything will still be aggregated in the same way
This isn't very user-friendly for (at least) 2 reasons:
- The end user needs to pass in an exact number of columns for the aggregation to work. I'm handling this right now with the
Exception
but theres no intuitive way for the user to know exactly how many columns need to be passed in without reading documentation. - The keys need to be the same as what is used internally by the function (e.g.
q1
,q3_h
). Again, its difficult for the user to know exactly how to name their keys when creating the dictionary. An incorrectly named key will cause problems.
I feel the natural solution is just to use named arguments in my signature like:
def aggregate(q1, q2, q3_h, q3_m):
pass
That way the user doesnt need to care about naming or how many columns are passed in. However, in reality this function uses 42 different columns for aggregation, and I feel like a function signature of that length would get unwieldy and easy to pass columns in the wrong order
Is there a more sensible way (other than named arguments) to handle this type of situation, where you need to enforce a specific number and specific name of arguments going into a function?
2 Answers 2
This is a data normalization problem. You can handle it by adding a specific step to your program to ensure user input is correct and can be aggregated; otherwise, it will return an error.
To make an up to date documentation, you can publish a data model in the form of a python file and/or a documentation generated from a data model.
Here is an example:
my_data_model = {
'q1': {
'mandatory': True,
'description': 'Column 1',
'type': int
},
'q2': {
'mandatory': False,
'description': 'Column 2',
'type': str
},
# ...
}
def normalize(mapping, df):
for key, field_info in my_data_model.items():
if field_info['mandatory'] and key not in mapping:
raise Exception("Mandatory field {} is missing".format(key))
for key, value in list(mapping.items()):
if key not in my_data_model:
del mapping[key] # Silently remove, or raise an error if needed
elif value not in df:
raise Exception("Mapped column {} doesn't exist in data frame".format(value))
# Why not check types while we're at it
if df.dtypes[value] != my_data_model[key]['type']:
raise Exception("Mapped column {} type mismatch, expected {} got {}".format(key, my_data_model[key]['type'], df.dtypes[value]))
# Renaming for easier internal use
for key, value in mapping.items():
df.rename(columns={value:key}, inplace=True)
return df
def aggregate(df):
print(df['q1'])
def main(mapping, df):
return aggregate(normalize(mapping, df))
for mapping in [{}, {'q1':'asdf'}, {'q1':'col2'}, {'q1': 'col1'}]:
try:
main(mapping, DataFrame({'col1': [6, 7], 'col2': ['a', 'b']}))
except Exception as e:
print(e)
# Outputs normalisation errors for all but the last mapping
Use keyword arguments. Keyword arguments do not have to be passed in order. There are many proponents of keyword-only arguments for this and other reasons.
def aggregate(**kwargs):
# kwargs['q1'], etc are available here.
pass
You can pass this a dictionary like q_map
or pass something like q3_m=this, q1=that, etc...
and the order won't matter.
If you are talking about:
- 42 potential different columns
- of which four will be used as q1, q2, q3_h, q3_m
- and all of these mappings are known
then I would build a wrapper function that has mappings of all the potential columns to appropriate variables and then calls aggregate
.
It's possible I'm not completely understanding the problem, but this is the best given what you've described.
-
Im not sure how this addresses the problem. What would you pass into the wrapper function, and how do you get around using something like
q_map
? Im trying to make it so that the user doesnt need to create a mapping explicitly because otherwise they'd need to know all the key names, and how many columns etcSimon– Simon2018年04月08日 02:27:17 +00:00Commented Apr 8, 2018 at 2:27 -
I was envisioning you pre-generating all mappings in a wrapper function that calls aggregate. You pass in all the columns and the wrapper determines the mapping from the names. That's the only way that I could see to avoid users creating mappings.bikemule– bikemule2018年04月12日 15:38:36 +00:00Commented Apr 12, 2018 at 15:38
q_map
.