Enforce strict naming of multiple arguments

Question 1

I have some questionnaire data in CSV files for different projects. I created a function that takes a specific subset of columns and calculates aggregated values. The problem is that across these different projects, the column names are different but they still need to be aggregated in the same way

The way I'm handling this right now is as follows...

Each project uses a different python script where I use a dictionary to map keys to specific columns in my dataframe/csv file:

import pandas as pd
df = pd.read_csv("data.csv")
q_map = {'q1': df['question1'],
 'q2': df['question2'],
 'q3_h': df['question3_hours'],
 'q3_m': df['question3_minutes']}

A different q_map is needed for each projects because the column names will vary. For example, here q1 is mapped to df['question1'], but in another project it might be called df['q1_1']

I then pass q_map into my aggregation function:

def aggregate(q_map):
 if len(q_map) != 4:
 raise Exception("Incorrect number of items")
 total_a = q_map['q1'] + q_map['q2']
 total_minutes = q_map['q3_h']*60 + q_map['q3_m']
 return total_a, total_minutes
total, minutes = aggregate(q_map)

So in essence the dictionary is used as a way to ensure that the column names are always the same within the function, that way the function itself doesn't need to care if columns are named differently across projects, everything will still be aggregated in the same way

This isn't very user-friendly for (at least) 2 reasons:

The end user needs to pass in an exact number of columns for the aggregation to work. I'm handling this right now with the Exception but theres no intuitive way for the user to know exactly how many columns need to be passed in without reading documentation.
The keys need to be the same as what is used internally by the function (e.g. q1, q3_h). Again, its difficult for the user to know exactly how to name their keys when creating the dictionary. An incorrectly named key will cause problems.

I feel the natural solution is just to use named arguments in my signature like:

def aggregate(q1, q2, q3_h, q3_m):
 pass

That way the user doesnt need to care about naming or how many columns are passed in. However, in reality this function uses 42 different columns for aggregation, and I feel like a function signature of that length would get unwieldy and easy to pass columns in the wrong order

Is there a more sensible way (other than named arguments) to handle this type of situation, where you need to enforce a specific number and specific name of arguments going into a function?

Question 2

NamedTuple may be?

Question 3

I'm not sure I understand the problem. If only the names of the columns change, but the n-th column will always represent the same type of data across CSV files, just access them by index instead of by name, and get rid of q_map.

Question 4

the structure of each CSV is different. so not only are the columns names different, the column order, number of columns etc. may be different across projects

Question 5

This is a data normalization problem. You can handle it by adding a specific step to your program to ensure user input is correct and can be aggregated; otherwise, it will return an error.

To make an up to date documentation, you can publish a data model in the form of a python file and/or a documentation generated from a data model.

Here is an example:


my_data_model = {
 'q1': {
 'mandatory': True, 
 'description': 'Column 1',
 'type': int
 },
 'q2': {
 'mandatory': False, 
 'description': 'Column 2',
 'type': str
 },
# ...
}
def normalize(mapping, df):
 for key, field_info in my_data_model.items():
 if field_info['mandatory'] and key not in mapping:
 raise Exception("Mandatory field {} is missing".format(key))
 for key, value in list(mapping.items()):
 if key not in my_data_model:
 del mapping[key] # Silently remove, or raise an error if needed
 elif value not in df:
 raise Exception("Mapped column {} doesn't exist in data frame".format(value))
 # Why not check types while we're at it
 if df.dtypes[value] != my_data_model[key]['type']:
 raise Exception("Mapped column {} type mismatch, expected {} got {}".format(key, my_data_model[key]['type'], df.dtypes[value]))
 # Renaming for easier internal use
 for key, value in mapping.items():
 df.rename(columns={value:key}, inplace=True)
 return df
def aggregate(df):
 print(df['q1'])
def main(mapping, df):
 return aggregate(normalize(mapping, df))
for mapping in [{}, {'q1':'asdf'}, {'q1':'col2'}, {'q1': 'col1'}]:
 try:
 main(mapping, DataFrame({'col1': [6, 7], 'col2': ['a', 'b']}))
 except Exception as e:
 print(e)
# Outputs normalisation errors for all but the last mapping

Question 6

Use keyword arguments. Keyword arguments do not have to be passed in order. There are many proponents of keyword-only arguments for this and other reasons.

def aggregate(**kwargs): # kwargs['q1'], etc are available here. pass

You can pass this a dictionary like q_map or pass something like q3_m=this, q1=that, etc... and the order won't matter.

If you are talking about:

42 potential different columns
of which four will be used as q1, q2, q3_h, q3_m
and all of these mappings are known

then I would build a wrapper function that has mappings of all the potential columns to appropriate variables and then calls aggregate.

It's possible I'm not completely understanding the problem, but this is the best given what you've described.

Question 7

Im not sure how this addresses the problem. What would you pass into the wrapper function, and how do you get around using something like q_map? Im trying to make it so that the user doesnt need to create a mapping explicitly because otherwise they'd need to know all the key names, and how many columns etc

Question 8

I was envisioning you pre-generating all mappings in a wrapper function that calls aggregate. You pass in all the columns and the wrapper determines the mapping from the names. That's the only way that I could see to avoid users creating mappings.

Diane M Diane M 2,11611 silver badges17 bronze badges · Accepted Answer · 2019-08-31 23:21:09Z

This is a data normalization problem. You can handle it by adding a specific step to your program to ensure user input is correct and can be aggregated; otherwise, it will return an error.

To make an up to date documentation, you can publish a data model in the form of a python file and/or a documentation generated from a data model.

Here is an example:


my_data_model = {
 'q1': {
 'mandatory': True, 
 'description': 'Column 1',
 'type': int
 },
 'q2': {
 'mandatory': False, 
 'description': 'Column 2',
 'type': str
 },
# ...
}
def normalize(mapping, df):
 for key, field_info in my_data_model.items():
 if field_info['mandatory'] and key not in mapping:
 raise Exception("Mandatory field {} is missing".format(key))
 for key, value in list(mapping.items()):
 if key not in my_data_model:
 del mapping[key] # Silently remove, or raise an error if needed
 elif value not in df:
 raise Exception("Mapped column {} doesn't exist in data frame".format(value))
 # Why not check types while we're at it
 if df.dtypes[value] != my_data_model[key]['type']:
 raise Exception("Mapped column {} type mismatch, expected {} got {}".format(key, my_data_model[key]['type'], df.dtypes[value]))
 # Renaming for easier internal use
 for key, value in mapping.items():
 df.rename(columns={value:key}, inplace=True)
 return df
def aggregate(df):
 print(df['q1'])
def main(mapping, df):
 return aggregate(normalize(mapping, df))
for mapping in [{}, {'q1':'asdf'}, {'q1':'col2'}, {'q1': 'col1'}]:
 try:
 main(mapping, DataFrame({'col1': [6, 7], 'col2': ['a', 'b']}))
 except Exception as e:
 print(e)
# Outputs normalisation errors for all but the last mapping

Stack Exchange Network

Enforce strict naming of multiple arguments

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Enforce strict naming of multiple arguments

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions