Configuration parsing and verification utility for bioinformatics tools

Question 1

I've been working on my first Python library which provides different utilities for a set of bioinformatics tools. I am looking for feedback on this module I made for parsing configurations needed for each tool run.

To give a little more context, users are prompted with a text box to edit the default parameters that a tool runs with given a list of different acceptable values that each option may take on.

config.py

# -*- coding: utf-8 -*-
"""
 toolhelper.config
 ~~~~~~~~~~~~~~~~~
 This module provides the needed functionality 
 for parsing and verifying user-inputted config files.
"""
import sys
import logging
PY2 = sys.version_info[0] == 2
if PY2:
 import ConfigParser
else:
 import configparser as ConfigParser
class Config(object):
 """This class implements config file parsing and validation.
 Methods: 
 ...
 Attributes:
 ...
 """
 def __init__(self, defaults,
 value_map,
 filename,
 sections,
 numericals=None,
 allow_no_val=False):
 self.defaults = defaults
 self.value_map = value_map
 self.filename = filename
 self.sections = sections
 self.numericals = numericals
 self.settings = self._parse()
 self.allow_no_val = allow_no_val
 def _parse(self):
 """Builds dict from user-inputted config settings.
 Returns:
 Dictionary of config settings to be used for a tool run.
 """
 config = ConfigParser.ConfigParser(allow_no_value=self.allow_no_val)
 cfg = config.read(self.filename)
 if not cfg:
 msg = 'Config file not readable: Using default parameters.'
 logging.error(msg)
 return self.defaults
 settings = {}
 # Iterate over defaults keys to get each option to lookup
 for option in self.defaults:
 try:
 input_val = _get_config_value(config, self.sections, option)
 numerical = option in self.numericals.keys()
 # Some of the tools allow for cases where the user can choose from
 # many possible unlisted values for a specific option. This is assumed
 # when option is in defauls by not value_map
 in_value_map = option in self.value_map.keys()
 in_defaults = option in self.defaults.keys()
 # Might want to make this behavior more explicit.
 if in_defaults and in_value_map and not numerical:
 value = input_val
 msg = 'Will check value for option: {} in specific tool.'.format(option)
 logging.debug(msg)
 else:
 value = self._value_map_lookup(option, input_val, num=numerical)
 except (InvalidValueError, InvalidOptionError):
 msg = ('Invalid or missing entry for {}. Using default: '
 ' {}.'.format(option, self.defaults[option]))
 logging.error(msg)
 settings[option] = self.defaults[option]
 else:
 logging.debug('parameter %s = %s is valid.', option, value)
 settings[option] = value
 return settings
 def _value_map_lookup(self, option, value, num=False):
 """Get value corresponding to the value argument in value_map.
 Args:
 value: A key to lookup in the value_map dict.
 numerical: dictionary of numerical entries and their ranges.
 Returns:
 A value obtained from value_map that is suited for the tool's
 logic. value_map is a map of user-inputted values to the values
 that are needed by the tool.
 value_map[option][value], or float(value) if num=True and the
 value is not in value_map.
 Raises:
 InvalidValueError: Value is not found in value_map dictionary. The given
 Value is invalid
 """
 if value in self.value_map[option].keys():
 return self.value_map[option][value]
 elif num:
 try:
 float_val = float(value)
 _check_range(float_val, self.numericals[option][0], self.numericals[option][1])
 except ValueError:
 raise InvalidValueError
 else:
 return float_val
 else:
 raise InvalidValueError
def _get_config_value(config, sections, option):
 """ Fetches value from ConfigParser
 A wrapper around ConfigParser.get(). This function checks
 that the config has the given section/option before calling
 config.get(). If the config does not have the option an
 InvalidOptionError is raised.
 Args:
 config: A RawConfigParser instance.
 sections: list of config sections to check for option.
 option: The parameter name to look up in the config.
 Returns:
 Parameter value for corresponding section and option.
 Raises:
 InvalidOptionError: The config is missing the given option argument.
 """
 for section in sections:
 if config.has_option(section, option):
 return ''.join(config.get(section, option).split()).lower()
 raise InvalidOptionError
def _check_range(value, lower, upper):
 """Check if lower < value < upper or lower < value if upper is 'inf'
 Args:
 value: A number whose range is to be checked against the given range
 lower: The lower limit on the range
 upper: The upper limit on the range
 Raises:
 InvalidValueError: value not in range
 """
 if upper == 'inf':
 if value >= lower:
 return
 elif lower <= value <= upper:
 return
 raise ValueError
class InvalidValueError(Exception):
 """Exception class for invalid values
 Exception for when the user enters a value that
 is not a valid option. Raise this exception when the
 user-inputted value in not in valid_options[key].
 """
class InvalidOptionError(Exception):
 """Exception class for invalid options
 Exception for when the user gives an unexpected or invlaid
 option. Raise this exception to handle user-inputted options
 that do not belong in the config or needed options that are
 missing.
 """

A config object is instantiated with:

a dictionary that represents a tools default config parameters

# Default configuration values for this tool. Equivalent to Default_Parameters.txt.
# When adding keys and values to this dictionary, make sure they are lower case and
# match the values to the values in the VALUE_MAP dict.
DEFAULTS = {
 'counting method' : 'templates',
 'productive only' : True,
 'resolved only' : True,
 'vj resolution' : 'gene',
 'correction' : 'BY',
 'alpha' : 0.05,
}

A value map dictionary which holds the acceptable values for each config option with each value mapped to what the tool needs to run internally.

# A nested dict where the outer dictionary keys are all options, and each nested
# dictionary maps acceptable user-inputted values to the corresponding values
# needed for the tool internally.
# When adding entries to be sure all strings are lowercase and devoid of whitespace.
# Note this dict does not account for numerical entries
VALUE_MAP = {
 'couting method' : {
 'templates' : 'templates',
 'rearrangement' : 'rearrangement'
 },
 'productive only' : {
 'true' : 'productive',
 'false' : 'nonproductive'
 },
 'resolved only' : {
 'true' : True,
 'false' : False
 },
 'vj resolution' : {
 'gene' :'gene',
 'family' : 'family',
 'allele' : 'allele'
 },
 'correction' : {
 'bh' : 'BH',
 'bonferroni' : 'bonferroni',
 'by' : 'BY',
 'fdr' : 'fdr',
 'none' : None,
 }
}

The filename of the file where the user configuration lives for a specific tool run.
The sections in the configuration.

SECTIONS = ['union', 'significance']

Dictionary of configuration options that can take on a numerical value in a specific range.

# configuaration settings that may be numerical
# keys: numerical options; values: tuples that embed accepted range.
NUMERICALS = {
 'alpha' : (0, 1)
}

A tool will then make a config instance and grab the parsed and validation settings from the config.settings attribute.

 configuration = config.Config(settings.OUTPUT_DIR,
 settings.DEFAULTS,
 settings.VALUE_MAP,
 settings.CONFIG_FILE,
 settings.SECTIONS,
 settings.NUMERICALS)
 parameters = configuration.settings

Each tool initially had it's own configuration checking method with nested try-except blocks checking each config option and it was a total mess. I spent a lot of time trying to think how to best represent and validate the config data to hopefully minimize code complexity in each tool.

Question 2

General suggestions:

black can automatically format your code to be more idiomatic.
isort can group and sort your imports automatically.
flake8 with a strict complexity limit will give you more hints to write idiomatic Python:
```
[flake8]
max-complexity = 4
ignore = W503,E203
```
That limit is not absolute by any means, but it's worth thinking hard whether you can keep it low whenever validation fails. For example, I'm working with a team on an application since a year now, and our complexity limit is up to 7 in only one place.

I would then recommend adding type hints everywhere (I'm not sure whether they work with Python 2 though) and validating them using a strict mypy configuration:

[mypy]
check_untyped_defs = true
disallow_untyped_defs = true
ignore_missing_imports = true
no_implicit_optional = true
warn_redundant_casts = true
warn_return_any = true
warn_unused_ignores = true

Specific suggestions:

Things like pulling out self.numericals.keys() should not happen within a loop. You never modify it, so the result will be the same every time. Better to get this value once.
allow_no_val should probably just be called allow_no_value, since that's what it's used as.
If you put the bit about checking for the existence of the configuration file in a main() method the script would be easier to test. You could then pass a stream to Config - presumably config.read handles that just as well as a filename.
__init__ running _parse is an antipattern. Typically __init__ only does trivial things with its parameters, and relies on other methods to do the heavy lifting. Renaming _parse to parse would help in that case.
Boolean parameters are a code smell. For example, in the case of _value_map_lookup it would be easier to read if there was a separate lookup method for numbers.

Returning early and structuring the rest of the code accordingly can simplify some parts of the code. For example,

if value in self.value_map[option].keys():
 return self.value_map[option][value]
elif num:
 try:
 float_val = float(value)
 _check_range(float_val, self.numericals[option][0], self.numericals[option][1])
 except ValueError:
 raise InvalidValueError
 else:
 return float_val
else:
 raise InvalidValueError

could be written as

if value in self.value_map[option].keys():
 return self.value_map[option][value]
if not num:
 raise InvalidValueError
try:
 float_val = float(value)
 _check_range(float_val, self.numericals[option][0], self.numericals[option][1])
 return float_val
except ValueError:
 raise InvalidValueError

Why does _get_config_value check for an option in every section and return the first one? That seems to make sections pointless, because they don't matter.

l0b0 l0b0 9,11722 silver badges36 bronze badges · Accepted Answer · 2019-07-14 02:25:51Z

General suggestions:

black can automatically format your code to be more idiomatic.
isort can group and sort your imports automatically.
flake8 with a strict complexity limit will give you more hints to write idiomatic Python:
```
[flake8]
max-complexity = 4
ignore = W503,E203
```
That limit is not absolute by any means, but it's worth thinking hard whether you can keep it low whenever validation fails. For example, I'm working with a team on an application since a year now, and our complexity limit is up to 7 in only one place.

I would then recommend adding type hints everywhere (I'm not sure whether they work with Python 2 though) and validating them using a strict mypy configuration:

[mypy]
check_untyped_defs = true
disallow_untyped_defs = true
ignore_missing_imports = true
no_implicit_optional = true
warn_redundant_casts = true
warn_return_any = true
warn_unused_ignores = true

Specific suggestions:

Things like pulling out self.numericals.keys() should not happen within a loop. You never modify it, so the result will be the same every time. Better to get this value once.
allow_no_val should probably just be called allow_no_value, since that's what it's used as.
If you put the bit about checking for the existence of the configuration file in a main() method the script would be easier to test. You could then pass a stream to Config - presumably config.read handles that just as well as a filename.
__init__ running _parse is an antipattern. Typically __init__ only does trivial things with its parameters, and relies on other methods to do the heavy lifting. Renaming _parse to parse would help in that case.
Boolean parameters are a code smell. For example, in the case of _value_map_lookup it would be easier to read if there was a separate lookup method for numbers.

Returning early and structuring the rest of the code accordingly can simplify some parts of the code. For example,

if value in self.value_map[option].keys():
 return self.value_map[option][value]
elif num:
 try:
 float_val = float(value)
 _check_range(float_val, self.numericals[option][0], self.numericals[option][1])
 except ValueError:
 raise InvalidValueError
 else:
 return float_val
else:
 raise InvalidValueError

could be written as

if value in self.value_map[option].keys():
 return self.value_map[option][value]
if not num:
 raise InvalidValueError
try:
 float_val = float(value)
 _check_range(float_val, self.numericals[option][0], self.numericals[option][1])
 return float_val
except ValueError:
 raise InvalidValueError

Why does _get_config_value check for an option in every section and return the first one? That seems to make sections pointless, because they don't matter.

Stack Exchange Network

Configuration parsing and verification utility for bioinformatics tools

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Configuration parsing and verification utility for bioinformatics tools

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions