Remove duplication in python code for dict schema validation

Question 1

I want you to look the following code for validating dictionary schemas. There are some libraries out there that do the same thing, but I just wanted to write my own for fun and not to overload my current project with a big third party library.

Feel free to comment regarding code optimization, duplication, API design and the testing part.

You can see the API usage in the test snippet.

class ValidationError(Exception):
 def __init__(self, *args, **kwargs):
 super(ValidationError, self).__init__()
class Schema(object):
 """
 Schema class to validate the structure of dic objects.
 Accepts a dict of Attr's and validates their type, a validation function, etc.
 usage: Schema({'name': Attr(str, validator=lambda x: len(x) > 0),
 'age': Attr(int, lambda x: x > 0)})
 To denote a nested dict you use it like this Attr(schema=Schema({...}))
 """
 def __init__(self, attrs):
 self._attrs = attrs
 def bind(self, data):
 if isinstance(data, dict):
 self._data = data
 return self
 raise TypeError('data parameter is not a dict')
 def validate(self, data=None):
 """
 returns True if the bound data is valid to the schema
 raises an exception otherwise
 """
 if data:
 self.bind(data)
 for key in self._attrs.keys():
 attr = self._attrs[key]
 if not attr.is_optional and not key in self._data:
 raise ValidationError('Required attribute {0} not present'.format(key))
 if key in self._data:
 attr.validate(self._data[key])
 return True
class Attr(object):
 """
 Class to be used inside a Schema object.
 It validates individual items in the target dict,
 usage: Attr(int, validatos=some_func)
 """
 def __init__(self, t=None, optional=False, validator=None, schema=None):
 self._t = t
 self._optional = optional
 self._validator = validator
 self._schema = schema
 @property
 def is_schema(self):
 return isinstance(self._schema, Schema)
 @property
 def is_optional(self):
 return self._optional
 @property
 def validator(self):
 return self._validator
 @validator.setter
 def validator(self, val):
 self._validator = val
 def validate(self, data):
 if self._schema:
 if not isinstance(self._schema, Schema):
 raise TypeError('{0} is not instance of Schema'.format(data))
 return self._schema.validate(data)
 elif not isinstance(data, self._t):
 raise TypeError('{0} is not instance of {1}'.format(data, self._t))
 if self.validator and not self.validator(data):
 raise ValidationError('attribute did not pass validation test')
 return True

If you think some test are missing just tell it.

import unittest
class TestAttr(unittest.TestCase):
 def test_validates_type(self):
 attr = Attr(int)
 self.assertTrue(attr.validate(1), 'The attribute should be valid')
 attr = Attr(int)
 self.assertRaises(TypeError, attr.validate, '1')
 def test_validates_validator(self):
 attr = Attr(int, validator=lambda x: x > 1)
 self.assertTrue(attr.validate(2), 'The attribute should be valid')
 self.assertRaises(ValidationError, attr.validate, 0)
class TestSchema(unittest.TestCase):
 def test_flat_schema(self):
 schema = Schema({'name': Attr(str),
 'age': Attr(int, validator=lambda x: x > 0),
 'awesome': Attr(bool, optional=True, validator=lambda x: x == True)})
 valid = {'name': 'python', 'age': 19, 'awesome': True}
 self.assertTrue(schema.validate(valid), 'schema should be valid')
 # omit optional
 valid = {'name': 'python', 'age': 19}
 self.assertTrue(schema.validate(valid), 'schema should be valid')
 invalid = {'name': 'python'}
 self.assertRaises(ValidationError, schema.validate, invalid)
 # validate optional if present
 invalid = {'name': 'python', 'age': 19, 'awesome': False}
 self.assertRaises(ValidationError, schema.validate, invalid)
 def test_nested_schema(self):
 address = Schema({'street': Attr(str, validator=lambda x: len(x) > 0),
 'number': Attr(int)})
 schema = Schema({'name': Attr(str),
 'age': Attr(int, validator=lambda x: x > 0),
 'address': Attr(schema=address)})
 valid_address = {'street': 'pip', 'number': 2}
 invalid_address = {'street': '', 'number': 5}
 valid_data = {'name': 'python', 'age': 4, 'address': valid_address}
 invalid_data = {'name': 'python', 'age': 4, 'address': invalid_address}
 self.assertTrue(schema.validate(valid_data), 'schema should be valid')
 self.assertRaises(ValidationError, schema.validate, invalid_data)

Question 2

1. Error messages

The biggest problem with this code is that the error messages are useless. If the purpose is to validate data structures, then you have to think about what happens when validation fails. How easily are users going to be able to track down the cause of the problem?

So imagine that you are validating some data using a schema that you got from some library:
```
>>> from somewhere import schema
>>> data = dict(m=0)
>>> schema.validate(data)
Traceback (most recent call last):
 ...
ValidationError: attribute did not pass validation test
```
Now what? Why didn't the attribute pass the validation test? What are you supposed to do?

The message really needs to explain what the problem is, for example this would be much better:
```
ValidationError: 0 is not a valid month (use 1=January to 12=December)
```
So the Attr class needs some way of generating good error messages. For example, maybe the constructor could take a function that generates the message:
```
Attr(int, validator=lambda m:1 <= m <= 12,
 message="{} is not a valid month (use 1=January to 12=December)".format)
```
(There are other ways of doing this which I'm sure you can think of.)

Let's suppose you've implemented the suggestion above, and we try it out:

>>> data = dict(a=0, b=0, c=0)
>>> schema.validate(data)
Traceback (most recent call last):
 ...
ValidationError: 0 is not a valid month (use 1=January to 12=December)

Oh. Which attribute was the invalid month? Was it a, b or c? In order to be any use to the developer, the error message needs to include the attribute name:

ValidationError: b=0 is not a valid month (use 1=January to 12=December)

But the attribute name on its own is not enough when you are validating a nested structure:

>>> data = dict(coordinates=dict(a=0, b=0, c=0), date=dict(a=0, b=0, c=0))
>>> schema.validate(data)
Traceback (most recent call last):
 ...
ValidationError: b=0 is not a valid month (use 1=January to 12=December)

The error message needs to include the complete path through the data structure to the erroneous attribute, perhaps like this:

ValidationError: date.b=0 is not a valid month (use 1=January to 12=December)

2. Other commentary

The validator doesn't fail if there are attributes that are present in the data but missing in the schema. For example, shouldn't this fail validation:
```
>>> schema = Schema({})
>>> schema.validate(dict(a=1))
True
```
You added some documentation, but it's very sketchy. It's really not good enough to say, "You can see the API usage in the test snippet." Did you learn to use Python by reading the unit tests? If not, how can you in all fairness expect users of your class to do that?

For example, what does a Schema object represent? What is its purpose? Can you give us a use case? Can you provide some examples? What does the bind method do? In the validate method, what's the role of the data argument? In the Attr constructor, what's the meaning of the t, optional, and schema arguments? And so on.
There are some mistakes in the examples (there's a missing validator= in one place, and a typo validatos in another). If you used the doctest module to make your examples runnable, then you would have found and fixed these mistakes.
In ValidationError.__init__, if you're just going to call the superclass's method, there's no point in defining the method at all. Just write:
```
class ValidationError(Exception):
 pass
```
What is the purpose of the bind method? It allows the caller to divide the validation process into two steps: first, bind data to a schema object; second, call .validate() with no argument to do the actual validation. But what does this achieve over calling .validate(data)?
By testing isinstance(data, dict) you limit the kinds of object that can validated. The only methods on data you actually call are __contains__, __iter__ and __getitem__, so it would be better to test against collections.abc.Mapping rather than dict.
A dictionary with no keys converts to Boolean as False, so instead of writing:
```
if data:
 self.bind(data)
```
(which would go wrong when data={}) you should write:
```
if data is not None:
 self.bind(data)
```
Instead of:
```
for key in self._attrs.keys():
 attr = self._attrs[key]
```
you should write:
```
for key, attr in self._attrs.items():
```
(or .iteritems() in Python 2). This avoids an unnecessary lookup, and in Python 2 also avoids the unnecessary allocation of a list of keys.

In these lines:

if not attr.is_optional and not key in self._data:
 raise ValidationError('Required attribute {0} not present'.format(key))
if key in self._data:
 attr.validate(self._data[key])

you look up key in self._data three times. This seems a waste. Turn the logic round and look up the key just once:

try:
 attr.validate(self._data[key])
except KeyError:
 if not attr.is_optional:
 raise ValidationError(...)

It seems inelegant to me for the is_optional logic to be in the Schema class. Shouldn't this be in the Attr class?
When you've removed the bind method, you'll see that the only method left on the class is the validate method. So I think it would be simpler to name the class Validator and use the __call__ method instead. (This has the advantage that you no longer need the special-case schema argument to the Attr constriuctor.)
Having a property Attr.validator with getter and setter seems pointless because these are only ever called from within the Attr class. The Attr.is_schema property is pointless because it is never used.
Having a property Attr.is_optional seems pointless: why not just use an attribute?
Having a special case for Attr(schema=Schema(...)) is unnecessary: one can just write Attr(validator=Schema(...).validate). Or if you take my suggestion above about using the __call__ method, you'd be able to just write Attr(validator=Schema(...)).
The t argument to the Attr constructor would be better named classinfo to match the name of the second argument to isinstance. Since it's an optional argument, it should default to object (rather than None), so that if you don't specify it, the test passes rather than fails.

Gareth Rees Gareth Rees 50.1k3 gold badges130 silver badges210 bronze badges · Accepted Answer · 2013-11-16 17:40:42Z

1. Error messages

The biggest problem with this code is that the error messages are useless. If the purpose is to validate data structures, then you have to think about what happens when validation fails. How easily are users going to be able to track down the cause of the problem?

So imagine that you are validating some data using a schema that you got from some library:
```
>>> from somewhere import schema
>>> data = dict(m=0)
>>> schema.validate(data)
Traceback (most recent call last):
 ...
ValidationError: attribute did not pass validation test
```
Now what? Why didn't the attribute pass the validation test? What are you supposed to do?

The message really needs to explain what the problem is, for example this would be much better:
```
ValidationError: 0 is not a valid month (use 1=January to 12=December)
```
So the Attr class needs some way of generating good error messages. For example, maybe the constructor could take a function that generates the message:
```
Attr(int, validator=lambda m:1 <= m <= 12,
 message="{} is not a valid month (use 1=January to 12=December)".format)
```
(There are other ways of doing this which I'm sure you can think of.)

Let's suppose you've implemented the suggestion above, and we try it out:

>>> data = dict(a=0, b=0, c=0)
>>> schema.validate(data)
Traceback (most recent call last):
 ...
ValidationError: 0 is not a valid month (use 1=January to 12=December)

Oh. Which attribute was the invalid month? Was it a, b or c? In order to be any use to the developer, the error message needs to include the attribute name:

ValidationError: b=0 is not a valid month (use 1=January to 12=December)

But the attribute name on its own is not enough when you are validating a nested structure:

>>> data = dict(coordinates=dict(a=0, b=0, c=0), date=dict(a=0, b=0, c=0))
>>> schema.validate(data)
Traceback (most recent call last):
 ...
ValidationError: b=0 is not a valid month (use 1=January to 12=December)

The error message needs to include the complete path through the data structure to the erroneous attribute, perhaps like this:

ValidationError: date.b=0 is not a valid month (use 1=January to 12=December)

2. Other commentary

The validator doesn't fail if there are attributes that are present in the data but missing in the schema. For example, shouldn't this fail validation:
```
>>> schema = Schema({})
>>> schema.validate(dict(a=1))
True
```
You added some documentation, but it's very sketchy. It's really not good enough to say, "You can see the API usage in the test snippet." Did you learn to use Python by reading the unit tests? If not, how can you in all fairness expect users of your class to do that?

For example, what does a Schema object represent? What is its purpose? Can you give us a use case? Can you provide some examples? What does the bind method do? In the validate method, what's the role of the data argument? In the Attr constructor, what's the meaning of the t, optional, and schema arguments? And so on.
There are some mistakes in the examples (there's a missing validator= in one place, and a typo validatos in another). If you used the doctest module to make your examples runnable, then you would have found and fixed these mistakes.
In ValidationError.__init__, if you're just going to call the superclass's method, there's no point in defining the method at all. Just write:
```
class ValidationError(Exception):
 pass
```
What is the purpose of the bind method? It allows the caller to divide the validation process into two steps: first, bind data to a schema object; second, call .validate() with no argument to do the actual validation. But what does this achieve over calling .validate(data)?
By testing isinstance(data, dict) you limit the kinds of object that can validated. The only methods on data you actually call are __contains__, __iter__ and __getitem__, so it would be better to test against collections.abc.Mapping rather than dict.
A dictionary with no keys converts to Boolean as False, so instead of writing:
```
if data:
 self.bind(data)
```
(which would go wrong when data={}) you should write:
```
if data is not None:
 self.bind(data)
```
Instead of:
```
for key in self._attrs.keys():
 attr = self._attrs[key]
```
you should write:
```
for key, attr in self._attrs.items():
```
(or .iteritems() in Python 2). This avoids an unnecessary lookup, and in Python 2 also avoids the unnecessary allocation of a list of keys.

In these lines:

if not attr.is_optional and not key in self._data:
 raise ValidationError('Required attribute {0} not present'.format(key))
if key in self._data:
 attr.validate(self._data[key])

you look up key in self._data three times. This seems a waste. Turn the logic round and look up the key just once:

try:
 attr.validate(self._data[key])
except KeyError:
 if not attr.is_optional:
 raise ValidationError(...)

It seems inelegant to me for the is_optional logic to be in the Schema class. Shouldn't this be in the Attr class?
When you've removed the bind method, you'll see that the only method left on the class is the validate method. So I think it would be simpler to name the class Validator and use the __call__ method instead. (This has the advantage that you no longer need the special-case schema argument to the Attr constriuctor.)
Having a property Attr.validator with getter and setter seems pointless because these are only ever called from within the Attr class. The Attr.is_schema property is pointless because it is never used.
Having a property Attr.is_optional seems pointless: why not just use an attribute?
Having a special case for Attr(schema=Schema(...)) is unnecessary: one can just write Attr(validator=Schema(...).validate). Or if you take my suggestion above about using the __call__ method, you'd be able to just write Attr(validator=Schema(...)).
The t argument to the Attr constructor would be better named classinfo to match the name of the second argument to isinstance. Since it's an optional argument, it should default to object (rather than None), so that if you don't specify it, the test passes rather than fails.

Stack Exchange Network

Remove duplication in python code for dict schema validation

1 Answer 1

1. Error messages

2. Other commentary

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Remove duplication in python code for dict schema validation

1 Answer 1

1. Error messages

2. Other commentary

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions