The following code is a module that evaluates boolean expressions, which are entered through an unauthorized web API. The exposed function of the library being used is evaluate()
. Its positional argument string
is a string of a pythonic boolean expression that is entered by the user and thusly untrustworthy (so it's potentially any string). I implemented some basic checks to ensure that arbitrary code execution is mitigated. Most of the security is outsourced to the callback function used to evaluate the boolness of user-defined tokens. I'd appreciate a review with focus on security, but also appreciate general feedback.
"""Boolean evaluation."""
from typing import Callable, Iterator
__all__ = ['SecurityError', 'evaluate']
PARENTHESES = frozenset({'(', ')'})
KEYWORDS = frozenset({'and', 'or', 'not'})
class SecurityError(Exception):
"""Indicates a possible security breach
in parsing boolean statements.
"""
def bool_val(token: str, callback: Callable[[str], bool]) -> str:
"""Evaluates the given statement into a boolean value."""
callback_result = callback(token)
if isinstance(callback_result, bool):
return str(callback_result)
raise SecurityError('Callback method did not return a boolean value.')
def tokenize(word: str) -> Iterator[str]:
"""Yields tokens of a string."""
for index, char in enumerate(word):
if char in PARENTHESES:
yield word[:index]
yield char
yield from tokenize(word[index+1:])
break
else:
yield word
def boolexpr(string: str, callback: Callable[[str], bool]) -> Iterator[str]:
"""Yields boolean expression elements for python."""
for word in string.strip().split():
for token in filter(None, tokenize(word)):
if token in KEYWORDS or token in PARENTHESES:
yield token
else:
yield bool_val(token, callback)
def evaluate(
string: str, *,
callback: Callable[[str], bool] = lambda s: s.casefold() == 'true'
) -> bool:
"""Safely evaluates a boolean string."""
return bool(eval(' '.join(boolexpr(string, callback))))
Update
This library is used in one place only, namely a real estates filtering API, where web applications can retrieve real estates from a database. By using boolean expressions parsed by this library code, the web applications can filter the retrieved real estates. The actual matching is done in the callback function passed to evaluate, which matches common filtering properties on the respective real estates and then uses evaluate()
to assert whether the requested combination of properties is met. E.g. id in [1, 2, 3] and (sales_type == "rent" or price < 250000)
which, depending on the real estate, evaluates the matching to True and ( False or True )
which is then passed to eval()
.
-
\$\begingroup\$ Boolean-expr-as-a-service does not seem needed. Can you give any description to the contrary? \$\endgroup\$Reinderien– Reinderien2022年01月03日 19:07:54 +00:00Commented Jan 3, 2022 at 19:07
-
\$\begingroup\$ In what part of the architecture is (for example) your ID membership check converted into a boolean? The client, the web server, this application server? \$\endgroup\$Reinderien– Reinderien2022年01月04日 14:10:48 +00:00Commented Jan 4, 2022 at 14:10
-
\$\begingroup\$ All evaluations are done on the server side. \$\endgroup\$Richard Neumann– Richard Neumann2022年01月04日 14:13:39 +00:00Commented Jan 4, 2022 at 14:13
-
\$\begingroup\$ So a client writes an arbitrary expression, sends it to the server, the server evaluates boolean tokens to truth values, fills them in in the expression, the expression itself is evaluated, and then what? Used for filtering in memory, in SQL? \$\endgroup\$Reinderien– Reinderien2022年01月04日 14:19:48 +00:00Commented Jan 4, 2022 at 14:19
-
\$\begingroup\$ In-memory filtering of XML DOM models. Filtering on database-level would be too complex and not significantly improve performance (yes, I tested that). \$\endgroup\$Richard Neumann– Richard Neumann2022年01月04日 15:19:37 +00:00Commented Jan 4, 2022 at 15:19
1 Answer 1
I mean to say this in the gentlest possible way, but from top to bottom this is a bad idea. It would take a more detailed explanation of why this creature exists to convince me otherwise.
Your default callback produces some pretty crazy results: ^
and 1
both evaluate to False
, for instance.
A critical step in security is to limit DOS potential by constraining the possible expression input length.
I would replace your PARENTHESES
and KEYWORDS
with an AST-walk that is very strict on node types.
Your eval
uses all-defaults, which means:
- A code object will be immediately compiled, without being able to examine the AST first
- Any future-features available to the calling scope are implicitly passed into the compiled scope
- Just... read this and tell me that it's a good idea:
If both [global and local] dictionaries are omitted, the expression is executed with the
globals
andlocals
in the environment whereeval()
is called.
I would also rethink your callback. If the idea is to have a small collection of names that evaluate to booleans, just pass these in as locals
and call it a day. Anything more complicated expands the potential for misuse.
If this needs to exist at all, I would suggest something like
"""Boolean evaluation."""
import ast
from types import CodeType
from typing import Union
EXPR_LIMIT = 100
class SecurityError(Exception):
"""Indicates a possible security breach
in parsing boolean statements.
"""
def compile_isolated(
source: Union[str, ast.Expression], flags: int,
) -> Union[ast.Expression, CodeType]:
return compile(
source=source,
filename='<untrusted_expr>',
mode='eval',
flags=flags,
dont_inherit=1, # no future features, isolated flags
optimize=2, # cut asserts and docstrings
)
def evaluate(string: str, **bool_locals: bool) -> bool:
if len(string) > EXPR_LIMIT:
raise SecurityError('Expression too long')
expr: ast.Expression = compile_isolated(
source=string,
flags=ast.PyCF_ONLY_AST, # no await, no pep484, defer compilation
)
if not isinstance(expr, ast.Expression):
raise SecurityError('Invalid parent node')
top_op, = ast.iter_child_nodes(expr)
for node in ast.walk(top_op):
if isinstance(node, (
ast.UnaryOp, ast.Not, ast.BoolOp, ast.boolop, ast.Name, ast.Load,
)):
continue
if isinstance(node, ast.Constant):
literal = ast.literal_eval(node)
if not isinstance(literal, bool):
raise SecurityError('Invalid literal')
else:
raise SecurityError('Invalid node')
code_object: CodeType = compile_isolated(source=expr, flags=0)
isolated_globals = {'__builtins__': {}}
for v in bool_locals.values():
if not isinstance(v, bool):
raise ValueError('Invalid value for bool_locals')
result = eval(code_object, isolated_globals, bool_locals)
if not isinstance(result, bool):
raise SecurityError('Unexpected non-boolean output')
return result
def test() -> None:
assert evaluate('not (a and False) or True', a=True) is True
assert evaluate('not False') is True
assert evaluate('a and (b or c)', a=True, b=True, c=False) is True
for bad in (
'__loader__ or True',
'exit',
'quit',
'eval and False',
):
try:
evaluate(bad)
raise AssertionError('Should have failed')
except NameError:
pass
for bad in (
'1 or False',
'int(False)',
'False * True',
'exit()',
'dos'*40,
):
try:
evaluate(bad)
raise AssertionError('Should have failed')
except SecurityError:
pass
if __name__ == '__main__':
test()