Check Python code for certain statements

Question 1

I'm working on a system that allows users to run their code on our system. While their code is sandboxed, I would still like to know if their code is using certain statements, especially imports. This is used to do a quick check for malicious code or code that is against the guidelines for the platform, it will not be the only check, since the code is also checked by humans later, but filtering out the worst cases automatically would be preferable.

So what would be the best way, without executing the code, to check if their code for example imports sys (or a part of sys)? I would hope there's a nicer/better way then regex-searching that code.

Bonus question: What about more complex statements? For example calling foo from module bar?

EDIT: This is NOT a question about security. It's about finding certain statements inside code. See my comment. This user-code will only run inside the users sandbox, so they can just ruin their own sandbox. But if their code gets 'certified' it can run in other users sandboxes, before it gets certified it needs to be checked. And if an automated check can spot the worst offenses, that would be helpful.

Question 2

The humans need to do a very comprehensive check. It will be possible to find a way around any simple system that you implement here, for instance, by using the exec command.

Question 3

I might be wrong, but I can't think of any other method than scanning source file as text files: if you let the python interpreter execute them, you won't be able to introspectively interrogate the code before import statements and module-level functions have been executed... But maybe I am missing part of the problem?

Question 4

@Oliver ...and still... even humans might have troubles if the code is obfuscated (for example pickled/zipped/rot13's, etc...)

Question 5

The security is done by the sandboxing. But the system will allow people to run their code inside other peoples sandbox AFTER checks. So if their code is obfuscated, it will just immediately be rejected. If it's not obfuscated, it will be checked for a few limitations (such as sys, os.system, subprocess). Ideally it would just tell me what line those offending statements are so I can manually check them.

Question 6

I really wouldn't bother trying to do this kind of artificial sandboxing because

1024 ** 1024 ** 1024

Will still chew up your interpreter.

or even this

eval("__vzcbeg__('gvzr').nfpgvzr()")

If you want some security look into pypy's sandbox its about the most secure way to run untrusted python code. There are a few python only modules like pysandbox but I personally suggest the pypy sandbox.

Question 7

I would guess the OP doesn't want this for security, given that he already uses sandboxes. This kind of checks might be intended to recognise attempts at breaking the sandbox, and enforce ban...

Question 8

I don't think you can detect that sort of thing reliably at all. Consider the following:

>>> f = None
>>> b = vars()[[f for f in vars() if 'ti' in f][0]]
>>> m = getattr(b, [f for f in dir(b) if 't_' in f][0])
>>> m('x\x9c+\xae,\x06\x00\x02\xc1\x01`'.decode('zip'))
<module 'sys' (built-in)>

Question 9

You can't do this just by static analysis of the code, since it can always do tricky things, e.g.:

>>> getattr(__builtins__, "__" + chr(105) + "mport__")("sys")
<module 'sys' (built-in)>

As you can see, looking at the disassembly, code or ast won't help, as nowhere does it even contain the string "import":

>>> import dis
>>> dis.dis(lambda: getattr(__builtins__, "__" + chr(105) + "mport__")("sys"))
 1 0 LOAD_GLOBAL 0 (getattr)
 3 LOAD_GLOBAL 1 (__builtins__)
 6 LOAD_CONST 1 ('__')
 9 LOAD_GLOBAL 2 (chr)
 12 LOAD_CONST 2 (105)
 15 CALL_FUNCTION 1
 18 BINARY_ADD
 19 LOAD_CONST 3 ('mport__')
 22 BINARY_ADD
 23 CALL_FUNCTION 2
 26 LOAD_CONST 4 ('sys')
 29 CALL_FUNCTION 1
 32 RETURN_VALUE

Question 10

I assume chr(73) was a mistake because that's the ascii for I and "Import" throws errors for me. Works fine though with 105 for lower case i.

Question 11

While true sand boxing is indeed very difficult, if it is the import statement you try to catch, consider this:

>>> org_imp = __builtins__.__import__
>>> def imp_hook(*args, **kw):
 if args[0] == 'sys':
 print 'Gotcha!!'
 return None
 return org_imp
>>> __builtins__.__import__ = imp_hook
>>> import sys
Gotcha!!
>>> sys
>>> print sys
None

This work's regardless of the complexity of the import statement itself.

Note: Don't just print & return None, throw meaningful exceptions, but you get the idea!

Question 12

You may use the ast python module to analyze Python code. See my answer to a very similar question here:

https://stackoverflow.com/a/8255293/589206

Here's a solution for your import statement problem:

import ast
import sys
class FunctionNameFinder(ast.NodeVisitor):
 def visit_Import(self, node):
 print "Importing on line", node.lineno, ":",
 for i in node.names: print i.name,
 print
with open(sys.argv[1], 'rU') as f:
 FunctionNameFinder().visit(ast.parse("".join(f.readlines())))

Of course, this won't help in cases where a malicious user is putting a lot of effort into obfuscating his code, but then, the only way to go is use a real sandbox. But that wasn't your question in the first place.

Question 13

Frankly I don't see how this should be better than using regexes... It would be at best an order of magnitude slower, and still with the same limitations.... Or am I missing anything?

Question 14

You can use this method to solve the bonus question as well, which will be hard using regular expressions.

Question 15

Also doesn't work with simple tricks such as Liquid Fire's code above.

Question 16

Well... I wouldn't call matching "from foo import bar" a very difficult task to achieve with regexes! :)

Question 17

Hmmm ... still I think it's nicer to use the ast library, but a real solution involves a sandbox for sure :)

Question 18

What you are trying to do is a common scenario: You are already doing dynamic analysis of code by running in a sandbox. On top of you'd like to have static analysis as well using another tool read the program for you.

Both approaches have their own shortcomings and due to the nature of the computation, none of them can guarantee to provide you with all kinds of potential scenarios going wrong; however still the combination of two provides you a lot of useful information at a higher confidence level.

In other popular languages, for example C/C++, there are robust tools (e.g. Lint) which can analyze the code deeply and report a lot of potential problems including those related to security.

Unfortunately Python doesn't have tools having robustness level that high. Having said that, you can still do a lot. I think your best choice would to be to use PyLint.

PyLint comes with some standard rules for the code analysis but you can override those to customize your own code smells.

For instance, if you simply would like to see the kind of modules being used, you can use the imports checker. For handling more complex scenarios, you can customize and extend the functionality. Take a look at their documentation for enhancing PyLint.

Take a look at the tutorial to get started:

Jakob Bowyer 34.8k8 gold badges80 silver badges93 bronze badges · Accepted Answer · 2011-12-05 13:35:53Z

I really wouldn't bother trying to do this kind of artificial sandboxing because

1024 ** 1024 ** 1024

Will still chew up your interpreter.

or even this

eval("__vzcbeg__('gvzr').nfpgvzr()")

If you want some security look into pypy's sandbox its about the most secure way to run untrusted python code. There are a few python only modules like pysandbox but I personally suggest the pypy sandbox.

I would guess the OP doesn't want this for security, given that he already uses sandboxes. This kind of checks might be intended to recognise attempts at breaking the sandbox, and enforce ban...

CollectivesTM on Stack Overflow

Check Python code for certain statements

6 Answers 6

1 Comment

Comments

1 Comment

Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

6 Answers 6

1 Comment

Comments

1 Comment

Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related