Python text processing and parsing

Question 1

I have a file in gran/config.py AND I cannot import this file (not an option).

Inside this config.py, there is the following code

...<more code>
 animal = dict(
 bear = r'^bear4x',
 tiger = r'^.*\tiger\b.*$'
 )
...<more code>

I want to be able parse r'^bear4x' or r'^.*\tiger\b.*$' based on bear or tiger.

I started out with

try:
 text = open('gran/config.py','r')
 tline = filter('not sure', text.readlines())
 text.close()
except IOError, str:
 pass

I was hoping to grab the whole animal dict by
grab = re.compile("^animal\s*=\s*('.*')") or something like that
and maybe change tline to tline = filter(grab.search,text.readlines())

but it only grabs animal = dict( and not the following lines of dict.

how can i grab multiple lines?
look for animal then confirm the first '(' then continue to look until ')' ??

Note: the size of animal dict may change so anything static approach (like grab 4 extra lines after animal is found) wouldnt work

Question 2

What kind of error appears when you try to import the file?

Question 3

@badc0re hmm not related because not an option. importing is not an option because config.py is trying to import something that's not available so I have to treat it as a text file. by importing it, it will try to run the code, import something thats not available.

Question 4

Maybe you should try some AST hacks? With python it is easy, just:

import ast
config= ast.parse( file('config.py').read() )

So know you have your parsed module. You need to extract assign to animals and evaluate it. There are safe ast.literal_eval function but since we make a call to dict it wont work here. The idea is to traverse whole module tree leaving only assigns and run it localy:

class OnlyAssings(ast.NodeTransformer):
 def generic_visit( self, node ):
 return None #throw other things away
 def visit_Module( self, node ):
 #We need to visit Module and pass it
 return ast.NodeTransformer.generic_visit( self, node )
 def visit_Assign(self, node):
 if node.targets[0].id == 'animals': # this you may want to change
 return node #pass it
 return None # throw away
config= OnlyAssings().visit(config)

Compile it and run:

exec( compile(config,'config.py','exec') )
print animals

If animals should be in some dictionary, pass it as a local to exec:

data={}
exec( compile(config,'config.py','exec'), globals(), data )
print data['animals']

There is much more you can do with ast hacking, like visit all If and For statement or much more. You need to check documentation.

Question 5

If the only reason you can't import that file as-is is because of imports that will fail otherwise, you can potentially hack your way around it than trying to process a perfectly good Python file as just text.

For example, if I have a file named busted_import.py with:

import doesnotexist
foo = 'imported!'

And I try to import it, I will get an ImportError. But if I define what the doesnotexist module refers to using sys.modules before trying to import it, the import will succeed:

>>> import sys
>>> sys.modules['doesnotexist'] = ""
>>> import busted_import
>>> busted_import.foo
'imported!'

So if you can just isolate the imports that will fail in your Python file and redefine those prior to attempting an import, you can work around the ImportErrors

Question 6

this has to be in config.py but i do not have write access to config.py. the busted import is inside the python file that im trying to import. so its import's import thats busted.

Question 7

Yeah that's what I was demonstrating above. Pretend my busted_import.py is your config.py. Try importing, and when you get the ImportError, simply redirect that module within sys.modules using the example above and then try the import again. If you get another ImportError, repeat the process until you get no more ImportErrors.

Question 8

gotcha. ill try this approach and see if i can import config.py

Question 9

grrr. still complaining. ImportError: No module named farm.animals despite having sys.modules['farm.animals'] = ""

Question 10

I am not getting what exactly are you trying to do.

If you want to process each line with regular expression - you have ^ in regular expression re.compile("^animal\s*=\s*('.*')"). It matches only when animal is at the start of line, not after some spaces. Also of course it does not match bear or tiger - use something like re.compile("^\s*([a-z]+)\s*=\s*('.*')").

If you want to process multiple lines with single regular expression, read about re.DOTALL and re.MULTILINE and how they affect matching newline characters:

http://docs.python.org/2/library/re.html#re.MULTILINE

Also note that text.readlines() reads lines, so the filter function in filter('not sure', text.readlines()) is run on each line, not on whole file. You cannot pass regular expression in this filter(<re here>, text.readlines()) and hope it will match multiple lines.

BTW processing Python files (and HTML, XML, JSON... files) using regular expressions is not wise. For every regular expression you write there are cases where it will not work. Use parser designed for given format - for Python source code it's ast. But for your use case ast is too complex.

Maybe it would be better to use classic config files and configparser. More structured data like lists and dicts can be easily stored in JSON or YAML files.

Question 11

Trying to parse unstructured text file with regular expression is extremely difficult like this case.

Question 12

Oh, now I see he is actually not trying to match newline characters... Will update the anwser.

Question 13

i want to be able to grab the whole dict () named animal

Question 14

@ealeon then you cannot use text.readlines(). Read all contents of the file into single string: text.read()

Arpegius 5,91741 silver badges56 bronze badges · Accepted Answer · 2013-08-27 20:38:58Z

Maybe you should try some AST hacks? With python it is easy, just:

import ast
config= ast.parse( file('config.py').read() )

So know you have your parsed module. You need to extract assign to animals and evaluate it. There are safe ast.literal_eval function but since we make a call to dict it wont work here. The idea is to traverse whole module tree leaving only assigns and run it localy:

class OnlyAssings(ast.NodeTransformer):
 def generic_visit( self, node ):
 return None #throw other things away
 def visit_Module( self, node ):
 #We need to visit Module and pass it
 return ast.NodeTransformer.generic_visit( self, node )
 def visit_Assign(self, node):
 if node.targets[0].id == 'animals': # this you may want to change
 return node #pass it
 return None # throw away
config= OnlyAssings().visit(config)

Compile it and run:

exec( compile(config,'config.py','exec') )
print animals

If animals should be in some dictionary, pass it as a local to exec:

data={}
exec( compile(config,'config.py','exec'), globals(), data )
print data['animals']

There is much more you can do with ast hacking, like visit all If and For statement or much more. You need to check documentation.

CollectivesTM on Stack Overflow

Python text processing and parsing

3 Answers 3

Comments

4 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

Comments

4 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related