Parsing indented text and inserting its data into a tree

Question 1

My goal is to parse indented text in the style of python and YAML.

This only find the parent of each line.

This bit of code seems to do the trick, but I'm not really satisfied and I wanted to know if you would do this another way.

raw = """animal
 carnivorous
 tiger
 lion
 vegetarian
 cow
 sheep
plant
 algea
 tree
 leaf
 pine
 fungus
 good
 bad
 evil
 mean
 cactus
 big
 small"""
lines = raw.split('\n')
indents = [(0,0,'root')]
for a in raw.split('\n'):
 indent = 0
 while a[indent] == ' ': indent+=1
 if indent % 4:
 print("not multiple of 4")
 break
 indents.append((len(indents), int(indent/4)+1,a.replace(' ','')))
for a in indents: print(a)
stack=[indents[0]]
entries =[indents[0]]
prev_indent = 0
for item in indents[1:]:
 print("#########################")
 id, indent, name = item
 diff = indent - prev_indent
 print(item)
 print("diff",diff, [a[2] for a in stack])
 if diff>0:
 entries.append(item+(stack[-1][2],))
 elif diff<0:
 # entries.append(item+(stack[-diff][2],))
 count = -diff
 while count>-1: stack.pop();count-=1
 entries.append(item+(stack[-1][2],))
 elif diff==0:
 stack.pop()
 entries.append(item+(stack[-1][2],))
 stack.append(item)
 prev_indent = entries[-1][1]
 print("result", entries[-1])
print("########################")
for a in entries:
 if len (a) == 3: continue
 ident, level, name, parent = a
 print(level*' '*4, name, '(', parent, ')')

This results in this (the name in parenthesis is the parent):

animal ( root )
 carnivorous ( animal )
 tiger ( carnivorous )
 lion ( carnivorous )
 vegetarian ( animal )
 cow ( vegetarian )
 sheep ( vegetarian )
plant ( root )
 algea ( plant )
 tree ( plant )
 leaf ( tree )
 pine ( tree )
 fungus ( plant )
 good ( fungus )
 bad ( fungus )
 evil ( bad )
 mean ( bad )
 cactus ( plant )
 big ( cactus )
 small ( cactus )

Question 2

Is there a way to change this pattern re.compile(r'^(?P<indent>(?: {4})*)(?P<name>\S.*)') you suggest to get rid of empty lines too? I am using it to read a text file and this file has empty lines causing error in the code you sugest.

Question 3

You should be able to accomplish this task by making one linear pass through the lines, instead of making one pass to build indents and a second pass to build entries.

It's good practice to package your code into functions, and to write docstrings for them. In particular, if you have code that follows the pattern

outputs = []
for item in inputs:
 outputs.append(...)

... then consider writing a generator instead.

When you detect indentation that is not a multiple of 4 spaces, you print a message and stop building indents, but other than that, you still allow the program to proceed normally. The program should probably abort at that point, and I suggest doing so by raising an exception. Furthermore, I consider indentation that is suddenly excessively deep (e.g. going from 1 level to 3 levels of indentation) to be another kind of error that should be detected.

I don't like the way you handle the special case of the root node. In particular, having a non-uniform tuple length is asking for trouble — it is basically data that is not of the same type. I would avoid making the root node part of the data structure altogether, so that you don't have to write an exclusion for this special case:

for a in entries:
 if len (a) == 3: continue

Analyzing the text one character at a time (using while a[indent] == ' ': indent+=1) feels tedious. I suggest using regular expressions to describe what kind of text you are expecting. For example,

re.compile(r'^(?P<indent>(?: {4})*)(?P<name>\S.*)')

... says that you are looking for indentation at the beginning of the line that is a multiple of four spaces, followed by a name that starts with a non-space character.

Suggested solution

import re
def parse_tree(lines):
 """
 Parse an indented outline into (level, name, parent) tuples. Each level
 of indentation is 4 spaces.
 """
 regex = re.compile(r'^(?P<indent>(?: {4})*)(?P<name>\S.*)')
 stack = []
 for line in lines:
 match = regex.match(line)
 if not match:
 raise ValueError(
 'Indentation not a multiple of 4 spaces: "{0}"'.format(line)
 )
 level = len(match.group('indent')) // 4
 if level > len(stack):
 raise ValueError('Indentation too deep: "{0}"'.format(line))
 stack[level:] = [match.group('name')]
 yield level, match.group('name'), (stack[level - 1] if level else None)
raw = """...""" 
for level, name, parent in parse_tree(raw.split('\n')):
 print('{0}{1} ( {2} )'.format(' ' * (4 * level), name, parent or 'root'))

Question 4

Thanks. Please, could you comment or expand on the 2 last line in parse_tree() ?

Question 5

The penultimate line pops items from the stack as necessary, then pushes one name onto the stack. The last line of code produces the (level, name, parent) tuple result for that line of text.

200_success 200_success 146k22 gold badges190 silver badges479 bronze badges · Accepted Answer · 2017-09-23 20:22:38Z

You should be able to accomplish this task by making one linear pass through the lines, instead of making one pass to build indents and a second pass to build entries.

It's good practice to package your code into functions, and to write docstrings for them. In particular, if you have code that follows the pattern

outputs = []
for item in inputs:
 outputs.append(...)

... then consider writing a generator instead.

When you detect indentation that is not a multiple of 4 spaces, you print a message and stop building indents, but other than that, you still allow the program to proceed normally. The program should probably abort at that point, and I suggest doing so by raising an exception. Furthermore, I consider indentation that is suddenly excessively deep (e.g. going from 1 level to 3 levels of indentation) to be another kind of error that should be detected.

I don't like the way you handle the special case of the root node. In particular, having a non-uniform tuple length is asking for trouble — it is basically data that is not of the same type. I would avoid making the root node part of the data structure altogether, so that you don't have to write an exclusion for this special case:

for a in entries:
 if len (a) == 3: continue

Analyzing the text one character at a time (using while a[indent] == ' ': indent+=1) feels tedious. I suggest using regular expressions to describe what kind of text you are expecting. For example,

re.compile(r'^(?P<indent>(?: {4})*)(?P<name>\S.*)')

... says that you are looking for indentation at the beginning of the line that is a multiple of four spaces, followed by a name that starts with a non-space character.

Stack Exchange Network

Parsing indented text and inserting its data into a tree

1 Answer 1

Suggested solution

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parsing indented text and inserting its data into a tree

1 Answer 1

Suggested solution

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions