8
\$\begingroup\$

My goal is to parse indented text in the style of python and YAML.

This only find the parent of each line.

This bit of code seems to do the trick, but I'm not really satisfied and I wanted to know if you would do this another way.

raw = """animal
 carnivorous
 tiger
 lion
 vegetarian
 cow
 sheep
plant
 algea
 tree
 leaf
 pine
 fungus
 good
 bad
 evil
 mean
 cactus
 big
 small"""
lines = raw.split('\n')
indents = [(0,0,'root')]
for a in raw.split('\n'):
 indent = 0
 while a[indent] == ' ': indent+=1
 if indent % 4:
 print("not multiple of 4")
 break
 indents.append((len(indents), int(indent/4)+1,a.replace(' ','')))
for a in indents: print(a)
stack=[indents[0]]
entries =[indents[0]]
prev_indent = 0
for item in indents[1:]:
 print("#########################")
 id, indent, name = item
 diff = indent - prev_indent
 print(item)
 print("diff",diff, [a[2] for a in stack])
 if diff>0:
 entries.append(item+(stack[-1][2],))
 elif diff<0:
 # entries.append(item+(stack[-diff][2],))
 count = -diff
 while count>-1: stack.pop();count-=1
 entries.append(item+(stack[-1][2],))
 elif diff==0:
 stack.pop()
 entries.append(item+(stack[-1][2],))
 stack.append(item)
 prev_indent = entries[-1][1]
 print("result", entries[-1])
print("########################")
for a in entries:
 if len (a) == 3: continue
 ident, level, name, parent = a
 print(level*' '*4, name, '(', parent, ')')

This results in this (the name in parenthesis is the parent):

animal ( root )
 carnivorous ( animal )
 tiger ( carnivorous )
 lion ( carnivorous )
 vegetarian ( animal )
 cow ( vegetarian )
 sheep ( vegetarian )
plant ( root )
 algea ( plant )
 tree ( plant )
 leaf ( tree )
 pine ( tree )
 fungus ( plant )
 good ( fungus )
 bad ( fungus )
 evil ( bad )
 mean ( bad )
 cactus ( plant )
 big ( cactus )
 small ( cactus )
200_success
145k22 gold badges190 silver badges478 bronze badges
asked Sep 23, 2017 at 18:03
\$\endgroup\$
1
  • \$\begingroup\$ Is there a way to change this pattern re.compile(r'^(?P<indent>(?: {4})*)(?P<name>\S.*)') you suggest to get rid of empty lines too? I am using it to read a text file and this file has empty lines causing error in the code you sugest. \$\endgroup\$ Commented Apr 12, 2020 at 17:22

1 Answer 1

9
\$\begingroup\$

You should be able to accomplish this task by making one linear pass through the lines, instead of making one pass to build indents and a second pass to build entries.


It's good practice to package your code into functions, and to write docstrings for them. In particular, if you have code that follows the pattern

outputs = []
for item in inputs:
 outputs.append(...)

... then consider writing a generator instead.


When you detect indentation that is not a multiple of 4 spaces, you print a message and stop building indents, but other than that, you still allow the program to proceed normally. The program should probably abort at that point, and I suggest doing so by raising an exception. Furthermore, I consider indentation that is suddenly excessively deep (e.g. going from 1 level to 3 levels of indentation) to be another kind of error that should be detected.


I don't like the way you handle the special case of the root node. In particular, having a non-uniform tuple length is asking for trouble — it is basically data that is not of the same type. I would avoid making the root node part of the data structure altogether, so that you don't have to write an exclusion for this special case:

for a in entries:
 if len (a) == 3: continue

Analyzing the text one character at a time (using while a[indent] == ' ': indent+=1) feels tedious. I suggest using regular expressions to describe what kind of text you are expecting. For example,

re.compile(r'^(?P<indent>(?: {4})*)(?P<name>\S.*)')

... says that you are looking for indentation at the beginning of the line that is a multiple of four spaces, followed by a name that starts with a non-space character.


Suggested solution

import re
def parse_tree(lines):
 """
 Parse an indented outline into (level, name, parent) tuples. Each level
 of indentation is 4 spaces.
 """
 regex = re.compile(r'^(?P<indent>(?: {4})*)(?P<name>\S.*)')
 stack = []
 for line in lines:
 match = regex.match(line)
 if not match:
 raise ValueError(
 'Indentation not a multiple of 4 spaces: "{0}"'.format(line)
 )
 level = len(match.group('indent')) // 4
 if level > len(stack):
 raise ValueError('Indentation too deep: "{0}"'.format(line))
 stack[level:] = [match.group('name')]
 yield level, match.group('name'), (stack[level - 1] if level else None)
raw = """...""" 
for level, name, parent in parse_tree(raw.split('\n')):
 print('{0}{1} ( {2} )'.format(' ' * (4 * level), name, parent or 'root'))
answered Sep 23, 2017 at 20:22
\$\endgroup\$
2
  • \$\begingroup\$ Thanks. Please, could you comment or expand on the 2 last line in parse_tree() ? \$\endgroup\$ Commented Sep 23, 2017 at 21:20
  • 1
    \$\begingroup\$ The penultimate line pops items from the stack as necessary, then pushes one name onto the stack. The last line of code produces the (level, name, parent) tuple result for that line of text. \$\endgroup\$ Commented Sep 23, 2017 at 21:23

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.