My goal is to parse indented text in the style of python and YAML.
This only find the parent of each line.
This bit of code seems to do the trick, but I'm not really satisfied and I wanted to know if you would do this another way.
raw = """animal
carnivorous
tiger
lion
vegetarian
cow
sheep
plant
algea
tree
leaf
pine
fungus
good
bad
evil
mean
cactus
big
small"""
lines = raw.split('\n')
indents = [(0,0,'root')]
for a in raw.split('\n'):
indent = 0
while a[indent] == ' ': indent+=1
if indent % 4:
print("not multiple of 4")
break
indents.append((len(indents), int(indent/4)+1,a.replace(' ','')))
for a in indents: print(a)
stack=[indents[0]]
entries =[indents[0]]
prev_indent = 0
for item in indents[1:]:
print("#########################")
id, indent, name = item
diff = indent - prev_indent
print(item)
print("diff",diff, [a[2] for a in stack])
if diff>0:
entries.append(item+(stack[-1][2],))
elif diff<0:
# entries.append(item+(stack[-diff][2],))
count = -diff
while count>-1: stack.pop();count-=1
entries.append(item+(stack[-1][2],))
elif diff==0:
stack.pop()
entries.append(item+(stack[-1][2],))
stack.append(item)
prev_indent = entries[-1][1]
print("result", entries[-1])
print("########################")
for a in entries:
if len (a) == 3: continue
ident, level, name, parent = a
print(level*' '*4, name, '(', parent, ')')
This results in this (the name in parenthesis is the parent):
animal ( root )
carnivorous ( animal )
tiger ( carnivorous )
lion ( carnivorous )
vegetarian ( animal )
cow ( vegetarian )
sheep ( vegetarian )
plant ( root )
algea ( plant )
tree ( plant )
leaf ( tree )
pine ( tree )
fungus ( plant )
good ( fungus )
bad ( fungus )
evil ( bad )
mean ( bad )
cactus ( plant )
big ( cactus )
small ( cactus )
1 Answer 1
You should be able to accomplish this task by making one linear pass through the lines, instead of making one pass to build indents
and a second pass to build entries
.
It's good practice to package your code into functions, and to write docstrings for them. In particular, if you have code that follows the pattern
outputs = [] for item in inputs: outputs.append(...)
... then consider writing a generator instead.
When you detect indentation that is not a multiple of 4 spaces, you print a message and stop building indents
, but other than that, you still allow the program to proceed normally. The program should probably abort at that point, and I suggest doing so by raising an exception. Furthermore, I consider indentation that is suddenly excessively deep (e.g. going from 1 level to 3 levels of indentation) to be another kind of error that should be detected.
I don't like the way you handle the special case of the root node. In particular, having a non-uniform tuple length is asking for trouble — it is basically data that is not of the same type. I would avoid making the root node part of the data structure altogether, so that you don't have to write an exclusion for this special case:
for a in entries: if len (a) == 3: continue
Analyzing the text one character at a time (using while a[indent] == ' ': indent+=1
) feels tedious. I suggest using regular expressions to describe what kind of text you are expecting. For example,
re.compile(r'^(?P<indent>(?: {4})*)(?P<name>\S.*)')
... says that you are looking for indentation at the beginning of the line that is a multiple of four spaces, followed by a name that starts with a non-space character.
Suggested solution
import re
def parse_tree(lines):
"""
Parse an indented outline into (level, name, parent) tuples. Each level
of indentation is 4 spaces.
"""
regex = re.compile(r'^(?P<indent>(?: {4})*)(?P<name>\S.*)')
stack = []
for line in lines:
match = regex.match(line)
if not match:
raise ValueError(
'Indentation not a multiple of 4 spaces: "{0}"'.format(line)
)
level = len(match.group('indent')) // 4
if level > len(stack):
raise ValueError('Indentation too deep: "{0}"'.format(line))
stack[level:] = [match.group('name')]
yield level, match.group('name'), (stack[level - 1] if level else None)
raw = """..."""
for level, name, parent in parse_tree(raw.split('\n')):
print('{0}{1} ( {2} )'.format(' ' * (4 * level), name, parent or 'root'))
-
\$\begingroup\$ Thanks. Please, could you comment or expand on the 2 last line in parse_tree() ? \$\endgroup\$jokoon– jokoon2017年09月23日 21:20:16 +00:00Commented Sep 23, 2017 at 21:20
-
1\$\begingroup\$ The penultimate line pops items from the stack as necessary, then pushes one name onto the stack. The last line of code produces the (level, name, parent) tuple result for that line of text. \$\endgroup\$200_success– 200_success2017年09月23日 21:23:41 +00:00Commented Sep 23, 2017 at 21:23
re.compile(r'^(?P<indent>(?: {4})*)(?P<name>\S.*)')
you suggest to get rid of empty lines too? I am using it to read a text file and this file has empty lines causing error in the code you sugest. \$\endgroup\$