More efficient conversion of tab-formatted nested data structure to json

Question 1

I have data that looks like the "Input" below and need to convert it into JSON. My solution works by parsing the text to find a level for each data point. Then I use a recursive structure to build a JSON tree (or maybe its not JSON, but its much more useful than the original format).

First, I transform the input in the following way.

Input:

person:
 address:
 street1: 123 Bar St
 street2: 
 city: Madison
 state: WI
 zip: 55555
 web:
 email: [email protected]

First-step output:

[{'name':'person','value':'','level':0},
 {'name':'address','value':'','level':1},
 {'name':'street1','value':'123 Bar St','level':2},
 {'name':'street2','value':'','level':2},
 {'name':'city','value':'Madison','level':2},
 {'name':'state','value':'WI','level':2},
 {'name':'zip','value':55555,'level':2},
 {'name':'web','value':'','level':1},
 {'name':'email','value':'[email protected]','level':2}]

This is easy to accomplish with split(':') and by counting the number of leading tabs:

def tab_level(astr):
 """Count number of leading tabs in a string
 """
 return len(astr)- len(astr.lstrip('\t'))

Then I feed the first-step output into the following function:

def ttree_to_json(ttree,level=0):
 result = {}
 for i in range(0,len(ttree)):
 cn = ttree[i]
 try:
 nn = ttree[i+1]
 except:
 nn = {'level':-1}
 # Edge cases
 if cn['level']>level:
 continue
 if cn['level']<level:
 return result
 # Recursion
 if nn['level']==level:
 dict_insert_or_append(result,cn['name'],cn['value'])
 elif nn['level']>level:
 rr = ttree_to_json(ttree[i+1:], level=nn['level'])
 dict_insert_or_append(result,cn['name'],rr)
 else:
 dict_insert_or_append(result,cn['name'],cn['value'])
 return result
 return result

where:

def dict_insert_or_append(adict,key,val):
 """Insert a value in dict at key if one does not exist
 Otherwise, convert value to list and append
 """
 if key in adict:
 if type(adict[key]) != list:
 adict[key] = [adict[key]]
 adict[key].append(val)
 else:
 adict[key] = val

The approach is redundant and therefore inefficient. I also wonder whether the solution is robust (for example, I had to modify the code to accommodate repeated tags). Think of the Input above as a formatting for SGML. Any suggestions for improvement would be greatly appreciated!

Question 2

Is the format of the file constant, meaning there will always be those attributes present. For example, person will always have address and web and address will always have those property? If so then you can just read person by person assuming file format doesn't change

Question 3

I can see how that would work for addresses. Unfortunately, the example is just an example. I need something that works for arbitrary tags.

Question 4

Is there some sort of structure to the input. Are the data field always tab separated such that the tabbed line belongs to the parent above with one less tab?

Question 5

Ah yes. Sorry if that was not clear. The data is structured such that tabs denote a parent-child relationship. Similar to the structure of python code.

Question 6

Just want to paint the problem clearer. Is 'person' the key or the value? In your example you parse it as name=person with value=''

Question 7

I have not programmed in python but you should be able to do this in one shot. In pseudo-code it should be something like so:

function parseJsonInput (file)
 var parsedJson= {};
 var parentStack = [parsedJson]
 for each line in file
 var data = parseLine(line) //return key,value,level. null if not avail for each field
 //leaf item process it by adding it to its current parent
 if data.value is not null
 var currentParent = parentStack.getLastElement()
 currentParent[data.key] = data.value
 var nextLineLevel = parseLine( peekNextLine() ).level; //peek next line level
 if nextLineLevel = data.level - 1
 parentStack.pop() //done processing child, about to go back to parent
 else 
 //group node, push it as the new parent and keep on processing. 
 //created more variable than needed for clarity
 var currentParent = parentStack.getLastElement()
 currentParent[data.key] = {}
 var newParent = currentParent[data.key]
 parentStack.push( newParent )
 endfor
 return parsedJson;
end function

I haven't tried that but that should work give or take few bugs. But the basic idea as I mentioned in the comment is to transverse the file iteratively as a tree structure using iterative post-tree-transversal method. I'm not sure if the 'peekNextLine' is available to you, if not then you would need another variable to keep track of the last level processed and strategically insert that logic there -- I started doing this, but figured it might be confusing at first. Let me know if there is some issues. I can help. If I have time I can even write a quick javascript version for you to learn from.

Question 8

Thank you user\d+. Using a stack is probably a better solution... and more transparent! Still, I am curious about an improved recursive solution. Is it possible to achieve the same level of efficiency with a recursive solution?

Question 9

Can you please convert it to Javascript code

Question 10

@LaxmikantDange sorry bud no incentive for me

Question 11

Practical tips

Instead of:

try:
 nn = ttree[i+1]
except:
 nn = {'level': -1}

This should have been:

if i + 1 < len(ttree):
 nn = ttree[i + 1]
else:
 nn = {'level': -1}

try/except is for handling anomalies: things that shouldn't happen under normal circumstances. Take for example an inventory service, with an API method get_item(item_id). Callers of the API are expected to use valid item ids. If a user asks for a non-existent item that's an anomaly. So when loading the item with item_id from the backend storage, you would assume the query will work, and handle the case of non-existent items with a try/except.

In your program, the case of i + 1 >= len(ttree) happens during normal operations, when processing the last line of input. This is not an anomaly, as it happens for perfectly legal inputs too. This is a case for checking with an if instead of try/except.

Abusing try/except can sometimes hurt performance too. One time in the past I misunderstood the "ask forgiveness not permission" rule and changed many of my ifs to try/except in a program where this would happen a few dozens of times. During my tests, the speed difference was noticeable even without a timer. try/except is the right tool for handling anomalies, things that are not supposed to happen during normal operations.

Don't check for the type of an object like this:

if type(adict[key]) != list:

Use the isinstance built-in function instead:

if isinstance(adict[key], list):

Instead of range(0, n) you can write range(n), it's the same thing.

Question 12

Thanks for your feedback @janos! I was not aware of the isinstance() method, thank you for pointing that out. Regarding your first tip, I was under the impression that asking for forgiveness is more pythonic. See, for example, this question link. Alex Martelli does it too, see link. Does try-except really make programs run slower? Do you have a source to support that assertion?

Question 13

Alex Martelli says try/except is for "anomalous situations", and that's exactly what I meant with "for handling exceptional events that shouldn't happen under normal circumstances". If I understand your program correctly, it will pass through the except during normal operations, for example when you parse the example input, which is a valid input, and there is no anomaly here.

Question 14

Thanks @janos! I found another example of exceptions being used in non-anomalous situations... Python's iter protocol. See, for example, this tutorial.

Question 15

I suppose you're talking about the implementation of the take function? That's still an anomalous situation: trying to take n items from a sequence that has less than n items doesn't really make sense, so that's kind of an anomaly.

dchhetri dchhetri 1564 bronze badges · Accepted Answer · 2014-07-26 05:36:26Z

I have not programmed in python but you should be able to do this in one shot. In pseudo-code it should be something like so:

function parseJsonInput (file)
 var parsedJson= {};
 var parentStack = [parsedJson]
 for each line in file
 var data = parseLine(line) //return key,value,level. null if not avail for each field
 //leaf item process it by adding it to its current parent
 if data.value is not null
 var currentParent = parentStack.getLastElement()
 currentParent[data.key] = data.value
 var nextLineLevel = parseLine( peekNextLine() ).level; //peek next line level
 if nextLineLevel = data.level - 1
 parentStack.pop() //done processing child, about to go back to parent
 else 
 //group node, push it as the new parent and keep on processing. 
 //created more variable than needed for clarity
 var currentParent = parentStack.getLastElement()
 currentParent[data.key] = {}
 var newParent = currentParent[data.key]
 parentStack.push( newParent )
 endfor
 return parsedJson;
end function

I haven't tried that but that should work give or take few bugs. But the basic idea as I mentioned in the comment is to transverse the file iteratively as a tree structure using iterative post-tree-transversal method. I'm not sure if the 'peekNextLine' is available to you, if not then you would need another variable to keep track of the last level processed and strategically insert that logic there -- I started doing this, but figured it might be confusing at first. Let me know if there is some issues. I can help. If I have time I can even write a quick javascript version for you to learn from.

Thank you user\d+. Using a stack is probably a better solution... and more transparent! Still, I am curious about an improved recursive solution. Is it possible to achieve the same level of efficiency with a recursive solution?

Stack Exchange Network

More efficient conversion of tab-formatted nested data structure to json

2 Answers 2

Practical tips

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

More efficient conversion of tab-formatted nested data structure to json

2 Answers 2

Practical tips

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions