3
\$\begingroup\$

I have data that looks like the "Input" below and need to convert it into JSON. My solution works by parsing the text to find a level for each data point. Then I use a recursive structure to build a JSON tree (or maybe its not JSON, but its much more useful than the original format).

First, I transform the input in the following way.

Input:

person:
 address:
 street1: 123 Bar St
 street2: 
 city: Madison
 state: WI
 zip: 55555
 web:
 email: [email protected]

First-step output:

[{'name':'person','value':'','level':0},
 {'name':'address','value':'','level':1},
 {'name':'street1','value':'123 Bar St','level':2},
 {'name':'street2','value':'','level':2},
 {'name':'city','value':'Madison','level':2},
 {'name':'state','value':'WI','level':2},
 {'name':'zip','value':55555,'level':2},
 {'name':'web','value':'','level':1},
 {'name':'email','value':'[email protected]','level':2}]

This is easy to accomplish with split(':') and by counting the number of leading tabs:

def tab_level(astr):
 """Count number of leading tabs in a string
 """
 return len(astr)- len(astr.lstrip('\t'))

Then I feed the first-step output into the following function:

def ttree_to_json(ttree,level=0):
 result = {}
 for i in range(0,len(ttree)):
 cn = ttree[i]
 try:
 nn = ttree[i+1]
 except:
 nn = {'level':-1}
 # Edge cases
 if cn['level']>level:
 continue
 if cn['level']<level:
 return result
 # Recursion
 if nn['level']==level:
 dict_insert_or_append(result,cn['name'],cn['value'])
 elif nn['level']>level:
 rr = ttree_to_json(ttree[i+1:], level=nn['level'])
 dict_insert_or_append(result,cn['name'],rr)
 else:
 dict_insert_or_append(result,cn['name'],cn['value'])
 return result
 return result

where:

def dict_insert_or_append(adict,key,val):
 """Insert a value in dict at key if one does not exist
 Otherwise, convert value to list and append
 """
 if key in adict:
 if type(adict[key]) != list:
 adict[key] = [adict[key]]
 adict[key].append(val)
 else:
 adict[key] = val

The approach is redundant and therefore inefficient. I also wonder whether the solution is robust (for example, I had to modify the code to accommodate repeated tags). Think of the Input above as a formatting for SGML. Any suggestions for improvement would be greatly appreciated!

asked Jul 26, 2014 at 4:03
\$\endgroup\$
6
  • \$\begingroup\$ Is the format of the file constant, meaning there will always be those attributes present. For example, person will always have address and web and address will always have those property? If so then you can just read person by person assuming file format doesn't change \$\endgroup\$ Commented Jul 26, 2014 at 4:34
  • \$\begingroup\$ I can see how that would work for addresses. Unfortunately, the example is just an example. I need something that works for arbitrary tags. \$\endgroup\$ Commented Jul 26, 2014 at 4:50
  • \$\begingroup\$ Is there some sort of structure to the input. Are the data field always tab separated such that the tabbed line belongs to the parent above with one less tab? \$\endgroup\$ Commented Jul 26, 2014 at 4:55
  • \$\begingroup\$ Ah yes. Sorry if that was not clear. The data is structured such that tabs denote a parent-child relationship. Similar to the structure of python code. \$\endgroup\$ Commented Jul 26, 2014 at 4:59
  • \$\begingroup\$ Just want to paint the problem clearer. Is 'person' the key or the value? In your example you parse it as name=person with value='' \$\endgroup\$ Commented Jul 26, 2014 at 5:05

2 Answers 2

1
\$\begingroup\$

I have not programmed in python but you should be able to do this in one shot. In pseudo-code it should be something like so:

function parseJsonInput (file)
 var parsedJson= {};
 var parentStack = [parsedJson]
 for each line in file
 var data = parseLine(line) //return key,value,level. null if not avail for each field
 //leaf item process it by adding it to its current parent
 if data.value is not null
 var currentParent = parentStack.getLastElement()
 currentParent[data.key] = data.value
 var nextLineLevel = parseLine( peekNextLine() ).level; //peek next line level
 if nextLineLevel = data.level - 1
 parentStack.pop() //done processing child, about to go back to parent
 else 
 //group node, push it as the new parent and keep on processing. 
 //created more variable than needed for clarity
 var currentParent = parentStack.getLastElement()
 currentParent[data.key] = {}
 var newParent = currentParent[data.key]
 parentStack.push( newParent )
 endfor
 return parsedJson;
end function

I haven't tried that but that should work give or take few bugs. But the basic idea as I mentioned in the comment is to transverse the file iteratively as a tree structure using iterative post-tree-transversal method. I'm not sure if the 'peekNextLine' is available to you, if not then you would need another variable to keep track of the last level processed and strategically insert that logic there -- I started doing this, but figured it might be confusing at first. Let me know if there is some issues. I can help. If I have time I can even write a quick javascript version for you to learn from.

answered Jul 26, 2014 at 5:36
\$\endgroup\$
3
  • \$\begingroup\$ Thank you user\d+. Using a stack is probably a better solution... and more transparent! Still, I am curious about an improved recursive solution. Is it possible to achieve the same level of efficiency with a recursive solution? \$\endgroup\$ Commented Jul 27, 2014 at 2:32
  • \$\begingroup\$ Can you please convert it to Javascript code \$\endgroup\$ Commented Apr 25, 2015 at 11:48
  • \$\begingroup\$ @LaxmikantDange sorry bud no incentive for me \$\endgroup\$ Commented May 25, 2015 at 2:33
1
\$\begingroup\$

Practical tips

Instead of:

try:
 nn = ttree[i+1]
except:
 nn = {'level': -1}

This should have been:

if i + 1 < len(ttree):
 nn = ttree[i + 1]
else:
 nn = {'level': -1}

try/except is for handling anomalies: things that shouldn't happen under normal circumstances. Take for example an inventory service, with an API method get_item(item_id). Callers of the API are expected to use valid item ids. If a user asks for a non-existent item that's an anomaly. So when loading the item with item_id from the backend storage, you would assume the query will work, and handle the case of non-existent items with a try/except.

In your program, the case of i + 1 >= len(ttree) happens during normal operations, when processing the last line of input. This is not an anomaly, as it happens for perfectly legal inputs too. This is a case for checking with an if instead of try/except.

Abusing try/except can sometimes hurt performance too. One time in the past I misunderstood the "ask forgiveness not permission" rule and changed many of my ifs to try/except in a program where this would happen a few dozens of times. During my tests, the speed difference was noticeable even without a timer. try/except is the right tool for handling anomalies, things that are not supposed to happen during normal operations.


Don't check for the type of an object like this:

if type(adict[key]) != list:

Use the isinstance built-in function instead:

if isinstance(adict[key], list):

Instead of range(0, n) you can write range(n), it's the same thing.

answered Jul 26, 2014 at 19:30
\$\endgroup\$
4
  • \$\begingroup\$ Thanks for your feedback @janos! I was not aware of the isinstance() method, thank you for pointing that out. Regarding your first tip, I was under the impression that asking for forgiveness is more pythonic. See, for example, this question link. Alex Martelli does it too, see link. Does try-except really make programs run slower? Do you have a source to support that assertion? \$\endgroup\$ Commented Jul 27, 2014 at 2:29
  • \$\begingroup\$ Alex Martelli says try/except is for "anomalous situations", and that's exactly what I meant with "for handling exceptional events that shouldn't happen under normal circumstances". If I understand your program correctly, it will pass through the except during normal operations, for example when you parse the example input, which is a valid input, and there is no anomaly here. \$\endgroup\$ Commented Jul 27, 2014 at 4:51
  • \$\begingroup\$ Thanks @janos! I found another example of exceptions being used in non-anomalous situations... Python's iter protocol. See, for example, this tutorial. \$\endgroup\$ Commented Aug 10, 2014 at 2:28
  • \$\begingroup\$ I suppose you're talking about the implementation of the take function? That's still an anomalous situation: trying to take n items from a sequence that has less than n items doesn't really make sense, so that's kind of an anomaly. \$\endgroup\$ Commented Aug 10, 2014 at 6:43

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.