I need to parse a simple JSON string (flat JSON, no hierarchy) for keys and values, and there is a system constraint that I cannot use any built-in JSON library and can only read a string once due to latency requirements. I need to use Python 2.7.x series and cannot use a higher version.
Currently, my major concern for below code is, I still do a bit more than one pass string parsing, since I need to go backward/forward to remove unnecessary characters around a word, which is what wordBeautify
does.
This is a follow-up to "Json String Parsing"
def wordBeautify(word1, beginIndex, endIndex):
noMeaningChars=[' ', '"', '{', ',', ':','}']
while word1[beginIndex] in noMeaningChars:
beginIndex+=1
while word1[endIndex] in noMeaningChars:
endIndex-=1
return (beginIndex,endIndex+1)
def parseElegant2(str1):
keyStr=''
valueStr=''
beginWord = False
isKey=True
beginIndex=0
endIndex=0
for i in range(len(str1)):
if str1[i]==':':
endIndex=i-1
(x,y) = wordBeautify(str1, beginIndex, endIndex)
keyStr=str1[x:y]
print "key string " + keyStr
beginIndex=i+1
elif str1[i]==',' or str1[i]=='}':
endIndex=i
(x,y) = wordBeautify(str1, beginIndex, endIndex)
valueStr=str1[x:y]
beginIndex=i+1
print "value string " + valueStr
print 'key and value, '+ keyStr, valueStr
def parseElegant(str1):
keyStr=''
valueStr=''
beginWord = False
isKey=True
beginIndex=0
endIndex=0
for i in range(len(str1)):
if str1[i] == '"' and beginWord==False:
beginWord=True
beginIndex = i+1
elif str1[i] == '"' and beginWord==True:
beginWord=False
endIndex=i
if beginIndex<endIndex:
print "get word, " + str1[beginIndex:endIndex]
elif str1[i]==':':
keyStr=str1[beginIndex:endIndex]
elif str1[i]==',':
valueStr=str1[beginIndex:endIndex]
print 'key and value, '+ keyStr, valueStr
if __name__ == "__main__":
#JSONString = '{ "id": 1, "name": "A green door", "price": 12.50, "tags": ["home", "green"]}'
JSONString1 = '{ "id": "1", "name": "A green door", "price": "12.50", "tags": "home green"}'
JSONString2 = '{ "id": 1, "name": "A green door", "price": 12.50, "tags": "home green"}'
parseElegant2(JSONString1)
parseElegant2(JSONString2)
2 Answers 2
First of all, let me say well done on the progress made from your first question to now!
Code Review is for reviewing code
As you seem to be seeking advice "about how to implement" an idea (i.e. streams), I think you may be better off asking on Programmers.SE, since that is a more appropriate place for questions about "software architecture and design" and "algorithm and data structure concepts". Code Review is "not for questions about broken code, hypothetical code, or non-existent code, as such questions will be closed as off-topic." (taken from this help page).
I can see that you have code already, and are seeking to improve it (in a way that doesn't add any functionality) - so I consider your question to be on-topic. Although your desire to learn about other ways of implementing it entirely may not be best placed on this site.
A Quick Review
Remove old code
You have included the (now unused) code for parseElegant
(version 1). While I can understand you not wanting to erase this, you have already linked to your previous question (which contains the old code). Including version 1 here just adds ~24 lines for reviewers to have to scroll through.
Can't you just use strip
?
The wordBeautify
function (which you are seeking to make redundant/remove) seems to be VERY similar to the Python built-in string.strip method?
Consider the following code:
test="{:foo }"
(x,y)=wordBeautify(test,0,len(test)-1)
a=test[x:y]
b=test.strip(' "{,:}')
Result: a
and b
both work out to be "foo"!
Use better variable names!
At present, your parseElegant2
function takes an argument called str1
. While it is obvious that this means something along the lines of the first string - it is not actually particularly helpful in knowing what that variable represents.
Something more like json
, text_candidate
, or candidate
may be better - as it expresses that this is a long text containing json, or something that is a candidate to be processed into JSON.
You have a similar issue in wordBeautify
, with word1
. This could just be named word
, or similar.
Return is better than print
I've noticed that in your function you are not return
'ing any values. Instead you are just print
'ing to the standard output (STDOUT).
Is this your goal? Or will your program eventually need to construct some kind of data structure (such as a dict
) of the JSON values?
I would recommend refactoring your code to return a data structure, and have your function write the JSON values into that data structure.
A word on string stream processing
As @ferada commented, I think you should "consider using a stream, looking only at the (single) next character at a time and then build a state machine around it.", however your current program is not doing this.
Instead, your loop is going through the indices of the string. And is (at present), also calling your wordBeautify
function to do the same (with indices).
Rather than using for i in range(len(str1)):
, as you currently are - I would recommend that you loop through the characters themselves (one at a time, as @ferada suggested). Thus, your loop may look more like: for c in str1:
This way, you are essentially forced into only looking at the current character. (i.e. you are unable to increment/decrement any number value to go forwards or backwards in the string.)
Constructing a state machine
From the Wikipedia article about Finite-state machines:
A finite-state machine (FSM) or simply state machine, is a mathematical model of computation used to design computer programs. It is conceived as an abstract machine that can be in one of any finite number of states. The machine is in only one state at a time, called the current state. It can change from one state to another when triggered, this is called a transition. A particular state machine is defined by a list of its states, and the triggering condition for each transition.
(Emphasis mine)
You can read the article for more background information, but essentially your are seeking to construct a state machine. As you can read in the last sentence (that I've bolded), this requires:
- a list of states
- the trigger conditions for each state transition
What are the states?
Luckily, the official JSON website (JSON.org) actually gives us these exact things! In the form of railroad diagrams. Specifically, they tell us that:
An object is an unordered set of name/value pairs. An object begins with { (left brace) and ends with } (right brace). Each name is followed by : (colon) and the name/value pairs are separated by , (comma).
An object is an unordered set of name/value pairs. An object begins with { (left brace) and ends with } (right brace)...
(source: json.org)
So we know that STRING is a state, for example. They also tell us how we get into (and out of) the STRING state:
A string is a sequence of zero or more Unicode characters, wrapped in double quotes...
A string is a sequence of zero or more Unicode characters, wrapped in double quotes...
(source: json.org)
So we know that a STRING state is triggered when we encounter a "
(quote) character. (And that it ends the same way.)
Note: For now, let's ignore the backslash (\
) escape style string components.
We also are told what a VALUE might be as well:
A value can be a string in double quotes, or a number, or true or false or null...
A value can be a string in double quotes, or a number, or true or false or null...
(source: json.org)
This gives us another state, namely NUMBER - since a value can be a number.
Some limitations (Let's not reinvent the whole wheel)
We already know that you only want to deal with "flat JSON (no hierarchy) for keys and values". So, we'll forget about OBJECT and ARRAY in the diagram above. ;-)
Let's also (just to make it easy), ignore the fact that a value can be true
, false
, or null
. (i.e. it is either a string or a number).
The states so far
So far, we have been able to use the official JSON.org website diagrams to determine we have the following two states:
- STRING
- NUMBER
We also know that a string is triggered by the condition: c=='"'
(where c
is our stream's current character).
But how is a NUMBER triggered?
Well the JSON.org website has a rather complicated way of detecting numbers (which takes into account negative numbers, scientific notation, etc). but for these purposes let's just say that a NUMBER is triggered by a sequence of 1 or more of the following characters: 0123456789.
(the decimal point is included).
Other transitions
If we look back at the first railroad diagram:
An object is an unordered set of name/value pairs. An object begins with { (left brace) and ends with } (right brace)...
We can see that the COLON (:
) and COMMA (,
) characters are also transitions between states..
Putting it all together
If we combine all of these states and transitions, plus the idea of a character (rather than index) loop, and the other improvements I've mentioned - we can produce a single function which is capable of processing a JSON string "all at once":
def parseElegant3(candidate):
READY=0
STRING=1
NUMBER=3
COLON=-1
COMMA=-2
search={'"':STRING, ':':COLON, ',':COMMA, STRING:'"', COMMA:','}
NOP=" {\t}\r\n"
stringIgnore='{,:}'
numbers='0123456789.'
result={}
if not len(candidate):
return result
accumulator=""
mostRecentKey=None
state=READY
for c in candidate:
newState=state
if state is READY:
if c in NOP:continue
if c in numbers:
newState=NUMBER
if c in search.keys():
newState=search[c]
if newState is COLON:
newState=READY
if len(accumulator):
accumulator=accumulator.strip()
result[accumulator]=None
mostRecentKey=accumulator
accumulator=""
else:
state=READY
continue
elif newState is COMMA:
newState=READY
elif newState is NUMBER and not state is NUMBER:
accumulator+=c
if state is STRING and newState is STRING:
if c in stringIgnore:continue
if c==search[STRING]:
newState=READY
if not mostRecentKey is None:
result[mostRecentKey]=accumulator
accumulator=""
mostRecentKey=None
else:
accumulator+=c
elif state is NUMBER:
if c in numbers:
accumulator+=c
else:
if c in NOP:continue
assert c == search[COMMA]
try:
result[mostRecentKey]=int(accumulator)
except ValueError:
result[mostRecentKey]=float(accumulator)
accumulator=""
mostRecentKey=None
newState=READY
state=newState
return result
JSONString = '{ "id": 1, "name": "A green door", "price": 12.50, "tags": "home green"}'
parseElegant3(JSONString)
-
1\$\begingroup\$ Thank you! (1) The accumulator is where stream data (characters) are stored after reading and before being put into the
result
. (2) Yes, a STRING is always "quoted", where-as a NUMBER is never like that. a NUMBER also can only have the characters0123456789.
- where-as a string can have anything (except for a quote). Yes,search
is basically a dictionary of the _trigger) characters. \$\endgroup\$Tersosauros– Tersosauros2016年03月07日 07:58:07 +00:00Commented Mar 7, 2016 at 7:58 -
1\$\begingroup\$ Thanks Tersosauros, in your code,
state is STRING
, does it mean at the begin of a reading a string, or in the middle of reading a string? Thanks. \$\endgroup\$Lin Ma– Lin Ma2016年03月07日 08:07:05 +00:00Commented Mar 7, 2016 at 8:07 -
1\$\begingroup\$ if the state is STRING it means we have already triggered the transition from the READY state into STRING (by finding a
"
) - so it means we are in the middle. Yes, you could useif mostRecentKey
- however that will mean anymostRecentKey
value that evaluates to false cannot be a key name. Usingif not mostRecentKey is None
gets around this by being more explicit. \$\endgroup\$Tersosauros– Tersosauros2016年03月07日 08:42:38 +00:00Commented Mar 7, 2016 at 8:42 -
1\$\begingroup\$ @Tersosauros maybe
if mostRecentKey is not None
is easier to read (and understand) \$\endgroup\$oliverpool– oliverpool2016年03月07日 13:35:57 +00:00Commented Mar 7, 2016 at 13:35 -
1\$\begingroup\$ I'm not sure what you mean about the
mostRecentKey
"problem"? Obviously you don't want a key ofNone
in theresult
. Right? I was talking about the idea you mentioned of usingif mostRecentKey
instead. Also as @oliverpoolsaid, his syntax is probably more readable! \$\endgroup\$Tersosauros– Tersosauros2016年03月08日 19:53:13 +00:00Commented Mar 8, 2016 at 19:53
If it's possible for you to convert the JSON string to a dictionary data structure like so:
d = eval(JSONString)
then you could use the
string.strip()
and
string.replace()
methods on the individual keys.
You could then either work with the dict/JSON as is, or rebuild it to resemble a string if necesssary.
-
\$\begingroup\$ Thanks user6026938, thanks for sharing the idea. Vote up. :) It is not possible for me to "convert the JSON string to a dictionary data structure" in my specific problem and I have to write the raw parser. If you have any good implementation how to parse the JSON string only once, appreciate for sharing. \$\endgroup\$Lin Ma– Lin Ma2016年03月07日 05:49:08 +00:00Commented Mar 7, 2016 at 5:49
Explore related questions
See similar questions with these tags.
wordBeautify
does. \$\endgroup\$