A simple recursive descent JSON parser. The entrypoint to the code is the parse
function. Since I'm pretty new to common lisp, I wanted to get feedback on how to make my code more concise, general, and performant. There are no real optimizations -- this is the first working version, and I would like to know what it would take to go from my implementation to something that would be production ready (of course, ignoring the fact that I could simply use a library for this). Some possible sources of improvement:
- Creating a separate lexer to improve performance and reduce calls like
scanner-peek
immediately beforescanner-pop
- Making it more robust to malformed inputs / improved error messages
- Shorter code / following lisp idioms better
Here is the code in its entirety:
(defun mk-scanner (s) (list :stream s :cursor 0))
(defun scanner-peek (scanner)
(if (scanner-finished? scanner)
nil
(char (getf scanner :stream) (getf scanner :cursor))))
(defun scanner-pop (scanner)
(let ((result (scanner-peek scanner)))
(incf (getf scanner :cursor))
result))
(defun scanner-finished? (scanner)
(>= (getf scanner :cursor) (length (getf scanner :stream))))
(defun scanner-advance (n scanner)
(let ((cur (getf scanner :cursor)))
(setf (getf scanner :cursor) (+ cur n))
(subseq (getf scanner :stream) cur (min (+ cur n) (length (getf scanner :stream))))))
(defun scanner-position (scanner)
(getf scanner :cursor))
; c is either a string or a char.
(defun match (c scanner)
(let ((s (string c)))
(let ((s-stream (scanner-advance (length s) scanner)))
(if (equal s s-stream) t
(error (format nil "No match at char ~a" (scanner-position scanner)))))))
(defun skip-ws (scanner)
(loop while (member (scanner-peek scanner) (mapcar #'code-char '(#x20 #x09 #x0A #x0D)))
do (scanner-pop scanner)))
(defun parse-string (scanner)
(match #\" scanner)
(prog1
(coerce (loop
while (not (equal (scanner-peek scanner) #\"))
collect
(let ((chr (scanner-pop scanner)))
(if (equal chr #\\)
(list chr (scanner-pop scanner))
chr))) 'string)
(match #\" scanner)))
(defun parse-int (scanner)
(parse-integer (coerce (loop while (member (scanner-peek scanner) (coerce "-1234567890" 'list)) collect (scanner-pop scanner)) 'string)))
(defun parse-array (scanner)
(prog2
(match "[" scanner)
(loop
collect (prog2
(skip-ws scanner)
(parse-value scanner)
(skip-ws scanner))
while (and (equal (scanner-peek scanner) #,円) (match #,円 scanner)))
(skip-ws scanner)
(match "]" scanner)))
(defun parse-object (scanner)
(prog2
(match "{" scanner)
(loop
while (progn (skip-ws scanner) (not (equal (scanner-peek scanner) #\})))
collect (parse-kvp scanner))
(match "}" scanner)))
(defun parse-value (scanner)
(skip-ws scanner)
(let ((c (scanner-peek scanner)))
(cond
((equal c #\n) (progn (match "null" scanner) 'jsnull))
((equal c #\t) (progn (match "true" scanner) 'jstrue))
((equal c #\f) (progn (match "false" scanner) 'jsfalse))
((equal c #\[) (parse-array scanner))
((equal c #\") (parse-string scanner))
((equal c #\{) (parse-object scanner))
((member c (coerce "-1234567890" 'list)) (parse-int scanner)))))
(defun parse-kvp (scanner)
(let (key val)
(skip-ws scanner)
(setq key (parse-string scanner))
(skip-ws scanner)
(match #\: scanner)
(skip-ws scanner)
(setq val (parse-value scanner))
(skip-ws scanner)
(when (equal (scanner-peek scanner) #,円) (match #,円 scanner))
(cons key val)))
(defun parse (str)
(parse-value (mk-scanner str)))
```
2 Answers 2
specification
Please mention the URL of a JSON spec. near the top of this codebase. Extra points for mentioning the URL of some corpus of known-good and known-bad inputs you have tested against.
EOF behavior
I don't understand this.
(defun scanner-pop (scanner)
(let ((result (scanner-peek scanner)))
(incf (getf scanner :cursor))
result))
Popping at end produces a nil
result, good.
But it always increments the cursor position?!?
That doesn't sound like an appropriate post-condition.
fast (length)
In scanner-finished?
we evaluate (length (getf scanner :stream))
.
If we're cdr'ing our way down a list that can take a while,
with time complexity linear in the size of the list.
What we want here is to exploit the following
promise:
If sequence is a vector with a fill pointer, the active length as specified by the fill pointer is returned.
You didn't mention which Lisp compiler you're using.
In mk-scanner
I see no type hints, pragmas, coercions, assertions,
or other advice that would let a compiler optimize this.
We'd like for the compiled code to cut to the chase,
certain of the stream's type and able to unconditionally
grab that fill pointer.
You may find it instructive to do micro-benchmarks on code like this, with and without a small tweak like a type hint. Disassembling, or requesting verbose compiler output, can also be instructive. Common Lisp can be a very fast language, if the source code doesn't require it to solve "too general" of a problem. Dynamic input types tend to interfere with a compiler's efforts to prove a form is of a certain type.
delta
(setf (getf scanner :cursor) (+ cur n))
Incf accepts a delta:
incf place [delta-form] => new-value
It defaults to 1
, but you can certainly supply n
if you wish.
Again we see the curious behavior that cursor may advance far beyond the length of the stream.
getter?
(defun scanner-position (scanner)
(getf scanner :cursor))
I'm not sure this getter is pulling its weight. Consider deleting it. Alternatively, consistently use it everywhere.
Rather than the (admirably!) simple mk-scanner
, you might prefer to use
CLOS,
which automatically produces such getters.
static typing
; c is either a string or a char.
(defun match (c scanner)
Well, you told me, and I appreciate the advice, thank you. But wouldn't you like to expose that knowledge to the compiler?
Consider only accepting a string, to avoid the (string c)
coercion.
It's not obvious to me if the compiled code always has to
create a (length s)
temporary copy of the input stream
in order to perform the comparison.
This predicate function seems likely to be more expensive than necessary.
indent
This is hard to read.
(if (equal s s-stream) t
(error (format nil "No match at char ~a" (scanner-position scanner)))))))
Ask your editor or pretty printer to use sensible indentation, please.
The identifier skip-ws
is perfect, but tacking on a comment
which spells out that we are skipping "whitespace" wouldn't hurt.
Similarly for "key value pair".
unbalanced quotes
If parse-string
is given a truncated prefix of valid JSON,
so there's an odd number of "
quotes,
it looks like it won't respond gracefully.
I urge you to write a pair of automated unit tests for this function, which exercise both the Sad and the Happy Path.
parse-int
(defun parse-int (scanner)
(parse-integer (coerce (loop while (member (scanner-peek scanner) (coerce "-1234567890" 'list)) collect (scanner-pop scanner)) 'string)))
I found that hard to read. Please use sensible indentation.
It's not obvious to me that the compiler will manage to avoid re-building a digits list on each invocation. Consider defining a constant, outside this function's scope. Consider defining an array constant, so indexing gives the answer in constant time, rather than doing eight or nine cdr's.
while (and (equal (scanner-peek scanner) #,円) (match #,円 scanner)))
Please use indentation to highlight we're computing
conjunction of the equal
and the match
results.
(defun mk-scanner (s) (list :stream s :cursor 0))
Using a property list for your scanner type is pretty unusual. Normally in Common Lisp you'd use a structure or an object, with generic functions that take it, so you can define scanners that work with strings, files, etc. and have the code that uses it Just Work. But Common Lisp already has a good way to do this: Use streams. scanner-pop
becomes read-char
, scanner-peek
becomes (peek-char scanner :eof-error-p nil)
, etc.
The top level entry point
(defun parse (str)
(parse-value (mk-scanner str)))
becomes
(defun parse (str)
(with-input-from-string (s str) (parse-value s)))
(defun skip-ws (scanner)
(loop while (member (scanner-peek scanner) (mapcar #'code-char '(#x20 #x09 #x0A #x0D)))
do (scanner-pop scanner)))
Instead of using magic numbers for whitespace characters, use named characters.
(member (scanner-peek scanner) '(#\Space #\Tab #\Linefeed #\Return))
makes the code clearer to the reader (And potentially more efficient if your CL implementation doesn't lift the mapcar
out of that loop).
On a related note, (member c (coerce "-1234567890" 'list))
in your parse-value
function can be written without the unnecessary conversion from string to list as (find c "-123456789")
(I'd rewrite the whole function to use case
instead of cond
, though)
(if ... nil)
-> usewhen
orunless
\$\endgroup\$