Please review the following Clojure tokenizer. The goal is to study Clojure, so sometimes I re-implement functions (it will be nice to see standard functions for this). Beyond of this, code seems quite cumbersome.
(defn space? [^Character c]
(Character/isWhitespace c))
(defn digit? [^Character c]
(= (Character/getType c)
(Character/DECIMAL_DIGIT_NUMBER)))
(defn to-int [ds]
(reduce #(+ (* 10 %1) %2) (map #(- (int %) (int 0円)) ds)))
(defn take-int [s]
(when (digit? (first s))
(let [[digits rs] (split-with digit? s)]
[rs {:int (to-int digits)}])))
(defn take-sym [s]
(case (first s)
\+ [(rest s) {:sym :plus}]
\- [(rest s) {:sym :minus}]
\* [(rest s) {:sym :star}]
\/ [(rest s) {:sym :slash}]
nil))
(defn take-error [msg]
[nil {:error msg}])
(defn tokens [expr]
(loop [ex expr, ts []]
(let [nsp (drop-while space? ex)]
(if (empty? nsp)
(conj ts {:eof nil})
(if-let [[rs t] (or (take-int nsp)
(take-sym nsp)
(take-error (str "Invalid symbol " (first nsp))))]
(if (:error t)
t
(recur rs (conj ts t))))))))
(defn -main [& args]
; Work around dangerous default behavior in Clojure.
(alter-var-root #'*read-eval* (constantly false))
(println (tokens (seq " 20 + 32 * 4 ")))
)
1 Answer 1
I don't know if you're still at it (given it's been a few months) ; in any case, here's my answer (it could be useful to someone, who knows).
speaking of standard functions, maybe you could leverage the Clojure reader e.g.:
(def to-int (comp read-string #(apply str %))) ;; the str bit is ugly, but necessary (read-string consumes Strings)
Depending on what you're attempting to do, you could parse your chain of characters directly into a stream of Clojure symbols, data structures and basic types or use it more precisely, like above.
Otherwise, you could save yourself the trouble of reimplementing integer parsing in
to-int
by usingInteger/parseInt
on the substring directly.performance-wise,
char
sequences consume a lot of memory (even lazy ones), not to mention the risks of holding onto heads ; you should consider usingString
manipulation techniques over sequences:subs
overrest
/next
, regular expressions oversplit-with
/drop-while
, etc. .for example,
drop-while space?
screamstrim
,ltrim
or(subs (re-find "\S" ...) ...)
.when (digit? (first ...))
/split-with digit?
would becomewhen-let [[_ digits rs] (re-find "^(\d+)(.*)$" ...)]
BTW -assuming input is big- feeding the whole input to the lexer feels like a waste of memory ; you should instrument a
Reader
or something, and tokenize as you read.-main
: if you don't useread-string
, you don't need your*read-eval*
bit ; besides,alter-var-root
is not the way to do it: local rebinding is (e.g. withbinding
).tokens
: sequence returning function usingloop
/recur
with an accumulator screamslazy-seq
:(defn tokens [expr] (lazy-seq ... (if (empty? nsp) [{:eof nil}] ... (cons t (if-not (:error t) (tokens rs)))
The only difference in behaviour is with errors, that get appended last instead of returned directly, which feels more flexible (the lexer stops on an error token instead of returning an error token) ; a test for errors would become
(:error (last ...))
instead of(:error ...)
.take-sym
: thiscase
screams hashmap to me and all thosefirst
/rest
call for a little destructuring ; what about:(def char->sym {\+ :plus \- :minus \* :star \/ :slash}) (defn take-sym [[c & r]] (if-let [sym (char->sym c)] [r {:sym sym}]))
AFAIK
case
performs comparisons based on hash code, so it should have the same (poor) performance as a hashmap. If it becomes a performance bottleneck, you may choose to reimplement it on the Java side (as aswitch
statement) or use achar
-keys optimized map instance (like this one from fastutil or that one from Trove).Another idea would be to
keyword
-ify the characters directly (though it would put the burden of filtering unrecognized symbols on the parser).digit?
:Character/isDigit
does the trick.