I originally wrote this as an answer to a question on Stack Overflow, but it turned out so nicely that I decided to post it here to see if I can make it even better.
(defn char-seq
"Returns a lazy sequence of characters from rdr. rdr must implement
java.io.Reader."
[rdr]
(let [c (.read rdr)]
(if-not (neg? c)
(cons (char c) (lazy-seq (char-seq rdr))))))
(defn line-offsets
"Returns a lazy sequence of offsets of all lines in s."
[s]
(if (seq s)
(->> (partition-all 3 1 s)
(map-indexed
(fn [i [a b c]]
(cond
(= b \newline) (if c (+ 2 i))
(= a \return) (if b (inc i)))))
(filter identity)
(cons 0))))
(defn ordered-line-seq
"Returns the lines of text from raf at each offset in offsets as a lazy
sequence of strings. raf must implement java.io.RandomAccessFile."
[raf offsets]
(map (fn [i]
(.seek raf i)
(.readLine raf))
offsets))
Example usage:
(let [filename "data.txt"
offsets (with-open [rdr (clojure.java.io/reader filename)]
(shuffle (line-offsets (char-seq rdr))))]
(with-open [raf (java.io.RandomAccessFile. filename "r")]
(run! println (ordered-line-seq raf offsets))))
I'm interested in any way to improve this code, but here are some specific things I'm looking for:
- Since
RandomAccessFile
obviously uses byte offsets, andline-offsets
returns character offsets, this code won't work for Unicode files. Is there a way to fix that problem without adding a lot of complexity? - Are
a
,b
, andc
good names for the parameters of the anonymous function inline-offsets
? - What exactly are the guidelines for parameter ordering in named functions like
ordered-line-seq
?
2 Answers 2
Regarding your questions:
You're reading bytes. Fortunately, the only three line ending conventions all involve Newline and Return, both a single byte. There's nothing more to it. So if you stay with an encoding using single bytes for (most of) their characters, such as UTF-8, there will be no problem. For other Unicode encodings you'd have to read more than one byte per character and also do some decoding.
Frankly even without encodings this is a hairy topic, c.f. Newline on Wikipedia. Though
line-offsets
uses the same convention asreadLine
, so I guess that's just FYI.I'd at least not use short/one-letter variable names for function parameters, because they also serve as documentation. E.g.
r
doesn't tell the reader anything,rdr
is barely english, butreader
is clear and doesn't cost that much more in terms of characters. Same withs
andsequence
. Fora
,b
,c
inline-offsets
I'd even say thatc1
toc3
would be more helpful.It'd be best to orient yourself at standard functions. It's probably better to have the most important "subject" first, then subsequently the rest unless there's a good reason not to. Parameters at the ends can also very easily be curried, maybe that's another hint to think about it.
Other than that, yeah, looks good. I don't quite see why the laziness
is necessary, since it's calling shuffle
anyway, but perhaps it's
useful in some situations?
-
\$\begingroup\$ Yeah, the only function that needs to be lazy here is
ordered-line-seq
, but I figured that there's no reason for the other two functions not to be lazy, so I made them lazy anyway.c1
toc3
do seem like better names; however, would you say that consistency withline-seq
is a good enough reason for the name of therdr
parameter inchar-seq
? \$\endgroup\$Sam Estep– Sam Estep2015年11月07日 16:52:05 +00:00Commented Nov 7, 2015 at 16:52 -
\$\begingroup\$ It's a matter of opinion really. I'm not overly fond of the standard library naming, so my post is coloured by that. \$\endgroup\$ferada– ferada2015年11月07日 17:16:27 +00:00Commented Nov 7, 2015 at 17:16
-
\$\begingroup\$ I can certainly understand that; I do agree that
rdr
is not very readable here. Anyway, thanks for the comments! :) \$\endgroup\$Sam Estep– Sam Estep2015年11月07日 17:23:11 +00:00Commented Nov 7, 2015 at 17:23
Actually assert the preconditions
You say in your documentation:
rdr must implement
java.io.Reader.
Still, if I pass in a rdr
that has not java.io.Reader
implemented, I am going to get a weird error. I would provide a use friendly error message. (Pseudocode):
(defn char-seq
"Returns a lazy sequence of characters from rdr. rdr must implement
java.io.Reader."
[rdr]
(when (not (implements java.io.Reader rdr))
raise "rdr must implement java.io.Reader")
; ... the same
rdr
?
This name is hard to understand, remember that your function parameters names are part of the documentation and should be as clear as possible, when in doubt err on the side of verbosity.
-
1\$\begingroup\$ All good advice, but my goal was for
char-seq
to be as similar as possible toline-seq
, which shares that exact parameter name. Also, "must implement" in theline-seq
docstring really just means "must support the same method signatures", so I'm using that meaning here as well; that's whyline-seq
doesn't assert its precondition either. \$\endgroup\$Sam Estep– Sam Estep2015年11月06日 14:36:17 +00:00Commented Nov 6, 2015 at 14:36