7
\$\begingroup\$

I originally wrote this as an answer to a question on Stack Overflow, but it turned out so nicely that I decided to post it here to see if I can make it even better.

(defn char-seq
 "Returns a lazy sequence of characters from rdr. rdr must implement
 java.io.Reader."
 [rdr]
 (let [c (.read rdr)]
 (if-not (neg? c)
 (cons (char c) (lazy-seq (char-seq rdr))))))
(defn line-offsets
 "Returns a lazy sequence of offsets of all lines in s."
 [s]
 (if (seq s)
 (->> (partition-all 3 1 s)
 (map-indexed
 (fn [i [a b c]]
 (cond
 (= b \newline) (if c (+ 2 i))
 (= a \return) (if b (inc i)))))
 (filter identity)
 (cons 0))))
(defn ordered-line-seq
 "Returns the lines of text from raf at each offset in offsets as a lazy
 sequence of strings. raf must implement java.io.RandomAccessFile."
 [raf offsets]
 (map (fn [i]
 (.seek raf i)
 (.readLine raf))
 offsets))

Example usage:

(let [filename "data.txt"
 offsets (with-open [rdr (clojure.java.io/reader filename)]
 (shuffle (line-offsets (char-seq rdr))))]
 (with-open [raf (java.io.RandomAccessFile. filename "r")]
 (run! println (ordered-line-seq raf offsets))))

I'm interested in any way to improve this code, but here are some specific things I'm looking for:

  • Since RandomAccessFile obviously uses byte offsets, and line-offsets returns character offsets, this code won't work for Unicode files. Is there a way to fix that problem without adding a lot of complexity?
  • Are a, b, and c good names for the parameters of the anonymous function in line-offsets?
  • What exactly are the guidelines for parameter ordering in named functions like ordered-line-seq?
asked Nov 6, 2015 at 1:34
\$\endgroup\$

2 Answers 2

2
\$\begingroup\$

Regarding your questions:

  • You're reading bytes. Fortunately, the only three line ending conventions all involve Newline and Return, both a single byte. There's nothing more to it. So if you stay with an encoding using single bytes for (most of) their characters, such as UTF-8, there will be no problem. For other Unicode encodings you'd have to read more than one byte per character and also do some decoding.

    Frankly even without encodings this is a hairy topic, c.f. Newline on Wikipedia. Though line-offsets uses the same convention as readLine, so I guess that's just FYI.

  • I'd at least not use short/one-letter variable names for function parameters, because they also serve as documentation. E.g. r doesn't tell the reader anything, rdr is barely english, but reader is clear and doesn't cost that much more in terms of characters. Same with s and sequence. For a, b, c in line-offsets I'd even say that c1 to c3 would be more helpful.

  • It'd be best to orient yourself at standard functions. It's probably better to have the most important "subject" first, then subsequently the rest unless there's a good reason not to. Parameters at the ends can also very easily be curried, maybe that's another hint to think about it.

Other than that, yeah, looks good. I don't quite see why the laziness is necessary, since it's calling shuffle anyway, but perhaps it's useful in some situations?

answered Nov 7, 2015 at 15:29
\$\endgroup\$
3
  • \$\begingroup\$ Yeah, the only function that needs to be lazy here is ordered-line-seq, but I figured that there's no reason for the other two functions not to be lazy, so I made them lazy anyway. c1 to c3 do seem like better names; however, would you say that consistency with line-seq is a good enough reason for the name of the rdr parameter in char-seq? \$\endgroup\$ Commented Nov 7, 2015 at 16:52
  • \$\begingroup\$ It's a matter of opinion really. I'm not overly fond of the standard library naming, so my post is coloured by that. \$\endgroup\$ Commented Nov 7, 2015 at 17:16
  • \$\begingroup\$ I can certainly understand that; I do agree that rdr is not very readable here. Anyway, thanks for the comments! :) \$\endgroup\$ Commented Nov 7, 2015 at 17:23
0
\$\begingroup\$

Actually assert the preconditions

You say in your documentation:

rdr must implement
 java.io.Reader.

Still, if I pass in a rdr that has not java.io.Reader implemented, I am going to get a weird error. I would provide a use friendly error message. (Pseudocode):

(defn char-seq
 "Returns a lazy sequence of characters from rdr. rdr must implement
 java.io.Reader."
 [rdr]
 (when (not (implements java.io.Reader rdr))
 raise "rdr must implement java.io.Reader")
 ; ... the same

rdr?

This name is hard to understand, remember that your function parameters names are part of the documentation and should be as clear as possible, when in doubt err on the side of verbosity.

ferada
11.4k25 silver badges65 bronze badges
answered Nov 6, 2015 at 13:49
\$\endgroup\$
1
  • 1
    \$\begingroup\$ All good advice, but my goal was for char-seq to be as similar as possible to line-seq, which shares that exact parameter name. Also, "must implement" in the line-seq docstring really just means "must support the same method signatures", so I'm using that meaning here as well; that's why line-seq doesn't assert its precondition either. \$\endgroup\$ Commented Nov 6, 2015 at 14:36

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.