This program forms the reducer of a Hadoop MapReduce job. It reads data in from stdin that is tab delimited.
foo 1
foo 1
bar 1
and outputs
foo 2
bar 1
Any suggestions for improvements?
(use '[clojure.string :only [split]])
(def reducer (atom {}))
(defn update-map [map key]
(merge-with + map {key 1}))
(doseq [line (line-seq (java.io.BufferedReader. *in*))]
(let [k (first (split line #"\t"))]
(swap! reducer update-map k)))
(doseq [kv @reducer]
(println (format "%s\t%s" (first kv) (second kv))))
3 Answers 3
probably a bit too late to help OP, but in case anyone else stumbles upon this question, here's a nice succinct way of doing it, using the frequencies
function:
(doseq [[word freq] (frequencies
(map
#(re-find #"^[^\t]+" %) ;; just get the first non-tab characters
(line-seq (java.io.BufferedReader. *in*))))]
(println (str word "\t" freq)))
Why don't you use reduce
instead of the first doseq
? Something along the lines (untested, entered directly here):
(def response
(reduce (fn [map line]
(let [k (fist (split line #"\t"))]
(update-map map k)))
{} (line-seq (java.io.BufferedReader. *in*)))
(doseq [kv response]
(println (format "%s\t%s" (first kv) (second kv))))
Then you won't need the atom
either.
Can ouput contain numbers other than 1? Like:
foo 1
foo 3
bar 10
If so, then:
(use '[clojure.string :only [split]])
(def parsed-input
(for [line (line-seq (java.io.BufferedReader. *in*))
:let [[k v] (split line #"\t")]]
{k (Double/parseDouble v)}))
(def table (apply (partial merge-with + {}) parsed-input))
(doseq [[k v] table]
(println (str k "\t" v)))
Outputs:
bar 10.0
foo 4.0
If it's just 1's frequencies
will do as suggested.