6
\$\begingroup\$

I'm currently a beginner with clojure and I thought I'd try building a web crawler with core.async.

What I have works, but I am looking for feedback on the following points:

  • How can I avoid using massive buffers when I don't want to lose values?
  • Am I using go blocks efficiently? Are there places where a thread would be more appropriate?
  • How can I better determine when I can finish crawling? Currently I have a timeout of 3 seconds on taking from the urls-chan and if the timeout wins, I assume we're done. This doesn't seem very efficient.

Here is the main part of the code:

(def visited-urls (atom #{}))
(def site-map (atom {}))
;; I've given massive buffers my two channels here because I don't want to drop
;; values. I'm not quite sure why they need to be so big, but anything smaller gives me:
;; Exception in thread "async-dispatch-1626" java.lang.AssertionError:
;; Assert failed: No more than 1024 pending puts are allowed on a single channel. Consider using a windowed buffer.
;; (< (.size puts) impl/MAX-QUEUE-SIZE)
(def urls-chan (chan 102400))
(def log-chan (chan 102400))
(def exit-chan (chan 1))
(defn get-doc
 "Fetches a parsed html page from the given url and places onto a channel"
 [url]
 (go (let [{:keys [error body opts headers]} (<! (async-get url))
 content-type (:content-type headers)]
 (if (or error (not (.startsWith content-type "text/html")))
 (do (log "error fetching" url)
 false)
 (Jsoup/parse body (base-url (:url opts)))))))
;; Main event loop
(defn start-consumers
 "Spins up n go blocks to take a url from urls-chan, store its assets and then
 puts its links onto urls-chan, repeating until there are no more urls to take"
 [n domain]
 (dotimes [_ n]
 (go-loop [url (<! urls-chan)]
 (when-not (@visited-urls url)
 (log "crawling" url)
 (swap! visited-urls conj url)
 (when-let [doc (<! (get-doc url))]
 (swap! site-map assoc url (get-assets doc))
 (doseq [url (get-links doc domain)]
 (go (>! urls-chan url)))))
 ;; Take the next url off the q, if 3 secs go by assume no more are coming
 (let [[value channel] (alts! [urls-chan (timeout 3000)])]
 (if (= channel urls-chan)
 (recur value)
 (>! exit-chan true))))))
(defn -main
 "Crawls [domain] for links to assets"
 [domain]
 (let [start-time (System/currentTimeMillis)]
 (start-logger)
 (log "Begining crawl of" domain)
 (start-consumers 40 domain)
 ;; Kick off with the first url
 (>!! urls-chan domain)
 (<!! exit-chan)
 (println (json/write-str @site-map))
 (<!! (log "Completed after" (seconds-since start-time) "seconds"))))
asked Sep 6, 2014 at 18:38
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

(when-not (@visited-urls url)more than 1 consumers may be looking at the same unvisited url at this point. They will be crawling the same url which is unexpected but it doesn't seem to break anything.

I don't see any better way with clojure atom. Actually atom doesn't buy you anything here because all it does is to mutate a global state. I think java.util.concurrent.ConcurrentHashMap works better. visited-urls can be a map of URL to boolean representing if the URL has been visited. the condition should be .putIfAbsent(url, true) == null

Heslacher
50.9k5 gold badges83 silver badges177 bronze badges
answered Dec 31, 2014 at 10:41
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.