Clojure core.async web crawler

Question 1

I'm currently a beginner with clojure and I thought I'd try building a web crawler with core.async.

What I have works, but I am looking for feedback on the following points:

How can I avoid using massive buffers when I don't want to lose values?
Am I using go blocks efficiently? Are there places where a thread would be more appropriate?
How can I better determine when I can finish crawling? Currently I have a timeout of 3 seconds on taking from the urls-chan and if the timeout wins, I assume we're done. This doesn't seem very efficient.

Here is the main part of the code:

(def visited-urls (atom #{}))
(def site-map (atom {}))
;; I've given massive buffers my two channels here because I don't want to drop
;; values. I'm not quite sure why they need to be so big, but anything smaller gives me:
;; Exception in thread "async-dispatch-1626" java.lang.AssertionError:
;; Assert failed: No more than 1024 pending puts are allowed on a single channel. Consider using a windowed buffer.
;; (< (.size puts) impl/MAX-QUEUE-SIZE)
(def urls-chan (chan 102400))
(def log-chan (chan 102400))
(def exit-chan (chan 1))
(defn get-doc
 "Fetches a parsed html page from the given url and places onto a channel"
 [url]
 (go (let [{:keys [error body opts headers]} (<! (async-get url))
 content-type (:content-type headers)]
 (if (or error (not (.startsWith content-type "text/html")))
 (do (log "error fetching" url)
 false)
 (Jsoup/parse body (base-url (:url opts)))))))
;; Main event loop
(defn start-consumers
 "Spins up n go blocks to take a url from urls-chan, store its assets and then
 puts its links onto urls-chan, repeating until there are no more urls to take"
 [n domain]
 (dotimes [_ n]
 (go-loop [url (<! urls-chan)]
 (when-not (@visited-urls url)
 (log "crawling" url)
 (swap! visited-urls conj url)
 (when-let [doc (<! (get-doc url))]
 (swap! site-map assoc url (get-assets doc))
 (doseq [url (get-links doc domain)]
 (go (>! urls-chan url)))))
 ;; Take the next url off the q, if 3 secs go by assume no more are coming
 (let [[value channel] (alts! [urls-chan (timeout 3000)])]
 (if (= channel urls-chan)
 (recur value)
 (>! exit-chan true))))))
(defn -main
 "Crawls [domain] for links to assets"
 [domain]
 (let [start-time (System/currentTimeMillis)]
 (start-logger)
 (log "Begining crawl of" domain)
 (start-consumers 40 domain)
 ;; Kick off with the first url
 (>!! urls-chan domain)
 (<!! exit-chan)
 (println (json/write-str @site-map))
 (<!! (log "Completed after" (seconds-since start-time) "seconds"))))

Question 2

(when-not (@visited-urls url)more than 1 consumers may be looking at the same unvisited url at this point. They will be crawling the same url which is unexpected but it doesn't seem to break anything.

I don't see any better way with clojure atom. Actually atom doesn't buy you anything here because all it does is to mutate a global state. I think java.util.concurrent.ConcurrentHashMap works better. visited-urls can be a map of URL to boolean representing if the URL has been visited. the condition should be .putIfAbsent(url, true) == null

woodings woodings 1111 bronze badge · Answer 1 · 2014-12-31 10:41:08Z

(when-not (@visited-urls url)more than 1 consumers may be looking at the same unvisited url at this point. They will be crawling the same url which is unexpected but it doesn't seem to break anything.

I don't see any better way with clojure atom. Actually atom doesn't buy you anything here because all it does is to mutate a global state. I think java.util.concurrent.ConcurrentHashMap works better. visited-urls can be a map of URL to boolean representing if the URL has been visited. the condition should be .putIfAbsent(url, true) == null

Stack Exchange Network

Clojure core.async web crawler

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Clojure core.async web crawler

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions