4
\$\begingroup\$

I have been writing a web crawler in F# that downloads pages with stylesheets and scripts.
Can somebody give me suggestions on improving this code, please? Would appreciate any feedback that could improve it / prettify it.

open System
open System.Net
let log fmt =
 Printf.kprintf (fun str -> // todo: lock | agent
 printfn "%-4O %-10O %O" Environment.CurrentManagedThreadId (DateTime.Now.ToLongTimeString()) str
 ) 
 fmt
type File = { 
 ContentType : string; 
 Path : string 
 } 
type DownloadResult = 
| Error of exn
| Content of File
module files = 
 let download trgFileName srcUrl = async { 
 try
 let w = WebRequest.Create(Uri srcUrl) 
 use! r = w.AsyncGetResponse()
 use f = new IO.FileStream(trgFileName, IO.FileMode.Create, IO.FileAccess.Write, IO.FileShare.None)
 r.GetResponseStream().CopyTo f
 return Content { ContentType = r.ContentType; Path = trgFileName }
 with e -> 
 return Error e
 }
module filesTest = 
 let download trgFileName srcUrl = async { 
 return Error (exn "todo")
 }
let computeHash (s:IO.Stream) = 
 Security.Cryptography.HashAlgorithm.Create().ComputeHash s
let fileHash (path:string) = 
 use fs = new IO.FileStream(path, IO.FileMode.Open, IO.FileAccess.Read)
 let hc = computeHash fs
 BitConverter.ToString(hc).Replace("-", "")
let trimQuery url = 
 Uri(url).GetLeftPart(UriPartial.Path)
open files 
// open filesTest 
let urls htmlPath : string list = [
 // todo: extract links from html. Load HtmlDocument and travers DOM nodes
 ]
let crawl trgFolder maxDepth url = 
 let cs = new Threading.CancellationTokenSource()
 let m = MailboxProcessor<int * string>.Start(fun inbox -> async { // depth * url
 use entries = new IO.StreamWriter(IO.Path.Combine(trgFolder, "entries.txt"), true)
 let visited = Collections.Generic.HashSet<string>()
 while true do
 let! depth, url = inbox.Receive()
 if depth < maxDepth then 
 let url' = trimQuery url
 if not(visited.Add url') then
 log "downloading %O" url
 let! r = download (IO.Path.GetTempFileName()) url'
 match r with 
 | Error e -> log "error %O" e.Message
 | Content f -> // todo: async { } |> Async.Start + cs.Token + visited - Concurrent.ConcurrentDictionary
 let hc = fileHash f.Path
 log "processing %O as %O" url hc
 let path = IO.Path.Combine(trgFolder, hc)
 IO.File.Move(f.Path, path)
 entries.WriteLine(sprintf "%O \t %O" hc url)
 for u in urls path |> Seq.map trimQuery do
 if visited.Add u then 
 inbox.Post (depth+1, u)
 if inbox.CurrentQueueLength = 0 then 
 log "done"
 cs.Cancel()
 }, cs.Token)
 m.Post (0, url)
 cs
let c = crawl __SOURCE_DIRECTORY__ 2 "https://gist.github.com" 
// c.Cancel()
asked May 17, 2016 at 13:03
\$\endgroup\$

2 Answers 2

1
\$\begingroup\$

The first, purely aesthetic impression that strikes me is the level of nesting in your main crawl function. Consider extracting the core of that function. This might only be to a local function within crawl, but trying to make functions which have less than 7 (*or some fairly small number) generally makes code easier to follow.

Consider wrapping primitives in single-case discriminated unions to make passing variables safer. E.g. you could create a type to wrap the string url which is passed around such as:

type Url = Url of string

Another option to reduce nesting is to pass a function into crawl which will be executed having completed the request, so would be of signature DownloadResult -> Url list (or maybe DownloadResult -> Async<Url list>) allowing you to return the set of next URLs to fetch. By doing this you've decoupled the ability to fetch a URL from what you ultimately want to do with the result. This pattern is the functional equivalent of the dependency inversion principle.

I'd question why you've chosen to write all content straight to disk then pass the file path around. Perhaps you know that you're planning on downloading large files, but I'd suggest for a first implementation it might be simpler to just pass an array of bytes or a stream around. My second thought would be to pass the response stream in the DownloadResult. This allows you to delegate to the handler what it's going to do with the content, rather that build that assumption into crawl.

answered May 18, 2016 at 8:43
\$\endgroup\$
1
\$\begingroup\$

Actually web crawling is a quite large topic. First, about your existing code.

  1. Mostly cosmetic edits. I would recommend payload functionality (moving content to files) to separate function and pass it as parameter.

  2. MailboxProcessor is a good decision for a crawler, but it seems you did not implement it properly if I did not miss something - the main issue is that agent will be busy until it will have not completed current page processing. To solve it I would execute the following piece of code in Async.StartChild.

    let! r = download (IO.Path.GetTempFileName()) url'
     match r with 
     | Error e -> log "error %O" e.Message
     | Content f -> // todo: async { } |> Async.Start + cs.Token + visited - Concurrent.ConcurrentDictionary
     let hc = fileHash f.Path
     log "processing %O as %O" url hc
     let path = IO.Path.Combine(trgFolder, hc)
     IO.File.Move(f.Path, path)
     entries.WriteLine(sprintf "%O \t %O" hc url)
     for u in urls path |> Seq.map trimQuery do
     if visited.Add u then 
     inbox.Post (depth+1, u)
    

    Second is about improvements specific for crawling.

  3. I do not see the delay between requests - many sites will ban you for that.

  4. You did not post code for links extraction and handling. It may be tricky to implement it properly.

  5. I would set user agent to prevent blocking by some servers (though, it is not such important as delay).

Also, I recently wrote two blog posts about the topic (one about theoretical aspects and second about implementation in F#) - maybe you will find something useful, especially in parts of delays implementation, URL links extraction and proper asynchronous handling.

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
answered May 18, 2016 at 21:31
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.