In A Tour of Go, you are given the following problem:
In this exercise you'll use Go's concurrency features to parallelize a web crawler.
Modify the
Crawl
function to fetch URLs in parallel without fetching the same URL twice.
This is the code you're given
package main import ( "fmt" ) type Fetcher interface { // Fetch returns the body of URL and // a slice of URLs found on that page. Fetch(url string) (body string, urls []string, err error) } // Crawl uses fetcher to recursively crawl // pages starting with url, to a maximum of depth. func Crawl(url string, depth int, fetcher Fetcher) { // TODO: Fetch URLs in parallel. // TODO: Don't fetch the same URL twice. // This implementation doesn't do either: if depth <= 0 { return } body, urls, err := fetcher.Fetch(url) if err != nil { fmt.Println(err) return } fmt.Printf("found: %s %q\n", url, body) for _, u := range urls { Crawl(u, depth-1, fetcher) } return } func main() { Crawl("http://golang.org/", 4, fetcher) } // fakeFetcher is Fetcher that returns canned results. type fakeFetcher map[string]*fakeResult type fakeResult struct { body string urls []string } func (f fakeFetcher) Fetch(url string) (string, []string, error) { if res, ok := f[url]; ok { return res.body, res.urls, nil } return "", nil, fmt.Errorf("not found: %s", url) } // fetcher is a populated fakeFetcher. var fetcher = fakeFetcher{ "http://golang.org/": &fakeResult{ "The Go Programming Language", []string{ "http://golang.org/pkg/", "http://golang.org/cmd/", }, }, "http://golang.org/pkg/": &fakeResult{ "Packages", []string{ "http://golang.org/", "http://golang.org/cmd/", "http://golang.org/pkg/fmt/", "http://golang.org/pkg/os/", }, }, "http://golang.org/pkg/fmt/": &fakeResult{ "Package fmt", []string{ "http://golang.org/", "http://golang.org/pkg/", }, }, "http://golang.org/pkg/os/": &fakeResult{ "Package os", []string{ "http://golang.org/", "http://golang.org/pkg/", }, }, }
And this is the solution I came up with
type Job struct {
url string
depth int
done chan bool
}
func dedup(fetcher Fetcher, ch chan Job) {
seen := make(map[string]bool)
for job := range ch {
if _, ok := seen[job.url]; ok || job.depth <= 0 {
job.done <- true
continue
}
seen[job.url] = true
go Crawl(job, fetcher, ch)
}
}
func Crawl(job Job, fetcher Fetcher, q chan Job) {
body, urls, err := fetcher.Fetch(job.url)
if err != nil {
fmt.Println(err)
job.done <- true
return
}
fmt.Printf("found: %s %q\n", job.url, body)
ch := make(chan bool, len(urls))
for _, u := range urls {
u := u
go func() { q <- Job{u, job.depth - 1, ch} }()
}
for i := 0; i < len(urls); i++ {
<-ch
}
job.done <- true
return
}
func main() {
done := make(chan bool, 1)
q := make(chan Job)
go dedup(fetcher, q)
q <- Job{"http://golang.org/", 4, done}
<-done
}
All jobs go through dedup
's channel, so that a given URL is fetched only once. dedup
kicks off a goroutine for each unique URL it sees, and the goroutine then adds URLs from that page back to the channel.
Go's way of thinking is new to me, so my main concerns are
- Is it idiomatic?
- Is it clear what's going on to someone who's familiar with Go?
- Is buffering the channels in this way necessary (or even helpful)?
-
1\$\begingroup\$ @DaveC thanks for the feedback, but it should be posted as an answer, not in the comments. \$\endgroup\$mjolka– mjolka2015年03月07日 23:13:20 +00:00Commented Mar 7, 2015 at 23:13
-
\$\begingroup\$ doing that now... at first I thought I only had a one line comment \$\endgroup\$Dave C– Dave C2015年03月07日 23:15:00 +00:00Commented Mar 7, 2015 at 23:15
-
1\$\begingroup\$ @DaveC not a problem. Even short answers should be posted as answers :) \$\endgroup\$mjolka– mjolka2015年03月07日 23:24:03 +00:00Commented Mar 7, 2015 at 23:24
1 Answer 1
Your map is of booleans so you can replace
if _, ok := seen[job.url]; ok || bar
withif seen[job.url] || bar
. The zero value (which is what gets returned for non-existing map lookups) ofbool
isfalse
.It would be better to have
Crawl
do something likedefer func() { job.done <- true }()
immediately once only rather than sprinkle
job.done <- true
at all the right places and hoping you (and any future editors) don't ever miss any.Your
for
loop inCrawl
should either just require the channel is sufficiently buffered (as it currently is) and not bother with goroutines, or the whole for loop should be put into a single goroutine (and then you don't need any channel buffering).for i := 0; i < len(urls); i++
should be justfor i := range urls
(or since you don't needi
justfor _ = range urls
).You seem to be using
Jobs.done
likesync.WaitGroup
.You could just have a single wait group that keeps track of outstanding jobs. Your
main
would make sure to callwg.Add(1)
before sending it's first job and before it then doeswg.Wait()
. Mostly everywhere you dojob.done <- true
would then be justwg.Done()
. You'd just need to make sure that your crawler didwg.Add(len(urls))
before it didwg.Done()
to make sure the wait group count could never accidentally fall to zero too early.Although having dedup check the depth is probably still a good idea, it's easy and trivially to do it in the crawl function and save some effort.
You code modified with the above points:
type Job struct {
url string
depth int
}
func dedup(fetcher Fetcher, ch chan Job, wg *sync.WaitGroup) {
seen := make(map[string]bool)
for job := range ch {
if seen[job.url] || job.depth <= 0 {
wg.Done()
continue
}
seen[job.url] = true
go Crawl(job, fetcher, ch, wg)
}
}
func Crawl(job Job, fetcher Fetcher, q chan Job, wg *sync.WaitGroup) {
defer wg.Done()
body, urls, err := fetcher.Fetch(job.url)
if err != nil {
log.Println("Crawl failure:", err) // TODO: something better with errors
return
}
log.Printf("Crawl: found %s\t%q\n", job.url, body)
if job.depth <= 1 {
return
}
wg.Add(len(urls))
for _, u := range urls {
q <- Job{u, job.depth - 1}
}
}
func main() {
wg := new(sync.WaitGroup)
wg.Add(1)
q := make(chan Job)
go dedup(fetcher, q, wg)
q <- Job{"http://golang.org/", 4}
wg.Wait()
// We close q here so that the dedup goroutine
// will finish. This isn't strictly needed here
// since we're main and everything gets stopped
// when we return; but that wouldn't be the case
// if we were some other function in a long lived
// program.
close(q)
}
And runnable on the playground (by including the Go tour stuff): http://play.golang.org/p/cuY3PZdQmI
(For comparison: your original code is also runnable on the playground.)