I'm a beginner in Go, and just finished the Golang tour. This crawler is not the same as the one in the exercises in the tour but something I wrote myself. I am looking for suggestions for making it better in terms of idiomatic Go.
package main
import (
"fmt"
"sync"
"net/http"
"log"
"regexp"
"io/ioutil"
)
type Crawler struct {
urls map[string]bool
mux sync.Mutex
umatch *regexp.Regexp
}
func (c Crawler) parse(body string) (urls []string) {
return c.umatch.FindAllString(body, -1)
}
func (c Crawler) fetch(url string) (urls []string) {
res, err := http.Get(url)
if err != nil {
fmt.Println("Error in fetching %s: %s", url, err)
log.Fatal(err)
}
defer res.Body.Close()
body, err := ioutil.ReadAll(res.Body)
if err != nil {
log.Fatal(err)
}
urls = c.parse(string(body))
return
}
func (c Crawler) Crawl(url string, depth int) {
if depth <= 0 {
return
}
c.mux.Lock()
if c.urls[url] { //Already exists
c.mux.Unlock()
return
}
c.urls[url] = true
c.mux.Unlock()
log.Println("Fetching %s", url)
fetched := c.fetch(url)
for _, u := range fetched {
go c.Crawl(u, depth - 1)
}
return
}
func main() {
c := Crawler{urls : map[string]bool{}, umatch : regexp.MustCompile(`(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?`)}
c.Crawl("http://www.yahoo.com", 3)
for u, _ := range c.urls {
fmt.Println(u)
}
}
3 Answers 3
I like regexes, so I can review the regex. (The only experience I have in Go is with regexes.)
It's pretty large, and you have numerous unnecessary characters. Look at it:
(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?
- Go does not use delimiters.
/
is not special and needs no escaping:
(http|ftp|https)://
\w
already matches_
(underscore), and-
does not need to be escaped if it appears at the beginning or end of a character class:
([-\w]+(?:(?:\.[-\w]+)+))
- There is no reason to escape
.
or+
in a character class. Additionally,amp
is all included as part of\w
:
([-\w.,@?^=%&;:/~+#]*
It can be further shrunk if you use character ranges, but this is not very legible when it comes to special characters:
([\w+-/@?^=%&;:~#]*
- I have no clue why you escaped the
@
. Of course it doesn't need escaping, since you didn't escape the other one!
[-\w@?^=%&;/~+#])?
To put it all together, the new regex would be:
(http|ftp|https)://([\w-]+(?:(?:\.[\w-]+)+))([-\w.,@?^=%&;:/~+#]*[-\w@?^=%&;/~+#])?
I also wrote a crawler in go and have the following regex for sublinks only within the same domain.
re := regexp.MustCompile("href=\"(.*?)\"")
subre := regexp.MustCompile("\"/[\\w]+")
matchLink := re.FindAllStringSubmatch(string(*data), -1)
If you wanted to make a bit more concurrent you could use channels and goroutines.
The expression c.fetch(url)
could be a executed in a goroutine. You would have it fetch the URLs from the page and put those on a channel. In the main calling process you could have it pick up those URLs and fetch those and so forth.
go c.fetch(url)
If you want to have a look at mine, go here.
I also wrote a crawler in go and have the following regex for sublinks only within the same domain.
re := regexp.MustCompile("href=\"(.*?)\"")
subre := regexp.MustCompile("\"/[\\w]+")
matchLink := re.FindAllStringSubmatch(string(*data), -1)
If you want to have a look at mine Multi Site Concurrent web crawler using gRPC