Web Crawler in Go

Question 1

I'm a beginner in Go, and just finished the Golang tour. This crawler is not the same as the one in the exercises in the tour but something I wrote myself. I am looking for suggestions for making it better in terms of idiomatic Go.

package main
import (
 "fmt"
 "sync"
 "net/http"
 "log"
 "regexp"
 "io/ioutil"
)
type Crawler struct {
 urls map[string]bool
 mux sync.Mutex
 umatch *regexp.Regexp
}
func (c Crawler) parse(body string) (urls []string) {
 return c.umatch.FindAllString(body, -1)
}
func (c Crawler) fetch(url string) (urls []string) {
 res, err := http.Get(url)
 if err != nil {
 fmt.Println("Error in fetching %s: %s", url, err)
 log.Fatal(err)
 }
 defer res.Body.Close()
 body, err := ioutil.ReadAll(res.Body)
 if err != nil {
 log.Fatal(err)
 }
 urls = c.parse(string(body))
 return
}
func (c Crawler) Crawl(url string, depth int) {
 if depth <= 0 {
 return
 }
 c.mux.Lock()
 if c.urls[url] { //Already exists
 c.mux.Unlock()
 return
 }
 c.urls[url] = true
 c.mux.Unlock()
 log.Println("Fetching %s", url)
 fetched := c.fetch(url)
 for _, u := range fetched {
 go c.Crawl(u, depth - 1)
 }
 return
}
func main() {
 c := Crawler{urls : map[string]bool{}, umatch : regexp.MustCompile(`(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?`)}
 c.Crawl("http://www.yahoo.com", 3)
 for u, _ := range c.urls {
 fmt.Println(u)
 }
}

Question 2

I like regexes, so I can review the regex. (The only experience I have in Go is with regexes.)

It's pretty large, and you have numerous unnecessary characters. Look at it:

(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?

Go does not use delimiters. / is not special and needs no escaping:

(http|ftp|https)://

\w already matches _ (underscore), and - does not need to be escaped if it appears at the beginning or end of a character class:

([-\w]+(?:(?:\.[-\w]+)+))

There is no reason to escape . or + in a character class. Additionally, amp is all included as part of \w:

([-\w.,@?^=%&;:/~+#]*

It can be further shrunk if you use character ranges, but this is not very legible when it comes to special characters:

([\w+-/@?^=%&;:~#]*

I have no clue why you escaped the @. Of course it doesn't need escaping, since you didn't escape the other one!

[-\w@?^=%&;/~+#])?

To put it all together, the new regex would be:

(http|ftp|https)://([\w-]+(?:(?:\.[\w-]+)+))([-\w.,@?^=%&;:/~+#]*[-\w@?^=%&;/~+#])?

Question 3

I also wrote a crawler in go and have the following regex for sublinks only within the same domain.

 re := regexp.MustCompile("href=\"(.*?)\"")
 subre := regexp.MustCompile("\"/[\\w]+")
 matchLink := re.FindAllStringSubmatch(string(*data), -1)

If you wanted to make a bit more concurrent you could use channels and goroutines.

The expression c.fetch(url) could be a executed in a goroutine. You would have it fetch the URLs from the page and put those on a channel. In the main calling process you could have it pick up those URLs and fetch those and so forth.

go c.fetch(url)

If you want to have a look at mine, go here.

Question 4

I also wrote a crawler in go and have the following regex for sublinks only within the same domain.

 re := regexp.MustCompile("href=\"(.*?)\"")
 subre := regexp.MustCompile("\"/[\\w]+")
 matchLink := re.FindAllStringSubmatch(string(*data), -1)

If you want to have a look at mine Multi Site Concurrent web crawler using gRPC

Laurel Laurel 7954 silver badges20 bronze badges · Answer 1 · 2016-08-08 01:28:05Z

I like regexes, so I can review the regex. (The only experience I have in Go is with regexes.)

It's pretty large, and you have numerous unnecessary characters. Look at it:

(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?

Go does not use delimiters. / is not special and needs no escaping:

(http|ftp|https)://

\w already matches _ (underscore), and - does not need to be escaped if it appears at the beginning or end of a character class:

([-\w]+(?:(?:\.[-\w]+)+))

There is no reason to escape . or + in a character class. Additionally, amp is all included as part of \w:

([-\w.,@?^=%&;:/~+#]*

It can be further shrunk if you use character ranges, but this is not very legible when it comes to special characters:

([\w+-/@?^=%&;:~#]*

I have no clue why you escaped the @. Of course it doesn't need escaping, since you didn't escape the other one!

[-\w@?^=%&;/~+#])?

To put it all together, the new regex would be:

(http|ftp|https)://([\w-]+(?:(?:\.[\w-]+)+))([-\w.,@?^=%&;:/~+#]*[-\w@?^=%&;/~+#])?

Angelo Angelo 313 bronze badges · Answer 2 · 2018-01-23 23:51:37Z

I also wrote a crawler in go and have the following regex for sublinks only within the same domain.

 re := regexp.MustCompile("href=\"(.*?)\"")
 subre := regexp.MustCompile("\"/[\\w]+")
 matchLink := re.FindAllStringSubmatch(string(*data), -1)

If you wanted to make a bit more concurrent you could use channels and goroutines.

The expression c.fetch(url) could be a executed in a goroutine. You would have it fetch the URLs from the page and put those on a channel. In the main calling process you could have it pick up those URLs and fetch those and so forth.

go c.fetch(url)

If you want to have a look at mine, go here.

Angelo Angelo 313 bronze badges · Answer 3 · 2018-01-23 23:42:16Z

I also wrote a crawler in go and have the following regex for sublinks only within the same domain.

 re := regexp.MustCompile("href=\"(.*?)\"")
 subre := regexp.MustCompile("\"/[\\w]+")
 matchLink := re.FindAllStringSubmatch(string(*data), -1)

If you want to have a look at mine Multi Site Concurrent web crawler using gRPC

Stack Exchange Network

Web Crawler in Go

3 Answers 3

To put it all together, the new regex would be:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Web Crawler in Go

3 Answers 3

To put it all together, the new regex would be:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions