crawler 致力于实现中文友好的网络抓取系统,项目基于PuerkitoBio的一个初级的并行的轻量级抓取库gocrawl.本项目目标:
- 实现分布式
- 增加机器学习算法
- 优化中文编码处理
- 完善文档
- Full control over the URLs to visit, inspect and query (using a pre-initialized [goquery][] document)
- Crawl delays applied per host
- Obedience to robots.txt rules (using the [robotstxt.go][robots] library)
- Concurrent execution using goroutines
- Configurable logging
- Open, customizable design providing hooks into the execution logic
crawl depends on the following userland libraries:
- [goquery][]
- [purell][]
- [robotstxt.go][robots]
To install:
go get github.com/zchking/crawl
To install a previous version, you have to git clone https://github.com/zchking/crawl into your $GOPATH/src/github.com/zchking/gocrawl/ directory, and then run (for example) git checkout v0.3.2 to checkout a specific version, and go install to build and install the Go package.
2015年01月16日 start from PuerkitoBio.Thanks.
From example_test.go:
package gocrawl
import (
"github.com/PuerkitoBio/goquery"
"net/http"
"regexp"
"time"
)
// Only enqueue the root and paths beginning with an "a"
var rxOk = regexp.MustCompile(`http://duckduckgo\.com(/a.*)?$`)
// Create the Extender implementation, based on the gocrawl-provided DefaultExtender,
// because we don't want/need to override all methods.
type ExampleExtender struct {
DefaultExtender // Will use the default implementation of all but Visit() and Filter()
}
// Override Visit for our need.
func (this *ExampleExtender) Visit(ctx *URLContext, res *http.Response, doc *goquery.Document) (interface{}, bool) {
// Use the goquery document or res.Body to manipulate the data
// ...
// Return nil and true - let gocrawl find the links
return nil, true
}
// Override Filter for our need.
func (this *ExampleExtender) Filter(ctx *URLContext, isVisited bool) bool {
return !isVisited && rxOk.MatchString(ctx.NormalizedURL().String())
}
func ExampleCrawl() {
// Set custom options
opts := NewOptions(new(ExampleExtender))
opts.CrawlDelay = 1 * time.Second
opts.LogFlags = LogAll
// Play nice with ddgo when running the test!
opts.MaxVisits = 2
// Create crawler and start at root of duckduckgo
c := NewCrawlerWithOptions(opts)
c.Run("https://duckduckgo.com/")
// Remove "x" before Output: to activate the example (will run on go test)
// xOutput: voluntarily fail to see log output
}
```
@TODO crawler Document
Refer to PuerkitoBio/gocrawl
- PuerkitoBio
- Richard Penman
- Dmitry Bondarenko
- Markus Sonderegger
The [BSD 3-Clause license][bsd].
[bsd]: http://opensource.org/licenses/BSD-3-Clause
[goquery]: https://github.com/PuerkitoBio/goquery
[robots]: https://github.com/temoto/robotstxt.go
[purell]: https://github.com/PuerkitoBio/purell
[robprot]: http://www.robotstxt.org/robotstxt.html
[robspec]: https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt