Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

zchking/crawler

Repository files navigation

crawler

crawler 致力于实现中文友好的网络抓取系统,项目基于PuerkitoBio的一个初级的并行的轻量级抓取库gocrawl.本项目目标:

  1. 实现分布式
  2. 增加机器学习算法
  3. 优化中文编码处理
  4. 完善文档

Features

  • Full control over the URLs to visit, inspect and query (using a pre-initialized [goquery][] document)
  • Crawl delays applied per host
  • Obedience to robots.txt rules (using the [robotstxt.go][robots] library)
  • Concurrent execution using goroutines
  • Configurable logging
  • Open, customizable design providing hooks into the execution logic

Installation and dependencies

crawl depends on the following userland libraries:

  • [goquery][]
  • [purell][]
  • [robotstxt.go][robots]

To install:

go get github.com/zchking/crawl

To install a previous version, you have to git clone https://github.com/zchking/crawl into your $GOPATH/src/github.com/zchking/gocrawl/ directory, and then run (for example) git checkout v0.3.2 to checkout a specific version, and go install to build and install the Go package.

Changelog

2015年01月16日 start from PuerkitoBio.Thanks.

Example

From example_test.go:

package gocrawl
import (
 "github.com/PuerkitoBio/goquery"
 "net/http"
 "regexp"
 "time"
)
// Only enqueue the root and paths beginning with an "a"
var rxOk = regexp.MustCompile(`http://duckduckgo\.com(/a.*)?$`)
// Create the Extender implementation, based on the gocrawl-provided DefaultExtender,
// because we don't want/need to override all methods.
type ExampleExtender struct {
 DefaultExtender // Will use the default implementation of all but Visit() and Filter()
}
// Override Visit for our need.
func (this *ExampleExtender) Visit(ctx *URLContext, res *http.Response, doc *goquery.Document) (interface{}, bool) {
 // Use the goquery document or res.Body to manipulate the data
 // ...
 // Return nil and true - let gocrawl find the links
 return nil, true
}
// Override Filter for our need.
func (this *ExampleExtender) Filter(ctx *URLContext, isVisited bool) bool {
 return !isVisited && rxOk.MatchString(ctx.NormalizedURL().String())
}
func ExampleCrawl() {
 // Set custom options
 opts := NewOptions(new(ExampleExtender))
 opts.CrawlDelay = 1 * time.Second
 opts.LogFlags = LogAll
 // Play nice with ddgo when running the test!
 opts.MaxVisits = 2
 // Create crawler and start at root of duckduckgo
 c := NewCrawlerWithOptions(opts)
 c.Run("https://duckduckgo.com/")
 // Remove "x" before Output: to activate the example (will run on go test)
 // xOutput: voluntarily fail to see log output
}
```

Document

@TODO crawler Document

Refer to PuerkitoBio/gocrawl

Thanks

  • PuerkitoBio
  • Richard Penman
  • Dmitry Bondarenko
  • Markus Sonderegger

License

The [BSD 3-Clause license][bsd].

[bsd]: http://opensource.org/licenses/BSD-3-Clause

[goquery]: https://github.com/PuerkitoBio/goquery

[robots]: https://github.com/temoto/robotstxt.go

[purell]: https://github.com/PuerkitoBio/purell

[robprot]: http://www.robotstxt.org/robotstxt.html

[robspec]: https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

About

go web crawler

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

AltStyle によって変換されたページ (->オリジナル) /