分享
  1. 首页
  2. 文章

gocrawl 分析

harrysun · · 5783 次点击 · · 开始浏览
这是一个创建于 的文章,其中的信息可能已经有所发展或是发生改变。

1. gocrawl 类结构

 1 // The crawler itself, the master of the whole process
 2 type Crawler struct {
 3 Options *Options
 4 
 5 // Internal fields
 6 logFunc func(LogFlags, string, ...interface{})
 7 push chan *workerResponse
 8 enqueue chan interface{}
 9 stop chan struct{}
10  wg *sync.WaitGroup
11 pushPopRefCount int
12 visits int
13 
14 // keep lookups in maps, O(1) access time vs O(n) for slice. The empty struct value
15 // is of no use, but this is the smallest type possible - it uses no memory at all.
16 visited map[string]struct{}
17 hosts map[string]struct{}
18  workers map[string]*worker
19 }
 1 // The Options available to control and customize the crawling process.
 2 type Options struct {
 3 UserAgent string
 4 RobotUserAgent string
 5 MaxVisits int
 6 EnqueueChanBuffer int
 7 HostBufferFactor int
 8 CrawlDelay time.Duration // Applied per host
 9  WorkerIdleTTL time.Duration
10 SameHostOnly bool
11 HeadBeforeGet bool
12  URLNormalizationFlags purell.NormalizationFlags
13  LogFlags LogFlags
14  Extender Extender
15 }
 1 // Extension methods required to provide an extender instance.
 2 type Extender interface {
 3 // Start, End, Error and Log are not related to a specific URL, so they don't
 4 // receive a URLContext struct.
 5 Start(interface{}) interface{}
 6  End(error)
 7 Error(*CrawlError)
 8 Log(LogFlags, LogFlags, string)
 9 
10 // ComputeDelay is related to a Host only, not to a URLContext, although the FetchInfo
11 // is related to a URLContext (holds a ctx field).
12 ComputeDelay(string, *DelayInfo, *FetchInfo) time.Duration
13 
14 // All other extender methods are executed in the context of an URL, and thus
15 // receive an URLContext struct as first argument.
16 Fetch(*URLContext, string, bool) (*http.Response, error)
17 RequestGet(*URLContext, *http.Response) bool
18 RequestRobots(*URLContext, string) ([]byte, bool)
19 FetchedRobots(*URLContext, *http.Response)
20 Filter(*URLContext, bool) bool
21 Enqueued(*URLContext)
22 Visit(*URLContext, *http.Response, *goquery.Document) (interface{}, bool)
23 Visited(*URLContext, interface{})
24 Disallowed(*URLContext)
25 }

entry point:

 1 func main() {
 2 ext := &Ext{&gocrawl.DefaultExtender{}}
 3 // Set custom options
 4 opts := gocrawl.NewOptions(ext)
 5 opts.CrawlDelay = 1 * time.Second
 6 opts.LogFlags = gocrawl.LogError
 7 opts.SameHostOnly = false
 8 opts.MaxVisits = 10
 9 
10 c := gocrawl.NewCrawlerWithOptions(opts)
11 c.Run("http://0value.com")
12 }

3 steps: in main

1) get a Extender

2) create Options with given Extender

3) create gocrawel

as it is commented, go crawel contols the whole process, Option supplies some configuration info and Extender does the real work.

2. other key structs

worker, workResponse and sync.WaitGroup

1 // Communication from worker to the master crawler, about the crawling of a URL
2 type workerResponse struct {
3 ctx *URLContext
4 visited bool
5 harvestedURLs interface{}
6 host string
7 idleDeath bool
8 }
 1 // The worker is dedicated to fetching and visiting a given host, respecting
 2 // this host's robots.txt crawling policies.
 3 type worker struct {
 4 // Worker identification
 5 host string
 6 index int
 7 
 8 // Communication channels and sync
 9 push chan<- *workerResponse
10  pop popChannel
11 stop chan struct{}
12 enqueue chan<- interface{}
13 wg *sync.WaitGroup
14 
15 // Robots validation
16 robotsGroup *robotstxt.Group
17 
18 // Logging
19 logFunc func(LogFlags, string, ...interface{})
20 
21 // Implementation fields
22 wait <-chan time.Time
23 lastFetch *FetchInfo
24  lastCrawlDelay time.Duration
25 opts *Options
26 }
for info about sync.WaitGroup, please visit http://mindfsck.net/example-golang-makes-concurrent-programming-easy-awesome/ and http://soniacodes.wordpress.com/2011/02/28/channels-vs-sync-package/

3. I will give a whole workflow of gocrawl in a few days.(6/20/2014)


有疑问加站长微信联系(非本文作者)

本文来自:博客园

感谢作者:harrysun

查看原文:gocrawl 分析

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

关注微信
5783 次点击
暂无回复
添加一条新回复 (您需要 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传

用户登录

没有账号?注册
(追記) (追記ここまで)

今日阅读排行

    加载中
(追記) (追記ここまで)

一周阅读排行

    加载中

关注我

  • 扫码关注领全套学习资料 关注微信公众号
  • 加入 QQ 群:
    • 192706294(已满)
    • 731990104(已满)
    • 798786647(已满)
    • 729884609(已满)
    • 977810755(已满)
    • 815126783(已满)
    • 812540095(已满)
    • 1006366459(已满)
    • 692541889

  • 关注微信公众号
  • 加入微信群:liuxiaoyan-s,备注入群
  • 也欢迎加入知识星球 Go粉丝们(免费)

给该专栏投稿 写篇新文章

每篇文章有总共有 5 次投稿机会

收入到我管理的专栏 新建专栏