使用Go语言(golang)写个简单的爬虫

tt-0411 · · 10894 次点击 · · 开始浏览

这是一个创建于的文章,其中的信息可能已经有所发展或是发生改变。

上次用Scala写了个爬虫。最近在闲工夫之时,学习Go语言,便用Go移植了那个用Scala写的爬虫,代码如下:

package main
import (
	"fmt"
	"io/ioutil"
	"net/http"
	"regexp"
)
var (
	ptnIndexItem = regexp.MustCompile(`<a target="_blank" href="(.+\.html)" title=".+" >(.+)</a>`)
	ptnContentRough = regexp.MustCompile(`(?s).*<div class="artcontent">(.*)<div id="zhanwei">.*`)
	ptnBrTag = regexp.MustCompile(`<br>`)
	ptnHTMLTag = regexp.MustCompile(`(?s)</?.*?>`)
	ptnSpace = regexp.MustCompile(`(^\s+)|( )`)
)
func Get(url string) (content string, statusCode int) {
	resp, err1 := http.Get(url)
	if err1 != nil {
		statusCode = -100
		return
	}
	defer resp.Body.Close()
	data, err2 := ioutil.ReadAll(resp.Body)
	if err2 != nil {
		statusCode = -200
		return
	}
	statusCode = resp.StatusCode
	content = string(data)
	return
}
type IndexItem struct {
	url string
	title string
}
func findIndex(content string) (index []IndexItem, err error) {
	matches := ptnIndexItem.FindAllStringSubmatch(content, 10000)
	index = make([]IndexItem, len(matches))
	for i, item := range matches {
		index[i] = IndexItem{"http://www.yifan100.com" + item[1], item[2]}
	}
	return
}
func readContent(url string) (content string) {
	raw, statusCode := Get(url)
	if statusCode != 200 {
		fmt.Print("Fail to get the raw data from", url, "\n")
		return
	}
	match := ptnContentRough.FindStringSubmatch(raw)
	if match != nil {
		content = match[1]
	} else {
		return
	}
	content = ptnBrTag.ReplaceAllString(content, "\r\n")
	content = ptnHTMLTag.ReplaceAllString(content, "")
	content = ptnSpace.ReplaceAllString(content, "")
	return
}
func main() {
	fmt.Println(`Get index ...`)
	s, statusCode := Get("http://www.yifan100.com/dir/15136/")
	if statusCode != 200 {
		return
	}
	index, _ := findIndex(s)
	fmt.Println(`Get contents and write to file ...`)
	for _, item := range index {
		fmt.Printf("Get content %s from %s and write to file.\n", item.title, item.url)
		fileName := fmt.Sprintf("%s.txt", item.title)
		content := readContent(item.url)
		ioutil.WriteFile(fileName, []byte(content), 0644)
		fmt.Printf("Finish writing to %s.\n", fileName)
	}
}

代码行数比Scala版的有一定增加,主要原因有以下几方面原因:
1 golang 重视代码书写规范,或者说代码格式,很多地方写法比较固定,甚至比较麻烦。比如就算是if判断为真后的执行语句只有一句话,按照代码规范,也要写出带大括号的三行,而在Scala和很多其他语言中,一行就行;
2 golang 的strings包和regexp包提供的方法并不特别好用,特别是和Scala相比,使用起来感觉Scala的正则和字符串处理要舒服的多;
3 scala版的爬虫里面用到了Scala标准库中的实用类和方法,它们虽然不是语法组成,但用起来感觉像是语法糖,这里很多方法和函数式编程有关,golang的函数式编程还没有去仔细学习。

当然golang版的爬虫也有一个优势,就是编译速度很快,执行速度在现在的写法里面体现不出优势;golang的特性goroutine在这里没有用到,这段代码今后会不断改进。

有疑问加站长微信联系(非本文作者)

本文来自:博客园

感谢作者:tt-0411

查看原文:使用Go语言(golang)写个简单的爬虫

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

关注微信

10894 次点击

上一篇:基于vim的golang开发环境配置

下一篇:golang--类似mochiweb的多进程监听tcp链接

代码 http 函数式编程爬虫

0 回复

暂无回复

添加一条新回复 (您需要后才能回复没有账号 ?)

请尽量让自己的回复能够对别人有帮助
支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
图片支持拖拽、截图粘贴等方式上传

用户登录

Go今日面试题

(追記) (追記ここまで)

今日阅读排行

加载中

(追記) (追記ここまで)

一周阅读排行

加载中

关注我

扫码关注领全套学习资料关注微信公众号
加入 QQ 群:
- 192706294(已满)
- 731990104(已满)
- 798786647(已满)
- 729884609(已满)
- 977810755(已满)
- 815126783(已满)
- 812540095(已满)
- 1006366459(已满)
- 692541889
关注微信公众号
加入微信群:liuxiaoyan-s,备注入群
也欢迎加入知识星球 Go粉丝们(免费)

给该专栏投稿写篇新文章

每篇文章有总共有 5 次投稿机会

使用Go语言(golang)写个简单的爬虫

用户登录

今日阅读排行

一周阅读排行

关注我

给该专栏投稿 写篇新文章

收入到我管理的专栏 新建专栏

给该专栏投稿写篇新文章

收入到我管理的专栏新建专栏