Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

hellokaton/elves

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

38 Commits

Repository files navigation

Elves

一个轻量级的爬虫框架设计与实现,博文分析

@biezhi on zhihu

特性

  • 事件驱动
  • 易于定制
  • 多线程执行
  • CSS 选择器和 XPath 支持

Maven 坐标

<dependency>
 <groupId>io.github.biezhi</groupId>
 <artifactId>elves</artifactId>
 <version>0.0.2</version>
</dependency>

如果你想在本地运行这个项目源码,请确保你是 Java8 环境并且安装了 lombok 插件。

架构图

调用流程图

快速上手

搭建一个爬虫程序需要进行这么几步操作

  1. 编写一个爬虫类继承自 Spider
  2. 设置要抓取的 URL 列表
  3. 实现 Spiderparse 方法
  4. 添加 Pipeline 处理 parse 过滤后的数据

举个栗子:

public class DoubanSpider extends Spider {
 public DoubanSpider(String name) {
 super(name);
 this.startUrls(
 "https://movie.douban.com/tag/爱情",
 "https://movie.douban.com/tag/喜剧",
 "https://movie.douban.com/tag/动画",
 "https://movie.douban.com/tag/动作",
 "https://movie.douban.com/tag/史诗",
 "https://movie.douban.com/tag/犯罪");
 }
 @Override
 public void onStart(Config config) {
 this.addPipeline((Pipeline<List<String>>) (item, request) -> log.info("保存到文件: {}", item));
 }
 public Result parse(Response response) {
 Result<List<String>> result = new Result<>();
 Elements elements = response.body().css("#content table .pl2 a");
 List<String> titles = elements.stream().map(Element::text).collect(Collectors.toList());
 result.setItem(titles);
 // 获取下一页 URL
 Elements nextEl = response.body().css("#content > div > div.article > div.paginator > span.next > a");
 if (null != nextEl && nextEl.size() > 0) {
 String nextPageUrl = nextEl.get(0).attr("href");
 Request nextReq = this.makeRequest(nextPageUrl, this::parse);
 result.addRequest(nextReq);
 }
 return result;
 }
}
public static void main(String[] args) {
 DoubanSpider doubanSpider = new DoubanSpider("豆瓣电影");
 Elves.me(doubanSpider, Config.me()).start();
}

爬虫例子

开源协议

MIT

About

🎊 Design and implement of lightweight crawler framework.

Topics

Resources

License

Stars

Watchers

Forks

Packages

Contributors

Languages

AltStyle によって変換されたページ (->オリジナル) /