GitHub - l2weekly/webmagic: A scalable web crawler framework.

l2weekly/webmagic

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 451 Commits
asserts		asserts
en_docs		en_docs
webmagic-avalon		webmagic-avalon
webmagic-core		webmagic-core
webmagic-extension		webmagic-extension
webmagic-lucene		webmagic-lucene
webmagic-panel		webmagic-panel
webmagic-samples		webmagic-samples
webmagic-saxon		webmagic-saxon
webmagic-scripts		webmagic-scripts
webmagic-selenium		webmagic-selenium
zh_docs		zh_docs
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
pom.xml		pom.xml
release-note.md		release-note.md
user-manual.md		user-manual.md
webmagic-avalon.md		webmagic-avalon.md

Repository files navigation

logo

Readme in Chinese

User Manual (Chinese)

Build Status

A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler.

Features:

Simple core with high flexibility.
Simple API for html extracting.
Annotation with POJO to customize a crawler, no configuration.
Multi-thread and Distribution support.
Easy to be integrated.

Install:

Add dependencies to your pom.xml:

	<dependency>
 <groupId>us.codecraft</groupId>
 <artifactId>webmagic-core</artifactId>
 <version>0.4.3</version>
 </dependency>
	<dependency>
 <groupId>us.codecraft</groupId>
 <artifactId>webmagic-extension</artifactId>
 <version>0.4.3</version>
 </dependency>

Get Started:

First crawler:

Write a class implements PageProcessor:

 public class OschinaBlogPageProcesser implements PageProcessor {
 private Site site = Site.me().setDomain("my.oschina.net");
 @Override
 public void process(Page page) {
 List<String> links = page.getHtml().links().regex("http://my\\.oschina\\.net/flashsword/blog/\\d+").all();
 page.addTargetRequests(links);
 page.putField("title", page.getHtml().xpath("//div[@class='BlogEntity']/div[@class='BlogTitle']/h1").toString());
 page.putField("content", page.getHtml().$("div.content").toString());
 page.putField("tags",page.getHtml().xpath("//div[@class='BlogTags']/a/text()").all());
 }
 @Override
 public Site getSite() {
 return site;
 }
 public static void main(String[] args) {
 Spider.create(new OschinaBlogPageProcesser()).addUrl("http://my.oschina.net/flashsword/blog")
 .addPipeline(new ConsolePipeline()).run();
 }
 }

page.addTargetRequests(links)

Add urls for crawling.

You can also use annotation way:

	@TargetUrl("http://my.oschina.net/flashsword/blog/\\d+")
	public class OschinaBlog {
	 @ExtractBy("//title")
	 private String title;
	 @ExtractBy(value = "div.BlogContent",type = ExtractBy.Type.Css)
	 private String content;
	 @ExtractBy(value = "//div[@class='BlogTags']/a/text()", multi = true)
	 private List<String> tags;
	 public static void main(String[] args) {
	 OOSpider.create(
	 	Site.me(),
				new ConsolePageModelPipeline(), OschinaBlog.class).addUrl("http://my.oschina.net/flashsword/blog").run();
	 }
	}

Docs and samples:

The architecture of webmagic (refered to Scrapy)

image

Javadocs: http://code4craft.github.io/webmagic/docs/en/

There are some samples in webmagic-samples package.

Lisence:

Lisenced under Apache 2.0 lisence

Contributors:

Thanks these people for commiting source code, reporting bugs or suggesting for new feature:

Thanks:

To write webmagic, I refered to the projects below :

Scrapy

A crawler framework in Python.

http://scrapy.org/
Spiderman

Another crawler framework in Java.

https://gitcafe.com/laiweiwei/Spiderman

Mail-list:

https://groups.google.com/forum/#!forum/webmagic-java

Bitdeli Badge

About

A scalable web crawler framework.

Releases

10 tags

Packages

No packages published

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

l2weekly/webmagic

Folders and files

Latest commit

History

Repository files navigation

Features:

Install:

Get Started:

First crawler:

Docs and samples:

Lisence:

Contributors:

Thanks:

Mail-list:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

l2weekly/webmagic

Folders and files

Latest commit

History

Repository files navigation

Features:

Install:

Get Started:

First crawler:

Docs and samples:

Lisence:

Contributors:

Thanks:

Mail-list:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages