Home » Software » How To Create a Web Crawler and Data Miner

How To Create a Web Crawler and Data Miner

(追記) (追記ここまで)

A web crawler is an internet bot that browses the Internet World Wide Web, Its often to be called a web spider. Most known web crawler is googlebot. A web crawler starting to browse a list of URL to visit (seeds). After that, it identifies all the hyperlink in the web page and adds them to list of URLs to visit. In this article, i will show you How To Create A Web Crawler. There are many ways to create a web crawler, One of them is using Apache Nutch.

Apache Nutch is a scalable and very robust tool for web crawling. Apache Nutch can be integrated with Phyton programming language for web crawling. You can use it to crawl on your data, for a better indexing. If you understand Apache Nutch clearly, you can create your own search engine like Google.

Apache Nutch can run on a single machine as well as on a distributed environment like Apache Hadoop. It’s written in java. Apache Nutch can also integrated with Apache Solr (Solr is a search platform that can be used for searching any type of data and web pages) easily, so we can pass all the indexed and crawled page by Apache Nutch to Apache Solr.

Set Up Your Web Crawler

To start using Apache Nutch, First we need to install it. First thing to do is installing dependencies in Apache Nutch.

The dependencies are :

  1. Apache Nutch
  2. HBase
  3. Ant
  4. JDK

In this tutorial, we will use Apache Nutch 2.2.1 version. These are the steps for installation and configuration of Apache Nutch 2.2.1

1. Download Apache Nutch

2.Extract it by using this command # tar -zxvf apache-nutch.2.2.1-src.tar.gz

3.Download HBase Apache Hadoop

4.Extract it by using this command # tar -zxvf Hbase.x.x.tar.gz

5.Configure HBase. Go to hbase-site.xml and find <Your HBase home>/conf and modify it like image below

(追記) (追記ここまで)

How-To-Create-a-Web-Crawler-and-Data-Miner How To Create a Web Crawler and Data Miner

6.Specify Gora backend in nutch-site.xml (You can find it at $NUTCH_HOME/conf)

1
2
3
4
5
&lt;property&gt;
&lt;name&gt; storage.data.store.class &lt;/name&gt;
&lt;value&gt; org.apache.gora.hbase.store.HBaseStore &lt;/value&gt;
&lt;description&gt; Default class for storing data &lt;/decription&gt;
&lt;/property&gt;

7. Ensure that HBasegora-hbase dependency is available in ivy.xml by putting the following configuration

1
2
&lt;dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*-
&gt;default" /&gt;

8. Make sure HBaseStore is set as default data by putting the following configuration into gora.properties

1
2
gora.properties:
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

9. Go to Apache Nutch home directory and type following command

1
ant runtime

10. At this point, Apache Nutch will create respective directories.

11. Make sure Hbase is working properly by go to the home directory of hbase and type the following command

1
./bin/hbase shell

If everything goes well you will see this output

1
2
3
HBase Shell; enter 'help&lt;RETURN&gt;' for list of supported commands.
Type "exit&lt;RETURN&gt;" to leave the HBase Shell
Version: 0.90.4, r1001068, Fri Sep 24 13:55:42 PDT 2010

Start to Crawling Your First Website Using Apache Nutch

After finished installation steps of Apache Nutch, you can start crawling by use following steps

1. Add your agent name in value field in nutch-site.xml by add following configuration

1
2
3
4
5
6
&lt;configuration&gt;
&lt;property&gt;
&lt;name&gt; http.agent.name &lt;/name&gt;
&lt;value&gt; My Private Spider Bot &lt;/value&gt;
&lt;/property&gt;
&lt;/configuration&gt;

2.Go to the local directory of Apache Nutch which located at <your Apache Nutch home>/runtime and create a directory called urls inside it

3.Create seed.txt inside urls directory and put whatever you want to crawl first. for example

1
http://technotif.com

4. Now you can start to crawl by starting Apache Nutch and HBase by using following command

1
2
cd&lt;Respective directory of Apache Nutch&gt;/runtime
bin/crawl urls/seed.txt TestCrawl

If you got errors when starting Apache Nutch, Check for common errors

February 21, 2014 Technology Tips Software 3 Comments

You May Want to See :

3 thoughts on “How To Create a Web Crawler and Data Miner

  1. Trover says:

    In the momento of compilation, show an error: “[FAILED ] org.hasqldb#hsqldb;2.2.8!hsqldb.jar:…” , “Imposible to resolve dependencies:…, My OS is Ubuntu 14.0.4 Any Idea? Thanks.

    1. James Howard says:

      Make sure dependecies set correctly

      Try to delete entire .ivy directory and re-run ant

  2. Leonardo says:

    And the data miner? 🙂

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Name *
Email *
Website

This site uses Akismet to reduce spam. Learn how your comment data is processed.

AltStyle によって変換されたページ (->オリジナル) /