Web Crawler in Java

Question 1

I've written a working web crawler in Java that finds the frequencies of words on web pages. I have two issues with it.

The organization of my code in WebCrawler.java is terrible. Is there a way I could refactor the class to make it more readable? Perhaps splitting it into multiple classes?
I want to find a way to group similar words, so that I don't have entries for every possible word(for example, page and pages).

This is WebCrawler.java:

public class WebCrawler {
 private static final int MAX_URL_QUEUE = 3000;
 private final Object lock = new Object();
 private List<Thread> threads;
 private int currentURLNum = 0;
 private final int MAXURL;
 private Queue<String> urlQqueue;
 private URLGroup crawledURLs;
 private WordExtractor words;
 public WebCrawler(String start, int max, int threads) {
 this.MAXURL = max;
 urlQqueue = new LinkedList<String>();
 urlQqueue.offer(start);
 this.threads = new ArrayList<Thread>();
 words = new WordExtractor();
 for(int i = 0; i < threads; i++){
 this.threads.add(new Thread(new CrawlTask(this, words)));
 }
 crawledURLs = new URLGroup();
 }
 public void run(){
 for(Thread t : threads){
 t.start();
 }
 boolean allThreadsDead = false;
 while(!allThreadsDead){
 allThreadsDead = true;
 for(int i = 0; i < threads.size(); i++){
 if(threads.get(i).isAlive()){
 Thread.yield();
 i = 0;
 allThreadsDead = false;
 }
 }
 }
 }
 static class CrawlTask implements Runnable{
 private final WebCrawler bot;
 private final WordExtractor words;
 public CrawlTask(WebCrawler bot, WordExtractor words){
 this.bot = bot;
 this.words = words;
 }
 @Override
 public void run() {
 URLGroup group = new URLGroup();
 while(bot.currentURLNum < bot.MAXURL){
 String url = bot.urlQqueue.poll();
 if(url == null){
 Thread.yield();
 continue;
 } 
 Document doc = null;
 try{
 doc = bot.getDocument(url.toString(), 3000);
 }catch(Exception e){
 e.printStackTrace();
 continue;
 }
 words.extractFromDocument(doc);
 List<String> urls = new LinkExtractor().extractFromDocument(doc).getCollectedURLs();
 synchronized(bot.lock){
 bot.currentURLNum++;
 group.add(url);
 for(String u : urls){
 if(bot.urlQqueue.size() > MAX_URL_QUEUE){
 break;
 }
 bot.urlQqueue.offer(u);
 }
 System.out.println(bot.currentURLNum);
 }
 } 
 for(String u : group.urls){
 bot.crawledURLs.add(u);
 }
 System.out.println(Thread.currentThread() + " is done!");
 }
 }
 private Document getDocument(String url, int timeout) throws IOException {
 return Jsoup.connect(url).timeout(timeout).get();
 }
// The Main Entry-point
 public static void main(String[] args) throws IOException { 
 String startingUrl = "http://en.wikipedia.org/wiki/Special:Random";
 WebCrawler bot = new WebCrawler(startingUrl, 50, 10);
 bot.run();
 bot.crawledURLs.printCrawled(System.out);
 bot.words.printTopWords(System.out, 30);
 }

And this is my current method of storing the frequencies of words:

public class WordExtractor implements DocumentExtractor{
 private HashMap<String, Integer> wordFrequency = new HashMap<String, Integer>();
 public void foundWord(String word){
 if(wordFrequency.get(word) == null){
 wordFrequency.put(word, 1);
 }else{
 wordFrequency.put(word, wordFrequency.get(word) + 1);
 }
 }
 public void printTopWords(OutputStream output, int num) throws IOException{
 List<Map.Entry<String, Integer>> topWords = new ArrayList<Map.Entry<String, Integer>>();
 for(int i = 0; i < num; i++)
 topWords.add(null);
 Iterator<Map.Entry<String, Integer>> it = wordFrequency.entrySet().iterator();
 while (it.hasNext()) {
 Map.Entry<String, Integer> pairs = (Map.Entry<String, Integer>)it.next();
 for(int i = 0; i < topWords.size(); i++){
 int topVal = 0;
 if(topWords.get(i) != null)
 topVal = topWords.get(i).getValue();
 if(pairs.getValue() > topVal){
 topWords.set(i, pairs);
 break;
 }
 }
 }
 output.write(topWords.toString().getBytes());
 }
 @Override
 public synchronized DocumentExtractor extractFromDocument(Document doc) {
 String[] words = doc.body().text().split("\\s+");
 for(String word : words){
 foundWord(word.toLowerCase());
 }
 return this;
 }
}

Question 2

Apart from the obvious code-review, you could consider using Lucene with a RAMDirectory for indexing the documents, and that will 'stem' the words for you (pages -> page), as well as count the frequency. You can also have stop-words (ignore words, like 'the', 'and', etc.). natural Language Processing (NLP) is the general computing 'genre' for this type of problem.

Question 3

If I'm right you could eliminate most of the threading-logic with an ExecutorService and an awaitTermination call. The linked javadoc has some usage examples too. (Don't forget to check the return value of awaitTermination.)

(Effective Java, 2nd edition, Item 47: Know and use the libraries)
int max is a little bit mysterious. What does it maximize? I'd put that into the name.
urlQqueue is used from multiple threads without proper synchronization (urlQqueue.poll()), it's not thread-safe. Queue has other, thread-safe implementations, I'd use one of them instead of LinkedList.

You could use a BlockingQueue to avoid loops with Thread.yield() (which is rarely approprite, according to the javadoc):
```
while (bot.currentURLNum < bot.MAXURL) {
 String url = bot.urlQqueue.poll();
 if (url == null) {
 Thread.yield();
 continue;
 }
```
currentURLNum is also not properly synchronized, it is read by the while statement without synchronization.

[...] synchronization has no effect unless both read and write operations are synchronized.

From Effective Java, 2nd Edition, Item 66: Synchronize access to shared mutable data.
The reference type could be simply Map<String, Integer> here:
```
private HashMap<String, Integer> wordFrequency = new HashMap<String, Integer>();
```
See also: Effective Java, 2nd edition, Item 52: Refer to objects by their interfaces
Actually, you could use a Multiset for counting the words. (Here is a similar example)
You need some more synchronization for WordExtractor, since the underlying HashMap is not thread safe. printTopWords might not see your latest data. foundWord is also public, so if anyone call he or she could get weird results too.
Gauava has a helper method (copyHighestCountFirst) for multisets which might be very similar to the algorithm in the printTopWords method.

(Effective Java, 2nd edition, Item 47: Know and use the libraries The author mentions only the JDK's built-in libraries but I think the reasoning could be true for other libraries too.)

palacsint palacsint 30.3k9 gold badges82 silver badges157 bronze badges · Accepted Answer · 2014-06-27 18:12:31Z

If I'm right you could eliminate most of the threading-logic with an ExecutorService and an awaitTermination call. The linked javadoc has some usage examples too. (Don't forget to check the return value of awaitTermination.)

(Effective Java, 2nd edition, Item 47: Know and use the libraries)
int max is a little bit mysterious. What does it maximize? I'd put that into the name.
urlQqueue is used from multiple threads without proper synchronization (urlQqueue.poll()), it's not thread-safe. Queue has other, thread-safe implementations, I'd use one of them instead of LinkedList.

You could use a BlockingQueue to avoid loops with Thread.yield() (which is rarely approprite, according to the javadoc):
```
while (bot.currentURLNum < bot.MAXURL) {
 String url = bot.urlQqueue.poll();
 if (url == null) {
 Thread.yield();
 continue;
 }
```
currentURLNum is also not properly synchronized, it is read by the while statement without synchronization.

[...] synchronization has no effect unless both read and write operations are synchronized.

From Effective Java, 2nd Edition, Item 66: Synchronize access to shared mutable data.
The reference type could be simply Map<String, Integer> here:
```
private HashMap<String, Integer> wordFrequency = new HashMap<String, Integer>();
```
See also: Effective Java, 2nd edition, Item 52: Refer to objects by their interfaces
Actually, you could use a Multiset for counting the words. (Here is a similar example)
You need some more synchronization for WordExtractor, since the underlying HashMap is not thread safe. printTopWords might not see your latest data. foundWord is also public, so if anyone call he or she could get weird results too.
Gauava has a helper method (copyHighestCountFirst) for multisets which might be very similar to the algorithm in the printTopWords method.

(Effective Java, 2nd edition, Item 47: Know and use the libraries The author mentions only the JDK's built-in libraries but I think the reasoning could be true for other libraries too.)

Stack Exchange Network

Web Crawler in Java

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Web Crawler in Java

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions