5
\$\begingroup\$

I've written a working web crawler in Java that finds the frequencies of words on web pages. I have two issues with it.

  • The organization of my code in WebCrawler.java is terrible. Is there a way I could refactor the class to make it more readable? Perhaps splitting it into multiple classes?

  • I want to find a way to group similar words, so that I don't have entries for every possible word(for example, page and pages).

This is WebCrawler.java:

public class WebCrawler {
 private static final int MAX_URL_QUEUE = 3000;
 private final Object lock = new Object();
 private List<Thread> threads;
 private int currentURLNum = 0;
 private final int MAXURL;
 private Queue<String> urlQqueue;
 private URLGroup crawledURLs;
 private WordExtractor words;
 public WebCrawler(String start, int max, int threads) {
 this.MAXURL = max;
 urlQqueue = new LinkedList<String>();
 urlQqueue.offer(start);
 this.threads = new ArrayList<Thread>();
 words = new WordExtractor();
 for(int i = 0; i < threads; i++){
 this.threads.add(new Thread(new CrawlTask(this, words)));
 }
 crawledURLs = new URLGroup();
 }
 public void run(){
 for(Thread t : threads){
 t.start();
 }
 boolean allThreadsDead = false;
 while(!allThreadsDead){
 allThreadsDead = true;
 for(int i = 0; i < threads.size(); i++){
 if(threads.get(i).isAlive()){
 Thread.yield();
 i = 0;
 allThreadsDead = false;
 }
 }
 }
 }
 static class CrawlTask implements Runnable{
 private final WebCrawler bot;
 private final WordExtractor words;
 public CrawlTask(WebCrawler bot, WordExtractor words){
 this.bot = bot;
 this.words = words;
 }
 @Override
 public void run() {
 URLGroup group = new URLGroup();
 while(bot.currentURLNum < bot.MAXURL){
 String url = bot.urlQqueue.poll();
 if(url == null){
 Thread.yield();
 continue;
 } 
 Document doc = null;
 try{
 doc = bot.getDocument(url.toString(), 3000);
 }catch(Exception e){
 e.printStackTrace();
 continue;
 }
 words.extractFromDocument(doc);
 List<String> urls = new LinkExtractor().extractFromDocument(doc).getCollectedURLs();
 synchronized(bot.lock){
 bot.currentURLNum++;
 group.add(url);
 for(String u : urls){
 if(bot.urlQqueue.size() > MAX_URL_QUEUE){
 break;
 }
 bot.urlQqueue.offer(u);
 }
 System.out.println(bot.currentURLNum);
 }
 } 
 for(String u : group.urls){
 bot.crawledURLs.add(u);
 }
 System.out.println(Thread.currentThread() + " is done!");
 }
 }
 private Document getDocument(String url, int timeout) throws IOException {
 return Jsoup.connect(url).timeout(timeout).get();
 }
// The Main Entry-point
 public static void main(String[] args) throws IOException { 
 String startingUrl = "http://en.wikipedia.org/wiki/Special:Random";
 WebCrawler bot = new WebCrawler(startingUrl, 50, 10);
 bot.run();
 bot.crawledURLs.printCrawled(System.out);
 bot.words.printTopWords(System.out, 30);
 }

And this is my current method of storing the frequencies of words:

public class WordExtractor implements DocumentExtractor{
 private HashMap<String, Integer> wordFrequency = new HashMap<String, Integer>();
 public void foundWord(String word){
 if(wordFrequency.get(word) == null){
 wordFrequency.put(word, 1);
 }else{
 wordFrequency.put(word, wordFrequency.get(word) + 1);
 }
 }
 public void printTopWords(OutputStream output, int num) throws IOException{
 List<Map.Entry<String, Integer>> topWords = new ArrayList<Map.Entry<String, Integer>>();
 for(int i = 0; i < num; i++)
 topWords.add(null);
 Iterator<Map.Entry<String, Integer>> it = wordFrequency.entrySet().iterator();
 while (it.hasNext()) {
 Map.Entry<String, Integer> pairs = (Map.Entry<String, Integer>)it.next();
 for(int i = 0; i < topWords.size(); i++){
 int topVal = 0;
 if(topWords.get(i) != null)
 topVal = topWords.get(i).getValue();
 if(pairs.getValue() > topVal){
 topWords.set(i, pairs);
 break;
 }
 }
 }
 output.write(topWords.toString().getBytes());
 }
 @Override
 public synchronized DocumentExtractor extractFromDocument(Document doc) {
 String[] words = doc.body().text().split("\\s+");
 for(String word : words){
 foundWord(word.toLowerCase());
 }
 return this;
 }
}
asked May 22, 2014 at 0:17
\$\endgroup\$
1
  • 3
    \$\begingroup\$ Apart from the obvious code-review, you could consider using Lucene with a RAMDirectory for indexing the documents, and that will 'stem' the words for you (pages -> page), as well as count the frequency. You can also have stop-words (ignore words, like 'the', 'and', etc.). natural Language Processing (NLP) is the general computing 'genre' for this type of problem. \$\endgroup\$ Commented May 22, 2014 at 0:24

1 Answer 1

3
\$\begingroup\$
  1. If I'm right you could eliminate most of the threading-logic with an ExecutorService and an awaitTermination call. The linked javadoc has some usage examples too. (Don't forget to check the return value of awaitTermination.)

    (Effective Java, 2nd edition, Item 47: Know and use the libraries)

  2. int max is a little bit mysterious. What does it maximize? I'd put that into the name.

  3. urlQqueue is used from multiple threads without proper synchronization (urlQqueue.poll()), it's not thread-safe. Queue has other, thread-safe implementations, I'd use one of them instead of LinkedList.

    You could use a BlockingQueue to avoid loops with Thread.yield() (which is rarely approprite, according to the javadoc):

    while (bot.currentURLNum < bot.MAXURL) {
     String url = bot.urlQqueue.poll();
     if (url == null) {
     Thread.yield();
     continue;
     }
    
  4. currentURLNum is also not properly synchronized, it is read by the while statement without synchronization.

    [...] synchronization has no effect unless both read and write operations are synchronized.

    From Effective Java, 2nd Edition, Item 66: Synchronize access to shared mutable data.

  5. The reference type could be simply Map<String, Integer> here:

    private HashMap<String, Integer> wordFrequency = new HashMap<String, Integer>();
    

    See also: Effective Java, 2nd edition, Item 52: Refer to objects by their interfaces

  6. Actually, you could use a Multiset for counting the words. (Here is a similar example)

  7. You need some more synchronization for WordExtractor, since the underlying HashMap is not thread safe. printTopWords might not see your latest data. foundWord is also public, so if anyone call he or she could get weird results too.

  8. Gauava has a helper method (copyHighestCountFirst) for multisets which might be very similar to the algorithm in the printTopWords method.

    (Effective Java, 2nd edition, Item 47: Know and use the libraries The author mentions only the JDK's built-in libraries but I think the reasoning could be true for other libraries too.)

answered Jun 27, 2014 at 18:12
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.