I've written a working web crawler in Java that finds the frequencies of words on web pages. I have two issues with it.
The organization of my code in WebCrawler.java is terrible. Is there a way I could refactor the class to make it more readable? Perhaps splitting it into multiple classes?
I want to find a way to group similar words, so that I don't have entries for every possible word(for example,
page
andpages
).
This is WebCrawler.java:
public class WebCrawler {
private static final int MAX_URL_QUEUE = 3000;
private final Object lock = new Object();
private List<Thread> threads;
private int currentURLNum = 0;
private final int MAXURL;
private Queue<String> urlQqueue;
private URLGroup crawledURLs;
private WordExtractor words;
public WebCrawler(String start, int max, int threads) {
this.MAXURL = max;
urlQqueue = new LinkedList<String>();
urlQqueue.offer(start);
this.threads = new ArrayList<Thread>();
words = new WordExtractor();
for(int i = 0; i < threads; i++){
this.threads.add(new Thread(new CrawlTask(this, words)));
}
crawledURLs = new URLGroup();
}
public void run(){
for(Thread t : threads){
t.start();
}
boolean allThreadsDead = false;
while(!allThreadsDead){
allThreadsDead = true;
for(int i = 0; i < threads.size(); i++){
if(threads.get(i).isAlive()){
Thread.yield();
i = 0;
allThreadsDead = false;
}
}
}
}
static class CrawlTask implements Runnable{
private final WebCrawler bot;
private final WordExtractor words;
public CrawlTask(WebCrawler bot, WordExtractor words){
this.bot = bot;
this.words = words;
}
@Override
public void run() {
URLGroup group = new URLGroup();
while(bot.currentURLNum < bot.MAXURL){
String url = bot.urlQqueue.poll();
if(url == null){
Thread.yield();
continue;
}
Document doc = null;
try{
doc = bot.getDocument(url.toString(), 3000);
}catch(Exception e){
e.printStackTrace();
continue;
}
words.extractFromDocument(doc);
List<String> urls = new LinkExtractor().extractFromDocument(doc).getCollectedURLs();
synchronized(bot.lock){
bot.currentURLNum++;
group.add(url);
for(String u : urls){
if(bot.urlQqueue.size() > MAX_URL_QUEUE){
break;
}
bot.urlQqueue.offer(u);
}
System.out.println(bot.currentURLNum);
}
}
for(String u : group.urls){
bot.crawledURLs.add(u);
}
System.out.println(Thread.currentThread() + " is done!");
}
}
private Document getDocument(String url, int timeout) throws IOException {
return Jsoup.connect(url).timeout(timeout).get();
}
// The Main Entry-point
public static void main(String[] args) throws IOException {
String startingUrl = "http://en.wikipedia.org/wiki/Special:Random";
WebCrawler bot = new WebCrawler(startingUrl, 50, 10);
bot.run();
bot.crawledURLs.printCrawled(System.out);
bot.words.printTopWords(System.out, 30);
}
And this is my current method of storing the frequencies of words:
public class WordExtractor implements DocumentExtractor{
private HashMap<String, Integer> wordFrequency = new HashMap<String, Integer>();
public void foundWord(String word){
if(wordFrequency.get(word) == null){
wordFrequency.put(word, 1);
}else{
wordFrequency.put(word, wordFrequency.get(word) + 1);
}
}
public void printTopWords(OutputStream output, int num) throws IOException{
List<Map.Entry<String, Integer>> topWords = new ArrayList<Map.Entry<String, Integer>>();
for(int i = 0; i < num; i++)
topWords.add(null);
Iterator<Map.Entry<String, Integer>> it = wordFrequency.entrySet().iterator();
while (it.hasNext()) {
Map.Entry<String, Integer> pairs = (Map.Entry<String, Integer>)it.next();
for(int i = 0; i < topWords.size(); i++){
int topVal = 0;
if(topWords.get(i) != null)
topVal = topWords.get(i).getValue();
if(pairs.getValue() > topVal){
topWords.set(i, pairs);
break;
}
}
}
output.write(topWords.toString().getBytes());
}
@Override
public synchronized DocumentExtractor extractFromDocument(Document doc) {
String[] words = doc.body().text().split("\\s+");
for(String word : words){
foundWord(word.toLowerCase());
}
return this;
}
}
-
3\$\begingroup\$ Apart from the obvious code-review, you could consider using Lucene with a RAMDirectory for indexing the documents, and that will 'stem' the words for you (pages -> page), as well as count the frequency. You can also have stop-words (ignore words, like 'the', 'and', etc.). natural Language Processing (NLP) is the general computing 'genre' for this type of problem. \$\endgroup\$rolfl– rolfl2014年05月22日 00:24:37 +00:00Commented May 22, 2014 at 0:24
1 Answer 1
If I'm right you could eliminate most of the threading-logic with an
ExecutorService
and anawaitTermination
call. The linked javadoc has some usage examples too. (Don't forget to check the return value ofawaitTermination
.)(Effective Java, 2nd edition, Item 47: Know and use the libraries)
int max
is a little bit mysterious. What does it maximize? I'd put that into the name.urlQqueue
is used from multiple threads without proper synchronization (urlQqueue.poll()
), it's not thread-safe.Queue
has other, thread-safe implementations, I'd use one of them instead ofLinkedList
.You could use a
BlockingQueue
to avoid loops withThread.yield()
(which is rarely approprite, according to the javadoc):while (bot.currentURLNum < bot.MAXURL) { String url = bot.urlQqueue.poll(); if (url == null) { Thread.yield(); continue; }
currentURLNum
is also not properly synchronized, it is read by thewhile
statement without synchronization.[...] synchronization has no effect unless both read and write operations are synchronized.
From Effective Java, 2nd Edition, Item 66: Synchronize access to shared mutable data.
The reference type could be simply
Map<String, Integer>
here:private HashMap<String, Integer> wordFrequency = new HashMap<String, Integer>();
See also: Effective Java, 2nd edition, Item 52: Refer to objects by their interfaces
Actually, you could use a
Multiset
for counting the words. (Here is a similar example)You need some more synchronization for
WordExtractor
, since the underlyingHashMap
is not thread safe.printTopWords
might not see your latest data.foundWord
is alsopublic
, so if anyone call he or she could get weird results too.Gauava has a helper method (
copyHighestCountFirst
) for multisets which might be very similar to the algorithm in theprintTopWords
method.(Effective Java, 2nd edition, Item 47: Know and use the libraries The author mentions only the JDK's built-in libraries but I think the reasoning could be true for other libraries too.)