HTML downloader and parser for CR

Question 1

This program downloads a Code Review HTML file and parses it.

Could you review my program?

Main.java

import java.net.URL;
public class Main {
 public static void main(String[] args) throws Exception {
 final String site1 = "http://codereview.stackexchange.com/questions/";
 String site2;
 String site3;
 URL url;
 HtmlGetter htmlGetter = new HtmlGetter();
 while(true) {
 site2 = "69";
 site3 = "is-this-implementation-of-shamos-hoey-algorithm-ok";
 url = new URL(site1 + site2 + "/" + site3);
 htmlGetter.setFileName("[" + site2 + "]" + site3 + ".html");
 htmlGetter.download(url);
 htmlGetter.parse();
 break;
 }
 }
}

HtmlGetter.java

import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.net.URL;
import java.util.Scanner;
import org.apache.commons.lang3.StringEscapeUtils; 
import org.apache.commons.io.IOUtils;
public class HtmlGetter {
 private String fileName;
 public void setFileName(String fileName) {
 this.fileName = fileName;
 }
 public String getFileName() {
 return fileName;
 }
 public void download(URL url) throws Exception {
 final InputStream in = url.openStream();
 final OutputStream out = new FileOutputStream(fileName);
 IOUtils.copy(in, out);
 in.close();
 out.close();
 }
 public void parse() throws Exception {
 Scanner scanner = new Scanner(new File(fileName));
 String oneLine;
 String htmlTagRegex = "<(/)?([a-zA-Z]*)(\\s[a-zA-Z]*=[^>]*)?(\\s)*(/)?>";
 while(scanner.hasNextLine()) {
 oneLine = scanner.nextLine();
 if(oneLine.matches(".*<code>.*")) {
 while(scanner.hasNextLine()&&(!oneLine.matches(".*</code>.*"))) {
 oneLine = StringEscapeUtils.unescapeHtml4(oneLine.replaceAll(htmlTagRegex, ""));
 System.out.println(oneLine);
 oneLine = scanner.nextLine();
 // This cannot handle such kind of a line which has <code> and </code> together.
 }
 System.out.println("-------------------------------------------");
 }
 } 
 }
}

Question 2

public static void main(String[] args) throws Exception {
 final String site1 = "http://codereview.stackexchange.com/questions/";
 String site2;
 String site3;

Why do define site1 as a final constant, but delay the other assignments until later? They should really be static finals for the class. Also some clearer names would not go amiss.

 URL url;

Declare them where they are used, don't declare them ahead of time.

 HtmlGetter htmlGetter = new HtmlGetter();
 while(true) {
 site2 = "69";
 site3 = "is-this-implementation-of-shamos-hoey-algorithm-ok";
 url = new URL(site1 + site2 + "/" + site3);
 htmlGetter.setFileName("[" + site2 + "]" + site3 + ".html");
 htmlGetter.download(url);
 htmlGetter.parse();
 break;

Why do you have a loop if you are just going to break out of it? } }

public void parse() throws Exception {

You shouldn't throw Exception, your specification should really only include the exceptions that'll actually be thrown

 Scanner scanner = new Scanner(new File(fileName));
 String oneLine;
 String htmlTagRegex = "<(/)?([a-zA-Z]*)(\\s[a-zA-Z]*=[^>]*)?(\\s)*(/)?>";

Parsing HTML by regex is asking for trouble. Instead, you'll find the task much easier if you grab a html parsing library and use that.

 while(scanner.hasNextLine()) {
 oneLine = scanner.nextLine();
 if(oneLine.matches(".*<code>.*")) {

You are just searching for a substring here, there isn't much point in using regex to do that.

 while(scanner.hasNextLine()&&(!oneLine.matches(".*</code>.*"))) {

If you insist on regex, use a single regex to capture the entire code tag.

 oneLine = StringEscapeUtils.unescapeHtml4(oneLine.replaceAll(htmlTagRegex, ""));
 System.out.println(oneLine);
 oneLine = scanner.nextLine();
 // This cannot handle such kind of a line which has <code> and </code> together.
 }
 System.out.println("-------------------------------------------");
 }
 } 
}

Using something like jsoup: http://jsoup.org/

Document question = Jsoup.connect("http://codereview.stackexchange.com/questions/69").get();
Elements codes = question.select("code")
for(Element code: codes)
{
 System.out.println(code.text())
}

Winston Ewert Winston Ewert 30.7k4 gold badges52 silver badges79 bronze badges · Accepted Answer · 2012-05-06 00:39:00Z

public static void main(String[] args) throws Exception {
 final String site1 = "http://codereview.stackexchange.com/questions/";
 String site2;
 String site3;

Why do define site1 as a final constant, but delay the other assignments until later? They should really be static finals for the class. Also some clearer names would not go amiss.

 URL url;

Declare them where they are used, don't declare them ahead of time.

 HtmlGetter htmlGetter = new HtmlGetter();
 while(true) {
 site2 = "69";
 site3 = "is-this-implementation-of-shamos-hoey-algorithm-ok";
 url = new URL(site1 + site2 + "/" + site3);
 htmlGetter.setFileName("[" + site2 + "]" + site3 + ".html");
 htmlGetter.download(url);
 htmlGetter.parse();
 break;

Why do you have a loop if you are just going to break out of it? } }

public void parse() throws Exception {

You shouldn't throw Exception, your specification should really only include the exceptions that'll actually be thrown

 Scanner scanner = new Scanner(new File(fileName));
 String oneLine;
 String htmlTagRegex = "<(/)?([a-zA-Z]*)(\\s[a-zA-Z]*=[^>]*)?(\\s)*(/)?>";

Parsing HTML by regex is asking for trouble. Instead, you'll find the task much easier if you grab a html parsing library and use that.

 while(scanner.hasNextLine()) {
 oneLine = scanner.nextLine();
 if(oneLine.matches(".*<code>.*")) {

You are just searching for a substring here, there isn't much point in using regex to do that.

 while(scanner.hasNextLine()&&(!oneLine.matches(".*</code>.*"))) {

If you insist on regex, use a single regex to capture the entire code tag.

 oneLine = StringEscapeUtils.unescapeHtml4(oneLine.replaceAll(htmlTagRegex, ""));
 System.out.println(oneLine);
 oneLine = scanner.nextLine();
 // This cannot handle such kind of a line which has <code> and </code> together.
 }
 System.out.println("-------------------------------------------");
 }
 } 
}

Using something like jsoup: http://jsoup.org/

Document question = Jsoup.connect("http://codereview.stackexchange.com/questions/69").get();
Elements codes = question.select("code")
for(Element code: codes)
{
 System.out.println(code.text())
}

Stack Exchange Network

HTML downloader and parser for CR

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

HTML downloader and parser for CR

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions