This program downloads a Code Review HTML file and parses it.
Could you review my program?
Main.java
import java.net.URL;
public class Main {
public static void main(String[] args) throws Exception {
final String site1 = "http://codereview.stackexchange.com/questions/";
String site2;
String site3;
URL url;
HtmlGetter htmlGetter = new HtmlGetter();
while(true) {
site2 = "69";
site3 = "is-this-implementation-of-shamos-hoey-algorithm-ok";
url = new URL(site1 + site2 + "/" + site3);
htmlGetter.setFileName("[" + site2 + "]" + site3 + ".html");
htmlGetter.download(url);
htmlGetter.parse();
break;
}
}
}
HtmlGetter.java
import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.net.URL;
import java.util.Scanner;
import org.apache.commons.lang3.StringEscapeUtils;
import org.apache.commons.io.IOUtils;
public class HtmlGetter {
private String fileName;
public void setFileName(String fileName) {
this.fileName = fileName;
}
public String getFileName() {
return fileName;
}
public void download(URL url) throws Exception {
final InputStream in = url.openStream();
final OutputStream out = new FileOutputStream(fileName);
IOUtils.copy(in, out);
in.close();
out.close();
}
public void parse() throws Exception {
Scanner scanner = new Scanner(new File(fileName));
String oneLine;
String htmlTagRegex = "<(/)?([a-zA-Z]*)(\\s[a-zA-Z]*=[^>]*)?(\\s)*(/)?>";
while(scanner.hasNextLine()) {
oneLine = scanner.nextLine();
if(oneLine.matches(".*<code>.*")) {
while(scanner.hasNextLine()&&(!oneLine.matches(".*</code>.*"))) {
oneLine = StringEscapeUtils.unescapeHtml4(oneLine.replaceAll(htmlTagRegex, ""));
System.out.println(oneLine);
oneLine = scanner.nextLine();
// This cannot handle such kind of a line which has <code> and </code> together.
}
System.out.println("-------------------------------------------");
}
}
}
}
1 Answer 1
public static void main(String[] args) throws Exception {
final String site1 = "http://codereview.stackexchange.com/questions/";
String site2;
String site3;
Why do define site1 as a final constant, but delay the other assignments until later? They should really be static finals for the class. Also some clearer names would not go amiss.
URL url;
Declare them where they are used, don't declare them ahead of time.
HtmlGetter htmlGetter = new HtmlGetter();
while(true) {
site2 = "69";
site3 = "is-this-implementation-of-shamos-hoey-algorithm-ok";
url = new URL(site1 + site2 + "/" + site3);
htmlGetter.setFileName("[" + site2 + "]" + site3 + ".html");
htmlGetter.download(url);
htmlGetter.parse();
break;
Why do you have a loop if you are just going to break out of it? } }
public void parse() throws Exception {
You shouldn't throw Exception, your specification should really only include the exceptions that'll actually be thrown
Scanner scanner = new Scanner(new File(fileName));
String oneLine;
String htmlTagRegex = "<(/)?([a-zA-Z]*)(\\s[a-zA-Z]*=[^>]*)?(\\s)*(/)?>";
Parsing HTML by regex is asking for trouble. Instead, you'll find the task much easier if you grab a html parsing library and use that.
while(scanner.hasNextLine()) {
oneLine = scanner.nextLine();
if(oneLine.matches(".*<code>.*")) {
You are just searching for a substring here, there isn't much point in using regex to do that.
while(scanner.hasNextLine()&&(!oneLine.matches(".*</code>.*"))) {
If you insist on regex, use a single regex to capture the entire code tag.
oneLine = StringEscapeUtils.unescapeHtml4(oneLine.replaceAll(htmlTagRegex, ""));
System.out.println(oneLine);
oneLine = scanner.nextLine();
// This cannot handle such kind of a line which has <code> and </code> together.
}
System.out.println("-------------------------------------------");
}
}
}
Using something like jsoup: http://jsoup.org/
Document question = Jsoup.connect("http://codereview.stackexchange.com/questions/69").get();
Elements codes = question.select("code")
for(Element code: codes)
{
System.out.println(code.text())
}
Explore related questions
See similar questions with these tags.