I am trying to parse HTML using "jsoup". This is my first time working with "jsoup" and I read some tutorials on it as well.
If you see my table, it has three <tr>
as of now (I have shortened it down to have three table rows just for understanding purpose, but in general it will be more). Now I would like to extract Cluster Name
from my table and its corresponding host name
; so for example, I would extract Titan
as cluster name and all its hostname whose status are down.
For Titan
cluster name, I have two hostnames machineA.abc.com
and machineB.abc.com
in which machineA
status is up
but machineB
status is down
.
So, I will print out Titan
as cluster name and print out machineB.abc.com
as the hostname, since it is down.
I have two cluster names: one is Titan
and the other is Goldy
; so I want to find all the machines which are down for Titan
cluster name only.
<table border=1>
<tr>
<td> </td>
<td> </td>
<td>Alert</td>
<td>Cluster Name</td>
<td>IP addr</td>
<td>Host Name</td>
<td>Type</td>
<td>Status</td>
<td>Free</td>
<td>Version</td>
<td>Restart Time</td>
<td>UpTime(Days)</td>
<td>Last probed</td>
<td>Last up</td>
</tr>
<tr bgcolor="ffffff">
<td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
<td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
<td bgcolor="ffffff"> </td>
<td>Titan</td>
<td>10.100.111.77</td>
<td>machineA.abc.com</td>
<td></td>
<td bgcolor="ffffff">up</td>
<td bgcolor="ffffff" align=right>88%</td>
<td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
<td bgcolor="ffffff">2014年07月04日 01:49:08,220</td>
<td bgcolor="ffffff" align=right>381</td>
<td>07-14 20:01:59</td>
<td>07-14 20:01:59</td>
</tr>
<tr bgcolor="ffffff">
<td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
<td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
<td bgcolor="ffffff"> </td>
<td></td>
<td>10.200.192.99</td>
<td>machineB.abc.com</td>
<td></td>
<td bgcolor="ffffff">down</td>
<td bgcolor="ffffff" align=right>85%</td>
<td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
<td bgcolor="ffffff">2014年07月04日 01:52:20,613</td>
<td bgcolor="ffffff" align=right>103</td>
<td>07-14 20:01:59</td>
<td>07-14 20:01:59</td>
</tr>
<tr bgcolor="ffffff">
<td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
<td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
<td bgcolor="ffffff"> </td>
<td>Goldy</td>
<td>10.100.111.77</td>
<td>machineH.pqr.com</td>
<td></td>
<td bgcolor="ffffff">up</td>
<td bgcolor="ffffff" align=right>88%</td>
<td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
<td bgcolor="ffffff">2014年07月04日 01:49:08,220</td>
<td bgcolor="ffffff" align=right>381</td>
<td>07-14 20:01:59</td>
<td>07-14 20:01:59</td>
</tr>
</table>
I'd like to know if I can improve this slightly:
public static void main(String[] args) throws JSONException, IOException {
URL url = new URL("some_url");
Document doc = Jsoup.parse(url, 3000);
ArrayList<String> downServers = new ArrayList<>();
Element table = doc.select("table").get(0); //select the first table.
Elements rows = table.select("tr");
for (int i = 1; i < rows.size(); i++) { //first row is the col names so skip it.
Element row = rows.get(i);
Elements cols = row.select("td");
if (cols.get(3).text().equals("Titan")) {
if (cols.get(7).text().equals("down"))
downServers.add(cols.get(5).text());
do {
if(i < rows.size() - 1)
i++;
row = rows.get(i);
cols = row.select("td");
if (cols.get(7).text().equals("down") && cols.get(3).text().equals("")) {
downServers.add(cols.get(5).text());
}
if(i == rows.size() - 1)
break;
}
while (cols.get(3).text().equals(""));
i--;
}
}
System.out.println(downServers);
}
1 Answer 1
If you don't need implementation specific features, declare variables using interface types. That is, instead of:
ArrayList<String> downServers = new ArrayList<>();
Use the List
interface:
List<String> downServers = new ArrayList<>();
You can remove the JSONException
throws declaration from the main
, as it will never happen.
Your main loop can be written a bit simpler, for example:
Iterator<Element> rowIterator = rows.iterator();
rowIterator.next();
boolean wasMatch = false;
while (rowIterator.hasNext()) {
Element row = rowIterator.next();
Elements cols = row.select("td");
String clusterName = cols.get(3).text();
if (wasMatch && clusterName.isEmpty() || clusterName.equals("Titan")) {
wasMatch = true;
if (cols.get(7).text().equals("down")) {
downServers.add(cols.get(5).text());
}
} else {
wasMatch = false;
}
}
I would also wrap a row in a class with convenient accessors like this:
class ServerInfo {
final Elements cols;
ServerRowWrapper(Elements cols) {
this.cols = cols;
}
String getClusterName() {
return cols.get(3).text();
}
String getServerName() {
return cols.get(5).text();
}
boolean isDown() {
return cols.get(7).text().equals("down");
}
}
which will make the main loop much more intuitive and free from hardcoded numbers like 3, 5, 7:
while (rowIterator.hasNext()) {
ServerInfo serverInfo = new ServerInfo(rowIterator.next().select("td"));
String clusterName = serverInfo.getClusterName();
if (wasMatch && clusterName.isEmpty() || clusterName.equals("Titan")) {
wasMatch = true;
if (serverInfo.isDown()) {
downServers.add(serverInfo.getServerName());
}
} else {
wasMatch = false;
}
}
Finally, to make future changes easier to test, add some unit tests.
As the first step, extract the middle part from your main
method, to take a Document
as parameter and return the list of servers that are down:
static List<String> findServersDown(Document document) {
List<String> downServers = new ArrayList<>();
// ... rest of the code from main
return downServers;
}
Change the main
method to use the new findServersDown
method:
public static void mainx(String[] args) throws IOException {
URL url = new URL("some_url");
Document doc = Jsoup.parse(url, 3000);
System.out.println(ServerInfoParser.findServersDown(doc));
}
Add a unit test (or more!), for example:
public class ServerInfoParserTest {
@Test
public void testSecondServerDown() throws IOException {
Document doc = Jsoup.parse(new File("src/test/resources/test-data/table1.html"), "utf-8");
List<String> downServers = ServerInfoParser.findServersDown(doc);
assertEquals(Arrays.asList("machineB.abc.com"), downServers);
}
}
UPDATE (in response to your comment)
Instead of hardcoding Titan
, it's better to generalize. It's easy enough to do.
Pass in a set of names to findServersDown
and adjust the if
condition in the loop:
static List<String> findServersDown(Document document, Set<String> names) {
// ...
while (rowIterator.hasNext()) {
// ...
if (wasMatch && clusterName.isEmpty() || names.contains(clusterName)) {
// ...
Then in the unit test:
Set<String> names = new HashSet<>(Arrays.asList("Titan"));
List<String> downServers = ServerInfoParser.findServersDown(doc, names);
This way the method will return the list of server names in any of the clusters you specified.
If instead of a flat list of server names, if you want them grouped by the name of the cluster, that should be easy to do as well, just rework the return type to a Map<String, Set<String>>
instead.