I'm tinkering with the idea of crafting a search engine in my spare time. More of a learning experience than anything at this point, but still a project. A key aspect of this system is checking whether a domain is live or not. That's what this code is trying to do (and is succeeding).
It's written in Java, and using outside classes means that the code is pre-obfuscated! Yay!
import java.util.ArrayList;
import java.util.Iterator;
public class BaseCheckDriver extends Thread{
public static void main(String[] args) {
long start = System.currentTimeMillis(), end;
String query = (args.length > 0 && args[0].equals(false) ?
"select * from ... where is_live is null" :
"select * from ...");
int numThreads = 16; // can be changed to however many
ArrayList<Object[]> results = Database.query(query, null);
Database.update("update ... set is_live = null where is_live is not null limit " + (results.size() + 1), null); // gets around "safe updates" and allows for easy monitoring
// distribute results to lists
ArrayList<ArrayList<Object[]>> listContainer = new ArrayList<ArrayList<Object[]>>();
for(int i = 0;i<numThreads;i++) listContainer.add(new ArrayList<Object[]>());
for(Object[] row : results){
int addTo = 0;
for(int i=1;i<listContainer.size();i++)
if(listContainer.get(i).size() < listContainer.get(i - 1).size()) addTo = i;
listContainer.get(addTo).add(row);
}
// distribute lists to threads
ArrayList<Thread> threadContainer = new ArrayList<Thread>();
for(int i = 0;i<numThreads;i++) threadContainer.add(new BaseCheckDriver(listContainer.get(i)));
for(Thread thread : threadContainer) thread.start();
// let threads execute
try{
for(Thread thread : threadContainer) thread.join();
}catch(InterruptedException e){
e.printStackTrace();
}
end = System.currentTimeMillis();
System.out.println("All done!");
System.out.println("\tTotal execution time: " + (end - start) + "ms");
System.out.println("\tAverage execution time: " + ((end - start) / results.size()) + "ms");
}
// now the fun begins
private ArrayList<Object[]> results;
public BaseCheckDriver(ArrayList<Object[]> results){
this.results = results;
}
public void run(){
System.out.println(Thread.currentThread().getName() + " Started!");
long start = System.currentTimeMillis(), end;
Iterator<Object[]> resultIterator = results.iterator();
while(resultIterator.hasNext()) Indexer.indexBase(resultIterator.next());
end = System.currentTimeMillis();
System.out.println(Thread.currentThread().getName() + " Done!");
System.out.println("\tTotal execution time: " + (end - start) + "ms");
System.out.println("\tAverage execution time: " + ((end - start) / results.size()) + "ms");
}
}
Any improvements would be appreciated. Speed is at a good spot right now (stats below), but one thing that I notice is that although the distribution is technically even, the execution times are far from it. A page that isn't live will take longer to look for than one that is (more often than not) and a page behind a very slow connection may take longer than that. An example: this last run had Thread 12 being the last to complete, running 34 rows behind Thread 7, which was another 34 rows behind Thread 9.
I guess the best solution would be some form of opportunistic distribution, but I have no idea how to go about that. Essentially what I'm thinking of is passing a row off to a thread as soon as it's not busy, thus having all threads finish at about the same time.
Output stats (status data removed for clarity):
... Thread-15 Done! Total execution time: 21167099ms Average execution time: 25813ms ... Thread-2 Done! Total execution time: 21201090ms Average execution time: 25823ms ... Thread-10 Done! Total execution time: 21457947ms Average execution time: 26168ms ... Thread-13 Done! Total execution time: 21608962ms Average execution time: 26352ms ... Thread-4 Done! Total execution time: 21627681ms Average execution time: 26343ms ... Thread-11 Done! Total execution time: 21638154ms Average execution time: 26387ms ... Thread-5 Done! Total execution time: 21824853ms Average execution time: 26583ms ... Thread-3 Done! Total execution time: 21890344ms Average execution time: 26663ms ... Thread-6 Done! Total execution time: 21900767ms Average execution time: 26675ms ... Thread-0 Done! Total execution time: 21909558ms Average execution time: 26686ms ... Thread-8 Done! Total execution time: 21930624ms Average execution time: 26712ms ... Thread-14 Done! Total execution time: 22053145ms Average execution time: 26894ms ... Thread-1 Done! Total execution time: 22091676ms Average execution time: 26908ms ... Thread-9 Done! Total execution time: 22167669ms Average execution time: 27033ms ... Thread-7 Done! Total execution time: 22626100ms Average execution time: 27559ms ... Thread-12 Done! Total execution time: 23560248ms Average execution time: 28732ms All done! Total execution time: 23562053ms Average execution time: 1794ms
-
\$\begingroup\$ Ran in just over 6.5 hrs, if you don't want to convert from ms. \$\endgroup\$ndm13– ndm132015年06月03日 00:10:23 +00:00Commented Jun 3, 2015 at 0:10
2 Answers 2
Keeping this short because I think it's the most significant improvement...
Since Java 1.5, the preferred way of doing multi-threading is to rely on an ExecutorService
, which can be started via the many static
methods of Executors
. Instead of extending Thread
, your class can implement Callable<List<Object[]>>
with a call()
in place of your run()
, which lets you return your results, if any. I realized I was 'tricked' into thinking your BaseCheckDriver
is returning something since your private
field is named results
. In any case, the ExecutorService
will work with Runnable
implementations too, so you should alternatively implement that interface instead of Thread
. ExecutorService.shutdown()
should then be used in place of (manually) calling join()
on each Thread
.
Essentially what I'm thinking of is passing a row off to a thread as soon as it's not busy, thus having all threads finish at about the same time.
Why yes, you can also consider this approach: Instead of pre-allocating a number of URLs to be checked per thread, set up your thread pool (using the ExecutorService
) and submit a URL per thread. The slower threads (either due to slow connections or non-existing ones) will be left alone until they complete their task, and you will continuously (ideal scenario) have fresh threads to continue serving the fast, working URLs.
The other point I'll like to nitpick about is your single/double-line for
and if
statements without braces. Don't fear the braces, as they clearly define the scope of the for
/if
keywords without worry of missing (or adding) an extra statement.
-
1\$\begingroup\$ ExecutorService sounds like exactly what I need! I'll have to look into it. All the information I was finding was telling me that I had to either extend Thread or Runable to get what I wanted. Having another option certainly opens this for redesign. \$\endgroup\$ndm13– ndm132015年06月03日 04:15:26 +00:00Commented Jun 3, 2015 at 4:15
-
1\$\begingroup\$ Took your advice. Bit of work to get it set up, but it paid off! See updated question. \$\endgroup\$ndm13– ndm132015年06月03日 23:22:24 +00:00Commented Jun 3, 2015 at 23:22
-
1\$\begingroup\$ I looked at your earlier revision before you moved your code to PasteBin, if
BaseCheckDriver
is not going to return anything then you can let it implementRunnable
. CallingExecutorService.submit(Runnable)
will returnnull
upon successful completion, which should be fine for you either way anyways. \$\endgroup\$h.j.k.– h.j.k.2015年06月03日 23:47:08 +00:00Commented Jun 3, 2015 at 23:47 -
\$\begingroup\$ Thanks! It seemed a little ridiculous to have to return null! \$\endgroup\$ndm13– ndm132015年06月04日 03:31:41 +00:00Commented Jun 4, 2015 at 3:31
-
\$\begingroup\$ NOTE: When submitting Runnables, apparently you need to use
awaitTermination(long, TimeUnit);
to ensure that the threads complete in addition toshutdown();
! Found this out the hard way. Relevant snippet: pastebin.com/gwsk0aRA \$\endgroup\$ndm13– ndm132015年06月04日 03:56:50 +00:00Commented Jun 4, 2015 at 3:56
Use constants to store constant values
int numThreads = 16; // can be changed to however many
Instead of that, try
private static final int NUMBER_OF_THREADS = 16; // can be changed to however many
Now you can easily see that it is a constant value. Also, moving it outside the function makes it available to other methods if you want to use it. Or you could leave it inside the method without the private
modifier.
Favor interfaces over implementations
ArrayList<Object[]> results = Database.query(query, null);
As a general rule in Java, when defining the type of a variable, you want to use the interface rather than the implementation. That way if you wanted to change Database.query
to return a LinkedList
rather than an ArrayList
, you could.
List<Object[]> results = Database.query(query, null);
You also might want to consider storing something other than a generic Object
array. But that's set in the Database.query
method.
Don't forget what you know
for(Object[] row : results){ int addTo = 0; for(int i=1;i<listContainer.size();i++) if(listContainer.get(i).size() < listContainer.get(i - 1).size()) addTo = i; listContainer.get(addTo).add(row); }
You don't have to calculate the correct place every time. You can simply take turns:
int addTo = 0;
for (Object[] row : results) {
if (addTo >= listContainer.size()) {
addTo = 0;
}
listContainer.get(addTo).add(row);
addTo++;
}
At first glance, this may seem like more code, but notice that it eliminates an entire for
loop. Also, the increased number of lines is accompanied by a decrease in the code density. I could get the code length down to one fewer lines following the same pattern as the original. However, one statement per line is generally easier to read and follow.
Note that you could also do this with an iterator.
Iterator<List<Object[]>> current = listContainer.iterator();
for (Object[] row : results) {
if (!current.hasNext()) {
current = listContainer.iterator();
}
current.next().add(row);
}
That's a little more straightforward about what it is doing.
Note: if you switch to having the threads load new URLs whenever they finish, this will be unnecessary. I think that the point is valid regardless though. There are other circumstances when you will have to do things like this.
Naming
ArrayList<Thread> threadContainer = new ArrayList<Thread>();
Previously you used listContainer
to indicate a container of lists that hold something else. Your threadContainer
is just some threads.
List<Thread> threads = new ArrayList<Thread>();
So just name it threads
. That's at least as clear about what the variable holds. And shorter.
-
\$\begingroup\$ Thanks for the tips! I'm not sure about numThreads being a constant since it's only used locally, and once optimized will stay at a fixed value unless set as an argument. Good tip about using the interface; I hadn't thought about that. I'll definitely be improving the distributor! \$\endgroup\$ndm13– ndm132015年06月03日 04:12:33 +00:00Commented Jun 3, 2015 at 4:12
-
\$\begingroup\$ Thanks again for the advice. The restructuring meant I didn't get to take some of your advice, but I switched to the List interface and used a simpler naming scheme, and made numThreads a private variable for what it's worth. See updated question. \$\endgroup\$ndm13– ndm132015年06月03日 23:24:04 +00:00Commented Jun 3, 2015 at 23:24