5
\$\begingroup\$

I'm tinkering with the idea of crafting a search engine in my spare time. More of a learning experience than anything at this point, but still a project. A key aspect of this system is checking whether a domain is live or not. That's what this code is trying to do (and is succeeding).

It's written in Java, and using outside classes means that the code is pre-obfuscated! Yay!

import java.util.ArrayList;
import java.util.Iterator;
public class BaseCheckDriver extends Thread{
 public static void main(String[] args) {
 long start = System.currentTimeMillis(), end;
 String query = (args.length > 0 && args[0].equals(false) ? 
 "select * from ... where is_live is null" : 
 "select * from ...");
 int numThreads = 16; // can be changed to however many
 ArrayList<Object[]> results = Database.query(query, null);
 Database.update("update ... set is_live = null where is_live is not null limit " + (results.size() + 1), null); // gets around "safe updates" and allows for easy monitoring
 // distribute results to lists
 ArrayList<ArrayList<Object[]>> listContainer = new ArrayList<ArrayList<Object[]>>();
 for(int i = 0;i<numThreads;i++) listContainer.add(new ArrayList<Object[]>());
 for(Object[] row : results){
 int addTo = 0;
 for(int i=1;i<listContainer.size();i++)
 if(listContainer.get(i).size() < listContainer.get(i - 1).size()) addTo = i;
 listContainer.get(addTo).add(row);
 }
 // distribute lists to threads
 ArrayList<Thread> threadContainer = new ArrayList<Thread>();
 for(int i = 0;i<numThreads;i++) threadContainer.add(new BaseCheckDriver(listContainer.get(i)));
 for(Thread thread : threadContainer) thread.start();
 // let threads execute
 try{
 for(Thread thread : threadContainer) thread.join();
 }catch(InterruptedException e){
 e.printStackTrace();
 }
 end = System.currentTimeMillis();
 System.out.println("All done!");
 System.out.println("\tTotal execution time: " + (end - start) + "ms");
 System.out.println("\tAverage execution time: " + ((end - start) / results.size()) + "ms");
 }
 // now the fun begins
 private ArrayList<Object[]> results;
 public BaseCheckDriver(ArrayList<Object[]> results){
 this.results = results;
 }
 public void run(){
 System.out.println(Thread.currentThread().getName() + " Started!");
 long start = System.currentTimeMillis(), end;
 Iterator<Object[]> resultIterator = results.iterator();
 while(resultIterator.hasNext()) Indexer.indexBase(resultIterator.next());
 end = System.currentTimeMillis();
 System.out.println(Thread.currentThread().getName() + " Done!");
 System.out.println("\tTotal execution time: " + (end - start) + "ms");
 System.out.println("\tAverage execution time: " + ((end - start) / results.size()) + "ms");
 }
}

Any improvements would be appreciated. Speed is at a good spot right now (stats below), but one thing that I notice is that although the distribution is technically even, the execution times are far from it. A page that isn't live will take longer to look for than one that is (more often than not) and a page behind a very slow connection may take longer than that. An example: this last run had Thread 12 being the last to complete, running 34 rows behind Thread 7, which was another 34 rows behind Thread 9.

I guess the best solution would be some form of opportunistic distribution, but I have no idea how to go about that. Essentially what I'm thinking of is passing a row off to a thread as soon as it's not busy, thus having all threads finish at about the same time.

Output stats (status data removed for clarity):

...
Thread-15 Done!
 Total execution time: 21167099ms
 Average execution time: 25813ms
...
Thread-2 Done!
 Total execution time: 21201090ms
 Average execution time: 25823ms
...
Thread-10 Done!
 Total execution time: 21457947ms
 Average execution time: 26168ms
...
Thread-13 Done!
 Total execution time: 21608962ms
 Average execution time: 26352ms
...
Thread-4 Done!
 Total execution time: 21627681ms
 Average execution time: 26343ms
...
Thread-11 Done!
 Total execution time: 21638154ms
 Average execution time: 26387ms
...
Thread-5 Done!
 Total execution time: 21824853ms
 Average execution time: 26583ms
...
Thread-3 Done!
 Total execution time: 21890344ms
 Average execution time: 26663ms
...
Thread-6 Done!
 Total execution time: 21900767ms
 Average execution time: 26675ms
...
Thread-0 Done!
 Total execution time: 21909558ms
 Average execution time: 26686ms
...
Thread-8 Done!
 Total execution time: 21930624ms
 Average execution time: 26712ms
...
Thread-14 Done!
 Total execution time: 22053145ms
 Average execution time: 26894ms
...
Thread-1 Done!
 Total execution time: 22091676ms
 Average execution time: 26908ms
...
Thread-9 Done!
 Total execution time: 22167669ms
 Average execution time: 27033ms
...
Thread-7 Done!
 Total execution time: 22626100ms
 Average execution time: 27559ms
...
Thread-12 Done!
 Total execution time: 23560248ms
 Average execution time: 28732ms
All done!
 Total execution time: 23562053ms
 Average execution time: 1794ms
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Jun 3, 2015 at 0:07
\$\endgroup\$
1
  • \$\begingroup\$ Ran in just over 6.5 hrs, if you don't want to convert from ms. \$\endgroup\$ Commented Jun 3, 2015 at 0:10

2 Answers 2

4
\$\begingroup\$

Keeping this short because I think it's the most significant improvement...

Since Java 1.5, the preferred way of doing multi-threading is to rely on an ExecutorService, which can be started via the many static methods of Executors. Instead of extending Thread, your class can implement Callable<List<Object[]>> with a call() in place of your run(), which lets you return your results, if any. I realized I was 'tricked' into thinking your BaseCheckDriver is returning something since your private field is named results. In any case, the ExecutorService will work with Runnable implementations too, so you should alternatively implement that interface instead of Thread. ExecutorService.shutdown() should then be used in place of (manually) calling join() on each Thread.

Essentially what I'm thinking of is passing a row off to a thread as soon as it's not busy, thus having all threads finish at about the same time.

Why yes, you can also consider this approach: Instead of pre-allocating a number of URLs to be checked per thread, set up your thread pool (using the ExecutorService) and submit a URL per thread. The slower threads (either due to slow connections or non-existing ones) will be left alone until they complete their task, and you will continuously (ideal scenario) have fresh threads to continue serving the fast, working URLs.

The other point I'll like to nitpick about is your single/double-line for and if statements without braces. Don't fear the braces, as they clearly define the scope of the for/if keywords without worry of missing (or adding) an extra statement.

answered Jun 3, 2015 at 1:23
\$\endgroup\$
5
  • 1
    \$\begingroup\$ ExecutorService sounds like exactly what I need! I'll have to look into it. All the information I was finding was telling me that I had to either extend Thread or Runable to get what I wanted. Having another option certainly opens this for redesign. \$\endgroup\$ Commented Jun 3, 2015 at 4:15
  • 1
    \$\begingroup\$ Took your advice. Bit of work to get it set up, but it paid off! See updated question. \$\endgroup\$ Commented Jun 3, 2015 at 23:22
  • 1
    \$\begingroup\$ I looked at your earlier revision before you moved your code to PasteBin, if BaseCheckDriver is not going to return anything then you can let it implement Runnable. Calling ExecutorService.submit(Runnable) will return null upon successful completion, which should be fine for you either way anyways. \$\endgroup\$ Commented Jun 3, 2015 at 23:47
  • \$\begingroup\$ Thanks! It seemed a little ridiculous to have to return null! \$\endgroup\$ Commented Jun 4, 2015 at 3:31
  • \$\begingroup\$ NOTE: When submitting Runnables, apparently you need to use awaitTermination(long, TimeUnit); to ensure that the threads complete in addition to shutdown();! Found this out the hard way. Relevant snippet: pastebin.com/gwsk0aRA \$\endgroup\$ Commented Jun 4, 2015 at 3:56
5
\$\begingroup\$

Use constants to store constant values

 int numThreads = 16; // can be changed to however many

Instead of that, try

 private static final int NUMBER_OF_THREADS = 16; // can be changed to however many

Now you can easily see that it is a constant value. Also, moving it outside the function makes it available to other methods if you want to use it. Or you could leave it inside the method without the private modifier.

Favor interfaces over implementations

 ArrayList<Object[]> results = Database.query(query, null);

As a general rule in Java, when defining the type of a variable, you want to use the interface rather than the implementation. That way if you wanted to change Database.query to return a LinkedList rather than an ArrayList, you could.

 List<Object[]> results = Database.query(query, null);

You also might want to consider storing something other than a generic Object array. But that's set in the Database.query method.

Don't forget what you know

 for(Object[] row : results){
 int addTo = 0;
 for(int i=1;i<listContainer.size();i++)
 if(listContainer.get(i).size() < listContainer.get(i - 1).size()) addTo = i;
 listContainer.get(addTo).add(row);
 }

You don't have to calculate the correct place every time. You can simply take turns:

 int addTo = 0;
 for (Object[] row : results) {
 if (addTo >= listContainer.size()) {
 addTo = 0;
 }
 listContainer.get(addTo).add(row);
 addTo++;
 }

At first glance, this may seem like more code, but notice that it eliminates an entire for loop. Also, the increased number of lines is accompanied by a decrease in the code density. I could get the code length down to one fewer lines following the same pattern as the original. However, one statement per line is generally easier to read and follow.

Note that you could also do this with an iterator.

 Iterator<List<Object[]>> current = listContainer.iterator();
 for (Object[] row : results) {
 if (!current.hasNext()) {
 current = listContainer.iterator();
 }
 current.next().add(row);
 }

That's a little more straightforward about what it is doing.

Note: if you switch to having the threads load new URLs whenever they finish, this will be unnecessary. I think that the point is valid regardless though. There are other circumstances when you will have to do things like this.

Naming

 ArrayList<Thread> threadContainer = new ArrayList<Thread>();

Previously you used listContainer to indicate a container of lists that hold something else. Your threadContainer is just some threads.

 List<Thread> threads = new ArrayList<Thread>();

So just name it threads. That's at least as clear about what the variable holds. And shorter.

answered Jun 3, 2015 at 2:24
\$\endgroup\$
2
  • \$\begingroup\$ Thanks for the tips! I'm not sure about numThreads being a constant since it's only used locally, and once optimized will stay at a fixed value unless set as an argument. Good tip about using the interface; I hadn't thought about that. I'll definitely be improving the distributor! \$\endgroup\$ Commented Jun 3, 2015 at 4:12
  • \$\begingroup\$ Thanks again for the advice. The restructuring meant I didn't get to take some of your advice, but I switched to the List interface and used a simpler naming scheme, and made numThreads a private variable for what it's worth. See updated question. \$\endgroup\$ Commented Jun 3, 2015 at 23:24

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.