Efficient Multithreaded File Processessing in Java

Question 1

This method in my code uses a producer-consumer design to read many files and process them for use in an NLP algorithm. The completion service collects files as they are processed, which means that docId != index. Given that I need to retrieve probabilities elsewhere where docID == index, I've implemented a priority queue to sort. However, I do this by sequentially retrieving documents from a list of futures and then putting them into the queue.

I know this implementation isn't great. I keep staring at the concurrentskiplistset thinking I should be able to implement it for better performance and not have these issues. I'm brand new to concurrency and Java, so I'm sure there's a lot to critique.

How can I make this multithreading more efficient?

Vectorizer:

public class Vectorizer {
public Vectorizer(){
}
public PriorityQueue<Document> readAll(File fileDir) 
throws InterruptedException, ExecutionException, IOException{
 //read each file. When each file is vectorized, put it in a minibatch.
 //producer-consumer threading structure.
 int NUM_THREADS = Runtime.getRuntime().availableProcessors();
 BlockingQueue<LinkedList<String>> queue = new ArrayBlockingQueue<>(50);
 ExecutorService service = Executors.newFixedThreadPool(NUM_THREADS);
 CompletionService<Document> completionService = 
 new ExecutorCompletionService<Document>(service);
 //TODO: is list the most efficient structure to handle future?
 List<Future<Document>> docs = new ArrayList<Future<Document>>();
 for (int i = 0; i < (NUM_THREADS - 1); i++) {
 docs.add(completionService.submit(new DocumentConsumer(queue)));
 }
 // Wait for ReadFile to complete
 service.submit(new ReadFile(queue, fileDir)).get();
 service.shutdownNow(); // interrupt CPUTasks
 // Wait for DocumentConsumer to complete
 service.awaitTermination(365, TimeUnit.DAYS);
 //do things with processed docs.
 //should I be doing this, though?
 PriorityQueue<Document> Documents = new PriorityQueue<Document>();
 for(Future<Document> d : docs){
 try{
 Document doc = d.get();
 Documents.add(doc);
 System.out.println(Integer.toString(doc.Cj));
 } catch(ExecutionException e) {
 e.getCause();e.printStackTrace();
 }
 }
 return Documents;
}

Document object (without compare and overload):

public class Document implements Comparator<Document>{
int docId, Cj;
HashMap<String,Integer> termTable;
public Document(int id,int Cj, HashMap<String,Integer> termMap){
 this.docId = id;
 this.Cj = Cj;
 this.termTable = termMap;
}

docId is set during file reading. I read in a file and append the count to the first line, which is then set during processing.

The rest of the code is available here.

Question 2

I didn't look at the specific performance issue you described. However, my general review:

Readability

for (int i = 0; i < (NUM_THREADS - 1); i++) {
 docs.add(completionService.submit(new DocumentConsumer(queue)));
}

Starting at 0 and ending at somevalue - 1. Why don't you start at 1? You don't even do anything with the index, so it's not like it matters.

 e.getCause();e.printStackTrace();

Don't put multiple statements on the same line. It turns your code into a wall of code. You think understanding a wall of text is hard? Try a wall of code...

Naming

Use camelCase for methods and variables. Use PascalCase for types. Use ALL_CAPS for constants. This means that...

int NUM_THREADS = Runtime.getRuntime().availableProcessors();

should be final and

PriorityQueue<Document> Documents = new PriorityQueue<Document>();

should be renamed to documents.

public Document(int id,int Cj, HashMap<String,Integer> termMap){

Should rename Cj to cj. ... And even better would be to rename it to something that explains what it is. Right now I have no idea what a cj is.

Pimgd Pimgd 22.5k5 gold badges68 silver badges144 bronze badges · Answer 1 · 2014-08-07 08:17:37Z

I didn't look at the specific performance issue you described. However, my general review:

Readability

for (int i = 0; i < (NUM_THREADS - 1); i++) {
 docs.add(completionService.submit(new DocumentConsumer(queue)));
}

Starting at 0 and ending at somevalue - 1. Why don't you start at 1? You don't even do anything with the index, so it's not like it matters.

 e.getCause();e.printStackTrace();

Don't put multiple statements on the same line. It turns your code into a wall of code. You think understanding a wall of text is hard? Try a wall of code...

Naming

Use camelCase for methods and variables. Use PascalCase for types. Use ALL_CAPS for constants. This means that...

int NUM_THREADS = Runtime.getRuntime().availableProcessors();

should be final and

PriorityQueue<Document> Documents = new PriorityQueue<Document>();

should be renamed to documents.

public Document(int id,int Cj, HashMap<String,Integer> termMap){

Should rename Cj to cj. ... And even better would be to rename it to something that explains what it is. Right now I have no idea what a cj is.

Stack Exchange Network

Efficient Multithreaded File Processessing in Java

1 Answer 1

Readability

Naming

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Efficient Multithreaded File Processessing in Java

1 Answer 1

Readability

Naming

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions