This method in my code uses a producer-consumer design to read many files and process them for use in an NLP algorithm. The completion service collects files as they are processed, which means that docId != index
. Given that I need to retrieve probabilities elsewhere where docID == index
, I've implemented a priority queue to sort. However, I do this by sequentially retrieving documents from a list of futures and then putting them into the queue.
I know this implementation isn't great. I keep staring at the concurrentskiplistset
thinking I should be able to implement it for better performance and not have these issues. I'm brand new to concurrency and Java, so I'm sure there's a lot to critique.
How can I make this multithreading more efficient?
Vectorizer:
public class Vectorizer {
public Vectorizer(){
}
public PriorityQueue<Document> readAll(File fileDir)
throws InterruptedException, ExecutionException, IOException{
//read each file. When each file is vectorized, put it in a minibatch.
//producer-consumer threading structure.
int NUM_THREADS = Runtime.getRuntime().availableProcessors();
BlockingQueue<LinkedList<String>> queue = new ArrayBlockingQueue<>(50);
ExecutorService service = Executors.newFixedThreadPool(NUM_THREADS);
CompletionService<Document> completionService =
new ExecutorCompletionService<Document>(service);
//TODO: is list the most efficient structure to handle future?
List<Future<Document>> docs = new ArrayList<Future<Document>>();
for (int i = 0; i < (NUM_THREADS - 1); i++) {
docs.add(completionService.submit(new DocumentConsumer(queue)));
}
// Wait for ReadFile to complete
service.submit(new ReadFile(queue, fileDir)).get();
service.shutdownNow(); // interrupt CPUTasks
// Wait for DocumentConsumer to complete
service.awaitTermination(365, TimeUnit.DAYS);
//do things with processed docs.
//should I be doing this, though?
PriorityQueue<Document> Documents = new PriorityQueue<Document>();
for(Future<Document> d : docs){
try{
Document doc = d.get();
Documents.add(doc);
System.out.println(Integer.toString(doc.Cj));
} catch(ExecutionException e) {
e.getCause();e.printStackTrace();
}
}
return Documents;
}
Document object (without compare and overload):
public class Document implements Comparator<Document>{
int docId, Cj;
HashMap<String,Integer> termTable;
public Document(int id,int Cj, HashMap<String,Integer> termMap){
this.docId = id;
this.Cj = Cj;
this.termTable = termMap;
}
docId
is set during file reading. I read in a file and append the count to the first line, which is then set during processing.
The rest of the code is available here.
1 Answer 1
I didn't look at the specific performance issue you described. However, my general review:
Readability
for (int i = 0; i < (NUM_THREADS - 1); i++) {
docs.add(completionService.submit(new DocumentConsumer(queue)));
}
Starting at 0 and ending at somevalue - 1
. Why don't you start at 1? You don't even do anything with the index, so it's not like it matters.
e.getCause();e.printStackTrace();
Don't put multiple statements on the same line. It turns your code into a wall of code. You think understanding a wall of text is hard? Try a wall of code...
Naming
Use camelCase
for methods and variables. Use PascalCase
for types. Use ALL_CAPS
for constants. This means that...
int NUM_THREADS = Runtime.getRuntime().availableProcessors();
should be final and
PriorityQueue<Document> Documents = new PriorityQueue<Document>();
should be renamed to documents
.
public Document(int id,int Cj, HashMap<String,Integer> termMap){
Should rename Cj
to cj
. ... And even better would be to rename it to something that explains what it is. Right now I have no idea what a cj
is.