This code has loops under loops which affects the performance badly. Please help me to optimize the code to improve its performance.
private void processPlainOutputFormat() {
// we reached the end of corpus processing.Process all Documents from corpus
Iterator<Document> documentIterator = getCorpus().iterator();
// method which create a header string
String header = getHeaderString(getFeaturesList());
List outputList = new ArrayList();
outputList.add(header);
boolean deleteDoc = (corpus.getDataStore() != null);
while (documentIterator.hasNext()) {
Document currentDocument = documentIterator.next();
AnnotationSet inputAS = inputAnnotationSet == null || inputAnnotationSet.trim().length() == 0 ? currentDocument.getAnnotations() : currentDocument.getAnnotations(inputAnnotationSet);
if (inputAS != null && inputAS.size() > 0) {
AnnotationSet annotationsToPrintSet = inputAS.get(getAnnotationName());
if (getFeaturesList() != null && getFeaturesList().size() > 0) {
List<Annotation> sortedAnnotations = new ArrayList(annotationsToPrintSet);
Collections.sort(sortedAnnotations, new OffsetComparator());
Iterator<Annotation> iterator = sortedAnnotations.iterator();
// just print out values directly to the GATE message window
while (iterator.hasNext()) {
Annotation annotation = iterator.next();
Iterator featuresIterator = getFeaturesList().iterator();
// local quick output
StringBuffer outputBuffer = new StringBuffer();
if (getPrintContent()) {
String content = currentDocument.getContent().toString().substring(annotation.getStartNode().getOffset().intValue(), annotation.getEndNode().getOffset().intValue());
outputBuffer.append(content).append(getFeaturesSeparatorSymbol());
}
while (featuresIterator.hasNext()) {
Object feature = featuresIterator.next();
outputBuffer.append(annotation.getFeatures().get(feature));
if (featuresIterator.hasNext()) {
outputBuffer.append(getFeaturesSeparatorSymbol());
}
}
outputList.add(outputBuffer.toString());
}
} else if (getPrintContent()) {
// just print content of the Annotation
List<Annotation> sortedAnnotations = new ArrayList(annotationsToPrintSet);
Collections.sort(sortedAnnotations, new OffsetComparator());
Iterator<Annotation> iterator = sortedAnnotations.iterator();
// just print out values directly to the GATE message window
while (iterator.hasNext()) {
Annotation annotation = iterator.next();
StringBuffer outputBuffer = new StringBuffer();
if (getPrintContent()) {
String content = currentDocument.getContent().toString().substring(annotation.getStartNode().getOffset().intValue(), annotation.getEndNode().getOffset().intValue());
outputBuffer.append(content);
}
outputList.add(outputBuffer.toString());
}
}
}
// cleanup document
corpus.unloadDocument(currentDocument);
if (deleteDoc) {
Factory.deleteResource(currentDocument);
}
}
// check output destination
if (getOutputDestination() == null) {
// just print out results
Iterator iterator = outputList.iterator();
while (iterator.hasNext()) {
Out.println(iterator.next());
}
} else {
write(outputList, getOutputDestination().getPath());
}
}
1 Answer 1
For now, I can't help you with performance, but maybe with readability.
Raw Types
You shouldn't just declare raw lists, because it makes the code hard to read, and also error-prone.
// this:
List outputList = new ArrayList();
// should be this:
List<String> outputList = new ArrayList<>();
Note also the empty <>
on the right side. You should do this for Iterator
s as well.
Unnecessary Variables
You have outputList
which stores the output strings. But then you also have outputBuffer
, which also stores the output strings, and then adds them to the outputList
. I would just make the outPutList
a buffer, and then only add to it.
One-time variables are sometimes nice to have, because you can name them and thus make your code more readable. But here for example, it doesn't add anything but an extra line:
// this:
String header = getHeaderString(getFeaturesList());
outputList.add(header);
// could be this:
outputList.add(getHeaderString(getFeaturesList());
Long Method
Your method is too long, too nested, and does too much.
First, I would extract methods such as: processDocument
(which can include all the code inside the outside while loop), cleanDocument
, processFeatureDocument
/ processPrintDocument
(I'm just guessing about the names; they could accept a StringBuffer
, so you wouldn't loose any performance by creating additional ones), and outputBuffer
.
Splitting up the code to different methods doesn't only help with readability, but will also help you identify bottlenecks by profiling the code, and thus tell you where you actually need to improve performance.
Lots of Null checks
You check if null is returned a lot, and your logic seems to depend on null a bit too much; see for example boolean deleteDoc = (corpus.getDataStore() != null)
or if (getOutputDestination() == null)
. What is the expected result here? Is the data store set to null if the document should not be deleted? This would not seem like a very good system (better to have a delete boolean in there for that). Or is this just error checking in case the data store is null, which it never actually should be? In that case, I would handle it like an error (throw an exception).
Naming
content
doesn't store the result of thegetContent
call, but a substring of it. What's the relevance of this substring? What part of the content did it extract? If possible, the variable name should tell be that. If not, then I would just remove it and append the substring directly.- don't shorten words.
inputAS
isn't all that clear, andinputAnnotationSet
isn't too long.
Comments
- some of your comments are not all that helpful, eg
// method which create a header string
(this should really be a JavaDoc comment at thegetHeaderString
method, not an in-code comment here),// check output destination
, etc - what would be really nice would be a method level JavaDoc comment, about what the method actually does.
Misc
- use
StringBuilder
instead of `StringBuffer if you don't need synchronization.