Formatting output

Question 1

This code has loops under loops which affects the performance badly. Please help me to optimize the code to improve its performance.

private void processPlainOutputFormat() {
 // we reached the end of corpus processing.Process all Documents from corpus
 Iterator<Document> documentIterator = getCorpus().iterator();
 // method which create a header string
 String header = getHeaderString(getFeaturesList());
 List outputList = new ArrayList();
 outputList.add(header);
 boolean deleteDoc = (corpus.getDataStore() != null);
 while (documentIterator.hasNext()) {
 Document currentDocument = documentIterator.next();
 AnnotationSet inputAS = inputAnnotationSet == null || inputAnnotationSet.trim().length() == 0 ? currentDocument.getAnnotations() : currentDocument.getAnnotations(inputAnnotationSet);
 if (inputAS != null && inputAS.size() > 0) {
 AnnotationSet annotationsToPrintSet = inputAS.get(getAnnotationName());
 if (getFeaturesList() != null && getFeaturesList().size() > 0) {
 List<Annotation> sortedAnnotations = new ArrayList(annotationsToPrintSet);
 Collections.sort(sortedAnnotations, new OffsetComparator());
 Iterator<Annotation> iterator = sortedAnnotations.iterator();
 // just print out values directly to the GATE message window
 while (iterator.hasNext()) {
 Annotation annotation = iterator.next();
 Iterator featuresIterator = getFeaturesList().iterator();
 // local quick output
 StringBuffer outputBuffer = new StringBuffer();
 if (getPrintContent()) {
 String content = currentDocument.getContent().toString().substring(annotation.getStartNode().getOffset().intValue(), annotation.getEndNode().getOffset().intValue());
 outputBuffer.append(content).append(getFeaturesSeparatorSymbol());
 }
 while (featuresIterator.hasNext()) {
 Object feature = featuresIterator.next();
 outputBuffer.append(annotation.getFeatures().get(feature));
 if (featuresIterator.hasNext()) {
 outputBuffer.append(getFeaturesSeparatorSymbol());
 }
 }
 outputList.add(outputBuffer.toString());
 }
 } else if (getPrintContent()) {
 // just print content of the Annotation
 List<Annotation> sortedAnnotations = new ArrayList(annotationsToPrintSet);
 Collections.sort(sortedAnnotations, new OffsetComparator());
 Iterator<Annotation> iterator = sortedAnnotations.iterator();
 // just print out values directly to the GATE message window
 while (iterator.hasNext()) {
 Annotation annotation = iterator.next();
 StringBuffer outputBuffer = new StringBuffer();
 if (getPrintContent()) {
 String content = currentDocument.getContent().toString().substring(annotation.getStartNode().getOffset().intValue(), annotation.getEndNode().getOffset().intValue());
 outputBuffer.append(content);
 }
 outputList.add(outputBuffer.toString());
 }
 }
 }
 // cleanup document
 corpus.unloadDocument(currentDocument);
 if (deleteDoc) {
 Factory.deleteResource(currentDocument);
 }
 }
 // check output destination
 if (getOutputDestination() == null) {
 // just print out results
 Iterator iterator = outputList.iterator();
 while (iterator.hasNext()) {
 Out.println(iterator.next());
 }
 } else {
 write(outputList, getOutputDestination().getPath());
 }
}

Question 2

For now, I can't help you with performance, but maybe with readability.

Raw Types

You shouldn't just declare raw lists, because it makes the code hard to read, and also error-prone.

// this:
List outputList = new ArrayList();
// should be this:
List<String> outputList = new ArrayList<>();

Note also the empty <> on the right side. You should do this for Iterators as well.

Unnecessary Variables

You have outputList which stores the output strings. But then you also have outputBuffer, which also stores the output strings, and then adds them to the outputList. I would just make the outPutList a buffer, and then only add to it.

One-time variables are sometimes nice to have, because you can name them and thus make your code more readable. But here for example, it doesn't add anything but an extra line:

// this:
String header = getHeaderString(getFeaturesList());
outputList.add(header);
// could be this:
outputList.add(getHeaderString(getFeaturesList());

Long Method

Your method is too long, too nested, and does too much.

First, I would extract methods such as: processDocument (which can include all the code inside the outside while loop), cleanDocument, processFeatureDocument/ processPrintDocument (I'm just guessing about the names; they could accept a StringBuffer, so you wouldn't loose any performance by creating additional ones), and outputBuffer.

Splitting up the code to different methods doesn't only help with readability, but will also help you identify bottlenecks by profiling the code, and thus tell you where you actually need to improve performance.

Lots of Null checks

You check if null is returned a lot, and your logic seems to depend on null a bit too much; see for example boolean deleteDoc = (corpus.getDataStore() != null) or if (getOutputDestination() == null). What is the expected result here? Is the data store set to null if the document should not be deleted? This would not seem like a very good system (better to have a delete boolean in there for that). Or is this just error checking in case the data store is null, which it never actually should be? In that case, I would handle it like an error (throw an exception).

Naming

content doesn't store the result of the getContent call, but a substring of it. What's the relevance of this substring? What part of the content did it extract? If possible, the variable name should tell be that. If not, then I would just remove it and append the substring directly.
don't shorten words. inputAS isn't all that clear, and inputAnnotationSet isn't too long.

Comments

some of your comments are not all that helpful, eg // method which create a header string (this should really be a JavaDoc comment at the getHeaderString method, not an in-code comment here), // check output destination, etc
what would be really nice would be a method level JavaDoc comment, about what the method actually does.

Misc

use StringBuilder instead of `StringBuffer if you don't need synchronization.

tim tim 25.3k3 gold badges31 silver badges76 bronze badges · Answer 1 · 2015-02-19 08:33:48Z

For now, I can't help you with performance, but maybe with readability.

Raw Types

You shouldn't just declare raw lists, because it makes the code hard to read, and also error-prone.

// this:
List outputList = new ArrayList();
// should be this:
List<String> outputList = new ArrayList<>();

Note also the empty <> on the right side. You should do this for Iterators as well.

Unnecessary Variables

You have outputList which stores the output strings. But then you also have outputBuffer, which also stores the output strings, and then adds them to the outputList. I would just make the outPutList a buffer, and then only add to it.

One-time variables are sometimes nice to have, because you can name them and thus make your code more readable. But here for example, it doesn't add anything but an extra line:

// this:
String header = getHeaderString(getFeaturesList());
outputList.add(header);
// could be this:
outputList.add(getHeaderString(getFeaturesList());

Long Method

Your method is too long, too nested, and does too much.

First, I would extract methods such as: processDocument (which can include all the code inside the outside while loop), cleanDocument, processFeatureDocument/ processPrintDocument (I'm just guessing about the names; they could accept a StringBuffer, so you wouldn't loose any performance by creating additional ones), and outputBuffer.

Splitting up the code to different methods doesn't only help with readability, but will also help you identify bottlenecks by profiling the code, and thus tell you where you actually need to improve performance.

Lots of Null checks

You check if null is returned a lot, and your logic seems to depend on null a bit too much; see for example boolean deleteDoc = (corpus.getDataStore() != null) or if (getOutputDestination() == null). What is the expected result here? Is the data store set to null if the document should not be deleted? This would not seem like a very good system (better to have a delete boolean in there for that). Or is this just error checking in case the data store is null, which it never actually should be? In that case, I would handle it like an error (throw an exception).

Naming

content doesn't store the result of the getContent call, but a substring of it. What's the relevance of this substring? What part of the content did it extract? If possible, the variable name should tell be that. If not, then I would just remove it and append the substring directly.
don't shorten words. inputAS isn't all that clear, and inputAnnotationSet isn't too long.

Comments

some of your comments are not all that helpful, eg // method which create a header string (this should really be a JavaDoc comment at the getHeaderString method, not an in-code comment here), // check output destination, etc
what would be really nice would be a method level JavaDoc comment, about what the method actually does.

Misc

use StringBuilder instead of `StringBuffer if you don't need synchronization.

Stack Exchange Network

Formatting output

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Formatting output

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions