Custom Iterator for Processing Large Files

Question 1

I frequently work with large files and implemented a custom iterator to process these files efficiently while minimizing memory usage. The iterator reads each line from a file, parses it into a Message object, and skips invalid entries.

Are there any optimizations or best practices I can apply to this custom iterable implementation to further reduce memory usage or improve its performance? Below is the implementation for reference:

public class MessageIterable implements Iterable<Message> {
private final Logger logger = MessageFileProcessorSingleton.getInstance().getLog();
private final BufferedReader bufferedReader;
private final MessageParserService parserService;
private final PostExceptionTasksHandler postExceptionTasksHandler;
public MessageIterable(
 BufferedReader bufferedReader,
 PostExceptionTasksHandler postExceptionTasksHandler,
 MessageParserService service) {
 this.bufferedReader = bufferedReader;
 this.postExceptionTasksHandler = postExceptionTasksHandler;
 this.parserService = service;
}
@Override
public Iterator<Message> iterator() {
 return new Iterator<>() {
 Message next;
 @Override
 public boolean hasNext() {
 if (next != null) {
 return true;
 }
 try {
 String messageRowJson;
 while ((messageRowJson = bufferedReader.readLine()) != null) {
 next = parserService.getMessageFromRowJson(messageRowJson);
 if (next == null) {
 logger.log(Level.SEVERE, "Unable to parse messaeg row");
System.exit(-1);
 } else {
 return true;
 }
 }
 } catch (IOException e) {
 handleInvalidMessage();
 }
 return false;
 }
 @Override
 public Message next() {
 Message current = next;
 next = null;
 return current;
 }
 };
}
private void handleInvalidMessage() {
 logger.warn("Invalid message encountered. Executing post-exception tasks.");
 postExceptionTasksHandler.handle();
}

Question 2

It just occurred to me that are you perhaps processing JSON-formatted log files? If you can describe some reasons for malformed messages we may be able to figure out a better way to handle them (see en.wikipedia.org/wiki/XY_problem ).

Question 3

PostExceptionTasksHandler seems like an unnecessary type. It is basically a Runnable that is executed when an exception occurs. It would likely be more useful if you passed the exception that caused handler invocation to the handler itself. If you do that, the handler type could be replaced with Consumer<IOException> and that already contains enough information in it's name to suggest to the caller what it is used for.

If you have access to the MessageParserService.getMessageFromRowJson(String) method source code, I would change it to throw an exception on an invalid message instead of returning a null. That way you provide the caller with more information for deciding on how to handle the error. Of if you want to use it in a context where exceptions are not desired, maybe change the Message class so that it contains information about the validity so that you can return an error status from the conversion method.

MessageParserService seems like it's more of a mapper than an actual service. The method signature suggests that it converts a JSON string to a data object. These are of course just naming conventions and thus highly opinion based, but I prefer that classes that just convert data types to follow naming pattern like MessageMapper.from(String source). This has benefits because the name restricts the class' scope to creating Message objects and the method name clearly tells that it creates a Message from a String. Only having methods named "from" in your mapper helps you think twice before you start loading responsibilities to it.

Do not call System.exit() to handle an error unless you know exactly why you want to do that. Based on the fact that you describe the code as "skipping invalid entries" I would guess that you don't know. Because the System.exit doesn't skip invalid entries. Instead it does the equivalent of yanking the power cord from the computer when it encounters a spelling error... There are a lot of related questions on Stack Overflow about this. You can start from here: https://stackoverflow.com/questions/3715967/when-should-we-call-system-exit-in-java

For what it's worth, maybe you could use Files.lines(Path, Charset) method instead of writing a special iterator for this? Something like this:

Files.lines(path, charset)
 .map(parserService::getMessageFromRowJson)
 .filter(Objects::nonNull)
 .forEach([ message processing method ]);

Regarding the question about improving memory usage and performance. You didn't provide any information about how the memory usage is a problem and there is very little to be done with the code we see. If memory usage really is a problem, it is likely in the code that processes the Message objects after they have been read.

Regarding runtime performance you're better of with using Files.lines(...) as the generated stream is backed by a memory mapped file and it is very hard to improve the performance from that. You can even use a parallel stream with it to see if taking advantage of a multi-processor computer pays off in your use case.

But the improved runtime performance is paid off with increased memory consumption. There really is no end-all answer to this. Here is a good article about performance optimization: https://ericlippert.com/2012/12/17/performance-rant/

Question 4

I haven't looked too deeply at this, but I notice that the Iterator only generates the next value in hasNext, and next just assumes this has already been done. This means that if there are ever two next() calls without a hasNext() in between, the second one will always return null as opposed to the value it ought return. Instead of assuming every call to next will be preceded by a hasNext call, make it so that next may itself advance the generation if needed (one way to do this could be having it call hasNext. Breaking the advancing logic into a helper method which both next and hasNext individually call is another).

Additionally, Iterator::next() should throw a NoSuchElementException when the underlying collection is exhausted. This implementation doesn't seem to do this.

Finally, it seems we expect message parsing to indicate failure in two different ways - either by throwing an IOException or by returning null. The former is handled well. The latter is handled by forcibly terminating the whole entire program with System::exit, leaving the caller no chance to recover or even clean up. Prefer handling this case in a manner more like the way you handle the other exception path.

Question 5

I disagree, it's the programmer's job to ensure that hasNext has been called. It shouldn't be called in next again. If we're going for performance here, that would be an unnecessary function call if had been called beforehand.

Question 6

@infinitezero I disagree that the use of the Iterator contract should be that rigid; it's not always the implementation's job to ensure the user can't shoot themselves in the foot. What I also disagree with, however, is the purpose of the next and hasNext methods that this implementation assumes. next should simply be the method that fetches the next element in the iterable, and hasNext should simply be a test method to see if calling next will succeed. As such, both OP's approach and this suggested alternative are incorrect implementations of the iterator pattern.

Question 7

@infinitezero That's not a rule exclusive to C++ and it has nothing to do with performance. It's called the KISS rule (i.e. Keep It Simple, Stupid). And in this case, it's related to the Single-Responsibility Principle (a.k.a. the S in SOLID). next and hasNext have clearly defined roles in the iterator pattern, and having them perform additional duties outside of those roles makes them bloated, tightly-coupled, and difficult to understand and maintain. Ironically, your approach to take away the user's foot-gun is an excellent way to end up with error-prone spaghetti code.

Question 8

@Abion47 @infinitezero Sure, hasNext should do as little work as possible to determine whether there exists a next value. But here "does a next value exist?" is equivalent to "does the rest of a file I have yet to fully read contain a line which will cause an arbitrary function to return a value?". Given that, I'm not sure how either of you envision a hasNext which provides a useful answer without causing the file to be read or the parser service to be called on at least some invocations.

Question 9

This whole comment thread is useless because it centers on multiple incorrect assumptions on what the Iterator API requires. Please just go read the JavaDoc. Sara J's answer is correct and does not need these addendums.

Question 10

Disclaimer: I'm not a Java programmer (last time was ~14 years ago)

Buffer lines
It's a good choice to use a Buffered Reader here as FileIO is generally a bottleneck.

Don't terminate on non-critical errors
I would never expect that a function that's called hasNext has the ability to nuke the running application (System.exit(-1)). That's an absolute no-go.
Definitely prefer exceptions over program termination.

Don't loop if you don't want a loop It looks like this is just copy/pasted because why is there a while loop here? Either you terminate the program or you return from the function.

while ((messageRowJson = bufferedReader.readLine()) != null) {
 next = parserService.getMessageFromRowJson(messageRowJson);
 if (next == null) {
 logger.log(Level.SEVERE, "Unable to parse messaeg row");
 System.exit(-1);
 } else {
 return true;
 }
}

If you decide to implement the buffer system, use a for loop instead (for reading lines) or a while loop (for reading byte sizes).

User expectation
Typically the hasNext implementation should be very light. In your case it contains FileIO. Initially I found this weird but decided not to raise a comment about it, since you can't know if there is a nextValue unless when you read the file. Except that you can. The only way that there is no further data is when the file reader has encountered EOF. Otherwise there is another line to be read. If the line read is invalid, you can still throw an exception. That is, move the file reading code into the next function and just check if another read will succeed. Potentially couple this with the buffering (check if there are still lines on the buffer first).

Typos
"Unable to parse messaeg row" contains a typo.

Question 11

The field next of the Iterator anonymous class should be private...

private Message next;

The narative of next method could get more expressive refactored from...

@Override
public Message next() {
 Message current = next;
 next = null;
 return current;
}

...to...

@Override
public Message next() {
 return message(next, null);
}
private Message message(Message returnee, Message next) {
 this.next = next;
 return returnee;
}

Question 12

I don't understand what's better in the suggested alternative.

TorbenPutkonen TorbenPutkonen 9,27921 silver badges29 bronze badges · Answer 1 · 2025-01-07 11:31:43Z

PostExceptionTasksHandler seems like an unnecessary type. It is basically a Runnable that is executed when an exception occurs. It would likely be more useful if you passed the exception that caused handler invocation to the handler itself. If you do that, the handler type could be replaced with Consumer<IOException> and that already contains enough information in it's name to suggest to the caller what it is used for.

If you have access to the MessageParserService.getMessageFromRowJson(String) method source code, I would change it to throw an exception on an invalid message instead of returning a null. That way you provide the caller with more information for deciding on how to handle the error. Of if you want to use it in a context where exceptions are not desired, maybe change the Message class so that it contains information about the validity so that you can return an error status from the conversion method.

MessageParserService seems like it's more of a mapper than an actual service. The method signature suggests that it converts a JSON string to a data object. These are of course just naming conventions and thus highly opinion based, but I prefer that classes that just convert data types to follow naming pattern like MessageMapper.from(String source). This has benefits because the name restricts the class' scope to creating Message objects and the method name clearly tells that it creates a Message from a String. Only having methods named "from" in your mapper helps you think twice before you start loading responsibilities to it.

Do not call System.exit() to handle an error unless you know exactly why you want to do that. Based on the fact that you describe the code as "skipping invalid entries" I would guess that you don't know. Because the System.exit doesn't skip invalid entries. Instead it does the equivalent of yanking the power cord from the computer when it encounters a spelling error... There are a lot of related questions on Stack Overflow about this. You can start from here: https://stackoverflow.com/questions/3715967/when-should-we-call-system-exit-in-java

For what it's worth, maybe you could use Files.lines(Path, Charset) method instead of writing a special iterator for this? Something like this:

Files.lines(path, charset)
 .map(parserService::getMessageFromRowJson)
 .filter(Objects::nonNull)
 .forEach([ message processing method ]);

Regarding the question about improving memory usage and performance. You didn't provide any information about how the memory usage is a problem and there is very little to be done with the code we see. If memory usage really is a problem, it is likely in the code that processes the Message objects after they have been read.

Regarding runtime performance you're better of with using Files.lines(...) as the generated stream is backed by a memory mapped file and it is very hard to improve the performance from that. You can even use a parallel stream with it to see if taking advantage of a multi-processor computer pays off in your use case.

But the improved runtime performance is paid off with increased memory consumption. There really is no end-all answer to this. Here is a good article about performance optimization: https://ericlippert.com/2012/12/17/performance-rant/

Sara J Sara J 4,21112 silver badges37 bronze badges · Answer 2 · 2025-01-07 00:08:26Z

7

\$\begingroup\$

I haven't looked too deeply at this, but I notice that the Iterator only generates the next value in hasNext, and next just assumes this has already been done. This means that if there are ever two next() calls without a hasNext() in between, the second one will always return null as opposed to the value it ought return. Instead of assuming every call to next will be preceded by a hasNext call, make it so that next may itself advance the generation if needed (one way to do this could be having it call hasNext. Breaking the advancing logic into a helper method which both next and hasNext individually call is another).

Additionally, Iterator::next() should throw a NoSuchElementException when the underlying collection is exhausted. This implementation doesn't seem to do this.

Finally, it seems we expect message parsing to indicate failure in two different ways - either by throwing an IOException or by returning null. The former is handled well. The latter is handled by forcibly terminating the whole entire program with System::exit, leaving the caller no chance to recover or even clean up. Prefer handling this case in a manner more like the way you handle the other exception path.

Share

edited Jan 7 at 0:15

answered Jan 7 at 0:08

Sara J's user avatar

Sara J Sara J

4,21112 silver badges37 bronze badges

\$\endgroup\$

16

\$\begingroup\$ I disagree, it's the programmer's job to ensure that hasNext has been called. It shouldn't be called in next again. If we're going for performance here, that would be an unnecessary function call if had been called beforehand. \$\endgroup\$

infinitezero
– infinitezero

2025年01月07日 19:15:21 +00:00
Commented Jan 7 at 19:15
1

\$\begingroup\$ @infinitezero I disagree that the use of the Iterator contract should be that rigid; it's not always the implementation's job to ensure the user can't shoot themselves in the foot. What I also disagree with, however, is the purpose of the next and hasNext methods that this implementation assumes. next should simply be the method that fetches the next element in the iterable, and hasNext should simply be a test method to see if calling next will succeed. As such, both OP's approach and this suggested alternative are incorrect implementations of the iterator pattern. \$\endgroup\$

Abion47
– Abion47

2025年01月07日 19:31:29 +00:00
Commented Jan 7 at 19:31
2

\$\begingroup\$ @infinitezero That's not a rule exclusive to C++ and it has nothing to do with performance. It's called the KISS rule (i.e. Keep It Simple, Stupid). And in this case, it's related to the Single-Responsibility Principle (a.k.a. the S in SOLID). next and hasNext have clearly defined roles in the iterator pattern, and having them perform additional duties outside of those roles makes them bloated, tightly-coupled, and difficult to understand and maintain. Ironically, your approach to take away the user's foot-gun is an excellent way to end up with error-prone spaghetti code. \$\endgroup\$

Abion47
– Abion47

2025年01月07日 19:55:46 +00:00
Commented Jan 7 at 19:55
2

\$\begingroup\$ @Abion47 @infinitezero Sure, hasNext should do as little work as possible to determine whether there exists a next value. But here "does a next value exist?" is equivalent to "does the rest of a file I have yet to fully read contain a line which will cause an arbitrary function to return a value?". Given that, I'm not sure how either of you envision a hasNext which provides a useful answer without causing the file to be read or the parser service to be called on at least some invocations. \$\endgroup\$

Sara J
– Sara J

2025年01月07日 20:27:28 +00:00
Commented Jan 7 at 20:27
2

\$\begingroup\$ This whole comment thread is useless because it centers on multiple incorrect assumptions on what the Iterator API requires. Please just go read the JavaDoc. Sara J's answer is correct and does not need these addendums. \$\endgroup\$

TorbenPutkonen
– TorbenPutkonen

2025年01月08日 06:00:41 +00:00
Commented Jan 8 at 6:00

| Show 11 more comments

infinitezero infinitezero 1,1105 silver badges14 bronze badges · Answer 3 · 2025-01-07 19:18:35Z

Disclaimer: I'm not a Java programmer (last time was ~14 years ago)

Buffer lines
It's a good choice to use a Buffered Reader here as FileIO is generally a bottleneck.

Don't terminate on non-critical errors
I would never expect that a function that's called hasNext has the ability to nuke the running application (System.exit(-1)). That's an absolute no-go.
Definitely prefer exceptions over program termination.

Don't loop if you don't want a loop It looks like this is just copy/pasted because why is there a while loop here? Either you terminate the program or you return from the function.

while ((messageRowJson = bufferedReader.readLine()) != null) {
 next = parserService.getMessageFromRowJson(messageRowJson);
 if (next == null) {
 logger.log(Level.SEVERE, "Unable to parse messaeg row");
 System.exit(-1);
 } else {
 return true;
 }
}

If you decide to implement the buffer system, use a for loop instead (for reading lines) or a while loop (for reading byte sizes).

User expectation
Typically the hasNext implementation should be very light. In your case it contains FileIO. Initially I found this weird but decided not to raise a comment about it, since you can't know if there is a nextValue unless when you read the file. Except that you can. The only way that there is no further data is when the file reader has encountered EOF. Otherwise there is another line to be read. If the line read is invalid, you can still throw an exception. That is, move the file reading code into the next function and just check if another read will succeed. Potentially couple this with the buffering (check if there are still lines on the buffer first).

Typos
"Unable to parse messaeg row" contains a typo.

user288812user288812 · Answer 4 · 2025-01-08 12:38:37Z

The field next of the Iterator anonymous class should be private...

private Message next;

The narative of next method could get more expressive refactored from...

@Override
public Message next() {
 Message current = next;
 next = null;
 return current;
}

...to...

@Override
public Message next() {
 return message(next, null);
}
private Message message(Message returnee, Message next) {
 this.next = next;
 return returnee;
}

I don't understand what's better in the suggested alternative.

Stack Exchange Network

Custom Iterator for Processing Large Files

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Custom Iterator for Processing Large Files

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions