I frequently work with large files and implemented a custom iterator to process these files efficiently while minimizing memory usage. The iterator reads each line from a file, parses it into a Message object, and skips invalid entries.
Are there any optimizations or best practices I can apply to this custom iterable implementation to further reduce memory usage or improve its performance? Below is the implementation for reference:
public class MessageIterable implements Iterable<Message> {
private final Logger logger = MessageFileProcessorSingleton.getInstance().getLog();
private final BufferedReader bufferedReader;
private final MessageParserService parserService;
private final PostExceptionTasksHandler postExceptionTasksHandler;
public MessageIterable(
BufferedReader bufferedReader,
PostExceptionTasksHandler postExceptionTasksHandler,
MessageParserService service) {
this.bufferedReader = bufferedReader;
this.postExceptionTasksHandler = postExceptionTasksHandler;
this.parserService = service;
}
@Override
public Iterator<Message> iterator() {
return new Iterator<>() {
Message next;
@Override
public boolean hasNext() {
if (next != null) {
return true;
}
try {
String messageRowJson;
while ((messageRowJson = bufferedReader.readLine()) != null) {
next = parserService.getMessageFromRowJson(messageRowJson);
if (next == null) {
logger.log(Level.SEVERE, "Unable to parse messaeg row");
System.exit(-1);
} else {
return true;
}
}
} catch (IOException e) {
handleInvalidMessage();
}
return false;
}
@Override
public Message next() {
Message current = next;
next = null;
return current;
}
};
}
private void handleInvalidMessage() {
logger.warn("Invalid message encountered. Executing post-exception tasks.");
postExceptionTasksHandler.handle();
}
-
1\$\begingroup\$ It just occurred to me that are you perhaps processing JSON-formatted log files? If you can describe some reasons for malformed messages we may be able to figure out a better way to handle them (see en.wikipedia.org/wiki/XY_problem ). \$\endgroup\$TorbenPutkonen– TorbenPutkonen2025年01月08日 10:30:30 +00:00Commented Jan 8 at 10:30
4 Answers 4
PostExceptionTasksHandler
seems like an unnecessary type. It is basically a Runnable
that is executed when an exception occurs. It would likely be more useful if you passed the exception that caused handler invocation to the handler itself. If you do that, the handler type could be replaced with Consumer<IOException>
and that already contains enough information in it's name to suggest to the caller what it is used for.
If you have access to the MessageParserService.getMessageFromRowJson(String)
method source code, I would change it to throw an exception on an invalid message instead of returning a null. That way you provide the caller with more information for deciding on how to handle the error. Of if you want to use it in a context where exceptions are not desired, maybe change the Message
class so that it contains information about the validity so that you can return an error status from the conversion method.
MessageParserService
seems like it's more of a mapper than an actual service. The method signature suggests that it converts a JSON string to a data object. These are of course just naming conventions and thus highly opinion based, but I prefer that classes that just convert data types to follow naming pattern like MessageMapper.from(String source)
. This has benefits because the name restricts the class' scope to creating Message
objects and the method name clearly tells that it creates a Message
from a String
. Only having methods named "from" in your mapper helps you think twice before you start loading responsibilities to it.
Do not call System.exit()
to handle an error unless you know exactly why you want to do that. Based on the fact that you describe the code as "skipping invalid entries" I would guess that you don't know. Because the System.exit doesn't skip invalid entries. Instead it does the equivalent of yanking the power cord from the computer when it encounters a spelling error... There are a lot of related questions on Stack Overflow about this. You can start from here: https://stackoverflow.com/questions/3715967/when-should-we-call-system-exit-in-java
For what it's worth, maybe you could use Files.lines(Path, Charset) method instead of writing a special iterator for this? Something like this:
Files.lines(path, charset)
.map(parserService::getMessageFromRowJson)
.filter(Objects::nonNull)
.forEach([ message processing method ]);
Regarding the question about improving memory usage and performance. You didn't provide any information about how the memory usage is a problem and there is very little to be done with the code we see. If memory usage really is a problem, it is likely in the code that processes the Message
objects after they have been read.
Regarding runtime performance you're better of with using Files.lines(...)
as the generated stream is backed by a memory mapped file and it is very hard to improve the performance from that. You can even use a parallel stream with it to see if taking advantage of a multi-processor computer pays off in your use case.
But the improved runtime performance is paid off with increased memory consumption. There really is no end-all answer to this. Here is a good article about performance optimization: https://ericlippert.com/2012/12/17/performance-rant/
I haven't looked too deeply at this, but I notice that the Iterator
only generates the next value in hasNext
, and next
just assumes this has already been done. This means that if there are ever two next()
calls without a hasNext()
in between, the second one will always return null
as opposed to the value it ought return. Instead of assuming every call to next
will be preceded by a hasNext
call, make it so that next
may itself advance the generation if needed (one way to do this could be having it call hasNext
. Breaking the advancing logic into a helper method which both next
and hasNext
individually call is another).
Additionally, Iterator::next()
should throw a NoSuchElementException
when the underlying collection is exhausted. This implementation doesn't seem to do this.
Finally, it seems we expect message parsing to indicate failure in two different ways - either by throwing an IOException
or by returning null
. The former is handled well. The latter is handled by forcibly terminating the whole entire program with System::exit
, leaving the caller no chance to recover or even clean up. Prefer handling this case in a manner more like the way you handle the other exception path.
-
\$\begingroup\$ I disagree, it's the programmer's job to ensure that
hasNext
has been called. It shouldn't be called innext
again. If we're going for performance here, that would be an unnecessary function call if had been called beforehand. \$\endgroup\$infinitezero– infinitezero2025年01月07日 19:15:21 +00:00Commented Jan 7 at 19:15 -
1\$\begingroup\$ @infinitezero I disagree that the use of the
Iterator
contract should be that rigid; it's not always the implementation's job to ensure the user can't shoot themselves in the foot. What I also disagree with, however, is the purpose of thenext
andhasNext
methods that this implementation assumes.next
should simply be the method that fetches the next element in the iterable, andhasNext
should simply be a test method to see if callingnext
will succeed. As such, both OP's approach and this suggested alternative are incorrect implementations of the iterator pattern. \$\endgroup\$Abion47– Abion472025年01月07日 19:31:29 +00:00Commented Jan 7 at 19:31 -
2\$\begingroup\$ @infinitezero That's not a rule exclusive to C++ and it has nothing to do with performance. It's called the KISS rule (i.e. Keep It Simple, Stupid). And in this case, it's related to the Single-Responsibility Principle (a.k.a. the S in SOLID).
next
andhasNext
have clearly defined roles in the iterator pattern, and having them perform additional duties outside of those roles makes them bloated, tightly-coupled, and difficult to understand and maintain. Ironically, your approach to take away the user's foot-gun is an excellent way to end up with error-prone spaghetti code. \$\endgroup\$Abion47– Abion472025年01月07日 19:55:46 +00:00Commented Jan 7 at 19:55 -
2\$\begingroup\$ @Abion47 @infinitezero Sure,
hasNext
should do as little work as possible to determine whether there exists a next value. But here "does a next value exist?" is equivalent to "does the rest of a file I have yet to fully read contain a line which will cause an arbitrary function to return a value?". Given that, I'm not sure how either of you envision ahasNext
which provides a useful answer without causing the file to be read or the parser service to be called on at least some invocations. \$\endgroup\$Sara J– Sara J2025年01月07日 20:27:28 +00:00Commented Jan 7 at 20:27 -
2\$\begingroup\$ This whole comment thread is useless because it centers on multiple incorrect assumptions on what the Iterator API requires. Please just go read the JavaDoc. Sara J's answer is correct and does not need these addendums. \$\endgroup\$TorbenPutkonen– TorbenPutkonen2025年01月08日 06:00:41 +00:00Commented Jan 8 at 6:00
Disclaimer: I'm not a Java programmer (last time was ~14 years ago)
Buffer lines
It's a good choice to use a Buffered Reader here as FileIO is generally a bottleneck.
Don't terminate on non-critical errors
I would never expect that a function that's called hasNext
has the ability to nuke the running application (System.exit(-1)
). That's an absolute no-go.
Definitely prefer exceptions over program termination.
Don't loop if you don't want a loop It looks like this is just copy/pasted because why is there a while loop here? Either you terminate the program or you return from the function.
while ((messageRowJson = bufferedReader.readLine()) != null) {
next = parserService.getMessageFromRowJson(messageRowJson);
if (next == null) {
logger.log(Level.SEVERE, "Unable to parse messaeg row");
System.exit(-1);
} else {
return true;
}
}
If you decide to implement the buffer system, use a for loop instead (for reading lines) or a while loop (for reading byte sizes).
User expectation
Typically the hasNext
implementation should be very light. In your case it contains FileIO. Initially I found this weird but decided not to raise a comment about it, since you can't know if there is a nextValue unless when you read the file. Except that you can. The only way that there is no further data is when the file reader has encountered EOF. Otherwise there is another line to be read. If the line read is invalid, you can still throw an exception. That is, move the file reading code into the next function and just check if another read will succeed. Potentially couple this with the buffering (check if there are still lines on the buffer first).
Typos
"Unable to parse messaeg row"
contains a typo.
The field next
of the Iterator
anonymous class should be private...
private Message next;
The narative of next
method could get more expressive refactored from...
@Override
public Message next() {
Message current = next;
next = null;
return current;
}
...to...
@Override
public Message next() {
return message(next, null);
}
private Message message(Message returnee, Message next) {
this.next = next;
return returnee;
}
-
\$\begingroup\$ I don't understand what's better in the suggested alternative. \$\endgroup\$Ron Klein– Ron Klein2025年01月18日 14:06:06 +00:00Commented Jan 18 at 14:06