but we know that if the line terminator appears, then it must appear in the new part of the chunk (that we just read). It can't appear in the old part because then it would have been split on the previous iteration. This means that if multiple chunks need to be read before the line terminator appears, then the initial part of the line gets searched again each time. This leads to quadratic runtime complexity if lines are much longer than the chunk size.
but we know that if the line terminator appears, then it must appear in the new part of the chunk (that we just read). It can't appear in the old part because then it would have been split on the previous iteration. This means that if multiple chunks need to be read before the line terminator appears, then the initial part of the line gets searched again each time. This leads to quadratic runtime complexity.
but we know that if the line terminator appears, then it must appear in the new part of the chunk (that we just read). It can't appear in the old part because then it would have been split on the previous iteration. This means that if multiple chunks need to be read before the line terminator appears, then the initial part of the line gets searched again each time. This leads to quadratic runtime complexity if lines are much longer than the chunk size.
1. Bug
If the file does not end with the line terminator, then the last line is lost. That's because the code says:
if not data: # EOF
break
but at this point there might still be an unterminated line remaining in chunk
, so this needs to be:
if not data: # EOF
if chunk:
yield chunk
break
Perhaps in your case the file is required to always end with a line terminator, but if so it would be a good idea to assert this, rather than silently discarding data if the file does not meet the requirement:
if not data: # EOF
if chunk:
raise ValueError("file does not end with line terminator")
break
2. Separation of concerns
The code in the post has four concerns:
- Opening and closing a file.
- Generating chunks of 100 kilobytes from a file.
- Generating lines from a file with arbitrary line terminator.
- Filtering out blank lines.
The principle of separation of concerns suggests that these should not be combined into one function. Taking the four concerns in turn:
As a general rule, if you have a function that operates on the contents of a file, it's better if the function takes a file-like object (not a filename) and leaves the opening and closing of the file up to the caller. That's because not all file-like objects correspond to named files. The caller might have a pipe from a call to
subprocess.Popen
, or they might have some data in memory wrapped as aio.StringIO
. If an interface takes a filename and opens the file itself, then it can't be used with these other kinds of file-like object.Reading fixed-size chunks from a file is already implemented by the
read
method.This is fine, but see below for some improvements.
Filtering out blank lines might not be needed or wanted, so it should be left to the caller to decide whether to do it. It is in any case easy to implement using an
if
statement, or a comprehension, orfilter
.
3. Other points
There are no docstrings.
When you have a class with
__init__
and one other method, then there is usually no need for a class: what you want is a function. (See Jack Diederich's PyCon 2012 talk "Stop Writing Classes" for a detailed look at this issue.)100 ×ばつ 1024 bytes is one hundred kilobytes (not megabytes).
Instead of having a keyword argument with default value of
None
:def read(..., chunk_size=None):
and then later setting it to a particular value if it tests false:
one_hundred_megabytes = 100 * 1024
chunk_size = chunk_size or one_hundred_megabytes
just specify the keyword argument with the default value you want:
def read(..., chunk_size=100 * 1024):
The code appends to
chunk
before testing for the line terminator:chunk += data # ... elif self._lineterminator in chunk:
but we know that if the line terminator appears, then it must appear in the new part of the chunk (that we just read). It can't appear in the old part because then it would have been split on the previous iteration. This means that if multiple chunks need to be read before the line terminator appears, then the initial part of the line gets searched again each time. This leads to quadratic runtime complexity.
For efficiency, we should test self._lineterminator in data
instead.
The code tests for the line terminator appearing in the chunk and then splits the chunk if it is found.
elif self._lineterminator in chunk: lines = chunk.split(self._lineterminator)
but this has to search the chunk twice, once for the in
operator, and again for the split
. It would be faster to split first and then see how many pieces there are:
lines = chunk.split(self._lineterminator)
if len(lines) > 1:
Combining points 5 and 6 above, we should split
data
first and append tochunk
afterwards:first, *rest = data.split(self._lineterminator) chunk += first if rest: yield chunk yield from rest[:-1] chunk = rest[-1]
I think the names would be clearer if
chunk
were renamedpart
(because it is part of a line), anddata
were renamedchunk
(because it is a chunk of text read from the file).
4. Revised code
def splitlines(file, newline, chunk_size=4096):
"""Generate lines from file (a file-like object), where a line is
followed by the string newline (or the end of the file). Optional
argument chunk_size is the size of the chunks read from the file.
"""
part = ''
while True:
chunk = file.read(chunk_size)
if not chunk:
if part:
yield part
break
first, *rest = chunk.split(newline)
part += first
if rest:
yield part
yield from rest[:-1]
part = rest[-1]
The caller should write something like:
with open(longabstract_file) as f:
for line in splitlines(f, "\tl\n"):
if line: # ignore blank lines
print(line)