I have a script - generator that accepts a file name or file object and a list of search words as input. The generator searches through the lines of the file and returns only those lines (the whole line) where at least one of the search words is found.
from io import TextIOBase
from typing import TextIO
import mimetypes
def find_lines(object_or_filename: TextIO | str, search_words: list) -> str:
if isinstance(object_or_filename, str) and mimetypes.guess_type(object_or_filename)[0] == 'text/plain':
with open(object_or_filename, 'r', encoding="utf-8") as file:
for line in file:
lower_line = line.lower()
if any(word.lower() in lower_line.split() for word in search_words):
yield line.strip()
elif isinstance(object_or_filename, TextIOBase):
for line in object_or_filename:
lower_line = line.lower()
if any(word.lower() in lower_line.split() for word in search_words):
yield line.strip()
Is there a way to optimize this script and avoid code duplication of loop:
for line in file:
lower_line = line.lower()
if any(word.lower() in lower_line.split() for word in search_words):
yield line.strip()
I want that this loop was only used once but don't know how.
-
\$\begingroup\$ On the shell, this is just grep. Do these strings get passed to another section of Python? \$\endgroup\$Reinderien– Reinderien2023年10月02日 12:57:44 +00:00Commented Oct 2, 2023 at 12:57
-
\$\begingroup\$ @Reinderien No, it is just generator itself. \$\endgroup\$zaelcovsky– zaelcovsky2023年10月03日 06:21:44 +00:00Commented Oct 3, 2023 at 6:21
6 Answers 6
You can extract the duplicated code in a function. Since your function is a generator, you can delegate the operations to another generator using yield from
.
Some other remarks:
- the name
object_or_filename
is imprecise, as you don't handle any generic object but text IO streams and filenames only.text_stream_or_filename
would be a better choice. - Since you specifically check the argument type, I would expect unsupported types to throw a
TypeError
instead of silently returningNone
. - You should split the file-name check over 2 lines, avoiding a very long line.
- You can specify that
search_word
expects a list of strings usingTyping.List
. Passing other types in the list doesn't cause issues, but this helps conveying your function's intent.
Suggested code:
from io import TextIOBase
from typing import List, TextIO
import mimetypes
def find_lines(text_stream_or_filename: TextIO | str, search_words: List[str]) -> str:
if (isinstance(text_stream_or_filename, str)
and mimetypes.guess_type(text_stream_or_filename)[0] == 'text/plain'):
with open(text_stream_or_filename, 'r', encoding="utf-8") as file:
yield from _find_lines_textIO(file, search_words)
elif isinstance(text_stream_or_filename, TextIOBase):
yield from _find_lines_textIO(text_stream_or_filename, search_words)
else:
raise TypeError('argument must be a text stream or a text file name')
def _find_lines_textIO(text_stream: TextIO, search_words: list):
for line in text_stream:
lower_line = line.lower()
if any(word.lower() in lower_line.split() for word in search_words):
yield line.strip()
-
\$\begingroup\$ thank yo for answering! Shoud I also specify
List[str]
in adef _find_lines_textIO
function? \$\endgroup\$zaelcovsky– zaelcovsky2023年10月03日 06:47:35 +00:00Commented Oct 3, 2023 at 6:47 -
1\$\begingroup\$ @zaelcovsky The leading underscore in the function name is a convention indicating that it is intended for internal use only ("private"). As such, type hinting isn't nearly as important, as it shouldn't be used directly, and IDEs should hide it. It wouldn't hurt to update the type hint if you want to, but in this context it would be acceptable to remove them as well, or leave it as it is. It's a matter of preference. \$\endgroup\$gazoh– gazoh2023年10月03日 07:53:34 +00:00Commented Oct 3, 2023 at 7:53
-
1\$\begingroup\$
Iterable[str]
fromtyping
(or fromcollections.abc
for newer versions of python) would be a better type hint forsearch_words
. Technicallystr
is itself anIterable[str]
, but invokingfind_lines(foo, 'ai')
would produce valid results. You could narrow it toSequence[str]
to showsearch_words
is expected to be a collection ofstr
rather than a singlestr
, but typically the more general the better. If the function did aisinstance(search_words, list)
or called specificlist
methods, then yesList[str]
would be more appropriate. \$\endgroup\$Jonny Henly– Jonny Henly2023年10月03日 22:26:07 +00:00Commented Oct 3, 2023 at 22:26
You can also use contextlib.nullcontext()
. The documentation gives an example similar to your situation.
If the file_or_path argument is a str or PathLike (e.g. pathlib.Path) object, it gets opened. An open file is a context manager that can be used in a with
statement to automatically close the file when the with
block ends.
If the file_or_path argument is an already opened file, we presume the caller will close the file, so use a nullcontext
. In the with
statement, a nullcontext
returns its argument, but otherwise does nothing.
Just convert search_words
to lowercase once, not for every line in the file.
import mimetypes
from contextlib import nullcontext
from io import TextIOBase
from os import PathLike
from typing import TextIO
def find_lines(file_or_path: TextIO | str | PathLike, search_words: list) -> str:
"""A generator that yields all lines in file_or_path that
include at least one word from search_words.
"""
# set up the file or path as a context manager
if isinstance(file_or_path, (str, PathLike)):
mimetype, _ = mimetypes.guess_type(file_or_path)
if mimetype != 'text/plain':
raise ValueError(f"Expected a plain text file, got a '{mimetype}' file.")
readable = open(file_or_path, 'r', encoding="utf-8")
elif isinstance(file_or_path, TextIOBase):
readable = nullcontext(file_or_path)
else:
raise TypeError("Expected a text file or a path to one.")
# convert words to lowercase just once
words = [word.lower() for word in search_words]
# use the context manager from above
with readable as file:
for line in file:
lower_line = line.lower()
if any(word in lower_line for word in words):
yield line.strip()
Depending on the length of search_words
and the size of the file, this may be an inefficient way to find matching lines.
-
\$\begingroup\$ thank you! I found solution with
contextlib.nullcontext
too. Why we need second variable_
in your code herewith mimetype, _ = mimetypes.guess_type(file_or_path)
? Also there is redundant:
on line 19. \$\endgroup\$zaelcovsky– zaelcovsky2023年10月03日 08:09:27 +00:00Commented Oct 3, 2023 at 8:09 -
\$\begingroup\$ and what context manager do with TextIO? Can this object be closed like file object? \$\endgroup\$zaelcovsky– zaelcovsky2023年10月03日 08:14:16 +00:00Commented Oct 3, 2023 at 8:14
-
1\$\begingroup\$ @zaelcovsky By convention, the variable
_
is used to indicate we don't care what the value is.mimetype, _ = ...
tells someone reading the code thatguess_type()
returns two items. The first is the mimetype and we don't care about the second. \$\endgroup\$RootTwo– RootTwo2023年10月03日 16:50:48 +00:00Commented Oct 3, 2023 at 16:50 -
1\$\begingroup\$ @zaelcovsky
nullcontext(file_or_path)
returns a context manager. It saves the argument passed to it, if any. Here it isfile_or_path
. Then in the linewith reader as file
, the argument,file_or_path
, then gets returned as the value offile
. Other than that,nullcontext()
doesn't do anything. \$\endgroup\$RootTwo– RootTwo2023年10月03日 16:57:55 +00:00Commented Oct 3, 2023 at 16:57
"Conditional with statement in Python" explains how you can make
with open(object_or_filename, 'r', encoding="utf-8") as file:
optional with contextlib.nullcontext
.
Python 3.7 introduced contextlib.nullcontext
:
from contextlib import nullcontext with get_stuff() if needs_with() else nullcontext() as gs: # do nearly the same large block of stuff, # involving gs or not, depending on needs_with()
If needs_with()
is False
then gs
will be None
. To make gs
be something_else
, you use nullcontext(something_else)
instead of nullcontext()
.
As such you can DRY the repeated code.
from io import TextIOBase
from typing import List, TextIO
import mimetypes
from contextlib import nullcontext
def find_lines(text_stream_or_filename: TextIO | str, search_words: List[str]) -> str:
if isinstance(text_stream_or_filename, str) and mimetypes.guess_type(text_stream_or_filename)[0] == 'text/plain' or \
isinstance(text_stream_or_filename, TextIOBase):
with open(text_stream_or_filename, 'r', encoding="utf-8") if (isinstance(text_stream_or_filename, str)) else \
nullcontext(text_stream_or_filename) as text:
for line in text:
lower_line = line.lower()
if any(word.lower() in lower_line.split() for word in search_words):
yield line.strip()
-
1\$\begingroup\$ This does not qualify as an answer on CodeReview, since it needs to speak on your original code and not only offer an alternative solution. \$\endgroup\$Reinderien– Reinderien2023年10月03日 15:11:18 +00:00Commented Oct 3, 2023 at 15:11
-
\$\begingroup\$ Unfortunately some people see words like "alternative solution" and jump into tone police mode. I have edited your answer to reduce the likelihood of visceral reactions. Your answer has an IO and one I didn't know existed, thanks for teaching my something new in Python. \$\endgroup\$2023年10月03日 18:06:44 +00:00Commented Oct 3, 2023 at 18:06
Your return typehint is incorrect because you aren't returning a string; you're returning an iterator of strings.
The difficulty you're encountering is a symptom of trying to be everything for everyone. Python often devolves into "poor man's function overloading" like this; the most criminal extremes are seen in APIs like Pandas and matplotlib where functions can accept dozens of different signatures that have to be figured out in runtime and are difficult or impossible to exhaustively document (and test!)
So do yourself a favour. Don't do what you're doing. Accept one and only one thing, a TextIO
file-like, and do whatever I/O operations you need to do outside of this function.
Use sets for word list checks.
Suggested
from typing import TextIO, Iterator
def find_lines(filelike: TextIO, search_words: set[str]) -> Iterator[str]:
for line in filelike:
if search_words & set(line.lower().split()):
yield line.strip()
-
\$\begingroup\$ I agree with the "trying to be everything for everyone" statement. But you could make life a little easier with 1 extra line of code before the for loop,
search_words = set(search_words)
and type hintsearch_words
asIterable[str]
. Then a user wouldn't run into an error related to the&
operator when they didn't realize they needed to invokefind_lines(foo, set(bar))
. \$\endgroup\$Jonny Henly– Jonny Henly2023年10月03日 22:39:43 +00:00Commented Oct 3, 2023 at 22:39
Reinderien's answer reminded me of functools.singledispatch
, which provides a mechanism for overloading a function based on the type of the first argument.
import mimetypes
from functools import singledispatch
from os import PathLike
from typing import Iterable, TextIO
@singledispatch
def find_lines(file: TextIO, search_words: list) -> Iterable[str]:
"""A generator that yields all lines in a file that
include at least one word from search_words.
This is the base function that takes an open file.
"""
for line in file:
lower_line = line.lower()
if any(word in lower_line for word in search_words):
yield line.strip()
@find_lines.register
def find_lines_path(path: str | PathLike, search_words: list) -> Iterable[str]:
"""This is the version of find_files that takes a path as a string
or a PathLike object (e.g. pathlib.Path). It opens the file, calls the
version of the function that accepts a file, then closes the file.
"""
mimetype, _ = mimetypes.guess_type(path)
if mimetype != 'text/plain':
raise ValueError(f"Expected a plain text file, got a '{mimetype}' file.")
with open(path) as file:
yield from find_lines(file, search_words)
Aho-Corasick
In terms of actually searching a list of words in a line, you want to learn about the Aho-Corasick which is... precisely made for that.
The key idea is to build a state-machine which matches all needles at once with a single pass over the haystack, rather than repeatedly searching each needle.
The simplest way to use it... is to use a regex and hope it uses Aho-Corasick under the hood1. As a bonus, you can also match in case-insensitive mode, which should be faster than first converting to lower-case then matching:
regex = re.compile('|'.join(search_words), re.IGNORECASE)
for line in file:
if regex.match(line):
yield line.strip()
Otherwise, you should have a look at pypy. There's many implementations of the algorithm to pick from, or use as inspiration to make your own.
1 I could not verify whether Python's default regex engine did, sorry.