Generate all lines from a file matching a given word list

Question 1

I have a script - generator that accepts a file name or file object and a list of search words as input. The generator searches through the lines of the file and returns only those lines (the whole line) where at least one of the search words is found.

from io import TextIOBase
from typing import TextIO
import mimetypes
def find_lines(object_or_filename: TextIO | str, search_words: list) -> str:
 if isinstance(object_or_filename, str) and mimetypes.guess_type(object_or_filename)[0] == 'text/plain':
 with open(object_or_filename, 'r', encoding="utf-8") as file:
 for line in file:
 lower_line = line.lower()
 if any(word.lower() in lower_line.split() for word in search_words):
 yield line.strip()
 elif isinstance(object_or_filename, TextIOBase):
 for line in object_or_filename:
 lower_line = line.lower()
 if any(word.lower() in lower_line.split() for word in search_words):
 yield line.strip()

Is there a way to optimize this script and avoid code duplication of loop:

 for line in file:
 lower_line = line.lower()
 if any(word.lower() in lower_line.split() for word in search_words):
 yield line.strip()

I want that this loop was only used once but don't know how.

Question 2

On the shell, this is just grep. Do these strings get passed to another section of Python?

Question 3

@Reinderien No, it is just generator itself.

Question 4

You can extract the duplicated code in a function. Since your function is a generator, you can delegate the operations to another generator using yield from.

Some other remarks:

the name object_or_filename is imprecise, as you don't handle any generic object but text IO streams and filenames only. text_stream_or_filename would be a better choice.
Since you specifically check the argument type, I would expect unsupported types to throw a TypeError instead of silently returning None.
You should split the file-name check over 2 lines, avoiding a very long line.
You can specify that search_word expects a list of strings using Typing.List. Passing other types in the list doesn't cause issues, but this helps conveying your function's intent.

Suggested code:

from io import TextIOBase
from typing import List, TextIO
import mimetypes
def find_lines(text_stream_or_filename: TextIO | str, search_words: List[str]) -> str:
 if (isinstance(text_stream_or_filename, str)
 and mimetypes.guess_type(text_stream_or_filename)[0] == 'text/plain'):
 with open(text_stream_or_filename, 'r', encoding="utf-8") as file:
 yield from _find_lines_textIO(file, search_words)
 elif isinstance(text_stream_or_filename, TextIOBase):
 yield from _find_lines_textIO(text_stream_or_filename, search_words)
 else:
 raise TypeError('argument must be a text stream or a text file name')
def _find_lines_textIO(text_stream: TextIO, search_words: list):
 for line in text_stream:
 lower_line = line.lower()
 if any(word.lower() in lower_line.split() for word in search_words):
 yield line.strip()

Question 5

thank yo for answering! Shoud I also specify List[str] in a def _find_lines_textIO function?

Question 6

@zaelcovsky The leading underscore in the function name is a convention indicating that it is intended for internal use only ("private"). As such, type hinting isn't nearly as important, as it shouldn't be used directly, and IDEs should hide it. It wouldn't hurt to update the type hint if you want to, but in this context it would be acceptable to remove them as well, or leave it as it is. It's a matter of preference.

Question 7

Iterable[str] from typing (or from collections.abc for newer versions of python) would be a better type hint for search_words. Technically str is itself an Iterable[str], but invoking find_lines(foo, 'ai') would produce valid results. You could narrow it to Sequence[str] to show search_words is expected to be a collection of str rather than a single str, but typically the more general the better. If the function did a isinstance(search_words, list) or called specific list methods, then yes List[str] would be more appropriate.

Question 8

You can also use contextlib.nullcontext(). The documentation gives an example similar to your situation.

If the file_or_path argument is a str or PathLike (e.g. pathlib.Path) object, it gets opened. An open file is a context manager that can be used in a with statement to automatically close the file when the with block ends.

If the file_or_path argument is an already opened file, we presume the caller will close the file, so use a nullcontext. In the with statement, a nullcontext returns its argument, but otherwise does nothing.

Just convert search_words to lowercase once, not for every line in the file.

import mimetypes
from contextlib import nullcontext
from io import TextIOBase
from os import PathLike
from typing import TextIO
def find_lines(file_or_path: TextIO | str | PathLike, search_words: list) -> str:
 """A generator that yields all lines in file_or_path that
 include at least one word from search_words.
 """
 
 # set up the file or path as a context manager
 if isinstance(file_or_path, (str, PathLike)):
 mimetype, _ = mimetypes.guess_type(file_or_path)
 if mimetype != 'text/plain':
 raise ValueError(f"Expected a plain text file, got a '{mimetype}' file.")
 
 readable = open(file_or_path, 'r', encoding="utf-8")
 
 elif isinstance(file_or_path, TextIOBase):
 
 readable = nullcontext(file_or_path)
 
 else:
 raise TypeError("Expected a text file or a path to one.")
 
 # convert words to lowercase just once
 words = [word.lower() for word in search_words]
 
 # use the context manager from above
 with readable as file:
 for line in file:
 lower_line = line.lower()
 if any(word in lower_line for word in words):
 yield line.strip()

Depending on the length of search_words and the size of the file, this may be an inefficient way to find matching lines.

Question 9

thank you! I found solution with contextlib.nullcontext too. Why we need second variable _ in your code here with mimetype, _ = mimetypes.guess_type(file_or_path) ? Also there is redundant : on line 19.

Question 10

and what context manager do with TextIO? Can this object be closed like file object?

Question 11

@zaelcovsky By convention, the variable _ is used to indicate we don't care what the value is. mimetype, _ = ... tells someone reading the code that guess_type() returns two items. The first is the mimetype and we don't care about the second.

Question 12

@zaelcovsky nullcontext(file_or_path) returns a context manager. It saves the argument passed to it, if any. Here it is file_or_path. Then in the line with reader as file, the argument, file_or_path, then gets returned as the value of file. Other than that, nullcontext() doesn't do anything.

Question 13

"Conditional with statement in Python" explains how you can make

with open(object_or_filename, 'r', encoding="utf-8") as file:

optional with contextlib.nullcontext.

Python 3.7 introduced contextlib.nullcontext:

from contextlib import nullcontext
with get_stuff() if needs_with() else nullcontext() as gs:
 # do nearly the same large block of stuff,
 # involving gs or not, depending on needs_with()

If needs_with() is False then gs will be None. To make gs be something_else, you use nullcontext(something_else) instead of nullcontext().

As such you can DRY the repeated code.

from io import TextIOBase
from typing import List, TextIO
import mimetypes
from contextlib import nullcontext
def find_lines(text_stream_or_filename: TextIO | str, search_words: List[str]) -> str:
 if isinstance(text_stream_or_filename, str) and mimetypes.guess_type(text_stream_or_filename)[0] == 'text/plain' or \
 isinstance(text_stream_or_filename, TextIOBase):
 with open(text_stream_or_filename, 'r', encoding="utf-8") if (isinstance(text_stream_or_filename, str)) else \
 nullcontext(text_stream_or_filename) as text:
 for line in text:
 lower_line = line.lower()
 if any(word.lower() in lower_line.split() for word in search_words):
 yield line.strip()

Question 14

This does not qualify as an answer on CodeReview, since it needs to speak on your original code and not only offer an alternative solution.

Question 15

Unfortunately some people see words like "alternative solution" and jump into tone police mode. I have edited your answer to reduce the likelihood of visceral reactions. Your answer has an IO and one I didn't know existed, thanks for teaching my something new in Python.

Question 16

Your return typehint is incorrect because you aren't returning a string; you're returning an iterator of strings.

The difficulty you're encountering is a symptom of trying to be everything for everyone. Python often devolves into "poor man's function overloading" like this; the most criminal extremes are seen in APIs like Pandas and matplotlib where functions can accept dozens of different signatures that have to be figured out in runtime and are difficult or impossible to exhaustively document (and test!)

So do yourself a favour. Don't do what you're doing. Accept one and only one thing, a TextIO file-like, and do whatever I/O operations you need to do outside of this function.

Use sets for word list checks.

Suggested

from typing import TextIO, Iterator
def find_lines(filelike: TextIO, search_words: set[str]) -> Iterator[str]:
 for line in filelike:
 if search_words & set(line.lower().split()):
 yield line.strip()

Question 17

I agree with the "trying to be everything for everyone" statement. But you could make life a little easier with 1 extra line of code before the for loop, search_words = set(search_words) and type hint search_words as Iterable[str]. Then a user wouldn't run into an error related to the & operator when they didn't realize they needed to invoke find_lines(foo, set(bar)).

Question 18

Reinderien's answer reminded me of functools.singledispatch, which provides a mechanism for overloading a function based on the type of the first argument.

import mimetypes
from functools import singledispatch
from os import PathLike
from typing import Iterable, TextIO
@singledispatch
def find_lines(file: TextIO, search_words: list) -> Iterable[str]:
 """A generator that yields all lines in a file that
 include at least one word from search_words.
 This is the base function that takes an open file.
 """
 
 for line in file:
 lower_line = line.lower()
 if any(word in lower_line for word in search_words):
 yield line.strip()
 
@find_lines.register
def find_lines_path(path: str | PathLike, search_words: list) -> Iterable[str]:
 """This is the version of find_files that takes a path as a string
 or a PathLike object (e.g. pathlib.Path). It opens the file, calls the 
 version of the function that accepts a file, then closes the file.
 """
 
 mimetype, _ = mimetypes.guess_type(path)
 if mimetype != 'text/plain':
 raise ValueError(f"Expected a plain text file, got a '{mimetype}' file.")
 
 with open(path) as file:
 yield from find_lines(file, search_words)

Question 19

Aho-Corasick

In terms of actually searching a list of words in a line, you want to learn about the Aho-Corasick which is... precisely made for that.

The key idea is to build a state-machine which matches all needles at once with a single pass over the haystack, rather than repeatedly searching each needle.

The simplest way to use it... is to use a regex and hope it uses Aho-Corasick under the hood¹. As a bonus, you can also match in case-insensitive mode, which should be faster than first converting to lower-case then matching:

 regex = re.compile('|'.join(search_words), re.IGNORECASE)
 for line in file:
 if regex.match(line):
 yield line.strip()

Otherwise, you should have a look at pypy. There's many implementations of the algorithm to pick from, or use as inspiration to make your own.

¹ I could not verify whether Python's default regex engine did, sorry.

gazoh gazoh 3,39910 silver badges26 bronze badges · Answer 1 · 2023-10-02 09:49:23Z

You can extract the duplicated code in a function. Since your function is a generator, you can delegate the operations to another generator using yield from.

Some other remarks:

the name object_or_filename is imprecise, as you don't handle any generic object but text IO streams and filenames only. text_stream_or_filename would be a better choice.
Since you specifically check the argument type, I would expect unsupported types to throw a TypeError instead of silently returning None.
You should split the file-name check over 2 lines, avoiding a very long line.
You can specify that search_word expects a list of strings using Typing.List. Passing other types in the list doesn't cause issues, but this helps conveying your function's intent.

Suggested code:

from io import TextIOBase
from typing import List, TextIO
import mimetypes
def find_lines(text_stream_or_filename: TextIO | str, search_words: List[str]) -> str:
 if (isinstance(text_stream_or_filename, str)
 and mimetypes.guess_type(text_stream_or_filename)[0] == 'text/plain'):
 with open(text_stream_or_filename, 'r', encoding="utf-8") as file:
 yield from _find_lines_textIO(file, search_words)
 elif isinstance(text_stream_or_filename, TextIOBase):
 yield from _find_lines_textIO(text_stream_or_filename, search_words)
 else:
 raise TypeError('argument must be a text stream or a text file name')
def _find_lines_textIO(text_stream: TextIO, search_words: list):
 for line in text_stream:
 lower_line = line.lower()
 if any(word.lower() in lower_line.split() for word in search_words):
 yield line.strip()

thank yo for answering! Shoud I also specify List[str] in a def _find_lines_textIO function?
@zaelcovsky The leading underscore in the function name is a convention indicating that it is intended for internal use only ("private"). As such, type hinting isn't nearly as important, as it shouldn't be used directly, and IDEs should hide it. It wouldn't hurt to update the type hint if you want to, but in this context it would be acceptable to remove them as well, or leave it as it is. It's a matter of preference.
Iterable[str] from typing (or from collections.abc for newer versions of python) would be a better type hint for search_words. Technically str is itself an Iterable[str], but invoking find_lines(foo, 'ai') would produce valid results. You could narrow it to Sequence[str] to show search_words is expected to be a collection of str rather than a single str, but typically the more general the better. If the function did a isinstance(search_words, list) or called specific list methods, then yes List[str] would be more appropriate.

RootTwo RootTwo 10.6k1 gold badge14 silver badges30 bronze badges · Answer 2 · 2023-10-03 07:39:22Z

You can also use contextlib.nullcontext(). The documentation gives an example similar to your situation.

If the file_or_path argument is a str or PathLike (e.g. pathlib.Path) object, it gets opened. An open file is a context manager that can be used in a with statement to automatically close the file when the with block ends.

If the file_or_path argument is an already opened file, we presume the caller will close the file, so use a nullcontext. In the with statement, a nullcontext returns its argument, but otherwise does nothing.

Just convert search_words to lowercase once, not for every line in the file.

import mimetypes
from contextlib import nullcontext
from io import TextIOBase
from os import PathLike
from typing import TextIO
def find_lines(file_or_path: TextIO | str | PathLike, search_words: list) -> str:
 """A generator that yields all lines in file_or_path that
 include at least one word from search_words.
 """
 
 # set up the file or path as a context manager
 if isinstance(file_or_path, (str, PathLike)):
 mimetype, _ = mimetypes.guess_type(file_or_path)
 if mimetype != 'text/plain':
 raise ValueError(f"Expected a plain text file, got a '{mimetype}' file.")
 
 readable = open(file_or_path, 'r', encoding="utf-8")
 
 elif isinstance(file_or_path, TextIOBase):
 
 readable = nullcontext(file_or_path)
 
 else:
 raise TypeError("Expected a text file or a path to one.")
 
 # convert words to lowercase just once
 words = [word.lower() for word in search_words]
 
 # use the context manager from above
 with readable as file:
 for line in file:
 lower_line = line.lower()
 if any(word in lower_line for word in words):
 yield line.strip()

Depending on the length of search_words and the size of the file, this may be an inefficient way to find matching lines.

thank you! I found solution with contextlib.nullcontext too. Why we need second variable _ in your code here with mimetype, _ = mimetypes.guess_type(file_or_path) ? Also there is redundant : on line 19.
and what context manager do with TextIO? Can this object be closed like file object?
@zaelcovsky By convention, the variable _ is used to indicate we don't care what the value is. mimetype, _ = ... tells someone reading the code that guess_type() returns two items. The first is the mimetype and we don't care about the second.
@zaelcovsky nullcontext(file_or_path) returns a context manager. It saves the argument passed to it, if any. Here it is file_or_path. Then in the line with reader as file, the argument, file_or_path, then gets returned as the value of file. Other than that, nullcontext() doesn't do anything.

zaelcovsky zaelcovsky 814 bronze badges · Answer 3 · 2023-10-03 06:57:13Z

"Conditional with statement in Python" explains how you can make

with open(object_or_filename, 'r', encoding="utf-8") as file:

optional with contextlib.nullcontext.

Python 3.7 introduced contextlib.nullcontext:

from contextlib import nullcontext
with get_stuff() if needs_with() else nullcontext() as gs:
 # do nearly the same large block of stuff,
 # involving gs or not, depending on needs_with()

If needs_with() is False then gs will be None. To make gs be something_else, you use nullcontext(something_else) instead of nullcontext().

As such you can DRY the repeated code.

from io import TextIOBase
from typing import List, TextIO
import mimetypes
from contextlib import nullcontext
def find_lines(text_stream_or_filename: TextIO | str, search_words: List[str]) -> str:
 if isinstance(text_stream_or_filename, str) and mimetypes.guess_type(text_stream_or_filename)[0] == 'text/plain' or \
 isinstance(text_stream_or_filename, TextIOBase):
 with open(text_stream_or_filename, 'r', encoding="utf-8") if (isinstance(text_stream_or_filename, str)) else \
 nullcontext(text_stream_or_filename) as text:
 for line in text:
 lower_line = line.lower()
 if any(word.lower() in lower_line.split() for word in search_words):
 yield line.strip()

This does not qualify as an answer on CodeReview, since it needs to speak on your original code and not only offer an alternative solution.
Unfortunately some people see words like "alternative solution" and jump into tone police mode. I have edited your answer to reduce the likelihood of visceral reactions. Your answer has an IO and one I didn't know existed, thanks for teaching my something new in Python.

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Answer 4 · 2023-10-03 15:20:12Z

Your return typehint is incorrect because you aren't returning a string; you're returning an iterator of strings.

The difficulty you're encountering is a symptom of trying to be everything for everyone. Python often devolves into "poor man's function overloading" like this; the most criminal extremes are seen in APIs like Pandas and matplotlib where functions can accept dozens of different signatures that have to be figured out in runtime and are difficult or impossible to exhaustively document (and test!)

So do yourself a favour. Don't do what you're doing. Accept one and only one thing, a TextIO file-like, and do whatever I/O operations you need to do outside of this function.

Use sets for word list checks.

Suggested

from typing import TextIO, Iterator
def find_lines(filelike: TextIO, search_words: set[str]) -> Iterator[str]:
 for line in filelike:
 if search_words & set(line.lower().split()):
 yield line.strip()

I agree with the "trying to be everything for everyone" statement. But you could make life a little easier with 1 extra line of code before the for loop, search_words = set(search_words) and type hint search_words as Iterable[str]. Then a user wouldn't run into an error related to the & operator when they didn't realize they needed to invoke find_lines(foo, set(bar)).

RootTwo RootTwo 10.6k1 gold badge14 silver badges30 bronze badges · Answer 5 · 2023-10-03 21:22:54Z

Reinderien's answer reminded me of functools.singledispatch, which provides a mechanism for overloading a function based on the type of the first argument.

import mimetypes
from functools import singledispatch
from os import PathLike
from typing import Iterable, TextIO
@singledispatch
def find_lines(file: TextIO, search_words: list) -> Iterable[str]:
 """A generator that yields all lines in a file that
 include at least one word from search_words.
 This is the base function that takes an open file.
 """
 
 for line in file:
 lower_line = line.lower()
 if any(word in lower_line for word in search_words):
 yield line.strip()
 
@find_lines.register
def find_lines_path(path: str | PathLike, search_words: list) -> Iterable[str]:
 """This is the version of find_files that takes a path as a string
 or a PathLike object (e.g. pathlib.Path). It opens the file, calls the 
 version of the function that accepts a file, then closes the file.
 """
 
 mimetype, _ = mimetypes.guess_type(path)
 if mimetype != 'text/plain':
 raise ValueError(f"Expected a plain text file, got a '{mimetype}' file.")
 
 with open(path) as file:
 yield from find_lines(file, search_words)

Matthieu M. Matthieu M. 6,4091 gold badge20 silver badges34 bronze badges · Answer 6 · 2023-10-04 14:49:17Z

Aho-Corasick

In terms of actually searching a list of words in a line, you want to learn about the Aho-Corasick which is... precisely made for that.

The key idea is to build a state-machine which matches all needles at once with a single pass over the haystack, rather than repeatedly searching each needle.

The simplest way to use it... is to use a regex and hope it uses Aho-Corasick under the hood¹. As a bonus, you can also match in case-insensitive mode, which should be faster than first converting to lower-case then matching:

 regex = re.compile('|'.join(search_words), re.IGNORECASE)
 for line in file:
 if regex.match(line):
 yield line.strip()

Otherwise, you should have a look at pypy. There's many implementations of the algorithm to pick from, or use as inspiration to make your own.

¹ I could not verify whether Python's default regex engine did, sorry.

Stack Exchange Network

Generate all lines from a file matching a given word list

6 Answers 6

Suggested

Aho-Corasick

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Generate all lines from a file matching a given word list

6 Answers 6

Suggested

Aho-Corasick

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions