Given a file or directory, create an iterator that returns the non-empty words from the file or from all files recursively in the directory. Only process ".txt" files. Words are sequence of characters separated by whitespace.
class WordIterable:
def __init__(self, path: str):
root = Path(path)
self._walker: Optional[Iterator[Path]] = None
if root.is_dir():
self._walker = root.rglob("*.txt")
elif root.suffix == ".txt":
self._walker = (p for p in [root])
self._open_next_file()
self._read_next_line()
def __iter__(self) -> Iterator[str]:
return self
def __next__(self) -> str:
next_word = self._next_word()
if not next_word:
self._read_next_line()
next_word = self._next_word()
if not next_word:
self._close_file()
self._open_next_file()
self._read_next_line()
next_word = self._next_word()
return next_word if WordIterable._is_not_blank(next_word) else next(self)
def _next_word(self) -> Optional[str]:
return self._line.pop() if self._line else None
def _read_next_line(self) -> None:
self._line = self._fp.readline().split()[::-1]
def _open_next_file(self) -> None:
if self._walker:
self._file: Path = next(self._walker, None)
if self._file:
self._fp = self._file.open(encoding="utf8")
return
raise StopIteration
def _close_file(self) -> None:
self._fp.close()
@staticmethod
def _is_not_blank(s: str) -> bool:
return s and s != "\n"
This works but seems like a lot of code. Can we do better?
Edit:
What is a "word" and a "non-empty word"?
Words are sequence of characters separated by whitespace.
The question doesn't say to recursively processes a directory and it's sub-directories, but that's what the code appears to do.
It does now.
The code only does ".txt" files.
Yes.
-
\$\begingroup\$ The question is vague and the code makes some assumptions that aren't in the question. For example, what is a "word" and a "non-empty word"? Just letters? What about numbers or punctuation? Also, the question doesn't say to recursively processes a directory and it's sub-directories, but that's what the code appears to do. Lastly, the question says to process all files in a directory, but the code only does ".txt" files. \$\endgroup\$RootTwo– RootTwo2021年09月13日 05:38:24 +00:00Commented Sep 13, 2021 at 5:38
-
\$\begingroup\$ @RootTwo This question was asked in an interview, and interview questions are sadly, but deliberately vague. I've added an edit for your follow up questions. \$\endgroup\$Abhijit Sarkar– Abhijit Sarkar2021年09月13日 06:02:39 +00:00Commented Sep 13, 2021 at 6:02
1 Answer 1
The code seems overly complicated and complex for a relatively simple task.
from pathlib import Path
def words(file_or_path):
path = Path(file_or_path)
if path.is_dir():
paths = path.rglob('*.txt')
else:
paths = (path, )
for filepath in paths:
yield from filepath.read_text().split()
The function can take a a directory name or a file name. For both cases, create an iterable, paths
, of the files to be processed. This way, both cases can be handled by the same code.
For each filepath in paths
use Path.read_text()
to open the file, read it in, close the file, and return the text that was read. str.split()
drops leading and trailing whitespace and then splits the string on whitespace. yield from ...
yields each word in turn.
If you don't want to read an entire file in at once, replace the yield from ...
with something like:
with filepath.open() as f:
for line in f:
yield from line.split()
-
\$\begingroup\$ This is great, but I’d like to take it one step further by reading one word at a time instead of the whole line. However, I’m wondering if that can somehow backfire, since reading one character at a time until a white space is encountered isn’t exactly efficient. I wish there was a
readword
function. \$\endgroup\$Abhijit Sarkar– Abhijit Sarkar2021年09月13日 09:58:03 +00:00Commented Sep 13, 2021 at 9:58 -
\$\begingroup\$ The most commonly used python implementations will be based on C, its stdlib, and UNIX philosophy. The IO there is build around lines and line-by-line processing. A
readword
function in python can be implemented very easily, but under the hood the implementation would read the line and discard everything except the word. You would not gain much \$\endgroup\$mcocdawc– mcocdawc2021年09月13日 15:04:21 +00:00Commented Sep 13, 2021 at 15:04 -
\$\begingroup\$ It would be totally possible, although definitely harder and less clean, to read the file in fixed-sized chunks instead of by line. And you would gain the ability to not crash if you come across a file with very long lines, e.g. a 32-gigabyte file without the
\n
character. Although I suppose such a file would likely exceed{LINE_MAX}
(implementation-defined and only required to be >= 2048) and thus wouldn't count as a "text file" according to the POSIX standard... \$\endgroup\$Gavin S. Yancey– Gavin S. Yancey2021年09月13日 17:27:31 +00:00Commented Sep 13, 2021 at 17:27 -
\$\begingroup\$ There's one issue with the
yield from
used here, it doesn't check for empty words, which is a requirement in the question. I changed it toyield from filter(_is_not_blank, line.split())
. \$\endgroup\$Abhijit Sarkar– Abhijit Sarkar2021年09月13日 19:40:30 +00:00Commented Sep 13, 2021 at 19:40 -
1\$\begingroup\$ @AbhijitSarkar,
.split()
should already filter out the blanks. Do you have an example input that causes it to return an empty word? \$\endgroup\$RootTwo– RootTwo2021年09月13日 19:59:32 +00:00Commented Sep 13, 2021 at 19:59