Iterate through non-empty words from text files

Question 1

Given a file or directory, create an iterator that returns the non-empty words from the file or from all files recursively in the directory. Only process ".txt" files. Words are sequence of characters separated by whitespace.

class WordIterable:
 def __init__(self, path: str):
 root = Path(path)
 self._walker: Optional[Iterator[Path]] = None
 if root.is_dir():
 self._walker = root.rglob("*.txt")
 elif root.suffix == ".txt":
 self._walker = (p for p in [root])
 self._open_next_file()
 self._read_next_line()
 def __iter__(self) -> Iterator[str]:
 return self
 def __next__(self) -> str:
 next_word = self._next_word()
 if not next_word:
 self._read_next_line()
 next_word = self._next_word()
 if not next_word:
 self._close_file()
 self._open_next_file()
 self._read_next_line()
 next_word = self._next_word()
 return next_word if WordIterable._is_not_blank(next_word) else next(self)
 def _next_word(self) -> Optional[str]:
 return self._line.pop() if self._line else None
 def _read_next_line(self) -> None:
 self._line = self._fp.readline().split()[::-1]
 def _open_next_file(self) -> None:
 if self._walker:
 self._file: Path = next(self._walker, None)
 if self._file:
 self._fp = self._file.open(encoding="utf8")
 return
 raise StopIteration
 def _close_file(self) -> None:
 self._fp.close()
 @staticmethod
 def _is_not_blank(s: str) -> bool:
 return s and s != "\n"

This works but seems like a lot of code. Can we do better?

Edit:

What is a "word" and a "non-empty word"?

Words are sequence of characters separated by whitespace.

The question doesn't say to recursively processes a directory and it's sub-directories, but that's what the code appears to do.

It does now.

The code only does ".txt" files.

Yes.

Question 2

The question is vague and the code makes some assumptions that aren't in the question. For example, what is a "word" and a "non-empty word"? Just letters? What about numbers or punctuation? Also, the question doesn't say to recursively processes a directory and it's sub-directories, but that's what the code appears to do. Lastly, the question says to process all files in a directory, but the code only does ".txt" files.

Question 3

@RootTwo This question was asked in an interview, and interview questions are sadly, but deliberately vague. I've added an edit for your follow up questions.

Question 4

The code seems overly complicated and complex for a relatively simple task.

from pathlib import Path
def words(file_or_path):
 path = Path(file_or_path)
 
 if path.is_dir():
 paths = path.rglob('*.txt')
 else:
 paths = (path, )
 
 for filepath in paths:
 yield from filepath.read_text().split()

The function can take a a directory name or a file name. For both cases, create an iterable, paths, of the files to be processed. This way, both cases can be handled by the same code.

For each filepath in paths use Path.read_text() to open the file, read it in, close the file, and return the text that was read. str.split() drops leading and trailing whitespace and then splits the string on whitespace. yield from ... yields each word in turn.

If you don't want to read an entire file in at once, replace the yield from ... with something like:

 with filepath.open() as f:
 for line in f:
 yield from line.split()

Question 5

This is great, but I’d like to take it one step further by reading one word at a time instead of the whole line. However, I’m wondering if that can somehow backfire, since reading one character at a time until a white space is encountered isn’t exactly efficient. I wish there was a readword function.

Question 6

The most commonly used python implementations will be based on C, its stdlib, and UNIX philosophy. The IO there is build around lines and line-by-line processing. A readword function in python can be implemented very easily, but under the hood the implementation would read the line and discard everything except the word. You would not gain much

Question 7

It would be totally possible, although definitely harder and less clean, to read the file in fixed-sized chunks instead of by line. And you would gain the ability to not crash if you come across a file with very long lines, e.g. a 32-gigabyte file without the \n character. Although I suppose such a file would likely exceed {LINE_MAX} (implementation-defined and only required to be >= 2048) and thus wouldn't count as a "text file" according to the POSIX standard...

Question 8

There's one issue with the yield from used here, it doesn't check for empty words, which is a requirement in the question. I changed it to yield from filter(_is_not_blank, line.split()).

Question 9

@AbhijitSarkar, .split() should already filter out the blanks. Do you have an example input that causes it to return an empty word?

RootTwo RootTwo 10.6k1 gold badge14 silver badges30 bronze badges · Accepted Answer · 2021-09-13 06:30:43Z

6

\$\begingroup\$

The code seems overly complicated and complex for a relatively simple task.

from pathlib import Path
def words(file_or_path):
 path = Path(file_or_path)
 
 if path.is_dir():
 paths = path.rglob('*.txt')
 else:
 paths = (path, )
 
 for filepath in paths:
 yield from filepath.read_text().split()

The function can take a a directory name or a file name. For both cases, create an iterable, paths, of the files to be processed. This way, both cases can be handled by the same code.

For each filepath in paths use Path.read_text() to open the file, read it in, close the file, and return the text that was read. str.split() drops leading and trailing whitespace and then splits the string on whitespace. yield from ... yields each word in turn.

If you don't want to read an entire file in at once, replace the yield from ... with something like:

 with filepath.open() as f:
 for line in f:
 yield from line.split()

Share

answered Sep 13, 2021 at 6:30

RootTwo's user avatar

RootTwo RootTwo

10.6k1 gold badge14 silver badges30 bronze badges

\$\endgroup\$

6

\$\begingroup\$ This is great, but I’d like to take it one step further by reading one word at a time instead of the whole line. However, I’m wondering if that can somehow backfire, since reading one character at a time until a white space is encountered isn’t exactly efficient. I wish there was a readword function. \$\endgroup\$

Abhijit Sarkar
– Abhijit Sarkar

2021年09月13日 09:58:03 +00:00
Commented Sep 13, 2021 at 9:58
\$\begingroup\$ The most commonly used python implementations will be based on C, its stdlib, and UNIX philosophy. The IO there is build around lines and line-by-line processing. A readword function in python can be implemented very easily, but under the hood the implementation would read the line and discard everything except the word. You would not gain much \$\endgroup\$

mcocdawc
– mcocdawc

2021年09月13日 15:04:21 +00:00
Commented Sep 13, 2021 at 15:04
\$\begingroup\$ It would be totally possible, although definitely harder and less clean, to read the file in fixed-sized chunks instead of by line. And you would gain the ability to not crash if you come across a file with very long lines, e.g. a 32-gigabyte file without the \n character. Although I suppose such a file would likely exceed {LINE_MAX} (implementation-defined and only required to be >= 2048) and thus wouldn't count as a "text file" according to the POSIX standard... \$\endgroup\$

Gavin S. Yancey
– Gavin S. Yancey

2021年09月13日 17:27:31 +00:00
Commented Sep 13, 2021 at 17:27
\$\begingroup\$ There's one issue with the yield from used here, it doesn't check for empty words, which is a requirement in the question. I changed it to yield from filter(_is_not_blank, line.split()). \$\endgroup\$

Abhijit Sarkar
– Abhijit Sarkar

2021年09月13日 19:40:30 +00:00
Commented Sep 13, 2021 at 19:40
1

\$\begingroup\$ @AbhijitSarkar, .split() should already filter out the blanks. Do you have an example input that causes it to return an empty word? \$\endgroup\$

RootTwo
– RootTwo

2021年09月13日 19:59:32 +00:00
Commented Sep 13, 2021 at 19:59

| Show 1 more comment

Stack Exchange Network

Iterate through non-empty words from text files

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Iterate through non-empty words from text files

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions