Given a list of words, e.g. dictionary.txt
aardvark
aardwolves
abacus
babble
...
I wanted to know how frequent all the different ngrams of letters were in it, e.g. aa appears twice in the above dataset but va only appears once.
I wrote a script that counts how many of each ngram there are, and presents the results in order from most-frequent ngram to least-frequent ngram as a CSV. I want to be able to pipe the output to other command-line programs.
import sys
from tqdm import tqdm
def error(*args, **kwargs):
print(*args, file=sys.stderr, **kwargs)
exit(1)
def count_ngrams(words: list[str], lengths: list[int], tqdm=tqdm) -> dict[str, int]:
ngrams = {}
for word in tqdm(words, desc='Counting n-grams', unit='word'):
for length in lengths:
for i in range(len(word) - length + 1):
ngram = word[i:i + length]
ngrams[ngram] = ngrams.get(ngram, 0) + 1
return ngrams
if __name__ == '__main__':
try:
lexicon_file = sys.argv[1]
except IndexError:
error('Pass a lexicon file as first argument. Words should be newline-delimited.')
with open(lexicon_file, 'r') as f:
words = f.read().splitlines()
if not words:
error('Lexicon file is malformed. It should be a newline-delimited list of words.')
ngrams_frequencies = count_ngrams(words, [2, 3])
try:
for ngram, frequency in sorted(ngrams_frequencies.items(), key=lambda x: x[1], reverse=True):
print(f'{ngram},{frequency}')
except BrokenPipeError:
pass
1 Answer 1
interactive progress bar
This is an interesting signature.
def count_ngrams( ..., tqdm=tqdm):
It supports "batch mode", where we silently count ngrams without a progress bar.
But I see no def silent_progress():
function in the OP which a caller could specify.
I assume this is copy-n-paste from some other larger code,
where the pattern has served you well.
Consider removing the parameter from the OP code, or alternatively
you might supply a "silent" utility function to support batch runs.
Consider tacking on a Callable
type annotation.
defaultdict
This code isn't hard to read:
ngrams = {}
...
ngrams[ngram] = ngrams.get(ngram, 0) + 1
But we could more naturally express Author's Intent in this way:
from collections import defaultdict
ngrams = defaultdict(int)
...
ngrams[ngram] += 1
Also you seem not to be linting with mypy --strict
,
given that the empty container was untyped.
cracking argv
This works fine.
try:
lexicon_file = sys.argv[1]
except IndexError:
error('Pass a lexicon file as first argument. ...
It would be easier to let
typer worry about CLI --help
.
from pathlib import Path
import typer
def main(lexicon_file: Path) -> None:
with open(lexicon_file ...
...
if __name__ == '__main__':
typer.run(main)
itemgetter
for ... in sorted(ngrams_frequencies.items(), key=lambda x: x[1], ... ):
This would have been a good place to rely upon itemgetter().
from operator import itemgetter
...
for ... in sorted(ngrams_frequencies.items(), key=itemgetter(1), ... ):
-
\$\begingroup\$ About the interactive progress bar, I added it on purpose thinking it might make sense in case I want to reuse this code or make it a module in the future. Would you add parameters like that to hobby scripts? \$\endgroup\$user98809– user988092024年09月29日 21:14:17 +00:00Commented Sep 29, 2024 at 21:14
-
\$\begingroup\$ Thanks for the great answer. Didn't know about
mypy
although I've always wanted it. I think it didn't exist yet the last time I used Python!typer
's really useful too, I often write CLIs \$\endgroup\$user98809– user988092024年09月29日 21:20:09 +00:00Commented Sep 29, 2024 at 21:20 -
\$\begingroup\$ Why do you think it's better to use
operator.itemgetter
rather than the short lambda? \$\endgroup\$user98809– user988092024年09月29日 21:30:15 +00:00Commented Sep 29, 2024 at 21:30 -
\$\begingroup\$ I tend to write
for x in tqdm( ... ):
when getting an early feel for performance, and then if need be will simply delete it if I find the output has become troublesome for a production batch job. I don't tend to go back and forth on it, but I really liked the flexibility of that signature. // Based on what I saw in the code, I was concerned the OP was unfamiliar with a few "batteries included" libraries that should be part of every pythonista's toolkit. "Speaking in idiom" is an effective way to communicate technical ideas. Favoring specific {attr,item}getter over generic lambda can do that. \$\endgroup\$J_H– J_H2024年09月29日 22:48:31 +00:00Commented Sep 29, 2024 at 22:48 -
3\$\begingroup\$ If you're importing from collections, you might as well use
Counter
docs.python.org/3/library/collections.html#collections.Counter . It has the same functionality as far as this code is concerned, but it expresses even more clearly the intent of the code. \$\endgroup\$AccidentalTaylorExpansion– AccidentalTaylorExpansion2024年09月30日 07:58:55 +00:00Commented Sep 30, 2024 at 7:58