6
\$\begingroup\$

Given a list of words, e.g. dictionary.txt

aardvark
aardwolves
abacus
babble
...

I wanted to know how frequent all the different ngrams of letters were in it, e.g. aa appears twice in the above dataset but va only appears once.

I wrote a script that counts how many of each ngram there are, and presents the results in order from most-frequent ngram to least-frequent ngram as a CSV. I want to be able to pipe the output to other command-line programs.

import sys
from tqdm import tqdm
def error(*args, **kwargs):
 print(*args, file=sys.stderr, **kwargs)
 exit(1)
def count_ngrams(words: list[str], lengths: list[int], tqdm=tqdm) -> dict[str, int]:
 ngrams = {}
 for word in tqdm(words, desc='Counting n-grams', unit='word'):
 for length in lengths:
 for i in range(len(word) - length + 1):
 ngram = word[i:i + length]
 ngrams[ngram] = ngrams.get(ngram, 0) + 1
 return ngrams
if __name__ == '__main__':
 try:
 lexicon_file = sys.argv[1]
 except IndexError:
 error('Pass a lexicon file as first argument. Words should be newline-delimited.')
 with open(lexicon_file, 'r') as f:
 words = f.read().splitlines()
 if not words:
 error('Lexicon file is malformed. It should be a newline-delimited list of words.')
 ngrams_frequencies = count_ngrams(words, [2, 3])
 try:
 for ngram, frequency in sorted(ngrams_frequencies.items(), key=lambda x: x[1], reverse=True):
 print(f'{ngram},{frequency}')
 except BrokenPipeError:
 pass
asked Sep 29, 2024 at 20:33
\$\endgroup\$

1 Answer 1

4
\$\begingroup\$

interactive progress bar

This is an interesting signature.

def count_ngrams( ..., tqdm=tqdm):

It supports "batch mode", where we silently count ngrams without a progress bar. But I see no def silent_progress(): function in the OP which a caller could specify. I assume this is copy-n-paste from some other larger code, where the pattern has served you well. Consider removing the parameter from the OP code, or alternatively you might supply a "silent" utility function to support batch runs.

Consider tacking on a Callable type annotation.

defaultdict

This code isn't hard to read:

 ngrams = {}
 ...
 ngrams[ngram] = ngrams.get(ngram, 0) + 1

But we could more naturally express Author's Intent in this way:

from collections import defaultdict
 ngrams = defaultdict(int)
 ...
 ngrams[ngram] += 1

Also you seem not to be linting with mypy --strict, given that the empty container was untyped.

cracking argv

This works fine.

 try:
 lexicon_file = sys.argv[1]
 except IndexError:
 error('Pass a lexicon file as first argument. ...

It would be easier to let typer worry about CLI --help.

from pathlib import Path
import typer
def main(lexicon_file: Path) -> None:
 with open(lexicon_file ...
 ...
if __name__ == '__main__':
 typer.run(main)

itemgetter

 for ... in sorted(ngrams_frequencies.items(), key=lambda x: x[1], ... ):

This would have been a good place to rely upon itemgetter().

from operator import itemgetter
...
 for ... in sorted(ngrams_frequencies.items(), key=itemgetter(1), ... ):
answered Sep 29, 2024 at 20:57
\$\endgroup\$
5
  • \$\begingroup\$ About the interactive progress bar, I added it on purpose thinking it might make sense in case I want to reuse this code or make it a module in the future. Would you add parameters like that to hobby scripts? \$\endgroup\$ Commented Sep 29, 2024 at 21:14
  • \$\begingroup\$ Thanks for the great answer. Didn't know about mypy although I've always wanted it. I think it didn't exist yet the last time I used Python! typer's really useful too, I often write CLIs \$\endgroup\$ Commented Sep 29, 2024 at 21:20
  • \$\begingroup\$ Why do you think it's better to use operator.itemgetter rather than the short lambda? \$\endgroup\$ Commented Sep 29, 2024 at 21:30
  • \$\begingroup\$ I tend to write for x in tqdm( ... ): when getting an early feel for performance, and then if need be will simply delete it if I find the output has become troublesome for a production batch job. I don't tend to go back and forth on it, but I really liked the flexibility of that signature. // Based on what I saw in the code, I was concerned the OP was unfamiliar with a few "batteries included" libraries that should be part of every pythonista's toolkit. "Speaking in idiom" is an effective way to communicate technical ideas. Favoring specific {attr,item}getter over generic lambda can do that. \$\endgroup\$ Commented Sep 29, 2024 at 22:48
  • 3
    \$\begingroup\$ If you're importing from collections, you might as well use Counter docs.python.org/3/library/collections.html#collections.Counter . It has the same functionality as far as this code is concerned, but it expresses even more clearly the intent of the code. \$\endgroup\$ Commented Sep 30, 2024 at 7:58

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.