This is a simple inverted index I made.
The goal is to:
- Read set of text files from a directory,
docs
- Tokenize them
- Normalize the tokens by removing punctuation and lowercasing them
- create an index of words to doc name, count of docs that that word appears in
Everything works as expected. I'm just curious what improvements there are to make.
import os
import re
import fileinput
punctuation_regex = r"[^\w\s]"
def normalize_tokens(tokens):
"""Remove all non-alphanumeric characters and convert to lower case"""
tokens = [re.sub(punctuation_regex, "", str.lower()) for str in tokens]
return tokens
def tokenize_input(input):
"""Split string on whitespace and return unique tokens"""
tokens = normalize_tokens(input.split())
return set(tokens)
def read_docs():
"""Read files in docs directory and return map of filename -> tokens in file"""
docs_to_tokens = {}
for filename in os.listdir("docs"):
with open(os.path.join("docs", filename), 'r') as f:
docs_to_tokens[filename] = tokenize_input(f.read())
return docs_to_tokens
def make_index():
"""Make inverted index of token -> docs containing token, number of docs containing token"""
index = {}
docs = read_docs()
for doc in docs:
for str in docs[doc]:
if not str in index:
index[str] = {}
index[str]['count'] = 0
index[str]['docs'] = []
index[str]['count'] += 1
index[str]['docs'].append(doc)
return index
def query_index(key, index):
"""Return docs that contain key"""
if key not in index:
return None
return index[key]['docs']
def main():
index = make_index()
for line in fileinput.input():
print(query_index(line.rstrip(), index))
if __name__ == "__main__":
main()
1 Answer 1
Nice, readable code, and well laid out.
I don't like the hardcoded pathname in read_docs()
- surely that should be an argument?
Similarly, I would pass the result of read_docs()
as argument to make_index()
so it can be tested separately.
Here, we have dicts that we access by exactly two fixed keys:
if not str in index: index[str] = {} index[str]['count'] = 0 index[str]['docs'] = [] index[str]['count'] += 1 index[str]['docs'].append(doc)
Perhaps we should be using a (named) tuple instead of a dict? And perhaps index
should be a defaultdict
, so we don't need the str in index
test?
Perhaps normalize_tokens()
should return a set, rather than creating a list which we then deduplicate in tokenize_input()
? Alternatively, make it a generator, which also avoids constructing the list.
-
\$\begingroup\$ Great suggestions! Thank you so much! Will make these changes \$\endgroup\$iluvfugu– iluvfugu2022年12月13日 04:46:45 +00:00Commented Dec 13, 2022 at 4:46