3
\$\begingroup\$

This is a simple inverted index I made.

The goal is to:

  • Read set of text files from a directory, docs
  • Tokenize them
  • Normalize the tokens by removing punctuation and lowercasing them
  • create an index of words to doc name, count of docs that that word appears in

Everything works as expected. I'm just curious what improvements there are to make.

import os
import re
import fileinput
punctuation_regex = r"[^\w\s]"
def normalize_tokens(tokens):
 """Remove all non-alphanumeric characters and convert to lower case"""
 tokens = [re.sub(punctuation_regex, "", str.lower()) for str in tokens]
 return tokens
 
def tokenize_input(input):
 """Split string on whitespace and return unique tokens"""
 tokens = normalize_tokens(input.split())
 return set(tokens)
def read_docs():
 """Read files in docs directory and return map of filename -> tokens in file"""
 docs_to_tokens = {}
 for filename in os.listdir("docs"):
 with open(os.path.join("docs", filename), 'r') as f:
 docs_to_tokens[filename] = tokenize_input(f.read())
 return docs_to_tokens
def make_index():
 """Make inverted index of token -> docs containing token, number of docs containing token"""
 index = {}
 docs = read_docs()
 for doc in docs:
 for str in docs[doc]:
 if not str in index:
 index[str] = {}
 index[str]['count'] = 0
 index[str]['docs'] = []
 index[str]['count'] += 1
 index[str]['docs'].append(doc)
 return index
def query_index(key, index):
 """Return docs that contain key"""
 if key not in index:
 return None
 return index[key]['docs']
def main():
 index = make_index()
 for line in fileinput.input():
 print(query_index(line.rstrip(), index))
if __name__ == "__main__":
 main()
Sᴀᴍ Onᴇᴌᴀ
29.5k16 gold badges45 silver badges201 bronze badges
asked Dec 11, 2022 at 23:22
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

Nice, readable code, and well laid out.

I don't like the hardcoded pathname in read_docs() - surely that should be an argument?

Similarly, I would pass the result of read_docs() as argument to make_index() so it can be tested separately.

Here, we have dicts that we access by exactly two fixed keys:

 if not str in index:
 index[str] = {}
 index[str]['count'] = 0
 index[str]['docs'] = []
 index[str]['count'] += 1
 index[str]['docs'].append(doc)

Perhaps we should be using a (named) tuple instead of a dict? And perhaps index should be a defaultdict, so we don't need the str in index test?

Perhaps normalize_tokens() should return a set, rather than creating a list which we then deduplicate in tokenize_input()? Alternatively, make it a generator, which also avoids constructing the list.

answered Dec 12, 2022 at 9:27
\$\endgroup\$
1
  • \$\begingroup\$ Great suggestions! Thank you so much! Will make these changes \$\endgroup\$ Commented Dec 13, 2022 at 4:46

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.