SImple Inverted Index

Question 1

This is a simple inverted index I made.

The goal is to:

Read set of text files from a directory, docs
Tokenize them
Normalize the tokens by removing punctuation and lowercasing them
create an index of words to doc name, count of docs that that word appears in

Everything works as expected. I'm just curious what improvements there are to make.

import os
import re
import fileinput
punctuation_regex = r"[^\w\s]"
def normalize_tokens(tokens):
 """Remove all non-alphanumeric characters and convert to lower case"""
 tokens = [re.sub(punctuation_regex, "", str.lower()) for str in tokens]
 return tokens
 
def tokenize_input(input):
 """Split string on whitespace and return unique tokens"""
 tokens = normalize_tokens(input.split())
 return set(tokens)
def read_docs():
 """Read files in docs directory and return map of filename -> tokens in file"""
 docs_to_tokens = {}
 for filename in os.listdir("docs"):
 with open(os.path.join("docs", filename), 'r') as f:
 docs_to_tokens[filename] = tokenize_input(f.read())
 return docs_to_tokens
def make_index():
 """Make inverted index of token -> docs containing token, number of docs containing token"""
 index = {}
 docs = read_docs()
 for doc in docs:
 for str in docs[doc]:
 if not str in index:
 index[str] = {}
 index[str]['count'] = 0
 index[str]['docs'] = []
 index[str]['count'] += 1
 index[str]['docs'].append(doc)
 return index
def query_index(key, index):
 """Return docs that contain key"""
 if key not in index:
 return None
 return index[key]['docs']
def main():
 index = make_index()
 for line in fileinput.input():
 print(query_index(line.rstrip(), index))
if __name__ == "__main__":
 main()

Question 2

Nice, readable code, and well laid out.

I don't like the hardcoded pathname in read_docs() - surely that should be an argument?

Similarly, I would pass the result of read_docs() as argument to make_index() so it can be tested separately.

Here, we have dicts that we access by exactly two fixed keys:

 if not str in index:
 index[str] = {}
 index[str]['count'] = 0
 index[str]['docs'] = []
 index[str]['count'] += 1
 index[str]['docs'].append(doc)

Perhaps we should be using a (named) tuple instead of a dict? And perhaps index should be a defaultdict, so we don't need the str in index test?

Perhaps normalize_tokens() should return a set, rather than creating a list which we then deduplicate in tokenize_input()? Alternatively, make it a generator, which also avoids constructing the list.

Question 3

Great suggestions! Thank you so much! Will make these changes

Toby Speight Toby Speight 87.1k14 gold badges104 silver badges322 bronze badges · Accepted Answer · 2022-12-12 09:27:11Z

Nice, readable code, and well laid out.

I don't like the hardcoded pathname in read_docs() - surely that should be an argument?

Similarly, I would pass the result of read_docs() as argument to make_index() so it can be tested separately.

Here, we have dicts that we access by exactly two fixed keys:

 if not str in index:
 index[str] = {}
 index[str]['count'] = 0
 index[str]['docs'] = []
 index[str]['count'] += 1
 index[str]['docs'].append(doc)

Perhaps we should be using a (named) tuple instead of a dict? And perhaps index should be a defaultdict, so we don't need the str in index test?

Perhaps normalize_tokens() should return a set, rather than creating a list which we then deduplicate in tokenize_input()? Alternatively, make it a generator, which also avoids constructing the list.

Great suggestions! Thank you so much! Will make these changes

Stack Exchange Network

SImple Inverted Index

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

SImple Inverted Index

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions