Simple Search Engine Program in Python

Question 1

This program will get data from a folder called Data which has txt files and will return the filename with a score but I don't think it's really good.

Can I get any kind of feedback on this program. Is there anyway this can be written in a better way or something that should be added to?

import os, nltk, re
from nltk import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords') # helps remove stop words such as "the", "and" etc
nltk.download('punkt') # Punkt sentence tokenizer, helps break down text
files = os.listdir("Data/")
f = open("words.txt","r")
contents = f.readlines()
vocab = []
for word in contents:
 vocab.append(word[0:-1].lower())
word_set = set(vocab)
def process_files(dir, filenames):
 file_to_terms = {}
 for file in filenames:
 pattern = re.compile('[\W_]+')
 name = dir + file
 file_to_terms[file] = open(name, 'r').read().lower();
 file_to_terms[file] = pattern.sub(' ',file_to_terms[file])
 re.sub(r'[\W_]+','', file_to_terms[file])
 file_to_terms[file] = file_to_terms[file].split()
 return file_to_terms
listdata = []
listdata = process_files("Data/",files)
print("storing keywords in dictionary done")
def index_one_file(termlist):
 fileIndex = {}
 for index, word in enumerate(termlist):
 if word in fileIndex.keys():
 fileIndex[word].append(index)
 else:
 fileIndex[word] = [index]
 return fileIndex
def make_indices(termlists):
 total = {}
 for filename in termlists.keys():
 total[filename] = index_one_file(termlists[filename])
 return total
indexwordallfiles = make_indices(listdata)
print("constructing inverted index.")
def fullIndex(regdex):
 total_index = {}
 for filename in regdex.keys():
 for word in regdex[filename].keys():
 if word in total_index.keys():
 if filename in total_index[word].keys():
 total_index[word][filename].extend(regdex[filename][word][:])
 else:
 total_index[word][filename] = regdex[filename][word]
 else:
 total_index[word] = {filename: regdex[filename][word]}
 return total_index
wordindex = fullIndex(indexwordallfiles)
print("now proceeding with the query part")
def one_word_query(word, invertedIndex):
 pattern = re.compile('[\W_]+')
 word = pattern.sub(' ',word)
 if word in invertedIndex.keys():
 return [filename for filename in invertedIndex[word].keys()]
 else:
 return []
def free_text_query(string):
 pattern = re.compile('[\W_]+')
 string = pattern.sub(' ',string.lower())
 result = []
 print(" returning intersection of files")
 for word in string.split():
 result.append(set(one_word_query(word,wordindex)))
 A={}
 A = result[0].intersection(result[1])
 for i in range(1,len(result)-1):
 A = A.intersection(result[i+1])
 return list(A)
k = len(files)
dic = {}
for item in wordindex:
 k = 0
 for fil in wordindex[item]:
 k += (len(wordindex[item][fil]))
 dic[item] = k
print(len(dic))
def keywithmaxval(d):
 """ a) create a list of the dict's keys and values; 
 b) return the key with the max value""" 
 try:
 v = list(d.values())
 k = list(d.keys())
 value = k[v.index(max(v))]
 return value
 except:
 return
def startQuery(query):
 pattern = re.compile('[\W_]+')
 query = pattern.sub(' ',query.lower())
 txtlist = word_tokenize(str(query))
 txtlist = [word for word in txtlist if not word in stopwords.words('english')]
 toreturn = {}
 for f in files:
 toreturn[f]=0
 for item in txtlist:
 listfilename = one_word_query(item,wordindex)
 for t in listfilename:
 toreturn[t] +=1
 num_of_files = len([iq for iq in os.scandir('Data/')])
 for i in range(0,num_of_files):
 tx = keywithmaxval(toreturn)
 print("filename ", tx ," score ", toreturn[tx])
 return tx
 if toreturn[tx] != 0:
 del toreturn[tx]

Question 2

For nltk, default_download_dir is PYTHONHOME\lib\nltk on Windows or one of various /usr directories, falling back to the home directory, for the rest of the world. In a production application you would want to separate the download() step into the setup code rather than the application code.

Otherwise,

Consider replacing os.listdir and related functions with equivalent but more sugar-y calls to pathlib
From your open() call onwards, that code belongs in one or more methods rather than the global namespace
Don't call re.compile from the inside of a loop; pre-compile your expression at the beginning of the method. Also, after doing this (repeated) compilation, you throw away the results and call re.sub. Use pattern.sub instead.
No need to semicolon-terminate your statements
Use with on your file opens
Consider adding PEP484 type hints
listdata = [] is redundant and can be deleted because you trample right over it on the next line; same with A={}
fileIndex would be file_index by PEP8
[filename for filename in invertedIndex[word].keys()] can be list(inverted_index[word])
dic is a poor name for a variable; it says the type but not what's inside
[word for word in txtlist if not word in stopwords.words('english')] would be better-represented by set subtraction
range(0,num_of_files) can drop the 0 since that's default

This block:

 return tx
 if toreturn[tx] != 0:
 del toreturn[tx]

will never see the last two statements executed.

Reinderien Reinderien 71k5 gold badges76 silver badges256 bronze badges · Answer 1 · 2021-05-26 00:29:31Z

For nltk, default_download_dir is PYTHONHOME\lib\nltk on Windows or one of various /usr directories, falling back to the home directory, for the rest of the world. In a production application you would want to separate the download() step into the setup code rather than the application code.

Otherwise,

Consider replacing os.listdir and related functions with equivalent but more sugar-y calls to pathlib
From your open() call onwards, that code belongs in one or more methods rather than the global namespace
Don't call re.compile from the inside of a loop; pre-compile your expression at the beginning of the method. Also, after doing this (repeated) compilation, you throw away the results and call re.sub. Use pattern.sub instead.
No need to semicolon-terminate your statements
Use with on your file opens
Consider adding PEP484 type hints
listdata = [] is redundant and can be deleted because you trample right over it on the next line; same with A={}
fileIndex would be file_index by PEP8
[filename for filename in invertedIndex[word].keys()] can be list(inverted_index[word])
dic is a poor name for a variable; it says the type but not what's inside
[word for word in txtlist if not word in stopwords.words('english')] would be better-represented by set subtraction
range(0,num_of_files) can drop the 0 since that's default

This block:

 return tx
 if toreturn[tx] != 0:
 del toreturn[tx]

will never see the last two statements executed.

Stack Exchange Network

Simple Search Engine Program in Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Simple Search Engine Program in Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions