3
\$\begingroup\$

This program will get data from a folder called Data which has txt files and will return the filename with a score but I don't think it's really good.

Can I get any kind of feedback on this program. Is there anyway this can be written in a better way or something that should be added to?

import os, nltk, re
from nltk import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords') # helps remove stop words such as "the", "and" etc
nltk.download('punkt') # Punkt sentence tokenizer, helps break down text
files = os.listdir("Data/")
f = open("words.txt","r")
contents = f.readlines()
vocab = []
for word in contents:
 vocab.append(word[0:-1].lower())
word_set = set(vocab)
def process_files(dir, filenames):
 file_to_terms = {}
 for file in filenames:
 pattern = re.compile('[\W_]+')
 name = dir + file
 file_to_terms[file] = open(name, 'r').read().lower();
 file_to_terms[file] = pattern.sub(' ',file_to_terms[file])
 re.sub(r'[\W_]+','', file_to_terms[file])
 file_to_terms[file] = file_to_terms[file].split()
 return file_to_terms
listdata = []
listdata = process_files("Data/",files)
print("storing keywords in dictionary done")
def index_one_file(termlist):
 fileIndex = {}
 for index, word in enumerate(termlist):
 if word in fileIndex.keys():
 fileIndex[word].append(index)
 else:
 fileIndex[word] = [index]
 return fileIndex
def make_indices(termlists):
 total = {}
 for filename in termlists.keys():
 total[filename] = index_one_file(termlists[filename])
 return total
indexwordallfiles = make_indices(listdata)
print("constructing inverted index.")
def fullIndex(regdex):
 total_index = {}
 for filename in regdex.keys():
 for word in regdex[filename].keys():
 if word in total_index.keys():
 if filename in total_index[word].keys():
 total_index[word][filename].extend(regdex[filename][word][:])
 else:
 total_index[word][filename] = regdex[filename][word]
 else:
 total_index[word] = {filename: regdex[filename][word]}
 return total_index
wordindex = fullIndex(indexwordallfiles)
print("now proceeding with the query part")
def one_word_query(word, invertedIndex):
 pattern = re.compile('[\W_]+')
 word = pattern.sub(' ',word)
 if word in invertedIndex.keys():
 return [filename for filename in invertedIndex[word].keys()]
 else:
 return []
def free_text_query(string):
 pattern = re.compile('[\W_]+')
 string = pattern.sub(' ',string.lower())
 result = []
 print(" returning intersection of files")
 for word in string.split():
 result.append(set(one_word_query(word,wordindex)))
 A={}
 A = result[0].intersection(result[1])
 for i in range(1,len(result)-1):
 A = A.intersection(result[i+1])
 return list(A)
k = len(files)
dic = {}
for item in wordindex:
 k = 0
 for fil in wordindex[item]:
 k += (len(wordindex[item][fil]))
 dic[item] = k
print(len(dic))
def keywithmaxval(d):
 """ a) create a list of the dict's keys and values; 
 b) return the key with the max value""" 
 try:
 v = list(d.values())
 k = list(d.keys())
 value = k[v.index(max(v))]
 return value
 except:
 return
def startQuery(query):
 pattern = re.compile('[\W_]+')
 query = pattern.sub(' ',query.lower())
 txtlist = word_tokenize(str(query))
 txtlist = [word for word in txtlist if not word in stopwords.words('english')]
 toreturn = {}
 for f in files:
 toreturn[f]=0
 for item in txtlist:
 listfilename = one_word_query(item,wordindex)
 for t in listfilename:
 toreturn[t] +=1
 num_of_files = len([iq for iq in os.scandir('Data/')])
 for i in range(0,num_of_files):
 tx = keywithmaxval(toreturn)
 print("filename ", tx ," score ", toreturn[tx])
 return tx
 if toreturn[tx] != 0:
 del toreturn[tx]
Reinderien
70.9k5 gold badges76 silver badges256 bronze badges
asked May 25, 2021 at 23:20
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

For nltk, default_download_dir is PYTHONHOME\lib\nltk on Windows or one of various /usr directories, falling back to the home directory, for the rest of the world. In a production application you would want to separate the download() step into the setup code rather than the application code.

Otherwise,

  • Consider replacing os.listdir and related functions with equivalent but more sugar-y calls to pathlib
  • From your open() call onwards, that code belongs in one or more methods rather than the global namespace
  • Don't call re.compile from the inside of a loop; pre-compile your expression at the beginning of the method. Also, after doing this (repeated) compilation, you throw away the results and call re.sub. Use pattern.sub instead.
  • No need to semicolon-terminate your statements
  • Use with on your file opens
  • Consider adding PEP484 type hints
  • listdata = [] is redundant and can be deleted because you trample right over it on the next line; same with A={}
  • fileIndex would be file_index by PEP8
  • [filename for filename in invertedIndex[word].keys()] can be list(inverted_index[word])
  • dic is a poor name for a variable; it says the type but not what's inside
  • [word for word in txtlist if not word in stopwords.words('english')] would be better-represented by set subtraction
  • range(0,num_of_files) can drop the 0 since that's default

This block:

 return tx
 if toreturn[tx] != 0:
 del toreturn[tx]

will never see the last two statements executed.

answered May 26, 2021 at 0:29
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.