1
\$\begingroup\$

I wanted to build a Inverse Document Frequency function, because in my opinion was not easy to do with scikit and I wanted also to show how it works for educational reasons.

Also reading this question on stack.

So I build this function. It works, not sure about the efficency of my solution.

Let me know what do you think.

def idf(corpora,term="None"): 
 """A function for calculating IDF, it takes a list of list as corpora, a term "t"
 and returns the relative IDF"""
 occurences=sum([ 1 for sublist in test for x in sublist if term in x])
 tot_doc=len(corpora)
 if occurences==0: 
 print("Error")
 else: 
 idf=round(1+np.log(tot_doc/occurences),3)
 return idf
test=[['I like beer'],['I like craft beers'],['I like ALE craft beers'],
 ['If I have to choose a craft beer, I will take the IPA craft beer']]
beer=idf(corpora=test,term='beer')
craft_beer=idf(corpora=test,term='craft')
ipa=idf(corpora=test,term='IPA')
print(beer)
print(craft_beer)
print(ipa)

Output:

 1.0
 1.288
 2.386
asked Feb 19, 2021 at 15:05
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$
  • Use of global variable
    In your function, you use test from the global scope in the comprehension which can lead to nasty side effects.
  • Type hints and formatting
    Type hints and proper formatting help future readers understand the code easier. Modern code editors and static code analysis tools can also use them.
  • Docstring
    A function docstring also helps future users to understand the code.
  • Raise ValueError if no match
    Printing Error is not a good way to notify users of this function that something went wrong. What is the problem? Also if you dont return anything in this case, the function returns None with no errors, which can lead to other, harder to find errors later.
  • Use generator in counting sum
    You dont need to create a list to count the occurences, you can just use a generator expression inside the sum.
  • No need for list of lists
    Maybe you have a reason for this, but you dont need the list of lists data structure for your example.
  • Use python builtins
    You dont need numpy for this task, using python builtins is enough. This way, you dont have to install numpy if you only want to use this function.

Putting it all together, my reviewed version looks like this.

import math
from typing import List
def inverse_document_frequency(corpora: List[str], term: str) -> float:
 """
 Calculates Inverse Document Frequency.
 Arguments:
 corpora (List[str]): List of strings to check for term 'term'
 term (str): Term to count in 'corpora'
 Returns:
 float: Inverse document frequency, defined as
 1 + log(length of corpora / occurrences of term)
 Rounded to 3 decimal places.
 """
 occurences = sum(1 for string in corpora if term in string)
 if occurences == 0:
 raise ValueError(f"No occurrences of term '{term}' in 'corpora'.")
 else:
 return round(1 + math.log(len(corpora) / occurences), 3)
test = [
 "I like beer",
 "I like craft beers",
 "I like ALE craft beers",
 "If I have to choose a craft beer, I will take the IPA craft beer",
]
beer = inverse_document_frequency(corpora=test, term="beer")
craft_beer = inverse_document_frequency(corpora=test, term="craft")
ipa = inverse_document_frequency(corpora=test, term="IPA")
print(beer)
print(craft_beer)
print(ipa)
```
answered Feb 19, 2021 at 17:50
\$\endgroup\$
1
  • \$\begingroup\$ Thank you for the feedback they are very useful. The list of list is because in the code for other reasons I use gensim and it takes as input for TfidfModel a list of list, so I tried to be consistent with this way to pass input documents from a corpora. \$\endgroup\$ Commented Feb 20, 2021 at 13:45

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.