I wanted to build a Inverse Document Frequency function, because in my opinion was not easy to do with scikit and I wanted also to show how it works for educational reasons.
Also reading this question on stack.
So I build this function. It works, not sure about the efficency of my solution.
Let me know what do you think.
def idf(corpora,term="None"):
"""A function for calculating IDF, it takes a list of list as corpora, a term "t"
and returns the relative IDF"""
occurences=sum([ 1 for sublist in test for x in sublist if term in x])
tot_doc=len(corpora)
if occurences==0:
print("Error")
else:
idf=round(1+np.log(tot_doc/occurences),3)
return idf
test=[['I like beer'],['I like craft beers'],['I like ALE craft beers'],
['If I have to choose a craft beer, I will take the IPA craft beer']]
beer=idf(corpora=test,term='beer')
craft_beer=idf(corpora=test,term='craft')
ipa=idf(corpora=test,term='IPA')
print(beer)
print(craft_beer)
print(ipa)
Output:
1.0
1.288
2.386
1 Answer 1
- Use of global variable
In your function, you usetest
from the global scope in the comprehension which can lead to nasty side effects. - Type hints and formatting
Type hints and proper formatting help future readers understand the code easier. Modern code editors and static code analysis tools can also use them. - Docstring
A function docstring also helps future users to understand the code. - Raise
ValueError
if no match
PrintingError
is not a good way to notify users of this function that something went wrong. What is the problem? Also if you dont return anything in this case, the function returnsNone
with no errors, which can lead to other, harder to find errors later. - Use generator in counting sum
You dont need to create a list to count the occurences, you can just use a generator expression inside thesum
. - No need for list of lists
Maybe you have a reason for this, but you dont need the list of lists data structure for your example. - Use python builtins
You dont need numpy for this task, using python builtins is enough. This way, you dont have to install numpy if you only want to use this function.
Putting it all together, my reviewed version looks like this.
import math
from typing import List
def inverse_document_frequency(corpora: List[str], term: str) -> float:
"""
Calculates Inverse Document Frequency.
Arguments:
corpora (List[str]): List of strings to check for term 'term'
term (str): Term to count in 'corpora'
Returns:
float: Inverse document frequency, defined as
1 + log(length of corpora / occurrences of term)
Rounded to 3 decimal places.
"""
occurences = sum(1 for string in corpora if term in string)
if occurences == 0:
raise ValueError(f"No occurrences of term '{term}' in 'corpora'.")
else:
return round(1 + math.log(len(corpora) / occurences), 3)
test = [
"I like beer",
"I like craft beers",
"I like ALE craft beers",
"If I have to choose a craft beer, I will take the IPA craft beer",
]
beer = inverse_document_frequency(corpora=test, term="beer")
craft_beer = inverse_document_frequency(corpora=test, term="craft")
ipa = inverse_document_frequency(corpora=test, term="IPA")
print(beer)
print(craft_beer)
print(ipa)
```
-
\$\begingroup\$ Thank you for the feedback they are very useful. The list of list is because in the code for other reasons I use
gensim
and it takes as input forTfidfModel
a list of list, so I tried to be consistent with this way to pass input documents from acorpora
. \$\endgroup\$Andrea Ciufo– Andrea Ciufo2021年02月20日 13:45:13 +00:00Commented Feb 20, 2021 at 13:45
Explore related questions
See similar questions with these tags.