IDF Function with a list of list

Question 1

I wanted to build a Inverse Document Frequency function, because in my opinion was not easy to do with scikit and I wanted also to show how it works for educational reasons.

Also reading this question on stack.

So I build this function. It works, not sure about the efficency of my solution.

Let me know what do you think.

def idf(corpora,term="None"): 
 """A function for calculating IDF, it takes a list of list as corpora, a term "t"
 and returns the relative IDF"""
 occurences=sum([ 1 for sublist in test for x in sublist if term in x])
 tot_doc=len(corpora)
 if occurences==0: 
 print("Error")
 else: 
 idf=round(1+np.log(tot_doc/occurences),3)
 return idf
test=[['I like beer'],['I like craft beers'],['I like ALE craft beers'],
 ['If I have to choose a craft beer, I will take the IPA craft beer']]
beer=idf(corpora=test,term='beer')
craft_beer=idf(corpora=test,term='craft')
ipa=idf(corpora=test,term='IPA')
print(beer)
print(craft_beer)
print(ipa)

Output:

 1.0
 1.288
 2.386

Question 2

Use of global variable
In your function, you use test from the global scope in the comprehension which can lead to nasty side effects.
Type hints and formatting
Type hints and proper formatting help future readers understand the code easier. Modern code editors and static code analysis tools can also use them.
Docstring
A function docstring also helps future users to understand the code.
Raise ValueError if no match
Printing Error is not a good way to notify users of this function that something went wrong. What is the problem? Also if you dont return anything in this case, the function returns None with no errors, which can lead to other, harder to find errors later.
Use generator in counting sum
You dont need to create a list to count the occurences, you can just use a generator expression inside the sum.
No need for list of lists
Maybe you have a reason for this, but you dont need the list of lists data structure for your example.
Use python builtins
You dont need numpy for this task, using python builtins is enough. This way, you dont have to install numpy if you only want to use this function.

Putting it all together, my reviewed version looks like this.

import math
from typing import List
def inverse_document_frequency(corpora: List[str], term: str) -> float:
 """
 Calculates Inverse Document Frequency.
 Arguments:
 corpora (List[str]): List of strings to check for term 'term'
 term (str): Term to count in 'corpora'
 Returns:
 float: Inverse document frequency, defined as
 1 + log(length of corpora / occurrences of term)
 Rounded to 3 decimal places.
 """
 occurences = sum(1 for string in corpora if term in string)
 if occurences == 0:
 raise ValueError(f"No occurrences of term '{term}' in 'corpora'.")
 else:
 return round(1 + math.log(len(corpora) / occurences), 3)
test = [
 "I like beer",
 "I like craft beers",
 "I like ALE craft beers",
 "If I have to choose a craft beer, I will take the IPA craft beer",
]
beer = inverse_document_frequency(corpora=test, term="beer")
craft_beer = inverse_document_frequency(corpora=test, term="craft")
ipa = inverse_document_frequency(corpora=test, term="IPA")
print(beer)
print(craft_beer)
print(ipa)
```

Question 3

Thank you for the feedback they are very useful. The list of list is because in the code for other reasons I use gensim and it takes as input for TfidfModel a list of list, so I tried to be consistent with this way to pass input documents from a corpora.

Tweakimp Tweakimp 7791 gold badge6 silver badges16 bronze badges · Accepted Answer · 2021-02-19 17:50:25Z

Use of global variable
In your function, you use test from the global scope in the comprehension which can lead to nasty side effects.
Type hints and formatting
Type hints and proper formatting help future readers understand the code easier. Modern code editors and static code analysis tools can also use them.
Docstring
A function docstring also helps future users to understand the code.
Raise ValueError if no match
Printing Error is not a good way to notify users of this function that something went wrong. What is the problem? Also if you dont return anything in this case, the function returns None with no errors, which can lead to other, harder to find errors later.
Use generator in counting sum
You dont need to create a list to count the occurences, you can just use a generator expression inside the sum.
No need for list of lists
Maybe you have a reason for this, but you dont need the list of lists data structure for your example.
Use python builtins
You dont need numpy for this task, using python builtins is enough. This way, you dont have to install numpy if you only want to use this function.

Putting it all together, my reviewed version looks like this.

import math
from typing import List
def inverse_document_frequency(corpora: List[str], term: str) -> float:
 """
 Calculates Inverse Document Frequency.
 Arguments:
 corpora (List[str]): List of strings to check for term 'term'
 term (str): Term to count in 'corpora'
 Returns:
 float: Inverse document frequency, defined as
 1 + log(length of corpora / occurrences of term)
 Rounded to 3 decimal places.
 """
 occurences = sum(1 for string in corpora if term in string)
 if occurences == 0:
 raise ValueError(f"No occurrences of term '{term}' in 'corpora'.")
 else:
 return round(1 + math.log(len(corpora) / occurences), 3)
test = [
 "I like beer",
 "I like craft beers",
 "I like ALE craft beers",
 "If I have to choose a craft beer, I will take the IPA craft beer",
]
beer = inverse_document_frequency(corpora=test, term="beer")
craft_beer = inverse_document_frequency(corpora=test, term="craft")
ipa = inverse_document_frequency(corpora=test, term="IPA")
print(beer)
print(craft_beer)
print(ipa)
```

Thank you for the feedback they are very useful. The list of list is because in the code for other reasons I use gensim and it takes as input for TfidfModel a list of list, so I tried to be consistent with this way to pass input documents from a corpora.

Stack Exchange Network

IDF Function with a list of list

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

IDF Function with a list of list

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions