I am very new to Python. I am trying to write a function that does the following, and reuse the function in future parts of the code: (what the function does):
- find the cosineValue between the elements of two list
- add the values to a list and calculate the mean
- append the mean values to a list
- return the list of means
I would then like to make calculations based on the list that is returned by the above function. However, the function (i.e. knearest_similarity(tfidf_datamatrix)) does not return anything. The print commands in the second function (i.e. threshold_function())do not show anything. Can someone please have a look at the code and tell me what I am doing wrong.
def knearest_similarity(tfidf_datamatrix):
k_nearest_cosineMean = []
for datavector in tfidf_datamatrix:
cosineValueSet = []
for trainingvector in tfidf_vectorizer_trainingset:
cosineValue = cx(datavector, trainingvector)
cosineValueSet.append(cosineValue)
similarityMean_of_k_nearest_neighbours = np.mean(heapq.nlargest(k_nearest_neighbours, cosineValueSet)) #the cosine similarity score of top k nearest neighbours
k_nearest_cosineMean.append(similarityMean_of_k_nearest_neighbours)
print k_nearest_cosineMean
return k_nearest_cosineMean
def threshold_function():
mean_cosineScore_mean = np.mean(knearest_similarity(tfidf_matrix_testset))
std_cosineScore_mean = np.std(knearest_similarity(tfidf_matrix_testset))
threshold = mean_cosineScore_mean - (3*std_cosineScore_mean)
print "The Mean of the mean of cosine similarity score for a normal Behaviour:", mean_cosineScore_mean #The mean will be used for finding the threshold
print "The standard deviation of the mean of cosine similarity score:", std_cosineScore_mean #The standstart deviation is also used to find threshold
print "The threshold for normal behaviour should be (Mean - 3*standard deviation):", threshold
return threshold
EDIT
I tried defining two global variables for the functions to use (i.e. tfidf_vectorizer_trainingset and tfidf_matrix_testset).
#fitting tfidf transfrom for training data
tfidf_vectorizer_trainingset = tfidf_vectorizer.fit_transform(readfile(trainingdataDir)).toarray()
#tfidf transform the test set based on the training set
tfidf_matrix_testset = tfidf_vectorizer.transform(readfile(testingdataDir)).toarray().
However the print commands in threshold_function() appear as below:
The Mean of the mean of cosine similarity score for a normal Behaviour: nan
The standard deviation of the mean of cosine similarity score: nan
The threshold for normal behaviour should be (Mean - 3*standard deviation): nan
EDIT2 I found that the first value in the k_nearest_cosineMean was nan. After deleting the value I managed to get valid calculations.
-
When you say that "the print commands ... do not show anything" do you literally mean that nothing is printed at all, or just that what is printed doesn't contain the numbers you want? It would be easier for others to help you if you could provide a minimal reproducible example and specific information about what you expect to see versus what you actually see.Daniel Pryden– Daniel Pryden2015年10月27日 04:53:34 +00:00Commented Oct 27, 2015 at 4:53
1 Answer 1
I the first line of threshold_function() you call knearest_similarity(tfidf_matrix_testset) however you never define what tfidf_matrix_testset is. You do that in the second line also. In the third line you use the output from the second line. Give tfidf_matrix_testset a value.