how to apply mutual information on categorical features

Question 1

I am using Scikit-learn to train a classification model. I have both discrete and continuous features in my training data.

I want to do feature selection using mutual information.

The features 1,2 and 3 are discrete. to this end, I try the code below :

mutual_info_classif(x, y, discrete_features=[1, 2, 3])

but it did not work, it gives me the error:

 ValueError: could not convert string to float: 'INT'

Question 2

I have apply the code that Mr W.P. McNeill have proposed in stackoverflow.com/q/43643278 but did not work

Question 3

we need more information in order to be able to help you. It might be useful if you copy a simplified example of your code.

Question 4

this is my code: from sklearn.feature_selection import mutual_info_classif res_M_train = mutual_info_classif(data_train, Y_train, discrete_features= [1,2,3]) thank you

Question 5

my data is like this :[0.983874,tcp,http,FIN,10,8,816,1172,17.278635,62,252,5976.375,8342.53125,2,2,109.319333,124.932859,5929.211713,192.590406,255,794167371,1624757001,255,0.206572,0.108393,0.098179,82,147,1,184,2,1,1,1,1,2,0,0,1,1,3,0,] as you can see my three first features are categoricale , and I want to calculate the mutual information of each feature: from sklearn.feature_selection import mutual_info_classif res_M_train = mutual_info_classif(data_train, Y_train, discrete_features= [1,2,3])

Question 6

A simple example with mutual information classifier:

import numpy as np
from sklearn.feature_selection import mutual_info_classif
X = np.array([[0, 0, 0],
 [1, 1, 0],
 [2, 0, 1],
 [2, 0, 1],
 [2, 0, 1]])
y = np.array([0, 1, 2, 2, 1])
mutual_info_classif(X, y, discrete_features=True)
# result: array([ 0.67301167, 0.22314355, 0.39575279]

Question 7

but I have mixed features like this X = np.array([[0, a, 0], [1, b, 0], [2, c,1], [2, d, 1], [2, a, 1]])

Question 8

this is a row from my Data [8e-06,"udp","-","INT",2,0,1762,0,125000.0003,254,0,881000000.0,0.0,0,0,0.008,0.0,0.0,0.0,0,0,0,0,0.0,0.0,0.0,881,0,0,0,2,2,1,1,1,2,0,0,0,1,2,0] it seems that the three first features cause the problem

Question 9

if you're using categories and you have string information, take a look to get_dummies

Question 10

mutual_info_classif can only take numeric data. You need to do label encoding of the categorical features and then run the same code.

x1=x.apply(LabelEncoder().fit_transform)

Then run the exact same code you were running.

mutual_info_classif(x1, y, discrete_features=[1, 2, 3])

Question 11

Care with that @Jatin, refering to sklearn's docs: This transformer should be used to encode target values, i.e. y, and not the input X. So maybe for this case it is a better option to use OrdinalEncoder.

Question 12

@rmoret Does it matter for calculating mutual information? "Not limited to real-valued random variables and linear dependence like the correlation coefficient, MI is more general and determines how different the joint distribution of the pair (X,Y) is from the product of the marginal distributions of X and Y. MI is the expected value of the pointwise mutual information (PMI)." Mutual Information Since we only care about shared information, ordering should not matter?

Question 13

.There is a difference between 'discrete' and 'categorical' In this case, function demands the data to be numerical. May be you can use label encoder if you have ordinal features. Else you would have to use one hot encoding for nominal features. You can use pd.get_dummies for this purpose.

Question 14

Same here. Does it matter whether you have ordinal features for calculating mutual information? "Not limited to real-valued random variables and linear dependence like the correlation coefficient, MI is more general and determines how different the joint distribution of the pair (X,Y) is from the product of the marginal distributions of X and Y. MI is the expected value of the pointwise mutual information (PMI)." Mutual Information Since we only care about shared information, ordering should not matter?

silgon 7,2998 gold badges52 silver badges71 bronze badges · Accepted Answer · 2018-11-25 18:28:45Z

4

A simple example with mutual information classifier:

import numpy as np
from sklearn.feature_selection import mutual_info_classif
X = np.array([[0, 0, 0],
 [1, 1, 0],
 [2, 0, 1],
 [2, 0, 1],
 [2, 0, 1]])
y = np.array([0, 1, 2, 2, 1])
mutual_info_classif(X, y, discrete_features=True)
# result: array([ 0.67301167, 0.22314355, 0.39575279]

Share

Improve this answer

answered Nov 25, 2018 at 18:28

silgon's user avatar

silgon

7,2998 gold badges52 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

samira

samira Over a year ago

but I have mixed features like this X = np.array([[0, a, 0], [1, b, 0], [2, c,1], [2, d, 1], [2, a, 1]])

2018年11月25日T18:35:09.593Z+00:00

samira

samira Over a year ago

this is a row from my Data [8e-06,"udp","-","INT",2,0,1762,0,125000.0003,254,0,881000000.0,0.0,0,0,0.008,0.0,0.0,0.0,0,0,0,0,0.0,0.0,0.0,881,0,0,0,2,2,1,1,1,2,0,0,0,1,2,0] it seems that the three first features cause the problem

2018年11月26日T00:12:38.737Z+00:00

silgon

silgon Over a year ago

if you're using categories and you have string information, take a look to get_dummies

2018年11月26日T09:29:32.567Z+00:00

CollectivesTM on Stack Overflow

how to apply mutual information on categorical features

3 Answers 3

3 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

3 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related