next
up
previous
contents
index
Next: Assessing as a feature
Up: Feature selection
Previous: Mutual information
Contents
Index
Another popular feature selection
method is
$\chi ^2$ .
In statistics, the $\chi ^2$ test is
applied to test the independence of two events,
where two events A and B are defined to be
independent if
$P(AB) = P(A)P(B)$ or, equivalently,
$P(A\vert B)=P(A)$ and
$P(B\vert A)=P(B)$. In
feature selection, the two events are occurrence of the
term and occurrence of the class.
We then rank terms with respect to the following
quantity:
where $e_\tcword$ and $e_c$ are defined as in Equation
130. $\observationo$
is the
observed frequency in
$\docsetlabeled$ and $E$ the
expected frequency. For example, $E_{11}$ is the
expected frequency of $\tcword$ and $c$ occurring together
in a document assuming that term and class are independent.
Worked example. We first
compute $E_{11}$ for the
data in Example 13.5.1:
where $N$ is the total number of documents as before.
We compute the other
$E_{e_\tcword e_c}$ in the same way:
$e_{\class{poultry}}=1$
$e_{\class{poultry}}=0$
$e_{\term{export}} = 1$
$ \observationo_{11}=49$
$E_{11}\approx 6.6$
$\observationo_{10} = 27{,}652$
$E_{10}\approx 27{,}694.4$
$e_{\term{export}} = 0$
$ \observationo_{01} = 141$
$E_{01}\approx 183.4$
$ \observationo_{00}=774{,}106$
$ E_{00}\approx 774{,}063.6$
Plugging these values into
Equation 133, we get a $X^2$ value of 284:
End worked example.
$X^2$ is a measure of how much expected counts $E$ and observed
counts $\observationo$ deviate from each other. A high value of $X^2$
indicates that the hypothesis of independence, which implies
that expected and observed counts are similar, is
incorrect. In our example,
$X^2 \approx 284 > 10.83$. Based
on Table 13.6 , we can reject the hypothesis that
poultry and export are independent with only
a 0.001 chance of being wrong.[*]Equivalently, we say that the outcome
$X^{\kern.5pt2} \approx 284 > 10.83$ is statistically
significant at the 0.001 level. If the two events are
dependent, then the occurrence of the term makes the occurrence
of the class more likely (or less likely), so it should be
helpful as a feature. This is the rationale of $\chi ^2$
feature selection.
Table 13.6:
Critical values of the $\chi ^2$
distribution with one degree of freedom. For example, if
the two events are
independent, then
$P(X^{\kern .5pt2}>6.63)<0.01$. So for
$X^{\kern .5pt2}>6.63$
the assumption of independence can be rejected with 99% confidence.
$p$
$\chi ^2$ critical value
0.1
2.71
0.05
3.84
0.01
6.63
0.005
7.88
0.001
10.83
An arithmetically simpler way of computing
$X^{\kern.5pt2}$ is the
following:
This is equivalent to Equation
133
(Exercise
13.6 ).
Subsections
next
up
previous
contents
index
Next: Assessing as a feature
Up: Feature selection
Previous: Mutual information
Contents
Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009年04月07日