next
up
previous
contents
index
Next: Frequency-based feature selection
Up: Feature selectionChi2 Feature selection
Previous: Feature selectionChi2 Feature selection
Contents
Index
From a statistical point of view, $\chi ^2$ feature selection
is problematic. For a test with one degree of freedom, the
so-called Yates correction should be used (see
Section
13.7 ), which makes it harder to reach
statistical significance. Also, whenever a statistical test
is used multiple times, then the probability of getting at
least one error increases. If 1,000 hypotheses are rejected,
each with 0.05 error probability, then
0ドル.05 \times 1000=50$ calls of the test will be
wrong on average. However, in text classification it rarely
matters whether a few additional terms are added to the
feature set or removed from it. Rather, the
relative
importance of features is important.
As long as $\chi ^2$ feature selection
only ranks features with respect to their usefulness and is
not used to make statements about statistical dependence or
independence of variables, we need not be overly concerned
that it does not adhere strictly to statistical theory.
next
up
previous
contents
index
Next: Frequency-based feature selection
Up: Feature selectionChi2 Feature selection
Previous: Feature selectionChi2 Feature selection
Contents
Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009年04月07日