Bayes classifier

Classification algorithm in statistics

Bayesian statistics
Part of a series on

Posterior = Likelihood ×ばつ Prior ÷ Evidence
Background
Bayesian inference Bayesian probability Bayes' theorem Bernstein–von Mises theorem Coherence Cox's theorem Cromwell's rule Likelihood principle Principle of indifference Principle of maximum entropy
Model building
Conjugate prior Linear regression Empirical Bayes Hierarchical model
Posterior approximation
Markov chain Monte Carlo Laplace's approximation Integrated nested Laplace approximations Variational inference Approximate Bayesian computation
Estimators
Bayesian estimator Credible interval Maximum a posteriori estimation
Evidence approximation
Evidence lower bound Nested sampling
Model evaluation
Bayes factor (Schwarz criterion) Model averaging Posterior predictive
icon Mathematics portal
v t e

In statistical classification, the Bayes classifier is the classifier having the smallest probability of misclassification of all classes using the same set of features.^[1]

Definition

[edit ]

Suppose a pair $(X,Y)$ {\displaystyle (X,Y)} takes values in $\mathbb {R} ^{d}\times \{1,2,\dots ,K\}$ {\displaystyle \mathbb {R} ^{d}\times \{1,2,\dots ,K\}}, where $Y$ {\displaystyle Y} is the class label of an element whose features are given by $X$ {\displaystyle X}. Assume that the conditional distribution of X, given that the label Y takes the value r is given by $(X\mid Y=r)\sim P_{r}\quad {\text{for}}\quad r=1,2,\dots ,K$ {\displaystyle (X\mid Y=r)\sim P_{r}\quad {\text{for}}\quad r=1,2,\dots ,K} where " $\sim$ {\displaystyle \sim }" means "is distributed as", and where $P_{r}$ {\displaystyle P_{r}} denotes a probability distribution.

A classifier is a rule that assigns to an observation X=x a guess or estimate of what the unobserved label Y=r actually was. In theoretical terms, a classifier is a measurable function $C:\mathbb {R} ^{d}\to \{1,2,\dots ,K\}$ {\displaystyle C:\mathbb {R} ^{d}\to \{1,2,\dots ,K\}}, with the interpretation that C classifies the point x to the class C(x). The probability of misclassification, or risk, of a classifier C is defined as ${\mathcal {R}}(C)=\operatorname {P} \{C(X)\neq Y\}.$ {\displaystyle {\mathcal {R}}(C)=\operatorname {P} \{C(X)\neq Y\}.}

The Bayes classifier is $C^{\text{Bayes}}(x)={\underset {r\in \{1,2,\dots ,K\}}{\operatorname {argmax} }}\operatorname {P} (Y=r\mid X=x).$ {\displaystyle C^{\text{Bayes}}(x)={\underset {r\in \{1,2,\dots ,K\}}{\operatorname {argmax} }}\operatorname {P} (Y=r\mid X=x).}

In practice, as in most of statistics, the difficulties and subtleties are associated with modeling the probability distributions effectively—in this case, $\operatorname {P} (Y=r\mid X=x)$ {\displaystyle \operatorname {P} (Y=r\mid X=x)}. The Bayes classifier is a useful benchmark in statistical classification.

The excess risk of a general classifier $C$ {\displaystyle C} (possibly depending on some training data) is defined as ${\mathcal {R}}(C)-{\mathcal {R}}(C^{\text{Bayes}}).$ {\displaystyle {\mathcal {R}}(C)-{\mathcal {R}}(C^{\text{Bayes}}).} Thus this non-negative quantity is important for assessing the performance of different classification techniques. A classifier is said to be consistent if the excess risk converges to zero as the size of the training data set tends to infinity.^[2]

Considering the components $x_{i}$ {\displaystyle x_{i}} of $x$ {\displaystyle x} to be mutually independent, we get the naive Bayes classifier, where $C^{\text{Bayes}}(x)={\underset {r\in \{1,2,\dots ,K\}}{\operatorname {argmax} }}\operatorname {P} (Y=r)\prod _{i=1}^{d}P_{r}(x_{i}).$ {\displaystyle C^{\text{Bayes}}(x)={\underset {r\in \{1,2,\dots ,K\}}{\operatorname {argmax} }}\operatorname {P} (Y=r)\prod _{i=1}^{d}P_{r}(x_{i}).}

Properties

[edit ]

Proof that the Bayes classifier is optimal and Bayes error rate is minimal proceeds as follows.

Define the variables: Risk $R(h)$ {\displaystyle R(h)}, Bayes risk $R^{*}$ {\displaystyle R^{*}}, all possible classes to which the points can be classified $Y=\{0,1\}$ {\displaystyle Y=\{0,1\}}. Let the posterior probability of a point belonging to class 1 be $\eta (x)=Pr(Y=1|X=x)$ {\displaystyle \eta (x)=Pr(Y=1|X=x)}. Define the classifier ${\mathcal {h}}^{*}$ {\displaystyle {\mathcal {h}}^{*}}as ${\mathcal {h}}^{*}(x)={\begin{cases}1&{\text{if }}\eta (x)\geqslant 0.5,\0円&{\text{otherwise.}}\end{cases}}$ {\displaystyle {\mathcal {h}}^{*}(x)={\begin{cases}1&{\text{if }}\eta (x)\geqslant 0.5,\0円&{\text{otherwise.}}\end{cases}}}

Then we have the following results:

$R(h^{*})=R^{*}$ {\displaystyle R(h^{*})=R^{*}}, i.e. $h^{*}$ {\displaystyle h^{*}} is a Bayes classifier,
For any classifier $h$ {\displaystyle h}, the excess risk satisfies $R(h)-R^{*}=2\mathbb {E} _{X}\left[|\eta (x)-0.5|\cdot \mathbb {I} _{\left\{h(X)\neq h^{*}(X)\right\}}\right]$ {\displaystyle R(h)-R^{*}=2\mathbb {E} _{X}\left[|\eta (x)-0.5|\cdot \mathbb {I} _{\left\{h(X)\neq h^{*}(X)\right\}}\right]}
$R^{*}=\mathbb {E} _{X}\left[\min(\eta (X),1-\eta (X))\right]$ {\displaystyle R^{*}=\mathbb {E} _{X}\left[\min(\eta (X),1-\eta (X))\right]}
$R^{*}={\frac {1}{2}}-{\frac {1}{2}}\mathbb {E} [|2\eta (X)-1|]$ {\displaystyle R^{*}={\frac {1}{2}}-{\frac {1}{2}}\mathbb {E} [|2\eta (X)-1|]}

Proof of (a): For any classifier $h$ {\displaystyle h}, we have ${\begin{aligned}R(h)&=\mathbb {E} _{XY}\left[\mathbb {I} _{\left\{h(X)\neq Y\right\}}\right]\\&=\mathbb {E} _{X}\mathbb {E} _{Y|X}[\mathbb {I} _{\left\{h(X)\neq Y\right\}}]\\&=\mathbb {E} _{X}[\eta (X)\mathbb {I} _{\left\{h(X)=0\right\}}+(1-\eta (X))\mathbb {I} _{\left\{h(X)=1\right\}}]\end{aligned}}$ {\displaystyle {\begin{aligned}R(h)&=\mathbb {E} _{XY}\left[\mathbb {I} _{\left\{h(X)\neq Y\right\}}\right]\\&=\mathbb {E} _{X}\mathbb {E} _{Y|X}[\mathbb {I} _{\left\{h(X)\neq Y\right\}}]\\&=\mathbb {E} _{X}[\eta (X)\mathbb {I} _{\left\{h(X)=0\right\}}+(1-\eta (X))\mathbb {I} _{\left\{h(X)=1\right\}}]\end{aligned}}} where the second line was derived through Fubini's theorem

Notice that $R(h)$ {\displaystyle R(h)} is minimised by taking $\forall x\in X$ {\displaystyle \forall x\in X}, $h(x)={\begin{cases}1&{\text{if }}\eta (x)\geqslant 1-\eta (x),\0円&{\text{otherwise.}}\end{cases}}$ {\displaystyle h(x)={\begin{cases}1&{\text{if }}\eta (x)\geqslant 1-\eta (x),\0円&{\text{otherwise.}}\end{cases}}}

Therefore the minimum possible risk is the Bayes risk, $R^{*}=R(h^{*})$ {\displaystyle R^{*}=R(h^{*})}.

Proof of (b): ${\begin{aligned}R(h)-R^{*}&=R(h)-R(h^{*})\\&=\mathbb {E} _{X}[\eta (X)\mathbb {I} _{\left\{h(X)=0\right\}}+(1-\eta (X))\mathbb {I} _{\left\{h(X)=1\right\}}-\eta (X)\mathbb {I} _{\left\{h^{*}(X)=0\right\}}-(1-\eta (X))\mathbb {I} _{\left\{h^{*}(X)=1\right\}}]\\&=\mathbb {E} _{X}[|2\eta (X)-1|\mathbb {I} _{\left\{h(X)\neq h^{*}(X)\right\}}]\\&=2\mathbb {E} _{X}[|\eta (X)-0.5|\mathbb {I} _{\left\{h(X)\neq h^{*}(X)\right\}}]\end{aligned}}$ {\displaystyle {\begin{aligned}R(h)-R^{*}&=R(h)-R(h^{*})\\&=\mathbb {E} _{X}[\eta (X)\mathbb {I} _{\left\{h(X)=0\right\}}+(1-\eta (X))\mathbb {I} _{\left\{h(X)=1\right\}}-\eta (X)\mathbb {I} _{\left\{h^{*}(X)=0\right\}}-(1-\eta (X))\mathbb {I} _{\left\{h^{*}(X)=1\right\}}]\\&=\mathbb {E} _{X}[|2\eta (X)-1|\mathbb {I} _{\left\{h(X)\neq h^{*}(X)\right\}}]\\&=2\mathbb {E} _{X}[|\eta (X)-0.5|\mathbb {I} _{\left\{h(X)\neq h^{*}(X)\right\}}]\end{aligned}}}

Proof of (c): ${\begin{aligned}R(h^{*})&=\mathbb {E} _{X}[\eta (X)\mathbb {I} _{\left\{h^{*}(X)=0\right\}}+(1-\eta (X))\mathbb {I} _{\left\{h*(X)=1\right\}}]\\&=\mathbb {E} _{X}[\min(\eta (X),1-\eta (X))]\end{aligned}}$ {\displaystyle {\begin{aligned}R(h^{*})&=\mathbb {E} _{X}[\eta (X)\mathbb {I} _{\left\{h^{*}(X)=0\right\}}+(1-\eta (X))\mathbb {I} _{\left\{h*(X)=1\right\}}]\\&=\mathbb {E} _{X}[\min(\eta (X),1-\eta (X))]\end{aligned}}}

Proof of (d): ${\begin{aligned}R(h^{*})&=\mathbb {E} _{X}[\min(\eta (X),1-\eta (X))]\\&={\frac {1}{2}}-\mathbb {E} _{X}[\max(\eta (X)-1/2,1/2-\eta (X))]\\&={\frac {1}{2}}-{\frac {1}{2}}\mathbb {E} [|2\eta (X)-1|]\end{aligned}}$ {\displaystyle {\begin{aligned}R(h^{*})&=\mathbb {E} _{X}[\min(\eta (X),1-\eta (X))]\\&={\frac {1}{2}}-\mathbb {E} _{X}[\max(\eta (X)-1/2,1/2-\eta (X))]\\&={\frac {1}{2}}-{\frac {1}{2}}\mathbb {E} [|2\eta (X)-1|]\end{aligned}}}

General case

[edit ]

The general case that the Bayes classifier minimises classification error when each element can belong to either of n categories proceeds by towering expectations as follows. ${\begin{aligned}\mathbb {E} _{Y}(\mathbb {I} _{\{y\neq {\hat {y}}\}})&=\mathbb {E} _{X}\mathbb {E} _{Y|X}\left(\mathbb {I} _{\{y\neq {\hat {y}}\}}|X=x\right)\\&=\mathbb {E} \left[\Pr(Y=1|X=x)\mathbb {I} _{\{{\hat {y}}=2,3,\dots ,n\}}+\Pr(Y=2|X=x)\mathbb {I} _{\{{\hat {y}}=1,3,\dots ,n\}}+\dots +\Pr(Y=n|X=x)\mathbb {I} _{\{{\hat {y}}=1,2,3,\dots ,n-1\}}\right]\end{aligned}}$ {\displaystyle {\begin{aligned}\mathbb {E} _{Y}(\mathbb {I} _{\{y\neq {\hat {y}}\}})&=\mathbb {E} _{X}\mathbb {E} _{Y|X}\left(\mathbb {I} _{\{y\neq {\hat {y}}\}}|X=x\right)\\&=\mathbb {E} \left[\Pr(Y=1|X=x)\mathbb {I} _{\{{\hat {y}}=2,3,\dots ,n\}}+\Pr(Y=2|X=x)\mathbb {I} _{\{{\hat {y}}=1,3,\dots ,n\}}+\dots +\Pr(Y=n|X=x)\mathbb {I} _{\{{\hat {y}}=1,2,3,\dots ,n-1\}}\right]\end{aligned}}}

This is minimised by simultaneously minimizing all the terms of the expectation using the classifier $h(x)=k,\quad \arg \max _{k}Pr(Y=k|X=x)$ {\displaystyle h(x)=k,\quad \arg \max _{k}Pr(Y=k|X=x)} for each observation x.

References

[edit ]

^ Devroye, L.; Gyorfi, L. & Lugosi, G. (1996). A probabilistic theory of pattern recognition. Springer. ISBN 0-3879-4618-7.
^ Farago, A.; Lugosi, G. (1993). "Strong universal consistency of neural network classifiers" . IEEE Transactions on Information Theory. 39 (4): 1146–1151. doi:10.1109/18.243433.

Retrieved from "https://en.wikipedia.org/w/index.php?title=Bayes_classifier&oldid=1317328076"

Definition

Properties

General case

See also

References