|
182 | 182 | "metadata": {},
|
183 | 183 | "source": [
|
184 | 184 | "## Interesting property of entropy\n",
|
185 | | - "### Concavity of entropy and convexity of KL-divergence\n", |
| 185 | + "### Concavity of entropy\n", |
| 186 | + "This section is a reproduction of \"book of statistical proof\", in particular for [entropy](https://statproofbook.github.io/P/ent-conc.html) and [kl-divergence](https://statproofbook.github.io/P/kl-conv.html)\n", |
| 187 | + "#### Link between entropy and KL-divergence\n", |
186 | 188 | "The entropy is concave in the space of probability mass function, more formally, this reads:\n",
|
187 | 189 | "\\begin{align*}\n",
|
188 | | - " H[\\lambda p_1 + (1-\\lambda p_2)] \\geq \\lambda H[p_1] + (1-\\lambda p_2) H[p_2]\n", |
| 190 | + " H[\\lambda p_1 + (1-\\lambda) p_2] \\geq \\lambda H[p_1] + (1-\\lambda) p_2 H[p_2] \\tag{1.1}\n", |
189 | 191 | "\\end{align*}\n",
|
190 | 192 | "where $p_1$ and $p_2$ are probability mass functions and $\\lambda \\in [0,1]$\n",
|
191 | 193 | "\n",
|
192 | 194 | "Proof: Let $X$ be a discrete random variable with possible outcomes $\\mathcal{X} := {x_i, i \\in 0,1,\\dots N-1}$ and let $u(x)$ be the probability mass function of a discrete uniform distribution on $X \\in \\mathcal{X}$. Then, the entropy of an arbitrary probability mass function $p(x)$ can be rewritten as\n",
|
193 | 195 | "\n",
|
194 | | - "\\begin{align*}\n", |
| 196 | + "\\begin{align*} \\tag{1.2}\n", |
195 | 197 | " H(X) &= - \\sum_{i=0}^{N-1} p(x_i)log(p(x_i)) \\\\\n",
|
196 | 198 | " &= - \\sum_{i=0}^{N-1} p(x_i)log\\left(\\frac{p(x_i)}{u(x_i)} u(x_i)\\right) \\\\\n",
|
197 | 199 | " &= - \\sum_{i=0}^{N-1} p(x_i)log\\left(\\frac{p(x_i)}{u(x_i)}\\right) - \\sum_{i=0}^{N-1} p(x_i)log(u(x_i)) \\\\\n",
|
198 | 200 | " &= -KL[p\\|u] - \\sum_{i=0}^{N-1} p(x_i)log(u(x_i)) \\\\\n",
|
199 | 201 | " &= -KL[p\\|u] - log \\left(\\frac{1}{N} \\right) \\sum_{i=0}^{N-1} p(x_i) \\\\\n",
|
200 | | - " &= log(N) - KL[p\\|u]\n", |
| 202 | + " &= log(N) - KL[p\\|u] \\\\\n", |
201 | 203 | " log(N) - H(X) &= KL[p\\|u]\n",
|
202 | 204 | "\\end{align*}\n",
|
203 | 205 | "\n",
|
204 | | - "Where $KL[p\\|u]$ is the Kullback-Leibler divergence between $p$ and the discrete uniform distriution $u$ over $\\mathcal{X},ドル a concept we will explain more in detail later on this page. \n", |
205 | | - "Note that the KL divergence is convex in the space of the pair of probability distributions $(p,q)$:\n", |
| 206 | + "Where $KL[p\\|u]$ is the Kullback-Leibler divergence between $p$ and the discrete uniform distriution $u$ over $\\mathcal{X}$. We will explain KL-divergence more in detail later on this page." |
| 207 | + ] |
| 208 | + }, |
| 209 | + { |
| 210 | + "cell_type": "markdown", |
| 211 | + "metadata": {}, |
| 212 | + "source": [ |
| 213 | + "#### Using convexity of KL-divergence to prove entropy concavity\n", |
| 214 | + "Note that the KL divergence is convex in the space of pairs of probability distributions $(p,q),ドル ie:\n", |
| 215 | + "\\begin{align*}\n", |
| 216 | + " KL[\\lambda p_1 + (1-\\lambda) p_2 \\| \\lambda q_1 + (1-\\lambda) q_2] \\leq \\lambda KL[p_1\\|q_1] + (1-\\lambda) KL[p_2\\|q_2 \\tag{(2.1)}\n", |
| 217 | + "\\end{align*}\n", |
| 218 | + "\n", |
| 219 | + "We recall that KL-divergence between two distribution $(p,q)$ reads\n", |
206 | 220 | "\\begin{align*}\n",
|
207 | | - " KL[\\lambda p_1 + (1-\\lambda p_2) \\| \\lambda q_1 + (1-\\lambda q_2)] \\geq \\lambda KL[p_1\\|q_1] + (1-\\lambda p_2) KL[p_2\\|q_2]\n", |
208 | | - "\\end{align*}\n" |
| 221 | + " KL[p\\|q] &= \\mathbb{E}_p[I[p]-[I[q]] \\\\\n", |
| 222 | + " &= - \\sum_{i=0}^{N-1} p(x_i)log\\left(\\frac{p(x_i)}{q(x_i)}\\right)\n", |
| 223 | + "\\end{align*}\n", |
| 224 | + "Which, reads, when using expression (2.1):\n", |
| 225 | + "\\begin{align*}\n", |
| 226 | + " &KL[\\lambda p_1 + (1-\\lambda) p_2 \\| \\lambda q_1 + (1-\\lambda) q_2] \\\\\n", |
| 227 | + " \\equiv &- \\sum_{i=0}^{N-1} (\\lambda p_1(x_i) + (1-\\lambda) p_2(x_i))log\\left(\\frac{\\lambda p_1(x_i) + (1-\\lambda) p_2(x_i)}{\\lambda q_1(x_i) + (1-\\lambda) q_2(x_i)}\\right) \\\\\n", |
| 228 | + " \\equiv &- \\sum_{i=0}^{N-1} (\\lambda p_1(x_i) + (1-\\lambda) p_2(x_i))log\\left(\\frac{\\lambda p_1(x_i) + (1-\\lambda) p_2(x_i)}{\\lambda q_1(x_i) + (1-\\lambda) q_2(x_i)}\\right)\n", |
| 229 | + "\\end{align*}\n", |
| 230 | + "\n", |
| 231 | + "#### Using convexity of KL-divergence to prove entropy concavity\n", |
| 232 | + "\n", |
| 233 | + "Lets take a special case of equation (2.1) where $(q_1,q_2)=(u,u)$ a pair of uniform discrete distributions:\n", |
| 234 | + "\\begin{align*} \\tag{(2.2)}\n", |
| 235 | + " KL[\\lambda p_1 + (1-\\lambda) p_2 \\| \\lambda u + (1-\\lambda) u] &\\leq \\lambda KL[p_1\\|u] + (1-\\lambda) KL[p_2\\|u] \\\\\n", |
| 236 | + " KL[\\lambda p_1 + (1-\\lambda) p_2 \\| u] &\\leq \\lambda KL[p_1\\|u] + (1-\\lambda) KL[p_2\\|u]\n", |
| 237 | + "\\end{align*}\n", |
| 238 | + "\n", |
| 239 | + "Lets now replace $KL[p\\|u]$ from equation (2.2), ie $KL[p\\|u]=log(N)-H[p]$ with the expression obtained in (1.2):\n", |
| 240 | + "\\begin{align*}\n", |
| 241 | + " KL[\\lambda p_1 + (1-\\lambda) p_2 \\| u] &\\leq \\lambda KL[p_1\\|u] + (1-\\lambda) KL[p_2\\|u] \\\\\n", |
| 242 | + " log(N)-H[\\lambda p_1 + (1-\\lambda) p_2] &\\leq \\lambda (log(N)-H[p_1]) + (1-\\lambda)(log(N)-H[p_2]) \\\\\n", |
| 243 | + " log(N)-H[\\lambda p_1 + (1-\\lambda) p_2] &\\leq log(N)(\\lambda+(1-\\lambda)) -\\lambda H[p_1] + (\\lambda-1)H[p_2]) \\\\\n", |
| 244 | + " log(N)-H[\\lambda p_1 + (1-\\lambda) p_2] &\\leq log(N) - (\\lambda H[p_1] + (1-\\lambda)H[p_2]) \\\\\n", |
| 245 | + " H[\\lambda p_1 + (1-\\lambda) p_2] &\\geq \\lambda H[p_1] + (1-\\lambda)H[p_2]\n", |
| 246 | + "\\end{align*}\n", |
| 247 | + "Which is equivalent to equation (1.1), ie the concavity of entropy" |
209 | 248 | ]
|
210 | 249 | },
|
211 | 250 | {
|
|
0 commit comments