|
372 | 372 | "\n",
|
373 | 373 | "This part is largly inspired by the following wikipedia article on [sinkhorn theorem](https://en.wikipedia.org/wiki/Sinkhorn%27s_theorem) and [doubly stochastic matrices](https://en.wikipedia.org/wiki/Doubly_stochastic_matrix) as well as this [blog post](https://zipjiang.github.io/2020/11/23/sinkhorn's-theorem-,-sinkhorn-algorithm-and-applications.html) from Zipjiang, and this [original overview](https://djalil.chafai.net/blog/2021/08/28/sinkhorn-and-circular-law/) by Djalil Chafai.\n",
|
374 | 374 | "\n",
|
375 | | - "In Djalil Chafai, it is mentionned, that, when factorizing a non-negative square matrix $A$ as $d_1 S d_2,ドル it is well known that S is the projection of $A,ドル with respect to the Kullback-Leibler divergence, on the set of doubly stochastic matrices (Birkhoff polytope). The Sinkhorn factorization and its algorithm became fashionable in the domain of computational optimal transportation due to a relation with entropic regularization. It is also considered in the domain of quantum information theory.\n", |
376 | | - "\n", |
377 | 375 | "### (Doubly) Stochastic matrices\n",
|
378 | 376 | "A doubly stochastic matrix (also called bistochastic matrix) is a square matrix $X = (x_{ij})$ of nonnegative real numbers, each of whose rows and columns sums to 1, i.e: $ \\sum_{i}x_{ij}=\\sum_{j}x_{ij}=1 $\n",
|
379 | 377 | "Thus, a doubly stochastic matrix is both left stochastic and right stochastic.\n",
|
|
383 | 381 | "If $A$ is an $n \\times n$ matrix with strictly positive elements, then there exist diagonal matrices $D_1$ and $D_2$ with strictly positive diagonal elements such that $D_1AD_2$ is doubly stochastic. The matrices $D_1$ and $D_2$ are unique modulo multiplying the first matrix by a positive number and dividing the second one by the same number.\n",
|
384 | 382 | "\n",
|
385 | 383 | "### Sinkhorn's algorithm\n",
|
386 | | - "What is great about Sinkhorn’s theorem is that there is a efficient algorithm to calculate an approximation of the doubly stochastic matrix given a matrix $A,ドル that features linear convergence. And the algorithm is very simple: one just iteratively normalize the matrix along rows and columns." |
| 384 | + "What is great about Sinkhorn’s theorem is that there is an efficient algorithm to calculate an approximation of the doubly stochastic matrix given a matrix $A,ドル that features linear convergence. And the algorithm is very simple: one just iteratively normalize the matrix along rows and columns. In Djalil Chafai post, it is mentionned, that, when factorizing a non-negative square matrix $M$ as $D_1 A D_2,ドル it is well known that $A$ is the projection of $M,ドル with respect to the Kullback-Leibler divergence, on the set of doubly stochastic matrices (Birkhoff polytope). The Sinkhorn factorization and its algorithm became fashionable in the domain of computational optimal transportation due to a relation with entropic regularization. It is also considered in the domain of quantum information theory." |
387 | 385 | ]
|
388 | 386 | },
|
389 | 387 | {
|
|
517 | 515 | "metadata": {},
|
518 | 516 | "source": [
|
519 | 517 | "### How is Sinkhorn algorithm related to OT ?\n",
|
520 | | - "So how is calculating a doubly stochastic matrix related to optimal transport?\n", |
521 | | - "\n", |
| 518 | + "So how is calculating a doubly stochastic matrix related to optimal transport ? The link between those two problems has been highlighted in a famous paper from Marco Cuturi in 2013 (CUTURI, Marco. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 2013, vol. 26.)" |
| 519 | + ] |
| 520 | + }, |
| 521 | + { |
| 522 | + "cell_type": "code", |
| 523 | + "execution_count": 34, |
| 524 | + "metadata": {}, |
| 525 | + "outputs": [], |
| 526 | + "source": [ |
| 527 | + "#IFrame(\"doc/OptimalTransportWasserteinDistance/NIPS-2013-sinkhorn-distances-lightspeed-computation-of-optimal-transport-Paper.pdf\", width=1200, height=800)" |
| 528 | + ] |
| 529 | + }, |
| 530 | + { |
| 531 | + "cell_type": "markdown", |
| 532 | + "metadata": {}, |
| 533 | + "source": [ |
| 534 | + "### Formalizing Kantorovich relaxed OT\n", |
522 | 535 | "We will reformulate our optimal transport problem with a slightly different matrix form: This time, we take N warehouses and also N customers (K=N from our original problem). The distance cost is still $c(x,y)$ is the distance between the storage area $x$ and the address of customer $y,ドル but we store those cost in matrix $M \\in \\mathbb{R}^{N\\times N+}$ where entry $m_{i,j}=c(x_i,y_j)$.\n",
|
523 | 536 | "\n",
|
524 | 537 | "Now suppose our e-readers are randomly distributed according to a distribution r among the N warehouses. And we would like them to be distributed according to another distribution c on the customers.\n",
|
525 | | - "Now we can write up a non-negative matrix $A \\in \\mathbb{R}^{N\\times K+},ドル where $a_{i,j}$ corresponds to how much portion of the total e-readers we would like to transport from warehouse i to customer j. Another way to view this matrix A is that A defines a joint distribution in the transportation space and r and c could be seen as its marginals of starting points and destinations respectively. Now it’s not hard to see that the final transportation cost would be the sum of the element-wise product between A and M, or in other words the Frobenius inner product $\\langle A, M \\rangle$ In fact, this is called the Kantorovich relaxation of OT.\n", |
| 538 | + "Now we can write up a non-negative matrix $A \\in \\mathbb{R}_{+}^{N\\times K},ドル where $a_{i,j}$ corresponds to how much portion of the total e-readers we would like to transport from warehouse i to customer j. Another way to view this matrix A is that A defines a joint distribution in the transportation space and r and c could be seen as its marginals of starting points and destinations respectively. Now it’s not hard to see that the final transportation cost would be the sum of the element-wise product between A and M, or in other words the Frobenius inner product $\\langle A, M \\rangle$ In fact, this is called the Kantorovich relaxation of OT.\n", |
526 | 539 | "\n",
|
527 | | - "There is the conclusion that if M is a distance matrix, that is, every element in the matrix comform to the three property of distance, the inner product is a distance itself. Confirming this should not be hard. To make a transfort optimal is to find the minimum cost of the total transport, that is, to minimize the Frobenius inner product.\n", |
| 540 | + "More formaly, we can say that for two probability vectors $r$ and $c$ in the simplex $\\Sigma_d := \\{ x \\in \\mathbb{R}_{+}^{d} : x^T \\mathbb{1}^d = 1\\},ドル where $\\mathbb{1}^d$ is the d dimensional vector of ones, we write $U(r, c)$ for the transport polytope of r and c, namely the polyhedral set of $d \\times d$ matrices:\n", |
| 541 | + "\\begin{align*}\n", |
| 542 | + " U(r,c) := \\{A \\in \\mathbb{R}_{+}^{d\\times d} | P\\mathbb{1}^d =r, P^T\\mathbb{1}^d =c \\}\n", |
| 543 | + "\\end{align*}\n", |
528 | 544 | "\n",
|
| 545 | + "The conclusion is that if $M$ is a distance matrix, that is, every element in the matrix comform to the three properties of distance (non-negativity, symmetry and the triangle inequality), the inner product is a distance itself. To make a transport optimal is to find the minimum cost of the total transport, that is, to minimize the Frobenius inner product.\n", |
| 546 | + "To rephrase previous assertion in a more formal way, transport distance between $r$ and $c$ given a $d \\times d$ cost matrix $M$ , the cost of mapping $r$ to $c$ using a transport matrix (or joint probability) A can be quantified as $\\langle A, M \\rangle$. The problem of optimal transport reads\n", |
| 547 | + "\\begin{align*}\n", |
| 548 | + " d_{M}(r,c) := \\underset{A \\in U(r,c)}{min} \\langle A, M \\rangle\n", |
| 549 | + "\\end{align*}\n", |
| 550 | + "\n", |
| 551 | + "\n", |
| 552 | + "### OT and statistical concepts\n", |
529 | 553 | "I would like to stop and mention that as we now interpret A as a joint probability matrix, we can define its entropy, the marginal probabiilty entropy, and KL-divergence between two different transportation matrix. These takes the form of\n",
|
530 | 554 | "\n",
|
531 | 555 | "\\begin{align*}\n",
|
532 | | - " \\text{Entropy} &= H(A) &= \\sum_{i,j} a_{i,j} log(a_{i,j}) \\\\\n", |
533 | | - " \\text{Marginal source distribution entropy r} &= H(r) &= \\sum_{i} \\left( \\sum_{j} a_{i,j} \\right) log\\left( \\sum_{j} a_{i,j} \\right)\\\\\n", |
534 | | - " \\text{Marginal destination distribution entropy c} &= H(c) &= \\sum_{j} \\left( \\sum_{i} a_{i,j} \\right) log\\left( \\sum_{i} a_{i,j} \\right)\\\\\n", |
535 | | - " \\text{KL-divergence between A and B transportation} &= &=\n", |
536 | | - "\\end{align*}\n" |
| 556 | + " \\text{Entropy} &= H(A) &= -\\sum_{i,j} a_{i,j} log(a_{i,j}) & \\\\\n", |
| 557 | + " \\text{Marginal source distribution entropy r} &= H(r) &= -\\sum_{i} \\left( \\sum_{j} a_{i,j} \\right) log\\left( \\sum_{j} a_{i,j} \\right) &= −\\sum_{i} r_i log(r_i)\\\\\n", |
| 558 | + " \\text{Marginal destination distribution entropy c} &= H(c) &= -\\sum_{j} \\left( \\sum_{i} a_{i,j} \\right) log\\left( \\sum_{i} a_{i,j} \\right) &= −\\sum_{i} c_i log(c_i)\\\\\n", |
| 559 | + " \\text{KL-divergence between A and B transportation} &= KL(A\\|B) &= \\sum_{i,j} a_{i,j} log\\left(\\frac{a_{i,j}}{{b}_{i,j}}\\right) &\n", |
| 560 | + "\\end{align*}" |
537 | 561 | ]
|
538 | 562 | },
|
539 | 563 | {
|
|
857 | 881 | "# See also, hungarian algorithm"
|
858 | 882 | ]
|
859 | 883 | },
|
| 884 | + { |
| 885 | + "cell_type": "markdown", |
| 886 | + "metadata": {}, |
| 887 | + "source": [ |
| 888 | + "## Computation of Wassertstein barycenters\n", |
| 889 | + "\n" |
| 890 | + ] |
| 891 | + }, |
| 892 | + { |
| 893 | + "cell_type": "code", |
| 894 | + "execution_count": null, |
| 895 | + "metadata": {}, |
| 896 | + "outputs": [], |
| 897 | + "source": [ |
| 898 | + "#IFrame(\"doc/OptimalTransportWasserteinDistance/Fast_computation_of_Wasserstein_barycenters.pdf\", width=1200, height=800)" |
| 899 | + ] |
| 900 | + }, |
860 | 901 | {
|
861 | 902 | "cell_type": "markdown",
|
862 | 903 | "metadata": {},
|
|
0 commit comments