Inception score
The Inception Score (IS) is an algorithm used to assess the quality of images created by a generative image model such as a generative adversarial network (GAN).[1] The score is calculated based on the output of a separate, pretrained Inception v3 image classification model applied to a sample of (typically around 30,000) images generated by the generative model. The Inception Score is maximized when the following conditions are true:
- The entropy of the distribution of labels predicted by the Inceptionv3 model for the generated images is minimized. In other words, the classification model confidently predicts a single label for each image. Intuitively, this corresponds to the desideratum of generated images being "sharp" or "distinct".
- The predictions of the classification model are evenly distributed across all possible labels. This corresponds to the desideratum that the output of the generative model is "diverse".[2]
It has been somewhat superseded by the related Fréchet inception distance.[3] While the Inception Score only evaluates the distribution of generated images, the FID compares the distribution of generated images with the distribution of a set of real images ("ground truth").
Definition
[edit ]Let there be two spaces, the space of images {\displaystyle \Omega _{X}} and the space of labels {\displaystyle \Omega _{Y}}. The space of labels is finite.
Let {\displaystyle p_{gen}} be a probability distribution over {\displaystyle \Omega _{X}} that we wish to judge.
Let a discriminator be a function of type {\displaystyle p_{dis}:\Omega _{X}\to M(\Omega _{Y})}where {\displaystyle M(\Omega _{Y})} is the set of all probability distributions on {\displaystyle \Omega _{Y}}. For any image {\displaystyle x}, and any label {\displaystyle y}, let {\displaystyle p_{dis}(y|x)} be the probability that image {\displaystyle x} has label {\displaystyle y}, according to the discriminator. It is usually implemented as an Inception-v3 network trained on ImageNet.
The Inception Score of {\displaystyle p_{gen}} relative to {\displaystyle p_{dis}} is{\displaystyle IS(p_{gen},p_{dis}):=\exp \left(\mathbb {E} _{x\sim p_{gen}}\left[D_{KL}\left(p_{dis}(\cdot |x)\|\int p_{dis}(\cdot |x)p_{gen}(x)dx\right)\right]\right)}Equivalent rewrites include{\displaystyle \ln IS(p_{gen},p_{dis}):=\mathbb {E} _{x\sim p_{gen}}\left[D_{KL}\left(p_{dis}(\cdot |x)\|\mathbb {E} _{x\sim p_{gen}}[p_{dis}(\cdot |x)]\right)\right]}{\displaystyle \ln IS(p_{gen},p_{dis}):=H[\mathbb {E} _{x\sim p_{gen}}[p_{dis}(\cdot |x)]]-\mathbb {E} _{x\sim p_{gen}}[H[p_{dis}(\cdot |x)]]}{\displaystyle \ln IS} is nonnegative by Jensen's inequality.
Pseudocode:
INPUT discriminator {\displaystyle p_{dis}}.
INPUT generator {\displaystyle g}.
Sample images {\displaystyle x_{i}} from generator.
Compute {\displaystyle p_{dis}(\cdot |x_{i})}, the probability distribution over labels conditional on image {\displaystyle x_{i}}.
Sum up the results to obtain {\displaystyle {\hat {p}}}, an empirical estimate of {\displaystyle \int p_{dis}(\cdot |x)p_{gen}(x)dx}.
Sample more images {\displaystyle x_{i}} from generator, and for each, compute {\displaystyle D_{KL}\left(p_{dis}(\cdot |x_{i})\|{\hat {p}}\right)}.
Average the results, and take its exponential.
RETURN the result.
Interpretation
[edit ]A higher inception score is interpreted as "better", as it means that {\displaystyle p_{gen}} is a "sharp and distinct" collection of pictures.
{\displaystyle \ln IS(p_{gen},p_{dis})\in [0,\ln N]}, where {\displaystyle N} is the total number of possible labels.
{\displaystyle \ln IS(p_{gen},p_{dis})=0} iff for almost all {\displaystyle x\sim p_{gen}}{\displaystyle p_{dis}(\cdot |x)=\int p_{dis}(\cdot |x)p_{gen}(x)dx}That means {\displaystyle p_{gen}} is completely "indistinct". That is, for any image {\displaystyle x} sampled from {\displaystyle p_{gen}}, discriminator returns exactly the same label predictions {\displaystyle p_{dis}(\cdot |x)}.
The highest inception score {\displaystyle N} is achieved if and only if the two conditions are both true:
- For almost all {\displaystyle x\sim p_{gen}}, the distribution {\displaystyle p_{dis}(y|x)} is concentrated on one label. That is, {\displaystyle H_{y}[p_{dis}(y|x)]=0}. That is, every image sampled from {\displaystyle p_{gen}} is exactly classified by the discriminator.
- For every label {\displaystyle y}, the proportion of generated images labelled as {\displaystyle y} is exactly {\displaystyle \mathbb {E} _{x\sim p_{gen}}[p_{dis}(y|x)]={\frac {1}{N}}}. That is, the generated images are equally distributed over all labels.
References
[edit ]- ^ Salimans, Tim; Goodfellow, Ian; Zaremba, Wojciech; Cheung, Vicki; Radford, Alec; Chen, Xi; Chen, Xi (2016). "Improved Techniques for Training GANs". Advances in Neural Information Processing Systems. 29. Curran Associates, Inc. arXiv:1606.03498 .
- ^ Frolov, Stanislav; Hinz, Tobias; Raue, Federico; Hees, Jörn; Dengel, Andreas (December 2021). "Adversarial text-to-image synthesis: A review". Neural Networks. 144: 187–209. arXiv:2101.09983 . doi:10.1016/j.neunet.2021年07月01日9 . PMID 34500257. S2CID 231698782.
- ^ Borji, Ali (2022). "Pros and cons of GAN evaluation measures: New developments". Computer Vision and Image Understanding. 215 103329. arXiv:2103.09396 . doi:10.1016/j.cviu.2021.103329. S2CID 232257836.