|  | 
|  | 1 | +{ | 
|  | 2 | + "cells": [ | 
|  | 3 | + { | 
|  | 4 | + "attachments": {}, | 
|  | 5 | + "cell_type": "markdown", | 
|  | 6 | + "metadata": {}, | 
|  | 7 | + "source": [ | 
|  | 8 | + "# Sequence Pediction Metrics\n", | 
|  | 9 | + "\n", | 
|  | 10 | + "Sequence prediction metrics all seek to summarize and quantify the extent to which a model has managed to reproduce, or accurately match, some gold standard sequences. Such problem arise throughout NLP.\n", | 
|  | 11 | + "\n", | 
|  | 12 | + "Examples:\n", | 
|  | 13 | + "\n", | 
|  | 14 | + "1. Mapping speech signals to their desired transcriptions.\n", | 
|  | 15 | + "1. Mapping texts in a language $L_{1}$ to their translations in a distinct language or dialect $L_{2}$.\n", | 
|  | 16 | + "1. Mapping input dialogue acts to their desired responses.\n", | 
|  | 17 | + "1. Mapping a sentence to one of its paraphrases.\n", | 
|  | 18 | + "1. Mapping real-world scenes or contexts (non-linguistic) to descriptions of them (linguistic)." | 
|  | 19 | + ] | 
|  | 20 | + }, | 
|  | 21 | + { | 
|  | 22 | + "attachments": {}, | 
|  | 23 | + "cell_type": "markdown", | 
|  | 24 | + "metadata": {}, | 
|  | 25 | + "source": [ | 
|  | 26 | + "Evaluations is very challenging because the relationships tend to be __many-to-one__: a given sentence might have multiple suitable translations; a given dialogue act will always have numerous felicitous responses; any scene can be described in multiple ways; and so forth. The most constrained of these problems is the speech-to-text case in 1, but even that one has indeterminacy in real-world contexts (humans often disagree about how to transcribe spoken language)." | 
|  | 27 | + ] | 
|  | 28 | + }, | 
|  | 29 | + { | 
|  | 30 | + "attachments": {}, | 
|  | 31 | + "cell_type": "markdown", | 
|  | 32 | + "metadata": {}, | 
|  | 33 | + "source": [ | 
|  | 34 | + "## Contents\n", | 
|  | 35 | + "\n", | 
|  | 36 | + "- **Word error rate**\n", | 
|  | 37 | + "- **BLUE score**\n", | 
|  | 38 | + "- **Perplexity**\n" | 
|  | 39 | + ] | 
|  | 40 | + }, | 
|  | 41 | + { | 
|  | 42 | + "attachments": {}, | 
|  | 43 | + "cell_type": "markdown", | 
|  | 44 | + "metadata": {}, | 
|  | 45 | + "source": [ | 
|  | 46 | + "## Imports" | 
|  | 47 | + ] | 
|  | 48 | + }, | 
|  | 49 | + { | 
|  | 50 | + "cell_type": "code", | 
|  | 51 | + "execution_count": 1, | 
|  | 52 | + "metadata": {}, | 
|  | 53 | + "outputs": [], | 
|  | 54 | + "source": [ | 
|  | 55 | + "%matplotlib inline\n", | 
|  | 56 | + "from nltk.metrics.distance import edit_distance\n", | 
|  | 57 | + "from nltk.translate import bleu_score\n", | 
|  | 58 | + "import numpy as np\n", | 
|  | 59 | + "import pandas as pd\n", | 
|  | 60 | + "import scipy.stats\n", | 
|  | 61 | + "from sklearn import metrics" | 
|  | 62 | + ] | 
|  | 63 | + }, | 
|  | 64 | + { | 
|  | 65 | + "attachments": {}, | 
|  | 66 | + "cell_type": "markdown", | 
|  | 67 | + "metadata": {}, | 
|  | 68 | + "source": [ | 
|  | 69 | + "## Word Error Rate\n", | 
|  | 70 | + "\n", | 
|  | 71 | + "The [word error rate](https://en.wikipedia.org/wiki/Word_error_rate) (WER) metric is a word-level, length-normalized measure of [Levenshtein string-edit distance](https://en.wikipedia.org/wiki/Levenshtein_distance):\n" | 
|  | 72 | + ] | 
|  | 73 | + }, | 
|  | 74 | + { | 
|  | 75 | + "cell_type": "code", | 
|  | 76 | + "execution_count": 2, | 
|  | 77 | + "metadata": {}, | 
|  | 78 | + "outputs": [], | 
|  | 79 | + "source": [ | 
|  | 80 | + "def wer(seq_true, seq_pred):\n", | 
|  | 81 | + " d = edit_distance(seq_true, seq_pred)\n", | 
|  | 82 | + " return d / len(seq_true)" | 
|  | 83 | + ] | 
|  | 84 | + }, | 
|  | 85 | + { | 
|  | 86 | + "cell_type": "code", | 
|  | 87 | + "execution_count": 3, | 
|  | 88 | + "metadata": {}, | 
|  | 89 | + "outputs": [ | 
|  | 90 | + { | 
|  | 91 | + "data": { | 
|  | 92 | + "text/plain": [ | 
|  | 93 | + "0.3333333333333333" | 
|  | 94 | + ] | 
|  | 95 | + }, | 
|  | 96 | + "execution_count": 3, | 
|  | 97 | + "metadata": {}, | 
|  | 98 | + "output_type": "execute_result" | 
|  | 99 | + } | 
|  | 100 | + ], | 
|  | 101 | + "source": [ | 
|  | 102 | + "wer(['A', 'B', 'C'], ['A', 'A', 'C'])" | 
|  | 103 | + ] | 
|  | 104 | + }, | 
|  | 105 | + { | 
|  | 106 | + "cell_type": "code", | 
|  | 107 | + "execution_count": 4, | 
|  | 108 | + "metadata": {}, | 
|  | 109 | + "outputs": [ | 
|  | 110 | + { | 
|  | 111 | + "data": { | 
|  | 112 | + "text/plain": [ | 
|  | 113 | + "0.25" | 
|  | 114 | + ] | 
|  | 115 | + }, | 
|  | 116 | + "execution_count": 4, | 
|  | 117 | + "metadata": {}, | 
|  | 118 | + "output_type": "execute_result" | 
|  | 119 | + } | 
|  | 120 | + ], | 
|  | 121 | + "source": [ | 
|  | 122 | + "wer(['A', 'B', 'C', 'D'], ['A', 'A', 'C', 'D'])" | 
|  | 123 | + ] | 
|  | 124 | + }, | 
|  | 125 | + { | 
|  | 126 | + "attachments": {}, | 
|  | 127 | + "cell_type": "markdown", | 
|  | 128 | + "metadata": {}, | 
|  | 129 | + "source": [ | 
|  | 130 | + "To calculate this over the entire test-set, one gets the edit-distances for each gold–predicted pair and normalizes these by the length of all the gold examples, rather than normalizing each case:" | 
|  | 131 | + ] | 
|  | 132 | + }, | 
|  | 133 | + { | 
|  | 134 | + "cell_type": "code", | 
|  | 135 | + "execution_count": 5, | 
|  | 136 | + "metadata": {}, | 
|  | 137 | + "outputs": [], | 
|  | 138 | + "source": [ | 
|  | 139 | + "def corpus_wer(y_true, y_pred):\n", | 
|  | 140 | + " dists = [edit_distance(seq_true, seq_pred)\n", | 
|  | 141 | + " for seq_true, seq_pred in zip(y_true, y_pred)]\n", | 
|  | 142 | + " lengths = [len(seq) for seq in y_true]\n", | 
|  | 143 | + " return sum(dists) / sum(lengths)" | 
|  | 144 | + ] | 
|  | 145 | + }, | 
|  | 146 | + { | 
|  | 147 | + "attachments": {}, | 
|  | 148 | + "cell_type": "markdown", | 
|  | 149 | + "metadata": {}, | 
|  | 150 | + "source": [ | 
|  | 151 | + "This gives a single summary value for the entire set of errors." | 
|  | 152 | + ] | 
|  | 153 | + }, | 
|  | 154 | + { | 
|  | 155 | + "attachments": {}, | 
|  | 156 | + "cell_type": "markdown", | 
|  | 157 | + "metadata": {}, | 
|  | 158 | + "source": [ | 
|  | 159 | + "### Bounds of word error rate\n", | 
|  | 160 | + "\n", | 
|  | 161 | + "$[0, \\infty),ドル where 0 is best. (The lack of a finite upper bound derives from the fact that the normalizing constant is given by the true sequences, and the predicted sequences can differ from them in any conceivable way in principle.)" | 
|  | 162 | + ] | 
|  | 163 | + }, | 
|  | 164 | + { | 
|  | 165 | + "attachments": {}, | 
|  | 166 | + "cell_type": "markdown", | 
|  | 167 | + "metadata": {}, | 
|  | 168 | + "source": [ | 
|  | 169 | + "### Value encoded by word error rate\n", | 
|  | 170 | + "\n", | 
|  | 171 | + "This method says that our desired notion of closeness or accuracy can be operationalized in terms of the low-level operations of insertion, deletion, and substitution. The guiding intuition is very much like that of F scores." | 
|  | 172 | + ] | 
|  | 173 | + }, | 
|  | 174 | + { | 
|  | 175 | + "attachments": {}, | 
|  | 176 | + "cell_type": "markdown", | 
|  | 177 | + "metadata": {}, | 
|  | 178 | + "source": [ | 
|  | 179 | + "### Weaknesses of word error rate\n", | 
|  | 180 | + "\n", | 
|  | 181 | + "The value encoded reveals a potential weakness in certain domains. Roughly, the more __semantic__ the task, the less appropriate WER is likely to be. \n", | 
|  | 182 | + "\n", | 
|  | 183 | + "For example, adding a negation to a sentence will radically change its meaning but incur only a small WER penalty, whereas passivizing a sentence (_Kim won the race_ → _The race was won by Kim_) will hardly change its meaning at all but incur a large WER penalty. \n", | 
|  | 184 | + "\n", | 
|  | 185 | + "See also [Liu et al. 2016](https://www.aclweb.org/anthology/D16-1230) for similar arguments in the context of dialogue generation." | 
|  | 186 | + ] | 
|  | 187 | + }, | 
|  | 188 | + { | 
|  | 189 | + "attachments": {}, | 
|  | 190 | + "cell_type": "markdown", | 
|  | 191 | + "metadata": {}, | 
|  | 192 | + "source": [ | 
|  | 193 | + "### Related to word error rate\n", | 
|  | 194 | + "\n", | 
|  | 195 | + "* WER can be thought of as a family of different metrics varying in the notion of edit distance that they employ.\n", | 
|  | 196 | + "\n", | 
|  | 197 | + "* The Word Accuracy Rate is 1.0 minus the WER, which, despits its name, is intuitively more like [recall](#Recall) than [accuracy](#Accuracy)." | 
|  | 198 | + ] | 
|  | 199 | + }, | 
|  | 200 | + { | 
|  | 201 | + "cell_type": "markdown", | 
|  | 202 | + "metadata": {}, | 
|  | 203 | + "source": [] | 
|  | 204 | + }, | 
|  | 205 | + { | 
|  | 206 | + "attachments": {}, | 
|  | 207 | + "cell_type": "markdown", | 
|  | 208 | + "metadata": {}, | 
|  | 209 | + "source": [ | 
|  | 210 | + "## BLEU Scores\n", | 
|  | 211 | + "\n", | 
|  | 212 | + "BLEU(Bilingual Evaluation Understudy) scores were originally developed in the context of machine translation, but they are applied in other generation tasks as well.\n", | 
|  | 213 | + "\n", | 
|  | 214 | + "For BLEU scoring, we require a dataset $Y$ consisting of instances $(a, B)$ where $a$ is a candidate (a model prediction) and $B$ is a set of gold texts. The metric has two main components:\n", | 
|  | 215 | + "\n", | 
|  | 216 | + "* __Modified n-gram precision__: A direct application of precision would divide the number of correct n-grams in the candidate (n-grams that appear in any translation) by the total number of n-grams in the candidate This has a degenerate solution in which the predicted output contains only one n-gram. BLEU's modified version substitutes the actual count for each n-gram $s$ in the candidate by the maximum number of times $s$ appears in any gold text.\n", | 
|  | 217 | + "\n", | 
|  | 218 | + "* __Brevity penalty (BP)__: to avoid favoring outputs that are too short, a penalty is applied. Let $r$ be the sum of all minimal absolute length differences between candidates and referents in the dataset $Y,ドル and let $c$ be the sum of the lengths of all the candidates. Then:\n", | 
|  | 219 | + "\n", | 
|  | 220 | + "$$\\textbf{BP}(Y) =\n", | 
|  | 221 | + "\\begin{cases}\n", | 
|  | 222 | + "1 & \\textrm{ if } c > r \\\\\n", | 
|  | 223 | + "\\exp(1 - \\frac{r}{c}) & \\textrm{otherwise}\n", | 
|  | 224 | + "\\end{cases}$$\n", | 
|  | 225 | + "\n" | 
|  | 226 | + ] | 
|  | 227 | + }, | 
|  | 228 | + { | 
|  | 229 | + "attachments": {}, | 
|  | 230 | + "cell_type": "markdown", | 
|  | 231 | + "metadata": {}, | 
|  | 232 | + "source": [ | 
|  | 233 | + "The BLEU score itself is typically a combination of modified n-gram precision for various $n$ (usually up to 4):\n", | 
|  | 234 | + "\n", | 
|  | 235 | + "$$\\textbf{BLEU}(Y) = \\textbf{BP}(Y) \\cdot \n", | 
|  | 236 | + " \\exp\\left(\\sum_{n=1}^{N} w_{n} \\cdot \\log\\left(\\textbf{modified-precision}(Y, n\\right)\\right)$$\n", | 
|  | 237 | + "\n", | 
|  | 238 | + "where $Y$ is the dataset, and $w_{n}$ is a weight for each $n$-gram level (usually set to 1ドル/n$).\n", | 
|  | 239 | + "\n", | 
|  | 240 | + "NLTK has [a flexible implementation of Bleu scoring](http://www.nltk.org/_modules/nltk/translate/bleu_score.html)." | 
|  | 241 | + ] | 
|  | 242 | + }, | 
|  | 243 | + { | 
|  | 244 | + "attachments": {}, | 
|  | 245 | + "cell_type": "markdown", | 
|  | 246 | + "metadata": {}, | 
|  | 247 | + "source": [ | 
|  | 248 | + "### Bounds of BLEU scores\n", | 
|  | 249 | + "\n", | 
|  | 250 | + "[0, 1], with 1 being the best, though with no expectation that any system will achieve 1, since even sets of human-created translations do not reach this level." | 
|  | 251 | + ] | 
|  | 252 | + }, | 
|  | 253 | + { | 
|  | 254 | + "attachments": {}, | 
|  | 255 | + "cell_type": "markdown", | 
|  | 256 | + "metadata": {}, | 
|  | 257 | + "source": [ | 
|  | 258 | + "### Value encoded by BLEU scores\n", | 
|  | 259 | + "\n", | 
|  | 260 | + "BLEU scores attempt to achieve the same balance between precision and recall that runs through the majority of the metrics discussed here. It has many affinities with [word error rate](#Word-error-rate), but seeks to accommodate the fact that there are typically multiple suitable outputs for a given input." | 
|  | 261 | + ] | 
|  | 262 | + }, | 
|  | 263 | + { | 
|  | 264 | + "attachments": {}, | 
|  | 265 | + "cell_type": "markdown", | 
|  | 266 | + "metadata": {}, | 
|  | 267 | + "source": [ | 
|  | 268 | + "### Weaknesses of BLEU scores\n", | 
|  | 269 | + "\n", | 
|  | 270 | + "* [Callison-Burch et al. (2006)](http://www.aclweb.org/anthology/E06-1032) criticize BLEU as a machine translation metric on the grounds that it fails to correlate with human scoring of translations. They highlight its insensitivity to n-gram order and its insensitivity to n-gram types (e.g., function vs. content words) as causes of this lack of correlation.\n", | 
|  | 271 | + "\n", | 
|  | 272 | + "* [Liu et al. (2016)](https://www.aclweb.org/anthology/D16-1230) specifically argue against BLEU as a metric for assessing dialogue systems, based on a lack of correlation with human judgments about dialogue coherence." | 
|  | 273 | + ] | 
|  | 274 | + }, | 
|  | 275 | + { | 
|  | 276 | + "attachments": {}, | 
|  | 277 | + "cell_type": "markdown", | 
|  | 278 | + "metadata": {}, | 
|  | 279 | + "source": [ | 
|  | 280 | + "### Related to BLEU scores\n", | 
|  | 281 | + "\n", | 
|  | 282 | + "There are many competitors/alternatives to BLEU, most proposed in the context of machine translation. Examples: [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric), [METEOR](https://en.wikipedia.org/wiki/METEOR), [HyTER](http://www.aclweb.org/anthology/N12-1017), [Orange (smoothed Bleu)](http://www.aclweb.org/anthology/C04-1072)." | 
|  | 283 | + ] | 
|  | 284 | + }, | 
|  | 285 | + { | 
|  | 286 | + "cell_type": "code", | 
|  | 287 | + "execution_count": null, | 
|  | 288 | + "metadata": {}, | 
|  | 289 | + "outputs": [], | 
|  | 290 | + "source": [] | 
|  | 291 | + }, | 
|  | 292 | + { | 
|  | 293 | + "attachments": {}, | 
|  | 294 | + "cell_type": "markdown", | 
|  | 295 | + "metadata": {}, | 
|  | 296 | + "source": [ | 
|  | 297 | + "## Perplexity\n", | 
|  | 298 | + "\n", | 
|  | 299 | + "[Perplexity](https://en.wikipedia.org/wiki/Perplexity) is a common metric for directly assessing generation models by calculating the probability that they assign to sequences in the test data:\n", | 
|  | 300 | + "\n", | 
|  | 301 | + "$$\\textbf{PP}(p, \\textbf{x}) = \\prod_{i=1}^{n}\\left(\\frac{1}{p(x_{i})}\\right)^{\\frac{1}{n}}$$\n", | 
|  | 302 | + "\n", | 
|  | 303 | + "where $p$ is a model assigning probabilities to elements and $\\textbf{x}$ is a sequence of length $n$.\n", | 
|  | 304 | + "\n", | 
|  | 305 | + "When averaging perplexity values obtained from all the sequences in a text corpus, one should again use the geometric mean:\n", | 
|  | 306 | + "\n", | 
|  | 307 | + "$$\\textbf{mean-PP}(p, X) = \n", | 
|  | 308 | + "\\exp\\left(\\frac{1}{m}\\sum_{x\\in X}\\log \\textbf{PP}(p, \\textbf{x})\\right)$$\n", | 
|  | 309 | + "\n", | 
|  | 310 | + "for a set of $m$ examples $X$." | 
|  | 311 | + ] | 
|  | 312 | + }, | 
|  | 313 | + { | 
|  | 314 | + "attachments": {}, | 
|  | 315 | + "cell_type": "markdown", | 
|  | 316 | + "metadata": {}, | 
|  | 317 | + "source": [ | 
|  | 318 | + "### Bounds of perplexity\n", | 
|  | 319 | + "\n", | 
|  | 320 | + "[1, $\\infty$], where 1 is best." | 
|  | 321 | + ] | 
|  | 322 | + }, | 
|  | 323 | + { | 
|  | 324 | + "attachments": {}, | 
|  | 325 | + "cell_type": "markdown", | 
|  | 326 | + "metadata": {}, | 
|  | 327 | + "source": [ | 
|  | 328 | + "### Values encoded by perplexity\n", | 
|  | 329 | + "\n", | 
|  | 330 | + "The guiding idea behind perplexity is that a good model will assign high probability to the sequences in the test data. This is an intuitive, expedient intrinsic evaluation, and it matches well with the objective for models trained with a cross-entropy or logistic objective." | 
|  | 331 | + ] | 
|  | 332 | + }, | 
|  | 333 | + { | 
|  | 334 | + "attachments": {}, | 
|  | 335 | + "cell_type": "markdown", | 
|  | 336 | + "metadata": {}, | 
|  | 337 | + "source": [ | 
|  | 338 | + "### Weaknesses of perplexity\n", | 
|  | 339 | + "\n", | 
|  | 340 | + "* Perplexity is heavily dependent on the nature of the underlying vocabulary in the following sense: one can artificially lower one's perplexity by having a lot of `UNK` tokens in the training and test sets. Consider the extreme case in which _everything_ is mapped to `UNK` and perplexity is thus perfect on any test set. The more worrisome thing is that any amount of `UNK` usage side-steps the pervasive challenge of dealing with infrequent words.\n", | 
|  | 341 | + "\n", | 
|  | 342 | + "* [As Hal Daumé discusses in this post](https://nlpers.blogspot.com/2014/05/perplexity-versus-error-rate-for.html), the perplexity metric imposes an artificial constraint that one's model outputs are probabilistic." | 
|  | 343 | + ] | 
|  | 344 | + }, | 
|  | 345 | + { | 
|  | 346 | + "attachments": {}, | 
|  | 347 | + "cell_type": "markdown", | 
|  | 348 | + "metadata": {}, | 
|  | 349 | + "source": [ | 
|  | 350 | + "### Related to perplexity\n", | 
|  | 351 | + "\n", | 
|  | 352 | + "Perplexity is the inverse of probability and, [with some assumptions](http://www.cs.cmu.edu/~roni/11761/PreviousYearsHandouts/gauntlet.pdf), can be seen as an approximation of the cross-entropy between the model's predictions and the true underlying sequence probabilities." | 
|  | 353 | + ] | 
|  | 354 | + }, | 
|  | 355 | + { | 
|  | 356 | + "cell_type": "markdown", | 
|  | 357 | + "metadata": {}, | 
|  | 358 | + "source": [] | 
|  | 359 | + } | 
|  | 360 | + ], | 
|  | 361 | + "metadata": { | 
|  | 362 | + "kernelspec": { | 
|  | 363 | + "display_name": "base", | 
|  | 364 | + "language": "python", | 
|  | 365 | + "name": "python3" | 
|  | 366 | + }, | 
|  | 367 | + "language_info": { | 
|  | 368 | + "codemirror_mode": { | 
|  | 369 | + "name": "ipython", | 
|  | 370 | + "version": 3 | 
|  | 371 | + }, | 
|  | 372 | + "file_extension": ".py", | 
|  | 373 | + "mimetype": "text/x-python", | 
|  | 374 | + "name": "python", | 
|  | 375 | + "nbconvert_exporter": "python", | 
|  | 376 | + "pygments_lexer": "ipython3", | 
|  | 377 | + "version": "3.9.12" | 
|  | 378 | + }, | 
|  | 379 | + "orig_nbformat": 4 | 
|  | 380 | + }, | 
|  | 381 | + "nbformat": 4, | 
|  | 382 | + "nbformat_minor": 2 | 
|  | 383 | +} | 
0 commit comments