Grate news everyone!

“If you deprive the robot of your intuition about cause and effect, you’re never going to communicate meaningfully.” – Pearl ’18

In a paper at NIPS ’15, Judea Pearl and co. put together a toy example when all of the standard multi-armed bandit (MAB) algorithms fail. The paper shows that a MAB algorithm can do worse than random guessing, how to overcome the problem in one such case, and raises questions that need to be addressed when standard MAB algorithms are used in practice. And as long as we’re describing human learning as Bayesian updating, the paper is also interesting as a comment on optimizing human decision making.

But looking at the google scholar citations, that paper was only cited 17 times since then (well really 16, 1 paper is in there twice). So what gives?

The paper is a fairly straightforward extension of the MAB problem. In typical MAB, the goal is to specify a decision making algorithm for selecting a discrete choice (i.e. an arm of a casino slot machine) that minimizes regret over the trials played (i.e. the opportunity cost associated with selecting a suboptimal option). The paper calls out two issues in this setting: (i) when observations used to set the parameters of the algorithm are censored to exclude the high performing contexts, the algorithm can perform worse than random guessing, and (ii) when the context isn’t accounted for, even with absolutely random exploration the algorithm won’t end up selecting the optimal choice. Their simple solution is to augment standard Thompson sampling, which usually makes a selection by taking the arm with the highest reward sampled from a distribution modeling belief over arm rewards. Here, the algorithm instead first samples arm values, then uses the difference between expected rewards from the two arms to scale down the sampled value of the highest arm, and finally plays the arm with the higher value after the adjustment. Using the difference in expected rewards allows the algorithm to avoid conditioning on context directly, and just compares the two choices given the would-be selection. In cases where the higher sampled value comes from the lower performing arm, the sampled value will eventually get scaled down to 0 and the arm will be selected less frequently. Among the unaddressed points are: (i) how to compute this difference when there are more than 2 arms and the relationship between them is unknown, (ii) why is this the right way to include belief about counterfactual performance in the algorithm (are there more direct ways of including it?), and (iii) whether this can be done without Thompson sampling, which can get computationally expensive as the number of arms and contexts gets to be large.

The authors also argue that unobserved confounding may be more the rule than the exception in practice, citing the familiar example of doctors prescribing more expensive drugs to more affluent patients, which they use to draw a distinction between random sampling of prescription events and random assignment of drugs to patients. In that case, the effect of the drug can’t be decoupled from the effect of difference in lifestyle without also prescribing the same drug to the less privileged patients.

As an even more familiar example for those doing data science at tech companies, consider online ad auctions, where censoring of observations can happen through impression win rates. The table below reproduces values in Table 1 from the paper, with win probabilities taking the place of intuition. Let’s say what we want to do is set bid values for two ads in a campaign by adjusting the budget they need to spend in some defined amount of time. If this is done without considering the combinations of segments and sites when selecting a budget level, both ads could have performance ~(0.15*0.8 + 0.45*0.1), which would end up being significantly worse than bidding correctly on the low win rate / high conversion rate combinations of user segments and sites.

bid

segment site P(imp) P(conversion | imp)

reddit.com

0.1

0.5

nytimes.com

0.1

0.4

nytimes.com

0.8

0.1

reddit.com

0.8

0.2

reddit.com

0.1

0.4

nytimes.com

0.1

0.5

nytimes.com

0.8

0.2

reddit.com

0.8

0.1

So given all this – a relatively simple paper, ample room for improvement, it’s broad relevance and the costly implications of ignoring ideas present in the same literature for over 20 years, why has so little work picked up on this thread? Are tech companies successfully avoiding these issues? And, how are we at the point where Pearl appears in a recent Atlantic article as one of machine learning’s “sharpest critics”?

Do AIs dream of pwning FF leagues?

In a previous post, I looked at how the established value based drafting (VBD) algorithm for picking fantasy football rosters would perform in a league of typical human players. It turned out that we get different performance depending on if we look at ranks of VBD drafters based (i) on expected preseason player forecasts or (ii) on actual points scored by a player that season. Based on preseason forecasts, we could expect a VBD roster to place 2nd in a 12 player league, while using actual player points, VBD is only expected to rank at 4.68. That’s still better than it would do by chance (if VBD came in 6th place), but it’s really only a slight advantage. This made me wonder if it would be all that difficult to improve on VBD using methods similar to those used to train AIs to play video/board games.

The simplest starting point might be taking a value iteration approach similar to the one described in the Mnih et. al Atari games paper from 2013. A pretty accessible introduction to the topic can be found in chapter 12 of David Poole and Alan Mackworth’s free online AI text, but in a nutshell, value iteration learns a game-playing policy by iteratively learning a function, Q(s,a), which captures the value of taking an action, a, when in a particular state, s. The function represents both the immediate reward, r_t and the reward you can expect once we end up in, s_{t+1}, that taking the action, a_t, lands us in:

\text{Q}(s_t, a_t) = r_t + \gamma * \text{argmax}_{a_{t+1}}\text{Q}(s_{t+1}, a_{t+1})

where 0\le\gamma\le1 is a discount, essentially just a tuneable hyperparameter encoding the tradeoff between immediate and long-term rewards. Doing this makes sense because on the last turn in the game we will act in the way that maximizes immediate reward, and on earlier turns, we will act in a way that gives an acceptable reward while positioning us to act in a way that also returns high rewards in later turns. Overall, it’s not that different from standard string alignment using dynamic programming. So, the solution approaches the optimal policy provided that we have enough data to get reliable score estimates in every cell (read state).

It’s also similar to the VBD algorithm, where the value function has been tuned through trial and error by humans since it’s initial description in 2001. The immediate reward would be r_t= (\text{Pts}(a_t) - \text{max}(\text{Pts}(\text{Pos}(a_t), t\ge100)))*dsct_{VBD}, which captures the difference between a drafted player’s projected points and the projected points for the best player in the same position after 100 turns, and dsct_{VBD} encodes the decreased value of drafting substitute players in a given position after the starter slots are filled. VBD doesn’t directly account for the state this would put us in with regard to players we would draft in subsequent turns, so \gamma=0.

For the Atari games, Q(s,a) was modeled using a neural network trained on gameplay data. Data was collected using an \epsilon-Greedy policy, and weight updates were done in mini-batches of random contiguous samples of state action rewards. The algorithm is captured in the following steps:

Initialize (gamma, eps) to numbers in [0,1]
Initialize Q_0(s,a) with some model of expected rewards for actions in different states
while patience:
 # simulate games 
 for t in turns:
 action= sample from uniform distribution on [0,1]
 if action < eps: 
 act randomly
 else:
 act according to argmax_a Q_i(s_t, a_t)
 # update value function
 sample (s_t, a_t, r_t, s_t+1) experience tuples from games
 update NN weights in Q_i to Q_i+1 using:
 x= [s_t, a_t]
 y= r_t + gamma * argmax_a Q_i(s_t+1, a_t+1)
 decay eps, increment i

So getting started…

The previous post evaluating VBD has already left me with a more or less reasonable simulator of leagues with typical human drafters and players strictly following VBD. Here, I would just need to extend it with (i) a drafting policy that allows me to act in a way that exploits the value function or explores new roster combinations with some randomly selected actions, (ii) logging experience tuples (s_t, a_t, r_t, s_{t+1}), and (iii) updating the value function by replaying the collected experience.

Since the draft simulator is already written in R, I decided to start modeling value with two of the bread and butter models readily available, linear regression with fused lasso regularization through glmnet, and gradient boosted trees through xgboost. Of course, the strength of the Atari paper was the use of the latest and greatest convolutional networks, and maybe it’s worth eventually building up to that, but for a quick first cut, I wanted to try something simple that works out of the box.

States, actions, and rewards in FF drafts. The state of a roster at some point in the draft, s_t, is the current set of players in each position. Actions, a_t, are the possible player selections available on that turn. The problem would be terribly sparse if there was one value of a_t for every player in the league on every possible turn. Instead, we can represent the possible actions, a_t, with features in Q(s, a) that generalize across players in a given position. Again, just to start, I used some basic player descriptors, including position, preseason point forecasts, bootstrapped confidence intervals of the preseason forecasts, VBD reward with various baselines, and counts of players already on the roster.

My goal for this model is to obtain the greatest total points from all starters on the roster, and so the form of the reward would be slightly different than in VBD. Whereas VBD looks at the player’s individual points, I am interested in if a player would improve the overall roster score, so here the immediate reward is: r_t=\text{Pts}(s_t)-\text{Pts}(s_{t-1}), where \text{Pts}(s_t) returns the total points (realized by end of season or projected preseason) that we would get by selecting the top starters already on the roster. So for example, if the WR drafted fourth ended up with more season points than WRs drafted earlier, the reward would be the difference in total starter value at the end of the season.

For simulating drafts, I used \epsilon-Greedy policy. Greedy selections were made with probability 1-\epsilon by ranking players in each position according to Q(s,a), and selecting the player with the overall highest value. Exploration selections were made with probability \epsilon by ranking players in each position according to Q(s,a), and selecting the top player from a random position. To log experience for replays, I logged just the top players in each position on each draft turn. I simulated drafts in leagues where all opponents were either typical humans or precisely followed VBD recommendations.

The R scripts to learn and evaluate the Q-value functions are here. I used the same three seasons as before (2014-2016) and evaluated performance relative to human and VBD drafters.

Results so far.

With a slight amount of tuning, we converge pretty quickly, and it is straightforward to find a Q-value function that outperforms VBD in a league of humans on data within the same year. At the same time, it is easy to fail miserably at generalizing across years, and in some cases, would probably need to invest a bit more TLC just to beat plain old VBD.

Learning Curves

I first learned a Q function for each season, alternating drafts against all human or all VBD leagues, random sampling draft positions. To do this I used a fixed \epsilon-Greedy exploration policy and used xgboost as the value model. I evaluated performance as the rank of the best starting roster selected using actual points collected by a player that season. The learning curves below show the rolling median and 0.25-0.75 interval. In general performance seems to converge after the first few hundred iterations.

learningCurves.png

Comparison with VBD

To see if this actually managed to improve over the VBD baseline from the previous post, I looked at how rosters selected using the learned values would rank in a draft in a human league compared with how VBD would rank. Using the logged experience from the xgboost exploration above, I retrained the value function either using glmnet or xgboost as the model. At first glance, this seems fine to do since we’re taking an off-policy learning approach anyway, but can revisit this. To look at the ability to generalize, I either trained only on the same year that the model would be evaluated on, or I trained using the other two years.

xgboost always wins within the same year, but across years it only wins in 2016, while VBD is better in the other two years. Since the action representations include the same contrasts used for VBD, this to me suggests that the weights learned using just two years don’t always generalize. Would be interesting to rinse and repeat on like a 20 to 30 year stretch if I could find the data somewhere.

withinAndAcrossYears

Anyway, that’s what I got for a first take on doing this

Here’s a random set of some other things that might eventually be worth trying.

better preseason predictions / more years of historical data
other league combinations
- more adversary combinations
- human drafters with constraints
  - qb first
  - 2 rb first
other notions of reward
bells and whistles
- online updates
- other exploration schemes
- additional features, encodings of draft position, or use a NN

How easy is it to moneyball a fantasy football league draft?

The debate

I have this ongoing debate with some friends, which also seems to be unresolved for people who have looked at it for much longer than we have. The question is: can model-based decision-making reliably outperform humans in a fantasy football draft?

On the one hand, there’s a cottage industry of people making forecasts and developing strategies for how to order draft picks. For player values, the usual suspects like ESPN, CBS, NFL, Yahoo Sports, and etc., produce predictions of expected player stats for the season; and there are also groups like fantasyfootballanalytics.net for example, who produce their own ensembled predictions, all with convenient R libraries and years tuning through of trial and error to get things working smoothly. For drafting the players, there’s a widely referenced value based drafting (VBD) strategy, first described in Fantasy Forecast Magazine in 2001, and then updated in this write-up by the footballguys in 2005. There’s also this surprisingly frequently referenced comparison of bidding policies from 2012, but I won’t get there in this post because the scope of our question is limited to snake drafts where pick order matters, but all players cost the same.

On the other hand, player value predictions aren’t perfect. For example, this fivethirtyeight article points out that ESPN fantasy projections for top running backs before the season overpredict performance, such that the top 12 players are in general ranked 2-3x lower in the actual season. There are also criticisms of VBD, as for example in this post on ESPN, saying the resulting teams “smell funny.” And then there are posts like this one on reddit, which managed to identify a combination of 9 reasonable players that underperformed by 80 points on some random week, underlining how noisy the player performance forecasts can be, albeit without saying how many combinations were considered to find this one bad team, or how many teams would have instead overperformed by a similar margin.

So to try to settle the debate, I tested if a vanilla VBD drafting policy could reliably produce rosters that come out on top in a league of typical fantasy football players. To do this, I pulled average player draft positions across thousands of fantasy leagues collected in myfantasyleague.com, and used these as input parameters to simulate leagues of 11 human drafters to compete with 1 VBD drafter. I measured performance by evaluating the actual fantasy points collected by the best set of starters on each team and then finding the rank of the VBD drafter.

TL;DR

In general, VBD gives a slight edge, but it could be an occasional slam-dunk when players drafted early by humans turn out to be lemons.

Results

I simulated drafts from three seasons (2014-2016), with the VBD drafter placed in any of the 12 positions. Across all leagues, VBD rosters had a mean rank of 4.68 when evaluating fantasy points observed at the end of the season. This is significantly higher than a rank of 6 that we would expect if VBD drafters were indistinguishable from typical human drafters (Chi. Sq. p.val=2.2e-16). Unfortunately, that is also substantially worse than how well we would have expected to VBD rosters to rank based on preseason projections, for which the mean rank is 2.05. So VBD does tip the scales slightly, but it isn’t exactly a silver bullet.

Interestingly there was a fair amount of variability between how VBD performed in the three years. Both expected ranks and observed ranks were noticeably higher in 2015 than 2014 and 2016.

VBD_rank_CDF

I’m not sure how to explain this yearly trend. The projections were downloaded from fantasyfootballanalytics.net separately for the three years, so maybe there is an issue with the data. Looking at the quality of preseason forecasts though, the Pearson correlation and the mean absolute error (MAE) were not dramatically different in 2015 than in other years. So maybe forecast quality should be evaluated differently, or there might be a better explanation.

Position Season Pearson correlation Mean Absolute Error

2014

0.7124

60.4863

2015

0.5562

68.4075

2016

0.6117

62.9405

2014

0.4459

56.5074

2015

0.5165

52.4999

2016

0.5724

53.1948

2014

0.4884

43.3151

2015

0.6593

32.0161

2016

0.6001

35.3129

2014

0.6390

45.4527

2015

0.6362

48.7875

2016

0.5236

48.1142

There was also a relationship with VBD draft turn, performing better with turns in the middle of the rotation. What stood out in 2015 is the first two draft positions resulted in some of the worst outcomes and late picks didn’t hurt performance like they did in other years. But, that doesn’t exactly point to what differentiated that year, since early picks also performed poorly in 2014.

obsrank_by_pos

Another explanation might be that in 2015 there were some obvious top picks in the first two round that ended up bombing. To get a sense if this is the case, we can look at the correlation between expected – observed points and draft position for each year. A correlation of 0 would indicate that across draft positions there wasn’t a trend with underperformance. A negative correlation might indicate that players drafted early underperformed to a greater extent than players drafted later.

year correlation

2014

-0.236

2015

-0.318

2016

-0.258

Turns out that this was indeed the case. Players drafted early in real fantasy football leagues underperformed more in 2015 than other years. The biggest offender that year was Le’Veon Bell, who injured his knee in November. But in all, 8 of the first 24 players drafted (first 2 rounds) underperformed by more than 100 points, compared to 2 and 4 players in the other two years.

Ok, so I wouldn’t exactly say the case is closed here, but at this point, I’d like to submit this as a first argument in favor of model-based drafting being superior.

And here are the deets, in case anyone wants to know more

League overview. I simulated a standard snake draft in a league with 12 teams. One team drafts players with a VBD-like policy, the remaining 11 have a simulated baseline policy. In all, I simulated 100 drafts for each year and each possible VBD draft position.

Evaluation. After the drafts, I identified the best set of starters on each team using actually observed fantasy points for each player and then compared how the VBD rosters rank in each league. In other words, assuming the teams are fixed after the draft, and the one best roster is selected by some oracle, how well does a VBD roster perform relative to baseline rosters? Another artifact is that I’m not considering defense, special teams, and kickers because people don’t generally draft them meaningfully early, and their performance doesn’t correlate well year to year.

Simulated human drafter. On each baseline draft turn, for every undrafted player, I sample from a Poisson distribution with lambda set to that player’s mean draft position, and then take the lowest sampled value, resolving ties by choosing the player with the lower mean draft position. This is done without picking backup players until all starters are selected.

position num starters in position max players in position

FLEX (WR/TE/RB)

VBD policy. Here I’m just going to follow a vanilla version of what’s in the footballguys’ VBD description. On each turn, a player is drafted by taking top player in the position with the greatest discounted incremental value given by:

\text{argmax}_{pos} \text{ } ({value}_{pos, rank=1} - {value}_{pos, rank=baseline})*discount

where {discount}= 1 - ( n_{pos \text{ } already \text{ } on \text{ } team} +1 - n_{starters\text{ } in\text{ } pos})*.2 and {value}_{pos, rank=baseline} is the points value of the top player in the position after the first 100 picks.

Though there is VBD rule 7: “Know When to Deviate from VBD Principles”, and in practice doing this may give a team without a quarterback for example. So to be fair, we’ll also apply the same policy of not drafting backup players until there is a starter in every position.

The data and le code. I downloaded the yearly fantasy projections from the projections tool here, the player rankings from myfantasyleague.com, and actual player performance from the NFL using nflscrapR. After a bit of munging, I joined the three years from each source into the final dataset here. The draft simulation and results were obtained using this script.

He grins like a Cheshire cat; said of anyone who shows his teeth and gums in laughing

So you know the Cheshire cat, yeah? Ever wonder what it’s point is? Like, it turns up in a number of places, but does it foreshadow or emphasize plot turns? Might it signify coming improvements or declines?

In my last post I confirmed that plot arcs from smoothed word sentiments seem to provide sensible representations of the plot progressions. Here, I’d like to try answering the questions about the Cheshire cat using it’s appearances within those narrative arcs.

First, in which texts does a Cheshire cat turn up?

*cracks knuckles*

After a bit of tinkering and a bit of a wait, I was able to pull out the Proj. Gutenberg index, parse out the book ids, construct urls for text files, and obtain complete versions of the 136 texts containing the phrase “Cheshire cat” or “Cheshire puss”. Code in case you feel like tinkering on something similar.

Now to find the plot twists. This is familiar territory for anyone used to looking at fluctuations in stock price or detecting chemical signals in measurements from NMR/MS/etc., and in deed, there’s a variety of R packages that already do this off the shelf. For no reason in particular, I went with quantmod this time around.

So can we actually detect the plot turns and confirm that the cat appears somewhere meaningful in “Alice’s Adventures in Wonderland”, which first popularized the character? I loaded the sentences and assigned sentiments as before. Then I identified Cheshire cat appearances by looking for sentences containing any of “Cheshire cat”, ” Cat “, or “Cheshire puss”, which should cover most true appearances, and omit the broader discussion of cats in the beginning. Again code is here.

alicesAdventures

In this case, the Cheshire cat first turns up just before the third trough in the plot progression, as Alice enters the Duchess’ house and finds the cat grinning on the hearth. Things deteriorate a bit as the Duchess asks the cook to butcher Alice, but the sentiment recovers when the focus shifts to abusive parenting. The passage is below, the turning point is in bold.

======================================================

I didnt know that Cheshire cats always grinned; in fact, I didnt know
that cats COULD grin.

They all can, said the Duchess; and most of em do.

I dont know of any that do, Alice said very politely, feeling quite
pleased to have got into a conversation.

You dont know much, said the Duchess; and thats a fact.

Alice did not at all like the tone of this remark, and thought it would
be as well to introduce some other subject of conversation. While she
was trying to fix on one, the cook took the cauldron of soup off the
fire, and at once set to work throwing everything within her reach at
the Duchess and the baby–the fire-irons came first; then followed a
shower of saucepans, plates, and dishes. The Duchess took no notice of
them even when they hit her; and the baby was howling so much already,
that it was quite impossible to say whether the blows hurt it or not.

Oh, PLEASE mind what youre doing! cried Alice, jumping up and down in
an agony of terror. Oh, there goes his PRECIOUS nose; as an unusually
large saucepan flew close by it, and very nearly carried it off.

If everybody minded their own business, the Duchess said in a hoarse
growl, the world would go round a deal faster than it does.

Which would NOT be an advantage, said Alice, who felt very glad to get
an opportunity of showing off a little of her knowledge. Just think of
what work it would make with the day and night! You see the earth takes
twenty-four hours to turn round on its axis–

Talking of axes, said the Duchess, chop off her head!

Alice glanced rather anxiously at the cook, to see if she meant to take
the hint; but the cook was busily stirring the soup, and seemed not to
be listening, so she went on again: Twenty-four hours, I THINK; or is
it twelve? I–

Oh, dont bother ME, said the Duchess; I never could abide figures!
And with that she began nursing her child again, singing a sort of
lullaby to it as she did so, and giving it a violent shake at the end of
every line:

Speak roughly to your little boy,
And beat him when he sneezes:
He only does it to annoy,
Because he knows it teases.

CHORUS.

======================================================

May be the whole situation could’ve been avoided if Alice didn’t feel the need to recover from not knowing that Cheshire cats grin? Who knows? But, as long as we can agree that in terms of sentiment, being raised by an abusive noble is just a first world problem relative to having a murder request to your name, then things again mostly seem to check out.

Or may be I’m just cherry-picking? After all, not every Cheshire cat appearance is near a plot turn. But neither is it reasonable to expect a random recurring character to only turn up at meaningful plot points. So for now, I’ll just focus on the one cat appearance closest to a critical point in each text and distinguish among the types of critical points that are closest. As a baseline, we can use the distribution of nearest peak types for all sentences that didn’t mention the Cheshire cat in the texts which contained it. Ran the numbers, counts below.

Critical point position Critical point type No cat (freq.) Cat (freq.)

after peak

171316 (0.221)

28 (0.201)

after trough

224358 (0.299)

50 (0.368)

before peak

148931 (0.192)

24 (0.176)

before trough

232068 (0.289)

34 (0.250)

There seems to be some enrichment of cat appearances before peak troughs (where the trough appears after the cat in the table). We can ask if these numbers are greater than what we would expect by chance in a variety of ways. I should probably put together a blooper reel of things I tried that didn’t work in a future post, but for now I’ll stick with one simple thing that did.

Avoiding the need for any of the assumptions of Chi-squared tests, I drew 10,000 samples from a multinomial distribution with parameters given by the frequencies in the “no cat” baseline column, and with each sample containing the same total number of counts as the cat column. The probabilities of observing at least as many cat placements as the amount we did observe are below.

Critical point position Critical point type Prob

After Both

0.0523

Both Trough

0.2155

After Trough

0.0167

Before Trough

0.8753

After Peak

0.6127

Before Peak

0.6240

Sure enough, it’s pretty unlikely to have observed this many Cheshire cat appearances before plot troughs purely by chance.

And so long story short, Cheshire cats do seem to have a preference for turning up just as things start to get better.

Cinderella science

For my first act, two topics related to sequential data that should come up if you search google for “Cinderella science”, which you might do if you’re ever looking for stories of uncredited hard work that ultimately worked out (if only with the help of a fairy godmother).

One is just a reference to something I feel more people should know about and is the source of the above photo. The photo is of Charles Keeling, the scientist credited with setting up CO2 monitoring in Mauna Loa, Hawaii, which he did in the 1960’s, before it was cool. I found his story in this commentary in Nature about the ups and downs of the thankless work involved in getting everything going. Understandably, it was a hard sell initially, as it doesn’t test a sexy falsifiable hypothesis. But in the end, it was the years of data from Keeling’s initial work that provided the evidence for the significant gradual changes in our atmospheric composition. Kind of makes me wonder whether we really need to have the same conversation again in the context of motivating exploratory space research?

For the uninitiated, the resulting time series is gorgeous.

monaloa

The second topic I wanted to mention probably doesn’t come up in the top google search results, but I think it ought to. There’s this paper on the analysis of plot progressions in Project Gutenberg‘s fiction collection claiming that there are just 6 fundamental plot types. Seemingly everywhere the paper is mentioned, The Atlantic, MIT Technology Review, New York Magazine, etc., it comes with this snippet from a Kurt Vonnegut lecture, where Vonnegut anticipates similar results decades earlier. But what’s a bit unsatisfying for me, is that no one really validates the plot arcs that Vonnegut didn’t even need a model to recognize.

I set out to reproduce the last of the arcs that Vonnegut describes – the Cinderella narrative (top right in the four plots below).

vonnegut

So what does it take to distill this expected profile from the text? I started with Mathew Jockers “syuzhet” R package (which incidentally his blog also links the Vonnegut lecture). The package provides pretty much everything we need: functions to parse a text into sentences, parse those sentences into word tokens, map tokens to sentiment values from various sources, then pool (read average/max) the sentiments at the sentence level, and finally obtain a smoothed narrative profile. After downloading “CINDERELLA; Or, THE LITTLE GLASS SLIPPER.” from Proj. Gutenberg and copy-pasting a few lines from the syuzhet vignette, can confirm without breaking a sweat, things for the most part check out.

library(syuzhet)
text <- get_text_as_string("./cinderella.txt")
s_v <- get_sentences(text)
s_v_sentiment <- get_sentiment(s_v)
s_v_sentiment_quantiles <- sapply(s_v_sentiment, function(i) sum(s_v_sentiment < i)/length(s_v_sentiment))
smoothed <- lowess(s_v_sentiment_quantiles, f = .25)
smoothed$y_scaled <- (smoothed$y-median(smoothed$y))/diff(range(smoothed$y))
plot(smoothed$y_scaled~smoothed$x/max(smoothed$x), 
 type= 'l', lwd= 2, col='purple',
 xlab= 'Narrative Time', 
 ylab= 'Emotional Valence')
abline(h=0, lty=2)

Tada, stamp collected!!

cinderella

Kind of surprising that assigning sentiments to individual words from a manually curated dataset can produce these results, but I guess sometimes smoothing sequential data can work magic.

What might be the global warming equivalent in the world of these narrative arcs?