Sabermetric Research
Phil Birnbaum
Tuesday, July 29, 2014
Are CEOs overpaid or underpaid?
Corporate executives make a lot of money. Are they worth it? Are higher-paid CEOs actually better than their lower-paid counterparts?
Business Week magazine says, no, they're not, and they have evidence to prove it. They took 200 highly-paid CEOs, and did a regression to predict their company's stock performance from their chief executive's pay. The plot looks highly random, with an r-squared of 0.01. Here's a stolen copy:
The magazine says,
"The comparison makes it look as if there is zero relationship between pay and performance ... The trend line shows that a CEO’s income ranking is only 1 percent based on the company’s stock return. That means that 99 percent of the ranking has nothing to do with performance at all. ...
"If 'pay for performance' was really a factor in compensating this group of CEOs, we’d see compensation and stock performance moving in tandem. The points on the chart would be arranged in a straight, diagonal line."
I think there are several reasons why that might not be right.
First, you can't go by the apparent size of the r-squared. There are a lot of factors involved in stock performance, and it's actually not unreasonable to think that the CEO would only be 1 percent of the total picture.
Second, an r-squared of 0.01 implies a correlation of 0.1. That's actually quite large. I bet if you ran a correlation of baseball salaries to one-week team performance, the r-squared would probably be just as small -- but that wouldn't mean players aren't paid by performance. As I've written before, you have to look at the regression equation, because even the smallest correlation could imply a large effect.
Third, the study appears to be based on a dataset created by Equilar, a consulting firm that advises on executive pay. But Equilar's study was limited to the 200 best-paid CEOs, and that artificially reduces the correlation.
If you take only the 30 best-paid baseball players, and look at this year's performance on the field, the correlation will be only moderate. But if you add in the rest of the players, and minor-leaguers too, the correlation will be much higher.
(If you don't believe me: find any scatterplot that shows a strong correlation. Take a piece of paper and cover up the leftmost 90% of the datapoints. The 10% that remain will look much more random.)
Fourth, the observed correlation is fairly statistically significant, at p=0.08 (one-tailed -- calculate it here). That could be just random chance, but, on its face, 0.08 does suggest there's a good chance there's something real going on. On the other hand, the result probably comes out "too" significant because the 200 datapoints aren't really indpendent. It could be the case, for instance, that CEOs tend to get paid more in the oil industry, and, coincidentally, oil stocks happen to have done well recently.
-----
BTW, I don't think there's a full article accompanying the Business Week chart; I think what's in that link is all we get. Which is annoying, because it doesn't tell us how the 200 CEOs were chosen, or what years' stock performance was looked at. I'm not even sure that the salaries were negotiated in advance. If they weren't, of course, the result is meaningless, because it could just be that successful companies rewarded their executives after the fact.
Furthermore, the chart doesn't match the text. The reporters say they got an r-squared of 0.01. I measured the slope of the regression line in the chart, by counting pixels, and it appears to be around 0.06. But an r of 0.06 implies an r-squared of 0.0036, which is far short of the 0.01 figure. Maybe the authors rounded up, for effect?
It could be that my pixel count was off. If you raise the slope from 0.06 to 0.071, you now get an r-squared of 0.0051, which does round to 0.01. So, for purposes of this post, I'm going to assume the r is actually 0.07.
-----
A correlation of 0.07 means that, to predict a company's performance ranking, you have to regress its CEO pay ranking 93% towards the mean. (This works out because the X and Y variables have the same SD, both consisting of numbers from 1 to 200.)
In other words, 7 percent of the differences are real. That doesn't sound like much, but it's actually pretty big.
Suppose you're the 20th ranked CEO in salary. What does that say about your company's likely performance? It means you have to regress it 93% of the way back to 100.5. That takes you to 95th.
So, CEOs that get paid 20th out of 200 improve their company's stock price by 5 rankings more than CEOs who get paid 100.5th out of 200.
How big is five rankings?
I found a website that allowed me to rank all the stocks in the S&P 500 by one-year return. (They use one year back from today, so, your numbers may be different by the time you try it. Click on the heading "1-Year Percent.")
The top stock, Micron Technology, gained 151.47%. The bottom stock, Avon, lost 42.80%.
The difference between #1 and #500 is 194.27 percentage points. Divide that by 499, and the average one-spot-in-the-rankings difference is 0.39 percentage points.
Micron is actually a big outlier -- it's about 33 points higher than #2 (Facebook), and 52 points higher than #5 (Under Armour). So, I'm going to arbitrarily reduce the difference from 0.39 to 0.3, just to be conservative.
On that basis, five rankings is the equivalent of 1.5 percentage points in performance.
How much money is that, in real-life terms, for a stock to overperform by 1.5 points?
On the S&P 500, the average company has a market capitalization (that is, the total value of all outstanding stock) of 28 billion (.pdf). For the average company, then, 1.5 points works out to 420ドル million in added value.
If you want to use the median rather than the mean, it's 13ドル.4 billion and 200ドル million, respectively.
Either way, it's a lot more than the difference in CEO compensation.
From the Business Week chart, the top CEO made about 142ドル million. The 200th CEO made around 12ドル.5 million. The difference is 130ドル million over 199 rankings, or 650ドルK per ranking. (The top four CEOs are outliers. If you remove them, the spread drops by half. But I'll leave them in anyway.)
Moving up the 80 rankings in our hypothetical example is worth only a 52ドル million raise -- much less than the apparent value added:
Pay difference: 52ドル million
--------------------------------
Median value added: 200ドル million
Mean value added: 420ドル million
Moreover ... the value of a good CEO is much higher, obviously, for a bigger company. The ten biggest companies on the S&P 500 have a market cap of at least 200ドル billion each. For a company of that size, the equivalently "good" CEO -- the one paid 20th out of 200 -- is worth three billion dollars. That's *60 times* the average executive salary.
Assuming my arithmetic is OK, and I didn't drop a zero somewhere.
-----
So, I think the Business Week regression shows the opposite of what they believe it shows. Taking the data at face value, you'd have to conclude that executives are underpaid according to their talent, not overpaid.
I'm not willing to go that far. There's a lot of randomness involved, and, as I suggested before, other possible explanations for the positive correlation. But, if you DO want to take the chart as evidence of anything,it's evidence that there is, indeed, a substantial connection between pay and performance. The r-squared of less than 0.01 only *looks* tiny.
-----
Although I think this is weak evidence that CEOs *do* make a difference that's bigger than their salary, the numbers certainly suggest that they *can* make that big an impact.
Suppose you own shares of Apple, and they're looking for a new CEO. A "superstar" candidate comes along. He wants twice as much money as normal. As a shareholder, do you want the company to pay it?
It depends what you expect his (or her) production to be. What kind of difference do you think a good CEO will make in the company's performance?
Suppose that, next year, you think Apple will earn 6ドル.50 a share with a "replacement level" CEO. How much more do you expect with the superstar CEO?
If you think he or she can make a 1% difference, that's an extra 6.5 cents per share. That might be too high. How about one cent a share, from 6ドル.51 instead of 6ドル.50? Does that seem reasonable?
Apple trades at around 15 times annual earnings. So, one cent in earnings means about 15 cents on the stock price. With six billion Apple shares outstanding, 15 cents a share gives the superstar CEO a "value above replacement" of 900ドル million.
So, for a company as big as Apple, if you *do* think a CEO can make a 1-part-in-650 difference in earnings, even the top CEO salary of 142ドル million looks cheap.
Apple has the largest market cap of all 500 companies in the index, at about 15 times the average, so it's perhaps a special case. But it shows that CEOs can certainly create, or destroy, a lot more value than than their salaries.
-----
So can you conclude that corporate executives are underpaid? Not unless you can provide good evidence that a particular CEO really is that much better than the alternatives.
There's a lot of luck involved in how a company's business goes -- it depends on the CEO's decisions, sure, but also on the overall market, and the actions of competitors, and advances in technology in general, and world events, and Fed policy, and random fads, and a million other things. It's probably very hard to figure the best CEOs, even based on a whole career. I bet it's as hard, as, say, figuring baseball's best hitters based on only a week's worth of AB.
Or maybe not. Steve Jobs was fired as Apple's CEO, then, famously, returned to the struggling company a few years later to mastermind the iPod, iPhone, and iPad. Apple is now worth around 100 times as much as it was before Jobs came back. That's an increase in value of somewhere around 500ドル billion. It was maybe closer to 300ドル billion at the time of Jobs' death in 2011.
How much of that is due to Jobs' actual "talent" as CEO? Was he just lucky that his ethos of "insanely great" wound up leading to the iPhone? Maybe Jobs just happened to win the lottery, in that he had the right engineers and creative people to create exactly the right product for the right time?
It's obvious that Apple created hundreds of billions of dollars worth of value during Jobs' tenure, but I have no idea how much of that is actually due to Jobs himself. Well, I shouldn't say *no* idea. From what I've read and seen, I'd be willing to bet that he's at least, say, 1 percent responsible.
One percent of 300ドル billion is 3ドル billion. Divide that by 14 years, and it's more than 200ドル million per year.
If you give Steve Jobs even just one percent of the credit for Apple's renaissance, he was still worth 50 percent more than today's highest-paid CEO, 300 percent more than today's eighth-highest paid CEO, and 1500 percent more than today's 200th-highest-paid CEO.
Labels: Business Week, CEO, merit, r-squared, regression
posted by Phil Birnbaum @ 7/29/2014 12:07:00 PM 11 comments
Tuesday, July 22, 2014
Did McDonald's get shafted by the Consumer Reports survey?
McDonald's was the biggest loser in Consumer Reports' latest fast food survey, ranking them dead last out of 21 burger chains. CR readers rated McDonald's only 5.8 out of 10 for their burgers, and 71 out of 100 for overall satisfaction. (Ungated results here.)
CR wrote,
"McDonald's own customers ranked its burgers significantly worse than those of [its] competitors."
Yes, that's true. But I think the ratings are a biased measure of what people actually think. I suspect that McDonald's is actually much better loved than the survey says. In fact, the results could even be backwards. It's theoretically possible, and fully consistent with the results, that people actually like McDonald's *best*.
I don't mean because of statistical error -- I mean because of selective sampling.
-----
According to CR's report, 32,405 subscribers reported on 96,208 dining experiences. That's 2.97 restaurants per respondent, which leads me to suspect that they asked readers to report on the three chains they visit most frequently. (I haven't actually seen the questionnaire -- they used to send me one in the mail to fill out, but not any more.)
Limiting respondents to their three most frequented restaurants would, obviously, tend to skew the results upward. If you don't like a certain chain, you probably wouldn't have gone lately, so your rating of "meh, 3 out of 10" wouldn't be included. It's going to be mostly people who like the food who answer the questions.
But McDonald's might be an exception. Because even if you don't like their food that much, you probably still wind up going occasionally:
-- You might be travelling, and McDonald's is all that's open (I once had to eat Mickey D's three nights in a row, because everything else nearby closed at 10 pm).
-- You might be short of time, and there's a McDonald's right in Wal-Mart, so you grab a burger on your way out and eat it in the car.
-- You might be with your kids, and kids tend to love McDonald's.
-- There might be only McDonald's around when you get hungry.
Those "I'm going for reasons other than the food" respondents would depress McDonald's ratings, relative to other chains.
Suppose there are two types of people in America. Half of them rate McDonald's a 9, and Fuddruckers a 5. The other half rate Fuddruckers an 8, but McDonald's a 6.
So, consumers think McDonald's is a 7.5, and Fuddrucker's is a 6.5.
But the people who prefer McDonald's seldom set foot anywhere else -- where there's a Fuddrucker's, the Golden Arches are always not too far away. On the other hand, fans of Fuddrucker's can't find one when they travel. So, they wind up eating at McDonald's a few times a year.
So what happens when you do the survey? McDonald's gets a rating of 7.5 -- the average of 9s from the loyal customers, and 6s from the reluctant ones. Fuddruckers, on the other hand, gets an average of 8 -- since only their fans vote.
That's how, even if people actually like McDonald's more than Fuddrucker's, selective sampling might make McDonald's look worse.
------
It seems likely this is actually happening. If you look at the burger chain rankings, it sure does seem like the biggest chains are clustered near the bottom. Of the five chains with the most locations (by my Googling and estimates), all of them rank within the bottom eight of the rankings: Wendy's (burger score 6.8), Sonic (6.7), Burger King (6.6), Jack In The Box (6.6), and McDonald's (5.8).
As far as I can tell, Hardees is next biggest, with about 2,000 US restaurants. It ranks in the middle of the pack, at 7.5.
Of the ten chains ranked higher than Hardee's, every one of them has less than 1,000 locations. The top two, Habit Burger Grill (8.1) and In-N-Out (8.0), have only 400 restaurants between them. Burgerville, which ranked 7.7, has only 39 stores. (Five Guys (7.9) now has more than 1,000, but the survey covered April, 2012, to June, 2013, when there were fewer.)
The pattern was the same in other categories, where the largest chains were also at or near the bottom. KFC ranked worst for chicken; Subway rated second-worst for sandwiches; and Taco Bell scored worst for Mexican.
And, the clincher, for me at least: the chain with the worst "dining experience," according to the survey was Sbarro, at 65/100.
What is Sbarro, if not the "I'm stuck at the mall" place to get pizza? Actually, I think there's even a Sbarro at the Ottawa airport -- one of only two fast food places in the departure area. If you get hungry waiting for your flight, it's either them or Tim Hortons.
The Sbarro ratings are probably dominated by customers who didn't have much of a choice.
(Not that I'm saying Sbarro is actually awesome food -- I don't ever expect to hear someone say, unironically, "hey, I feel like Sbarro tonight." I'm just saying they're probably not as bad as their rating suggests.)
------
Another factor: CR asked readers to rate the burgers, specifically. In-N-Out sells only burgers. But McDonald's has many other popular products. You can be a happy McDonald's customer who doesn't like the burgers, but you can't be a happy In-N-Out customer who doesn't like the burgers. Again, that's selective sampling that would skew the results in favor of the burger-only joints.
And don't forget: a lot of people *love* McDonald's french fries. So, their customers might be prefer "C+ burger with A+ fries" to a competitor who's a B- in both categories.
That thinking actually *supports* CR's conclusion that people like McDonald's burgers less ... but, at the same time, it makes the arbitrary ranking-by-burger-only seem a little unfair. It's as if CR rated baseball players by batting average, and ignored power and walks.
For evidence, you can compare CRs two sets of rankings.
In burgers, the bottom eight are clustered from 6.6 to 6.8 -- except McDonald's, a huge outlier at 5.8, as far from second-worst as second-worst is from average.
In overall experience, though, McDonald's makes up the difference completely, perhaps by hitting McNuggets over the fences. It's still last, but now tied with Burger King at 71. And the rest aren't that far away. The next six range from 74 to 76 -- and, for what it's worth, CR says a difference of five points is "not meaningful".
-----
A little while ago, I read an interesting story about people's preferences for pies. I don't remember where I read it so I may not have the details perfect. (If you recognize it, let me know.)
For years, Apple Pie was the biggest selling pie in supermarkets. But that was when only full-size pies were sold, big enough to feed a family. Eventually, one company decided to market individual-size pies. To their surprise, Apple was no longer the most popular -- instead, Blueberry was. In fact, Apple dropped all the way to *fifth*.
What was going on? It turns out that Apple wasn't anyone's most liked pie, but neither was it anyone's least liked pie. In other words, it ranked high as a compromise choice, when you had to make five people happy at once.
I suspect that's what happens with McDonald's. A bus full of tourists isn't going to stop at a specialty place which may be a little weird, or have limited variety. They're going to stop at McDonald's, where everyone knows the food and can find something they like.
McDonald's is kind of the default fast food, everybody's second or third choice.
------
But having said all that ... it *does* look to me that the ratings are roughly in line with what I consider "quality" in a burger. So I suspect there is some real signal in the results, despite the selective sampling issue.
Except for McDonald's.
Because, first, I don't think there's any way their burgers are *that* much "worse" than, say, Burger King's.
And, second, every argument I've made here applies significantly more to McDonald's than to any of the other chains. They have almost twice as many locations as Burger King, almost three times as many as Wendy's, and almost four times as many as Sonic. Unless you truly can't stand them, you'll probably find yourself at McDonald's at some point, even if you'd much rather be dining somewhere else.
All the big chains probably wind up shortchanged in CR's survey. But McDonald's, I suspect, gets spectacularly screwed.
Labels: Consumer Reports, fast food, McDonald's, selective sampling
posted by Phil Birnbaum @ 7/22/2014 02:43:00 PM 6 comments
Saturday, July 12, 2014
Nate Silver and the 7-1 blowout
Brazil entered last Tuesday's World Cup semifinal match missing two of their best players -- Neymar, who was out with an injury, and Silva, who was sitting out a red-card suspension. Would they still be good enough to beat Germany?
After crunching the numbers, Nate Silver, at FiveThirtyEight, forecasted that Brazil still had a 65 percent chance of winning the match -- that the depleted Brazilians were still better than the Germans. In that prediction, he was taking a stand against the betting markets, which actually had Brazil as underdog -- barely -- at 49 percent.
Then, of course, Germany beat the living sh!t out of Brazil, by a score of 7-1.
Huh? [Nate Silver] says his prediction "stunk," but it was probabilistic. No way to know if it was even wrong.
Exactly correct.
"I love Nate Silver and 538, but this result might be breaking his model. Haven't been super impressed with the predictions."
"To be fair to Nate Silver + 538, their model on the whole was excellent. It's how they dealt with Brazil where I (and others) had problems."
"It's hard to imagine how Silver could have been more wrong."
Update/clarification: I am not trying to defend Nate's methodology against others, and especially not against the Vegas line (which I trust more than Nate's, until there's evidence I shouldn't).
I'm just saying: the 7-1 outcome is NOT, in and of itself, sufficient evidence (or even "good" evidence) that Nate's prediction was wrong.
Labels: 7-1, brazil, forecasting, predictions, soccer
posted by Phil Birnbaum @ 7/12/2014 04:26:00 PM 47 comments
Wednesday, July 09, 2014
"The Cult of Statistical Significance"
"The Cult of Statistical Significance" is a critique of social science's overemphasis on confidence levels and its convention that only statistically-significant results are worthy of acceptance. It's by two academic economists, Stephen Ziliak and Deirdre McCloskey, and my impression is that it made a little bit of a splash when it was released in 2008.
I've had the book for a while now, and I've been meaning to write a review. But, I haven't finished reading it, yet; I started a couple of times, and only got about halfway through. It's a difficult read for me ... it's got a flowery style, and it jumps around a bit too much for my brain, which isn't good at multi-tasking. But a couple of weeks ago, someone on Twitter pointed me to this .pdf -- a short paper by the same authors, summarizing their arguments.
------
Ziliak and McCloskey's thesis is that scientists are too fixated on significance levels, and not enough on the actual size of the effect. To illustrate that, they use an example of two weight-loss pills:
"The first pill, called "Oomph," will shed from Mom an average of 20 pounds. Fantastic! But Oomph is very uncertain in its effects—at [a standard error of] plus or minus 10 pounds. ... Could be ten pounds Mom loses; could be thrice that.
"The other pill you found, pill "Precision," will take 5 pounds off Mom on average but it is very precise—at plus or minus 0.5 pounds. Precision is the same as Oomph in price and side effects but Precision is much more certain in its effects. Great! ...
"Fine. Now which pill do you choose—Oomph or Precision? Which pill is best for Mom, whose goal is to lose weight?"
Ziliak and McCloskey -- I'll call them "ZM" for short -- argue that "Oomph" is the more effectual pill, and therefore the best choice. But, because its effect is not statistically significant from zero*, scientists would recommend "Precision". Therefore, the blind insistence on statistical significance costs Mom, and society, a high price in lost health and happiness.
(*In their example, the effect actually *is* statistically significant, at 2 SDs, but the authors modify the example later so it isn't.)
But: that isn't what happens in real life. In actual research, scientists would *observe* 20 pounds plus or minus 10, and try to infer the true effect as best they can. But here, the authors proceed as if we *already know* the true effect on Mom is 20 +/- 10. But if we did already know that, then, *of course* we wouldn't need significance testing!
Why do the authors wind up having their inference going the wrong way? I think it's a consequence of failing to notice the elephant in the room, the fact that's the biggest reason significance testing becomes necessary. That elephant is: most pills don't work.
What I suspect is that when the authors see an estimate of 20, plus or minus 10, they think that must be a reasonable, unbiased estimate of the actual effect. They don't consider that most true values are zero, therefore, most observed effects are just random noise, and that the "20 pounds" estimate is likely spurious.
That's the key to the entire issue of why we have to look at statistical significance -- to set a high-enough bar that we don't wind up inundated with false positives.
At best, the authors are setting up an example in which they already assume the answer, then castigating statistical significance for getting it wrong. And, sure, insisting on p < .05 will indeed cause false negatives like this one. But ZM fail to set off the false negatives against the inevitable false positives that would result without looking at significance, without realizing we need to find evidence of existence.
-----
In fairness, Ziliak and McCloskey don't say explicitly that they're rejecting the idea that most pills are useless. They might not actually even believe it. They might just be making statistical assumptions that necessarily assume it's true. Specifically:
-- In their example, they assume that, because the "Oomph" study found a mean of 20 pounds and SD of 10 pounds, that's what Mom should expect in real life. But that only follows if every effect has the same probability of occurrence -- which isn't the case, since most true effects are actually zero.
-- They also seem to assume the implication of that, that when you come up with a 95% confidence interval for the size of the effect, there is actually a 95% probability that the effect lies in that range. Again, that only follows if every effect has the same probability of occurrence -- which isn't the case, since most true effects are actually zero.
-- And, I think they assume that if a result comes out with a p-value of .75, it implies a 75% chance that the true effect is greater than zero. Same thing: that only follows if every effect has the same probability of occurrence -- which isn't the case, since most true effects are actually zero.
I can't read minds, and I probably shouldn't assume that's what ZM were actually thinking. But that one single assumption would easily justify their entire line of argument -- if only it were true.
And it certainly *seems* justifiable, to assume that every effect size is equally likely. You can almost hear the argument being made: "Why assume that the drug is most likely useless? Isn't that an assumption without a basis, an unscientific prejudice? We should keep a completely open mind, and just let the data speak."
It sounds right, but it's not. "All effects are equally likely" is just as strong a prejudice as "Zero is most likely." It just *seems* more open-minded because (a) it doesn't have to be said explicitly, (b) it keeps everything equal, which seems less arbitrary, and (c) "don't be prejudiced" seems like a strong precedent, being such an important ethical rule for human relationships.
If you still think "most pills don't work" is an unacceptable assumption ... imagine that instead of "Oomph" being a pill, it was a magic incantation. Are you equally unwilling to accept the prejudice "most incantations don't work"?
If it is indeed true that most pills (and incantations) are useless, ignoring the fact might make you less prejudiced, but it will also make you more wrong.
----
And "more wrong" is something that ZM want to avoid, not tolerate. That's why they're so critical of the .05 rule -- it causes "a loss of jobs, justice, profit, and even life." Reasonably, they say we should evaluate the results not just on significance, but on the expected economic or social gain or loss. When a drug appears to have an effect on cancer that would save 1,000 lives a year ... why throw it away because there's too much noise? Noise doesn't cost lives, while the pill saves them!
Except that ... if you're looking to properly evaluate economic gain -- costs and benefits -- you have to consider the prior.
Suppose that 99 out of 100 experimental pills don't work. Then, when you get a p-value of .05, there's only about a 17 percent chance that the pill has a real effect. Do you want to approve cancer pills when you know five-sixths of them don't do anything?
(Why 5/6? Of the 99 worthless drugs, about 5 of them will show significance just randomly. So you accept 5 spurious effects for each real effect.)
And that 17 percent is when you *do* have p=.05 significance. If you lower your significance threshold, it gets worse. When you have p=.20, say, you get 20 false positives for every real one.
Doing the cost-benefit analysis for Mom's diet pill ... if there's only a 1 in 6 chance that the effect is real, her expectation is a loss of 3.3 pounds, not 20. In that case, she is indeed better off taking "Precision" than "Oomph".
-----
If you don't read the article or book, here's the one sentence summary: Scientists are too concerned with significance, and not enough with real-life effects. Or, as Ziliak and McCloskey put it,
The "oomph" -- the size of the coefficient -- is the scientific discovery that tells you something about the real world. The "precision" -- the significance level -- tells you only about your evidence and your experiment.
I agree with the authors on this point, except for one thing. Precision is not merely "nice". It's *necessary*.
If you have a family of eight and shop at Costco and need a new vehicle, "Tires are Nice but Cargo Space is the Bomb." That's true -- but the "Bomb" is useless without the "Nice".
Even if you're only concerned with real-world effects, you still need to consider p-values in a world where most hypotheses are false. As critical as I have been about the way significance is used in practice, it's still something that's essential to consider, in some way, in order to filter out false positives, where you mistakenly approve treatments that are no better than sugar pills.
None of that ever figures into the authors' arguments. Failing to note the false positives -- the word "false" doesn't appear anywhere in their essay, never mind "false positive" -- the authors can't figure out why everyone cares about significance so much. The only conclusion they can think of is that scientists must worship precision for its own sake. They write,
"[The] signal to noise ratio of pill Oomph is 2-to-1, and of pill Precision 10-to-1. Precision, we find, gives a much clearer signal—five times clearer.
"All right, then, once more: which pill for Mother? Recall: the pills are identical in every other way. "Well," say our significance testing colleagues, "the pill with the highest signal to noise ratio is Precision. Precision is what scientists want and what the people, such as your mother, need. So, of course, choose Precision."
"But Precision—precision commonly defined as a large t-statistic or small p-value on a coefficient—is obviously the wrong choice. Wrong for Mother's weight-loss plan and wrong for the many other victims of the sizeless scientist. The sizeless scientist decides whether something is important or not—he decides "whether there exists an effect," as he puts it—by looking not at the something's oomph but at how precisely it is estimated. Mom wants to lose weight, not gain precision."
Really? I have much, much less experience with academic studies than the authors, but ... I don't recall ever having seen papers boast about how precise their estimates are, except as evidence that effects are significant and real. I've never seen anything like, "My estimates are 7 SDs from zero, while yours are only 4.5 SDs, so my study wins! Even though yours shows cigarettes cause millions of cancer deaths, and mine shows that eating breakfast makes you marginally happier."
Does that really happen?
-------
Having said that, I agree emphatically with the part of ZM's argument that says scientists need to pay more attention to oomph. I've seen many papers that spend many, many words arguing that an effect exists, but then hardly any examining how big it is or what it means. Ziliak and McCloskey refer to these significance-obsessed authors as "sizeless scientists."
(I love the ZM terminology: "cult," "oomph," "sizeless".)
Indeed, sometimes studies find an effect size that's so totally out of whack that it's almost impossible -- but they don't even notice, so focused are they on significance levels.
I wish I could recall an example ... well, I can make one up, just to give you the flavor of how I vaguely remember the outrageousness. It's like, someone finds a statistically-significant relationship between baseball career length and lifespan, and trumpets how he has statistical significance at the 3 percent level ... but doesn't realize that his coefficient estimates a Hall-of-Famer's lifespan at 180 years.
If it were up to me, every paper would have to show the actual "oomph" of its findings in real-world terms. If you find a link between early-childhood education and future salary, how many days of preschool does it take to add, say, a dollar an hour? If you find a link between exercising and living longer, how many marathons does it take to add a month to your life? If fast food is linked with childhood obesity, how many pounds does a kid gain from each Happy Meal?
And we certainly do also need less talk of precision. My view is that you should spend maybe one paragraph confirming that you have statistical significance. Then, shut up about it and talk about the real world.
If you're publishing in the Journal of Costcological Science, you want to be talking about cargo space, and what the findings mean for those who benefit from Costcology. How many fewer trips to Costco will you make per year? Is it now more efficient to get your friends to buy you gift cards instead of purchasing a membership? Are there safety advantages to little Joey no longer having to make the trip home with an eleven-pound jar of Nutella between his legs?
You don't want to be going on and on about, how, yes, the new vehicle does indeed have four working tires! And, look, I used four different chemical tests to make sure they're actually made out of rubber! And did I mention that when I redo the regression but express the cargo space in metric, the car still tests positive for tires? It did! See, tires are robust with respect to the system of mensuration!
For me, one sentence is enough: "The tire treads are significant, more than 2 mm from zero."
-----
So I agree that you don't need to talk much about the tires. The authors, though, seem to be arguing that the tires themselves don't really matter. They think drivers must just have some kind of weird rubber fetish. Because, if the vehicle has enough cargo space, who cares if the tires are slashed?
You need both. Significance to make sure you're not just looking at randomness, and oomph to tell you what the science actually means.
Labels: cult, oomph, significance, statistics
posted by Phil Birnbaum @ 7/09/2014 01:21:00 PM 4 comments
Wednesday, July 02, 2014
When a null hypothesis makes no sense
In criminal court, you're "innocent until proven guilty." In statistical studies, it's "null hypothesis until proven significant."
The null hypothesis, generally, is the position that what you're looking for isn't actually there. If you're trying to prove that early-childhood education leads to success in adulthood, the default position is "we're going to assume it doesn't until evidence proves otherwise."
Why do we make "no" the null? It's because, most times, there really IS nothing there. Pick a random thing and a random life outcome: shirts, marriage. Is there a relationship between shirt color and how happy a marriage you'll have? Probably not. So "not" becomes the null hypothesis.
Carl Sagan famously said, "extraordinary claims require extraordinary evidence." And, in a world where most things are unrelated, "my drug shrinks tumors" is indeed an extraordinary claim.
The null hypothesis is the one that's the LEAST extraordinary -- the one that's most likely, in some common-sense way. "Randomness caused it until proven otherwise," not "Fairies caused it until proven otherwise."
In studies, authors usually gloss over that, and just use the convention that the null is always "zero". They'll say, "the difference between the treatment and control groups is not statistically-significantly different from zero, so we do not reject the hypothesis that the drug is of no benefit."
-------
But, "zero" isn't always the least extraordinary claim.
I believe that teams up 1-0 lead in hockey games get overconfident and wind up doing worse than expected. So, I convince Gary Bettman to randomly pick a treatment group of teams, and give them a free goal to start the first game of their season. At the end of the year, I compare goal differential between the treatment and control groups.
Which should my null hypothesis be?
-- The treatment has an effect of 0
-- The treatment has an effect of +1
Obviously, it's the second one. The first one, even though it includes the typical "zero," is, nonetheless, an extraordinary claim: that you give one group a one-goal advantage, but by the end of the year, that advantage has disappeared. Instead of saying "innocent until proven guilty," you'resaying, "one goal guilty unless proven otherwise." But that's hidden, because you use the word "zero" instead of "one goal guilty."
If you use 0 instead of +1, you're effectively making your hypothesis the default, by stealth.
(In this case, the null should be +1 ... in real life, the researcher would probably keep the same null, but also transform the model to put the conventional "0" back in. Instead of Y = b(treatment dummy), they'll use the model Y = (b+1)(treatment dummy), so that b=0 now means "no effect other than the obvious extra goal".)
What that shows is: it's not enough that you use "0". You have to make an argument about whether your zero is an appropriate null hypothesis for your model. If you choose the right model, and the peer reviewers don't notice, you can "privilege your hypothesis" by making "zero" represent anything you like.
But that's actually not my main point here.
------
A while ago, I saw a blog post where an economist ran a regression to predict wins from salaries, for a certain sport. The coefficient was not statistically-significantly different from zero, so the author declared that we can't reject the null hypothesis that team payroll relates to team performance.
But, in this case, "salary has an effect of zero" is not a reasonable null hypothesis. Why? Because we have strong, common-sense knowledge that salary DOES sometimes have an effect.
That knowledge is: we all know that better free-agent players get paid higher salaries. If you don't believe that's the case -- if you don't believe that LeBron James will earn more than a bench player next season -- you are excused from this argument. But, the economist who did that regression certainly believes it.
In light of that, "zero" is no longer the likeliest, least extraordinary, possibility, so it doesn't qualify as a null.
That doesn't mean it can't still be the right answer. It could indeed turn out that the relationship between salary and wins is truly 0.00000. For that to happen, it would have to be that other factors exactly cancel out the LeBron factor.
Suppose every million dollars more you spend on Lebron James gives you 0.33446 extra wins, on average (from a God's-eye view). In that case, if you use "zero" in your null hypothesis, it's exactly equivalent to this alternative:
"For every million dollars less you spend on Lebron, you just happen to get exactly 0.33446 extra wins from other players."
Well, that's completely arbitrary! Why would 0.33446 be more likely than 0.33445, or any other number? There's no reason to believe that 0.33446 is "least extraordinary." And so there's no reason to believe that the original "zero" is least extraordinary.
Moreover, if you use a null hypothesis of zero, you're contradicting yourself, because you're insisting on two contradictory things:
(1) Players who sign for a lot more money, like LeBron, are generally much better players.
(2) We do not reject the assumption that the amount of money a team pays is irrelevant to how good it is.
You can believe either one of these, but not both.
-----
It used to be conventional wisdom that women over 40 should get a mammogram every year. The annual scan, it was assumed, would help discover cancer earlier, and lead to better outcomes.
Recent studies, though, dispute that conclusion. They say that there is no evidence that there's any difference in cancer survival or diagnosis rates for women who get the procedure and women who don't.
Well, same problem: the null of "no difference" is an arbitrary one. It's the same argument as in the salary case:
Imagine two identical women, with the same cancer. One of them gets a mammogram, the cancer is discovered, and she starts treatment. Another one doesn't get the mammogram, and the cancer isn't discovered until later.
Obviously, the diagnosis MUST make a difference in the expected outcomes for these two patients. Nobody believes that whether you get treatment early or late makes NO difference, right? Otherwise, doctors would just shrug and ignore the mammogram.
But, the null hypothesis of "zero difference" suggests that, when you add in all the other women, the expected overall survival rates should be *exactly the same*.
That's an extraordinary claim. Sure it's *possible* that the years of life lost by the undiagnosed cancer are exactly offset by the years lost from the unnecessary treatment from false positives after a mammogram. Like, for instance, the 34 cancer patients who didn't get the mammogram each lose 8.443 years off their lives, and the 45 false-positives each lose 6.379 years, and if you work it out, it comes to exactly zero.
"We can't reject that there is no difference" is exactly as arbitrary as "We can't reject that the difference is the cosine of 1.2345".
Unless, of course, you have an argument about how zero is a special case. If you DID want to argue that cancer treatment is completely useless, then, certainly, your zero null would be appropriate.
------
"Zero" works well as a null hypothesis when it's most plausible that there's nothing there at all, when it's quite possible that there isn't any trace of a relationship. It's inappropriate otherwise: when there's SOME evidence of SOME real relationship, SOME of the time.
In other words, zero works when it's a synonym for "there's no relationship at all." It doesn't work when it's a synonym for, "the relationship is so small that it might as well be zero."
The null hypothesis works as a defense against the placebo effect. It does not work as a defense against actual effects that happen to be close to zero.
But, isn't it almost the same thing? It's it just splitting hairs?
No, not at all. It's an important distinction.
There are two different questions you may want a study to answer. First: is there actually a relationship there? And, second, if there is a relationship there, how big is it?
The traditional approach is: if you don't get statistical significance, you're considered to have not proved it's really there -- and, therefore, you're not allowed to talk about how big in might be. You have to stop dead.
But, in the case of the mammogram studies, you shouldn't have to prove it's really there. Under any reasonable assumptions a researcher might have about mammograms and cancer, there MUST be an effect. Whether the observed size is bigger or smaller than twice the SD -- which is the criterion for "existence" -- is completely irrelevant. You already know that an effect must exist.
If you demand statistical proof of existence when you already know it's there, you're choosing to ignore perfectly good information, and you're misleading yourself.
That's what happened in the Oregon Medicaid study. It found that Medicaid coverage was associated with "clinically significant" improvements in reducing hypertension. But they ignored those improvements, because there wasn't enough data to constitute sufficient evidence -- evidence that there actually is a relationship between having free doctor visits and having your hypertension treated.
But that's silly. We KNOW that people behave differently when they have Medicaid than when they don't. That's why they want it, so they can see doctors more and pay less. There MUST be actual differences in the two groups. We just don't know how large.
But, because the authors of the study chose to pretend that existence was in doubt, they threw away perfectly good evidence. Imprecise evidence, certainly -- the confidence interval was very wide. But imprecision was not the problem. If the point estimate had been just an SD higher than it was, they would have accepted it at face value, imprecision be damned.
-------
One last analogy:
The FDA has 100 untested pills that drugmakers say treat cancer. The FDA doesn't know anything about them. However, God knows, and He tells you.
It turns out 96 of the 100 compounds don't work at all -- they have no effect on cancer whatsoever, no more than sugar pills. The other four do something. They may help cancer, or they may hurt it, and all to different degrees. (Some of the four may even have an effect size of zero -- but if that's the case, they actually do something to the cancer, but the good things they do are balanced out by the bad things.)
You test one of the pills. The result is clinically significant, but only 0.6 SD from zero, not nearly strong enough to be statistically significant. It's reasonable for you to say, "well, I'm not even going to look at the magnitude of the effect, because, it's likely that it's just random noise from one of the 96 sugar pills."
You test another one of the pills, and get the same result. But this time, God pops His head into the lab and says, "By the way, that drug is one of the four that actually do something!"
This time, the size of the effect matters, doesn't it? You'd look like a fool to refuse to consider the evidence, given that you now know the pill is doing *something* to the cancer.
Well, that's what happens in real life. God has been telling us -- through common sense and observation -- that expensive players cost more money, that patients on Medicaid get more attention from doctors, and that patients with a positive mammogram get treated earlier.
Clinging to a null hypothesis of "no real effect," when we know that null hypothesis is false, makes no sense at all.
Labels: medicine, null hypothesis, significance, statistics
posted by Phil Birnbaum @ 7/02/2014 12:48:00 PM 3 comments