Realizations in Biostatistics

I set up a new data analysis blog

noreply@blogger.com (Anonymous) — 2016年12月12日 02:18:00 +0000

Well, I tried to write a blog post using the RStudio Rmarkdown system, and utterly failed. Thus, I set up a system where I could write from RStudio. So I set up a Github pages blog at randomjohn.github.io. There I can easily write and publish posts involving data analysis.

Windows 10 anniversary updates includes a whole Linux layer - this is good news for data scientists

noreply@blogger.com (Anonymous) — 2016年9月24日 21:34:00 +0000

If you are on Windows 10, no doubt you have heard that Microsoft included the bash shell in its 2016 Windows 10 anniversary update. What you may not know is that this is much, much more than just the bash shell. This is a whole Linux layer that enables you to use Linux tools, and does away with a further layer like Cygwin (which requires a special dll). However, you will only get the bash shell out of the box. To enable the whole Linux layer, follow instructions here. Basically, this involves enabling developer mode then enabling the Linux layer feature. In the process, you will download some further software from the Windows store.

Why is this big news? To me, this installs a lot of the Linux tools that have proven useful over the years, such as wc, sed, awk, grep, and so forth. In some cases, these tools work much better than software packages such as R or SAS, and their power comes in combining these tools through pipes. You also get apt-get, which enables you to install and manage packages such as SQLite, octave, and gnuplot. You can even install R through this method, though I don't know if RStudio works with R installed in this way.

If you're a Linux buff who uses Windows, you can probably think of many more things you can do with this. The only drawback is that I haven't tried using any sort of X Windows or other graphical interfaces.

Which countries have Regrexit?

noreply@blogger.com (Anonymous) — 2016年6月26日 18:24:00 +0000

This doesn't have a lot to do with bio part of biostatistics, but is an interesting data analysis that I just started. In the wake of the Brexit vote, there is a petition for a redo. The data for the petition is here, in JSON format.

Fortunately, in R, working with JSON data is pretty easy. You can easily download the data from the link and put it into a data frame. I start on that here, with the RJSONIO package, ggplot2, and a version of the petition I downloaded on 6/26/16.

One question I had was whether all the signers are British. Fortunately, the petition collects the place of residence of the signer, assuming no fraud. I came up with the following top 9 non-UK countries of origin of signers.

There are a couple of things to remember when interpreting this graph:

I left off the UK. The number of signatures is over 3 million, and contains by far the largest percentage of signatories.
5 of the 9 top countries are neighbors, including the top 2. The other 4 are Australia, the US, Canada, and New Zealand, who are all countries that have strong ties to the UK.
This assumes no petition fraud, which I can't guarantee. I saw at least one Twitter posting telling people to use her (if the profile pic is to be believed) residence code. There is a section of the petition data showing constituency, so I'm wondering if it would be possible to analyze the petition for fraud. I'm not as familiar with British census data as I am with US, but I imagine a mashup of the two would be useful.

(Update: Greg Jefferis posted a nice analysis here. See comments below.)

Little Debate: defining baseline

noreply@blogger.com (Anonymous) — 2016年6月05日 19:44:00 +0000

In an April 30, 2015 note in Nature (vol 520, p. 612), Jeffrey Leek and Roger Peng note that p-values get intense scrutiny, while all the decisions that lead up to the p-values get little debate. I wholeheartedly agree, and so I'm creating a Little Debate series to shine some light on these tiny decisions that may not get a lot of press. Yet these tiny decisions can have a big influence on statistical analysis. Because my focus here is mainly biostatistics, most of these ideas will be placed in the setting of clinical trials.

Defining baseline seems like an easy thing to do, and conceptually it is. Baseline is where you start before some intervention (e.g. treatment, or randomization to treatment or placebo). However, the details of the definition of baseline in a biostatistics setting can get tricky very quickly.

The missing baseline

Baseline is often defined as the value at a randomization or baseline visit, i.e. the last measurement before the beginning of some treatment or intervention. However, a lot of times things happen - a needle breaks, a machine stops working, or study staff just forget to do procedures or record times. (These are not just hypothetical cases ... these have all happened!) In these cases, we end up with a missing baseline. A missing baseline will make it impossible to determine the effect of an intervention for a given subject.

In this case, we have accepted that we can use previous values, such as those taken during the screening of a subject, as baseline values. This is probably the best we can do under the circumstances. However, I'm unaware of any research on what effect this has on statistical analysis.

To make matters worse, a lot of times people without statistical training or expertise will make these decisions, such as putting down a post-dose value as baseline. Even with good documentation, these sorts of mistakes are not easy to find, and, when they are, they are often found near the end of the study, right when data management and statisticians are trying to produce results, and sometimes after interim analyses.

The average baseline

Some protocols specify that baseline consists of the average of three repeated measurements. Again, this decision is often made before any statisticians are consulted. The issue with such a statistical analysis is that averages are not easily comparable to raw values. Let's say that a baseline QTc (a measure of how fast the heart charge recovers from a pump, corrected for heart rate) is defined based on 3 electrocardiogram (ECG) measurements. The standard deviation of a raw QTc measurement (i.e. based on one ECG), let's say, is s. The standard deviation of the average of those three (assuming independence) is s/√3, or just above half the standard deviation of the raw ECG. Thus, a change of 1 unit in the average of 3 ECGs is a lot more noteworthy than a change of 1 unit in a single ECG measurement. And yet we compare that to single measurements for the rest of the study.

To make matters worse, if the ECG machine screws up one measurement, then the baseline becomes the average of two. A lot of times we lose that kind of information, and yet analyze the data as if the mystery average is a raw measurement.

The extreme baseline

In one observational study, the sponsor wanted to use the maximum value over the last 12 months as a baseline. This was problematic for several reasons. Like the average, the extreme baseline (here the maximum) is on a different scale, and even has a different distribution, than the raw measurement. The Fisher-Tippett (extreme value) theorem states that the maximum of n values converges to one of three extreme value distributions (Gumbel, Frechet, or Weibull). These distributions are then being compared to, again, single measurements taken after baseline. What's worse, any number of measurements could have been taken for those subjects within 12 months, leading to a major case of shifting sands regarding the distribution of baseline.

Comparing an extreme value with a later singular measurement will lead to an unavoidable case of regression to the mean, thus creating an apparent trend in the data where none may exist. Without proper context, this may lead to overly optimistic interpretations of the effect of an intervention, and overly small p-values. (Note that a Bayesian analysis is not immune to the misleading conclusions that might arise from this terrible definition of baseline.)

Conclusion

The definition of baseline is a "tiny decision" that can have major consequences in a statistical analysis. Yet, the impact of this decision has not been well studied, especially in the context of a clinical trial where a wide range of definitions may be written into a protocol without the expert advice of a statistician. Even a definition that has been well-accepted -- that baseline is the last singular pre-dose value before intervention -- has not been well-studied in the scenario of missing baseline day measurement. Other decisions are often made without considering the impact on analysis, including some that may lead to wrong interpretations.

Simulating a Weibull conditional on time-to-event is greater than a given time

noreply@blogger.com (Anonymous) — 2016年5月20日 12:30:00 +0000

Recently, I had to simulate a time-to-event of subjects who have been on a study, are still ongoing at the time of a data cut, but who are still at risk of an event (e.g. progressive disease, cardiac event, death). This requires the simulation of a conditional Weibull. To do this, I created the following function:

# simulate conditional Weibull conditional on survival > T ---------------

# reliability function is exp{-(T+t/b)^a} / exp{-(T/b)^a} = 1-F(t)
# n = number of points to return
# shape = shape parm of weibull
# scale = scale parm of weibull (default 1)
# t is minimum (default is 0, which makes the function act like rweibull)
my_rweibull <- function(n,shape,scale=1,t=0) {
if (length(t)!=1 && length(t)!=n) {
stop("length(t) is not 1 or n")
}
return(scale*(-log(runif(n))+(t/scale)^shape)^(1/shape))
}

You use this function just like rweibull, with the exception that you pass in another vector t of minimum times or a scalar representing the minimum time of all simulated values. The idea is that the probability that the random variable will be at least T is given by exp{-(T+t/b)^a} / exp{-(T/b)^a}, so you can simulate this with a uniform random variate. I use the inversion method on the reliability function (just like using the inversion method on the distribution function, with the insight that if U is uniform(0,1), so is 1-U).

Truth be told, I ought to buckle down and learn how to do packages in R, but for now I'll just pass on some code on my blog if I think it will be helpful (or if I need to find it while doing a Google search later).

(Edit on 7/16: OOPS! A previous version of this had the scale and shape parameters switched. I've corrected it now. If you copied this before 7/16/2016, please check again.)

Talk to Upstate Data Science Group on Caret

noreply@blogger.com (Anonymous) — 2016年1月14日 20:01:00 +0000

Last night I gave an introduction and demo of the caret R package to the Upstate Data Science group, meeting at Furman University. It was fairly well attended (around 20 people), and well received.

It was great to get out of my own comfort zone a bit (since graduate school, I've only really given talks on some topic in biostatistics) and meeting statisticians, computer scientists, and other sorts of data scientists from many different fields. This is a relatively new group, and given the interest over the last couple of months or so I think this has been sorely needed in the Upstate South Carolina region.

We'll be participating in Open Data day in March of this year, so if you are in the Upstate SC region, or don't mind making the trek from Columbia or Western NC, find us on Meetup. Our next meeting is a data hack night which promises to be interesting.

Even the tiniest error messages can indicate an invalid statistical analysis

noreply@blogger.com (Anonymous) — 2015年11月25日 13:00:00 +0000

The other day, I was reading in a data set in R, and the function indicated that there was a warning about a parsing error on one line. I went ahead with the analysis anyway, but that small parsing error kept bothering me. I thought it was just one line of goofed up data, or perhaps a quote in the wrong place. I finally opened up the CSV file in a text editor, and found that the reason for the parsing error was that the data set was duplicated within the CSV file. The parsing error resulted from the reading of the header twice. As a result, anything I did afterward was suspect.

Word to the wise: track down the reasons for even the most innocuous-seeming warnings. Every stage of a statistical analysis is important, and small errors anywhere along the way and have huge consequences downstream. Perhaps this is obvious, but you still have to slow down and take care of the details.

(Note that I'm editing this to be a part of my Little Debate series, which discusses the tiny decisions dealing with data that are rarely discussed or scrutinized, but can have a major impact on conclusions.)

Statisticians ruin the day again, this time with a retraction

noreply@blogger.com (Anonymous) — 2015年10月22日 15:33:00 +0000

Authors retract second study about medical uses of honey - Retraction Watch at Retraction Watch:

For the second time, authors of manuscripts have had to retract their papers because of serious data analysis errors. While the details of the actual errors are scant, we do know that the article was published, a company tried to replicate the results but failed, the journal editor employed a third-party statistician who found serious errors in the data analysis, and the errors were serious enough that the paper, to stay accepted, would have had to go through a major revision and further peer review.

Better to get a statistician to ruin your day before publication (and require another revision) than to have to eat crow because a third party did it. Other thoughts:

Did the authors have a statistician review this before submitting?
Did the journal have this go through a statistical peer review before accepting?
Come to think of it, was a statistician involved in the planning of this study?

The thirty-day trial

noreply@blogger.com (Anonymous) — 2015年10月15日 05:50:00 +0000

Steve Pavlina wrote about a self-help technique called the thirty-day trial. To perform the technique, you commit 30 days of some new habit, such as quitting smoking or writing in a journal. The idea is that it’s psychologically easier to commit to something for 30 days than to make a permanent change, but after the 30 days you break addiction to old habits and have the perspective of whether to continue on with the new habit, go back, or go a completely different direction.

For activities like journaling or quitting smoking, this technique might work. After all, psychologist Jeremy Dean announced that it takes 21 days to form a new habit. (This has been treated with some due skepticism, which it should.) However, if you try a 30 day quit-smoking trial and it doesn’t work, try again. And if that doesn’t work, try it again. Until you succeed.

However, for activities such as a new diet, treatments, or any sort of healthcare advice, the 30 day trial should not be used.

For cases where one might try a 30 day trial just to see if it feels better, such as an elimination diet, confirmation bias is likely to be at play. This is especially true in the case of an elimination diet, where one eliminates some aspect of a diet, like dairy or gluten, and see if some kinds of symptoms, such as bloating or fatigue, go away. In such trials, the results may be due to the eliminated item in question, placebo/nocebo effect, or some third confounded eliminated item. For instance, bloating from gluten-containing foods probably comes from medium-length carbohydrates called FODMAPs. Just because you feel better for 30 days after discontinuing gluten-containing foods doesn’t mean that gluten is the culprit. In fact, it probably isn’t, as determined by a well-designed clinical trial. Likewise, eliminating milk from a diet isn’t likely to do too much unless lactose intolerance is the culprit, and there are tests for that.

Returning to Pavlina’s advice, he recommends a 30 day trial over the results of a clinical trial. This is sloppy and irresponsible advice. First, it is very unlikely that an individual will have access to screening assays, animal models, or the millions of dollars needed to discover and develop a drug. That is, unless said individual is up for trying millions of compounds, many potentially toxic and will probably tragically shorten the trial. Instead, a pharmaceutical should be used under the supervision of a doctor, who should be aware of the totality of literature (ideally consisting of randomized controlled trials if possible and a systematic review) and can navigate the benefits and possible adverse effects. A 30-day trial of a pharmaceutical or medical device may be completely inappropriate to realize benefits or assess risks. Here is the primary case where the plural of anecdote is not data.

The bottom line is that the 30 day trial is one of several ways (and perhaps not the best one) to change habits that you know from independent confirmation need to be done, like quitting smoking. It’s a terrible way to adjust a diet or determine if a course of treatment is the right one. Treatments should be based on science and under the supervision of a physician who can objectively determine whether a treatment is working correctly.

Statistics: P values are just the tip of the iceberg : Nature News & Comment

noreply@blogger.com (Anonymous) — 2015年5月08日 16:47:00 +0000

Statistics: P values are just the tip of the iceberg : Nature News & Comment:

This article is very important. Yes, p-values reported in the literature (or in your own research) need scrutiny, but so does every step in the analysis process, starting with the design of the study.

The many faces of the placebo response

noreply@blogger.com (Anonymous) — 2015年4月17日 12:45:00 +0000

This week, a study was published that claimed that the placebo response is mediated by genetics. Though I need to dig a little deeper and figure out exactly what this article is saying, I do think we need to take a step back and remember what can constitute a placebo response before we start talking about what this means for medical treatment and clinical trials.

In clinical trials, the placebo response can refer to a number of apparent responses to sham treatment:

The actual placebo response, i.e. a body’s physiological response to something perceived to be a treatment
Natural course of a disease, including fluctuations, adaptations
Investigator and/or subject bias on subjective instruments (hopefully mitigated by blinding/masking treatment arms in a study)
Regression to the mean (an apparent time-based relationship caused by one measurement that varies markedly from the average measurement)
... and many, many other sources

This week’s discovery does suggest that there is something physiological to the actual placebo response, and certainly genetics can influence this response. This may be useful in enriching clinical trials where a large placebo response is undesirable, e.g. by removing those subjects who are likely to response well to anything active or inert. After all, you don’t want an estimate of a treatment effect contaminated by a placebo response, nor do you want an impossibly high bar for showing an effect.

But we still need to remember the mundane sources of "placebo response" and lower those before we get too excited about genetic tests for clinical trials.

Helicopter parenting because your mind is "terrible at statistics" (or, rather, you are unaware of the denominator)

noreply@blogger.com (Anonymous) — 2015年4月14日 19:32:00 +0000

In Megan McArdle's Seven Reasons We Hate Free-Range Parenting - Bloomberg View:, she states that because of the news cycle, and because our minds are terrible at statistics, we think the world is a much less safe place than 30 years ago. It's true. We have more opportunity (because of the internet and cable news) to hear about tragedies and crime from far-away places. It's worse if such tragedy strikes closer to home. Thus, we tend to think the world is very unsafe. (And therefore are encouraged to become helicopter parents.) We are acutely aware of the numerator

What we do not hear, because it does not sell on cable news, is is the denominator. For all (very few) children abducted by strangers, for instance, we do not hear of the ones who successfully played at the park, or walked to the library, or went down to the field to play stickball (or I guess nerf softball because we shouldn't be throwing hard objects anymore) without getting abducted. This is because those stories do not sell.

I guess the second best is reporting on trends in parenting, and how they are driven by how bad we are at statistics (even statisticians).

Lying with statistics, CAM version

noreply@blogger.com (Anonymous) — 2015年3月19日 12:45:00 +0000

Full disclosure here: at one time I wanted to be a complementary and alternative (CAM) researcher. Or integrative, or whatever the cool kids call it these days. I thought that CAM research would yield positive fruit if they could just tighten up their methodology and leave nothing to question. While this is not intended to be a discussion of my career path, I’m glad I did not go down that road.

This article is a discussion of why. The basic premise of the article is that positive clinical trials do not really provide strong evidence of an implausible therapy, for much the same reason that doctors will give a stronger test to an individual who tests positive for HIV. A positive test will provide some, but not conclusive, evidence for HIV simply because HIV is so rare in the population. The predictive value of even a pretty good test is poor. And the predictive value of a pretty good clinical trial is pretty poor if the treatment has not been established.

Put it this way, if we have a treatment that has zero probability of working (the "null hypothesis" in statistical parlance), there will be a 5% probability that it will show a significant result in a conventional clinical trial. But let’s turn that on its head using Bayes Rule:

Prob (treat is useless| positive trial) * Prob(positive trial) = Prob (positive trial | treatment is useless) * Prob (treat is useless) (ok, this is just the definition of Prob (treat is useless AND positive trial)

Expanding, and using the law of total probability:

Prob (treat is useless| positive trial) = Prob (positive trial | treatment is useless)* Prob (treat is useless) / ((Prob positive trial|treat is useless)*Prob(treat is useless) + Prob(positive trial|treat is not useless)*Prob(treat is not useless))

Now we can substitute, assuming that our treatment is in fact truly useless:

Prob (treat is useless| positive trial) = p-value * 1 / (p-value * 1 + who cares * 0) = 1

That is to say, if we know the treatment is useless, the clinical trial is going to offer no new knowledge of the result, even if it was well conducted.

Drugs that enter in human trials are required to have some evidence for efficacy and safety, such as that gained from in vitro and animal testing. The drug development paradigm isn’t perfect in this regard, but the principle of the requirement of scientific and empirical evidence for safety and efficacy is sound. When we get better models for predicting safety and efficacy we will all be a lot happier. The point is to reduce the probability of futility to something low and maximize the probability of a positive trial given the treatment is not useless, which would result in something like:

Prob (treat is useless | positive trial) = p-value * <something tiny> / (p-value * something tiny + something large * something close to 1) = something tiny

Of course, there are healthy debates regarding the utility of the p-value. I question it as well, given that it requires a reference to trials that can never be run. These debates need to be had among regulators, academia, and industry to determine the best indicators of evidence of efficacy and safety.

But CAM studies have a long way to go before they can even think about such issues.

Lying with statistics, anti-vax edition 2015

noreply@blogger.com (Anonymous) — 2015年3月16日 12:30:00 +0000

Sometimes Facebook’s suggestions of things to read lead to some seriously funny material. After clicking on a link about vaccines, Facebook recommended I read an article about health outcomes in unvaccinated children. Reading this rubbish made me as annoyed as a certain box of blinking lights, but it again affords me the opportunity to describe how people can confuse, bamboozle, and twist logic using bad statistics.

First of all, Health Impact News has all the markings of a crank site. For instance, its banner claims it is a site for "News that impacts your health that other media sources may censor." This in itself ought to be a red flag, just like Kevin Trudeau’s Natural Cures They Don’t Want You to Know About.

But enough about that. Let’s see how this article and the referred study abuses statistics.

First of all, this is a bit of a greased pig. Their link leads to a malformed PDF file on a site called vaccineinjury.info. The site’s apparent reason for existence is to host a questionnaire for parents who did not vaccinate their children. So I’ll have to go on what the article says. There appeNo study of health outcomes of vaccinated people versus unvaccinated has ever been conducted in the U.S. by CDC or any other agency in the 50 years or more of an accelerating schedule of vaccinations (now over 50 doses of 14 vaccines given before kindergarten, 26 doses in the first year).ars to be another discussion on the vaccineinjury.info site, which I’ll get to in a moment.

The authors claim

No study of health outcomes of vaccinated people versus unvaccinated has ever been conducted in the U.S. by CDC or any other agency in the 50 years or more of an accelerating schedule of vaccinations (now over 50 doses of 14 vaccines given before kindergarten, 26 doses in the first year).

Here’s one. A simple Pubmed search will bring up others fairly quickly. These don’t take long to find. What happens after this statement is a long chain of unsupported assertions about what data the CDC has and has not collected, that I really don’t have an interest in debunking right now (and so leave as an exercise).

So on to the good stuff. They have a pretty blue and red bar graph that’s just itching to be shredded, so let’s do it. This blue and red bar graph is designed to demonstrate that vaccinated children are more likely to develop certain medical conditions, such as asthma and seizures, than unvaccinated children. Pretty scary stuff, if their evidence were actually true.

One of the most important principles in statistics is defining your population. If you fail at that, you might as well quit, get your money back from SAS, and call it a day, because nothing that comes after that is meaningful. You might as well make up a bunch of random numbers if that’s the case, because that will be just as meaningful.

This study fails miserably at defining its population. The best I can tell, the comparison is between a population in an observation study called KIGGS and respondents to an open invitation survey conducted at vaccineinjury.info.

What could go wrong? Rhetorical question.

We don’t know who responded to the vaccineinjury.info questionnaire, but it is aimed at parents who did not vaccinate their children. This pretty much tanks the rest of their argument. From what I can tell, these respondents seem to be motivated to give answers favorable to the antivaccine movement. That the data they present are supplemented with testimonials gives this away. They are comparing apples to rotten oranges.

The right way to answer a question like this is a matched case-control study of vaccinated and unvaccinated children. An immunologist is probably the best one to determine which factors need to be included in the matching. That way, an analysis conditioned on the matching can clearly point to the effect of the vaccinations rather than leave open the questions of whether the differences in cases were due to differences in inherent risk factors.

I’m wondering if there isn’t some ascertainment bias going on as well. Though I really couldn’t tell what the KIGGS population was, it was represented as the vaccinated population. So in addition to imbalances in risk factors, I’m wondering if the "diagnosis" in the unvaccinated population was derived from the parents were asked which medical conditions their children have. In that case, we have no clue what the real rate is like, because we are comparing parents’ judgments (and parents probably more likely to ignore mainstream medicine at that) with, presumably, a GP’s more rigorous diagnosis. That’s not to say that no children in the survey were diagnosed by an MD, but without that documentation (which this web-based survey isn’t going to be able to provide), the red bars in the pretty graph are essentially meaningless. (Which they were even before this discussion.)

But let’s move on.

The vaccineinjury.info cites some other studies that seem to agree with their little survey. For instance, McKeever, et al. published a study in the American Journal of Public Health in 2004 from which the vaccineinjury.info site claims an association between vaccines and the development of allergies. However, that apparent association, as stated in the study, is possibly the result of ascertainment bias (the association was only strong in a stratum with the least frequent GP visits). Even objections to the discussion of ascertainment bias leave the evidence of association of vaccines and allergic diseases unclear.

The vaccineinjury.info site also cites the Guinea-Bisseau study reported by Kristensen et al.in BMJ in 2000. They claim, falsely, that the study showed a higher mortality in vaccinated children.

They also cite a New Zealand study.

What they don’t do is describe how they chose the studies to be displayed on the web site. What were the search terms? Were these studies cherry-picked to demonstrate their point? (Probably, but they didn’t do a good job.)

What follows the discussion of other studies is an utter waste of internet space. They report the results of their "survey," I think. Or somebody else’s survey. I really couldn’t figure out what was meant by "Questionnaire for my unvaccinated child ("Salzburger Elternstudie")". The age breakdown for the "children" is interesting, for 2 out of the 1004 "children" were over 60! At any rate, if you are going to be talking about diseases in children, you need to present it by age, because, well, age is a risk factor in disease development. But they did not do this.

What is interesting about the survey, though, is the reasons the parents did not vaccinate their children, if only to give a preliminary notion of the range of responses.

In short, vaccineinjury.info, and the reporting site Health Impact News, present statistics that are designed to scare rather than inform. Proper epidemiological studies, contrary to the sites’ claims, have been conducted and provide no clear evidence to the notion that vaccinations cause allergies except in rare cases. In trying to compile evidence for their claims, they failed to provide evidence that they did a proper systematic review, and even misquoted the conclusions of the studies they presented.

All in all, a day in the life of a crank website.

Algorithmic cruelty

noreply@blogger.com (Anonymous) — 2014年12月31日 16:07:00 +0000

By now, most of us know about Facebook’s algorithmic retrospectives, and, of course, how some people thought it could be cruel. Indeed, posts about some sorts of issues, such as divorces or deaths, can get a lot of "likes" (where the "like" button means something other than "like") and comments, and therefore get flagged by whatever algorithm the Facebook data scientists came up with as a post worthy of a retrospective.

There are a lot of issues here. When someone "likes" a post, they do not necessarily mean they "like" the event the post is about. It could mean a number of different things, such as "I hear you," or "I’m empathizing with you," or even "Hang in there." However, the algorithms treat all likes equal.

Comments, of course, carry much more sophisticated meaning, but are much harder to analyze especially in the presence of sarcasm. And algorithms that do analyze comments (or any free text) for sentiment will require a large training set of hand-coded comments. (Which I suppose Facebook does have the resources to generate.)

Which leaves a few ways of handling this problem:

Do nothing different. Which is probably my favorite solution, because I’d like to look back on the good, the bad, and the ugly. It’s my life, and I want to remember it. Besides, the event that really sucked at the time (say, a torn ACL leading to surgery) may lead to good things.
Add a "I don’t want to see this" button. Which was already accomplished by including an "X" button, but maybe not so obvious.
Eliminate the retrospective, which I don’t think anybody agrees is a good solution.

I suppose one day Facebook’s algorithm will be smart enough to withhold posts it knows people don’t want to review, but then that will open up another can of worms.

No, a study did not link genetically engineered crops to 22 diseases

noreply@blogger.com (Anonymous) — 2014年12月09日 23:00:00 +0000

In my Facebook feed, a friend posted a very scary-looking study that links genetically engineered (GE) crops to the rise in 22 diseases. These are pretty fearsome diseases, too, like bile duct cancer and pelvis cancer. For instance¹:

There are a few ways to respond to this article:

First, it has not passed my attention that the second author has published a book Myths of Safe Pesticides, which has been analyzed and debunked by Harriet Hall.

Second, I could just say "correlation is not causation." QED. Article debunked, and can be swept to the dustbin.

Third, I can point out the correlation between sales of organic produce and autism. (Yikes!) In fact, using the methods of this article, I can probably prove a significant correlation between sales of organic produce and bile cancer, kidney cancer, autism, or lipoprotein disorder deaths. We can all grab our glyphosate-coated pitchforks and demand reform!

However, I think there are some statistical lessons here, and it's sometimes good to deconstruct some misused and abused statistics. And trust me, the statistics in this article are seriously misused. In fact, it might be an interesting project for an introductory graduate statistics class to collect articles like this and critique them. I'll do it for fun here. There are others that can speak to the scientific aspects of the article, like how it disagrees with the results of a review of over a trillion meals fed that incorporate GE products. There's also other quibbles with the article, like how it sometimes conflates pesticide discussions with glyphosate (an herbicide), that others can deconstruct.

When deciding on how to summarize and analyze data statistically, it is essential to work with the nature of the data. This article fails on several counts. First, it smashes together data from two complete different sources without considering how the data are related. Now, I'm generally excited to see data from disparate sources linked and analyzed together, but it has to be done carefully. This is how they obtained their data on GE use:

From 1990-2002, glyphosate data were available for all three crops, but beginning in 2003 data were not collected for all three crops in any given year. Data on the application rates were interpolated for the missing years by plotting and calculating a best fit curve. Results for the application rates for soy and corn are shown in Figures 2 and 3. Because the PAT was relatively small prior to about 1995, the sampling errors are much larger for pre-1995 data, more so for corn than for soy. Also, data were not missing until 2003 for soy and 2004 for corn. For these reasons, the interpolated curves begin in 1996 for soy and 1997 for corn in Figures 2 and 3.

This is how they obtained epidemiological data:

Databases were searched for epidemiological data on diseases that might have a correlation to glyphosate use and/or GE crop growth based on information given in the introduction. The primary source for these data was the Centers for Disease Control and Prevention (CDC). These data were plotted against the amount of glyphosate applied to corn and soy from Figure 6 and the total %GE corn and soy crops planted from Figure 1. The percentage of GE corn and soy planted is given by: (total estimated number of acres of GE soy + total estimated number of acres of GE corn)/(total Estimated acres of soy + total estimated acres of corn)x100, where the estimated numbers were obtained from the USDA as outlined above.

This seems innocent enough, but there's already a lot of wrong happening here. It's good that they explained some of their data cleaning, though we can always stand for more transparency behind this step. It's not scientifically glorious to describe how you handle missing or sparse data, but mishandling such can certainly sink your Nobel prize work. It's also good to explain derived variables, though I haven't gone back and checked their math.

The first fatal error is how they link the data. They simply merge it by year. It's the obvious-seeming step that already tanks their analysis. This is the same kind of merging that links, say, sales of organic crops to autism. Mashing up data needs to be done in a scientifically valid way, and simply merging disparate data by year isn't going to cut it here. All these data they gathered are crude summaries, and they just strung them together by year without giving any thought to whether the subjects in the epidemiological database have any connection to the subjects in the GE database. Sloppy, and that right there can be enough to tank any analysis, even if the analysis were well done. Which this one wasn't.

The second fatal error is how they present the data. Take the Figure 16 above. This graph breaks so many rules of data presentation that Edward Tufte's head would probably explode just from looking at it. But let's dig a little deeper. The authors say they plotted incidence of disease (in Figure 16 it's age-adjusted deaths due to lipoprotein disorder) against GE and glyphosate use. However, if you want to get technical about it, they plot all three of these versus time. This is a very important distinction. If they plotted incidence versus GE use, then they would put GE use on the x-axis. However, they show incidence in bar graphs by time, GE use in a line graph by time, and glyphosate use by time. I'll explain why this is important in the discussion of the third fatal flaw. But let's move ahead with the graph. From what I've been able to figure out, the left y-axis goes with the bar graph and is in deaths per hundred thousand. The axis on the right does double duty and covers both % of GE planted and 1000 tons of glyphosate used. It took me a while to figure that out, and it's very sloppy design anyway (the two scales have nothing to do with each other). If you ever see a line plot with a left and right y-axis, get skeptical. Here, the left axis starts at 0 and ends at 2.75 or so, and the right axis starts at -20 (!) and ends at 85 or so. I can see why they chose the y-axis, but the right axis is very curious. The -20 is a terrible choice for the start of the right axis. It's an invalid value of % of GE crops planted and 1000s of tons of glyphosate used. "Yes, Monsanto, I used -20,000 tons of glyphosate. You owe me 50,000ドル." It seems that the origin and scale of the right y-axis was chosen specifically to make GE and glyphosate use appear to track closely with deaths. I usually choose incompetence over malice to explain motivations, but it's very challenging to support incompetence in this case. It takes talent and/or effort to choose axes like this. I'll leave a deconstruction of the other graphs as an exercise, perhaps for your graduate-level stats class.

The third and final fatal error is how they analyze the data. Their analysis is the statistical equivalent to bringing a knife to a gunfight. They basically take all the GE and epidemiological data, ignore the time component, and send it through your Stat 101 Pearson correlation estimator formula. They construct some p-values, unsurprisingly find a massively small p-value, declare victory, and hit the publish button. Problem is, they compute the wrong statistical summary using the wrong formula and use it to make the wrong inference. The Pearson correlation estimator they use is designed for independent data, not time series data (and they know it's time series data because they say so on p. 11). Time series data has a complex correlation structure, and thus estimating second-order parameters like correlations is a bit of a challenge. For instance, GE use this year is going to be heavily correlated to GE use last year, as are deaths from lipoprotein disorders. Does the correlation reflect a relationship between death and GE use, or death this year and death last year? The naïve estimate assumes the correlation is between death and GE use, and accounts nothing of the relationship between deaths this year and last year (in the stat world we call this autocorrelation). Though I haven't done the math, my guess is that the correlation between death and GE use will be greatly reduced if not disappear altogether if time is taken into account. And even if there is a nonzero, significant correlation, the fact of the matter is that there needs to be a stronger link than time between the GE data and epidemiological data.

As a bonus, the paper claims to find a link between GE crop use, glyphosate use, and a whole bunch of nasty stuff, but they never try to tease out whether the nasty stuff is attributable to glyphosate or GE crops.

In conclusion, the paper claims to find a strong link between GE crop use and glyphosate use, and a host of diseases. Given that their paper was so deeply methodologically flawed, they are unable to support their conclusions. This paper should not be considered as evidence of the dangers of GE crop use or glyphosphate use, but should rather be used as a showcase of "How Not to Do It."

Edit: I need to learn how to spell glyphosate.

Footnotes:
¹Swanson, Leu, Abrahamson, and Wallet. "Genetically engineered crops, glyphosphate and the deterioration of health in the United States." Journal of Organic Systems. 9(2), 2014. Figure 16.

Joint statistical meetings 2013

noreply@blogger.com (Anonymous) — 2013年8月07日 16:02:00 +0000

Every year, the first week of August, we statisticians meet to get our statistics, networking, dancing, and beer on. With thousands in attendance, it's exhausting. I wonder about the quality of statistical work the second week of August.

Each conference seems to have a life of its own, so I tend to reflect on each one. Here's my reflection on this year's:

First, being in Montreal, most of us couldn't use smartphones. Thankfully, Revolution Analytics sponsored free WiFi. They also do great work with R. So we were all for the most part able to tweet.

The quality of talks was pretty good this year, and I've learned a lot. We even had one person describe simulations with a flowchart rather than indecipherable equations, and I strongly encourage that practice.

As a member of the biopharmaceutical section, I was struck by how few people take advantage of our awards. Of course, everybody giving a contributed or topic contributed talks is automatically entered into the best contributed paper competition. But we have a poster competition and student paper competition that have to be explicitly entered, and participation is low. This is a great opportunity.

The highlight of the conference, of course, was Nate Silver's talk, and he delivered admirably. The perhaps thousand statisticians in attendance needed the message: learn to communicate with journalists and teach them numbers need context. I also like his response to the question "statistician or data scientist?" Which was, of course, "I don't care what you call yourself, just do good work."

Wasserman on noninformative priors

noreply@blogger.com (Anonymous) — 2013年7月15日 12:45:00 +0000

Larry Wasserman calls the use of noninformative priors a "lost cause." I agree for the reasons he stated, and the fact that there are always better alternatives anyway. At the very least, there are the heavy-tailed "weakly informative priors" that put nearly all weight on something reasonable, such as small to moderate values of a variance, and little weight on stupid prior values, such as mean values on the order of 10¹⁰⁰.

However, they’ll be around for years to come. Noninformative priors are nice security blankets, and we get to think that we are approaching a problem with an open mind. I guess open minds can have stupid properties as well.

I hope, though, that we will start thinking more deeply about the consequences of our assumptions especially about noninformative priors rather than feeling nice about them.

MOOCs–a low-risk way to explore outside your field

noreply@blogger.com (Anonymous) — 2013年4月29日 03:21:00 +0000

One of the things I'm realizing from Massively Open Online Courses (MOOCs) -- those online free classes from universities that have seem to sprung up from almost nowhere in the last year and a half -- is that they offer a perfect opportunity to explore outside my field. At first (and this was even before the term MOOC was coined), I took classes there were just outside my field. For instance, I've been in clinical and postmarketing pharmaceutical statistics for over 10 years, and my first two classes were in databases and machine learning. I did this because I was aching to learn something new, but I figured that with a class in databases I could make our database guys in IT sweat a bit just by dropping some terms and showing some understanding of the basics. It worked. In addition, I wanted to understand what this machine learning field was all about, and how it was different from statistics. I accomplished that goal, too.

Since then, I have taken courses in the area of artificial intelligence/machine learning, sociology and networks, scientific computing (separately from statistical computing), and even entrepreneurship. I have also encouraged others to take part in MOOCs, though I don't know the result of that. Finally, I have come back to some classes I've already taken as a community TA, or former student who actively takes part in discussions to help new students take the class.

This is all valuable experience, and I could write several blog entries on the benefits. The main one I'm feeling right now is the feeling that I'm coming up for air, and taking a sampling of other points of view in a low-risk way. For example, though I don't actively use Fourier analysis in my own work, one recent class and one current class both use it to do different things (solve differential equations and process signals). Because these classes involve programming assignments, I've now deepened my understanding of the spectral theorem, which I only studied from a theoretical point of view in graduate school. I'm also thinking about this work from the point of view of time series analysis, which is helping me think about some problems involving longitudinal data at work.

From a completely different standpoint, another class helped me think about salary negotiations in terms of expected payoff (i.e. combination of probability of an offer being accepted vs. salary). This sort of analysis invited further analysis of the value of that job vs. what I would be paid and the insecurity of moving to a different job. In the end, I turned down what would have been a pretty good offer, because I decided it did not compensate for the risks I was incurring. The cool thing is that these were all applying concepts I already understood (expected value, expected payoff), but applied in a different way from what I was already doing.

The best thing about MOOCs is that the risk is low. All that is required is an internet connection and a decent computer. Some math courses may require a better computer to do high-powered math, but I've seen few that require expensive textbooks or expensive software. Even Mathworks is now offering Matlab at student pricing to people who are taking some classes, and Octave remains a free option for people unable to take advantage of it. And, if you are unable to keep up the work, there is now downside. You can simply unenroll.

RStudio is reminding me of the older Macs

noreply@blogger.com (Anonymous) — 2013年4月16日 02:33:00 +0000

The only thing missing is the cryptic ID number.

Well, the only bad thing is that I am trying to run a probabilistic graphical model on some real data, and having a crash like this will definitely slow things down.

Presenting without slides

noreply@blogger.com (Anonymous) — 2013年3月30日 18:21:00 +0000

Tired of slides, I’ve been experimenting with different ways of presenting. At the recent Conference on Statistical Practice, I decided only to use slides for an outline and references. As it turns out, the most critical feedback I got had to do with the fact that the audience couldn’t follow the organization because I had no slides.

I tried presenting without slides because, well, I started to use them as a crutch. I also saw a lot of people presenting essentially by putting together slides and reading from them. So I figured I would expand my horizons.

Next time I present, I’ll do slides, I guess, but I may try something a bit different.

Last session of Caltech's Learning from Data course starts April 2

noreply@blogger.com (Anonymous) — 2013年3月27日 20:49:00 +0000

I just received this email:

Caltech's Machine Learning MOOC is coming to an end this spring, with the final session starting on April 2. There will be no future sessions. The course has attracted more than 200,000 participants since its launch last year, and has gained wide acclaim. This is the last chance for anyone who wishes to take the course (http://work.caltech.edu/telecourse).
Best.
The Caltech Team

I strongly recommend this course if you can take it, even if you have taken other machine learning classes. It lays a great theoretical foundation for machine learning, sets it off nicely from classical statistics, and gives you some experience working with data as well.

If you were for some reason waiting for the right time, it looks to be now or never.

Review of Caltech's Learning from Data e-course

noreply@blogger.com (Anonymous) — 2013年3月21日 02:47:00 +0000

Caltech has an online course Learning from Data, taught by Professor Yaser Abu-Mostafa, that seeks to make the course material accessible to everybody. Unlike most of the online courses I've taken, this one is independently offered through a platform created just for the class. I took the course for its second offering in Jan-March 2013.

The platform on which the course is offered isn't as slick as Coursera. The lectures are offered through a Youtube playlist, and the homeworks are graded through multiple choice. That's perhaps a weakness of the class, but somehow the course faculty made it work.

The class's content was its strong point. Abu-Mostafa weaved theory and pragmatic concerns throughout the class, and invited students to write code in just about any platform (I, of course, chose R) to explore the theoretical ideas in a practical setting. Between this class and Andrew Ng's Machine Learning class on the Coursera platform, a student will have a very strong foundation to apply these techniques to a real-world setting.

I have only one objection to the content, which came in the last lecture. In his description of Bayesian techniques, he claimed that in most circumstances you could only model a parameter with a delta function. This, of course, falls in line with the frequentist notion that you have a constant, but unknowable "state of nature." I felt this way for a long time, but don't really believe it any more in a variety of contexts. I think he played up the Bayesian v. frequentist squabble a bit much, which may have been appropriate 20 years ago but is not so much an issue now.

Otherwise, I found the perspective from the course extremely valuable, especially in the context of supervised learning.

If you plan on taking the course, I recommend leaving a lot of time for it or having a very strong statistical background.

Distrust of R

noreply@blogger.com (Anonymous) — 2013年3月12日 23:12:00 +0000

I guess I've been living in a bubble for a bit, but apparently there are a lot of people who still mistrust R. I got asked this week why I used R (and, specifically, the package rpart) to generate classification and regression trees instead of SAS Enterprise Miner. Never mind the fact that rpart code has been around a very long time, and probably has been subject to more scrutiny than any other decision tree code. (And never mind the fact that I really don't like classification and regression trees in general because of their limitations.)

At any rate, if someone wants to pay the big bucks for me to use SAS Enterprise Miner just on their project, they can go right ahead. Otherwise, I have got a bit of convincing to do.

Bad statistics in high impact journals

noreply@blogger.com (Anonymous) — 2013年3月01日 01:49:00 +0000

Better Journals… Worse Statistics? : Neuroskeptic

In the linked blog entry, Neuroskeptic notes that high impact journals often have fewer statistical details than other journals. The research reported in these journals is often heavily amended, if not outright contradicted, by later research. I don't think this is nefarious, though, nor is it worthless. The kind of work reported in Science and Nature, for instance, generates interest and, therefore, more scrutiny (funding, studies, theses, etc.).

But as with all other research, if statistical details are included it might direct subsequent research in these topics a bit better.