Thursday, January 14, 2016
Talk to Upstate Data Science Group on Caret
It was great to get out of my own comfort zone a bit (since graduate school, I've only really given talks on some topic in biostatistics) and meeting statisticians, computer scientists, and other sorts of data scientists from many different fields. This is a relatively new group, and given the interest over the last couple of months or so I think this has been sorely needed in the Upstate South Carolina region.
We'll be participating in Open Data day in March of this year, so if you are in the Upstate SC region, or don't mind making the trek from Columbia or Western NC, find us on Meetup. Our next meeting is a data hack night which promises to be interesting.
Wednesday, November 25, 2015
Even the tiniest error messages can indicate an invalid statistical analysis
Word to the wise: track down the reasons for even the most innocuous-seeming warnings. Every stage of a statistical analysis is important, and small errors anywhere along the way and have huge consequences downstream. Perhaps this is obvious, but you still have to slow down and take care of the details.
(Note that I'm editing this to be a part of my Little Debate series, which discusses the tiny decisions dealing with data that are rarely discussed or scrutinized, but can have a major impact on conclusions.)
Wednesday, August 7, 2013
Joint statistical meetings 2013
Every year, the first week of August, we statisticians meet to get our statistics, networking, dancing, and beer on. With thousands in attendance, it's exhausting. I wonder about the quality of statistical work the second week of August.
Each conference seems to have a life of its own, so I tend to reflect on each one. Here's my reflection on this year's:
First, being in Montreal, most of us couldn't use smartphones. Thankfully, Revolution Analytics sponsored free WiFi. They also do great work with R. So we were all for the most part able to tweet.
The quality of talks was pretty good this year, and I've learned a lot. We even had one person describe simulations with a flowchart rather than indecipherable equations, and I strongly encourage that practice.
As a member of the biopharmaceutical section, I was struck by how few people take advantage of our awards. Of course, everybody giving a contributed or topic contributed talks is automatically entered into the best contributed paper competition. But we have a poster competition and student paper competition that have to be explicitly entered, and participation is low. This is a great opportunity.
The highlight of the conference, of course, was Nate Silver's talk, and he delivered admirably. The perhaps thousand statisticians in attendance needed the message: learn to communicate with journalists and teach them numbers need context. I also like his response to the question "statistician or data scientist?" Which was, of course, "I don't care what you call yourself, just do good work."
Monday, November 5, 2012
Snapshot of the statistics blogosphere
This was generated during my social network analysis project. I haven’t finished yet, but I did want to show the cute picture. The statistics blogosphere is like a school of jellyfish.
Friday, November 2, 2012
Politics vs. science and the Nate Silver controversy
I’ll take a small departure from the narrow world of biostatistics and comment on a wider matter.
Nate Silver of FiveThirtyEight has really kicked the hornet’s nest. This is a nest that really needed stirring, but I do not envy him for being the focus of attention.
This all started, I think, when he released his book and basically called political pundits out for a business model of generating drama rather than making good predictions. This wouldn’t be a huge deal, except that he has developed a statistical model that combines data from national and state polls with demographic data to project outcomes of presidential and senatorial elections. This model, as of this writing, has President Obama at close to an 81% probability of re-election, given the current state of things. As it turns out, there are a lot of people that don’t like this, and they generally fall into two camps:
1. People who would rather see President Obama defeated in the election, and
2. Pundits who have a vested interest in a dramatic “horse-race” election
I’ll add a third:
3. Pundits who want to remain relevant (whether to keep their jobs or reputations).
Frankly, I don’t think that pundits will have to worry about #3. There’s an allergy to fact in this country, a large group of people who would rather ignore established fact and cling to a fantasy. (You can find a sampling of these people over at the intelligent design blogosphere, for instance.) I think the demand for compelling stories over dry facts will remain.
I’ve run into people of the first type, when I’ve published some armchair statistician analyses based on Twitter sentiment, for instance. The responses weren’t critiques of the method, but rather, “who cares, Republicans rule!” Even more dangerous, I’ve run into similar responses to negative clinical study results in cases where sponsors have a vested interest in positive outcomes. (There was at least one case I remember a sponsor moved forward with an expensive study to follow on, and some where I was asked to reanalyze a zillion times.)
Nate write The Signal and the Noise where he, among a lot of explanation, points out that there is a whole cottage industry of people getting paid to BS about politics. So I think that some in the second category are starting to face an existential crisis, and that makes them dangerous.
Ultimately, we have to understand where Nate is coming from to understand his prediction. His money is (literally – He made a bet[1] on Twitter with “Morning Joe” Scarborough of NBC) on Obama’s victory in the election, not necessarily because he wants Obama to win, but because he has confidence in his prediction. When he made the bet, he made the controversy more than just trading words, but he called Joe’s bluff (Joe had said that anyone not calling the race a tossup is an ideologue). We can now call him The Statistician Who Kicked the Hornet’s Nest – the punditry, including the public editor of the New York Times that hosts his blog, is collectively attacking him.
Unfortunately, the punditry has the upper hand, because people are more interested in the narrative than the science.
[1] The bet originally consisted of the loser donating 1000ドル to charity. Nate subsequently donated 2538ドル to the Red Cross before the election.
Wednesday, October 31, 2012
Willful statistical illiteracy
The fine folks over at Simply Statistics have a very good educational article about the difference between the probability of winning an election and vote share. This article stems from a controversial column over at Politico criticizing Nate Silver and his election forecasts.
Twitter responses are even worse. Conservative filmmaker John Ziegler calls Nate Silver a “hyper-partisan fraud” who is “not an expert on polls.”
Glenn Thrush mentions a “conservative 538:”
And it’s not hard to find other examples.
I’ve run into this reaction a bit, especially when it comes to politics. There are a large group of people, who will dismiss any evidence going against their beliefs. I guess the punditry wasn’t so dismissive of Silver in 2010.
At any rate, I give a recommendation I rarely give: read this Politico article and the comments (ignore the “conservatives aren’t bright” nonsense, which is the same stuff coming from the left).
And let’s thank Nate Silver, RealClearPolitics, and all the honest pollsters who try to shine some data on this election.
Monday, October 29, 2012
The most valuable thing about my little stat blog network project
So, I decided to construct the linking graph through blogrolls, and finally settled on using a manual process. The best part of this project is really finding out for myself all the great content out there!
Monday, October 22, 2012
SNA class proposal
I’ve been taking several classes through Coursera (nothing against the other platforms; I took two of the original three classes via Stanford and just stuck with the platform). The latest one is Social Network Analysis, which has a programming project. Here is what I have posted as a proposal:
Ok, I've been thinking about the programming project idea some, and at first I was thinking of analyzing the statistics blogging community, mostly because I belong to it and I wanted to see what comes out. The analysis below can be done for any sort of community. I've developed this idea a little further and wanted to record it here for two reasons. First, I simply need to write it down to get it out of my head and in such a way that the public can understand it. Second, I'd like feedback.
As it turns out, I took the NLP class in the spring and think there's some overlap that can be exploited. (This comes up nicely in the Mining the Social Web and Programming Collective Intelligence books.) There are measures of content similarity, such as cosine similarity, which are simple to compute and reasonably work well to see how similar content is. Content can then be clustered based on similarity. So, then, I have the following questions:
- What are the communities, and do they relate to clusters of content similarity?
- If so, who are the "brokers" between different communities, and what do they blog about? There are a couple of aggregators, such as StatBlogs and R-Bloggers, that I imagine would glue together several communities (that's their purpose and value), but I imagine there are a few others that are aggregator-like + commentary as well. Original content generators, like mine, will probably be on the edges.
- Is it better to threshold edges based on a number of mentions, or use an edge weight based on the number of mentions?
- If I have time, I may try to do some sort of topic or named entity extraction, and get an automated way of seeing what these different communities are talking about.
Wednesday, August 15, 2012
Statisticians need marketing
We should be more “big tent” about statistics. ASA President Robert Rodriguez nailed this in his speech at JSM. Whenever someone does something with data, we should claim them as a statistician. Sometimes this will lead to claiming people we don’t necessarily agree with. But the big tent approach is what is allowing CS and other disciplines to overtake us in the data era.Apparently, the idea of data mining was rejected by most statisticians about 30 years ago, and it has found a home in computer science. Now data science is growing out computer science, and analytics seems to be growing out of some hybrid of computer science and business. The embracing of the terms "data science" and "analytics" puzzled me for a long time, because these fields seemed to be just statistics, data curation, and understanding of the application. (I recognize now that there is some more to it, especially the strong computer science component especially in big data applications.) I now see the tension between statisticians and practitioners of these related fields, and the puzzlement remains. Statisticians have a lot to contribute to these blooming fields, and we damn well better get to it.
We should try to forge relationships with start-up companies and encourage our students to pursue industry/start-up opportunities if they have interest. The less we are insular within the academic community, the more high-profile we will be.Stanford has this down to a business plan. So do some other universities. This trail is being blazed, and we can just hop on it.
It would be awesome if we started a statistical literacy outreach program in communities around the U.S. We could offer free courses in community centers to teach people how to understand polling data/the census/weather reports/anything touching data.Statistics without borders is a great place to do this. Perhaps SWB needs better marketing as well?
Thursday, August 2, 2012
JSM 2012 in the rearview: reflections on the world's largest gathering of statisticians
The joint statistical meetings is an annual gathering of several large professional organizations of statisticians, and annually we descend on some city to share ideas. I'm a perennial attendee, and always find the conference valuable in several ways. I have a few thoughts about the conference in retrospect:
* For me, networking is much more important than talks. Of course, attending talks in your area of interest is a great way of finding people with whom to network.
* I'm very happy I volunteered with the biopharmaceutical section a couple of years ago. It's a lot of work, but rewarding.
* This year, I specifically went to a few sections out of my area, and found the experience valuable.
* I definitely recommend chairing a session or speaking.
* I also recommend roundtable lunches. I did one for the first time this year, and found the back and forth discussion valuable.
In short, I find that connecting with like-minded professionals to be an important part of my career and development as a person.
Friday, December 16, 2011
It’s not every day a new statistical method is published in Science
I’ll have to check this out – Maximal Information-based Nonparametric Exploration (MINE - har har). The link to the paper in Science.
I haven’t looked at this very much yet. It appears to be a way of weeding out potential variable relationships for further exploration. Because it’s nonparametric, relationships don’t have to be linear, and spurious relationships are controlled with a false discovery rate method. A jar file and R file are both provided.
Wednesday, September 14, 2011
Help! We need statistical leadership now! Part I: know your study
It’s time for statisticians to stand up and speak. This is a time where most scientific papers are “probably wrong,” and many of the reasons listed are statistical in nature. A recent paper in Nature Neuroscience noted a major statistical error in a disturbingly large number of papers. And a recent interview with Deborah Zarin, director of ClinicalTrials.gov, in Science revealed the very disturbing fact that many primary investigators and study statisticians did not understand their trial designs and the conclusions that can be drawn from them.
Recent focus on handling these problems have primary been concerned with financial conflicts of interest. Indeed, disclosure of financial conflicts of interest has only improved reporting of results. However, there are other sources of error that we have to consider.
A statistician responsible for a study has to be able to explain a study design and state what conclusions can be drawn from that design. I would prefer that we dig into that problem a little deeper and determine why this is occurring (and fix it!). I have a few hypotheses:
- We are putting junior statisticians in positions of responsibility before they are experienced enough
- Our emphasis on classical statistics fills a lot of our education, but is insufficient for current clinical trial needs involving adaptive trials, modern dose-finding, or comparison of interactions
- The demand for statistical services is so high, and the supply so low, that statisticians are spread out too thin and simply don’t have the time to put in the sophisticated thought required for these studies
- Statisticians feel hamstrung by the need to explain everything to their non-statistical colleagues and lack the common language, time, or concentration ability to do so effectively
I’ve certainly encountered all of these different situations.
Tuesday, September 13, 2011
The statistical significance of interactions
Nature Neuroscience recently pointed out a statistical error that has occurred over and over in science journals. Ben Goldacre explains the error in a little detail, and gives his cynical interpretation. Of course, I’ll apply Hanlon’s razor to the situation (unlike Ben), but I do want to explain the impact of these errors.
It’s easy to forget when you’re working with numbers that you are trying to explain what’s going on in the world. In biostatistics, you try to explain what’s going on in the human body. If you’re studying drugs, statistical errors affect people’s lives.
Where these studies are published is also important. Nature Neuroscience is a widely read journal, and a large number of articles in this publication commit the error. I wonder how many articles in therapeutic area journals make this mistake. These are the journals that affect day to day medical practice, and, if the Nature Neuroscience error rate holds, this is disturbing indeed.
Honestly, when I read these linked articles, I was dismayed but not surprised. We statisticians often give confusing advice on how to test for differences and probably overplay the meaning of statistically significant.
We statisticians have to assume the leadership position here in the design, analysis, and interpretation of statistical analysis.
Saturday, August 6, 2011
Has statistics become a dirty word?
Joint statistical meetings 2011 a reflection
And I managed to avoid a sunburn from the Miami sun this time.
Thursday, October 21, 2010
Belated world statistics day ruminations
First, I'm very happy that statistical reasoning is getting more airtime in the news. It's about time. While not everyone needs to be a statistician, I think it is within everyone's capability of learning enough about statistics to understand the increasing number of statistical arguments (and lack thereof) in the world around us. For example, the chart in this image was made by my 4 year old son. Certainly, his father is a statistician, but there is no reason why first and second graders can't make similar charts, and start to draw conclusions from them. Later grades can build on this exercise so that a basic understanding of statistics is achievable by the end of high school. The alternative is that too many people (even in scientific disciplines) fall vulnerable to anecdotal or even superstitious arguments. (Case in point: Jenny McCarthy's anti-vaccine campaign.)
I am pleased that SAS is pushing their educational agenda for mathematics and technology at the secondary school level, and Revolution Analytics has made their premium Revolution R product free for all academics. I, as these companies, have been displeased with the state of statistical and technological education in the grade schools and even undergraduate schools. Let's all work together to make this important tool accessible to everybody, as statistical reasoning is set to become an essential part of civil participation.
Thursday, September 9, 2010
What does likability have to do with statistics?
Wednesday, September 23, 2009
Beginning with the end in mind when collecting data: having data vs. using data
I'll give a recent example. I was asked to calculate the number of days a person was supposed to take a drug. We had the start date and end date, and so it should have been easy to do end - start + 1. However, to complicate matters, we were asked to consider days when the investigator told the subject to lay off the drug. This data was collected as free text. So, for example, the data could show up as follows:
- 3 days beginning 9/22/2009
- 9/22/2009-9/24/2009
- 9/22-24/2009
- Sept 22, 2009 - Sept 24, 2009
But we should not have had to. One person reviewing the data collection with the knowledge that this data would have to be analyzed would have immediately and strongly recommended that the data be collected in a structured format for the statistician to analyze at the end of the trial.
It is with great interest that I note that this problem is much wider. This blog post suggests a possible reason: problems of the past had to do with hidden information or data, but modern problems have to do with problems hidden within data that is in plain sight (a hypothesis of Malcolm Gladwell and probably many others). That is, in the past, having the data was good enough. We did not have space to store huge amounts of data, and certainly not the processing power to sift through all of it. Now, we have the storage and the processing power, but our paradigm of thinking about data has not kept up. We are still thinking that all we need is to have it, when what we really need is to analyze it, discard what's irrelevant, and correctly interpret what is there.
And that's why Hal Varian regards statistics as the "sexy job" of the next decade.
Friday, March 6, 2009
Simplicity is no substitute for correctness, but simplicity has an important role
The test of a good procedure is how well it works, not how well it is understood. -- John TukeyPerhaps I'm abusing Tukey's quote here, because I'm speaking of situations where the theory of the less understood methodology is fairly well understood, or at least fairly obvious to the statistician from previous theory. I'm also, in some cases, substituting "how correct it is" in place of "how well it works."
John Cook wrote a little the other day on this quote, and I wanted to follow up a bit more. I've run into many situations where a more understood method was preferred over one that would have, for example, cut the sample size of a clinical trial or made better use of the data that was collected. The sponsor simply wanted to go with the method that was taught in the first year statistics course because it was easier to understand. The results were often analysis plans that were less powerful, covered up important issues, or simply wrong (i.e. exact answer to the wrong question). It's a delicate balance especially for someone trained in theoretical statistics corresponding with a scientist or clinician in a very applied setting.
Here's how I resolve the issue. I think that the simpler methods are great for storytelling. I appreciate Andrew Gelman's tweaks to the simpler methods (and his useful discussing on Tukey as well!), and think basic graphing and estimation methods serve a useful purpose for presentation and first-order approximations of data analysis. But, in most practical cases, they should not be the last effort.
On a related note, I'm sure most statisticians know by know that they will have the "sexiest job" of the 2010 decade. The key will be how well we communicate our results. And here is where judicious use of the simpler methods (and creative data visualization) will make the greatest contributions.