Sunday, December 11, 2016
I set up a new data analysis blog
Saturday, September 24, 2016
Windows 10 anniversary updates includes a whole Linux layer - this is good news for data scientists
Why is this big news? To me, this installs a lot of the Linux tools that have proven useful over the years, such as wc, sed, awk, grep, and so forth. In some cases, these tools work much better than software packages such as R or SAS, and their power comes in combining these tools through pipes. You also get apt-get, which enables you to install and manage packages such as SQLite, octave, and gnuplot. You can even install R through this method, though I don't know if RStudio works with R installed in this way.
If you're a Linux buff who uses Windows, you can probably think of many more things you can do with this. The only drawback is that I haven't tried using any sort of X Windows or other graphical interfaces.
Sunday, June 26, 2016
Which countries have Regrexit?
Fortunately, in R, working with JSON data is pretty easy. You can easily download the data from the link and put it into a data frame. I start on that here, with the RJSONIO package, ggplot2, and a version of the petition I downloaded on 6/26/16.
One question I had was whether all the signers are British. Fortunately, the petition collects the place of residence of the signer, assuming no fraud. I came up with the following top 9 non-UK countries of origin of signers.
- I left off the UK. The number of signatures is over 3 million, and contains by far the largest percentage of signatories.
- 5 of the 9 top countries are neighbors, including the top 2. The other 4 are Australia, the US, Canada, and New Zealand, who are all countries that have strong ties to the UK.
- This assumes no petition fraud, which I can't guarantee. I saw at least one Twitter posting telling people to use her (if the profile pic is to be believed) residence code. There is a section of the petition data showing constituency, so I'm wondering if it would be possible to analyze the petition for fraud. I'm not as familiar with British census data as I am with US, but I imagine a mashup of the two would be useful.
Friday, May 20, 2016
Simulating a Weibull conditional on time-to-event is greater than a given time
# simulate conditional Weibull conditional on survival > T ---------------
# reliability function is exp{-(T+t/b)^a} / exp{-(T/b)^a} = 1-F(t)
# n = number of points to return
# shape = shape parm of weibull
# scale = scale parm of weibull (default 1)
# t is minimum (default is 0, which makes the function act like rweibull)
my_rweibull <- function(n,shape,scale=1,t=0) {
if (length(t)!=1 && length(t)!=n) {
stop("length(t) is not 1 or n")
}
return(scale*(-log(runif(n))+(t/scale)^shape)^(1/shape))
}
You use this function just like rweibull, with the exception that you pass in another vector t of minimum times or a scalar representing the minimum time of all simulated values. The idea is that the probability that the random variable will be at least T is given by exp{-(T+t/b)^a} / exp{-(T/b)^a}, so you can simulate this with a uniform random variate. I use the inversion method on the reliability function (just like using the inversion method on the distribution function, with the insight that if U is uniform(0,1), so is 1-U).
Truth be told, I ought to buckle down and learn how to do packages in R, but for now I'll just pass on some code on my blog if I think it will be helpful (or if I need to find it while doing a Google search later).
(Edit on 7/16: OOPS! A previous version of this had the scale and shape parameters switched. I've corrected it now. If you copied this before 7/16/2016, please check again.)
Thursday, January 14, 2016
Talk to Upstate Data Science Group on Caret
It was great to get out of my own comfort zone a bit (since graduate school, I've only really given talks on some topic in biostatistics) and meeting statisticians, computer scientists, and other sorts of data scientists from many different fields. This is a relatively new group, and given the interest over the last couple of months or so I think this has been sorely needed in the Upstate South Carolina region.
We'll be participating in Open Data day in March of this year, so if you are in the Upstate SC region, or don't mind making the trek from Columbia or Western NC, find us on Meetup. Our next meeting is a data hack night which promises to be interesting.
Wednesday, November 25, 2015
Even the tiniest error messages can indicate an invalid statistical analysis
Word to the wise: track down the reasons for even the most innocuous-seeming warnings. Every stage of a statistical analysis is important, and small errors anywhere along the way and have huge consequences downstream. Perhaps this is obvious, but you still have to slow down and take care of the details.
(Note that I'm editing this to be a part of my Little Debate series, which discusses the tiny decisions dealing with data that are rarely discussed or scrutinized, but can have a major impact on conclusions.)
Monday, April 15, 2013
RStudio is reminding me of the older Macs
Well, the only bad thing is that I am trying to run a probabilistic graphical model on some real data, and having a crash like this will definitely slow things down.
Tuesday, March 12, 2013
Distrust of R
I guess I've been living in a bubble for a bit, but apparently there are a lot of people who still mistrust R. I got asked this week why I used R (and, specifically, the package rpart) to generate classification and regression trees instead of SAS Enterprise Miner. Never mind the fact that rpart code has been around a very long time, and probably has been subject to more scrutiny than any other decision tree code. (And never mind the fact that I really don't like classification and regression trees in general because of their limitations.)
At any rate, if someone wants to pay the big bucks for me to use SAS Enterprise Miner just on their project, they can go right ahead. Otherwise, I have got a bit of convincing to do.
Wednesday, August 29, 2012
Integrating R into a SAS shop
I work in an environment dominated by SAS, and I am looking to integrate R into our environment.
Why would I want to do such a thing? First, I do not want to get rid of SAS. That would not only take away most of our investment in SAS training and hiring good quality SAS programmers, but it would also remove the advantages of SAS from our environment. These advantages include the following:
- Many years of collective experience in pharmaceutical data management, analysis, and reporting
- Workflow that is second to none (with the exception of reproducible research, where R excels)
- Reporting tools based on ODS that are second to none
- SAS has much better validation tools than R, unless you get a commercial version of R (which makes IT folks happy)
- SAS automatically does parallel processing for several common functions
So, if SAS is so great, why do I want R?
- SAS’s pricing model makes it so that if I get a package that does everything I want, I pay thousands of dollars per year more than the basic package and end up with a system that does way more than I need. For example, if I want to do a CART analysis, I have to buy Enterprise Miner, which does way more than I would need.
- R is more agile and flexible than SAS
- R more easily integrates with Fortran and C++ than SAS (I’ve tried the SAS integration with DLLs, and it’s doable, but hard)
- R is better at custom algorithms than SAS, unless you delve into the world of IML (which is sometimes a good solution).
I’m still looking at ways to do it, although the integration with IML/IML studio is promising.
Wednesday, March 21, 2012
Using R for a salary negotiation–an extension of decision tree models
Let’s say you are in the middle of a salary negotiation, and you want to know whether you should be aggressive in your offering or conservative. One way to help with the decision is to make a decision tree. We’ll work with the following assumptions:
- You are at a job currently making 50ドルk
- You have the choices between asking 60ドルk (which will be accepted with probability 0.8) or 70ドルk (which will be accepted with probability 0.2).
- You get one shot. If your asking price is rejected, you stay at your current job and continue to make 50ドルk. (This is one of those simplifying assumptions that we might dispense with later.)
This simplification of reality can be represented with a decision tree:
I went ahead and put in the expected payoff for each of these decisions. Because the more conservative approach has a higher expected payoff, this model suggests that you should take the conservative approach.
One shortcoming clearly is that this decision tree only shows two decisions, but really you have a range of decisions; you are not stuck with 60ドルk or 70ドルk for asking price. You might go with 65ドルk, or 62ドル.5k, or something else. So what would be the optimal asking price?
Again, we look at expected payoff, which is asking price*probability(offer accepted) + 50ドルk * probability(offer rejected). In this case, we need to model the probability that offer is accepted over the range of possible offers, not just the two points. The logistic model works very well for modeling probability, and that’s what I will use here to extend the two-point model. In fact, a logistic model with two parameters can be fit exactly to two points, and so that is what I will use here.
Here is my commented R code to implement this model:
my.offer <- function (x1=60,py.x1=.2,x2=70,py.x2=.8,ev.no=50,high=100,p.payoff=1) {
# return the offer to maximize expected payoff
# this assumes a game with one decision and one consequence
# you give an offer, and it is taken or refused. If taken, you receive a salary of
# (a function of) the offer. If refused, you stay at the old job and receive a
# salary of ev.no (presumably a current salary, but set to 0 if you are
# unemployed).
# the probability of rejection is modeled with a logistic function defined by
# two points (x1,py.x1) and (x2,py.x2)
# for example, if you expected a 20% rej. prob. with an offer of 140k, then
# x1,py.x1 = 140,.2. Similarly with x2,py.x2
# the expected payoff is modeled as offer*P(Yes|offer) + ev.no*P(No|offer),
# perhaps with modifications to account for benefits, negotiation, etc. This
# is defined in payoff function below.
# finally, high is defined as anything above what you would be expecting to offer
# and is used to create the plot limits and set the bounds in the optimization
# routine. # model the probability of no given salary offer
# here we have a logistic function defined by (x1,py.x1) and (x2,py.x2)
# note that qlogis is the inverse logit function
# also, matrices in R are defined in column-major form, not row-major form like
# FORTRAN, so we have to use 1,1,x1,x2 rather than 1,x1,1,x2
theta <- solve (matrix (c (1,1,x1,x2),nc=2),matrix (qlogis (c (py.x1,py.x2)),nc=1)) # for plot of probability function
xseq <- seq (ev.no,high,length=100)
yseq1 <- 1/(1+exp (-theta[1]-theta[2]*xseq)) # model the expected payoff of an offer
# model negotiations, benefits, and other things here
# (a simple way to model benefits though is just to change ev.no)
payoff <- function (x) {
tmp <- exp (-theta[1]-theta[2]*x)
return ( (ev.no + ifelse (ev.no>x*p.payoff,ev.no,x*p.payoff)*tmp)/(1+tmp) )
} yseq <- payoff(xseq) # plots
par (mfrow=c (1,2))
plot (xseq,yseq1,type='l',xlab='Offer',ylab='P(No|X)')
plot (xseq,yseq,type='l',xlab='Offer',ylab='Expected salary') # no sense in even discussing the matter if offer < ev.no
return (optimize (payoff,interval=c (ev.no,high),maximum=TRUE))
}
Created by Pretty R at inside-R.org
And here are the graphs and result:
> my.offer()
$maximum
[1] 61.96761
$objective
[1] 58.36087
So this model suggests that the optimum offer is close to 62ドルk, with an expected payoff of around 58ドルk. As a side effect, a couple of graphs are produced: giving the probability of rejection as a function of the asking price, and the expected salary (payoff) as a function of asking price.
So a few comments are in order:
- The value in this model is in varying the inputs and seeing how that affects the optimum asking price.
- The function I provided is slightly more complicated than what I presented in this post. You can model things like negotiation (i.e. you may end up at a little less than your asking price if you are not turned down right away), differences in benefits, and so forth. Once you have a simple and reliable baseline model with which to work, you can easily modify it to account for other factors.
- Like all models, this is an oversimplification of the salary negotiation process, but a useful oversimplification. There are cases where you want to be more aggressive in your asking, and this model can point those out.
- I commented the code profusely, but the side effects are probably not the best programming practice. However, this really is a toy model, so feel free to rip off the code.
- This model of course extends to other areas where you have a continuous range of choices with payoffs and/or penalties.
- If you are able to gather data on the probability of rejection based on offer, so much the better. You can then, instead of fitting an exact probability model, perform a logistic regression and use that as the basis of the expected payoff calculation.
Monday, December 5, 2011
From datasets to algorithms in R
Many statistical algorithms are taught and implemented in terms of linear algebra. Statistical packages often borrow heavily from optimized linear algebra libraries such as LINPACK, LAPACK, or BLAS. When implementing these algorithms in systems such as Octave or MATLAB, it is up to you to translate the data from the use case terms (factors, categories, numerical variables) into matrices.
In R, much of the heavy lifting is done for you through the formula interface. Formulas resemble y ~ x1 + x2 + …, and are defined in relation to a data.frame. There are a few features that make this very powerful:
- You can specify transformations automatically. For example, you can do y ~ log(x1) + x2 + … just as easily.
- You can specify interactions and nesting.
- You can automatically create a numerical matrix for a formula using model.matrix(formula).
- Formulas can be updated through the update() function.
Recently, I wanted to create predictions via Bayesian model averaging method (bma library on CRAN), but did not see where the authors implemented it. However, it was very easy to create a function that does this:
predict.bic.glm <- function(bma.fit,new.data,inv.link=plogis) {
# predict.bic.glm
# Purpose: predict new values from a bma fit with values in a new dataframe
#
# Arguments:
# bma.fit - an object fit by bic.glm using the formula method
# new.data - a data frame, which must have variables with the same names as the independent
# variables as was specified in the formula of bma.fit
# (it does not need the dependent variable, and ignores it if present)
# inv.link - a vectorized function representing the inverse of the link function
#
# Returns:
# a vector of length nrow(new.data) with the conditional probabilities of the independent
# variable being 1 or TRUE
# TODO: make inv.link not be specified, but rather extracted from glm.family of bma.fit$call
form <- formula(bma.fit$call$f)[-2] # extract formula from the call of the fit.bma, drop dep var
des.matrix <- model.matrix(form,new.data)
pred <- des.matrix %*% matrix(bma.fit$postmean,nc=1)
pred <- inv.link(pred)
return(pred)
}
The first task of the function finds the formula that was used in the call of the bic.glm() call, and the [-2] subscripting removes the dependent variable. Then the model.matrix() function creates a matrix of predictors with the original function (minus dependent variable) and new data. The power here is that if I had interactions or transformations in the original call to bic.glm(), they are replicated automatically on the new data, without my having to parse it by hand. With a new design matrix and a vector of coefficients (in this case, the expectation of the coefficients over the posterior distributions of the models), it is easy to calculate the conditional probabilities.
In short, the formula interface makes it easy to translate from the use case terms (factors, variables, dependent variables, etc.) into linear algebra terms where algorithms are implemented. I’ve only scratched the surface here, but it is worth investing some time into formulas and their manipulation if you intend to implement any algorithms in R.
R-bloggers
For a long time, I have relied on R-bloggers for new, interesting, arcane, and all around useful information related to R and statistics. Now my R-related material is appearing there.
If you use the R package at all, R-bloggers should be in your feed aggregator.
Monday, September 12, 2011
R to Word, revisited
There are a couple of other ways I can think of:
- Writing to CSV, then opening in MS Excel, then copying/pasting into Word (or even saving in another format)
- Going through HTML to get to Word (for example, the R2HTML package)
- Using a commercial solution such as Inference for R (has this been updated since 2009?!)
- Using Statconn, which seems broken in the later versions of Windows and is not cross platform in any case
- Going through LaTeX to RTF
- Use cat statements and the latex command from the Hmisc package (included with every R installation) or xtable package (downloadable from CRAN) to create a LaTeX document with the table.
- Call the l2r command included with the LaTeX2RTF installation. (This requires a bit of setup, see below.)
- Open up the result in Word and do any further manipulation.
I must admit, though, here is one area where SAS blows R away. With the exception of integration into LaTeX, and some semisuccessful integration with Excel through the Statconn project, R’s reporting capabilities in terms of nicely formatted output is seriously outpaced by SAS ODS.
Sunday, March 27, 2011
More on Ultraedit to R
Friday, January 21, 2011
Ultraedit to R
Saturday, January 15, 2011
Dona eis Python, whap!
Friday, April 16, 2010
R - not the epic fail we thought
I know that R is free and I am actually a Unix fan and think Open Source software is a great idea. However, for me personally and for most users, both individual and organizational, the much greater cost of software is the time it takes to install it, maintain it, learn it and document it. On that, R is an epic fail.With the exception of the last sentence, I am in full agreement. Especially in the industry I work in, qualification and documentation is hugely important, and a strength of SAS is a gigantic support department who has worked through these issues. Having some maverick install and use R, as I do, simply does not work for the more formal work that we do. (I use R for consulting and other things that do not necessarily fulfill a predicate rule.)
However, another company, REvolution Computing, has recognized this need as well. With R catching on at the FDA, larger companies in our industry have taken a huge interest in R, partly because of the flexibility in statistical calculations (and, frankly, the S language beats SAS/IML hands down for the fancier methods), and mostly to save millions of dollars in license fees. Compare REvolution's few hundred dollars a year for install and operation qualification on an infinite-seat license to SAS's license model, and it's not hard to see how.
And SAS, to their credit, has made it easier to interact with R.
Epic win for everybody.
Friday, November 20, 2009
My implementation of Berry and Berry's hierarchical Bayes algorithm for adverse events
Working on a drug safety project
I've been working on this project for some time now, off an on. Well, I've been making progress, and I'll share the results when I'm done. I'd also like to implement some of the other similar algorithms in this area, including the Poisson model that accounts for multiple occurrences of an adverse event, and a recent methodology that looks for "syndromes" (i.e. occurrences of groups of specific events all of which arise within a short time) and "constellations" (where the time restrictions are relaxed).
Saturday, July 11, 2009
Causal inference and biostatistics
This is a theory that I think deserves more attention in biostatistics. After all, it goes back to the root of why we are studying drugs. Ultimately, we really don't give a damn about whether outcomes are better in the treated group than in the placebo group. Rather, we are more interested in whether we are able to benefit individuals by giving them a treatment. In other words, we are interested in the unknowable quantity of what each person's outcome is if they are treated and what it is if they are not. If there's an improvement and it outweighs the (unknowable) risks, the drug is worth while. The reason we are interested in outcomes of a treated group and outcomes of a placebo group is that it's a surrogate for this unknowable quantity, especially if you use the intention-to-treat principle. However, as mentioned in the linked article and the research by Rubin, the intention to treat principle fails to deliver on its promise despite its simplicity and popularity.
Some clinical trials are now being run with causal inference as a central part of the design. Tools such as R and WinBUGS and Bayesian concepts now make this logistically feasible. Further advances in statistical handling of partial compliance to treatment, biological pathways of drugs, and the intention to treat principle itself make causal inference look much more desirable by the day. It's really only inertia caused by the popularity and (apparent) simplicity of intention to treat that makes this concept slower to catch on.