Realizations in Biostatistics: R

Showing posts with label R. Show all posts

Sunday, December 11, 2016

I set up a new data analysis blog

Well, I tried to write a blog post using the RStudio Rmarkdown system, and utterly failed. Thus, I set up a system where I could write from RStudio. So I set up a Github pages blog at randomjohn.github.io. There I can easily write and publish posts involving data analysis.

Posted by Unknown at 9:18 PM

Labels: blogging, open data, R

Saturday, September 24, 2016

Windows 10 anniversary updates includes a whole Linux layer - this is good news for data scientists

If you are on Windows 10, no doubt you have heard that Microsoft included the bash shell in its 2016 Windows 10 anniversary update. What you may not know is that this is much, much more than just the bash shell. This is a whole Linux layer that enables you to use Linux tools, and does away with a further layer like Cygwin (which requires a special dll). However, you will only get the bash shell out of the box. To enable the whole Linux layer, follow instructions here. Basically, this involves enabling developer mode then enabling the Linux layer feature. In the process, you will download some further software from the Windows store.

Why is this big news? To me, this installs a lot of the Linux tools that have proven useful over the years, such as wc, sed, awk, grep, and so forth. In some cases, these tools work much better than software packages such as R or SAS, and their power comes in combining these tools through pipes. You also get apt-get, which enables you to install and manage packages such as SQLite, octave, and gnuplot. You can even install R through this method, though I don't know if RStudio works with R installed in this way.

If you're a Linux buff who uses Windows, you can probably think of many more things you can do with this. The only drawback is that I haven't tried using any sort of X Windows or other graphical interfaces.

Posted by Unknown at 5:34 PM

Labels: Microsoft, R, software, Windows

Sunday, June 26, 2016

Which countries have Regrexit?

This doesn't have a lot to do with bio part of biostatistics, but is an interesting data analysis that I just started. In the wake of the Brexit vote, there is a petition for a redo. The data for the petition is here, in JSON format.

Fortunately, in R, working with JSON data is pretty easy. You can easily download the data from the link and put it into a data frame. I start on that here, with the RJSONIO package, ggplot2, and a version of the petition I downloaded on 6/26/16.

One question I had was whether all the signers are British. Fortunately, the petition collects the place of residence of the signer, assuming no fraud. I came up with the following top 9 non-UK countries of origin of signers.

There are a couple of things to remember when interpreting this graph:

I left off the UK. The number of signatures is over 3 million, and contains by far the largest percentage of signatories.
5 of the 9 top countries are neighbors, including the top 2. The other 4 are Australia, the US, Canada, and New Zealand, who are all countries that have strong ties to the UK.
This assumes no petition fraud, which I can't guarantee. I saw at least one Twitter posting telling people to use her (if the profile pic is to be believed) residence code. There is a section of the petition data showing constituency, so I'm wondering if it would be possible to analyze the petition for fraud. I'm not as familiar with British census data as I am with US, but I imagine a mashup of the two would be useful.

(Update: Greg Jefferis posted a nice analysis here. See comments below.)

Posted by Unknown at 2:24 PM

Labels: Brexit, politics, R

Friday, May 20, 2016

Simulating a Weibull conditional on time-to-event is greater than a given time

Recently, I had to simulate a time-to-event of subjects who have been on a study, are still ongoing at the time of a data cut, but who are still at risk of an event (e.g. progressive disease, cardiac event, death). This requires the simulation of a conditional Weibull. To do this, I created the following function:

# simulate conditional Weibull conditional on survival > T ---------------

# reliability function is exp{-(T+t/b)^a} / exp{-(T/b)^a} = 1-F(t)
# n = number of points to return
# shape = shape parm of weibull
# scale = scale parm of weibull (default 1)
# t is minimum (default is 0, which makes the function act like rweibull)
my_rweibull <- function(n,shape,scale=1,t=0) {
if (length(t)!=1 && length(t)!=n) {
stop("length(t) is not 1 or n")
}
return(scale*(-log(runif(n))+(t/scale)^shape)^(1/shape))
}

You use this function just like rweibull, with the exception that you pass in another vector t of minimum times or a scalar representing the minimum time of all simulated values. The idea is that the probability that the random variable will be at least T is given by exp{-(T+t/b)^a} / exp{-(T/b)^a}, so you can simulate this with a uniform random variate. I use the inversion method on the reliability function (just like using the inversion method on the distribution function, with the insight that if U is uniform(0,1), so is 1-U).

Truth be told, I ought to buckle down and learn how to do packages in R, but for now I'll just pass on some code on my blog if I think it will be helpful (or if I need to find it while doing a Google search later).

(Edit on 7/16: OOPS! A previous version of this had the scale and shape parameters switched. I've corrected it now. If you copied this before 7/16/2016, please check again.)

Posted by Unknown at 8:30 AM

Labels: clinical trials, R, simulation

Thursday, January 14, 2016

Talk to Upstate Data Science Group on Caret

Last night I gave an introduction and demo of the caret R package to the Upstate Data Science group, meeting at Furman University. It was fairly well attended (around 20 people), and well received.

It was great to get out of my own comfort zone a bit (since graduate school, I've only really given talks on some topic in biostatistics) and meeting statisticians, computer scientists, and other sorts of data scientists from many different fields. This is a relatively new group, and given the interest over the last couple of months or so I think this has been sorely needed in the Upstate South Carolina region.

We'll be participating in Open Data day in March of this year, so if you are in the Upstate SC region, or don't mind making the trek from Columbia or Western NC, find us on Meetup. Our next meeting is a data hack night which promises to be interesting.

Posted by Unknown at 3:01 PM

Labels: analytics, data science, R, statistics

Wednesday, November 25, 2015

Even the tiniest error messages can indicate an invalid statistical analysis

The other day, I was reading in a data set in R, and the function indicated that there was a warning about a parsing error on one line. I went ahead with the analysis anyway, but that small parsing error kept bothering me. I thought it was just one line of goofed up data, or perhaps a quote in the wrong place. I finally opened up the CSV file in a text editor, and found that the reason for the parsing error was that the data set was duplicated within the CSV file. The parsing error resulted from the reading of the header twice. As a result, anything I did afterward was suspect.

Word to the wise: track down the reasons for even the most innocuous-seeming warnings. Every stage of a statistical analysis is important, and small errors anywhere along the way and have huge consequences downstream. Perhaps this is obvious, but you still have to slow down and take care of the details.

(Note that I'm editing this to be a part of my Little Debate series, which discusses the tiny decisions dealing with data that are rarely discussed or scrutinized, but can have a major impact on conclusions.)

Posted by Unknown at 8:00 AM

Labels: Little Debate, practice of statistics, R, statistics

Monday, April 15, 2013

RStudio is reminding me of the older Macs

The only thing missing is the cryptic ID number.

Well, the only bad thing is that I am trying to run a probabilistic graphical model on some real data, and having a crash like this will definitely slow things down.

Posted by Unknown at 10:33 PM

Labels: R, RStudio

Tuesday, March 12, 2013

Distrust of R

I guess I've been living in a bubble for a bit, but apparently there are a lot of people who still mistrust R. I got asked this week why I used R (and, specifically, the package rpart) to generate classification and regression trees instead of SAS Enterprise Miner. Never mind the fact that rpart code has been around a very long time, and probably has been subject to more scrutiny than any other decision tree code. (And never mind the fact that I really don't like classification and regression trees in general because of their limitations.)

At any rate, if someone wants to pay the big bucks for me to use SAS Enterprise Miner just on their project, they can go right ahead. Otherwise, I have got a bit of convincing to do.

Posted by Unknown at 7:12 PM

Labels: practice of statistics, R

Wednesday, August 29, 2012

Integrating R into a SAS shop

I work in an environment dominated by SAS, and I am looking to integrate R into our environment.

Why would I want to do such a thing? First, I do not want to get rid of SAS. That would not only take away most of our investment in SAS training and hiring good quality SAS programmers, but it would also remove the advantages of SAS from our environment. These advantages include the following:

Many years of collective experience in pharmaceutical data management, analysis, and reporting
Workflow that is second to none (with the exception of reproducible research, where R excels)
Reporting tools based on ODS that are second to none
SAS has much better validation tools than R, unless you get a commercial version of R (which makes IT folks happy)
SAS automatically does parallel processing for several common functions

So, if SAS is so great, why do I want R?

SAS’s pricing model makes it so that if I get a package that does everything I want, I pay thousands of dollars per year more than the basic package and end up with a system that does way more than I need. For example, if I want to do a CART analysis, I have to buy Enterprise Miner, which does way more than I would need.
R is more agile and flexible than SAS
R more easily integrates with Fortran and C++ than SAS (I’ve tried the SAS integration with DLLs, and it’s doable, but hard)
R is better at custom algorithms than SAS, unless you delve into the world of IML (which is sometimes a good solution).

I’m still looking at ways to do it, although the integration with IML/IML studio is promising.

Posted by Unknown at 8:45 AM

Labels: R, SAS

Wednesday, March 21, 2012

Using R for a salary negotiation–an extension of decision tree models

Let’s say you are in the middle of a salary negotiation, and you want to know whether you should be aggressive in your offering or conservative. One way to help with the decision is to make a decision tree. We’ll work with the following assumptions:

You are at a job currently making 50ドルk
You have the choices between asking 60ドルk (which will be accepted with probability 0.8) or 70ドルk (which will be accepted with probability 0.2).
You get one shot. If your asking price is rejected, you stay at your current job and continue to make 50ドルk. (This is one of those simplifying assumptions that we might dispense with later.)

This simplification of reality can be represented with a decision tree:

blogpayoff

I went ahead and put in the expected payoff for each of these decisions. Because the more conservative approach has a higher expected payoff, this model suggests that you should take the conservative approach.

One shortcoming clearly is that this decision tree only shows two decisions, but really you have a range of decisions; you are not stuck with 60ドルk or 70ドルk for asking price. You might go with 65ドルk, or 62ドル.5k, or something else. So what would be the optimal asking price?

Again, we look at expected payoff, which is asking price*probability(offer accepted) + 50ドルk * probability(offer rejected). In this case, we need to model the probability that offer is accepted over the range of possible offers, not just the two points. The logistic model works very well for modeling probability, and that’s what I will use here to extend the two-point model. In fact, a logistic model with two parameters can be fit exactly to two points, and so that is what I will use here.

Here is my commented R code to implement this model:

my.offer <- function (x1=60,py.x1=.2,x2=70,py.x2=.8,ev.no=50,high=100,p.payoff=1) {
 # return the offer to maximize expected payoff
 # this assumes a game with one decision and one consequence
 # you give an offer, and it is taken or refused. If taken, you receive a salary of
 # (a function of) the offer. If refused, you stay at the old job and receive a
 # salary of ev.no (presumably a current salary, but set to 0 if you are
 # unemployed).
 # the probability of rejection is modeled with a logistic function defined by
 # two points (x1,py.x1) and (x2,py.x2)
 # for example, if you expected a 20% rej. prob. with an offer of 140k, then
 # x1,py.x1 = 140,.2. Similarly with x2,py.x2
 # the expected payoff is modeled as offer*P(Yes|offer) + ev.no*P(No|offer),
 # perhaps with modifications to account for benefits, negotiation, etc. This
 # is defined in payoff function below.
 # finally, high is defined as anything above what you would be expecting to offer
 # and is used to create the plot limits and set the bounds in the optimization
 # routine.  # model the probability of no given salary offer
 # here we have a logistic function defined by (x1,py.x1) and (x2,py.x2)
 # note that qlogis is the inverse logit function
 # also, matrices in R are defined in column-major form, not row-major form like
 # FORTRAN, so we have to use 1,1,x1,x2 rather than 1,x1,1,x2
 theta <- solve (matrix (c (1,1,x1,x2),nc=2),matrix (qlogis (c (py.x1,py.x2)),nc=1))  # for plot of probability function
 xseq <- seq (ev.no,high,length=100)
 yseq1 <- 1/(1+exp (-theta[1]-theta[2]*xseq))  # model the expected payoff of an offer
 # model negotiations, benefits, and other things here
 # (a simple way to model benefits though is just to change ev.no)
 payoff <- function (x) {
 tmp <- exp (-theta[1]-theta[2]*x)
 return ( (ev.no + ifelse (ev.no>x*p.payoff,ev.no,x*p.payoff)*tmp)/(1+tmp) )
 }   yseq <- payoff(xseq)  # plots
 par (mfrow=c (1,2))
 plot (xseq,yseq1,type='l',xlab='Offer',ylab='P(No|X)')
 plot (xseq,yseq,type='l',xlab='Offer',ylab='Expected salary')  # no sense in even discussing the matter if offer < ev.no
 return (optimize (payoff,interval=c (ev.no,high),maximum=TRUE))
}

Created by Pretty R at inside-R.org

And here are the graphs and result:

image

> my.offer()
$maximum
[1] 61.96761

$objective
[1] 58.36087

So this model suggests that the optimum offer is close to 62ドルk, with an expected payoff of around 58ドルk. As a side effect, a couple of graphs are produced: giving the probability of rejection as a function of the asking price, and the expected salary (payoff) as a function of asking price.

So a few comments are in order:

The value in this model is in varying the inputs and seeing how that affects the optimum asking price.
The function I provided is slightly more complicated than what I presented in this post. You can model things like negotiation (i.e. you may end up at a little less than your asking price if you are not turned down right away), differences in benefits, and so forth. Once you have a simple and reliable baseline model with which to work, you can easily modify it to account for other factors.
Like all models, this is an oversimplification of the salary negotiation process, but a useful oversimplification. There are cases where you want to be more aggressive in your asking, and this model can point those out.
I commented the code profusely, but the side effects are probably not the best programming practice. However, this really is a toy model, so feel free to rip off the code.
This model of course extends to other areas where you have a continuous range of choices with payoffs and/or penalties.
If you are able to gather data on the probability of rejection based on offer, so much the better. You can then, instead of fitting an exact probability model, perform a logistic regression and use that as the basis of the expected payoff calculation.

Posted by Unknown at 8:45 AM

Labels: modeling, R

Monday, December 5, 2011

From datasets to algorithms in R

Many statistical algorithms are taught and implemented in terms of linear algebra. Statistical packages often borrow heavily from optimized linear algebra libraries such as LINPACK, LAPACK, or BLAS. When implementing these algorithms in systems such as Octave or MATLAB, it is up to you to translate the data from the use case terms (factors, categories, numerical variables) into matrices.

In R, much of the heavy lifting is done for you through the formula interface. Formulas resemble y ~ x1 + x2 + …, and are defined in relation to a data.frame. There are a few features that make this very powerful:

You can specify transformations automatically. For example, you can do y ~ log(x1) + x2 + … just as easily.
You can specify interactions and nesting.
You can automatically create a numerical matrix for a formula using model.matrix(formula).
Formulas can be updated through the update() function.

Recently, I wanted to create predictions via Bayesian model averaging method (bma library on CRAN), but did not see where the authors implemented it. However, it was very easy to create a function that does this:

predict.bic.glm <- function(bma.fit,new.data,inv.link=plogis) {
# predict.bic.glm
# Purpose: predict new values from a bma fit with values in a new dataframe
#
# Arguments:
# bma.fit - an object fit by bic.glm using the formula method
# new.data - a data frame, which must have variables with the same names as the independent
# variables as was specified in the formula of bma.fit
# (it does not need the dependent variable, and ignores it if present)
# inv.link - a vectorized function representing the inverse of the link function
#
# Returns:
# a vector of length nrow(new.data) with the conditional probabilities of the independent
# variable being 1 or TRUE
# TODO: make inv.link not be specified, but rather extracted from glm.family of bma.fit$call

form <- formula(bma.fit$call$f)[-2] # extract formula from the call of the fit.bma, drop dep var
des.matrix <- model.matrix(form,new.data)
pred <- des.matrix %*% matrix(bma.fit$postmean,nc=1)
pred <- inv.link(pred)
return(pred)
}

The first task of the function finds the formula that was used in the call of the bic.glm() call, and the [-2] subscripting removes the dependent variable. Then the model.matrix() function creates a matrix of predictors with the original function (minus dependent variable) and new data. The power here is that if I had interactions or transformations in the original call to bic.glm(), they are replicated automatically on the new data, without my having to parse it by hand. With a new design matrix and a vector of coefficients (in this case, the expectation of the coefficients over the posterior distributions of the models), it is easy to calculate the conditional probabilities.

In short, the formula interface makes it easy to translate from the use case terms (factors, variables, dependent variables, etc.) into linear algebra terms where algorithms are implemented. I’ve only scratched the surface here, but it is worth investing some time into formulas and their manipulation if you intend to implement any algorithms in R.

Posted by Unknown at 8:35 AM

Labels: R

R-bloggers

For a long time, I have relied on R-bloggers for new, interesting, arcane, and all around useful information related to R and statistics. Now my R-related material is appearing there.

If you use the R package at all, R-bloggers should be in your feed aggregator.

Posted by Unknown at 8:30 AM

Labels: R

Monday, September 12, 2011

R to Word, revisited

In a previous post (a long time ago) I discussed a way to get a R data frame into a Word table. The code in that entry was essentially a brute force way of wrapping R data in RTF code, but that RTF code was the bare minimum. There was no optimization of widths, or borders, or anything like that.
There are a couple of other ways I can think of:

Writing to CSV, then opening in MS Excel, then copying/pasting into Word (or even saving in another format)
Going through HTML to get to Word (for example, the R2HTML package)
Using a commercial solution such as Inference for R (has this been updated since 2009?!)
Using Statconn, which seems broken in the later versions of Windows and is not cross platform in any case
Going through LaTeX to RTF

I’ve started to look at the last option a bit more, mainly because LaTeX2RTF has become a more powerful program. Here’s my current workflow from R to RTF:

Use cat statements and the latex command from the Hmisc package (included with every R installation) or xtable package (downloadable from CRAN) to create a LaTeX document with the table.
Call the l2r command included with the LaTeX2RTF installation. (This requires a bit of setup, see below.)
Open up the result in Word and do any further manipulation.

The advantages to this approach are basically that you can tune the appearance of the table from R rather than through post-processing of the RTF file. The main disadvantage is that you have to do a lot of editing of the l2r.bat file (if you are on Windows) to point to the correct directories of MiKTeX, Imagemagick, and Ghostscript. (And you have to make sure everything is installed correctly as well.) It’s rather tricky because you have to include quotes in the right places. However, that setup only has to occur once. The second disadvantage is that if you use the system() command to call l2r.bat, you have to use a strange syntax, such as system(paste('"c:\\Program Files\\latex2rtf\\l2r"',"c:\\path\\to\\routput.tex")). I suppose you could wrap this up in a neat function. At any rate, I think the effort is worth it, because the output is rather nice.
I must admit, though, here is one area where SAS blows R away. With the exception of integration into LaTeX, and some semisuccessful integration with Excel through the Statconn project, R’s reporting capabilities in terms of nicely formatted output is seriously outpaced by SAS ODS.

Posted by Unknown at 7:15 AM

Labels: Inference for R, R, RTF, SAS

Sunday, March 27, 2011

More on Ultraedit to R

I think I have found a superior solution to integrating Ultraedit and R. See my post over at IDM Users forum.

Posted by Unknown at 7:04 AM

Labels: R, Ultraedit

Friday, January 21, 2011

Ultraedit to R

My favorite text editor on Windows is Ultraedit, but it does not have a nice interface to R in the same vein as Emacs/ESS, Tinn-R, or Eclipse. (I have never used Eclipse.) Ultraedit is powerful enough to submit whole R programs and even capture the output, but you cannot submit pieces and parts of programs.

Until now. After one night of dusting off my C skills and learning a bit about how to use winsock, I have created a little command line tool that will send code to R and get back the output. It's very rough around the edges, but it works.

You may also be interested in this forum thread.

Please read the README before using, as it will give you some minimal instructions on setup. Notably, the svSocket package is required (it's part of Sciviews, linked above, and available on CRAN).

If you use Ultraedit and R, I think you will find it helpful even with the rough edges. Feedback is appreciated, either here or the linked thread.

Posted by Unknown at 9:13 AM

Labels: R, statistical programming, Ultraedit

Saturday, January 15, 2011

Dona eis Python, whap!

Well, I'm taking the plunge and learning Python. We'll see how this goes. Then I'll try NumPy (and SciPy, if it gets ported), and see of I can get R and Python talking.

Posted by Unknown at 10:27 PM

Labels: Python, R

Friday, April 16, 2010

R - not the epic fail we thought

I usually like AnnMaria's witty insight. I can relate to a lot of what she is saying. After all SAS and family life are large parts of my life, too. But you can imagine the reaction she provoked in saying the following:

I know that R is free and I am actually a Unix fan and think Open Source software is a great idea. However, for me personally and for most users, both individual and organizational, the much greater cost of software is the time it takes to install it, maintain it, learn it and document it. On that, R is an epic fail.

With the exception of the last sentence, I am in full agreement. Especially in the industry I work in, qualification and documentation is hugely important, and a strength of SAS is a gigantic support department who has worked through these issues. Having some maverick install and use R, as I do, simply does not work for the more formal work that we do. (I use R for consulting and other things that do not necessarily fulfill a predicate rule.)

However, another company, REvolution Computing, has recognized this need as well. With R catching on at the FDA, larger companies in our industry have taken a huge interest in R, partly because of the flexibility in statistical calculations (and, frankly, the S language beats SAS/IML hands down for the fancier methods), and mostly to save millions of dollars in license fees. Compare REvolution's few hundred dollars a year for install and operation qualification on an infinite-seat license to SAS's license model, and it's not hard to see how.

And SAS, to their credit, has made it easier to interact with R.

Epic win for everybody.

Posted by Unknown at 8:59 AM

Labels: pharmaceutical industry, R, SAS

Friday, November 20, 2009

My implementation of Berry and Berry's hierarchical Bayes algorithm for adverse events

I've been working on this for quite some time (see here for a little background), so I'm pleased that it looks close to done at least as far as the core algorithm. It uses global variables for now, and I'm sure there are a couple of other bugs lurking, but here it is, after the jump.

Posted by Unknown at 5:48 PM

Labels: clinical trials, R, safety

Working on a drug safety project

In order to move some of my personal interests along, I have been trying to implement the methodology found in Berry and Berry's article Accounting for Multiplicities in Assessing Drug Safety . This methodology uses the MedDRA hierarchy to improve the power of detecting damage to a particular organ. (The drawback, of course, is that MedDRA system organ classes do not perfectly correspond to the biology.) Apparently some other groups have already done so, but these implementations are hiding in paid software or on people's local drives, but doing my own implementation in R is a good learning experience.

I've been working on this project for some time now, off an on. Well, I've been making progress, and I'll share the results when I'm done. I'd also like to implement some of the other similar algorithms in this area, including the Poisson model that accounts for multiple occurrences of an adverse event, and a recent methodology that looks for "syndromes" (i.e. occurrences of groups of specific events all of which arise within a short time) and "constellations" (where the time restrictions are relaxed).

Posted by Unknown at 1:03 PM

Labels: Bayesian statistics, MedDRA, R, safety

Saturday, July 11, 2009

Causal inference and biostatistics

I've been following the discussion on causal inference over at Gelman's blog with quite a bit of interest. Of course, this is in response to Judea Pearl's latest book on causal inference, which differs quite a bit from the theory that had been forwarded by Donald Rubin and his colleagues for the last 35 years or so.

This is a theory that I think deserves more attention in biostatistics. After all, it goes back to the root of why we are studying drugs. Ultimately, we really don't give a damn about whether outcomes are better in the treated group than in the placebo group. Rather, we are more interested in whether we are able to benefit individuals by giving them a treatment. In other words, we are interested in the unknowable quantity of what each person's outcome is if they are treated and what it is if they are not. If there's an improvement and it outweighs the (unknowable) risks, the drug is worth while. The reason we are interested in outcomes of a treated group and outcomes of a placebo group is that it's a surrogate for this unknowable quantity, especially if you use the intention-to-treat principle. However, as mentioned in the linked article and the research by Rubin, the intention to treat principle fails to deliver on its promise despite its simplicity and popularity.

Some clinical trials are now being run with causal inference as a central part of the design. Tools such as R and WinBUGS and Bayesian concepts now make this logistically feasible. Further advances in statistical handling of partial compliance to treatment, biological pathways of drugs, and the intention to treat principle itself make causal inference look much more desirable by the day. It's really only inertia caused by the popularity and (apparent) simplicity of intention to treat that makes this concept slower to catch on.

Posted by Unknown at 4:12 PM

Labels: Bayesian statistics, causal inference, intention to treat, R