Showing posts with label biostatistics. Show all posts
Showing posts with label biostatistics. Show all posts

Sunday, June 5, 2016

Little Debate: defining baseline

In an April 30, 2015 note in Nature (vol 520, p. 612), Jeffrey Leek and Roger Peng note that p-values get intense scrutiny, while all the decisions that lead up to the p-values get little debate. I wholeheartedly agree, and so I'm creating a Little Debate series to shine some light on these tiny decisions that may not get a lot of press. Yet these tiny decisions can have a big influence on statistical analysis. Because my focus here is mainly biostatistics, most of these ideas will be placed in the setting of clinical trials.

Defining baseline seems like an easy thing to do, and conceptually it is. Baseline is where you start before some intervention (e.g. treatment, or randomization to treatment or placebo). However, the details of the definition of baseline in a biostatistics setting can get tricky very quickly.

The missing baseline

Baseline is often defined as the value at a randomization or baseline visit, i.e. the last measurement before the beginning of some treatment or intervention. However, a lot of times things happen - a needle breaks, a machine stops working, or study staff just forget to do procedures or record times. (These are not just hypothetical cases ... these have all happened!) In these cases, we end up with a missing baseline. A missing baseline will make it impossible to determine the effect of an intervention for a given subject.

In this case, we have accepted that we can use previous values, such as those taken during the screening of a subject, as baseline values. This is probably the best we can do under the circumstances. However, I'm unaware of any research on what effect this has on statistical analysis.

To make matters worse, a lot of times people without statistical training or expertise will make these decisions, such as putting down a post-dose value as baseline. Even with good documentation, these sorts of mistakes are not easy to find, and, when they are, they are often found near the end of the study, right when data management and statisticians are trying to produce results, and sometimes after interim analyses.

The average baseline

Some protocols specify that baseline consists of the average of three repeated measurements. Again, this decision is often made before any statisticians are consulted. The issue with such a statistical analysis is that averages are not easily comparable to raw values. Let's say that a baseline QTc (a measure of how fast the heart charge recovers from a pump, corrected for heart rate) is defined based on 3 electrocardiogram (ECG) measurements. The standard deviation of a raw QTc measurement (i.e. based on one ECG), let's say, is s. The standard deviation of the average of those three (assuming independence) is s/√3, or just above half the standard deviation of the raw ECG. Thus, a change of 1 unit in the average of 3 ECGs is a lot more noteworthy than a change of 1 unit in a single ECG measurement. And yet we compare that to single measurements for the rest of the study.

To make matters worse, if the ECG machine screws up one measurement, then the baseline becomes the average of two. A lot of times we lose that kind of information, and yet analyze the data as if the mystery average is a raw measurement.

The extreme baseline

In one observational study, the sponsor wanted to use the maximum value over the last 12 months as a baseline. This was problematic for several reasons. Like the average, the extreme baseline (here the maximum) is on a different scale, and even has a different distribution, than the raw measurement. The Fisher-Tippett (extreme value) theorem states that the maximum of n values converges to one of three extreme value distributions (Gumbel, Frechet, or Weibull). These distributions are then being compared to, again, single measurements taken after baseline. What's worse, any number of measurements could have been taken for those subjects within 12 months, leading to a major case of shifting sands regarding the distribution of baseline.

Comparing an extreme value with a later singular measurement will lead to an unavoidable case of regression to the mean, thus creating an apparent trend in the data where none may exist. Without proper context, this may lead to overly optimistic interpretations of the effect of an intervention, and overly small p-values. (Note that a Bayesian analysis is not immune to the misleading conclusions that might arise from this terrible definition of baseline.)

Conclusion

The definition of baseline is a "tiny decision" that can have major consequences in a statistical analysis. Yet, the impact of this decision has not been well studied, especially in the context of a clinical trial where a wide range of definitions may be written into a protocol without the expert advice of a statistician. Even a definition that has been well-accepted -- that baseline is the last singular pre-dose value before intervention -- has not been well-studied in the scenario of missing baseline day measurement. Other decisions are often made without considering the impact on analysis, including some that may lead to wrong interpretations.

Monday, August 20, 2012

Clinical trials: enrollment targets vs. valid hypothesis testing

The questions raised in this Scientific American article ought to concern all of us, and I want to take some of these questions further. But let me first explain the problem.

Clinical trials and observational studies of drugs, biologics, and medical devices are a huge logistical challenge, not the least of which is finding physicians and patients to participate. The thesis of the article is that the classical methods of finding participants – mostly compensation – lead to perverse incentives to lie about one’s medical condition.

I think there is a more subtle issue, and it struck me when one of our clinical people expressed a desire not to put enrollment caps on large hospitals for the sake of a fast enrollment. In our race to finish the trial and collect data, we are biasing our studies toward larger centers where there may be better care. This effect is exactly the opposite of that posited in the article, where treatment effect is biased downward. Here, treatment effect is biased upward, with doctors more familiar with best delivery practices (many of the drugs I study are IV or hospital-based), best treatment practices, and more efficient care.

We statisticians can start to characterize the problem by looking at treatment effect by different sites, or using hierarchical models to separate out center effect from drug. But this isn’t always a great solution, because low-enrolling sites, by definition, have a lot fewer people, and pooling is problematic because low-enrolling centers tend to have way more variation in level and quality of care than high-enrolling centers.

We can get creative on the statistical analysis end of studies, but I think the best solution is going to involve stepping back at the clinical trial logistics planning stage and recasting the recruitment problem in terms of a generalizability/speed tradeoff.

Tuesday, May 19, 2009

Deep thought of the day

Biostatisticians should be involved much more than we currently are in the forming of a data collection strategy in clinical trials. Too often we are left in the margins of case report form design, and so we have to deal with the consequences of decisions made by those who don't live downstream of the data collection process.

I have a suspicion that clinical trials isn't the only place where this principle applies.

Wednesday, March 25, 2009

Challenges in statistical review of clinical trials

The new 2009 ASA Biopharm newsletter is out, and the cover article is important not for the advice it gives to statistical reviewers (I'm assuming the point of view of an industry statistician), but for a glimpse into the mindset of a statistical reviewer at the FDA. Especially interesting is the use of similar results in a trial as a warning sign for potential fraud or misrepresentation of the actual data.

Thursday, May 8, 2008

Can the blind really see?

That's Sen. Grassley's concern, stated here. (A thorough and well-done blog with some eye candy, though I don't agree with a lot of opinions expressed there.)

I've wondered about this question even before the ENHANCE trial came to light, but, since I'm procrastinating on getting out a deliverable (at 11:30pm!) I'm going to just say that I plan to write about this soon.

Friday, May 2, 2008

Well, why not?

Since I'm posting, I might as well point toward Derek Lowe's post about the failure of the Singulair/Claritin idea. Too bad for Merck, though one has to wonder how long this drug combination strategy among big pharma is going to play out. After all, wouldn't it be about as cheap to take two pills (since one is over-the-counter) as it would be to ask your insurance to fork it over for a prescription version of a combination? Heck, a lot of people take the combination separately now, anyway.

So at any rate, Derek deduces that the problem lies in efficacy. Is it possible to support a marketing claim that the combination is more than the sum of its parts? Merck apparently thinks so, but the FDA does not. Unless there's an advisory committee meeting on this, or the drug eventually gets approved, or efforts to get results of all clinical trials posted publically succeed, we won't know for sure. But what I do know is that for one of these combinations to gain marketing approval, at the very least there has to be a statistically significant synergistic effect. That means that the treatment effect has to be greater than the sum of the treatment effects of the drugs alone. Studies that demonstrate this effect tend to have a lot of patients, especially if there are multiple dose levels involved. It isn't easy, and I've known more than one combination development program to fizzle out.

Update: but see this serious safety concern for Singulair reported by Pharmalot.
Posted by Unknown at 11:59 PM

Monday, February 25, 2008

The buzz

Every once in a while I look through the keyword referrals and do an informal assessment of how people find this blog -- one that's geared for a rather narrow audience. Here are the most popular keywords:

  • O'Brien-Fleming (especially doing this kind of design in SAS)
  • Bayesian statistics in R
  • noninferiority
  • NNT (number needed to treat)
  • confidence intervals in SAS
On occasion I will get hits from more clinical or scientific searches, such as TGN1412 (the monoclonal antibody that caused a cytokine storm in healthy volunteers, leading to gangrene, multiple organ failure, and for those lucky/unlucky enough to survive, cancer), Avandia, or CETP inhibitors.

At 2000 hits in a year, this is clearly a narrowly-targeted blog. :D
Posted by Unknown at 11:45 PM
Labels:

Saturday, September 1, 2007

Bias in group sequential designs - site effect and Cochran-Mantel-Hanszel odds ratio


It is well known that estimating treatment effects from a group sequential design results in a bias. When you use the Cochran-Mantel-Haenszel statistic to estimate an odds ratio, the number of patients within each site affects the bias in the estimate of the odds ratio. I've presented the results of a simulation study, where I created a hypothetical trial and then resampled from this trial 1000 times. I calculated the approximate bias in the log odds ratio (i.e. log of the CMH odds ratio estimate) and plotted that versus the estimated log odds ratio. The line is cubic smoothing spline, made by the statement symbol i=sm75ps in SAS. The actual values are underprinted in light gray circles just to get some idea of the variability.

Wednesday, August 1, 2007

A good joint statistical meetings week

While I was not particularly enthralled with the location, I found this years Joint Statistical Meetings to be very good. By about 5 pm yesterday, I thought it was going to be so-so. There were good presentations on adaptive trials and Bayesian clinical trials, and even a few possible answers to some serious concerns I have about noninferiority trials. Last night I went to the biopharmaceutical section business meeting, and struck up conversations with a few people from the industry and the FDA (including the speaker who had some ideas on how to improve noninferiority trials). And shy, bashful me who had to drink 3 glasses of wine a couple of years ago to get up the courage to approach a (granted rather famous) colleague was one of the last ones to leave the mixer.

This morning, I was still feeling a little burned out, but decided to drag myself to a section on Bayesian trials in medical devices. I found the speakers (which came from both industry and FDA) top notch, and at the end the session turned into a very nice dialog on the CDRH draft guidance.

I then went to a session on interacting with the FDA in a medical device setting, and again speakers from both the FDA and industry were top notch. Again, the talks turned into very good discussions about how to most effectively communicate with the FDA, especially from a statistician/statistical consultant's point of view. I asked the question of how to handle the situation where, though it's not in the best interest, a sponsor wants to kick the statistical consults out of the FDA interactions. The answer: speak the sponsor's language, which is in dollars. Quite frankly, statistics is a major part of any clinical development plan, and unless the focus is specifically on chemistry, manufacturing, and controls (CMC), a statistician needs to be present for any contact with the FDA. (In a few years, it might be true for CMC as well.) If this is not the case, especially if it's consistently not the case throughout the development cycle of the product, the review can be delayed, and time is money. Other great questions were asked on use of software and submission of data. We all got an idea of what is required statistically in a medical device submission.

After lunch was a session given by the section on graphics and International Biometric Society (West N America Region). Why it wasn't cosponsored by biopharmaceutical, I'll never know. The talks were all about using graphs to understand effects of drugs, and how to use graphs to effectively support a marketing application or medical publication. The underlying message was get out of the 60's line printer era with the illegible statistical tables, and take advantage of new tools available. Legibility is key in producing a graph, followed by the ability to present a large amount of data in a small area. In some cases, many dimensions can be included on a graph, so that the human eye can spot potential complex relationships among variables. Some companies, notably big pharma, are far ahead in this arena. (I guess they have well-paid talent to work on this kind of stuff.)

These were three excellent sessions, and worth demanding more of my aching feet. Now I'm physically tired and ready to chill with my family for the rest of the week/weekend before doing "normal" work on Monday. But professionally, I'm refreshed.

Friday, July 20, 2007

Whistleblower on "statistical reporting system"

Whether you love or hate Peter Rost (and there seems to be very little in between), you can't work in the drug or CRO industry and ignore him. Yesterday, he and Ed Silverman (Pharmalot) broke a story on a director of statistics who blew the whistle on Novartis. Of course, this caught my eye.

While I can't really determine whether Novartis is "at fault" from these two stories (and related echos throughout the pharma blogs), I can tell you about statistical reporting systems, and why I think that these allegations can impact Novartis's bottom line in a major way.

Gone are the days of doing statistics with pencil, paper, and a desk calculator. These days, and especially in commercial work, statistics are all done with a computer. Furthermore, no statistical calculation is done in a vacuum. Especially in a clinical trial, there are thousands of these calculations which must be integrated and presented so that they can be interpreted by a team of scientists and doctors who then decide whether a drug is safe and effective (or, more accurately, whether a drug's benefits outweigh its risks).

A statistical reporting system, briefly, is a collection of standards, procedures, practices, and computer programs (usually SAS macros, but may involve programs in any language) that standardize the computation and reporting of statistics. Assuming they are well-written, these processes and programs are general enough to process the data any kind of study and produce reports that are consistent across all studies, and, hopefully, across all product lines in a company. For example, there may be one program to turn raw data into summary statistics (n, mean, median, standard deviation) and present them in a standardized way in a text table. Since this is a procedure we do many times, we'd like to just be able to "do it" without having to fuss over the details. We feed the variable name in (and perhaps some other details like number of decimal places) and voila the table. Not all statistics is that routine (and good for me because that means job security), but perhaps 70-80% is and can be made more efficient. Other programs and standards will take care of titles, footnotes, column headers, formatting, tracking, and validation in a standardized and efficient way. This saves a lot of time in both programming and in review and validation of tables.

So far, so good. But what happens when these systems break? As you might expect, you have to pay careful attention to these statistical reporting systems, even go so far as applying some software development life cycle methodology. If they break, you influence not just one calculation but perhaps thousands. And there is no way of knowing - obscure bugs in the code might influence just 10 out of a whole series of studies, where a more serious bug might affect everything. If this system is applied to every product in house (and it should probably be general enough to apply to at least one category of products, such as all cancer products), the integrity of the data analysis for a whole series of products is compromised.

Allegations were also made that a contract programmer was told to change dates on adverse events, which could either be a benign but bizarre request if the reasons for the change are well-documented (it's better to change dates in the database than at the program level, because it's easier to audit changes to a database and specific changes to specific dates keep a program from being generalizable to other similar circumstances) or an ethical nightmare if the changes were done to make the safety profile of the drug look better. From Pharmalot's report, the latter was alleged.

You might guess the consequences of systematic errors in data submitted to the FDA. The FDA does have the authority to kick out an application if it has good reason to believe that its data is incorrect. This application has to go through the resubmission process, after it is completely redone. (The FDA will only do this if there are systematic problems.) This erodes the confidence the reviewers have in the application, and probably even all applications submitted by a sponsor who made the errors. This kind of distrust is very costly, resulting in longer review periods, more work to assure the validity of the data, analysis, and interpretation, and, ultimately, lower profits. Much lower.

It doesn't look like the FDA has invoked its Application Integrity Policy on Novartis's Tasigna or any other product. But it has invoked its right to three more months of review time, saying it needs to "review additional data."

So, yes, this is big trouble as of now. Depending on the investigation, it could get bigger. A lot bigger.

Update: Pharmalot has posted a response from Novartis. In it, Novartis reiterates their confidence in the integrity of their data and claims to have proactively shared all data with the FDA (as they should). They also claim that the extension to the review time for the NDA was for the FDA to consider amendments to the submission.

This is a story to watch (and without judgment, for now, since this is currently a matter of "he said, she said"). And, BTW, I think Novartis responded very quickly. (Ed seems to think that 24 hours was too long.)

Wednesday, March 28, 2007

A final word on Number Needed to Treat

In my previous post in this series I discussed how to create confidence intervals for the Number Needed to Treat (NNT). I just left it as taking the reciprocal of the confidence limits of the absolute risk reduction. I tried to find a better way, but I suppose there's a reason that we have a rather unsatisfactory method as a standard practice. The delta method doesn't work very well, and I suppose methods based on higher-order Taylor series will not work much better.

So, what happens if the treatment has no statistically significant effect (sample size is too small or the treatment simply doesn't work). The confidence interval for absolute risk reduction will cover 0, say, maybe -2.5% to 5%. Taking reciprocals, you get an apparent NNT confidence interval of -40 to 20. A negative NNT is easy enough to interpret: -40 NNT means that for every 40 people you "treat" with the failed treatment, you get a reduction of 1 in favorable outcomes. A 0 absolute risk reduction results in NNT=∞. So if the confidence interval of absolute risk reduction covers 0, the confidence interval must cover ∞. In fact, in the example above, we get the bizarre confidence set of -∞ to -40 and 20 to ∞, NOT -40 to 20. The interpretation of this confidence set (it's no longer an interval) is that either you have to treat at least 20 people but probably a lot more to help one, or if you treat 40 or more people then you might harm one. For this reason, for a treatment that doesn't reach statistical significance (i.e. whose absolute risk reduction includes 0), the NNT is often reported as a point estimate. I would argue that such a point estimate is meaningless. In fact, if it were left up to me, I would not report an NNT for a treatment that doesn't reach statistical significance, because the interpretation of statistical non-significance is that you can't prove with the data you have that the treatment helps anybody.

Douglas Altman, heavy hitter in medical statistics, has the gory details.

Technorati Tags: ,
Posted by Unknown at 12:58 AM

Wednesday, December 13, 2006

Realizations

This blog grew out of a need to separate my professional blogging from personal blogging. With the public trust in clinical trials waning in the wake of Vioxx and Ketek, there is a need for public information and transparency on how clinical trials are conducted. While more information won't solve all of the problems -- this requires a commitment on the part of drug and research companies as well -- I do find that a lot of mistrust stems from a simple misunderstanding of the scientific process in general and how drugs are studied in particular.

Most other blogging on this subject seems to come from doctors. I've found very few statistical bloggers, and I'm the only person I know of that is blogging with a purely biostatistical focus. Yet it is the biostatistician who has to determine whether a clinical trial can give a statistically valid answer, and the doctor or pharmacologist who decides if the statistically valid answer has any meaning or relevance. This blog is about clinical trials and other research, and how to extract conclusions from them.

The material in this blog will draw from the news, my own personal experience, and even a bit of research, and is intended for a general and professional audience.

As for the title: in statistics a "realization" is one instance of data. We dream up these statistical models, use mathematics to say what we can about them without looking at real data, and then examine realizations to see how well our models and methods work in practice. The data that comes out of a clinical trial can be said to be a realization.
Posted by Unknown at 5:05 PM
Labels:
Subscribe to: Comments (Atom)

AltStyle によって変換されたページ (->オリジナル) /