|
|
||||||||
Ann Thorac Surg 2002;74:1901-1908
© 2002 The Society of Thoracic Surgeons
a Providence Health System, Portland, Oregon USA
* Address reprint requests to Dr Grunkemeier, East Pavilion MOB, Suite 33, 9155 SW Barnes, Portland, OR 97225, USA.
e-mail: ggrunkemeier{at}providence.org
Abstract
Full Bayesian analysis is an alternative statistical paradigm, as opposed to traditionally used methods, usually called frequentist statistics. Bayesian analysis is controversial because it requires assuming a prior distribution, which can be arbitrarily chosen; thus there is a subjective element, which is considered to be a major weakness. However, this could also be considered a strength since it provides a formal way of incorporating prior knowledge. Since it is flexible and permits repeated looks at evolving data, Bayesian analysis is particularly well suited to the evaluation of new medical technology. Bayesian analysis can refer to a range of things: from a simple, noncontroversial formula for inverting probabilities to an alternative approach to the philosophy of science. Its advantages include: (1) providing direct probability statementswhich are what most people wrongly assume they are getting from conventional statistics; (2) formally incorporating previous information in statistical inference of a data set, a natural approach which we follow in everyday reasoning; and (3) flexible, adaptive research designs allowing multiple looks at accumulating study data. Its primary disadvantage is the element of subjectivity which some think is not scientific. We discuss and compare frequentist and Bayesian approaches and provide three examples of Bayesian analysis: (1) EKG interpretation, (2) a coin-tossing experiment, and (3) assessing the thromboembolic risk of a new mechanical heart valve.
"The Bayesian paradigm is conceptually simple, intuitively plausible, and probabilistically elegant" [1].
Readers of this journal are aware of the importance of statistics in the description and interpretation of surgical results. Standard statistical methods, and the philosophy of statistics that supports them, are variously referred to as classic, conventional, traditional, frequentist, or sampling theory statisticswhen it is necessary to distinguish them from the less well-known alternative called Bayesian statistics. This competing philosophy and set of analysis tools is named after a man and his theorem that consider probability subjective and allow preexisting external information to formally contribute to the interpretation of data. The first use of the word "new" in the title of this paper is a bit ironic, since Bayes paper was published (posthumously) in 1763 [2], while the "classic" statistics arose from the works of Fisher [3] and Neyman and Pearson [4] in the 1920s.
In both the traditional (frequentist) and the Bayesian statistical paradigms, we are concerned with estimating values called parameters. They are unknown, and our research studies are designed to collect data to help estimate them. Simple examples, considered in more detail below, are the probability that a patient has coronary artery disease (CAD); the probability a tossed coin will fall heads; and the thromboembolism rate with a new mechanical heart valve. In traditional statistics, these parameters are fixed, constant values, but in Bayesian analysis they are random variables. Unlike a fixed parameter, which can take only a single value, a random variable is characterized by a probability distribution over the range of values the parameter can take. This seemingly simple distinction leads to quite divergent theories of analysis.
EKG interpretation
Bayes theorem is simply a formula that inverts conditional probability statements (Appendix 1). For example, if we know the probability that a person with CAD will have a positive electrocardiogram (EKG), then what is the probability that a person who tests positive on an EKG has CAD? They are not the same, and Bayes theorem is used to derive the latter from the former. Cardiology diagnosticians have been using this approach for many years [5, 6].
Electrocardiogram tests are not perfect, so we acknowledge that when a patient tests positive, there is a probability that the patient does (A) or does not (B) have CAD. Similarly, when a patient tests negative, there is another set of probabilities that the patient does (C) or does not (D) have CAD. These four situations exhaust the possibilities. Sensitivity of the test is the percentage of CAD patients who test positive: (A/(A+C)); specificity of the test is the percentage of non-CAD patients who test negative: (D/(B+D)). Given these values, when a certain patient tests positive, what is the probability that the patient has CAD? This post-test probability can be computed using Bayes theorem (Appendix 1), and depends on the patients pretest probability of CAD, that is, the prevalence of CAD in such patients. Figure 1 shows posttest probabilities for the entire range of pretest probabilities, for three combinations of specificity and sensitivity.
|
In this example, we used Bayes theorem to combine external information (the pretest or prior probability of CAD) with the observed data (positive EKG), to estimate posttest or posterior probabilities of CAD. This did not illustrate any subjective or controversial aspects. Next we consider an example where the prior probabilities, the analog of the fixed, pretest probabilities of CAD, are unknown and subject to interpretation. This is the essence of Bayesian analysis.
Coin-tossing experiment
Probability theory originated to solve questions related to gambling, and games of chance still provide instructive examples. Suppose we just acquired a newly-minted Oregon quarter and want to know, before using it for gambling purposes, if it is "fair" (50% of its tosses will fall heads) or unfair (biased). So we toss this coin 10 times, and get 8 heads.
Frequentist approach
Typically we are interested in showing a change or difference, so we pick a null hypothesisin this case, that the coin is fairwhich we hope to reject on the basis of our study data. The binomial distribution gives the probabilities of getting various numbers of heads. If the true probability is 50%, then the probability of getting 8 heads is only 4.4%. You might think that we reject the null hypothesis (at the 5% level) on this basis, but we must include the probability of results that are more extreme. The probability of 9 heads is 1.0% and of 10 heads is 0.1%, so we get a cumulative p-value of 5.5%. Moreover, we usually add in the other, equally extreme, end of the distributionthat is, the probabilities for getting 0, 1, or 2 headsgiving 11.0% for the final "p-value". So our coin is NOT (shown to be) biased, based on this exercise, which used scenarios that could have occurred, but did not.
However, if we had made 100 tosses and gotten 80 heads (same coin, still 80% heads), then the coin IS biased: the difference from 50% is "highly significant" (p<0.000000001). What is the chance that with only 10 tosses we would reach significance if the coin were biased? That is called the power of the test, and for 10 tosses it is only 38%, even if the true probability of heads were 80%. So, even though we cannot reject the null hypothesis, we should not accept it, either. This poses a dilemma, and because of this issue, it is currently recommended to emphasize confidence intervals that intrinsically incorporate the sample sizes, rather than hypothesis tests.
In this experiment, the single estimate of 80% is a point estimate. Based on only 10 tosses, this point estimate has low precision. We quantify this imprecision by using an interval estimate or confidence interval (CI). Many methods have been proposed to compute CIs for binomial percentages (Appendix 2). Four choices are shown in Table 1. (One reason for including several CI methods that yield different answers is to show another element of subjectivity in classic analysis.) When the confidence limit does not include 50.0%, a hypothesis test would reject the null hypothesis (the true probability of heads is 50%) and declare the coin biased. This is the case with two of the CIs, but not so with the other two. The "exact" test uses the binomial distribution, which corresponds to the p-value approach used above. A confidence interval contains the point estimate and gets narrower as the number of tosses grows. In a way, it contains the range of values that are consistent with the data we have observed.
|
Bayes methodology provides a formal way of incorporating prior experience or opinions into the analysis. That is a definite advantage, but the price paid is that we must quantify this prior knowledge, and that can be a very subjective exercise. That is the criticism of the theory. I might pick one distribution, you another, both with good and justifiable (to ourselves) reasons. Two steps are involved: (1) select a prior distribution for the quantity of interest, then (2) use the observed data to update this distribution, converting it to a posterior distribution. This is equivalent to using the finding of a positive EKG to convert a patients pre-test probability of CAD to a post-test probability.
Determine the prior distribution
We recognize that all coins have a certain similarity and feel compelled to formally incorporate this knowledge into our estimating. How? By quantifying this knowledge, in the form of a prior distribution. Three possibilities are shown in Figure 2,
as probability density functions from the beta distribution (Appendix 3).
|
Diffuse. We think the true probability is 50% and are "pretty sure" that it is between 25% and 75%. This opinion is shown by the thicker curve in Figure 2, which has 90% of the probability between 25% and 75%.
Concentrated. We think the probability is 50%, and would be surprised if it was outside 40%60%. This opinion is shown by the thickest curve in Figure 2, which has 90% of the prior probability between 40% and 60%.
Combine the observed data with the prior distribution
Bayes theorem is invoked to combine the observed data (8 heads in 10 tosses) and the prior probability distributions to produce posterior probability distributions (Appendix 3). Figure 3
shows the posterior distributions for each of the priors in Figure 2. The mean of each of these distributions (vertical lines) is between the mean of the priors (all 50%) and the observed data (80%). All three of these estimates are more intuitively appealing than the frequentist estimate of 80%. The less concentrated the prior distribution, the more influence the observed data have on the posterior distribution. The most concentrated prior distribution (arguably the most reasonable) with a posterior mean of 54% is influenced least by the data (Fig 3; Table 1).
|
|
Confidence intervals are not credible
We discussed the point estimates from the Bayes approach, but what about the interval estimates? Table 1 also contains these Bayes estimates, called "credible intervals" or "probability intervals". Most physicians think that a 95% confidence interval has a meaning that can be described by this (direct) statement (A): "The probability the true mean value is in the interval is 95%." But, the true meaning is more convoluted. Frequentists assume that the true mean value is fixed, not random, so a probability statement about it has no meaning. Instead, they can only claim (indirectly) that (B): "If this experiment were repeated many times, 95% of the confidence intervals would contain the mean value."
Bayes methods, on the other hand, consider that such parameters are random variables, so probability statement A is exactly what Bayesians can claim about their intervals. It could be argued that since most physicians use statement A to describe "confidence" intervals, what they really want are "probability" intervals. Since to get them they must use Bayesian methods, then they are really Bayesians at heart!
Evaluating new technology
Analysis of new medical technology presents an ideal opportunity for Bayesian analysis, as summarized in a recent review article on its use in health technology assessment [7], and it is being actively promoted by the FDA. For the past several years, FDA has hosted meetings and workshops to discuss Bayesian approaches to new marketing approvals, given presentations at external meetings [8, 9], and supported the Bayesian research work of others [10].
The Bayesian approach to statistical analysis is more adaptive and less rigid than the traditional frequentist approach. The difference is quite apparent in the approach to analysis of clinical studies in which information is accumulated in an ongoing fashion. The traditional approach is to pick a sample size using estimates of anticipated performance; conduct the study on that number of patients; analyze the data at the end of the study; and either accept or reject the null hypothesis based on the p-value of the pivotal test statistic. Some variations of this allow for interim analyses of the data and the possibility of prematurely rejecting the null hypothesis. But these interim looks must be built into the original design, and require additional patients to be entered to allow for the extra hypothesis tests. It is important not to violate this rigid study protocol in order to protect the value of the p-value, to keep it from exceeding its nominal size. This is said to be an objective design, and that only the data collected in the study is allowed to influence the conclusion of the study. But in reality, information outside of the study is utilized, since we need to supply guesses about the mean and variance of the treatment[s] being studied in order to estimate the required sample size.
The FDA often requires a clinical study to have an independent data monitoring board, comprised of persons not directly connected with the study. Their job is to review the data periodically, and determine whether the study should be stopped because of safety issues. It is ethically necessary to stop a study if it is revealed to be using a harmful therapy. Another reason for stopping a study prematurely could be that the new therapy has already declared itself to be so superior to the control that the remainder of the study patients could not possibly overturn that conclusion. A final reason for premature stopping could be that there is no difference between the treatments and that continuing the study until the prefixed number of patients have been studied could not possibly change this conclusion. The remainder of the study would thus be using resources and patients to provide no useful information. For the monitoring board to fulfill its obligation, they must have access to the data and be able to assess it statistically. But these periodic peeks at the data invalidate the formal study design, since they are not protected by provisions for interim analyses. Thus there is a fundamental incongruence between protecting the p-value and protecting the patients.
How to reconcile this issue? Bayesian analysis is perfectly suited to repeated, interim analyses. It does not require fixed prior sample sizes and one-time assessment, but can utilize accumulating information that becomes available as the study progresses, as the following example demonstrates.
New mechanical heart valve
Assume we want to clinically study a new heart valve, and take one endpoint, eg, thromboembolism (TE). The Poisson distribution is used to describe and summarize event rates (Appendix 3). We are interested in estimating the event rate with the new valve and comparing it to an established standard. Just as with coin-tossing, we have previous experience with prosthetic heart valves. Thus, for example, although the TE rate with a new heart valve could theoretically be anything from 1% per year to 100% per year (or even higher), we know that it is more likely to be the former than the latter. We do not think a coin would have 25% or 75% heads, nor that a heart valve have a TE rate greater than, say, 10% per year.
Prior distribution for the thromboembolism rate
The current FDA guidance document for heart valves defines objective performance criteria (OPC) for valve-related complications [11]. The OPC represents the performance of approved, currently available valves, and a new device must demonstrate that its performance is comparable. Comparability is defined as "significantly better than twice as bad" as the OPC; thus the null hypothesis (which we hope to reject) is that the valves true mean rate is twice as high as the OPC. Using a one-sided test with a power of 0.80 and size of 0.05, this resulted in a required sample size of 324 patient years for TE with mechanical valves, whose OPC is 3.0%/year [12]. (For OPC of 1.2% per year, the rates for leak and endocarditis, 800 patient years are needed; this is the minimum sample size recommended for new approval studies.)
To be somewhat synchronous with the standard hypothesis testing design, we take our prior distribution to be the one used to represent the null hypothesis in the sample size estimation. This was a gamma distribution with a mean of twice the OPC (6% per year in this case), and the 5% quantile (the size of the hypothesis test) at a critical value (about 3.7% per year in this case) [12]. This prior distribution is indicated in Figure 5 by a dashed line. This might be called a conservative or "skeptical" [13, 14] prior, and would be suitable for one who was pessimistic about a new product and wanted to require a high burden of proof from the data. This is consistent with the null hypothesis, a dire situation which we hope the observed data is good enough to reject.
|
|
Bayesian analysis requires an estimate of the distribution of the parameter(s) of interest to be made before the study starts. The analysis uses this estimate directly, and the result will vary depending on what it was. Thus two different analysts may get two different answers. However, one can disclose what the prior was and why, and let the consumer of the information see if they agree; or give a range of prior distributions, including one for the skeptic and one for the enthusiast [13]. There are also elements of subjectivity in classical analyses. Sample size estimation requires guesses, based on prior information. A Bayesian simply incorporates this information formally into the analysis rather than using it to fix the sample size beforehand. A predetermined sample size can be disadvantageous. If it is too small, the study will not have a high probability of reaching its appropriate conclusion. If the sample size is too large, the study will go on longer than it would have to, using more resources and patients than necessary. Bayesian analysis readily permits sequential assessments of the data. A bonus of using this method is that the resulting analysis has a more direct and satisfying interpretation, permitting, for example, true probability intervals to be given.
Our examples were quite simple compared to the wide range of problems for which Bayesian analysis can be used. We used two simple probability distribution, the binomial and the Poisson, and their conjugate priors (Appendix 3). Complex problems can also be solved by Bayes methods, using software such as the BUGS system [15], for which an excellent website exists (http://www.mrc-bsu.cam.ac.uk/bugs/). Another interesting website, devoted to Bayesian analysis, has a wide range of information (http://www.bayesian.org/). And an easy to read review of medical statistics [16] is also available on the web (http://hesweb1.med.virginia.edu/biostat/teaching/bayes.short.course.pdf). Bayesian analysis is applicable in many other typical statistical situations, and is naturally suited for decision analysis [17] and for meta-analysis [18, 19], among other applications. It is also ideally suited for the evaluation of clinical studies of new medical technology, and FDA is now actively promoting this usage.
Appendix 1
Bayes theorem
Conditional probability means the "probability of an event or condition (say, A) given that another condition (B) exists." This statement is written in mathematical shorthand as P(A|B), where the "|" symbol stands for "given", and is defined by the formula:
![]() |
In its simplest form, Bayes Theorem can be thought of as a tool to invert the conditional probability: if we know P(A|B), we can use it to find P(B|A). By the definition above,
![]() |
![]() |
![]() |
Cardiac surgeons may be familiar with Bayesian risk models for operative mortality [20]. They are based on the above equation, with B being operative death. But, instead of just one prior condition, A, there are several, one for each risk factor in the model.
Rain, go away
Here is a nonmedical example to give the inclined reader an opportunity to apply the formula. We often hear it said "it always rains in Oregon on the weekend." Suppose condition A = RAIN and condition B = WEEKEND, and that this allegation is partly true: the probability of rain on a weekend day, P(A|B), is 50% and the probability of rain on a weekday, P(A|notB), is only 25%. If it is now raining outside, what is P(B|A), the probability that it is now the weekend?
The prior probability of B (weekend), without knowledge of the meteorological conditions, is 2/7, the fraction of weekend days in a week. Given, as above, that P(A|B)= 0.50 and P(A|not B) = 0.25, we can use Bayes theorem to find the solution:
![]() |
EKG interpretation
This same concept can be used in more practical situations, eg, the EKG interpretation example in the text. We know the sensitivity and specificity of the EKG test and the pretest probability of CAD for the patient. We want to know the post-test probability of CAD, that is, the probability of CAD given a positive EKG (+EKG). We use Bayes formula, above, to solve this by substituting: P(CAD) is the pretest (unconditional) probability of CAD; P(+EKG|CAD) = sensitivity; P(+EKG|not CAD) = 1- Specificity.
For the first patient in the example, P(CAD|+EKG)=0.9x0.8/(0.9x0.8+(1-0.9))=97%, shown by the filled circle in Figure 1. For the second patient in the example, P(CAD|+EKG)=0.9x0.1/(0.9x0.1+(1-0.9)x0.9)=50%, shown by the open circle in Figure 1.
Appendix 2
Confidence intervals
Two primary aspects of conventional statistical analysis are P values ("Is the difference significant?") and confidence intervals ("What is the probable range of the estimate?"). Because of idiosyncrasies of hypothesis testing, it is often recommended to emphasize confidence intervals instead. This is especially important in the case of negative results (nonsignificant p values).
There are many options for computing a confidence interval (CI), and they give somewhat different results. For the binomial distribution used in the coin toss example, 12 methods have been reviewed [21]. For the Poisson distribution used in the heart valve example, 13 methods have been reviewed [22].
Table 1 contains four CIs:
Appendix 3
Probability distributions
Many different probability distributions are used to describe random variables. Best known is the normal distribution, the common bell-shaped curve, which describes many natural phenomena. The binomial distribution is used to describe binary (dichotomous) data and the Poisson distribution is used to describe count data.
Bayes formula was given in Appendix 1 for converting simple pretest, or prior probabilities, into post-test, or posterior probabilities. An analogous formula can be used for entire probability distributions, f and g, and unknown parameter(s) B:
![]() |
Since the denominator does not contain the parameters, it is often omitted and the above is written as:
![]() |
means proportional to. The first term on the right-hand side is the likelihood of the data, so the above can be written more generally as,
posterior distribution
likelihood x prior distribution.
When the prior and posterior distributions are members of the same family, this distribution is called conjugate to the likelihood. The beta distribution has two parameters, a and b, and is a conjugate prior to the binomial distribution. If the prior isß(a, b) and the observed data is binomial (s success in n tries), then the posterior isß(a+s, b+n-s). This was used for the coin toss example (Figs 2 and 3).
The gamma distribution has two parameters, the shape k and the (inverse) scale r, and is a conjugate prior to the Poisson distribution. Thus if the prior is
(k, r) and the observed data is Poisson (e events in t patient-years) then the posterior is
(k+e, r+t). This was used for the heart valve example (Figs 5 and 6). Both of these families of distributions, beta and gamma, are very flexible and are widely used for prior distributions.
References
This article has been cited by other articles:
![]() |
P. T.L. Chiam and C. E. Ruiz Percutaneous Transcatheter Aortic Valve Implantation: Assessing Results, Judging Outcomes, and Planning Trials: The Interventionalist Perspective J. Am. Coll. Cardiol. Intv., August 1, 2008; 1(4): 341 - 350. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. L. Grunkemeier, R. Jin, and A. Starr Prosthetic heart valves: objective performance criteria versus randomized clinical trial. Ann. Thorac. Surg., September 1, 2006; 82(3): 776 - 780. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ANN THORAC SURG | ASIAN CARDIOVASC THORAC ANN | EUR J CARDIOTHORAC SURG |
| J THORAC CARDIOVASC SURG | ICVTS | ALL CTSNet JOURNALS |