|
|
||||||||
Ann Thorac Surg 2005;80:2106-2113
© 2005 The Society of Thoracic Surgeons
a Department of Surgery, Caritas St. Elizabeth's Medical Center, Boston, Massachusetts
b Department of Cardiac Surgery, Massachusetts General Hospital, Boston, Massachusetts
c Department of Cardiac Surgery, Boston University Medical Center, Boston, Massachusetts
d Division of Cardiac Surgery, Brigham and Women's Hospital, Boston, Massachusetts
e Department of Health Care Policy, Harvard Medical School, and Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts
Accepted for publication June 28, 2005.
* Address correspondence to Dr Shahian, Department of Surgery, Caritas St. Elizabeth's Medical Center, 736 Cambridge St, Boston, MA 02135 (Email: david.shahian{at}caritaschristi.org).
| Abstract |
|---|
|
|
|---|
METHODS: Extensively validated and audited data were available for all 4,603 isolated coronary artery bypass grafting procedures performed at 13 Massachusetts hospitals during 2002. To produce the official Massachusetts cardiac surgery report card, a 19-variable predictor set and a hierarchical generalized linear model were employed. For the current study, this same analysis was repeated with the 14 predictors used in the New York Cardiac Surgery Reporting System. Two additional analyses were conducted using each set of predictor variables and applying standard logistic regression. For each of the four combinations of predictors and models, the point estimates of risk-adjusted 30-day mortality, 95% confidence or probability intervals, and outlier status were determined for each hospital.
RESULTS: Overall unadjusted mortality for coronary bypass operations was 2.19%. For most hospitals, there was wide variability in the point estimates and confidence or probability intervals of risk-adjusted mortality depending on statistical model, but little variability relative to the choice of predictors. There were no hospital outliers using hierarchical models, but there was one outlier using logistic regression with either predictor set.
CONCLUSIONS: When used to compare provider performance, logistic regression increases the possibility of false outlier classification. The use of hierarchical models is recommended.
| Introduction |
|---|
|
|
|---|
To investigate how the choice of a particular set of predictors and statistical methodology may affect the results of cardiac surgery report cards, we have utilized the highly audited data set that formed the basis of the Massachusetts 20002 cardiac surgery report card [13].
| Material and Methods |
|---|
|
|
|---|
There were 13 hospitals performing CABG in Massachusetts in 2002, of which two were new programs with a limited number of cases. There were 53 active cardiac surgeons in the Commonwealth with a median of 95 isolated CABG admissions. All of the programs in the state were major academic or tertiary centers, or were sponsored by such a center. Of 7,661 hospital admissions in which cardiac surgery was performed in Massachusetts in 2002, 4,604 involved solitary CABG procedures. One patient was lost to follow-up in Europe, leaving 4,603 patients available for analysis.
All cardiac surgery data were collected, cleansed, audited, validated, analyzed, and warehoused at the Massachusetts Data Analysis Center (Mass-DAC), which was based in the Department of Health Care Policy at Harvard Medical School. The Massachusetts Inpatient Acute Hospital Case Mix and Charge database and the Massachusetts Mortality Index were used to validate the data submissions. At regular intervals, each hospital received reports on invalid, missing, inconsistent, or out-of-usual range data, and they were given 30 days to justify or correct them. One hundred fourteen data quality reports were issued, with a mean of 8.8 per hospital (range, 5 to 12).
Fourteen informational meetings were held between August 15, 2002, and October 13, 2004, regarding the 2002 data. Separate meetings were held for data managers, chiefs of cardiac surgery, and members of the Mass-DAC Cardiac Advisory Board (composed of experts from other major database and quality monitoring organizations).
Two separate audits were conducted, the first in the spring of 2003 by MassPro, a Massachusetts quality improvement organization. This audit comprised approximately 500 cases, including all deaths (the endpoint in our analysis was 30 day all-cause mortality) plus a random sample of surviving patients from each hospital. After final data submission closeout and initial cleansing, a second audit was performed of selected data fields including cases that were initially coded as "other cardiac surgery," urgent and emergent procedures, and certain patient codes such as unstable angina, chronic obstructive pulmonary disease, or advanced New York Heart Association class. When necessary, hospitals were required to submit additional documentation including histories and physicals, progress notes, operative notes, intensive care unit flow charts, and discharge summaries. Of 1,820 charts initially identified for audit, 724 were selected for review by an adjudication committee consisting of three cardiac surgery members from three different institutions. Unanimous consent was required for each problematic entry. A total of 835 changes were made by this committee, more than half of which were in one institution that had systematically miscoded angina.
For the public report card, a 19-variable Mass-DAC predictor set was derived based on expert literature review. A random intercept, two-level hierarchical generalized linear model (HGLM) was chosen as the analytical technique [14] (see Statistical Appendix), and was estimated using BUGS software (Bayesian Inference Using Gibbs Sampling, version 0.60; MRC Biostatistics Unit, Cambridge, United Kingdom). The specific BUGS code for this study is available from the authors upon request. Exchangeability was assumed across all providers [15, 16]. Because hospital characteristics were excluded from the model, hospitals were compared globally, not just to similar hospitals [17, 18]. A standardized mortality incidence rate was determined for each provider, which is conceptually similar to the RAM used in report cards based upon logistic models. It is computationally quite different, however, and is determined by dividing the "smoothed," risk-adjusted, provider-specific estimate of mortality by the estimate of expected mortality obtained using the average intercept for all Massachusetts providers (see Statistical Appendix). This quantity, similar to the observed/expected ratio, is then multiplied by the state unadjusted mortality rate to obtain the standardized RAM. The credible 95% probability interval (PI) for each provider was then estimated and was compared with the state average mortality of 2.19%. As there is no simple closed-form solution for the hierarchical estimator of risk-adjusted mortality (and 95% PI), models were estimated using Gibbs sampling. After an initial burn-in of 2,000 iterations, parameter estimates were based on the subsequent 5,000 draws. Ninety-five percent PIs were obtained through identification of the 2.5th and 97.5th percentiles of the 5,000 RAMs. This was the methodology used to construct the first Massachusetts public report card, published in October 2004.
For the purposes of this paper, three additional analyses were then conducted using the exact same data set. We first recomputed the HGLM estimates using a "pseudo-New York" predictor set based on the 14 variables included in the New York Cardiac Surgery Reporting System (CSRS). Slight modifications were necessary because the definitions varied somewhat between the STS NCD and CSRS databases. Finally, two additional analyses were performed using both the Mass-DAC and CSRS predictor variables and applying standard logistic regression, the method used in the New York CSRS, the NNE, the STS NCD, and the Department of Veteran's Affairs [1, 2, 5, 8, 19]. Ninety-five percent CIs were calculated for the logistic models [20]. Therefore, in total, we calculated hospital-specific, risk-adjusted mortality rates and determined outlier status using four combinations of predictor variables and statistical methodologies.
The prior distribution for the between-hospital variance parameter (
2) is the key determinant of the degree of shrinkage in our HGLM. We selected a gamma distribution, gamma (0.001, 0.001), with a mean of 1 and variance of 1,000 for the between-hospital precision parameter
2, the inverse of the between-hospital variance parameter. This "just proper" distribution has been used in previous CABG profiling studies and is often a reasonable choice, although it does favor low variance [16, 21]. We assessed the robustness of our conclusions by varying the values of the mean and variance of this gamma distribution.
It is also important that the hierarchical model employ estimates of between-hospital standard deviation that are uninfluenced by potential aberrant providers. Otherwise, the model might accommodate excessive variability and thus have reduced sensitivity to detect true outliers. Using an odds ratio median and range methodology [21], we assessed the reasonableness of our estimate of between-hospital standard deviation
when using the gamma distribution.
| Results |
|---|
|
|
|---|
|
|
|
|
|
was 0.2. This corresponds to an odds ratio range (97.5% to 2.5% points of the odds ratio distribution) of 2.19, and a median odds ratio (median ratio of the maximum to minimum odds ratios in a random pair) of 1.24, both of which appear reasonable [21]. Sensitivity analysis demonstrated little difference in our results as we varied the values of the mean and variance of the gamma prior distribution chosen for the between-hospital precision parameter
2. With alternative priors (gamma [1.0E-2, 1.0E-2], gamma [1.0E-1, 1.0E-1]) for the between-hospital precision parameter
-2, the risk-adjusted mortality of hospital 13 was most affected but the lower limit of its 95% PI still included the state mean (ie, not an outlier). Furthermore, the posterior estimates of between-hospital standard deviation using these different gamma priors were all less than 0.4. | Comment |
|---|
|
|
|---|
Likewise, in standard logistic regression, the estimates of observed mortality obtained from low volume providers are a much less accurate reflection of their true unobserved mortality. The problem of small sample size as it relates to provider profiling is well known and has received extensive attention in the literature [1, 14, 33, 34]. Statistical theory dating back to the work of Stein and James more than 50 years ago [35] suggests that better estimates from small samples can be obtained by "shrinking" the observed values, the degree of shrinkage being inversely proportional to the number of observations. The shrunken estimate represents a weighted average between the observed value for the sample and the grand mean for the entire population of similar samples. Such estimates also account for the phenomenon of "regression to the mean" and thus are better predictors of future performance [36].This is important, as there is typically a 1- to 2-year lag between data collection and publication of report cards ("the future is now").
If the effects of clustering and small sample size are ignored, the net result will be overestimation of systematic interprovider variability, underestimation of random interprovider variability, and an increased potential for the provider to be falsely classified as an outlier. This is further magnified by the problem of multiple comparisons, which can lead to the spurious identification of significant differences between providers [33, 36].
Increasingly, many statisticians believe that hierarchical generalized linear models of the type employed in Massachusetts, and recently introduced for use in the STS NCD, are a preferable approach. They specifically account for both clustering and small sample sizes, thus ensuring better overall precision in provider estimates, and through the shrinkage process they also mitigate the problem of multiple comparisons [33, 36]. Such models assume exchangeability among providers. That is, they explicitly acknowledge the heterogeneity among providers but assert that the providers' performances cannot be ordered a priori [16, 22]. Some might argue that this assumption is not justified because, on average, lower-volume programs have worse results [37]. However, this relationship is relatively weak for CABG surgery, and many lower volume programs including those in Massachusetts have better than average results [3841]. It has also been argued that hierarchical models, which view the observed provider results as samples from a larger unobserved population, are not appropriate for report cards, where all state or regional data are available. However, although the goal is not to generalize to other unobserved providers, hierarchical models do provide better estimates of future results, which is really the most useful application of report cards for patients [15, 36, 42].
Several studies have reanalyzed available New York cardiac surgery outcome data and have demonstrated how the use of HGLM mitigates apparent (and misleading) interprovider variability and the likelihood of false outlier identification [15, 16, 43]. Localio and associates [33] analyzed Pennsylvania 1991-1992 CABG report card results, which were based on logistic regression, and compared these with results obtained with a hierarchical model. Use of a hierarchical model dramatically reduced false outlier identification while still providing adequate statistical power to detect true outliers, particularly when 2 years of data were used. DeLong and colleagues [42, 44] investigated the impact of different statistical methods for CABG provider profiling, including random effects models, and they also concluded that a mixed-effects model provided the most realistic assessment of provider performance, especially when there were small sample sizes. Austin and colleagues [45] compared the performance of logistic regression and hierarchical models to assess the care of patients with acute myocardial infarction. In general, random-effects (hierarchical) models had greater specificity and positive predictive value than fixed-effect logistic models, whereas the latter had greater sensitivity to detect outliers. However, when the volume of cases for each hospital in this simulation study was arbitrarily raised to 250, a reasonable number for hospital CABG volume, the sensitivity differences between the two models for detecting true outliers diminished. The negative predictive value of being classified a non-outlier was roughly similar for the two models.
Our results demonstrate the potential impact that choice of statistical methodology can make in the results of provider profiling. The "smoothing" effect of the hierarchical model results in shrinkage of individual hospital results toward the overall state mean, and a narrowing of the PI compared with the corresponding logistic regression CI [22]. Results of our prior distribution sensitivity analyses and the value of the random effects standard deviation were reasonable, and they did not suggest that our models produced excessive shrinkage.
We believe the hierarchical approach results in a more accurate estimate of a provider's unobserved true performance, both absolute and relative to their comparison institutions. In our study, logistic regression would have labeled one provider as an outlier, but this result was not confirmed with HGLM. There is always a tradeoff in modeling between sensitivity and specificity. Ultimately, it is a public policy decision as to whether increased sensitivity to detect true outliers or avoidance of false outlier identification is the more important consideration in state report cards [45]. We acknowledge that despite what we consider to be their theoretical superiority, the profiling results obtained with hierarchical models may not always differ significantly from those based upon logistic regression. However, by consistently applying the correctly specified model, the likelihood of accurate results in a variety of scenarios is increased.
The relative lack of variability we observed between results obtained using the Mass-DAC or NY predictor sets is not surprising. Most major cardiac risk models have similar predictors and there is a core set of critical variables that provide most of the predictive power [46, 47].
One incidental observation of interest emerged from this study. The Massachusetts report card demonstrated very low unadjusted CABG mortality, one of the lowest ever publicly reported. Massachusetts has a strict determination-of-need process and a highly academic environment, but before the 2002 report card, there was no mandatory data collection or public reporting of cardiac surgery outcomes. These results confirm the previous report of Ghali and associates [48] that was based on administrative data. They suggest that public report cards may have a valuable role in providing public accountability, but that they are not essential to the provision of the highest quality cardiac surgery care.
We conclude that extreme caution must be employed in calculating and interpreting risk-adjusted outcome results, and that the most appropriate statistical methodology must be employed. For provider profiling, we believe this is HGLM.
| Statistical Appendix |
|---|
|
|
|---|
The logit (log-odds) of the probability of 30-day mortality for patient j with co-morbidity x
ij
(y
ij
= 1) is given by:
|
|
Then, the probability that the j
th
patient treated at the i
th
hospital will die (30-day mortality) is given by:
|
|
|
|
|
|
Risk-adjusted mortality (RAM) = observed mortality rate divided by expected mortality rate, multiplied by the state unadjusted mortality rate. Confidence intervals for the RAM are calculated using a large-sample approximation.
Random Intercept Hierarchical Generalized Linear Model (HGLM)
For patient j at hospital i:
Given patient covariate x
ij
, fixed coefficient ß
1 and random intercept ß
0i
which is normally distributed with mean µ and between-hospital variance
2, then the logit (log-odds) of the probability of 30-day mortality for patient j (y
ij
= 1) is given by the following equation:
|
|
|
|
|
|
|
|
|
|
|
|
The standardized mortality incidence rate (RAM) is the smoothed, adjusted, hospital-specific mortality rate divided by the expected mortality rate, multiplied by the state unadjusted mortality rate. Probability intervals are calculated via Monte Carlo simulation.
| Acknowledgments |
|---|
|
|
|---|
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
D. A. Asch, S. Nicholson, S. Srinivas, J. Herrin, and A. J. Epstein Evaluating Obstetrical Residency Programs Using Patient Outcomes JAMA, September 23, 2009; 302(12): 1277 - 1283. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. A. Kramer and J. E. Zimmerman Predicting Outcomes for Cardiac Surgery Patients After Intensive Care Unit Admission Seminars in Cardiothoracic and Vascular Anesthesia, September 1, 2008; 12(3): 175 - 183. [Abstract] [PDF] |
||||
![]() |
D. M. Shahian and S.-L. T. Normand Low-volume coronary artery bypass surgery: Measuring and optimizing performance. J. Thorac. Cardiovasc. Surg., June 1, 2008; 135(6): 1202 - 1209. [Full Text] [PDF] |
||||
![]() |
D. M. Shahian and S.-L. T. Normand Comparison of "Risk-Adjusted" Hospital Outcomes Circulation, April 15, 2008; 117(15): 1955 - 1963. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. A. Ferraris, F. H. Edwards, D. M. Shahian, and S. P. Ferraris Risk Stratification and Comorbidity Card. Surg. Adult, January 1, 2008; 3(2008): 199 - 246. [Full Text] |
||||
![]() |
G. J. Stukenborg, D. P. Wagner, F. E. Harrell Jr, M. N. Oliver, S. W. Heim, A. L. Price, C. K. Han, A. M. D. Wolf, and A. F. Connors Jr Which Hospitals Have Significantly Better or Worse Than Expected Mortality Rates for Acute Myocardial Infarction Patients?: Improved Risk Adjustment With Present-at-Admission Diagnoses Circulation, December 18, 2007; 116(25): 2960 - 2968. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Westaby, N. Archer, N. Manning, S. Adwani, C. Grebenik, O. Ormerod, R. Pillai, and N. Wilson Comparison of hospital episode statistics and central cardiac audit database in public reporting of congenital heart surgery mortality BMJ, October 13, 2007; 335(7623): 759 - 759. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Suaya, D. S. Shepard, S.-L. T. Normand, P. A. Ades, J. Prottas, and W. B. Stason Use of Cardiac Rehabilitation by Medicare Beneficiaries After Myocardial Infarction or Coronary Bypass Surgery Circulation, October 9, 2007; 116(15): 1653 - 1662. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. H. Livingston, A. C. Elliott, L. S. Hynan, and E. Engel When Policy Meets Statistics: The Very Real Effect That Questionable Statistical Analysis Has on Limiting Health Care Access for Bariatric Surgery Arch Surg, October 1, 2007; 142(10): 979 - 987. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. M. Shahian, T. Silverstein, A. F. Lovett, R. E. Wolf, and S.-L. T. Normand Comparison of Clinical and Administrative Data Sources for Hospital Coronary Artery Bypass Graft Surgery Report Cards Circulation, March 27, 2007; 115(12): 1518 - 1527. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. M. Krumholz, Y. Wang, J. A. Mattera, Y. Wang, L. F. Han, M. J. Ingber, S. Roman, and S.-L. T. Normand An Administrative Claims Model Suitable for Profiling Hospital Performance Based on 30-Day Mortality Rates Among Patients With an Acute Myocardial Infarction Circulation, April 4, 2006; 113(13): 1683 - 1692. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. M. Krumholz, Y. Wang, J. A. Mattera, Y. Wang, L. F. Han, M. J. Ingber, S. Roman, and S.-L. T. Normand An Administrative Claims Model Suitable for Profiling Hospital Performance Based on 30-Day Mortality Rates Among Patients With Heart Failure Circulation, April 4, 2006; 113(13): 1693 - 1701. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ANN THORAC SURG | ASIAN CARDIOVASC THORAC ANN | EUR J CARDIOTHORAC SURG |
| J THORAC CARDIOVASC SURG | ICVTS | ALL CTSNet JOURNALS |