|
|
||||||||
Ann Thorac Surg 2001;72:2155-2168
© 2001 The Society of Thoracic Surgeons
a Department of Thoracic and Cardiovascular Surgery, Lahey Clinic, Burlington, Massachusetts, USA
b Department of Health Care Policy, Harvard Medical School, Boston, Massachusetts, USA
c Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, USA
d Division of Cardiac Surgery, Massachusetts General Hospital, Boston, Massachusetts, USA
e Division of Cardiology, Beth Israel Deaconess Medical Center, Boston, Massachusetts, USA
f Department of Cardiology, St. Elizabeths Medical Center, Boston, Massachusetts, USA
g Department of Cardiology, Brigham and Womens Hospital and Cardiovascular Data Analysis Center, Boston, Massachusetts, USA
h Division of Health Care Quality, Massachusetts Department of Public Health, Boston, Massachusetts, USA
* Address reprint requests to Dr Shahian, Department of Thoracic and Cardiovascular Surgery, Lahey Clinic, 41 Mall Rd, Burlington, MA 01805, USA
e-mail: david.m.shahian{at}lahey.org
| Abstract |
|---|
|
|
|---|
Report cards have been the more commonly used approach, typically as a result of state legislation. They are based on the presumption that publication of outcomes effectively motivates providers, and that market forces will reward higher quality. Numerous studies have challenged the validity of these hypotheses. Furthermore, although states with report cards have reported significant decreases in risk-adjusted mortality, it is unclear whether this improvement resulted from public disclosure or, rather, from the development of internal quality programs by hospitals. An additional confounding factor is the nationwide decline in heart surgery mortality, including states without quality monitoring. Finally, report cards may engender negative behaviors such as high-risk case avoidance and "gaming" of the reporting system, especially if individual surgeon results are published.
The alternative approach, continuous quality improvement, may provide an opportunity to enhance performance and reduce interprovider variability while avoiding the unintended negative consequences of report cards. This collaborative method, which uses exchange visits between programs and determination of best practice, has been highly effective in northern New England and in the Veterans Affairs Administration. However, despite their potential advantages, quality programs based solely on confidential continuous quality improvement do not address the issue of public accountability. For this reason, some states may continue to mandate report cards. In such instances, it is imperative that appropriate statistical techniques and report formats are used, and that professional organizations simultaneously implement continuous quality improvement programs.
The statistical methodology underlying current report cards is flawed, and does not justify the degree of accuracy presented to the public. All existing risk-adjustment methods have substantial inherent imprecision, and this is compounded when the results of such patient-level models are aggregated and used inappropriately to assess provider performance. Specific problems include sample size differences, clustering of observations, multiple comparisons, and failure to account for the random component of interprovider variability. We advocate the use of hierarchical or multilevel statistical models to address these concerns, as well as report formats that emphasize the statistical uncertainty of the results.
| Introduction |
|---|
|
|
|---|
| Background |
|---|
|
|
|---|
Many quality studies have focused on open heart surgery, particularly coronary artery bypass graft (CABG) surgery, because it is a frequently performed and costly procedure with well-defined end points. In addition to postoperative death, other important outcome indicators include stroke, myocardial injury, infection, hemorrhage, 30-day readmission rate after hospital discharge, and long-term survival.
The development of scientifically valid methods for assessing the results of CABG surgery was stimulated by the 1987 Health Care Financing Administration (HCFA) publication of Medicare mortality rates that were not appropriately adjusted for patient severity. Many observers argued that such results did not take into account differences in preoperative risk factors that could substantially affect mortality. Hospitals and physicians (generically referred to as "providers" throughout this report) attract divergent patient populations based on geographic location, hospital size and local competition, degree of academic involvement, special interests, and reputation. Any comparison of providers must therefore take into account differences in the severity of illness of their patients. This observation led to the development of methods to risk-adjust medical outcomes, particularly for CABG [1118, 20, 21, 26, 27, 5254, 6193].
Many risk-adjusted outcome studies have subsequently been published that report single-institution and regional cardiac surgery mortality rates. Early reports demonstrated that these results varied substantially. Hannan and associates [26] studied the records of 7,596 patients in New York State who underwent open-heart procedures during the first 6 months of 1989. Unadjusted institutional mortality rates ranged from 2.2% to 14.3%, substantially greater than the expected mortality range (2.8% to 9.3%) derived from the logistic risk prediction model. Two consecutive articles in the August 14, 1991, issue of the Journal of the American Medical Association also reported significant and unexplained regional variations in CABG mortality. Among five medical centers and 18 cardiothoracic surgeons who performed CABG in Maine, New Hampshire, and Vermont between 1987 and 1989, unadjusted in-hospital mortality for isolated CABG varied from 3.1% to 6.3% among hospitals, and from 1.9% to 9.2% among individual surgeons [20]. After risk adjustment, statistically significant variability persisted both among medical centers (p = 0.021) and surgeons (p = 0.025). In an accompanying article, Williams and colleagues [50] reported CABG results from 1985 through 1987 at five Philadelphia teaching hospitals. Although the mortality rates did not vary substantially among hospitals for patients in diagnosis related group (DRG) 107, a greater than twofold (p = 0.0004) variation in mortality for DRG 106 (cardiac catheterization and CABG on the same admission) occurred. Observed mortality rates for individual surgeons varied from 3.8% to 10.2% for both DRG 106 and DRG 107. Although only limited risk adjustment was possible, differences in patient severity did not appear to account for this variability in interhospital and interprovider mortality. Similar observations continue to be reported even in more recent studies. In the Alabama Coronary Artery Bypass Grafting Cooperative Project [58], statewide unadjusted and risk-adjusted mortality for 1995 and 1996 varied from 2% to 12%.
Risk-adjusted cardiac surgery databases continue to be used by institutions [74], cities [94], states [7, 2660], regions [2025], the Veterans Affairs Administration [6167], and the Society of Thoracic Surgeons [6873] for both academic and quality improvement purposes. Certain uses of such databases have been highly controversial. In some states, including New York [2644, 9597], Pennsylvania [4551], and New Jersey [5256], government agencies have released detailed report cards that rank hospitals and surgeons according to their risk-adjusted mortality rates. The value of such provider profiling is actively debated, especially with regard to the methodologies used and the potential impact of such reports.
Proponents believe that report cards encourage quality improvement through market forces. Low-mortality institutions and surgeons should attract consumers, including patients, referring cardiologists, and managed care plans. With this incentive, hospitals institute internal quality improvement to preserve and enhance their market share.
The other broad approach to cardiac surgery quality assessment uses the principles of continuous quality improvement (CQI). This confidential peer process is not accessible to the public; it uses benchmarking, determination of "best practice," and collaborative education among surgeons [2025, 57, 59]. The emphasis of this approach is process and systems improvement rather than detection of outliers. It seeks to achieve regional or system-wide quality improvement while avoiding the unintended negative consequences of public report cards.
| Market-driven quality improvement ("report cards") |
|---|
|
|
|---|
Reduction in CABG mortality
Report card advocates have credited this approach with reducing both average CABG mortality and interhospital and interprovider variability [2729]. Subsequent to the release of report cards in New York, statewide risk-adjusted mortality declined 41% from 4.17% to 2.45% between 1989 and 1992 [27]. The risk-adjusted mortality declined in all hospital groups, with the greatest reductions for hospitals located in the tercile with the highest initial mortality (7.12% to 2.77%, a reduction of 61%) [29]. In the same study, mortality also declined substantially within each of three terciles of surgeons, although the largest decrease (45%) again occurred in the highest mortality tercile. Overall, when the Cardiac Surgery Reporting System (CSRS) began in New York State in 1989, five hospitals had greater than expected and five hospitals had lower than expected risk-adjusted mortality. By 1992, only one hospital remained in each category, and interhospital variability had decreased appreciably.
Some of these improvements may have resulted from the 25% decline in the number of New York CABG patients operated on by very lowvolume surgeons (defined as < 50 cases per year), as well as the disproportionate decrease in mortality (60%) associated with these surgeons [28]. Between 1989 and 1992, 27 low-volume surgeons with risk-adjusted mortality 2.5 to 5 times the state average stopped performing CABG, either voluntarily or because of restriction of hospital privileges [28, 41].
Despite the substantial reduction in New York CABG mortality after public report cards became available, the relationship of these report cards to the observed improvements is unclear. Data collection appears to be associated with mortality reduction, but public dissemination is not necessarily the mechanism. The mere process of collecting data may stimulate hospitals to focus on outcomes and quality long before such data are released to the public. The long lag time between data collection and publication, especially for individual surgeons, makes it difficult to ascribe a direct causal relationship to any subsequent improvements in mortality for a given hospital. Surgical staff change; internal process and structural changes are instituted; new managed care populations are enrolled; and techniques for cardiac care evolve.
Much of the decline in New York CABG mortality has clearly resulted from the widespread development of internal quality control programs by hospitals. Statewide improvements occurred across a broad spectrum of institutions that had quite disparate mortality rates when the reporting system began [29]. Specific process problems at some "high-mortality" hospitals were also identified and corrected. At one such institution, excess deaths were found to have occurred primarily in the patients who were the most unstable preoperatively. Process improvements in the stabilization of such patients and standardization of operative technique at that institution led to a dramatic reduction in overall CABG mortality from 4.5% to 1.5% [34, 35].
Another significant observation that confounds any analysis of improvement in CABG outcome is the nationwide decrease in operative mortality observed over the past decade [76, 77]. Thus, some of the reduction in New York CABG mortality may be part of a national trend resulting from improvements in preoperative stabilization, anesthesia, myocardial protection, and surgical technique. Ghali and associates [75] studied states adjacent to New York that did not have report cards and compared the trends in their mortality rates with that of New York. Although the study is limited by differences in risk adjustment between regions and by the lack of a dedicated cardiac surgery database in Massachusetts, the observed decrease in CABG mortality in Massachusetts was identical to that of New York (standardized mortality ratios 0.58 and 0.59, respectively). Without any formal quality improvement initiative or report card, the actual mortality rates in Massachusetts decreased from 4.7% in 1990% to 3.3% in 1994 (42%), within the same period that predicted mortality increased from 4.7% to 5.7%.
Proponents of report cards argue that the improvement in CABG mortality in New York has been disproportionate. For example, even in the study by Ghali and associates [75], New York had the lowest absolute value of CABG mortality in 1994 of all the regions studied. Peterson and associates [30] have reported that New York States unadjusted Medicare CABG mortality decreased 33% between 1987 and 1992 compared with a 19% decline in CABG mortality outside of New York. From 1989 to 1992 (when report cards were first available in New York), the reduction in observed CABG mortality was 22% in New York versus 9% in the rest of the nation (p < 0.001). In 1992, New York also had the lowest risk-adjusted Medicare CABG mortality of any state in the nation [30].
These observations suggest that the magnitude of improvement in New York State CABG mortality may have exceeded the national trend. Is this due to the public dissemination of report cards, as some have argued [41], or is it a result of quality improvement initiatives undertaken by individual hospitals in New York and elsewhere? One answer to this crucial question may be found in the Medicare CABG study of Peterson and associates [30]. The only area with 1987 to 1992 mortality reduction comparable to that of New York was northern New England, which had an aggressive but completely confidential CQI approach [2025]. This observation suggests that it is the institution of a formal quality improvement process that is the key, irrespective of whether this effort is public or confidential.
Avoidance of high-risk cases
Despite the use of risk adjustment, surgeons remain concerned that these methodologies provide inadequate protection from the inevitably higher mortality rates that are experienced when operating on high-risk patients [8, 37]. Some observers believe that the decrease in CABG mortality in New York after the introduction of public reporting resulted from the reluctance of surgeons to operate on higher-risk patients. Discussions of this topic have been widely publicized, including an emotional personal account in The New York Times [96]. The ability to improve the apparent performance of a provider through avoidance of high-risk patients has also been discussed in the context of a chronic medical disease [98].
One of the most frequently cited references in support of the "high-risk case avoidance" hypothesis is that of Omoigui and associates [33]. They reported that 482 patients from New York State were operated on at the Cleveland Clinic between 1989 and 1993. These patients had a higher incidence of high-risk characteristics, including reoperation (44%) and New York Heart Association Class III or IV (47.6%), than did patients from other referral areas. This had not been apparent during the pre-report card (1980 to 1988) era in New York. Based on both the Cleveland Clinic and New York risk-adjustment models, New York State patients at the Cleveland Clinic had the highest expected mortality of any referral group, and the observed morbidity and mortality were also correspondingly higher. Average yearly referrals from New York to the Cleveland Clinic increased in the postreport card era, whereas they decreased from other states (p < 0.001).
Chassin and associates [41] dispute these findings, noting that only 482 patients went from New York to Cleveland from 1989 to 1993 compared with a total of 73,877 patients who stayed in New York for CABG surgery. In many cases, patients traveled to Cleveland because of geographic considerations and long-established referral patterns. The authors also argue that the 1989 to 1993 time frame of the Omoigui study is inappropriate because the first year in which public release could have had an impact was 1991. Arguably, however, the mere knowledge that outcome data were being collected for subsequent publication may have been enough to influence the behavior of New York hospitals and surgeons.
Peterson and associates [30] have also challenged the hypothesis of high-risk case avoidance. They note that between 1987 and 1992, the percentage of New York Medicare patients receiving CABG out of state actually declined from its peak values of 12.5% to 14.3% during the period 1987 to 1989 to 11.3% in 1992 (p < 0.001 for trend). Furthermore, since the public reporting program began, New York elderly patients who had experienced a myocardial infarction or who had multiple comorbidities were more likely to receive coronary bypass surgery (p < 0.001), as was the national trend, despite being obviously higher-risk patients.
Hannan and associates [32] specifically investigated the question of high-risk case avoidance in New York after the introduction of report cards. They identified a group of 3,281 patients (7.3% of the total CABG population) operated on from 1990 to 1992 for whom expected mortality rates were greater than 7.5%. Although the actual mortality rate for the high-risk group was 15.9% versus 2.0% for the remaining low-risk patients, the risk-adjusted mortality was actually lower for the high-risk group (2.9% vs 3.0%). Furthermore, each hospital and quartile of individual surgeons had similar risk-adjusted mortality for all cases (including high risk) versus only their low-risk cases. This suggests that the risk-adjustment mechanism gives adequate protection to the surgeon who chooses to operate on a high-risk patient. Finally, the authors noted a 73% increase in high-risk CABG patients (expected mortality > 7.5%) from 1990 to 1992, whereas lower risk patients increased only 11.4%.
Irrespective of the conflicting objective evidence regarding avoidance of high-risk cases, many surgeons perceive that accepting such cases may jeopardize their reputations and careers. Schneider and Epstein [46] reported that 63% of Pennsylvania surgeons in their study were less willing to operate on severely ill patients because of report card profiling, and 59% of cardiologists reported increased difficulty finding a surgeon to operate on such patients. Burack and associates [37] surveyed 104 New York State cardiac surgeons and found that high-risk CABG patients were more likely to be denied treatment than were similar high-risk patients with aortic dissection, arguing that this occurred because the latter group was not subject to public reporting. A total of 62% of surgeons reported that they refused to operate on at least 1 high-risk patient during the preceding year because of the performance reports, implying that they do not believe that the current risk-adjustment mechanism protects them sufficiently.
Several potential solutions have been recommended to address the controversy surrounding high-risk patients [42]. One possibility would be to provide separate reports for such patients, although, given the small numbers, this would result in limited statistical significance. Another option would be to exclude those patients classified by a state peer-review committee as excessively high-risk, a process currently being implemented in New Jersey. However, such patients are often the most instructive for quality improvement purposes. Califf and associates [93] suggest collecting and analyzing data on cardiac patients from the point of initial referral to determine whether high-risk cases are being avoided, although such a system would be extremely difficult to implement.
An even more fundamental question is whether low mortality rate is appropriate as the sole end point for quality assessment in CABG. The Society of Thoracic Surgeons Ad Hoc Committee on Physician-Specific Mortality Rates for Cardiac Surgery [8] has recommended that other quality indicators should also be studied, including risk-adjusted perioperative morbidity, cost/benefit analysis, functional improvement, quality of life, and patient satisfaction. If low operative mortality is the only important performance metric, then hospitals and surgeons may alter their practice to achieve this goal, perhaps to the detriment of the overall cardiovascular health of the population. For example, one way to reduce operative mortality would be to avoid the highest-risk patients. However, from a public health perspective, it is precisely these high-risk patients for whom successful surgery would produce the greatest decrement in overall cardiovascular mortality and morbidity for the population, albeit at the price of higher mortality for hospitals and surgeons willing to operate on them [10]. Arguably, it might be more desirable to accept the possibility of somewhat higher average mortality rates for CABG to ensure access to cardiac care for these patients who will benefit most.
"Gaming"
It has been widely suggested that the existence of public report cards may result in "gaming" of the reporting system, which may be responsible for apparent improvements in risk-adjusted mortality [7, 37]. Gaming may take several forms: 1) upcoding of preoperative comorbidities, 2) change of operative class, and 3) transfer of critically ill postoperative patients to extended care facilities.
Upcoding of comorbidities
One of the most commonly described forms of "gaming" is upcoding of preoperative comorbidities, or "coding creep" [7, 37, 38, 93, 97] This may involve inappropriate or excessive coding for certain risk factors, particularly if they are imprecisely defined, or increasing the severity of a categorical risk factor. This practice increases the expected mortality rate, thus making it more likely that the actual mortality rate will fall within or below the expected range. Such practices could account for some of the 73% increase in high-risk cases from 1990 to 1992 reported by Hannan and associates [32] in New York.
Green and Wintfeld [38] described two components of the decline in risk-adjusted CABG mortality in New York. The first was a decrease in observed mortality from 3.5% to 3.1%, and the second an increase in the expected mortality from 2.7% to 3.7%. An increase in risk factor coding, some of which the authors believed to be spurious, accounted for 66% of the increase in predicted mortality and 41% of the total reduction in risk-adjusted mortality. The authors note that between 1989 and 1991, coding of preoperative renal failure increased from 0.4% to 2.8%, congestive heart failure from 1.7% to 7.6%, chronic obstructive pulmonary disease from 6.9% to 17.4% (1.8% to 52.9% at one hospital), unstable angina from 14.9% to 21.8% (1.9% to 20.8% at one hospital), and low ejection fraction from 18.9% to 22.2%. Furthermore, the frequency of comorbidities among surgeons varied more than would be expected from case-mix differences, including a range of 1.4% to 60.6% for chronic obstructive pulmonary disease and 0.7% to 61.4% for unstable angina [38, 97]. Parsonnet [53], a leader in the development of the New Jersey risk-adjusted data base, has also expressed concern about the reliability of coding in his state. For example, the frequency of cases coded as "elective" ranged from 27% to 95% across nine New Jersey hospitals [53].
Chassin and colleagues [41] maintain that refinement of definitions rather than intentional upcoding may have been responsible for the difference in risk-factor coding between 1989 and 1991. In support of this argument, they note that after an initial increase in these factors, their prevalence has remained relatively stable since 1991. They also believe that underreporting of some risk factors occurred in 1989, which was discovered through the audit process and largely resolved by 1990.
It seems likely that some degree of upcoding has occurred as a result of public report cards. The only solutions to this problem, other than relying on the integrity of hospitals and surgeons, are to make data definitions as unambiguous as possible, to train abstractors to collect data appropriately, and to conduct random external audits to ensure accuracy.
Change of operative class
A second type of gaming is change of operative class [7, 10, 37]. Because CABG is the only major cardiac surgical category for which results have been publicly reported, it is possible that some surgeons have identified high-risk patients who are likely to experience mortality and morbidity and have converted them to nonreported procedures. A high-risk CABG procedure can be changed to a CABGmitral valve repair with a few commissural stitches, to a CABG-atrial septal defect repair with the closure of a patent foramen ovale, and to a CABGleft ventricular aneurysmectomy with a few apical plication stitches. By doing so, the apparent mortality in that surgeons isolated CABG population decreases, the mortality of the unreported combined procedure increases, but the total number of deaths remains unchanged [10]. Mortalities are simply shifted to categories that are not publicly reported.
The frequency of such behaviors is difficult to quantitate and almost impossible to detect retrospectively. However, the problem does exist and has been described by respected cardiac surgical leaders [7, 10]. Although all medical professionals would agree that this is unethical conduct, it is an example of the kind of behavior that may be encouraged by public reporting, especially individual surgeon profiling.
Transfer of critically ill postoperative patients
A final method of gaming occurs when seriously ill postoperative patients are transferred to an extended care facility before their anticipated death. Because many databases include only those deaths occurring in the hospital where the surgery was performed, mortalities that occur in extended care facilities may not be captured [37]. To avoid this problem, outcome data should be collected for each patient at a fixed time, such as 30-day mortality, regardless of where the patient may have been transferred.
Statistical methodology
Stimulated by the release of inadequately adjusted Medicare mortality results, numerous statistical models have been developed to adjust cardiac surgery outcomes for patient acuity [1118, 20, 21, 26, 27, 5254, 6193]. The output of such models is typically an estimate of the likelihood of postprocedural mortality. Such patient-level data are subsequently aggregated to calculate the expected mortality of a hospital provider or individual surgeon, to compare observed and expected mortalities, and to determine outlier status. These results are then published as cardiac surgery "report cards."
The fact that such determinations arise from sophisticated mathematical models, together with the degree of precision with which they have been publicly reported, has invested them with an exaggerated aura of scientific accuracy. In reality, the available data and methodologies are imperfect, are subject to varying interpretations, and are often too technical for the average layman to fully comprehend [99]. This realization should temper the enthusiasm of those who advocate report cards in the name of the publics "right to know." Although trained statisticians understand the nuances and limitations of such models, nonstatisticians (including most consumers, media writers, and surgeons) tend to look primarily at bottom-line mortality rates and rank orders with little regard to their accuracy or range of statistical dispersion [7, 41, 53].
We will examine three important areas of statistical concern: (1) the inherent limitations of patient-level risk prediction models; (2) the inappropriate application of such models to multilevel or hierarchical scenarios, such as the determination of outlying providers [100107]; and (3) the failure of report cards to account for and to emphasize adequately the statistical uncertainty of their results.
Patient-level models
Numerous risk-adjustment models have been developed to compensate for differences in patient acuity among cardiac surgery providers. Despite their increasing sophistication, such risk models remain an inexact approximation to the complexity of biological processes. This is true regardless of whether one uses Bayesian [15, 16, 65] or neural networks [16] or logistic regression [16, 2022, 26, 27, 52, 68]. Because such models provide critical information that is used to estimate differences in quality of care among providers, their limitations must be recognized.
Unexplained variation
In an analysis of the New York Cardiac Surgery Reporting System, Green and Wintfeld [38] found that the New York risk adjustment model explained only 7.3% of variation in surgeon-specific mortality, 0.4% of the variation in hospital mortality, and 8% of the outcome variability among individual patients. Similar observations were noted in the CABG study of Landon and associates [84] and in a study of internal medicine provider profiling by Hofer and colleagues [98].
Chassin and associates [41] have correctly responded that in contrast to multivariate regression for continuous dependent variables, the R2 of logistic models with dichotomous outcomes are always substantially less than 1. Hartz and colleagues [40] demonstrated this with an analysis of a theoretically perfectly fit logistic model. Although statistically appropriate measures of explained variation for binary dependent variables do exist [108, 109], they are frequently neither calculated nor reported. Nonetheless, even though R2 is not an ideal measure of goodness of fit for logistic models, the very low values of R2 observed in the New York study suggest considerable mortality variation that is not explained by the model.
Some unexplained variability might result from risk factors that are not included in the risk algorithm. Furthermore, some disease states with a broad range of clinical expression are reduced to dichotomous or categorical variables, thus losing much value as predictors [53, 93].
Model performance
In terms of patient-level assessments, most risk adjustment models provide relatively good calibration as measured by the Hosmer-Lemeshow test, but this describes only the expected performance of groups of patients with similar comorbidities. Discrimination measures the ability to predict the outcome for a given patient, and is a tradeoff between the sensitivity and specificity of the model as measured by the ROC curve area or C-index [16]. Regardless of the specific statistical methodology or variables used, most cardiac surgery models have ROC curve areas (C-indices) between 0.76 and 0.82, typically on the lower end of this range [16, 2022, 41, 74]. This is comparable to the accuracy of weather prediction and substantially inferior to the discrimination of many other sophisticated prediction models [16, 93]. As demonstrated in the work of Lippmann and Shahian [16], this may lead to incorrect outcome prediction for many patients. This degree of inaccuracy is one of the reasons why many are dismayed at the amount of media attention given to relatively small differences in risk-adjusted mortality. Furthermore, because most risk-adjustment models do not separate sampling variability from provider variability, many of the statistical summaries of model fit are invalid [100].
Unknown predictors are a likely cause of the limited discrimination of most CABG models [16]. Because neural and Bayesian networks do not appear to improve substantially the performance of logistic regression, it is unlikely that complex nonlinear relationships are responsible for the observed deficiencies in model performance.
Interprovider variability and determination of outliers
The methodology by which differences between hospitals or surgeons are determined is an equally relevant question, but it has received scant attention until recently. Variability among providers may result from systematic patient differences due to measured or unmeasured comorbidities, from random "noise," or from true differences in quality of care [26, 48]. The methodologies used to make these determinations are controversial and highly dependent on the statistical techniques used [48, 7881, 90, 100102]. In most states where report cards have been issued, optimal statistical practice has not been used [48, 90, 93, 100107]. As a result, random variation has been underestimated and apparent systematic differences exaggerated [100]. We will consider two major areas of statistical concern: (1) the impact of sample size variability on the precision of estimates, and (2) the use of aggregate results from patient-level prediction models to assess provider performance.
Sample size considerations
The ability of cardiac surgery risk-adjustment models to distinguish differences in performance among hospitals is usually overestimated. Deaths are an uncommon occurrence, and the number of cases per hospital or surgeon is relatively small [23, 48, 90]. Confidence intervals are highly dependent on the number of cases and methodology, and these have a significant impact on outlier status [23, 48]. Small sample size makes statistically significant comparisons difficult, especially at the individual provider level, and the confidence intervals are broad and of questionable value [23, 90].
Because sample sizes vary by provider, the precision of the estimates also varies considerably. By the James-Stein theorem [110113], comparisons of average mortality rates from 3 or more providers does not yield the best estimate of their "true" underlying mortality rates, especially when assessing low-volume programs the results of which have large standard deviations. This phenomenon is known as regression-to-the mean bias. "Shrunken" estimates, derived from both the observed mortality rate of a given provider and the mean mortality of the entire population of providers, are more accurate [80, 81, 100102, 104107, 110114]. Such estimates, which are incorporated into hierarchical models but not standard logistic regression models, provide more accurate values for both the numerator and the denominator of the observed-to-expected (O/E) mortality ratio.
The tendency for individual observed means, especially if they are extreme, to move toward the grand mean is commonly observed in everyday life. This years "hot stock" or sensational rookie hitter is often rewarded for that performance, only to experience subsequent results that are more average. Interestingly, "regression-to-the mean" rather than true quality improvement may be responsible for some of the apparent mortality reduction at institutions initially identified as outliers [101].
Methodology for determining interprovider variability
From the perspective of government oversight agencies and the media, provider profiling and identification of hospital or physician outliers are the most important uses for any cardiac surgery database. Unfortunately, the appropriate determination of this status can be extremely complex and may vary substantially depending on the methodology [48, 7881, 90, 93, 100102, 104].
One fundamental problem is that the unit of analysis is the patient, whereas the unit of inference is the provider. For example, differences between observed and expected hospital-specific mortality are the key measures of interest; yet these are obtained by aggregating patient-level data and fitting a patient-level model. Unfortunately, most researchers have focused on assessing the fit of the patient-level regression models and have completely neglected assessing the appropriateness of the provider-level summaries [100, 115].
Many approaches to explaining variation in patient mortality fail to separate within-provider sampling variability from interprovider variability, the latter including both random and systematic components. This may subsequently result in falsely declaring some providers as outliers [7881, 100102, 104107]. Although hierarchical statistical models (discussed in a subsequent section) have the ability to partition variation in the appropriate manner, these models have not become part of standard software and hence are rarely used.
An additional problem is that patient outcomes are correlated within providers, even after accounting for patient differences [100107, 116]. This clustering violates the assumption of independence inherent in most of the commonly used statistical techniques. A practical consequence of ignoring nonindependence is that some providers may be inappropriately classified as outliers because the accuracy associated with the estimate is artificially high.
The method by which observed and expected mortalities are used to determine outlier status is of critical importance. Typically, this involves the calculation of the difference between a providers expected and observed mortality (or the ratio of the observed to the expected mortality, O/E), then assessing whether this statistic is extreme, usually defined as being in the tail of a normal distribution [80, 100102, 104]. The assumptions implicit in such an algorithm are that the pool of providers used to determine the expected mortality rates are representative of some meaningful standard population, and that being in the tail implies a systematic departure from the norm. Unfortunately, both assumptions are not always reasonable. For example, the use of the mean together with a 95% confidence interval to define outlier status is problematic in areas where all institutions perform at a high level. In such situations, it may be more appropriate to define an acceptable range of statistical dispersion around a nationally accepted threshold; defining the range based solely on local data virtually ensures that some program will be assigned outlier status [80, 100]. Although policy makers may be reluctant or unable to specify such an absolute threshold, a specific standard will avoid falsely labeling some providers as aberrant. Whether an absolute or relative standard is used, it is essential that the statistical distribution around each providers "true" mortality rate be appropriately estimated. Substantial inaccuracy may also occur if the normal approximation is used to estimate p values for outlier determination rather than calculating exact values from the appropriate distribution [80].
Hierarchical models
The use of hierarchical models appropriately accounts for many of these concerns and typically results in a significant reduction of outliers [48, 7881, 90, 100107, 114117]. Hierarchical models have also been referred to as multilevel models, random or mixed-effect models, random coefficient regression models, and covariance component models [107]. They have been used to analyze many processes that are inherently hierarchical, including students within classrooms within schools within districts, and patients cared for by physicians working within hospitals. Researchers have used hierarchical models to analyze geographic or small-area differences in procedure use, often producing results that display less variation in care than previously described [114116]. Such models have also been used to reanalyze some of the New York [80, 101] and Pennsylvania [48] cardiac surgery data and to compare multiple methods for outlier analysis in the Medicare Cooperative Cardiovascular Project [78, 79]. In such studies, hierarchical models have typically resulted in less systematic variation, broader confidence intervals for regression parameters, and fewer outliers.
Hierarchical models regard the observed data as having been drawn in two stages: (1) patient observations within an institution, and then (2) the institutional effects from a larger unobserved population [106]. In a multilevel model for provider profiling, the first level might be the familiar patient-level regression model with predictor variables [90]. What distinguishes hierarchical models is the second level (in this case consisting of cardiac care providers), with its own systematic (eg., hospital size, location, academic status, referral patterns) and random provider effects [90, 100107]. Coefficients of the patient-level models vary across providers based on these fixed and random effects. Failure to account for this second source of random variability at the provider level will result in an incorrect, exaggerated degree of statistical certainty [106].
Based on the assumption of exchangeability, or the notion that all providers are roughly similar, hierarchical models use the entire pool or ensemble of data to improve the accuracy of the hospital-specific mortality regression models [101, 102, 104, 106, 116]. They correct for differences in sample size using shrinkage estimators, similar to the James-Stein estimators discussed previously. These models also account for clustering of observations [90, 103, 104]; they appropriately separate within-provider variability from between-provider variability; and they address the problem of multiple comparisons [48, 78, 102].
Although recognized by most statisticians as the preferred method for this type of analysis, hierarchical models have not previously been used for cardiac surgery report cards. This omission probably results from the fact that hierarchical models are complex, labor-intensive, and not included in most "off-the-shelf" statistical software programs. They require substantial statistical sophistication to implement properly.
Report format
Another concern is the manner in which report cards have been presented and interpreted. Consumers, the media, and other interested parties inevitably use this data to "rank" providers numerically, yet this practice is of questionable scientific validity. Green and Wintfeld [38] observed substantial changes in New York surgeon rank from year to year, and a hospitals previous rank was not necessarily predictive of future performance. In the Northern New England Cardiovascular Study Group, programs have experienced significant reshuffling of rank without any discernible changes in their processes or surgeon complement. The most likely explanation for these observations is that hospital ranks were not statistically significantly different to begin with; therefore, any observed changes are also not significant and are the result of noise. Rank statistics have considerable inherent variability. Sophisticated analytic methods must be used to give an accurate portrayal of the spread and spacing of ranks, and such methods have rarely been used in practice [100, 101].
Consumers are also strongly inclined to make pairwise comparisons between providers rather than between each provider and some accepted standard. Although it is natural to want to embark on such comparisons, the limited information that is typically reported makes such comparisons untrustworthy. To do so correctly, stepped-up (Bonferroni) confidence bounds or some other appropriate adjustment, such as that obtained within a hierarchical modeling framework, must be included to control the overall type I error rate. Few, if any, reports have done so.
Impact of report cards on consumer choice
In addition to stimulating hospital quality improvement and encouraging high-mortality surgeons to improve or retire, another intuitive rationale for report cards is that they might redirect consumers from low-performing to high-performing institutions [118120]. Despite the logical appeal of this hypothesis, it lacks objective confirmation [6, 119].
Considerable evidence exists that consumers are uninfluenced by report cards. After the release of HCFA Medicare mortality rates for a number of DRGs, Vladeck and associates [120] studied the occupancy rates for hospitals that had average, higher than expected, and lower than expected mortality rates. Interestingly, the occupancy for the high-mortality hospitals did not change substantially after the public dissemination of this information, leading the authors to conclude that "consumer choice of hospitals probably has far more to do with preferences for and by physicians, tradition, convenience, and word of mouthnot to mention sheer randomnessthan with objective information about hospitals. Market-based theories of health care may, once again, have simply run afoul of the realities."
Blendon and associates [121] reported similar results from a Kaiser/Agency for Health Care Policy and Research poll conducted in 1996. Approximately three quarters of the patients indicated they would choose a hospital or surgeon that had been used without problems by their family, rather than one that was rated much higher in quality by experts. The authors note that this is a reflection of general consumer behavior, as only about 20% of consumers frequently use objective ratings in making major purchases. In health care, consumers place more importance on convenience, cost, coverage, and access than on quality data [6, 119].
In Cleveland, large corporations developed a system for monitoring quality of care at area hospitals and released a twice-yearly report showing patient satisfaction and mortality rates for various illnesses and procedures [94]. These reports clearly stimulated some institutions to improve their processes, but had surprisingly little effect on consumer choice. Less well-known hospitals with high performance ratings did not gain market share, and low-performing hospitals suffered little penalty.
Several consumer choice studies have focused specifically on heart surgery in areas where report cards have been published. Schneider and Epstein [45] studied cardiac surgery patients in Pennsylvania in 1995 and 1996, several years after the initial release of public report cards. They found that only 12% of the patients knew about the report card before surgery, and that fewer than 1% knew the correct rating of their hospital or surgeon and regarded it as having had an important impact on their selection of a provider.
Similar disinterest was exhibited by referring cardiologists. Schneider and Epstein [46] also surveyed Pennsylvania cardiologists in 1995. Only 10% of these cardiologists indicated that the mortality rates in the guide were very important in assessing surgeon performance, citing concerns about data reliability, inadequate risk adjustment, and the use of mortality as the sole quality measure. Fewer than 10% of these cardiologists used the guide with more than 10% of their patients, and 87% of the cardiologists indicated minimal or no impact on their referral recommendations. In a related study by Hannan and associates [42] in New York, 94% of cardiologists found the New York report readable, and 67% found it reasonably accurate in capturing performance differences among surgeons. However, only 38% of cardiologists reported that it had substantially changed their referral patterns. The accuracy of this survey is substantiated by objective data from New York hospitals. In 1989, 8.7% of patients underwent CABG at New York hospitals with higher than average risk-adjusted mortality, and 15.7% underwent CABG at hospitals with lower than average risk-adjusted mortality. In 1993, after publication of report cards, the corresponding values were 9.5% and 17.0% [41]. Thus, a shift from high-mortality to low-mortality programs was not noted.
Finally, the impact of quality data on referral patterns by different payers was investigated by Shahian and associates [122] in Massachusetts and by Erickson and associates [123] in New York. These studies differed substantially in data set, risk adjustment, and statistical methodology. Despite the differences in study design and the fact that two distinct geographic areas were analyzed, the results of the studies were quite similar. In both market areas, managed care patients were significantly less likely to have heart surgery at lower mortality hospitals. This is strikingly counterintuitive, as it would be expected that these managed care payers would be the most likely to use objective outcome data to influence the referral patterns of their patients.
Thus, a variety of studies all suggest that the improvements noted after a public release of report cards have resulted not from consumer redirection to low-mortality providers, but rather from internal hospital quality improvement initiatives and restriction of surgeon privileges. Even the primary proponents of public reporting support this conclusion [29, 41, 42, 124]. Such observations have led Nugent, a leader in the CQI approach to cardiac surgery quality, to conclude that our efforts should be focused on improving results and reducing variability among all hospitals [125]. In this way, consumers will be safe in using whatever other criteria they wish in selecting a provider.
Individual surgeon profiling
One of the most controversial aspects of report cards is the publication of individual surgeon profiles. Even in New York, where the Department of Health has enthusiastically supported hospital report cards, the original plan did not include the release of physician-specific data. However, Newsday discovered that this information had been collected and successfully sued to access it [41, 43].
Statistical power is even more problematic at the individual provider level than at the hospital level because of the smaller sample sizes [23, 90]. In New York, this problem was addressed by using 3-year sampling for physicians as opposed to 1-year sampling for hospitals. A trade-off exists, however, between aggregating data over time to increase surgeon sample size and using a model that explicitly accounts for trends in outcomes. If mortality is decreasing over time, then omitting such an important factor in creating reports will systematically bias the results.
Because of these sample size issues, individual provider profiling is a particular concern for surgeons at the beginning or end of their careers, or for surgeons who perform CABG expertly but at lower volume because their primary focus is in other areas of heart surgery. It is also particularly disadvantageous and discouraging to surgeons willing to operate on the highest-risk patients, as the higher mortality for such patients is not diluted within the larger institutional volume. Surgeon profiling may thus further jeopardize access to care for the highest-risk patients.
Individual provider report cards focus on the surgeon as the primary determinant of outcome. They do not ascribe sufficient importance to important process issues, such as case selection by referring physicians, preoperative stabilization, postoperative care, and anesthesia. Surgeon profiling ignores the role of other providers, including cardiologists, anesthesiologists, nurses, and allied health professionals.
Individual physician profiling is most likely to foster negative behaviors because it is viewed as a personal threat to a surgeons reputation and career. It is unlikely to provide statistically meaningful information unless years of data are examined. If 3 years of aggregate data are necessary to achieve adequate statistical power, and if the subsequent "cleaning" and analysis of data takes an additional 6 to 12 months, then 4 years will have transpired from the beginning of data collection on an individual surgeon to reporting of their results. So many other factors will have likely changed during this period that the information is of limited value by the time it is published [101].
| The CQI approach to cardiac surgery quality improvement |
|---|
|
|
|---|
An alternative approach is based on the principles of CQI [23, 128131]. Rather than attempting to "weed out" low-performing institutions or surgeons, CQI employs collaboration, exchange visits within a nonthreatening environment, and identification of best practices to raise quality and reduce variability among all cooperating institutions [7, 23, 24, 125, 126].
Berwick has led a national and international effort to apply CQI principles to health care, asserting that most performance problems are due to failures of process and systems, not individuals. The Northern New England Cardiovascular Disease Study Group [2025], the Minnesota Society of Thoracic Surgeons [57, 59], and the Veterans Affairs Administration [6167] have all implemented highly successful CQI programs in cardiac surgery based upon these principles, without public report cards or media sensationalism Their focus has been process and systems improvement, not punitive action against individual surgeons or institutions. Peterson and colleagues [30] have demonstrated that the reduction in Medicare CABG mortality for the northern New England group using this approach has been comparable to that in New York.
Because of the problems associated with public report cards, and because consumers do not appear to choose providers based on outcome data, it could reasonably be argued that confidential CQI is the preferable approach to achieving true improvements in cardiac surgery quality. By raising quality at all hospitals and reducing interhospital variation, consumers would be assured of safety in choosing any cardiac surgery provider, using whatever criteria they desire (eg, convenience, family recommendation) to make that decision [125]. Not surprisingly, report card proponents argue that an approach based solely on CQI is inadequate because it fails to provide public accountability.
| The balance between public accountability and peer-based quality improvement |
|---|
|
|
|---|
Neither approach is without potential adverse sequelae. Full disclosure may promote negative behavior among providers with little evidence that it directly improves overall quality of care. As stated by Berwick [133], the more public and potentially punitive the process, the less likely it is to be used enthusiastically to promote the public welfare. On the other hand, it is predictable that anything less than full-disclosure report cards will lead some to conclude that valuable information is being withheld. Hybrid approaches combining CQI and a basic report card may be the best compromise [9, 93].
| Recommendations |
|---|
|
|
|---|
In states where legislation mandates public report cards, it is critical that surgeons, statisticians familiar with provider profiling, and representatives from public health agencies work collaboratively to ensure a product that has the greatest possible credibility. Whether used for public report cards or confidential CQI, risk factors and outcome data should be collected using a validated instrument such as The Society of Thoracic Surgeons (STS) Database. This is the largest and most universally accepted database in cardiac surgery, and it will hopefully become the national benchmark in the future. No level of model sophistication can eliminate the biases resulting from poorly collected or inaccurate data. Precise database definitions, uniform training of data managers, and periodic external audit are all essential.
Provider-specific outcome measures (such as the 30-day observed-to-expected mortality ratio with 95% confidence intervals) should be calculated using hierarchical generalized linear models to avoid exaggerated estimates of precision and false labeling of providers as outliers. In the absence of an absolute standard (such as an arbitrary upper limit on 30-day surgical mortality for any provider), a reference population is needed for making comparisons among providers. Advantages of using state or regional heart surgery programs to form the reference population include data completeness, validity, and relevance. This is the method used in New York and northern New England. The disadvantage is lack of national representation. Alternatively, the national STS database would provide broader geographic representation in construction of the mortality ratios, but the estimates might be misleading because of the voluntary nature of the STS sample. Weighing these considerations, we recommend that state or region-specific reference populations be used when the purpose is to satisfy a legislative mandate for outcome information.
Finally, an effective public quality report should be designed to prevent consumers from drawing incorrect conclusions because they lack full appreciation of the precision of the data. Table 1 depicts a proposed report format that more explicitly conveys this uncertainty;
i denotes the estimated standardized mortality ratio for the ith provider, (
Lower,
Upper) is a 95% confidence interval for the estimates, and the columns are ordered from the largest standardized mortality ratio to the smallest. By examining the shading, the reader can infer that the mortality ratio for provider 1 is not statistically significantly different from that of provider 2 (light shading). However, the estimate for provider 1 is significantly higher than that for provider 4 (first row of chart). Alternatively, the reader draws the same conclusion when looking down the first column in the chartprovider 4 has significantly lower standardized mortality than provider 1.
|
Finally, it should be a long-term goal to study other quality markers in addition to mortality, including morbidity, readmission rates, late survival, and quality of life. There should also be expanded use of process and structure quality measures in cardiac surgery, although these are more difficult to quantify.
| Footnotes |
|---|
|
|
|---|
| References |
|---|
|
|
|---|