|
|
||||||||
Ann Thorac Surg 2007;83:S13-S26
© 2007 The Society of Thoracic Surgeons
,*
a Duke Clinical Research Institute, Durham, North Carolina
b Tufts University School of Medicine, Boston, Massachusetts
c Department of Health Care Policy, Harvard Medical School, and cDepartment of Biostatistics, Harvard School of Public Health, Boston, Massachusetts
d Division of Cardiothoracic Surgery, University of Florida, Jacksonville, Florida
e Division of Cardiovascular & Thoracic Surgery, University of Kentucky Chandler Medical Center, Lexington, Kentucky
f Sentara Cardiovascular Research Institute, Norfolk, Virginia
g The Society of Thoracic Surgeons, Chicago, Illinois
h The Society of Thoracic Surgeons, Seattle, Washington
Accepted for publication January 12, 2007.
* Address correspondence to Dr Shahian, The Society of Thoracic Surgeons, 633 N Saint Clair St, Suite 2320, Chicago, IL 60611 (Email: shahian{at}comcast.net).
| Executive Summary |
|---|
|
|
|---|
The QMTF evaluated various options for combining 11 National Quality Forum (NQF)endorsed process and outcome measures, both within and across the four domains of care chosen by the Task Force (Perioperative Medical Care, Operative Care, Risk-Adjusted Operative Mortality, and Postoperative Risk-Adjusted Major Morbidity). These methods included simple or weighted averaging, a composite opportunity model similar to that used by the Centers for Medicare & Medicaid Services (CMS), "all or none" scoring, scaled combinations, and latent variable models. Each method was illustrated using actual 2004 STS data from 133,149 coronary artery bypass procedures. Provider performance was estimated using Bayesian random-effects approaches to account for small sample size and to incorporate risk adjustment for outcomes.
Latent variable modeling failed to provide accurate estimates of provider performance when tested with actual STS data. Most other methods of combining individual measures within a given domain produced similar and consistent estimates of performance (Spearman rank correlations 0.95 to 0.98), and an all or none approach was selected.
Combining scores across domains was accomplished by rescaling and then adding the domain-specific estimates. When this methodology is applied to actual STS data, a one percentage point improvement in mortality has the same impact on the overall composite score as does an 8% improvement in the morbidity rate, an 11% improvement in the frequency of internal mammary artery usage, or a 28% change in the frequency of using all four NQF-recommended medications.
The QMTF considered various approaches to determining performance tiers based on composite scores. As a demonstration of one such system, the QMTF conducted a pilot study with 2004 STS data, using a 99% Bayesian certainty criterion to assign performance tiers. This stringent criterion was used to maximize the statistical certainty of tier assignments. Applying this methodology, approximately 77% of providers fell into a middle-performance tier, 10% were determined to be in a high-performing tier, and another 13% in a low-performing tier.
In summary, the STS QMTF has developed and tested a composite measure of cardiac surgery quality that encompasses multiple domains of care, uses Bayesian random-effects analyses, uses all or none scoring where appropriate, and avoids subjective weighting of individual measures. One possible methodology for assigning performance tiers derived from these scores was demonstrated in a pilot study. This overall methodology was applied to actual STS data and appeared to satisfy multiple criteria for validity. These quality measures for cardiac surgery should prove useful to STS participants, payers, and governmental agencies.
| Introduction |
|---|
|
|
|---|
The STS now faces a similar leadership opportunity as the American health care system embarks on an unprecedented effort to measure and improve quality. A major focus of this collective effort will be performance measurement, as emphasized by the latest Institute of Medicine quality report, Performance Measurement: Accelerating Improvement [1]. To meet such new challenges and opportunities, it will be necessary to develop quality measures for cardiac surgery that are far more comprehensive than simple risk-adjusted mortality. This set of individual and composite quality measures must be evidence-based, derived from state of the art analytic methods, and subjected to rigorous empirical evaluation.
In 2005, the STS commissioned a Quality Measurement Task Force (QMTF) to develop methods for combining multiple dimensions of performance into a single comprehensive summary quality measure. Part 1 of this QMTF report describes the conceptual framework within which the QMTF conducted its deliberations and the guidelines used to select a set of individual quality measures for coronary artery bypass grafting (CABG) [2]. In Part 2, the QMTF focuses on the following statistical and methodologic issues: (1) the distribution and correlation of selected performance measures in an actual STS NCD data sample; (2) alternative approaches to combining measures within and across the four selected domains of quality (including sensitivity analyses); and (3) various approaches for assigning providers to performance tiers based on their composite scores, including a practical example of one such method.
| General Methodology |
|---|
|
|
|---|
|
"True" Process Compliance and Risk-Adjusted Outcomes Rates
The STS data for each NQF process measure consist of the number of patients cared for by a provider who were eligible to receive the specified care process (denominator) and the number of patients for whom the care process was actually delivered (numerator). Although the numerator and denominator are directly observable, the true quantity of interest relevant to performance evaluation is unobservable and may be regarded as the underlying true probability of delivering the care process. For each provider, the QMTF used analytic methods described subsequently that focus on estimating these unobservable parameters, which correspond to the five NQF process measures. In this report, the term usage rate will usually denote the actual observed percentage of eligible patients in which a provider used the care process, and true usage rate will denote the estimated corresponding "true" value.
In the case of outcomes measures, the STS data consist of the number of patients meeting the measure-specific inclusion criteria (denominator) and the number of these patients who avoided a particular adverse outcome (numerator). The "true" underlying probability of adverse outcomes may be defined and estimated in a fashion similar to that described for process measures. For outcomes, however, this estimate must also take into account the providers case mix, resulting in a risk-standardized adverse event rate. This may be regarded as the percentage incidence of adverse outcomes that would be anticipated if the provider treated patients having an overall risk profile similar to the STS national average. Because there are six NQF outcome measures, we define and estimate six corresponding theoretical risk-standardized rates for each provider. Estimating these parameters requires a statistical model, as described subsequently and in the Technical Appendix.
Analytical Methods
Multivariate random-effects models were applied to STS data to estimate true provider-specific usage rates for process measures and true risk-standardized event rates for outcome measures. The term multivariate refers to the fact that several quality measures are analyzed together in a single model, not estimated one-at-a-time in separate models. Unlike conventional methods, multivariate random-effects modeling incorporates information from all peer providers, thereby "borrowing strength" to obtain a more reliable estimate of a single providers performance [37]. Provider-specific estimates are shrunken towards the mean for all providers, with the amount of shrinkage being inversely related to number of CABG cases and also dependent on the relative amounts of between-provider and within-provider variation.
Provider-specific risk-adjusted (or risk-standardized) mortality and morbidity rates were estimated using risk scores from previously published risk-adjustment models [8] as described in the Technical Appendix.
Bayesian methodology [911] was used to fit each random-effects model and to study the characteristics of alternative methods for combining and weighting quality measures. One major advantage of Bayesian approaches is that inferences about a providers performance are explicitly stated in terms of probabilities. For example, based on a providers data, we might be 99% sure that their true performance is better than average. Conventional p values and confidence intervals do not have a similar probability interpretation.
| Distribution of Individual Performance Measures |
|---|
|
|
|---|
The estimated distribution of true usage rates for NQF process measures is depicted in Figure 1. Between-provider variability as measured by the estimated interquartile range (IQR = 75th percentile minus 25th percentile) was greatest for the discharge antilipids measure (IQR, 67.5% to 88.5% = 21.0%), followed by preoperative ß-blockers (IQR, 64.7% to 79.2% = 14.5%) and discharge ß-blockers (IQR, 77.3% to 90.3% = 13.0%). The least variable measures were IMA usage (IQR, 89.9% to 96.0% = 6.1%) and discharge antiplatelets (IQR, 91.1% to 97.1% = 6.0%). Although most individual process measures had high overall estimated compliance rates, less than half of all patients received all four medications (estimated provider-specific median = 47.5%; Figure 3).
|
|
|
| Individual Performance Measure Correlation |
|---|
|
|
|---|
|
|
| Composite Scoring Methodologies |
|---|
|
|
|---|
In investigating how best to determine composite scores within and across domains, the QMTF considered a variety of existing approaches from health care, educational and psychological testing, psychometrics, and public sector performance assessment [1222], the latter predominantly from the United Kingdom and Europe.
Finally, composite score methodologies were also pilot tested using actual 2004 STS data to assess the sensitivity of rankings to the choice of methodology and to illustrate one potential methodology for classifying providers into performance tiers.
Methods for Within-Domain Composite Scoring
For two of the four quality-of-care domains (Perioperative Medical Care and Postoperative Risk-Adjusted Major Morbidity), it was necessary to combine multiple measures into a single composite domain score. Options considered for combining individual measures within a domain included (a) the CMS opportunity model [23], (b) averaging of item-specific estimates, (c) all or none scoring [24], and (d) latent variable analysis [1618, 25].
(a) CMS Opportunity Model. An opportunity-based approach, such as the method used by CMS in recent pay-for-performance pilot studies [23], is one way of accounting for the fact that some patients may be ineligible for some measures. An opportunity-based measure is obtained by summing the numerators for each indicator (ie, number of patients who received the particular care), summing their denominators (number of eligible patients), and dividing the former by the latter. Implicitly, each item is weighted in proportion to the percentage of eligible patients. In the case of NQF cardiac surgery measures, for which there are few recognized exclusions, almost all patients are eligible for all process measures. In this case, the opportunity-based approach should be virtually identical to simple averaging, and this was confirmed in pilot testing with actual STS data.
(b) Averaging of Item-Specific Estimates. Both simple averaging of item-specific estimates and the opportunity-based approach may permit high performance on some measures to mask poor performance on other measures that may be critical to quality (compensability). Weighting the item-specific estimates by their importance may mitigate this problem; however, the rational assignment of such weights is highly problematic. As there are no clear data available in the literature from which to derive such weights empirically, the QMTF considered alternative methods. These included an expert opinion survey of STS members conducted on behalf of the QMTF. The results of this survey demonstrated consensus among STS members (experts) on use of an IMA graft as the most important marker of CABG process quality. In contrast, there was insufficient agreement to differentiate among the other NQF process measures or the various postoperative complications. Finally, purely subjective assignment of weights to individual items by the QMTF was considered but rejected as scientifically indefensible. In the absence of a clear rationale for assigning weights, experts generally consider equal weighting as the most appropriate default approach [22].
As a satisfactory weighting methodology was unavailable, averaging with equal weights was further investigated with sensitivity analyses, using the 2004 STS NCD data. A providers overall score for medications was defined as the average of the providers four medication-specific usage probabilities, as estimated with multivariate random-effects modeling. Similarly, a providers overall score for the risk-adjusted morbidity domain was defined as the average of the providers five risk-standardized event probabilities, as estimated with multivariate random-effects modeling. As summarized in columns A and B of Table 4, the results of these analyses were generally consistent with those derived from the CMS and all or none methodologies. Together with the theoretical objections as noted, this led the QMTF to reject the use of both simple and weighted averaging.
|
Application of this approach to actual STS data revealed that there was inter-provider variability of the all or none compliance percentages for the perioperative process domain and of the any or none occurrence percentages for the morbidity domain (Fig 3). This variability suggests that such a scoring approach may be useful in helping to distinguish performance differences among providers. The all or none approach yielded similar composite scores to those obtained from averaging or from an opportunity model (Table 4; Columns A and B; Spearman correlation, 0.95 to 0.98).
(d) Latent Trait Analysis, including Item Response Theory. Finally, the QMTF also considered more complex modeling techniques originally developed in the fields of psychometric and educational testing, including latent trait analysis and item response theory [1618, 25]. Latent trait analysis is theoretically well suited for the study of an abstract construct such as aptitude or quality. In this approach, multiple observable indicators such as process compliance or morbidity rates are assumed to be related to an underlying (unobserved) latent variable such as surgical quality, the latter being the primary focus of interest. This type of model potentially allows quality to be estimated with high statistical efficiency by combining information from multiple observable measures into a single parameter. The relative weights for each observable indicator are determined iteratively from the model, obviating the need to make a priori weight assignments.
Although latent trait modeling has potential statistical and practical advantages for discriminating among providers, the underlying assumptions (eg, undimensionality, local independence) may not be appropriate for all data sets and must be tested on a case-by-case basis. Using 2004 STS data, we fitted the latent-trait logistic model described by Landrum and associates [17] and others. Informal model assessment included graphing the observed versus predicted values, whereas formal model evaluation consisted of computing the difference between observed and expected rates and an approximate Bayesian posterior p value.
Separate analyses were conducted for the four NQF medication measures and five NQF risk-adjusted morbidity end points. In both cases, there were large discrepancies between the model-based estimates and each providers actual observed data, and the adequacy of each model was rejected with high confidence (Bayesian p < 0.00001). In contrast, the multivariate random-effects model appeared to fit adequately (Bayesian p = 0.43 for analysis of medications and p = 0.38 for morbidities). These findings suggest that one or more of the major assumptions (eg, unidimensionality, local independence) underlying the latent trait logistic model may not be tenable for STS data. This led the QMTF to reject the latent variable modeling approach for STS composite quality measures.
Final Within-Domain Composite Scoring Method
After testing all four potential methods for combining measures within domains, the QMTF selected an all/any or none approach. This method is straightforward and intuitive, avoids subjective weighting, sets an appropriately high benchmark for the ideal CABG hospitalization, and performs as well as or better than other methods when applied to actual STS data. A providers score for the perioperative medication domain is its estimated true probability of delivering all four NQF medications. The providers score for the morbidity domain is its estimated true risk-standardized probability of avoiding all five major morbidities.
Determination of Final Composite (Across-Domain) Scores
The next step of this project was to combine the two process measures (IMA usage rate and all or none medication compliance rate) and two risk-standardized outcomes measures (operative mortality rate and any or none morbidity rate) into a single comprehensive quality score. To assure consistent directionality, so that increasingly positive values reflect better performance, mortality rates were converted to survival rates (risk-standardized survival rate = 100 risk-standardized mortality rate), and morbidity rates were converted to "absence of morbidity" rates (risk-standardized absence of morbidity rate = 100 risk-standardized morbidity rate). A providers score for the mortality domain is the providers risk-standardized survival rate. Similarly, the providers score for the morbidity domain is the risk-standardized absence of morbidity rate.
Another major statistical consideration was how to account for the differing scales of measurement of the domain-specific scores. In theory, each measure has the same scale, which ranges from 0% to 100%; however, in reality, measurement scales differ dramatically. Medication adherence rates are widely dispersed and range from close to 0% to almost 100% (a range of 100%). In contrast, risk-standardized survival rates are tightly clustered in a narrow interval ranging from about 95% to 99% (a range of 4%). To account for these differences, the scales of measurement need to be standardized before the domain scores are combined into an overall composite score.
The QMTF considered multiple options to standardize measurement scales among domains and to create a single overall composite score. In the CMS approach (a), separate process and outcomes composite scores are averaged together, with each domain composite weighted according to the number of items it encompasses:
(a) The rescaled score for the j th domain is calculated as:
|
|
|
|
The next two approaches involved rescaling each domain score by the reciprocal of its standard deviation (b) or its range (c), then weighting the rescaled estimates equally:
(b) Divide by the domain-specific standard deviation:
|
|
j is the corresponding standard deviation. The rescaled domain-specific scores all have the same standard deviation. Thus: |
|
(c) Divide by the domain-specific range.
|
|
|
|
Results were similar when rescaling was accomplished using the reciprocal of the items range instead of its standard deviation (Spearman correlation = 0.99; column C). Ranks agree to within 100 places for 100% of providers.
(d) Results diverged to a much greater extent when the items were combined without rescaling (Spearman correlation = 0.84; column D):
|
|
The implications of rescaling were further explored. If items are not rescaled, then items that vary widely between providers will disproportionately influence the overall composite. Furthermore, without rescaling, a one percentage point difference in the risk-standardized mortality rate is considered to have the same importance as a one percentage point difference in the frequency of using IMA or compliance with the all or none medication measure. Rescaling by either the reciprocal of the standard deviation or the range changes the amount that improvement on a single item contributes to the overall composite. The approximate standard deviations corresponding to mortality, morbidity, IMA, and medications are 0.5, 4.2, 5.8, and 14.3, respectively. When items are weighted by the reciprocal of the standard deviation, a one percentage point improvement in mortality has the same impact on the composite score as does an 8% improvement in the morbidity rate, an 11% improvement in the frequency of IMA usage, or a 28% change in the frequency of using all four medications. These findings are largely consistent with the QMTFs clinical assessment regarding the importance of individual items, as well as results of the STS member survey.
Final Overall Composite Scoring Method
To compute an overall composite score, the QMTF chose to rescale the domain-specific estimates by the reciprocals of their standard deviations, then add these rescaled estimates. To verify that each item contributes statistical information but does not dominate the composite, we calculated the [itemtotal] correlation between each domain-specific estimate and the overall comprehensive score. The [itemtotal] Pearson correlations were 0.48 (IMA score versus overall score), 0.56 (medication domain score versus overall score), 0.65 (morbidity domain score versus overall score), and 0.78 (mortality domain score versus overall score). Thus, although risk-adjusted mortality and morbidity explain much of the variation in the overall comprehensive score, no single item dominates, and all four items contribute statistical information.
| Performance Tier Determination |
|---|
|
|
|---|
Using actual STS data, the QMTF pilot tested the discriminating power of one hypothetical three-tiered rating system. A high level of statistical certainty was deemed essential for a system designed specifically to rate providers. Accordingly, in the pilot study, providers were assigned to the middle tier if their score was statistically indistinguishable from the STS national average based on a 99% Bayesian certainty criterion. Otherwise, providers were assigned to the top tier (above average performance) or bottom tier (below average performance).
The number of providers assigned to the bottom, middle, and top tiers using this particular rating system with the 2004 STS data was 70 (13%), 407 (77%), and 53 (10%), respectively. Providers in the middle tier may be interpreted as having average performance. Their estimated performance was either very close to the STS national average value, or else the number of patients was too small to make a reliable determination. For the remaining 123 providers, the classification of above average or below average performance could be made with high confidence (more than 99% certainty). Compared with the bottom tier, providers in the top tier had lower estimated risk-standardized mortality rates (median, 1.7% versus 3.0%); lower estimated any or none morbidity rates (median, 9.8% versus 18.1%); higher IMA usage rates (median, 95.7% versus 88.1%); and higher all or none medication rates (median, 66.4% versus 35.7%).
We speculated that the comprehensive quality score would have greater statistical power for discriminating between providers than a report card based solely on risk-adjusted operative mortality. The results of this pilot study confirm the statistical advantages of combining process and outcome measures when making such inter-provider comparisons. If tier assignments were to be based solely on risk-standardized operative mortality, then only 6 providers (1%) could be assigned to the top or bottom tier with at least 99% certainty. In contrast, the composite score was able to distinguish above average or below average performance for 123 providers (23%). The composite score achieves high statistical power because it combines information from 11 different quality indicators into a single estimate.
| Comment |
|---|
|
|
|---|
Although composite measures have many practical advantages, combining multiple measures into a single indicator is inherently problematic [15, 19, 20, 22]. Similar to the development of any statistical model, considerable judgment must be exercised in the construction of a composite indicator. The choice of the individual component measures, weighting of measures, the method used to aggregate measures, and the assignment of ranks are just a few of the potential sources of controversy. Decisions regarding these and other issues may substantially impact the evaluation of providers and may have important policy implications [15, 19, 20, 22]. Furthermore, each individual indicator reflects a different aspect of quality. Therefore, some information is invariably lost when only a single summary score is reported, particularly when measures are poorly correlated and nonredundant, as they were in our pilot studies.
Depending on how the composite is constructed, some important areas of performance may not be addressed or may be relatively undervalued, and aggregation may also obscure individual areas of strength or weakness. The ability to decompose the composite into its individual components is critical. This allows providers to analyze their performance in specific areas and to formulate improvement strategies.
Finally, recent studies have demonstrated that composite scores are associated with substantial random variation and may also be sensitive to factors such as the methods for weighting and aggregation [19, 20]. This emphasizes the need to correctly partition variability when calculating such scores, and to present them with appropriate measures of uncertainty.
To address the legitimate concern of whether a providers composite performance score would be influenced substantially by the choice of statistical methodology, the QMTF implemented and compared a large number of approaches. In addition to the methods presented here, we also considered several variations of the most common approaches. In general, inferences about a providers quality were robust and largely insensitive to the choice of methodology.
In constructing a composite scoring system, the goal of the QMTF was to use methodology that is scientifically rigorous and useful for third parties, as well as transparent, actionable, and acceptable to the cardiac surgery community. Although our goal was complete objectivity, we were often confronted by choices in which a correct decision could not be objectively determined. In such cases, detailed analyses were conducted to assess the empirical implications of our decision. In reporting these results, we have attempted to make the characteristics of the scoring system transparent to its users and to the surgeons who will be evaluated by it.
The QMTF approach to determining composite quality scores has several distinguishing features. First, performance on the individual component items is estimated using random-effects regression models. This approach, also known as shrinkage estimation, is particularly advantageous when some providers treat a small number of patients or the end points of interest are rare outcomes.
Second, to make inferences about each providers quality, we have estimated performance using a Bayesian framework. Unlike conventional approaches, such as the CMS methodology, the Bayesian framework makes it possible to compute true probability intervals and other measures of uncertainty for any quantity of interest. Although we exploited the Bayesian framework for this investigation, our approach is flexible, and other statistical methods, such as empirical Bayes, might also be adopted.
Third, we used all or none composite scoring for the two domains of quality that contained multiple measures. This sets a high benchmark standard, the "ideal" CABG hospitalization.
Finally, for both theoretical and empirical reasons, we rejected all methods that used "importance" weighting of various measures when calculating domain scores. Neither literature review nor expert opinion survey provided scientifically valid, consistent weight determinations. Rather than subjectively assigning weights, we elected to use the all/any or none approach with implicit equal weighting. For the overall composite measure, weighting was accomplished through rescaling.
Limitations
NQF process measures have been extensively vetted to establish their suitability for performance measurement. However, it is important to acknowledge some limitations inherent to these measures and, in fact, to all measures of process compliance. Determining a set of valid exclusion criteria for process measures has proven challenging in many areas of medicine, and information on contraindications to various process measures is relatively limited in the STS NCD. As a result, some patients who are appropriately denied a medication owing to a contraindication will be misclassified as representing a process failure. The true overall percentage compliance with process measures will therefore be underestimated because some ineligible patients are included. Because the NQF measure set does not account for most exclusions, the QMTF acknowledges that all programs will have less than perfect process compliance scores. However, unless the proportion of patients who might satisfy legitimate exclusion criteria varies substantially among programs, comparisons that focus on relative rather than absolute process compliance will be unbiased. As exclusion criteria become more standardized and are added to the STS NCD, future versions of the QMTF scoring methodology may be modified to more fully account for them.
Limitations also exist with respect to various outcome measures. Because certain outcomes are difficult to define precisely, it is possible that variation in coding practices could account for some of the observed differences between providers.
Future Directions
Although the QMTF composite quality score is described as "comprehensive," we do not presume that it captures every important aspect of quality. The goal was to combine the NQF measures in a statistically reasonable, practical, and intuitive fashion. The decision to create an overall composite score using four domains of care (two process domains and two outcomes domains) resulted from our commitment to use all the NQF-endorsed CABG measures that were captured in the STS NCD. The 11 relevant NQF measures are divided about equally into process and outcomes, and they appear to group naturally and appropriately into the domains we selected. The QMTF believes that these four domains assess different and not necessarily congruent aspects of the care process, all of which are important in a multidimensional quality construct.
There are multiple alternative approaches that could have been used to develop the composite score, or that may be the focus of subsequent research:
Having developed the initial methodology described in this report, all subsequent modifications can be implemented in response to STS interests and external requests.
As the STS NCD evolves from a registry used primarily for research and for internal quality improvement to one that is also used for reporting to the public and to specific third parties such as payers, the need for aggressive audit and validation is correspondingly increased. This is currently a major national initiative of the STS NCD.
The management of missing data merits careful further consideration. The implications of such missing data increase when multiple outcomes are measured, and the missing-at-random assumption is less tenable. The consequences of restricting our pilot study to sites with less than 5% missing data were not specifically investigated, but this will be an important consideration when actual composite scores are publicly released. STS NCD policies may be modified to maximize provider data completeness for elements that comprise the composite score, and the most appropriate statistical methodologies for managing any remaining missing data will be implemented.
It will be instructive to monitor provider-specific scores longitudinally to assess their consistency, and to study which domains and individual measures appear most predictive of subsequent provider performance.
If composite scores are to be used for a rating system, choices must be made regarding the use of an internal or external benchmark, the number of rating tiers, the cut points used to define each tier, and the statistical criteria used to assign providers to tiers. The hypothetical three-tiered rating system described in this report is only one of many such potential systems, and the operating characteristics of alternative approaches should be investigated and compared.
Finally, research is needed to identify the optimum time window for estimating and reporting performance. In our analyses, performance was estimated based on a single year of data. When estimating performance, methods that incorporate data from multiple time points deserve consideration.
| Conclusion |
|---|
|
|
|---|
| Technical appendix |
|---|
|
|
|---|
| 1. Final Model for Estimating Composite Scores |
|---|
|
|
|---|
To ensure consistent directionality between process measures and risk-adjusted outcomes, the numerators of the risk-adjusted outcomes measures are defined as the number of patients who avoid the adverse end point. Thus, for both process and outcomes, larger values of the numerator are favorable.
| Summary Measures of Case Mix |
|---|
|
|
|---|
The development and validation of STS risk models is described in Shroyer and colleagues [[8]. The STS CABG mortality model was updated in 2004 and has not been published. (It is available as an online STS risk calculator at www.sts.org.) Although the end point predicted by the STS mortality/major morbidity model is not identical to the any or none morbidity end point chosen by the QMTF, the STS model still provides a useful summary measure of case mix. Because mortality is a relatively rare end point, the risk factors that predict the combined end point of "mortality or major morbidity" are essentially identical to the risk factors that predict our any or none morbidity end point. In light of this similarity, we used the existing STS mortality/major morbidity risk model to calculate risk scores for risk-adjusting the any or none morbidity end point.
For each end point, the formula for calculating a patients predicted risk of the end point has the form:
|
|
| 1.1 Statistical Model |
|---|
|
|
|---|
mj denote the true site-specific success probability at site j (j = 1,2, ... , J) for measure m, where m = 1 denotes avoidance of operative mortality; m = 2 denotes avoidance of all five morbidities; m = 3 denotes IMA usage; and m = 4 denotes use of all eligible medications. Let n
mj denote the number of patients who were eligible to be included in the denominator for measure m at site j, and let Y
mj denote the number of successful outcomes (numerator). Conditional on
mj, the observed numerator is assumed to arise from a binomial distribution with probability parameter
mj. That is:
A probability model for all four outcomes Y
1j, ... , Y
4j is obtained by assuming that each binomial outcome is conditionally independent given (
1j, ... ,
4j). Let Y
j = (Y
1j, ... , Y
4j) denote the collection of outcomes for the jth participant. The likelihood for site j is given by
|
| (1) |
1j, ... ,
4j), the observed data are assumed to consist of four independent binomially distributed random variables. The assumption of conditional independence is likely to be violated in practice but is made to simplify the computation. Although the model assumes conditional independence between (Y
1j, ... , Y
4j), the model does not assume marginal independence between these variables, because the underlying binomial parameters (
1j, ... ,
4j) are assumed to arise from a random distribution with parameters that allow for intra-item correlation.
To express probabilities on a linear scale that is not constrained by a maximum of 100% or a minimum of 0%, the probability parameters are converted to odds parameters, and we model the logarithm of the odds. Let
mj =
mj/(1
mj) denote the odds of success for measure m at site j. The odds is interpreted as the probability of a successful outcome divided by the probability of a unsuccessful outcome. If the odds can be estimated, then it can be converted to a probability, because
mj =
mj/(1 +
mj). Similar to
mj, larger values of
mj imply a higher probability of a successful outcome. We focus on the logarithm of the odds parameters, log
mj, because this quantity ranges from negative infinity to infinity (ie, no boundary constraints).
A fundamental assumption of the multivariate hierarchic logistic model is that the parameters (log
1j, log
2j, log
3j, log
4j) are distributed according to a multivariate normal distribution. Correlation among performance on different end points is reflected in the covariance parameters of the multivariate normal distribution. We further assume that a providers performance on any single end point is described by a logistic regression model. The latter assumption is embodied by the set of equations:
|
|
1,
2,
3, and
4) denote intercept parameters that determine the overall frequency of success for the four measures; (
1j,
2j,
3j, and
4j) are normally distributed error terms that determine the extent to which the jth site deviates from the average; z
1j denotes the logit of the average predicted risk of mortality at site j, as determined by the STS mortality model (described above); z
2j denotes the logit of the average predicted risk of "mortality or major morbidity", as determined by the STS composite end point model (described above); and (ß1 and ß2) denote regression coefficients to be estimated from the data. The terms ß1
z
1j and ß2
z
2j are included to incorporate risk-adjustment into the analysis of the mortality and morbidity end points. No assumptions are made about the covariance parameters of the multivariate normal distribution. An equivalent specification of the model is: |
|
11,
12, ... ,
44) denote unknown parameters of the multivariate normal covariance matrix. These unknown covariance parameters are estimated from the data along with the unknown
s and ßs. | 1.2 Definition of Risk-Standardized Rates |
|---|
|
|
|---|
mj is not a meaningful reflection of a sites quality because it partly reflects the sites case mix. Interest instead focuses on estimating each providers "risk-standardized success rate," denoted by
mj'. Because there is no single widely accepted definition of the risk-standardized success rate, the QMTF considered several options, and chose the following:
|
| (2) |
m denotes the value of z
mj for an "average" provider. The risk-standardized success rate
mj' is loosely interpreted as the success rate for measure m that would be projected to occur hypothetically if provider j had a "typical" case mix. | 1.3 Definition of True Composite Score |
|---|
|
|
|---|
|
|
mj' or
mj. Larger values of the composite score imply better performance. | 1.4 Method of Estimation |
|---|
|
|
|---|
1j',
2j',
3j,
4j were estimated in a Bayesian framework by specifying a diffuse normal prior for
1 and
2; an informative normal prior distribution for ß1, ß2, ß3, and ß4; and a diffuse Wishart prior for the distribution of T =
1, where
= (
11,
12, ... ,
44) denotes the covariance matrix of the random effects distribution. Specifically: |
|
|
|
|
|
(R) denotes a Wishart distribution with
degrees of freedom and scale matrix R. The Wishart distribution is parameterized such that
i = 1
z
i
z'
i
Wishartv (R) if z
i
iid
N (0, R), with R denoting the covariance matrix of the multivariate normal distribution of the z
i.
The chosen scale matrix implies that the prior mean of the correlation between two random effects from the same site is equal to 0.05, that is, E[corr(
mj,
m'j)] = 0.05. According to the prior distribution, there is also 50% prior probability that corr(
mj,
m'j) lies in the interval (0.70, 0.70). The parameters {
m}, {ßm}, and T were assumed to be mutually independent in the prior distribution.
The N(1,1) prior for ßm is motivated by the fact that ßm = 1 by definition under the assumption that each providers true event rate is exactly equal to the rate predicted by the STS risk model; hence, we chose our prior mean to be 1.0. In reality, we do not believe ßm = 1. (Owing to site-level variation in performance, we do not believe the STS risk model will exactly predict each sites true event rate). The prior variance of 1.0 was chosen to allow for uncertainty regarding the true value of ßm. Although larger values of the variance might be considered desirable (because larger variance implies greater uncertainty about ßm), we encountered computational difficulties (slow mixing) with larger variance.
| 2. Models for Combining Items Within a Domain (Reported but not Selected) |
|---|
|
|
|---|
To describe these models, let M denote the total number of measures considered in a single model (M = 4 for medication models; M = 5 for morbidity models); and let J (= 530) denote the number of providers. Using the notation of Part 1, let n
mj, y
mj,
mj, and
mj denote the number of eligible patients (denominator), the number of successful results (numerator), the true success probability, and the odds of success, respectively, for measure m (m = 1, 2, ... , M) and site j (j = 1, 2, ... , J). The probability model for site js data is given by equation (1) above. Each of the four models described below was estimated in a Bayesian framework using vague proper priors for the distribution of model parameters.
Model A1. Multivariate random effects model for analyzing the four NQF medication measures
(M = 4.) Model:
|
|
1j, ... ,
4j) are distributed according to a multivariate normal distribution having mean vector zero and an unstructured covariance matrix.
Model A2. Multivariate random effects model for analyzing the five NQF morbidity measures
(M = 5.) Model:
|
|
1j, ... ,
5j) are distributed according to a multivariate normal distribution with mean vector zero and an unstructured covariance matrix. The summary measures of case mix (z 1j, ... , z 5j) were calculated from previously validated STS risk models, using the approach described in Section 1 of the Appendix. Separate STS risk models exist for each of the five NQF-endorsed morbidity end points (Shroyer and colleagues [[8]. The quantity z mj is defined as the logit of the average predicted risk of end point m at site j.
Model A3. Latent trait logistic model for analyzing the four NQF medication measures
(M = 4.) Model:
|
|
iid
N(0, 1); and for identifiability we assume that
1 > 1. This model is an application of the latent trait logistic model described by Landrum and colleagues [[17]. In this model, Q
j represents the "latent quality" of the jth site.
Model A4. Latent trait logistic model for analyzing the five NQF morbidity measures
(M = 5.) Model:
|
|
iid
N (0, 1); and for identifiability, we assume that
1 > 1. This model is a slight generalization of the latent trait logistic model described by Landrum and colleagues [[17]. The model they described did not include the terms ßm
z
mj that allow for risk adjustment.
Use of Bayesian p-Values to Test Fit of Latent Trait Logistic Models
The Bayesian p-value is the probability that a hypothetic replicated data set, y
rep, would diverge from the true model as much as the observed data set, y, diverges from the true model. Divergence between the model and the data was defined by the quantity
|
|
denotes the collection of all of the parameters
mj. The Bayesian p-value was defined as |
|
) is the probability density function for a hypothetical replicated data set conditional on the model parameters (defined by equation 1, above); and p(
|y) is the posterior distribution of the model parameters given the observed data. The probability of interest is taken over the joint distribution p(y
rep,
|y). To calculate the probability of interest, the integral
|
| (3) |
MxJ
2
D (y,
)|
, y], where
d
2 denotes a
2 random variable with d degrees of freedom. The final approximate p-value was calculated as |
|
(1),
(2), ... ,
(N) denote N draws from a Markov chain Monte Carlo simulation with target distribution p(
|y). | 3. Models for Other Results Reported |
|---|
|
|
|---|
The final multivariate model for estimating composite performance (described in Part 1 of the Appendix) involves four end points: operative mortality, any or none morbidity, IMA usage, and all or none medications. In addition to analyzing these end points simultaneously in a single multivariate model, we also analyzed these end points one at a time by fitting four separate univariate random effects models. The method of incorporating risk adjustment was identical to the method described for the multivariate model. These univariate analyses were used to produce the top panel of Figure 3 (all or none medication usage); the bottom panel of Figure 3 (any or none morbidity end point); and to count the number of sites that would be identified as outliers if performance was estimated based on operative mortality alone (last paragraph of Performance Tier Determination section).
| Footnotes |
|---|
|
|
|---|
Dr Shahian is the Quality Measurement Task Force Chair and Writing Group Leader. | References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
J. B. Dimick, D. O. Staiger, O. Baser, and J. D. Birkmeyer Composite Measures For Predicting Surgical Mortality In The Hospital Health Aff., July 1, 2009; 28(4): 1189 - 1198. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Brunelli, R. G. Berrisford, G. Rocco, G. Varela, and on behalf of the European Society of Thoracic Surg The European Thoracic Database project: composite performance score to measure quality of care after major lung resection Eur. J. Cardiothorac. Surg., May 1, 2009; 35(5): 769 - 774. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. P. Jacobs, R. J. Cerfolio, and R. M. Sade The Ethics of Transparency: Publication of Cardiothoracic Surgical Outcomes in the Lay Press. Ann. Thorac. Surg., March 1, 2009; 87(3): 679 - 686. [Full Text] [PDF] |
||||
![]() |
A. F. Hernandez and S. M. O'Brien Sex Differences in Hospital Risk-Adjusted Mortality Rates for Medicare Beneficiaries Undergoing CABG Surgery--Invited Commentary Arch Intern Med, November 24, 2008; 168(21): 2323 - 2325. [Full Text] [PDF] |
||||
![]() |
S. C. Stamou, S. L. Camp, R. M. Stiegel, M. K. Reames, E. Skipper, L. T. Watts, M. Nussbaum, F. Robicsek, and K. W. Lobdell Quality improvement program decreases mortality after cardiac surgery. J. Thorac. Cardiovasc. Surg., August 1, 2008; 136(2): 494 - 499.e8. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. M. Shahian and S.-L. T. Normand Low-volume coronary artery bypass surgery: Measuring and optimizing performance. J. Thorac. Cardiovasc. Surg., June 1, 2008; 135(6): 1202 - 1209. [Full Text] [PDF] |
||||
![]() |
J. P. Drozda Jr, E. P. Hagan, M. J. Mirro, E. D. Peterson, and J. S. Wright ACCF 2008 Health Policy Statement on Principles for Public Reporting of Physician Performance Data: A Report of the American College of Cardiology Foundation Writing Committee to Develop Principles for Public Reporting of Physician Performance Data J. Am. Coll. Cardiol., May 20, 2008; 51(20): 1993 - 2001. [Full Text] [PDF] |
||||
![]() |
S. M. O'Brien, E. R. DeLong, R. S. Dokholyan, F. H. Edwards, and E. D. Peterson Exploring the Behavior of Hospital Composite Performance Measures: An Example From Coronary Artery Bypass Surgery Circulation, December 18, 2007; 116(25): 2969 - 2975. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ANN THORAC SURG | ASIAN CARDIOVASC THORAC ANN | EUR J CARDIOTHORAC SURG |
| J THORAC CARDIOVASC SURG | ICVTS | ALL CTSNet JOURNALS |