ATS
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to Personal Folders
Right arrow Download to citation manager
Right arrow Author home page(s):
David M. Shahian
Fred H. Edwards
Victor A. Ferraris
Constance K. Haan
Jeffrey B. Rich
Cynthia M. Shewan
Richard P. Anderson
Eric D. Peterson
Right arrow Permission Requests
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by O’Brien, S. M.
Right arrow Articles by Peterson, E. D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by O’Brien, S. M.
Right arrow Articles by Peterson, E. D.
Related Collections
Right arrow Education
Right arrow Professional affairs

Ann Thorac Surg 2007;83:S13-S26
© 2007 The Society of Thoracic Surgeons


Report of the STS Quality Measurement Task Force

Quality Measurement in Adult Cardiac Surgery: Part 2—Statistical Considerations in Composite Measure Scoring and Provider Rating

Sean M. O’Brien, PhDa, David M. Shahian, MDb,{dagger},*, Elizabeth R. DeLong, PhDa, Sharon-Lise T. Normand, PhDc, Fred H. Edwards, MDd, Victor A. Ferraris, MDe, Constance K. Haan, MDd, Jeffrey B. Rich, MDf, Cynthia M. Shewan, PhDg, Rachel S. Dokholyan, MPHa, Richard P. Anderson, MDh, Eric D. Peterson, MD, MPHa

a Duke Clinical Research Institute, Durham, North Carolina
b Tufts University School of Medicine, Boston, Massachusetts
c Department of Health Care Policy, Harvard Medical School, and cDepartment of Biostatistics, Harvard School of Public Health, Boston, Massachusetts
d Division of Cardiothoracic Surgery, University of Florida, Jacksonville, Florida
e Division of Cardiovascular & Thoracic Surgery, University of Kentucky Chandler Medical Center, Lexington, Kentucky
f Sentara Cardiovascular Research Institute, Norfolk, Virginia
g The Society of Thoracic Surgeons, Chicago, Illinois
h The Society of Thoracic Surgeons, Seattle, Washington

Accepted for publication January 12, 2007.

* Address correspondence to Dr Shahian, The Society of Thoracic Surgeons, 633 N Saint Clair St, Suite 2320, Chicago, IL 60611 (Email: shahian{at}comcast.net).


    Executive Summary
 Top
 Executive Summary
 Introduction
 General Methodology
 Distribution of Individual...
 Individual Performance Measure...
 Composite Scoring Methodologies
 Performance Tier Determination
 Comment
 Conclusion
 Technical appendix
 1. Final Model for...
 Summary Measures of Case...
 1.1 Statistical Model
 1.2 Definition of Risk...
 1.3 Definition of True...
 1.4 Method of Estimation
 2. Models for Combining...
 3. Models for Other...
 Footnotes
 References
 
There is increasing interest among payers, patients, regulators, and providers to measure and compare cardiac surgery quality. The Society of Thoracic Surgeons (STS) Quality Measurement Task Force (QMTF) was established to develop comprehensive, summary performance measures encompassing multiple domains of quality. This report describes statistical considerations relevant to combining multiple measures into an overall composite score and then using such scores to rate providers.

The QMTF evaluated various options for combining 11 National Quality Forum (NQF)–endorsed process and outcome measures, both within and across the four domains of care chosen by the Task Force (Perioperative Medical Care, Operative Care, Risk-Adjusted Operative Mortality, and Postoperative Risk-Adjusted Major Morbidity). These methods included simple or weighted averaging, a composite opportunity model similar to that used by the Centers for Medicare & Medicaid Services (CMS), "all or none" scoring, scaled combinations, and latent variable models. Each method was illustrated using actual 2004 STS data from 133,149 coronary artery bypass procedures. Provider performance was estimated using Bayesian random-effects approaches to account for small sample size and to incorporate risk adjustment for outcomes.

Latent variable modeling failed to provide accurate estimates of provider performance when tested with actual STS data. Most other methods of combining individual measures within a given domain produced similar and consistent estimates of performance (Spearman rank correlations 0.95 to 0.98), and an all or none approach was selected.

Combining scores across domains was accomplished by rescaling and then adding the domain-specific estimates. When this methodology is applied to actual STS data, a one percentage point improvement in mortality has the same impact on the overall composite score as does an 8% improvement in the morbidity rate, an 11% improvement in the frequency of internal mammary artery usage, or a 28% change in the frequency of using all four NQF-recommended medications.

The QMTF considered various approaches to determining performance tiers based on composite scores. As a demonstration of one such system, the QMTF conducted a pilot study with 2004 STS data, using a 99% Bayesian certainty criterion to assign performance tiers. This stringent criterion was used to maximize the statistical certainty of tier assignments. Applying this methodology, approximately 77% of providers fell into a middle-performance tier, 10% were determined to be in a high-performing tier, and another 13% in a low-performing tier.

In summary, the STS QMTF has developed and tested a composite measure of cardiac surgery quality that encompasses multiple domains of care, uses Bayesian random-effects analyses, uses all or none scoring where appropriate, and avoids subjective weighting of individual measures. One possible methodology for assigning performance tiers derived from these scores was demonstrated in a pilot study. This overall methodology was applied to actual STS data and appeared to satisfy multiple criteria for validity. These quality measures for cardiac surgery should prove useful to STS participants, payers, and governmental agencies.


    Introduction
 Top
 Executive Summary
 Introduction
 General Methodology
 Distribution of Individual...
 Individual Performance Measure...
 Composite Scoring Methodologies
 Performance Tier Determination
 Comment
 Conclusion
 Technical appendix
 1. Final Model for...
 Summary Measures of Case...
 1.1 Statistical Model
 1.2 Definition of Risk...
 1.3 Definition of True...
 1.4 Method of Estimation
 2. Models for Combining...
 3. Models for Other...
 Footnotes
 References
 
More than 15 years ago, the STS was one of the first specialty organizations to recognize the importance of developing a prospectively maintained clinical data registry. The resulting STS National Adult Cardiac Surgery Database (STS NCD) has achieved widespread acceptance by the provider community as well as interested third parties, including health policy researchers, government regulators, accrediting agencies, and payers.

The STS now faces a similar leadership opportunity as the American health care system embarks on an unprecedented effort to measure and improve quality. A major focus of this collective effort will be performance measurement, as emphasized by the latest Institute of Medicine quality report, Performance Measurement: Accelerating Improvement [1]. To meet such new challenges and opportunities, it will be necessary to develop quality measures for cardiac surgery that are far more comprehensive than simple risk-adjusted mortality. This set of individual and composite quality measures must be evidence-based, derived from state of the art analytic methods, and subjected to rigorous empirical evaluation.

In 2005, the STS commissioned a Quality Measurement Task Force (QMTF) to develop methods for combining multiple dimensions of performance into a single comprehensive summary quality measure. Part 1 of this QMTF report describes the conceptual framework within which the QMTF conducted its deliberations and the guidelines used to select a set of individual quality measures for coronary artery bypass grafting (CABG) [2]. In Part 2, the QMTF focuses on the following statistical and methodologic issues: (1) the distribution and correlation of selected performance measures in an actual STS NCD data sample; (2) alternative approaches to combining measures within and across the four selected domains of quality (including sensitivity analyses); and (3) various approaches for assigning providers to performance tiers based on their composite scores, including a practical example of one such method.


    General Methodology
 Top
 Executive Summary
 Introduction
 General Methodology
 Distribution of Individual...
 Individual Performance Measure...
 Composite Scoring Methodologies
 Performance Tier Determination
 Comment
 Conclusion
 Technical appendix
 1. Final Model for...
 Summary Measures of Case...
 1.1 Statistical Model
 1.2 Definition of Risk...
 1.3 Definition of True...
 1.4 Method of Estimation
 2. Models for Combining...
 3. Models for Other...
 Footnotes
 References
 
Performance Measure Definitions
The QMTF selected 11 individual CABG performance measures (five process and six outcome), all of which were endorsed by the National Quality Forum (NQF) and are available in the STS NCD (Table 1). Specific NQF inclusion/exclusion criteria were applied to these measures to the extent possible. The internal mammary artery (IMA) measure excluded patients undergoing repeat CABG surgery, the permanent stroke measure excluded patients with a previous cerebrovascular accident, and the three discharge medication measures were only calculated among patients who survived until discharge.


View this table:
[in this window]
[in a new window]

 
Table 1 Individual Measures and Domains in the STS Composite Quality Score
 
Study Population
Using an actual sample from the STS NCD, the QMTF investigated the distribution of individual performance measures, methodologies for combining measures within and across domains, and the results derived from one potential methodology for performance tier assignment. The 530 providers that performed at least 10 isolated CABG surgeries during 2004 (median, 195; mean, 251; range 11 to 1513) and had less than 5% missing data for each of the five NQF process measures constituted the data source. The term provider is used generically to refer to an STS database participant (the unit of analysis), which may be a hospital or a cardiac surgery group, or both. The final study population consisted of all 133,149 patients who underwent isolated CABG surgery by one of these providers during 2004.

"True" Process Compliance and Risk-Adjusted Outcomes Rates
The STS data for each NQF process measure consist of the number of patients cared for by a provider who were eligible to receive the specified care process (denominator) and the number of patients for whom the care process was actually delivered (numerator). Although the numerator and denominator are directly observable, the true quantity of interest relevant to performance evaluation is unobservable and may be regarded as the underlying true probability of delivering the care process. For each provider, the QMTF used analytic methods described subsequently that focus on estimating these unobservable parameters, which correspond to the five NQF process measures. In this report, the term usage rate will usually denote the actual observed percentage of eligible patients in which a provider used the care process, and true usage rate will denote the estimated corresponding "true" value.

In the case of outcomes measures, the STS data consist of the number of patients meeting the measure-specific inclusion criteria (denominator) and the number of these patients who avoided a particular adverse outcome (numerator). The "true" underlying probability of adverse outcomes may be defined and estimated in a fashion similar to that described for process measures. For outcomes, however, this estimate must also take into account the provider’s case mix, resulting in a risk-standardized adverse event rate. This may be regarded as the percentage incidence of adverse outcomes that would be anticipated if the provider treated patients having an overall risk profile similar to the STS national average. Because there are six NQF outcome measures, we define and estimate six corresponding theoretical risk-standardized rates for each provider. Estimating these parameters requires a statistical model, as described subsequently and in the Technical Appendix.

Analytical Methods
Multivariate random-effects models were applied to STS data to estimate true provider-specific usage rates for process measures and true risk-standardized event rates for outcome measures. The term multivariate refers to the fact that several quality measures are analyzed together in a single model, not estimated one-at-a-time in separate models. Unlike conventional methods, multivariate random-effects modeling incorporates information from all peer providers, thereby "borrowing strength" to obtain a more reliable estimate of a single provider’s performance [3–7]. Provider-specific estimates are shrunken towards the mean for all providers, with the amount of shrinkage being inversely related to number of CABG cases and also dependent on the relative amounts of between-provider and within-provider variation.

Provider-specific risk-adjusted (or risk-standardized) mortality and morbidity rates were estimated using risk scores from previously published risk-adjustment models [8] as described in the Technical Appendix.

Bayesian methodology [9–11] was used to fit each random-effects model and to study the characteristics of alternative methods for combining and weighting quality measures. One major advantage of Bayesian approaches is that inferences about a provider’s performance are explicitly stated in terms of probabilities. For example, based on a provider’s data, we might be 99% sure that their true performance is better than average. Conventional p values and confidence intervals do not have a similar probability interpretation.


    Distribution of Individual Performance Measures
 Top
 Executive Summary
 Introduction
 General Methodology
 Distribution of Individual...
 Individual Performance Measure...
 Composite Scoring Methodologies
 Performance Tier Determination
 Comment
 Conclusion
 Technical appendix
 1. Final Model for...
 Summary Measures of Case...
 1.1 Statistical Model
 1.2 Definition of Risk...
 1.3 Definition of True...
 1.4 Method of Estimation
 2. Models for Combining...
 3. Models for Other...
 Footnotes
 References
 
The distribution of provider-specific performance on NQF measures was investigated by using random-effects modeling to identify overall performance levels and to quantify between-provider variation. In general, measures that vary widely yield high statistical power for discriminating among providers. Care is needed to ensure that the less variable measures contribute additional statistical information when they are combined with more variable measures in a composite.

The estimated distribution of true usage rates for NQF process measures is depicted in Figure 1. Between-provider variability as measured by the estimated interquartile range (IQR = 75th percentile minus 25th percentile) was greatest for the discharge antilipids measure (IQR, 67.5% to 88.5% = 21.0%), followed by preoperative ß-blockers (IQR, 64.7% to 79.2% = 14.5%) and discharge ß-blockers (IQR, 77.3% to 90.3% = 13.0%). The least variable measures were IMA usage (IQR, 89.9% to 96.0% = 6.1%) and discharge antiplatelets (IQR, 91.1% to 97.1% = 6.0%). Although most individual process measures had high overall estimated compliance rates, less than half of all patients received all four medications (estimated provider-specific median = 47.5%; Figure 3).


Figure 1
View larger version (61K):
[in this window]
[in a new window]

 
Fig 1. Estimated distribution of true provider-specific compliance rates for National Quality Forum process measures. (DC = discharge; IMA = internal mammary artery; IQR = interquartile range.)

 

Figure 3
View larger version (18K):
[in this window]
[in a new window]

 
Fig 3. Estimated distribution of true provider-specific rates for all or none or any or none measures. (IQR = interquartile range.)

 
The estimated distribution of true risk-standardized event rates for NQF outcome measures is depicted in Figure 2. As previously discussed, these estimates were derived from multivariate random-effects models by using STS risk factors. For operative mortality, there is an estimated sevenfold difference in the true risk-standardized rate for the worst performing provider compared with the best (5.3% versus 0.8%; IQR 1.9% to 2.9%, median = 2.3%). The least variable outcomes measures include stroke (median, 1.2%; IQR, 1.1% to 1.4% = 0.3%) and infection (median, 0.5%; IQR, 0.3% to 0.7% = 0.4%). The most variable outcomes measure was prolonged ventilation (median, 8.6%; IQR, 6.0% to 11.9% = 5.9%).


Figure 2
View larger version (21K):
[in this window]
[in a new window]

 
Fig 2. Estimated distribution of true provider-specific risk-standardized adverse event rates for National Quality Forum outcomes measures. (IQR = interquartile range.)

 

    Individual Performance Measure Correlation
 Top
 Executive Summary
 Introduction
 General Methodology
 Distribution of Individual...
 Individual Performance Measure...
 Composite Scoring Methodologies
 Performance Tier Determination
 Comment
 Conclusion
 Technical appendix
 1. Final Model for...
 Summary Measures of Case...
 1.1 Statistical Model
 1.2 Definition of Risk...
 1.3 Definition of True...
 1.4 Method of Estimation
 2. Models for Combining...
 3. Models for Other...
 Footnotes
 References
 
The estimated correlation between pairs of NQF measures (true process usage rates and true risk-standardized outcome rates) is summarized in Tables 2 and 3. Go For process measures, the estimated Pearson correlation between pairs ranged from 0.10 (IMA versus discharge antiplatelets) to 0.50 (preoperative ß-blockers versus discharge ß-blockers). For risk-adjusted outcome measures, the estimated Pearson correlation between pairs ranged from 0.15 (prolonged ventilation versus permanent stroke) to 0.65 (sternal infection versus operative mortality).


View this table:
[in this window]
[in a new window]

 
Table 2 Estimated Correlation Between True Success Rates for National Quality Forum Process Measures
 

View this table:
[in this window]
[in a new window]

 
Table 3 Estimated Correlation Between True Success Rates for National Quality Forum Risk-Adjusted Outcome Measures
 
These results suggest that individual process and outcome performance measures were generally not related to performance on other measures. Even for the most strongly correlated measures, a provider’s performance on one measure did not accurately predict performance on another measure. For example, among providers who ranked in the top quartile of performance for preoperative ß-blocker usage, 23.5% ranked in the bottom half of performance for discharge ß-blockers. These findings suggest that the 11 selected measures provide complementary rather than redundant information about performance.


    Composite Scoring Methodologies
 Top
 Executive Summary
 Introduction
 General Methodology
 Distribution of Individual...
 Individual Performance Measure...
 Composite Scoring Methodologies
 Performance Tier Determination
 Comment
 Conclusion
 Technical appendix
 1. Final Model for...
 Summary Measures of Case...
 1.1 Statistical Model
 1.2 Definition of Risk...
 1.3 Definition of True...
 1.4 Method of Estimation
 2. Models for Combining...
 3. Models for Other...
 Footnotes
 References
 
The QMTF selected four quality-of-care domains for CABG, represented by 11 NQF-endorsed process and outcomes measures: (1) Perioperative Medical Care (preoperative and discharge ß-blockers, discharge antiplatelets, and discharge antilipids), (2) Operative Care (IMA usage), (3) Risk-Adjusted Operative Mortality, and (4) Postoperative Risk-Adjusted Major Morbidity (stroke, prolonged ventilation, renal insufficiency, reexploration for any cause, and deep sternal wound infection). The QMTF considered both (a) alternative methods to combine measures within multiple-measure domains (including sensitivity analyses) and (b) alternative methods to combine estimates across all four domains into a single composite quality score, including rescaling.

In investigating how best to determine composite scores within and across domains, the QMTF considered a variety of existing approaches from health care, educational and psychological testing, psychometrics, and public sector performance assessment [12–22], the latter predominantly from the United Kingdom and Europe.

Finally, composite score methodologies were also pilot tested using actual 2004 STS data to assess the sensitivity of rankings to the choice of methodology and to illustrate one potential methodology for classifying providers into performance tiers.

Methods for Within-Domain Composite Scoring
For two of the four quality-of-care domains (Perioperative Medical Care and Postoperative Risk-Adjusted Major Morbidity), it was necessary to combine multiple measures into a single composite domain score. Options considered for combining individual measures within a domain included (a) the CMS opportunity model [23], (b) averaging of item-specific estimates, (c) all or none scoring [24], and (d) latent variable analysis [16–18, 25].

(a) CMS Opportunity Model. An opportunity-based approach, such as the method used by CMS in recent pay-for-performance pilot studies [23], is one way of accounting for the fact that some patients may be ineligible for some measures. An opportunity-based measure is obtained by summing the numerators for each indicator (ie, number of patients who received the particular care), summing their denominators (number of eligible patients), and dividing the former by the latter. Implicitly, each item is weighted in proportion to the percentage of eligible patients. In the case of NQF cardiac surgery measures, for which there are few recognized exclusions, almost all patients are eligible for all process measures. In this case, the opportunity-based approach should be virtually identical to simple averaging, and this was confirmed in pilot testing with actual STS data.

(b) Averaging of Item-Specific Estimates. Both simple averaging of item-specific estimates and the opportunity-based approach may permit high performance on some measures to mask poor performance on other measures that may be critical to quality (compensability). Weighting the item-specific estimates by their importance may mitigate this problem; however, the rational assignment of such weights is highly problematic. As there are no clear data available in the literature from which to derive such weights empirically, the QMTF considered alternative methods. These included an expert opinion survey of STS members conducted on behalf of the QMTF. The results of this survey demonstrated consensus among STS members (experts) on use of an IMA graft as the most important marker of CABG process quality. In contrast, there was insufficient agreement to differentiate among the other NQF process measures or the various postoperative complications. Finally, purely subjective assignment of weights to individual items by the QMTF was considered but rejected as scientifically indefensible. In the absence of a clear rationale for assigning weights, experts generally consider equal weighting as the most appropriate default approach [22].

As a satisfactory weighting methodology was unavailable, averaging with equal weights was further investigated with sensitivity analyses, using the 2004 STS NCD data. A provider’s overall score for medications was defined as the average of the provider’s four medication-specific usage probabilities, as estimated with multivariate random-effects modeling. Similarly, a provider’s overall score for the risk-adjusted morbidity domain was defined as the average of the provider’s five risk-standardized event probabilities, as estimated with multivariate random-effects modeling. As summarized in columns A and B of Table 4, the results of these analyses were generally consistent with those derived from the CMS and all or none methodologies. Together with the theoretical objections as noted, this led the QMTF to reject the use of both simple and weighted averaging.


View this table:
[in this window]
[in a new window]

 
Table 4 Sensitivity Analyses
 
(c) All or None Scoring. The QMTF also considered the use of all or none scoring as advocated by the Institute for Healthcare Improvement [24] and the Institute of Medicine [1]. With an all or none score, performance on process measures is defined by the percentage of patients who received all of the care items for which they were eligible. An analogous measure for outcomes is any or none, defined by the percent of patients who were discharged without having sustained any of the five major complications. No partial credit is given if a patient experiences some but not all of the desired results.

Application of this approach to actual STS data revealed that there was inter-provider variability of the all or none compliance percentages for the perioperative process domain and of the any or none occurrence percentages for the morbidity domain (Fig 3). This variability suggests that such a scoring approach may be useful in helping to distinguish performance differences among providers. The all or none approach yielded similar composite scores to those obtained from averaging or from an opportunity model (Table 4; Columns A and B; Spearman correlation, 0.95 to 0.98).

(d) Latent Trait Analysis, including Item Response Theory. Finally, the QMTF also considered more complex modeling techniques originally developed in the fields of psychometric and educational testing, including latent trait analysis and item response theory [16–18, 25]. Latent trait analysis is theoretically well suited for the study of an abstract construct such as aptitude or quality. In this approach, multiple observable indicators such as process compliance or morbidity rates are assumed to be related to an underlying (unobserved) latent variable such as surgical quality, the latter being the primary focus of interest. This type of model potentially allows quality to be estimated with high statistical efficiency by combining information from multiple observable measures into a single parameter. The relative weights for each observable indicator are determined iteratively from the model, obviating the need to make a priori weight assignments.

Although latent trait modeling has potential statistical and practical advantages for discriminating among providers, the underlying assumptions (eg, undimensionality, local independence) may not be appropriate for all data sets and must be tested on a case-by-case basis. Using 2004 STS data, we fitted the latent-trait logistic model described by Landrum and associates [17] and others. Informal model assessment included graphing the observed versus predicted values, whereas formal model evaluation consisted of computing the difference between observed and expected rates and an approximate Bayesian posterior p value.

Separate analyses were conducted for the four NQF medication measures and five NQF risk-adjusted morbidity end points. In both cases, there were large discrepancies between the model-based estimates and each provider’s actual observed data, and the adequacy of each model was rejected with high confidence (Bayesian p < 0.00001). In contrast, the multivariate random-effects model appeared to fit adequately (Bayesian p = 0.43 for analysis of medications and p = 0.38 for morbidities). These findings suggest that one or more of the major assumptions (eg, unidimensionality, local independence) underlying the latent trait logistic model may not be tenable for STS data. This led the QMTF to reject the latent variable modeling approach for STS composite quality measures.

Final Within-Domain Composite Scoring Method
After testing all four potential methods for combining measures within domains, the QMTF selected an all/any or none approach. This method is straightforward and intuitive, avoids subjective weighting, sets an appropriately high benchmark for the ideal CABG hospitalization, and performs as well as or better than other methods when applied to actual STS data. A provider’s score for the perioperative medication domain is its estimated true probability of delivering all four NQF medications. The provider’s score for the morbidity domain is its estimated true risk-standardized probability of avoiding all five major morbidities.

Determination of Final Composite (Across-Domain) Scores
The next step of this project was to combine the two process measures (IMA usage rate and all or none medication compliance rate) and two risk-standardized outcomes measures (operative mortality rate and any or none morbidity rate) into a single comprehensive quality score. To assure consistent directionality, so that increasingly positive values reflect better performance, mortality rates were converted to survival rates (risk-standardized survival rate = 100 – risk-standardized mortality rate), and morbidity rates were converted to "absence of morbidity" rates (risk-standardized absence of morbidity rate = 100 – risk-standardized morbidity rate). A provider’s score for the mortality domain is the provider’s risk-standardized survival rate. Similarly, the provider’s score for the morbidity domain is the risk-standardized absence of morbidity rate.

Another major statistical consideration was how to account for the differing scales of measurement of the domain-specific scores. In theory, each measure has the same scale, which ranges from 0% to 100%; however, in reality, measurement scales differ dramatically. Medication adherence rates are widely dispersed and range from close to 0% to almost 100% (a range of 100%). In contrast, risk-standardized survival rates are tightly clustered in a narrow interval ranging from about 95% to 99% (a range of 4%). To account for these differences, the scales of measurement need to be standardized before the domain scores are combined into an overall composite score.

The QMTF considered multiple options to standardize measurement scales among domains and to create a single overall composite score. In the CMS approach (a), separate process and outcomes composite scores are averaged together, with each domain composite weighted according to the number of items it encompasses:

(a) The rescaled score for the j th domain is calculated as:


Formula

where Xj is the original score in the j th domain and nj is the number of items comprising the j th domain. Thus,


Formula

In the case of CABG surgery, CMS uses five process measures and three outcome measures. A single summary composite is obtained by weighting the process composite by 5/8 and the outcome composite by 3/8. The QMTF regards this weighting mechanism as a significant limitation of the CMS approach. It does not account for the fact that process measure adherence and risk-standardized survival rates are measured on unequal scales. Using this approach, the outcome component contribution to the overall composite score may be substantially underweighted compared with that of the process component, which is not the desired effect. The QMTF rejected this approach.

The next two approaches involved rescaling each domain score by the reciprocal of its standard deviation (b) or its range (c), then weighting the rescaled estimates equally:

(b) Divide by the domain-specific standard deviation:


Formula

where Formula is the average value of score j among STS participants and {sigma} j is the corresponding standard deviation. The rescaled domain-specific scores all have the same standard deviation. Thus:


Formula

(c) Divide by the domain-specific range.


Formula

where MAXj and MINj are the maximum and minimum observed values of Xj across all of the providers. The rescaled domain-specific scores all lie in the interval 0 to 1. Thus:


Formula

Results were similar when rescaling was accomplished using the reciprocal of the item’s range instead of its standard deviation (Spearman correlation = 0.99; column C). Ranks agree to within 100 places for 100% of providers.

(d) Results diverged to a much greater extent when the items were combined without rescaling (Spearman correlation = 0.84; column D):


Formula

Approximately 3.2% of providers changed ranks more than 200 places, depending on whether scaling was used. Furthermore, among the 177 providers that were ranked in the bottom third when rescaling was based on the standard deviation, four (2.3%) were ranked in the top third when the items were combined without rescaling.

The implications of rescaling were further explored. If items are not rescaled, then items that vary widely between providers will disproportionately influence the overall composite. Furthermore, without rescaling, a one percentage point difference in the risk-standardized mortality rate is considered to have the same importance as a one percentage point difference in the frequency of using IMA or compliance with the all or none medication measure. Rescaling by either the reciprocal of the standard deviation or the range changes the amount that improvement on a single item contributes to the overall composite. The approximate standard deviations corresponding to mortality, morbidity, IMA, and medications are 0.5, 4.2, 5.8, and 14.3, respectively. When items are weighted by the reciprocal of the standard deviation, a one percentage point improvement in mortality has the same impact on the composite score as does an 8% improvement in the morbidity rate, an 11% improvement in the frequency of IMA usage, or a 28% change in the frequency of using all four medications. These findings are largely consistent with the QMTF’s clinical assessment regarding the importance of individual items, as well as results of the STS member survey.

Final Overall Composite Scoring Method
To compute an overall composite score, the QMTF chose to rescale the domain-specific estimates by the reciprocals of their standard deviations, then add these rescaled estimates. To verify that each item contributes statistical information but does not dominate the composite, we calculated the [item–total] correlation between each domain-specific estimate and the overall comprehensive score. The [item–total] Pearson correlations were 0.48 (IMA score versus overall score), 0.56 (medication domain score versus overall score), 0.65 (morbidity domain score versus overall score), and 0.78 (mortality domain score versus overall score). Thus, although risk-adjusted mortality and morbidity explain much of the variation in the overall comprehensive score, no single item dominates, and all four items contribute statistical information.


    Performance Tier Determination
 Top
 Executive Summary
 Introduction
 General Methodology
 Distribution of Individual...
 Individual Performance Measure...
 Composite Scoring Methodologies
 Performance Tier Determination
 Comment
 Conclusion
 Technical appendix
 1. Final Model for...
 Summary Measures of Case...
 1.1 Statistical Model
 1.2 Definition of Risk...
 1.3 Definition of True...
 1.4 Method of Estimation
 2. Models for Combining...
 3. Models for Other...
 Footnotes
 References
 
Having selected a methodology for composite scoring, the QMTF considered various options for assigning providers to performance tiers. For example, a number of different approaches have been used in the United Kingdom, including confidence intervals to determine high and low outliers, ranks based on percentiles, and absolute thresholds, which, unlike the first two options, is not a reflection of performance relative to other providers [19, 20].

Using actual STS data, the QMTF pilot tested the discriminating power of one hypothetical three-tiered rating system. A high level of statistical certainty was deemed essential for a system designed specifically to rate providers. Accordingly, in the pilot study, providers were assigned to the middle tier if their score was statistically indistinguishable from the STS national average based on a 99% Bayesian certainty criterion. Otherwise, providers were assigned to the top tier (above average performance) or bottom tier (below average performance).

The number of providers assigned to the bottom, middle, and top tiers using this particular rating system with the 2004 STS data was 70 (13%), 407 (77%), and 53 (10%), respectively. Providers in the middle tier may be interpreted as having average performance. Their estimated performance was either very close to the STS national average value, or else the number of patients was too small to make a reliable determination. For the remaining 123 providers, the classification of above average or below average performance could be made with high confidence (more than 99% certainty). Compared with the bottom tier, providers in the top tier had lower estimated risk-standardized mortality rates (median, 1.7% versus 3.0%); lower estimated any or none morbidity rates (median, 9.8% versus 18.1%); higher IMA usage rates (median, 95.7% versus 88.1%); and higher all or none medication rates (median, 66.4% versus 35.7%).

We speculated that the comprehensive quality score would have greater statistical power for discriminating between providers than a report card based solely on risk-adjusted operative mortality. The results of this pilot study confirm the statistical advantages of combining process and outcome measures when making such inter-provider comparisons. If tier assignments were to be based solely on risk-standardized operative mortality, then only 6 providers (1%) could be assigned to the top or bottom tier with at least 99% certainty. In contrast, the composite score was able to distinguish above average or below average performance for 123 providers (23%). The composite score achieves high statistical power because it combines information from 11 different quality indicators into a single estimate.


    Comment
 Top
 Executive Summary
 Introduction
 General Methodology
 Distribution of Individual...
 Individual Performance Measure...
 Composite Scoring Methodologies
 Performance Tier Determination
 Comment
 Conclusion
 Technical appendix
 1. Final Model for...
 Summary Measures of Case...
 1.1 Statistical Model
 1.2 Definition of Risk...
 1.3 Definition of True...
 1.4 Method of Estimation
 2. Models for Combining...
 3. Models for Other...
 Footnotes
 References
 
Composite indicators are useful for summarizing and comparing the quality of care delivered by healthcare providers. In many areas of medicine, the number of acceptable quality indicators is large, and there is a need for summary measures that combine performance on multiple end points. Although quality improvement requires attention to each individual aspect of quality, there are many settings in which the users of quality measures are most interested in the bottom line. The comprehensive composite quality score developed by QMTF satisfies the need for a composite quality measure for CABG providers, and it can easily be extended to valvular surgery.

Although composite measures have many practical advantages, combining multiple measures into a single indicator is inherently problematic [15, 19, 20, 22]. Similar to the development of any statistical model, considerable judgment must be exercised in the construction of a composite indicator. The choice of the individual component measures, weighting of measures, the method used to aggregate measures, and the assignment of ranks are just a few of the potential sources of controversy. Decisions regarding these and other issues may substantially impact the evaluation of providers and may have important policy implications [15, 19, 20, 22]. Furthermore, each individual indicator reflects a different aspect of quality. Therefore, some information is invariably lost when only a single summary score is reported, particularly when measures are poorly correlated and nonredundant, as they were in our pilot studies.

Depending on how the composite is constructed, some important areas of performance may not be addressed or may be relatively undervalued, and aggregation may also obscure individual areas of strength or weakness. The ability to decompose the composite into its individual components is critical. This allows providers to analyze their performance in specific areas and to formulate improvement strategies.

Finally, recent studies have demonstrated that composite scores are associated with substantial random variation and may also be sensitive to factors such as the methods for weighting and aggregation [19, 20]. This emphasizes the need to correctly partition variability when calculating such scores, and to present them with appropriate measures of uncertainty.

To address the legitimate concern of whether a provider’s composite performance score would be influenced substantially by the choice of statistical methodology, the QMTF implemented and compared a large number of approaches. In addition to the methods presented here, we also considered several variations of the most common approaches. In general, inferences about a provider’s quality were robust and largely insensitive to the choice of methodology.

In constructing a composite scoring system, the goal of the QMTF was to use methodology that is scientifically rigorous and useful for third parties, as well as transparent, actionable, and acceptable to the cardiac surgery community. Although our goal was complete objectivity, we were often confronted by choices in which a correct decision could not be objectively determined. In such cases, detailed analyses were conducted to assess the empirical implications of our decision. In reporting these results, we have attempted to make the characteristics of the scoring system transparent to its users and to the surgeons who will be evaluated by it.

The QMTF approach to determining composite quality scores has several distinguishing features. First, performance on the individual component items is estimated using random-effects regression models. This approach, also known as shrinkage estimation, is particularly advantageous when some providers treat a small number of patients or the end points of interest are rare outcomes.

Second, to make inferences about each provider’s quality, we have estimated performance using a Bayesian framework. Unlike conventional approaches, such as the CMS methodology, the Bayesian framework makes it possible to compute true probability intervals and other measures of uncertainty for any quantity of interest. Although we exploited the Bayesian framework for this investigation, our approach is flexible, and other statistical methods, such as empirical Bayes, might also be adopted.

Third, we used all or none composite scoring for the two domains of quality that contained multiple measures. This sets a high benchmark standard, the "ideal" CABG hospitalization.

Finally, for both theoretical and empirical reasons, we rejected all methods that used "importance" weighting of various measures when calculating domain scores. Neither literature review nor expert opinion survey provided scientifically valid, consistent weight determinations. Rather than subjectively assigning weights, we elected to use the all/any or none approach with implicit equal weighting. For the overall composite measure, weighting was accomplished through rescaling.

Limitations
NQF process measures have been extensively vetted to establish their suitability for performance measurement. However, it is important to acknowledge some limitations inherent to these measures and, in fact, to all measures of process compliance. Determining a set of valid exclusion criteria for process measures has proven challenging in many areas of medicine, and information on contraindications to various process measures is relatively limited in the STS NCD. As a result, some patients who are appropriately denied a medication owing to a contraindication will be misclassified as representing a process failure. The true overall percentage compliance with process measures will therefore be underestimated because some ineligible patients are included. Because the NQF measure set does not account for most exclusions, the QMTF acknowledges that all programs will have less than perfect process compliance scores. However, unless the proportion of patients who might satisfy legitimate exclusion criteria varies substantially among programs, comparisons that focus on relative rather than absolute process compliance will be unbiased. As exclusion criteria become more standardized and are added to the STS NCD, future versions of the QMTF scoring methodology may be modified to more fully account for them.

Limitations also exist with respect to various outcome measures. Because certain outcomes are difficult to define precisely, it is possible that variation in coding practices could account for some of the observed differences between providers.

Future Directions
Although the QMTF composite quality score is described as "comprehensive," we do not presume that it captures every important aspect of quality. The goal was to combine the NQF measures in a statistically reasonable, practical, and intuitive fashion. The decision to create an overall composite score using four domains of care (two process domains and two outcomes domains) resulted from our commitment to use all the NQF-endorsed CABG measures that were captured in the STS NCD. The 11 relevant NQF measures are divided about equally into process and outcomes, and they appear to group naturally and appropriately into the domains we selected. The QMTF believes that these four domains assess different and not necessarily congruent aspects of the care process, all of which are important in a multidimensional quality construct.

There are multiple alternative approaches that could have been used to develop the composite score, or that may be the focus of subsequent research:

1 use of statistical methodologies (eg, factor analysis) to reduce the number of variables comprising the composite score;
2 construction of separate composite scores for processes and outcomes, assessing their degree of congruence, and studying each separately to determine their ability to discriminate overall performance;
3 consideration of new or revised NQF measures, or non-NQF measures, that provide important incremental information about overall care; and
4 ongoing assessment of the relationship between process measures and short- or long-term outcomes, with potential elimination of process measures that demonstrate limited clinical effectiveness.

Having developed the initial methodology described in this report, all subsequent modifications can be implemented in response to STS interests and external requests.

As the STS NCD evolves from a registry used primarily for research and for internal quality improvement to one that is also used for reporting to the public and to specific third parties such as payers, the need for aggressive audit and validation is correspondingly increased. This is currently a major national initiative of the STS NCD.

The management of missing data merits careful further consideration. The implications of such missing data increase when multiple outcomes are measured, and the missing-at-random assumption is less tenable. The consequences of restricting our pilot study to sites with less than 5% missing data were not specifically investigated, but this will be an important consideration when actual composite scores are publicly released. STS NCD policies may be modified to maximize provider data completeness for elements that comprise the composite score, and the most appropriate statistical methodologies for managing any remaining missing data will be implemented.

It will be instructive to monitor provider-specific scores longitudinally to assess their consistency, and to study which domains and individual measures appear most predictive of subsequent provider performance.

If composite scores are to be used for a rating system, choices must be made regarding the use of an internal or external benchmark, the number of rating tiers, the cut points used to define each tier, and the statistical criteria used to assign providers to tiers. The hypothetical three-tiered rating system described in this report is only one of many such potential systems, and the operating characteristics of alternative approaches should be investigated and compared.

Finally, research is needed to identify the optimum time window for estimating and reporting performance. In our analyses, performance was estimated based on a single year of data. When estimating performance, methods that incorporate data from multiple time points deserve consideration.


    Conclusion
 Top
 Executive Summary
 Introduction
 General Methodology
 Distribution of Individual...
 Individual Performance Measure...
 Composite Scoring Methodologies
 Performance Tier Determination
 Comment
 Conclusion
 Technical appendix
 1. Final Model for...
 Summary Measures of Case...
 1.1 Statistical Model
 1.2 Definition of Risk...
 1.3 Definition of True...
 1.4 Method of Estimation
 2. Models for Combining...
 3. Models for Other...
 Footnotes
 References
 
This two-part report by the STS QMTF describes the development of a multidimensional CABG composite quality score that is scientifically rigorous, uses NQF-endorsed measures from the STS NCD, and is consistent with relevant national guidelines for performance measurement. The QMTF regards this as the first step in a process that will constantly evolve as new quality measures, statistical methodologies, and health care policy objectives are developed.


    Technical appendix
 Top
 Executive Summary
 Introduction
 General Methodology
 Distribution of Individual...
 Individual Performance Measure...
 Composite Scoring Methodologies
 Performance Tier Determination
 Comment
 Conclusion
 Technical appendix
 1. Final Model for...
 Summary Measures of Case...
 1.1 Statistical Model
 1.2 Definition of Risk...
 1.3 Definition of True...
 1.4 Method of Estimation
 2. Models for Combining...
 3. Models for Other...
 Footnotes
 References
 
The modelling technique adopted by the Quality Measurement Task Force is multivariate random-effects logistic regression. The term multivariate means all of the quality measures are analyzed together in a single model, not estimated one at a time in separate models. Random-effects refers to the assumption that the provider-specific parameters of interest are assumed to arise from a specified distribution defined by parameters that are also estimated in the modelling process. Throughout this appendix, the terms provider and site are used interchangeably to refer to The Society of Thoracic Surgeons (STS) database participants (ie, hospitals and cardiac surgery groups).


    1. Final Model for Estimating Composite Scores
 Top
 Executive Summary
 Introduction
 General Methodology
 Distribution of Individual...
 Individual Performance Measure...
 Composite Scoring Methodologies
 Performance Tier Determination
 Comment
 Conclusion
 Technical appendix
 1. Final Model for...
 Summary Measures of Case...
 1.1 Statistical Model
 1.2 Definition of Risk...
 1.3 Definition of True...
 1.4 Method of Estimation
 2. Models for Combining...
 3. Models for Other...
 Footnotes
 References
 
The following data elements were used to estimate each provider’s final composite score:

1 Risk-adjusted operative mortality. The number of isolated coronary artery bypass grafting (CABG) patients (denominator) and the number of these patients who did not experience operative mortality. Note: Larger values of the numerator imply lower incidence of mortality.
2 Risk-adjusted any or none morbidity. The number of isolated CABG patients (denominator) and the number of these patients who did not experience any of the selected morbidity end points (numerator). Note: Larger values of the numerator imply lower incidence of morbidity.
3 Internal mammary artery usage. The number of isolated CABG patients who were eligible to receive an internal mammary artery (IMA) (denominator) and the number of these patients who actually received an IMA (numerator). Note: Larger values of the numerator imply more frequent IMA usage.
4 All or none medications. The number of isolated CABG patients who were eligible to receive at least one medication (denominator) and the number of these patients who received all of the medications for which they were eligible (numerator). Note: Larger values of the numerator imply more frequent use of all recommended medications.

To ensure consistent directionality between process measures and risk-adjusted outcomes, the numerators of the risk-adjusted outcomes measures are defined as the number of patients who avoid the adverse end point. Thus, for both process and outcomes, larger values of the numerator are favorable.


    Summary Measures of Case Mix
 Top
 Executive Summary
 Introduction
 General Methodology
 Distribution of Individual...
 Individual Performance Measure...
 Composite Scoring Methodologies
 Performance Tier Determination
 Comment
 Conclusion
 Technical appendix
 1. Final Model for...
 Summary Measures of Case...
 1.1 Statistical Model
 1.2 Definition of Risk...
 1.3 Definition of True...
 1.4 Method of Estimation
 2. Models for Combining...
 3. Models for Other...
 Footnotes
 References
 
In addition to the numerators and denominators listed, two summary measures of each site’s case mix were also incorporated into the final multivariate random effects logistic model to risk-adjust the mortality and morbidity end points. The summary measures are:

i the average predicted risk of mortality assigned to patients at each site by the existing STS CABG mortality model; and
ii the average predicted risk of "mortality or major morbidity" assigned to patients at each site by the existing STS CABG mortality/major morbidity model.

The development and validation of STS risk models is described in Shroyer and colleagues [[8]. The STS CABG mortality model was updated in 2004 and has not been published. (It is available as an online STS risk calculator at www.sts.org.) Although the end point predicted by the STS mortality/major morbidity model is not identical to the any or none morbidity end point chosen by the QMTF, the STS model still provides a useful summary measure of case mix. Because mortality is a relatively rare end point, the risk factors that predict the combined end point of "mortality or major morbidity" are essentially identical to the risk factors that predict our any or none morbidity end point. In light of this similarity, we used the existing STS mortality/major morbidity risk model to calculate risk scores for risk-adjusting the any or none morbidity end point.

For each end point, the formula for calculating a patient’s predicted risk of the end point has the form:


Formula

where x 1, x 2, ... , x q denote patient risk factors (eg, quantitative variables such as age, and comorbidities coded as 1=present, 0=absent); and b 0, b 1, ... , b q denote regression coefficients (constants) that were determined previously (see Shroyer and colleagues [[8]). Risk estimates were calculated for each patient and then averaged within providers to obtain the provider-specific average risk. A logit transformation was applied to the site-specific average risk estimates before including them in the model to express them on a scale that is not constrained by a maximum of 100% or a minimum of 0%. Thus, the final summary measures have the form z = log[p/(1 – p)], where p denotes the site-specific average risk of the endpoint.


    1.1 Statistical Model
 Top
 Executive Summary
 Introduction
 General Methodology
 Distribution of Individual...
 Individual Performance Measure...
 Composite Scoring Methodologies
 Performance Tier Determination
 Comment
 Conclusion
 Technical appendix
 1. Final Model for...
 Summary Measures of Case...
 1.1 Statistical Model
 1.2 Definition of Risk...
 1.3 Definition of True...
 1.4 Method of Estimation
 2. Models for Combining...
 3. Models for Other...
 Footnotes
 References
 
Let {pi}mj denote the true site-specific success probability at site j (j = 1,2, ... , J) for measure m, where m = 1 denotes avoidance of operative mortality; m = 2 denotes avoidance of all five morbidities; m = 3 denotes IMA usage; and m = 4 denotes use of all eligible medications. Let n mj denote the number of patients who were eligible to be included in the denominator for measure m at site j, and let Y mj denote the number of successful outcomes (numerator). Conditional on {pi}mj, the observed numerator is assumed to arise from a binomial distribution with probability parameter {pi}mj. That is: Formula

A probability model for all four outcomes Y 1j, ... , Y 4j is obtained by assuming that each binomial outcome is conditionally independent given ({pi}1j, ... ,{pi}4j). Let Y j = (Y 1j, ... , Y 4j) denote the collection of outcomes for the jth participant. The likelihood for site j is given by


Formula 1

(1)
where y j = (y 1j, ... , y 4j). Thus, conditional on the provider’s true probability parameters ({pi}1j, ... ,{pi}4j), the observed data are assumed to consist of four independent binomially distributed random variables. The assumption of conditional independence is likely to be violated in practice but is made to simplify the computation. Although the model assumes conditional independence between (Y 1j, ... , Y 4j), the model does not assume marginal independence between these variables, because the underlying binomial parameters ({pi}1j, ... ,{pi}4j) are assumed to arise from a random distribution with parameters that allow for intra-item correlation.

To express probabilities on a linear scale that is not constrained by a maximum of 100% or a minimum of 0%, the probability parameters are converted to odds parameters, and we model the logarithm of the odds. Let {theta}mj = {pi}mj/(1 – {pi}mj) denote the odds of success for measure m at site j. The odds is interpreted as the probability of a successful outcome divided by the probability of a unsuccessful outcome. If the odds can be estimated, then it can be converted to a probability, because {pi}mj = {theta}mj/(1 + {theta}mj). Similar to {pi}mj, larger values of {theta}mj imply a higher probability of a successful outcome. We focus on the logarithm of the odds parameters, log {theta}mj, because this quantity ranges from negative infinity to infinity (ie, no boundary constraints).

A fundamental assumption of the multivariate hierarchic logistic model is that the parameters (log{theta}1j, log{theta}2j, log{theta}3j, log{theta}4j) are distributed according to a multivariate normal distribution. Correlation among performance on different end points is reflected in the covariance parameters of the multivariate normal distribution. We further assume that a provider’s performance on any single end point is described by a logistic regression model. The latter assumption is embodied by the set of equations:


Formula

where ({alpha}1, {alpha}2, {alpha}3, and {alpha}4) denote intercept parameters that determine the overall frequency of success for the four measures; ({varepsilon}1j, {varepsilon}2j, {varepsilon}3j, and {varepsilon}4j) are normally distributed error terms that determine the extent to which the jth site deviates from the average; z 1j denotes the logit of the average predicted risk of mortality at site j, as determined by the STS mortality model (described above); z 2j denotes the logit of the average predicted risk of "mortality or major morbidity", as determined by the STS composite end point model (described above); and 1 and ß2) denote regression coefficients to be estimated from the data. The terms ß1 z 1j and ß2 z 2j are included to incorporate risk-adjustment into the analysis of the mortality and morbidity end points. No assumptions are made about the covariance parameters of the multivariate normal distribution. An equivalent specification of the model is:


Formula

where ({sigma}11, {sigma}12, ... ,{sigma}44) denote unknown parameters of the multivariate normal covariance matrix. These unknown covariance parameters are estimated from the data along with the unknown {alpha}s and ßs.


    1.2 Definition of Risk-Standardized Rates
 Top
 Executive Summary
 Introduction
 General Methodology
 Distribution of Individual...
 Individual Performance Measure...
 Composite Scoring Methodologies
 Performance Tier Determination
 Comment
 Conclusion
 Technical appendix
 1. Final Model for...
 Summary Measures of Case...
 1.1 Statistical Model
 1.2 Definition of Risk...
 1.3 Definition of True...
 1.4 Method of Estimation
 2. Models for Combining...
 3. Models for Other...
 Footnotes
 References
 
In the case of risk-adjusted outcomes measures, {pi}mj is not a meaningful reflection of a site’s quality because it partly reflects the site’s case mix. Interest instead focuses on estimating each provider’s "risk-standardized success rate," denoted by {pi}mj'. Because there is no single widely accepted definition of the risk-standardized success rate, the QMTF considered several options, and chose the following:


Formula 2

(2)
where z m denotes the value of z mj for an "average" provider. The risk-standardized success rate {pi}mj' is loosely interpreted as the success rate for measure m that would be projected to occur hypothetically if provider j had a "typical" case mix.


    1.3 Definition of True Composite Score
 Top
 Executive Summary
 Introduction
 General Methodology
 Distribution of Individual...
 Individual Performance Measure...
 Composite Scoring Methodologies
 Performance Tier Determination
 Comment
 Conclusion
 Technical appendix
 1. Final Model for...
 Summary Measures of Case...
 1.1 Statistical Model
 1.2 Definition of Risk...
 1.3 Definition of True...
 1.4 Method of Estimation
 2. Models for Combining...
 3. Models for Other...
 Footnotes
 References
 
The final composite score was defined as the quantity


Formula

where c 1 = 0.5, c 2 = 4.2, c 3 = 5.8, and c 4 = 14.3. These constants were chosen such that c m is approximately equal to the standard deviation (across providers) of the corresponding parameter, {pi}mj' or {pi}mj. Larger values of the composite score imply better performance.


    1.4 Method of Estimation
 Top
 Executive Summary
 Introduction
 General Methodology
 Distribution of Individual...
 Individual Performance Measure...
 Composite Scoring Methodologies
 Performance Tier Determination
 Comment
 Conclusion
 Technical appendix
 1. Final Model for...
 Summary Measures of Case...
 1.1 Statistical Model
 1.2 Definition of Risk...
 1.3 Definition of True...
 1.4 Method of Estimation
 2. Models for Combining...
 3. Models for Other...
 Footnotes
 References
 
The quantities {pi}1j', {pi}2j', {pi}3j, {pi}4j were estimated in a Bayesian framework by specifying a diffuse normal prior for {alpha}1 and {alpha}2; an informative normal prior distribution for ß1, ß2, ß3, and ß4; and a diffuse Wishart prior for the distribution of T = {Sigma} –1, where {Sigma} = ({sigma}11, {sigma}12, ... , {sigma}44) denotes the covariance matrix of the random effects distribution. Specifically:


Formula



Formula



Formula

where N(a, b) denotes a normal distribution with mean a and variance b; and Wishart{nu}(R) denotes a Wishart distribution with {nu} degrees of freedom and scale matrix R. The Wishart distribution is parameterized such that {Sigma}i = 1 {nu} z i z' i ~ Wishartv (R) if z i ~iid N (0, R), with R denoting the covariance matrix of the multivariate normal distribution of the z i.

The chosen scale matrix implies that the prior mean of the correlation between two random effects from the same site is equal to 0.05, that is, E[corr({varepsilon}mj, {varepsilon}m'j)] = 0.05. According to the prior distribution, there is also 50% prior probability that corr({varepsilon}mj, {varepsilon}m'j) lies in the interval (–0.70, 0.70). The parameters {{alpha}m}, {ßm}, and T were assumed to be mutually independent in the prior distribution.

The N(1,1) prior for ßm is motivated by the fact that ßm = 1 by definition under the assumption that each provider’s true event rate is exactly equal to the rate predicted by the STS risk model; hence, we chose our prior mean to be 1.0. In reality, we do not believe ßm = 1. (Owing to site-level variation in performance, we do not believe the STS risk model will exactly predict each site’s true event rate). The prior variance of 1.0 was chosen to allow for uncertainty regarding the true value of ßm. Although larger values of the variance might be considered desirable (because larger variance implies greater uncertainty about ßm), we encountered computational difficulties (slow mixing) with larger variance.


    2. Models for Combining Items Within a Domain (Reported but not Selected)
 Top
 Executive Summary
 Introduction
 General Methodology
 Distribution of Individual...
 Individual Performance Measure...
 Composite Scoring Methodologies
 Performance Tier Determination
 Comment
 Conclusion
 Technical appendix
 1. Final Model for...
 Summary Measures of Case...
 1.1 Statistical Model
 1.2 Definition of Risk...
 1.3 Definition of True...
 1.4 Method of Estimation
 2. Models for Combining...
 3. Models for Other...
 Footnotes
 References
 
Although the QMTF ultimately chose to combine items within a domain by using the all or none method, a variety of other model-based approaches were considered. These included: (i) fitting a multivariate random effects model and then averaging the item-specific estimates; and (ii) fitting a latent trait logistic model similar to the one described by Landrum and colleagues [[17]. Each of these two modeling strategies was applied separately to the perioperative medication domain (four items analyzed simultaneously: preoperative ß-blockers, discharge ß-blockers, discharge antiplatelets and discharge antilipids) and the major morbidity domain (five items analyzed simultaneously: prolonged ventilation, sternal infection, stroke, renal insufficiency, and reoperation). Altogether we fit four models (2 types of models x 2 domains).

To describe these models, let M denote the total number of measures considered in a single model (M = 4 for medication models; M = 5 for morbidity models); and let J (= 530) denote the number of providers. Using the notation of Part 1, let n mj, y mj, {pi}mj, and {theta}mj denote the number of eligible patients (denominator), the number of successful results (numerator), the true success probability, and the odds of success, respectively, for measure m (m = 1, 2, ... , M) and site j (j = 1, 2, ... , J). The probability model for site j’s data is given by equation (1) above. Each of the four models described below was estimated in a Bayesian framework using vague proper priors for the distribution of model parameters.

Model A1. Multivariate random effects model for analyzing the four NQF medication measures
(M = 4.) Model:


Formula

where ({varepsilon}1j, ... ,{varepsilon}4j) are distributed according to a multivariate normal distribution having mean vector zero and an unstructured covariance matrix.

Model A2. Multivariate random effects model for analyzing the five NQF morbidity measures
(M = 5.) Model:


Formula

where z mj denotes a summary measure of site j’s case mix used for risk-adjusting measure m (defined below); and ({varepsilon}1j, ... , {varepsilon}5j) are distributed according to a multivariate normal distribution with mean vector zero and an unstructured covariance matrix.

The summary measures of case mix (z 1j, ... , z 5j) were calculated from previously validated STS risk models, using the approach described in Section 1 of the Appendix. Separate STS risk models exist for each of the five NQF-endorsed morbidity end points (Shroyer and colleagues [[8]. The quantity z mj is defined as the logit of the average predicted risk of end point m at site j.

Model A3. Latent trait logistic model for analyzing the four NQF medication measures
(M = 4.) Model:


Formula

where each Q j is independently distributed according to a normal distribution with mean zero and unit variance, ie, Q j ~iid N(0, 1); and for identifiability we assume that {gamma}1 > 1. This model is an application of the latent trait logistic model described by Landrum and colleagues [[17]. In this model, Q j represents the "latent quality" of the jth site.

Model A4. Latent trait logistic model for analyzing the five NQF morbidity measures
(M = 5.) Model:


Formula

where z mj denotes a summary measure of site j’s case mix used for risk-adjusting measure m (defined above); each Q j is independently distributed according to a normal distribution with mean zero and unit variance, ie, Q j ~iid N (0, 1); and for identifiability, we assume that {gamma}1 > 1. This model is a slight generalization of the latent trait logistic model described by Landrum and colleagues [[17]. The model they described did not include the terms ßm z mj that allow for risk adjustment.

Use of Bayesian p-Values to Test Fit of Latent Trait Logistic Models
The Bayesian p-value is the probability that a hypothetic replicated data set, y rep, would diverge from the true model as much as the observed data set, y, diverges from the true model. Divergence between the model and the data was defined by the quantity


Formula

where {pi} denotes the collection of all of the parameters {pi}mj. The Bayesian p-value was defined as


Formula

where I is the indicator function; p(y rep|{pi}) is the probability density function for a hypothetical replicated data set conditional on the model parameters (defined by equation 1, above); and p({pi}|y) is the posterior distribution of the model parameters given the observed data. The probability of interest is taken over the joint distribution p(y rep, {pi}|y). To calculate the probability of interest, the integral


Formula 3

(3)
was approximated as Pr[{chi}MxJ 2 ≥ D (y, {pi})|{pi}, y], where {chi}d 2 denotes a {chi}2 random variable with d degrees of freedom. The final approximate p-value was calculated as


Formula

where {pi} (1), {pi} (2), ... , {pi}(N) denote N draws from a Markov chain Monte Carlo simulation with target distribution p({pi}|y).


    3. Models for Other Results Reported
 Top
 Executive Summary
 Introduction
 General Methodology
 Distribution of Individual...
 Individual Performance Measure...
 Composite Scoring Methodologies
 Performance Tier Determination
 Comment
 Conclusion
 Technical appendix
 1. Final Model for...
 Summary Measures of Case...
 1.1 Statistical Model
 1.2 Definition of Risk...
 1.3 Definition of True...
 1.4 Method of Estimation
 2. Models for Combining...
 3. Models for Other...
 Footnotes
 References
 
Other univariate and multivariate random effects models were used to derive results reported in this article. The histograms in Figure 1 and the correlations in Table 2 were derived from a multivariate random effects model that was identical to Model A1 (described above), except that it included the IMA end point in addition to the medication measures. Figure 2 and Table 3 were derived from a multivariate random effects model that was identical to Model A2 (described above), except that it included the mortality end point in addition to the morbidity measures.

The final multivariate model for estimating composite performance (described in Part 1 of the Appendix) involves four end points: operative mortality, any or none morbidity, IMA usage, and all or none medications. In addition to analyzing these end points simultaneously in a single multivariate model, we also analyzed these end points one at a time by fitting four separate univariate random effects models. The method of incorporating risk adjustment was identical to the method described for the multivariate model. These univariate analyses were used to produce the top panel of Figure 3 (all or none medication usage); the bottom panel of Figure 3 (any or none morbidity end point); and to count the number of sites that would be identified as outliers if performance was estimated based on operative mortality alone (last paragraph of Performance Tier Determination section).


    Footnotes
 Top
 Executive Summary
 Introduction
 General Methodology
 Distribution of Individual...
 Individual Performance Measure...
 Composite Scoring Methodologies
 Performance Tier Determination
 Comment
 Conclusion
 Technical appendix
 1. Final Model for...
 Summary Measures of Case...
 1.1 Statistical Model
 1.2 Definition of Risk...
 1.3 Definition of True...
 1.4 Method of Estimation
 2. Models for Combining...
 3. Models for Other...
 Footnotes
 References
 
{dagger} Dr Shahian is the Quality Measurement Task Force Chair and Writing Group Leader. Back


    References
 Top
 Executive Summary
 Introduction
 General Methodology
 Distribution of Individual...
 Individual Performance Measure...
 Composite Scoring Methodologies
 Performance Tier Determination
 Comment
 Conclusion
 Technical appendix
 1. Final Model for...
 Summary Measures of Case...
 1.1 Statistical Model
 1.2 Definition of Risk...
 1.3 Definition of True...
 1.4 Method of Estimation
 2. Models for Combining...
 3. Models for Other...
 Footnotes
 References
 

  1. Institute of Medicine Performance measurement: accelerating improvement. Washington, DC: The National Academies Press; 2006.
  2. Shahian DM, Edwards FH, Ferraris VA, et al. Quality measurement in adult cardiac surgeryPart 1—Conceptual framework and measure selection. Ann Thorac Surg 2007;83:S3-S12.
  3. Normand S-LT, Glickman ME, Gatsonis CA. Statistical methods for profiling providers of medical care: issues and applications J Am Stat Assoc 1997;92:803-814.
  4. Shahian DM, Normand SL, Torchiana DF, et al. Cardiac surgery report cards: comprehensive review and statistical critique Ann Thorac Surg 2001;72:2155-2168.[Abstract/Free Full Text]
  5. Christiansen CL, Morris CN. Improving the statistical approach to health care provider profiling Ann Intern Med 1997;127:764-768.[Abstract/Free Full Text]
  6. Goldstein H, Spiegelhalter DJ. League tables and their limitations: statistical issues in comparisons of institutional performance J R Stat Soc (series A) 1996;159:385-443.
  7. Leyland AH, Goldstein H. Multilevel modelling of health statistics. Chichester, UK: John Wiley and Sons, Ltd; 2001.
  8. Shroyer AL, Coombs LP, Peterson ED, et al. The Society of Thoracic Surgeons: 30-day operative mortality and morbidity risk models Ann Thorac Surg 2003;75:1856-1864.[Abstract/Free Full Text]
  9. Spiegelhalter DJ, Abrams KR, Myles JP. Bayesian approaches to clinical trials and health-care evaluation. West Sussex, UK: John Wiley and Sons, Ltd; 2004.
  10. Carlin BP, Louis TA. Bayes and empirical Bayes methods for data analysis. Boca Raton, FL: Chapman & Hall/CRC Press; 2000.
  11. Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian data analysis. Boca Raton: Chapman & Hall/CRC Press; 2004.
  12. Lied T, Malsbary R, Eisenberg C, Ranck J. Combining HEDIS indicators: a new approach to measuring plan performance Health Care Financ Rev 2002;23:117-129.[Medline]
  13. Zaslavsky A, Shaul J, Zaborski L, Cioffi M, Cleary P. Combining health plan performance indicators into simpler composite measures Health Care Financ Rev 2002;23:101-115.[Medline]
  14. Bethell C, Reuland C, Halfon N, Schor E. Measuring the quality of preventive and developmental services for young children: national estimates and patterns of clinicians’ performance Pediatrics 2004;113:1973-1983.[Abstract/Free Full Text]
  15. Nardo M, Saisana M, Saltelli A, Tarantola S, Hoffman A, Giovannini E. Handbook on constructing composite indicators: methodology and user guide (OECD Statistics Working Paper). 2005Organization for Economic Co-operation and Development (OECD) Statistics Working Paper JT00188147, STD/DOC; 2005:3.
  16. Embretson SE, Reise SP. Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum Associates; 2000.
  17. Landrum MB, Bronskill SE, Normand S-LT. Analytical methods for constructing cross-sectional profiles of health care providers Health Serv Outcomes Res Methodol 2000;1:23-47.
  18. Landrum MB, Normand S-LT, Rosenheck RA. Selection of related multivariate means: monitoring psychiatric care in the Department of Veterans Affairs J Am Stat Assoc 2003;98:7-16.
  19. Jacobs R, Goddard M, Smith PC. Are composite measures a robust reflection of performance in the public sector?. York, UK: Centre for Health Economics Working Paper 016, University of York; 2006.
  20. Jacobs R, Goddard M, Smith P. How robust are hospital ranks based on composite performance measures? Med Care 2005;43:1177-1184.[Medline]
  21. Snelling I. Do star ratings really reflect hospital performance? J Health Organ Manag 2003;17:210-223.[Medline]
  22. Booysen F. An overview and evaluation of composite indices of development Social Indicators Research 2002;59:115-151.
  23. Available at: www.premierinc.com/all/quality/hqi/resources/september-scoring-overview-september.pdf. Accessed Jul 30, 2006.
  24. Nolan T, Berwick DM. All-or-none measurement raises the bar on performance JAMA 2006;295:1168-1170.[Free Full Text]
  25. Skrondal A, Rabe-Hesketh S. Generalized latent variable modeling. Boca Raton, FL: Chapman & Hall/CRC; 2004.



This article has been cited by other articles:


Home page
Health Aff (Millwood)Home page
J. B. Dimick, D. O. Staiger, O. Baser, and J. D. Birkmeyer
Composite Measures For Predicting Surgical Mortality In The Hospital
Health Aff., July 1, 2009; 28(4): 1189 - 1198.
[Abstract] [Full Text] [PDF]


Home page
Eur. J. Cardiothorac. Surg.Home page
A. Brunelli, R. G. Berrisford, G. Rocco, G. Varela, and on behalf of the European Society of Thoracic Surg
The European Thoracic Database project: composite performance score to measure quality of care after major lung resection
Eur. J. Cardiothorac. Surg., May 1, 2009; 35(5): 769 - 774.
[Abstract] [Full Text] [PDF]


Home page
Ann. Thorac. Surg.Home page
J. P. Jacobs, R. J. Cerfolio, and R. M. Sade
The Ethics of Transparency: Publication of Cardiothoracic Surgical Outcomes in the Lay Press.
Ann. Thorac. Surg., March 1, 2009; 87(3): 679 - 686.
[Full Text] [PDF]


Home page
Arch Intern MedHome page
A. F. Hernandez and S. M. O'Brien
Sex Differences in Hospital Risk-Adjusted Mortality Rates for Medicare Beneficiaries Undergoing CABG Surgery--Invited Commentary
Arch Intern Med, November 24, 2008; 168(21): 2323 - 2325.
[Full Text] [PDF]


Home page
J. Thorac. Cardiovasc. Surg.Home page
S. C. Stamou, S. L. Camp, R. M. Stiegel, M. K. Reames, E. Skipper, L. T. Watts, M. Nussbaum, F. Robicsek, and K. W. Lobdell
Quality improvement program decreases mortality after cardiac surgery.
J. Thorac. Cardiovasc. Surg., August 1, 2008; 136(2): 494 - 499.e8.
[Abstract] [Full Text] [PDF]


Home page
J. Thorac. Cardiovasc. Surg.Home page
D. M. Shahian and S.-L. T. Normand
Low-volume coronary artery bypass surgery: Measuring and optimizing performance.
J. Thorac. Cardiovasc. Surg., June 1, 2008; 135(6): 1202 - 1209.
[Full Text] [PDF]


Home page
J Am Coll CardiolHome page
J. P. Drozda Jr, E. P. Hagan, M. J. Mirro, E. D. Peterson, and J. S. Wright
ACCF 2008 Health Policy Statement on Principles for Public Reporting of Physician Performance Data: A Report of the American College of Cardiology Foundation Writing Committee to Develop Principles for Public Reporting of Physician Performance Data
J. Am. Coll. Cardiol., May 20, 2008; 51(20): 1993 - 2001.
[Full Text] [PDF]


Home page
CirculationHome page
S. M. O'Brien, E. R. DeLong, R. S. Dokholyan, F. H. Edwards, and E. D. Peterson
Exploring the Behavior of Hospital Composite Performance Measures: An Example From Coronary Artery Bypass Surgery
Circulation, December 18, 2007; 116(25): 2969 - 2975.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to Personal Folders
Right arrow Download to citation manager
Right arrow Author home page(s):
David M. Shahian
Fred H. Edwards
Victor A. Ferraris
Constance K. Haan
Jeffrey B. Rich
Cynthia M. Shewan
Richard P. Anderson
Eric D. Peterson
Right arrow Permission Requests
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by O’Brien, S. M.
Right arrow Articles by Peterson, E. D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by O’Brien, S. M.
Right arrow Articles by Peterson, E. D.
Related Collections
Right arrow Education
Right arrow Professional affairs


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
ANN THORAC SURG ASIAN CARDIOVASC THORAC ANN EUR J CARDIOTHORAC SURG
J THORAC CARDIOVASC SURG ICVTS ALL CTSNet JOURNALS