|
|
||||||||
Ann Thorac Surg 2004;77:2232-2237
© 2004 The Society of Thoracic Surgeons
a Department of Statistical Science, University College London, London, United Kingdom
b Medical Statistics Unit, Research and Development Directorate, University College London Hospitals NHS Trust, London, United Kingdom
c MRC Clinical Trials Unit London, London, United Kingdom
d Department of Cardiac Surgery, British Heart Foundation, Hammersmith Hospital, Imperial College London Faculty of Medicine, London, United Kingdom
* Address reprint requests to Dr Taylor, Department of Cardiac Surgery, Faculty of Medicine, Imperial College, Hammersmith Hospital, Du Cane Rd, London W12 ONN, UK.
e-mail: k.m.taylor{at}imperial.ac.uk
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
See page 1960
To fit a statistical model using a computer software is relatively straightforward. However, it is just one part of the modeling process and may not produce a model that is reliable and valid for clinical use. There are several interlinked steps in a modeling process and each of these needs to be performed in a systematic manner. For example, lack of clarity in the specification of clinical aims of the model may result in the development of a model that is not useful in clinical practice. There are many factors associated with the risk of mortality and deciding which factors to include in a model is difficult. Several strategies are available to select risk factors [22] and these may produce risk models with different sets of risk factors and different estimates of the effect of a given factor for the same patient sample. Some estimates will be more reliable than others. Presence of missing data may pose additional problems. The performance of a model should be evaluated in the light of the specified aims [23].
The objective of this study is to review the methodology used in risk modeling in cardiac surgery for mortality, highlight the weaknesses and provide some suggestions for improvement.
| Material and methods |
|---|
|
|
|---|
We excluded articles where the main purpose was to use a statistical model for adjustment for case mix and not the presentation of a risk model for prediction of short-term mortality. Articles where the primary aim was to evaluate the performance of risk models proposed by other researchers were also excluded. We developed a checklist for the review to enable us to make an objective assessment of the methodology. This was based on what we consider to be the minimum criteria for a systematic approach to risk modeling as described in Figures 1 and 2. The modeling aspects for each of the reports were independently reviewed by two of the authors (RO and GA). Where there was a discrepancy between the two reviewers the reports were reexamined by a third reviewer (JE) and the issues were resolved through discussion.
|
|
| Results and comment |
|---|
|
|
|---|
Model development
Clinical aim of the model
Stating clear clinical aims for the model is a vital part of the modeling process as it may influence the number and choice of risk factors included in the model (see below) and to determine how the model will be used in practice. Examples include whether the model will be used to compare surgical performances across institutions in the same country, between different countries, for patient advice or treatment management, or for all of these purposes [24]. Four articles do not specifically state any clinical aims for the risk model. The most frequently stated aim was the facilitation of fairer comparisons between institutions and surgeons to assess quality of care. Improving information available to clinicians and patients for patient advice was also mentioned.
Choice of potential risk factors for mortality
It is important that developers prepare a list of risk factors according to the specified clinical aim of the model. There needs to be a balance between the degree of complexity required for the model in terms of selection of risk factors and what can be measured in practice. For example, if the aim is institutional comparisons a model developed to predict the overall number of deaths in an institution would probably be adequate and it may not need to be as complex as a model where the aim is patient counseling. For the latter, one may wish to predict deaths within mortality based clinical risk groups and more preoperative risk factors may need to be considered. If the aim is to compare performance of surgeons within an institution, and it may be that experienced surgeons operate on more seriously ill patients, a more detailed representation of the clinical risk factors in the model is necessary. If the model's intended use is to compare performances between countries/states of a specific region that share similar demographics, socioeconomic characteristics, and health care systems (such as Western Europe/USA), it will be applied to a very large group of patients. Because the differences in patient characteristics may average out, a model with a few key risk factors that could be measured reliably and easily in a standardized manner may be adequate for this purpose. A complex model may be impractical and unnecessary.
Of course a single complicated model may satisfy all of these roles but it could prove to be difficult to develop or use in practice, particularly if information on all risk factors in the model is not collected in many of the institutions for which the model is intended. Risk factors that are rarely measured in clinical practice should only be included if the model is intended for use in specific institutions where information is collected on these factors. It is not clear whether any of the model developers selected potential risk factors with reference to their stated aims. Furthermore, the developers did not always provide the number or a list of risk factors initially chosen for examination in the model. This information should be presented as it influences the required size of the model development data set (see next section).
Size of model development data
Model developers should clearly state the number of patients (before and after exclusions) and the percentage overall mortality or the number of deaths and ensure that an adequate number of patients are available to carry out the modeling process. As a rough rule for calculating the sample size required for the developmental data, the number of deaths observed in the data should be at least ten times the number of risk factors examined for inclusion in the model [25, 26]. A risk factor with k categories contributes k1 terms to the model. Therefore to contemplate a 15 risk factor model, assuming an inhospital or 30-day mortality rate of 3%, one would require at least 5000 patients to achieve adequate sample size. Otherwise spurious associations between the risk factors and outcome may be obtained, or the estimates of association may lack precision [25, 26]. Using this criterion, the developmental data were found to be of inadequate size for seven atricles and five reports did not provide sufficient details for assessment.
Handling of missing values
Model developers should clearly state the extent of missing observations in their data and how these were dealt with in the modeling process. Before fitting the model it is important to determine if there are systematic differences in the characteristics of patients with missing risk factors as this could introduce bias. If the extent of missing data is large, one should consider appropriate substitution of missing values and examine whether the results remain consistent. A large number of methods are available for substitution and a method should be selected with care in advance (before fitting the model) depending on the nature of the problem [27]. In ten of the papers, it is unclear how the model developers dealt with risk factors with missing values. Seven developers used only patient records that had complete information on all selected risk factors. However, it was not always clear whether the extent of bias introduced through the presence of missing data had been investigated.
Strategy used to select a final set of risk factors in a model
Sixteen developers used logistic regression, four used a Bayesian modeling technique, and one used artificial neural networks (ANN) to fit their final risk model.
Generally the variable selection strategies used by the model developers to produce their final model fall into three categories:
Construction of risk scores
Typically, risk models produce coefficients for each risk factor in the final model that represent their weights in predicting mortality. Nine model developers converted the coefficients to integer risk scores and derived an additive risk scoring system for simplicity in clinical use. Although this step can be performed with an appropriate mathematical transformation, the four developers that describe how this step was performed used a rather arbitrary and inappropriate conversion process. Scaling factors for converting coefficients to integer scores should be chosen carefully otherwise the predictive power of a model could be severely weakened. There are methods available to translate coefficients into integer scores with minimal loss of precision [29]. The process of construction of risk scores should be clearly described by the developers, or referenced.
Model validation
It is essential that a model's performance should be validated in the light of the specified clinical aims. For example, if a model aims to predict overall number of surgical deaths for institutions it should be tested how well the model can achieve this. If it has been developed to predict deaths within clinical risk groups the model should be evaluated in that context.
Type of validation
Typically, a model's performance is first evaluated on the developmental data. This is statistically termed as "goodness of fit" and is the weakest form of validation. We do not consider it further in this paper. Evaluation of a model's performance on data not used for its development is more important. It can be done using internal or external validation approaches (illustrated in Fig 2).
Ten developers carried out an internal validation using the data splitting method. In using this approach it is essential that the data are split into the model development and evaluation parts before building the model. The problems are that the size of data are reduced and the split is by definition fortuitous. With a different, equally valid split, different assessments of predictive accuracy may be obtained. This method is best suited to situations in which data on tens of thousands of patients are available. A resampling technique may be used to avoid the above mentioned problems [22].
Internal validation does not address the wider issue of the generalizability of the model. External validation on a large, completely independent test dataset is the most stringent type of validation and is an important part of the entire modeling process [23]. Model development cannot be considered complete until such evaluation has been performed. Of course, this may raise practical difficulties in obtaining suitable data. Only one article used completely independent data for validation. A less stringent approach of external validation, superior to internal validation, is to use a subsequent sample of patients from the same center or centers (temporal validation). This approach was used by four model developers. A validation process was not carried out by four of the developers, and for two it was unclear whether this step was performed.
Aspects of model validation
Calibration refers to a model's ability to predict mortality accurately. The Hosmer-Lemeshow (H-L) test was most commonly used to assess model calibration [30]. To evaluate the model, the finally selected model is applied to the test data. The H-L test compares the observed number of deaths to those predicted by the model typically in 10 equally sized risk groups (deciles). If the model is well calibrated, the observed proportion of deaths in each of the 10 groups will tend to agree with the average predicted probability of death in that risk group. If the p value from an H-L test is greater than 0.05, a current practice of the developers is to claim that the model predicts mortality accurately.
The H-L test is useful when comparing observed and predicted deaths within risk groups. However, we would like to point out some of the problems that are associated with the current use of H-L test. The traditional decile-based grouping [31] only assesses the statistical validity of the model and it may be more appropriate to use clinical risk groups. Clinical risk groups may be constructed by grouping patients into low, intermediate, and high risks of mortality or any other grouping that is considered clinically useful. However, the H-L test may be sensitive to the choice of risk groups used [30]. A particular choice may indicate good calibration, whereas another may suggest the opposite. Therefore, the clinical risk groups should be specified in advance, before starting the validation process, to avoid reaching any data driven conclusions. To quote only a p value from the H-L test is not sufficient. One may conclude from a statistically significant result that the model makes inaccurate predictions but it could still perform well for specific groups such as the low to intermediate risk groups of patients which would be the majority. A statistically nonsignificant result suggests that the model predicts accurately on average, but it may nevertheless perform badly for a particular group of patients such as a high risk group. When the validation data are large, small differences between observed and predicted numbers of deaths may come out as statistically significant although the differences are not clinically important. More insight into model "hot spots" may be gained by combining information from the H-L test with a table or a graph of observed and predicted deaths.
Discrimination is the ability of a model to distinguish patients who die from those who survive. The area under the receiver operating characteristic (ROC) curve also known as the C-index is commonly used to assess this aspect of model performance [32]. To calculate the ROC area, all possible pairs of patients in the validation data with different outcomes (died or survived) are considered. For a given pair, the predicted probability of death should ideally be higher for the patient who died than for the one who survived. The ROC area is the percentage of pairs for which this is true. For example a ROC area of 0.75 means the model correctly ranks 75% of the patient pairs according to their predicted probability. A current practice in cardiac surgery is to conclude that a model discriminates well if the ROC area is greater than 0.7.
The closer the ROC area is to 1 the greater the discriminatory ability of a model. However, the ROC measures the average discriminatory ability and is not very sensitive as it depends only on the ranks of the prediction and not the actual value [32]. Therefore, ROC results presented on their own do not provide the whole picture of a model's performance. It is possible to achieve a ROC area more than 0.7 even though the probabilities of death for all patients predicted by the model lie within a narrow range (for example: 0% to 3%) which may not be clinically useful [33]. It has been suggested that risk models with ROC areas above 0.8 have some clinical utility [22]. However, this cut-off is arbitrary and cannot be directly related to the intended use of a model.
At present the calibration and discrimination validation exercise appears to be performed routinely with little regard for the original clinical aims of the model. A typical conclusion is "the model is validated since p more than 0.05 and the ROC area more than 0.7." In two of the papers reviewed the model was considered valid on the basis of the ROC area, although the H-L test indicated poor calibration. Because the clinical aim of the model was not stated it is not possible to ascertain whether the model was clinically valid. One study used results from a ROC curve to conclude that a simple model with fewer numbers of risk factors was adequate for clinical use compared with more complex models without investigating the calibration of the models.
Depending on the aims, the validation process should take in the total picture: the ability of a model to predict mortality accurately, the range of predictions (whether these are clinically useful or not), and the ability to discriminate between high, intermediate, and low risk patients. If predictions are used to identify surgical centers or surgeons with unexpectedly high or low rates, achieving a high ROC area alone is not adequate. It is important for the model to achieve good calibration. A poorly calibrated model may cause large numbers of institutions or surgeons to reveal excessively high or low rates of mortality when in fact the fault lies with the model, not the clinical performance. If the calibration tests demonstrate poor results it may be possible to recalibrate the model. However recalibration is not straightforward and may require more than 1 set of data to obtain a reliable model, particularly if the aim is to apply the model to several different institutions. If a model is recalibrated, the recalibrated model should be presented for use. If predictions are used to stratify patients by disease severity in order to compare treatments or to decide on patient management, both calibration and discrimination aspects are important. If the purpose of the model is only to rank patients with respect to the risk of dying, then calibration is not an issue and assessing discrimination only is adequate. This situation is rather rare. Researchers should clearly specify the clinical aims of the model and draw conclusions from their validation process in relation to their stated aims.
Size of data for validation
The validation dataset should be large enough to enable precise comparison between observed and predicted outcomes and to endow statistical methods such as the H-L test with sufficient power. For the H-L test to be valid, the predicted number of events in each risk group used in the test should always be greater than 1, and for most risk groups it should be at least 5 [30]. It has been suggested as a general principle that adequate model evaluation requires at least 100 deaths in the validation sample [22]. Assuming a short-term mortality rate of 3% in cardiac surgery, 3300 patients would typically be needed to observe 100 deaths. If the size of the validation data is small the power of the statistical methods used to assess calibration may be reduced. The size of the data was small in 4 of the 16 papers in which a validation was performed.
Other aspects
Shahian and coworkers [34] have suggested the use of hierarchical models in cardiac surgery risk modeling. Statistical models, such as hierarchical models, have been developed for data that are clustered in nature. Patients within a cardiac center are likely to be more similar than patients across centers, due to similarity in treatments and perhaps socioeconomic characteristics, and may be considered as clustered within a center. Incorrect statistical inferences may be obtained if clustering is ignored in the analysis [35, 36]. These models are appropriate when data from a large number of centers are used for model development and not necessary when a model is developed using data from a single center. They may provide imprecise estimates when the data are from a small number of centers. Although some model developers have used data from a large number of centers for model development, they have not taken into account the clustering aspect.
None of the models reviewed in this study include interaction terms. These terms are important when there is prior belief on the basis of clinical experience that the effect of a risk factor on mortality may vary according to the levels of another risk factor, for example gender. If interaction terms are considered for inclusion in a risk model, they should be chosen on the basis of clinical plausibility as well as statistical significance. The size of the developmental data would need to be increased to accommodate these terms and interpretations made cautiously.
| Conclusions |
|---|
|
|
|---|
| Acknowledgments |
|---|
|
|
|---|
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
V. C. Carosella, J. L. Navia, S. Al-Ruzzeh, H. Grancelli, W. Rodriguez, C. Cardenas, J. Bilbao, and C. Nojek The first Latin-American risk stratification system for cardiac surgery: can be used as a graphic pocket-card score Interactive CardioVascular and Thoracic Surgery, August 1, 2009; 9(2): 203 - 208. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. E. Antunes, J. F. de Oliveira, and M. J. Antunes Risk-prediction for postoperative major morbidity in coronary surgery Eur. J. Cardiothorac. Surg., May 1, 2009; 35(5): 760 - 767. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Parolari, L. L. Pesce, M. Trezzi, C. Loardi, S. Kassem, C. Brambillasca, B. Miguel, E. Tremoli, P. Biglioli, and F. Alamanni Performance of EuroSCORE in CABG and off-pump coronary artery bypass grafting: single institution experience and meta-analysis Eur. Heart J., February 1, 2009; 30(3): 297 - 304. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. P. Dal-Bianco, B. K. Khandheria, F. Mookadam, F. Gentile, and P. P. Sengupta Management of Asymptomatic Severe Aortic Stenosis J. Am. Coll. Cardiol., October 14, 2008; 52(16): 1279 - 1292. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. E. Antunes, L. Eugenio, J. Ferrao de Oliveira, and M. J. Antunes Mortality risk prediction in coronary surgery: a locally developed model outperforms external risk models Interactive CardioVascular and Thoracic Surgery, August 1, 2007; 6(4): 437 - 441. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. B.M. Hosein, A. J.B. Clarke, S. P. McGuirk, M. Griselli, O. Stumper, J. V. De Giovanni, D. J. Barron, and W. J. Brawn Factors influencing early and late outcome following the Fontan procedure in the current era. The 'Two Commandments'? Eur. J. Cardiothorac. Surg., March 1, 2007; 31(3): 344 - 353. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Nilsson, M. Ohlsson, L. Thulin, P. Hoglund, S. A.M. Nashef, and J. Brandt Risk factor identification and mortality prediction in cardiac surgery using artificial neural networks J. Thorac. Cardiovasc. Surg., July 1, 2006; 132(1): 12 - 19. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. D. Berrizbeitia Invited commentary Ann. Thorac. Surg., June 1, 2006; 81(6): 2088 - 2088. [Full Text] [PDF] |
||||
![]() |
B. Zingone, A. Pappalardo, and L. Dreas Logistic versus additive EuroSCORE. A comparative assessment of the two models in an independent population sample Eur. J. Cardiothorac. Surg., December 1, 2004; 26(6): 1134 - 1140. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ANN THORAC SURG | ASIAN CARDIOVASC THORAC ANN | EUR J CARDIOTHORAC SURG |
| J THORAC CARDIOVASC SURG | ICVTS | ALL CTSNet JOURNALS |