Ann Thorac Surg 1997;63:1635-1643
© 1997 The Society of Thoracic Surgeons
Original Article: Cardiovascular
Coronary Artery Bypass Risk Prediction Using Neural Networks
Richard P. Lippmann, PhD,
David M. Shahian, MD
MIT Lincoln Laboratory, Lexington, and Department of Thoracic and Cardiovascular Surgery, Lahey Hitchcock Medical Center, Burlington, Massachusetts
Accepted for publication November 29, 1996.
 |
Abstract
|
|---|
Background. Neural networks are nonparametric, robust, pattern recognition techniques that can be used to model complex relationships.
Methods. The applicability of multilayer perceptron neural networks (MLP) to coronary artery bypass grafting risk prediction was assessed using The Society of Thoracic Surgeons database of 80,606 patients who underwent coronary artery bypass grafting in 1993. The results of traditional logistic regression and Bayesian analysis were compared with single-layer (no hidden layer), two-layer (one hidden layer), and three-layer (two hidden layer) MLP neural networks. These networks were trained using stochastic gradient descent with early stopping. All prediction models used the same variables and were evaluated by training on 40,480 patients and cross-validation testing on a separate group of 40,126 patients. Techniques were also developed to calculate effective odds ratios for MLP networks and to generate confidence intervals for MLP risk predictions using an auxiliary "confidence MLP."
Results. Receiver operating characteristic curve areas for predicting mortality were approximately 76% for all classifiers, including neural networks. Calibration (accuracy of posterior probability prediction) was slightly better with a two-member committee classifier that averaged the outputs of a MLP network and a logistic regression model. Unlike the individual methods, the committee classifier did not overestimate or underestimate risk for high-risk patients.
Conclusions. A committee classifier combining the best neural network and logistic regression provided the best model calibration, but the receiver operating characteristic curve area was only 76% irrespective of which predictive model was used.
 |
Introduction
|
|---|
For editorial comment, see 1529 and 1531.
Risk-adjusted mortality prediction is frequently used to assess the outcome of coronary artery bypass grafting (CABG). Increasingly sophisticated statistical prediction models (classifiers) have been applied to this task, ranging from simple univariate analysis to multivariate logistic regression and Bayesian statistics. However, most regression models require statistical assumptions (eg, linearity, additivity, distributional), which may not be justified [1], and the management of missing data is problematic. Bayesian models assume that prediction variables are independent and also require categorical data that typically can assume only two values. However, they do not require iterative training and easily accommodate missing features. Each of these methods has inherent limitations when applied to a complex biological process, and a high degree of predictive accuracy has yet to be achieved. Neural networks are a form of artificial intelligence that may obviate some of the problems associated with traditional statistical techniques, and it has been asserted by some [2] that they will represent the next major advance in predictive modeling.
Previously, we described the results of pilot studies of CABG risk prediction from databases of approximately 1,000 and 40,000 patients and a limited set of variables [3, 4]. An extensive set of new experiments was performed using 80,606 CABG patients from the 1993 database of The Society of Thoracic Surgeons [5] to evaluate the effectiveness of neural networks, committee classifiers, and bootstrap sampling. These experiments compared the discrimination and calibration accuracy of a logistic regression classifier [69], a Bayesian model [5, 1012], multilayer sigmoid neural network (MLP) classifiers [3, 4], and a committee classifier derived from the logistic regression and MLP classifiers. Discrimination was evaluated by plotting and computing the area under receiver operating characteristic (ROC) curves. Calibration was investigated using
2 tests to determine how accurately classifiers could stratify subjects into six mortality risk categories (0% to 2.5%, 2.5% to 5%, 5% to 10%, 10% to 20%, 20% to 30%, and 30% to 100%).
We also developed methods to provide odds ratios and confidence intervals, which overcome previous deficiencies of neural network models. These techniques and their derivation are described in Appendix 1. A glossary of advanced statistical and neural network terminology is presented in Appendix 2.
 |
Material and Methods
|
|---|
Predictor Variables
The 1993 data set from The Society of Thoracic Surgeons database contains 59 predictor variables per patient. It was first randomly split into training, evaluation, and test sets to simplify design of the prediction model (classifier) and to leave out data for some patients for a final classifier evaluation on unseen data. A small set of conventional classifiers and MLP classifiers with different numbers of hidden nodes, cost functions, and stepsize parameters was evaluated using these data. Parameters for all classifiers were trained using training data (20,178 patients). After a classifier had been trained, it was compared with other classifiers by measuring performance using a separate set of evaluation data (20,302 patients). Performance of classifiers on evaluation data was used to select training and structural parameters for alternate classifiers. Test data (40,126 patients) were used only once after all classifiers had been trained and designed. Before this final comparison, all classifiers were trained using the combined training and evaluation data. No changes in training parameters or classifier topologies were made after the final evaluation with training data.
The total number of patients who did not survive in this database population was low (1,386 or 3.4%) on the combined training and evaluation splits. The original data included four continuous predictor variables (age, weight, height, ejection fraction) and 55 binary predictors. A subset of 36 predictors was selected from the 59 original variables using the following rules:
- Leave out a binary predictor if it is "present" fewer than 200 times on the training data.
- Leave out a binary predictor if a
2 test on the training data (0.05 significance level) indicates that the predictor distribution is not different for patients who died or survived.
- Leave out predictors that are related to or highly correlated with other features on the training data.
The 33 binary predictors (eg, female, diabetes) and three continuous predictors (age, weight, ejection fraction) selected are listed in Table 1
. These predictors are similar to those used in previous CABG risk prediction studies. Continuous predictors were grouped into strata (see Table 1
) to create additional predictors for testing with Bayesian models and were used directly for all other classifiers. The three continuous predictors were normalized to zero mean, and unit variance and binary-valued predictors were set to ±0.5. Classifiers were provided identical predictor variables for all experiments.
Nine of the features selected were missing for a few patients. None of these features except ejection fraction and weight were missing for more than 5% of the patients. All missing features were replaced with their most likely values (the statistical mode for categorical variables and the median value for continuous variables) before being used as inputs for classifiers.
 |
Prediction Models
|
|---|
Logistic regression classifiers were determined using standard methodologies [69] as were the Bayesian statistics [5, 1012]. The conditional probabilities (percent of patients with each predictor variable who died or survived) necessary for Bayesian modeling are shown in Table 1
. The second column contains the number of patients and percent of total (conditioned probability) where this feature was "present" and the patient died. The third column contains the corresponding data for patients who survived. The last column provides the percent missing data for each variable. Large differences between the conditional probabilities combined with a feature that is "present" for many patients, as for Status Salvage, indicate that a particular feature is a good predictor for mortality. However, in this database, even good predictors are "not present" for many patients and thus do not discriminate well between survivors and nonsurvivors.
Figure 1
depicts a neural network of the type evaluated in this study, the multilayer perceptron. These networks have been applied successfully to many pattern classification problems. Input nodes (predictors) are connected in a massively parallel fashion to nodes within one or more hidden layers and ultimately to one or more output nodes (dependent variables). The output status of each node is determined by the cumulative input weights to that node as well as some mathematical operator, typically a nonlinear sigmoid function that constrains the output to between 0 and 1. The absolute value of the output node or nodes can be used to classify it into one or more categories (eg, "alive" or "dead") based on a chosen threshold level. The network begins its first training epoch with a set of arbitrary weights assigned to the various connections, and these are modified in successive iterations by a process of "back propagation." The network output is compared with the desired output, which depends on known outcomes, and differences are fed backward through the system and adjusted so as to minimize the mean square output error. This training continues through successive epochs until further increments in accuracy are no longer achieved. Care is taken not to overtrain the network, making it "specialized" to the training data and less capable of generalization to new data sets. Once trained, cross-validation testing is performed with data to which the network had not been exposed previously.

View larger version (27K):
[in this window]
[in a new window]
|
Fig 1. . Two-layer, multilayer perceptron neural network using random weight initialization and back propagation.
|
|
Multilayer neural networks with no hidden nodes (denoted single-layer MLPs), with one hidden layer (denoted two-layer MLPs), and with two hidden layers (denoted three-layer MLPs) were evaluated in this study. All classifiers were implemented using LNKnet pattern classification software [3, 4]. Figure 2
is a block diagram of the medical risk prediction system that has been developed based on neural network methodology. Important predictor variables from a patient's medical record are fed to a classifier and to a "confidence network." The classifier provides outputs that estimate the probability or risk of mortality. The confidence network provides upper and lower bounds on these risk estimates. Automated experiments were performed by training classifiers on training data and testing on evaluation data to explore the effect of varying training parameters and the topology of MLP classifiers with one and two hidden layers. Training parameters were explored for one three-layer MLP network with eight hidden nodes in the first hidden layer and four hidden nodes in the second hidden layer because this network had provided good performance on experiments with a smaller but similar database. The number of hidden nodes in the two-layer network varied from one to eight. The step size during training for all MLP networks varied from 0.001 to 0.1, the number of training passes varied from 5 to 80, and both square-error and cross-entropy cost functions were evaluated.
 |
Results
|
|---|
Neural Network Performance
Receiver operating characteristic areas changed little as the parameters were varied. Model calibration, which measures how well classifier outputs approximate posterior probabilities, improved substantially with a cross-entropy cost function (instead of square error), with a smaller step size (0.005 or 0.001 instead of 0.05 or 0.1) and with fewer epochs (10 or 20 versus 40 or 80). Two-layer MLP classifiers with four hidden nodes and the three-layer MLP classifier with eight hidden nodes in the first hidden layer and four hidden nodes in the second hidden layer provided good overall performance on evaluation data with a cross-entropy cost function, momentum of 0.6, and stochastic gradient descent stopping after 20 epochs. The step size was set to 0.001 for the single-layer MLP and 0.005 for the other MLP networks. Cross-validation experiments were performed to validate these settings when training used the combined training and evaluation data, which is twice as large as the training data. These experiments compared results with 5, 10, or 20 epochs of data. As a result of these experiments, the number of epochs was reduced to 10 for final testing. This reduction in number of epochs was expected given the larger number of patterns available for final training.
 |
Comparison Between Classifiers (Prediction Models)
|
|---|
DISCRIMINATION.
The ROC areas (see Comment) for all classifiers after training (using training and evaluation data) and final testing (using test data only) are provided in Table 2
. Receiver operating characteristic areas are all about 76% and vary only 1.6 percentage points (range, 74.8% to 76.4%) across classifiers. Differences between classifiers are not statistically significant given the approximate 0.8 percentage point standard deviation of these areas calculated as described in Hanley and McNeil [13]. These average areas are similar to values obtained in other studies where the risk of mortality was predicted for coronary bypass operations.
Past research suggests that the two-layer MLP classifier might exhibit excessive variability because of the limited available training data and differences in weight initialization during training. This variability was evaluated using bootstrap sampling [14]. Fifty ROC curves are shown superimposed in Figure 3B
. These were produced from 50 two-layer MLP networks trained using bootstrap sampling. The average ROC area for these bootstrap curves is 75.4%, which is only slightly less than the ROC area measured for the two-layer MLP classifier trained on all training and evaluation patients. The standard deviation in ROC areas is only 0.3%, which is small, even when compared with the variability across different types of classifiers. The variability in shapes of different bootstrap ROC curves is also small. These results demonstrate that the variability caused by stochastic gradient descent training and random weight initialization for the two-layer MLP classifier is small and not an important practical concern.

View larger version (22K):
[in this window]
[in a new window]
|
Fig 3. . (A) Receiver operating characteristic curve for committee classifier. Area (C-index) = 76.4%. (B) Fifty superimposed receiver operating characteristic curves generated using bootstrap sampling. Average receiver operating characteristic curve area (AVE AREA) = 75.4% ± 0.3%, suggesting little variability in neural network output secondary to random weight initialization and stochastic descent gradient training.
|
|
CALIBRATION.
Classifier outputs were also used to bin or stratify each patient into one of six risk levels (0% to 2.5%, 2.5% to 5%, 5% to 10%, 10% to 20%, 20% to 30%, 30% to 100%) by treating classifier outputs as posterior probability estimates. Calibration accuracy was evaluated by assigning patients to mortality bins based on network outputs and then comparing the average network output in each bin with the actual percentage of patients in that bin who did not survive. The resulting
2 tests were significant at the 0.05 level (indicating poor calibration accuracy) for all but the committee classifier (described in the next section) and the single-layer MLP classifier. The two-layer MLP and logistic regression classifier provided the next best calibration performance as measured by
2 values.
Average network outputs and actual percentage mortality for patients in each bin are shown in Figure 4
for the various classifiers evaluated. Good calibration accuracy in this figure is represented by symbols and lines near the diagonal. All classifiers provided good model calibration for patients in the lowest three bins (0% to 2.5%, 2.5% to 5%, 5% to 10%). The Bayes classifier severely overestimates risk for high-risk patients, probably because many of the high-risk characteristics are not truly independent variables. Patients with a true risk of roughly 14% are assigned a risk level above 40%. The two-layer MLP classifier underestimates risk for high-risk patients, whereas logistic regression overestimates risk, but not as severely as the Bayes classifier.

View larger version (36K):
[in this window]
[in a new window]
|
Fig 4. . Calibration, by mortality bins, of four classifiers. (MLP = multilayer sigmoid neural network.)
|
|
COMMITTEE CLASSIFIER PERFORMANCE.
Results with the evaluation data suggested that no one classifier alone could produce both high ROC areas (discrimination) and good model calibration as indicated by
2 scores. Therefore, a committee classifier was developed, derived from the two-layer MLP classifier and a logistic classifier. These two classifiers had provided good ROC areas and model calibration on the evaluation data, although they overestimated (logistic regression) or underestimated (MLP) risk in the highest risk groups. The committee classifier output was derived from a simple average of the outputs of the logistic and MLP classifiers. This type of averaging is reasonable because the outputs of both classifiers are estimates of posterior probabilities.
The simple two-classifier committee does not overestimate or underestimate risk for high-risk patients and provides the best calibration and ROC areas. In Figure 4
, the two standard deviation bounds drawn around the data for this classifier overlap the diagonal line, and the
2 difference between actual and predicted risk levels is not significant. The ROC curve (area = 76.4%) for the committee classifier, which is similar to the other classifiers, is shown in Figure 3A
. When the classifier output threshold is set at a level that permits correct preoperative identification of 50% (approximately 670) of the patients who will die (true positive), then 14% (approximately 5,600) of patients who will actually survive are incorrectly placed in the mortality category (false positive).
 |
Comment
|
|---|
Medical outcomes are a function of many variables, including random fluctuation, real differences in quality of delivered care, and differences in patient severity risk [15, 16]. Stimulated by the dissemination of "raw" CABG mortality data by the Health Care Financing Administration, risk-adjusted mortality prediction techniques have been developed to assess the CABG operation [17]. These may be used to provide doctors and their patients with a preoperative estimate of surgical risk, to render a more intensive level of care to patients with higher predicted risk of morbidity and mortality, to identify and adjust for important risk factors when studying the CABG procedure, and for internal quality control within a hospital or health care system. In its most controversial application, it has been used by states, including Pennsylvania and New York [8, 9], the Federal Government (Health Care Financing Administration and the Department of Veterans Affairs), and by insurance companies, to compare the "quality" of various heart surgery programs.
Numerous aspects of risk-stratified mortality prediction remain controversial [2, 7, 15, 16, 1824]. These include (1) the lack of universally accepted definitions for risk factors; (2) our incomplete knowledge of all potential risk factors that might influence outcomes; (3) the use of clinical versus administrative data; (4) the size of the database upon which the predictive systems are trained; (5) the infrequent occurrence of some risk factors that may be extremely important but whose impact is inadequately measured by the model; (6) incomplete, inaccurate, or missing data, the statistical management of which may substantially alter the results; (7) the lack of a well-defined relationship between risk-adjusted outcome and quality of care [24]; (8) variability of risk-factor reporting both between hospitals and during different time periods; and (9) the potential impact of "report cards" on an institution's future reporting practices (eg, sudden increases in the rate of reported comorbidities to inflate expected risk) and willingness to accept high-risk patients ("outmigration") [25].
The most appropriate mathematical model for this complex biological process is also problematic. Numerous prediction models have been used, including univariate analysis, multivariate logistic regression [69, 2629], and Bayesian statistics [5, 1012]. Most have been tested using split-sample cross-validation techniques [2, 15] as used in our series.
In most studies, model performance has been evaluated using two techniques, calibration and discrimination. In the calibration method [2], patients are grouped into expected mortality bins or groups, then the observed and expected mortality proportions are compared using
2 analysis. Most CABG models produce relatively accurate model calibration except for the highest risk groups. Receiver operating characteristic curve analysis, originally developed for signal processing, may be the best overall available technique for evaluating the discrimination accuracy of a diagnostic system [7, 13, 30, 31]. This test graphically depicts the trade-off between test sensitivity and specificity as the threshold for categorizing patients from the model output is varied. As the threshold for classifying a result as "positive" is lowered to detect as many "true positives" as possible, more "false-positive" classifications will also occur. The area under the ROC curve, also known as the C-index, increases proportionately with predictive accuracy of the test, with an area of 0.5 corresponding to pure chance and an area of 1.0 indicating a test with 100% sensitivity and specificity. The ROC curve areas for other types of diagnostic systems range from 0.71 to 0.89 for weather prediction and as high as 0.98 for certain types of computed tomographic imaging analysis [32]. In the majority of studies using ROC analysis for CABG mortality prediction, the area under the ROC curve has varied from 0.695 to 0.814 [7, 18, 20, 29, 30], with most results clustered between 0.73 and 0.76. Higher ROC areas of 0.82 to 0.84 were reported by Turner and associates [28] using Parsonnet and APACHE II algorithms, but the sample size was small (1,008 patients) and no internal cross-validation studies were performed. In an extensive review of the subject, Grover and associates [20] expressed the opinion that a C-index (ROC curve area) higher than 0.80 to 0.85 for CABG mortality prediction may never be achieved.
Because of the demonstrated weaknesses of current models and the recent application of artificial intelligence techniques to other areas of clinical prediction in medicine, some have suggested that this might be the next logical step in outcome prediction. Neural networks are a pattern recognition methodology that permits massive parallel processing of information, much as does the human neuronal network. Baxt [32] recently reviewed the applications of neural networks to medicine, including clinical diagnosis, radiographic imaging, waveform analysis, pathologic diagnosis, pharmacology, and outcome prediction.
There are numerous theoretical advantages of neural networks over logistic regression and Bayesian statistics. Neural networks require no a priori assumptions or knowledge about the underlying frequency distribution (nonparametric); they have the capacity to model complex, nonlinear relationships; they do not require assumptions about the independence of variables as does the Bayesian model; and they are relatively robust and tolerant of missing data and input errors. Disadvantages include the need for a large database upon which to train the network, high computation rates for the training (once trained, the network can be run on most personal computers), the possibility of overtraining, and the general unavailability of convenient features, such as odds ratios and confidence intervals, that have been useful in regression analysis and Bayesian models.
Our data, like those of most other series, show that all classifiers provided relatively good calibration except for the highest risk patients. This inaccuracy at the extreme occurs because the number of patients in the highest risk group is small, some important but infrequently occurring risk factors may be difficult to model, and some high-risk characteristics may not be truly independent. In our study, the best calibration accuracy was obtained using a committee classifier, which averaged the outputs of a two-layer multilayer perceptron and logistic regression.
Despite optimism that artificial intelligence techniques might be the next major advance in risk prediction for coronary bypass, our results in this analysis of more than 80,000 patients confirms the suspicion of many investigators that all prediction systems have inherent limitations [19, 20, 24]. Neural networks alone failed to improve upon the ROC curve area of logistic regression or Bayesian analysis, suggesting an absence of complex nonlinear relationships, at least among the variables presented to the network. A simple committee classifier derived from the two-layer MLP and logistic regression classifiers provided the best calibration accuracy even in high-risk patients, although discrimination as measured by the area under the ROC curve was unchanged.
 |
Appendix 1
|
|---|
Effective Odds Ratio for MLP Classifiers
A convenient feature of logistic regression is the simple interpretation that can be applied to internal parameters or weights. These weights are related to the odds of mortality, defined as the probability of mortality (as estimated by a classifier) divided by 1 minus this probability (P/1 - P). The changes in odds can be measured by the odds ratio (P1/1 - P1)/(P0/1 - P0), which is the odds when the predictor is "present" (value = 1) divided by the odds when the predictor is "absent" (value = 0). Logistic and single-layer MLP classifiers automatically provide odds ratios for predictor variables that are independent of the values of other predictors. The odds ratio for a particular input attached to a connection with weight "w" is equal to exp (w[xpresent - xabsent]), where xpresent is the value of that predictor when the feature is "present" and xabsent is the value of the predictor when the feature is "absent." This makes it easy to compare the importance of various predictor variables and to analyze classifier performance.
Odds ratios for two-layer and three-layer MLP classifiers are dependent on the values of other inputs because nonlinear interactions between features are allowed and because the inputoutput function from one input to the output is no longer a simple sigmoid. It is possible, however, to define an effective odds ratio averaged over all patients when the predictor of interest changes from "absent" to "present." This is computed by presenting the pattern of predictor variables from each patient to a classifier, varying the specific predictor of interest from "absent" to "present," calculating the odds for the network output under these two conditions and computing the odds ratio for each patient. The effective odds ratio is the odds ratio averaged over all patients or the average increase in risk observed by a patient.
The effective odds ratios for the ten most important binary variables for logistic regression, Bayes, single-layer MLP, two-layer MLP, and three-layer MLP classifiers are shown in Table 3
. Features are ordered using odds ratios from the logistic regression classifier. This table also contains the standard deviation of the odds ratios across patients. This standard deviation is zero for the Bayes, logistic, and single-layer MLP classifiers because the odds ratio for one input feature is independent of the values of other features for these classifiers. The standard deviations are provided in parentheses for the two-layer MLP and three-layer MLP classifiers. These odds ratios indicate that the same sets of predictor variables are generally the most important across all classifiers. They tend to be highest for the Bayes model and the two-layer or three-layer MLP.
The standard deviations in Table 3
suggest that odds ratios are sometimes an effective approach to characterizing MLP classifiers. For example, the odds ratio standard deviations for the MLP classifiers are low for the predictors "female" and "triple-vessel disease." Odds ratios are thus useful to summarize the effect of the MLP classifier for these features. Standard deviations are much higher, however, for other predictors, including "salvage operations" and "cardiogenic shock." The odds ratio for "salvage operations" is 19.4 for the two-layer MLP classifier, but the standard deviation of the odds ratio across patients is 12.1, and the odds ratios across patients for this feature ranges from 1.0 to 52. This large range occurs because the average risk of death for "nonsalvage operations" is 3.1%, but many of these patients have risk levels below 1%. At the other extreme, the risk of death for "salvage operations" typically ranges from 10% to 35%.
Effective odds ratios can be computed for MLP classifiers, but these values must be interpreted carefully because the effect of a particular predictor variable is patient-specific for complex MLP classifiers. The best approach to evaluating the effect of a feature for a particular patient is to vary only that feature while the other predictors are set using that patient's data. The importance of variables can be assessed over a large population by exploring the average risk and distribution of risk across patients when the predictor variable is "absent" or "present."
Confidence MLP Networks
Estimating the confidence in the classification decision produced by a neural network is a critical issue that has received relatively little study. Lack of a confidence measure makes it difficult for physicians and other professionals to accept the use of complex networks. Bootstrap sampling was used to generate confidence intervals for risk probabilities generated by the two-layer MLP classifiers. As shown in the top half of Figure 5
, 50 bootstrap sets of training data were created from the original training data by resampling with replacement. These bootstrap training sets were then used to train 50 bootstrap MLP classifiers using the same architecture and training procedures that were selected for risk prediction. When a pattern is fed into these classifiers, their outputs provide an estimate of the distribution of the output of the risk prediction MLP. Lower and upper confidence bounds for any input are obtained by sorting these outputs and selecting the 10% and 90% cumulative levels.

View larger version (32K):
[in this window]
[in a new window]
|
Fig 5. . Block diagram of bootstrap method for determination of confidence intervals for multilayer perceptron. (MLP = multilayer sigmoid neural network.)
|
|
It is computationally expensive to maintain and query 50 bootstrap MLPs whenever confidence bounds are desired for a particular patient. A simpler approach is to train a single confidence MLP to replicate the confidence bounds predicted by the 50 bootstrap MLPS, as shown in the bottom half of Figure 5
. The confidence MLP is fed the input pattern and the output of the risk prediction MLP and produces at its output the confidence intervals that would have been produced by 50 bootstrap MLPs. The confidence MLP is a mapping or regression network that replaces the 50 bootstrap networks. It was found that confidence networks with one hidden layer, two hidden nodes, and a linear output could accurately reproduce the upper and lower confidence intervals created by 50 bootstrap two-layer MLP networks. The confidence network outputs were almost always within ±10% of the actual bootstrap bounds. It was also found that only the output of the risk prediction MLP was required by the confidence networks to produce this level of accuracy. Upper and lower bounds produced by these confidence networks for a two-layer MLP network risk predictor are shown in Figure 6
. Bounds are high (±9 percentage points) when mortality risk is near 35% and drop to lower values at smaller risk levels. This relatively simple approach makes it possible to create and replicate confidence intervals for many types of classifiers.
 |
Appendix 2
|
|---|
Glossary
- Bootstrap sampling: A procedure used to generate multiple sets of training data from an original set containing data for N patients. A new bootstrap training set is created by randomly selecting N patients from the original data set. Patients are selected one at a time and any patient can be chosen during each random selection. This results in a bootstrap training set that may include multiple samples for some patients and that may not include other patients.
- Classifier: A prediction model or algorithm to compute the probability of an outcome (eg, mortality) from data for a single patient.
- Committee classifier: A prediction model or algorithm that makes use of outputs from two or more different classifiers.
- Cross-entropy cost function: A method of computing the difference between the desired binary-valued output of a neural net and the actual output that measures the cross-entropy between these values.
- Epoch: Back-propagation training is an iterative procedure where MLP weights are modified after presenting data for each patient. An epoch of training is complete after all training patients have been presented once.
- Hidden node: Computing elements in MLP networks whose outputs are not directly used as network outputs. In MLP networks with one-hidden layer, hidden nodes form a weighted sum of network inputs, apply a sigmoid function to the sum, and feed the result to the output nodes.
- Momentum: During each iteration of back-propagation training, weights are adjusted in a direction indicated by back-propagation calculations and in an amount specified by the step size. Momentum is a factor that smooths weight changes across multiple iterations and often shortens training time.
- Posterior probability estimate: An estimate of the mortality or survival probability for a given patient.
- Random weight initialization: Before an MLP neural network is trained, the weights associated with links between layers are initialized to small random values.
- Sigmoid neural network: MLP neural networks with computing elements or nodes that form weighted sums of inputs from the previous layer and pass these sums through a sigmoid nonlinearity that constrains the output of each node to a range from zero to one.
- Squared-error cost function: A method of computing the difference between the desired binary-valued output of a neural net and the actual output that computes the sum of the squared differences between these values across the output nodes.
- Step size: During each iteration of back-propagation training, weights are adjusted in a direction indicated by back-propagation calculations. The step size is a scale factor that determines how far to move weights in the specified direction.
- Stochastic gradient descent with early stopping: A training procedure for MLP networks where weights are adapted after presenting data for each patient. At the end of each epoch, the performance of the trained MLP is measured using data not used for training. Training is terminated if these measurements indicate that performance, as measured by the ROC area and
2 calibration, is no longer improving.
 |
Acknowledgments
|
|---|
We express our gratitude for the cooperation provided by Dr Richard E. Clark, The Society of Thoracic Surgeons Database Committee, and Summit Medical Systems. We also thank Linda Kukolich for data analysis performed using LNKnet software.
 |
Footnotes
|
|---|
Address reprint requests to Dr Shahian, Department of Thoracic and Cardiovascular Surgery, Lahey Hitchcock Medical Center, 41 Mall Rd, Burlington, MA 01805.
 |
References
|
|---|
- Harrell FE Jr, Lee KL, Matchar DB, Reichert TA. Regression models for prognostic prediction: advantages, problems, and suggested solutions. Cancer Treat Rep 1985;69:10717.[Medline]
- Steen PM. Approaches to predictive modeling. Ann Thorac Surg 1994;58:183640.[Abstract/Free Full Text]
- Lippmann RP, Kukolich L, Shahian D. Predicting the risk of complications in coronary artery bypass operations using neural networks. In: Tesaukro G, Touretzky D, Leen T, eds. Advances in neural information processing systems 7. San Matteo, CA: Morgan Kaufmann, 1995:105562.
- Lippmann RP, Kukolich L. Using neural networks to predict the risk of cardiac bypass operations. In: Rogers S, Ruck D, eds. Applications and science of artificial neural networks. SPIE 1995:65160.
- Edwards FH Clark RE, Schwartz M. Coronary artery bypass grafting: The Society of Thoracic Surgeons National Database experience. Ann Thorac Surg 1994;57:129.[Abstract/Free Full Text]
- Higgins TL, Estafanous FG, Loop FD, Beck GJ, Blum JM, Paranandi L. Stratification of morbidity and mortality outcome by preoperative risk factors in coronary artery bypass patients. JAMA 1992;267:23448.[Medline]
- O'Connor GT, Plume SR, Olmstead EM, et al. Multivariate prediction of in-hospital mortality associated with coronary artery bypass surgery. Circulation 1992;85:21108.[Abstract/Free Full Text]
- Hannan EL, Kilburn H Jr, O'Donnell JF, Lukacik G, Shields EP. Adult open heart surgery in New York State: an analysis of risk factors and hospital mortality rates. JAMA 1990;264:276874.[Medline]
- Hannan EL, Kilburn H Jr, Racz M, Shields E, Chassin MR. Improving the outcome of coronary artery bypass surgery in New York State. JAMA 1994;271:7616.[Medline]
- Guillermo M, Shroyer LW, Grover FL, Hammermeister KE. Bayesian-logit model for risk assessment in coronary artery bypass grafting. Ann Thorac Surg 1994;57:1492500.[Abstract/Free Full Text]
- Edwards FH, Albus RA, Zajtchuk R, et al. Use of a Bayesian statistical model for risk assessment in coronary artery surgery. Ann Thorac Surg 1988;45:43740.[Abstract/Free Full Text]
- Edwards FH, Albus RA, Zajtchuk R, Graeber GM, Barry M. A quality assurance model of operative mortality in coronary artery surgery. Ann Thorac Surg 1989;47:6469.[Abstract/Free Full Text]
- Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982; 143:2936.[Abstract/Free Full Text]
- Efron B, Tibshirani RJ. An introduction to the bootstrap. Monographs on statistics and applied probability 57. New York: Chapman and Hall, 1993.
- Daley J. Criteria by which to evaluate risk-adjusted outcomes problems in cardiac surgery. Ann Thorac Surg 1994;58:182735.[Abstract/Free Full Text]
- Iezzoni LI. Using risk-adjusted outcomes to assess clinical practice: an overview of issues pertaining to risk adjustment. Ann Thorac Surg 1994;58:18226.[Abstract/Free Full Text]
- Kouchoukos NT, Ebert PA, Grover FL, Lindesmith GG. Report of the Ad Hoc Committee on Risk Factors for Coronary Artery Bypass Surgery. Ann Thorac Surg 1988;45:3489.[Free Full Text]
- Chassin MR, Hannan EL, DeBuono BA. Benefits and hazards of reporting medical outcomes publicly. N Engl J Med 1996;334:3948.[Medline]
- Green J, Wintfeld N. Report cards on cardiac surgeons. Assessing New York State's approach. N Engl J Med 1995;332:122932.[Medline]
- Grover FL, Hammermeister KE, Shroyer ALW. Quality initiatives and the power of the database: what they are and how they run. Ann Thorac Surg 1995;60:151421.[Abstract/Free Full Text]
- Wu AW. The measure and mismeasure of hospital quality: appropriate risk-adjustment methods in comparing hospitals. Ann Intern Med 1995;122:14950.[Medline]
- Edwards FH, Clark RE, Schwartz M. Practical consideration in the management of large multiinstitutional databases. Ann Thorac Surg 1994;58:18414.[Abstract/Free Full Text]
- Jencks SF, Daley J, Draper D, Thomas N, Lenhart G, Walker J. Interpreting hospital mortality data: the role of clinical risk adjustment. JAMA 1988;260:36116.[Medline]
- Parsonnet V. Risk stratification in cardiac surgery: is it worthwhile? J Card Surg 1995;10:6908.[Medline]
- Omoigui NA, Miller DP, Brown KJ, et al. Outmigration for coronary bypass surgery in an era of public dissemination of clinical outcomes. Circulation 1996;93:2733.[Abstract/Free Full Text]
- Parsonnet V, Dean D, Bernstein AD. A method of uniform stratification of risk for evaluating the results of surgery in acquired adult heart disease. Circulation 1989;79(Suppl 1):312.
- Grover FL, Hammermeister KE, Burchfiel C, Cardiac Surgeons of the Department of Veterans Affairs. Initial report of the Veterans Administration Preoperative Risk Assessment Study for Cardiac Surgery. Ann Thorac Surg 1990;50:1228.[Abstract/Free Full Text]
- Turner JS, Morgan CJ, Thakrar B, Pepper JR. Difficulties in predicting outcome in cardiac surgery patients. Crit Care Med 1995;23:184350.[Medline]
- Tu JV, Jaglal SB, Naylor CD. Multicenter validation of a risk index for mortality, intensive care unit stay, and overall hospital length of stay after cardiac surgery: Steering Committee of the Provincial Adult Cardiac Care Network of Ontario. Circulation 1995;91:67784.[Abstract/Free Full Text]
- Marshall G, Grover FL, Henderson WG, Hammermeister KE. Assessment of predictive models for binary outcomes: an empirical approach using operative death from cardiac surgery. Stat Med 1994; 13:150111.[Medline]
- Swetts JA. Measuring the accuracy of diagnostic systems. Science 1988;240:128593.[Abstract/Free Full Text]
- Baxt WG. Application of artificial neural networks to clinical medicine. Lancet 1995;346:11358.[Medline]
Related Articles
-
Thoughts and Considerations on Modeling Coronary Bypass Surgery Risk
- Bradley A. Warner
Ann. Thorac. Surg. 1997 63: 1529-1530.
[Extract]
[Full Text]
-
Cardiothoracic Databases: Where Are We Headed?
- Frederick L. Grover
Ann. Thorac. Surg. 1997 63: 1531-1532.
[Extract]
[Full Text]
This article has been cited by other articles:

|
 |

|
 |
 
P. E. Puddu and A. Menotti
Artificial neural network versus multiple logistic function to predict 25-year coronary heart disease mortality in the Seven Countries Study
European Journal of Cardiovascular Prevention & Rehabilitation,
October 1, 2009;
16(5):
583 - 591.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
V. A. Ferraris, F. H. Edwards, D. M. Shahian, and S. P. Ferraris
Risk Stratification and Comorbidity
,
January 1, 2008;
3(2008):
199 - 246.
[Full Text]
|
 |
|

|
 |

|
 |
 
J. Nilsson, M. Ohlsson, L. Thulin, P. Hoglund, S. A.M. Nashef, and J. Brandt
Risk factor identification and mortality prediction in cardiac surgery using artificial neural networks
J. Thorac. Cardiovasc. Surg.,
July 1, 2006;
132(1):
12 - 19.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
J. Nilsson, L. Algotsson, P. Hoglund, C. Luhrs, and J. Brandt
Comparison of 19 pre-operative risk stratification models in open-heart surgery
Eur. Heart J.,
April 1, 2006;
27(7):
867 - 874.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
D. M. Shahian, E. H. Blackstone, F. H. Edwards, F. L. Grover, G. L. Grunkemeier, D. C. Naftel, S. A.M. Nashef, W. C. Nugent, and E. D. Peterson
Cardiac Surgery Risk Models: A Position Article
Ann. Thorac. Surg.,
November 1, 2004;
78(5):
1868 - 1877.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
R. Z. Omar, G. Ambler, P. Royston, J. Eliahoo, and K. M. Taylor
Cardiac surgery risk modeling for mortality: a review of current practice and suggestions for improvement
Ann. Thorac. Surg.,
June 1, 2004;
77(6):
2232 - 2237.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
D. M. Shahian, S.-L. Normand, D. F. Torchiana, S. M. Lewis, J. O. Pastore, R. E. Kuntz, and P. I. Dreyer
Cardiac surgery report cards: comprehensive review and statistical critique
Ann. Thorac. Surg.,
December 1, 2001;
72(6):
2155 - 2168.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
J. H. Burack, P. Impellizzeri, P. Homel, and J. N. Cunningham Jr
Public reporting of surgical mortality: a survey of New York State cardiothoracic surgeons
Ann. Thorac. Surg.,
October 1, 1999;
68(4):
1195 - 1200.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
B. A. Warner
Thoughts and Considerations on Modeling Coronary Bypass Surgery Risk
Ann. Thorac. Surg.,
June 1, 1997;
63(6):
1529 - 1530.
[Full Text]
|
 |
|

|
 |

|
 |
 
F. L. Grover
Cardiothoracic Databases: Where Are We Headed?
Ann. Thorac. Surg.,
June 1, 1997;
63(6):
1531 - 1532.
[Full Text]
|
 |
|