ATS
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to Personal Folders
Right arrow Download to citation manager
Right arrow Author home page(s):
David A. Stump
John M. Murkin
Right arrow Permission Requests
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Stump, D. A.
Right arrow Articles by Murkin, J. M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Stump, D. A.
Right arrow Articles by Murkin, J. M.

Ann Thorac Surg 2000;70:1782-1785
© 2000 The Society of Thoracic Surgeons


Supplement: outcomes 2000

Is that outcome different or not? The effect of experimental design and statistics on neurobehavioral outcome studies

David A. Stump, PhDa, Robert L. James, MSa, John M. Murkin, MDb

a Department of Anesthesiology, Wake Forest University School of Medicine, Winston-Salem, North Carolina, USA
b Department of Anesthesiology, The University of Western Ontario, London Health Science Center, London, Ontario, Canada

Address reprint requests to Dr Stump, Department of Anesthesiology, Wake Forest University School of Medicine, Medical Center Blvd, Winston-Salem, NC 27157
e-mail: dstump{at}wfubmc.edu

Presented at Outcomes 2000, "The Key West Meeting," Key West, FL, May 24–28, 2000.

Experimental design and statistics are two different but related arts. The art of experimental design is how you ask the question, and how you ask the question dictates which statistics you use to answer the question.

How you phrase the question depends upon your assumptions. For example, if your question is: "Are age and bypass time associated with negative outcomes after cardiac surgery?" then a common statistical approach is to make a tactical decision and say both sets of numbers are continuous variables and use a linear regression model. The problem is in the assumption that the variables are continuous. All of us know empirically that the 5 years between 40 and 45 do not impart the same risk or rate of deterioration as the 5 years between 70 and 75 (nor do we all age at the same rate, as Indiana Jones said, "its not the years, its the mileage"). Furthermore, during cardiopulmonary bypass (CPB), every minute past 100 carries a much greater risk than the preceding 100 minutes. Why, because extended bypass times generally indicates some complication, usually bleeding, has occurred and it is not the time on bypass per se, but what the time signifies, eg, some anomaly. Most surgeons have an average time to perform a three-vessel coronary artery bypass graft (CABG) with a relatively small standard deviation. The outliers are usually problem cases and skew the curve with an extended tail to the right. Therefore, age and CPB time are not continuous variables but categorical (young-old, long-short) and can be analyzed using a strategy like a 2 x 2 or 3 x 3 {chi}2 as in Table 1.


View this table:
[in this window]
[in a new window]
 
Table 1. The Use of Age and Bypass Time as Categorical (vs Continuous) Variables in Reporting Neurological Injury

 
In the past, we have seen many instances where data displayed as a continuous variable has shown no relationship with clinical reality, usually because the question was "phrased" improperly. For example, the question: "Is age a predictor of negative outcome after CPB?" is different from: "Do older people do worse than younger people after CPB?" Why, because in one instance you are using the persons age, ie, a continuous number, as opposed to grouping by category and using a cut-off score (older, over 65, risk factors such as postmenopausal, diabetic, hypertensive, retired, etc) (Fig 1).



View larger version (20K):
[in this window]
[in a new window]
 
Fig 1. Comparison of linear regression analysis (A) to {chi}2 analysis (B and C). The linear regression tests the hypothesis that the effect of age on test scores is linear. The fitted straight line of the linear regression is shown against the data. The slope of this line does not differ significantly from 0 (p = 0.44); thus, the effect of age on test score is not found to be statistically significant. If, however, the age is partitioned into patients < 65 years versus patients >= 65 years of age, and simultaneously the test scores are partitioned into test scores < 3 versus test scores >= 3 (B), then a 2 x 2 {chi}2 analysis can be performed (C). The {chi}2 analysis tests whether the proportion of patients scoring < 3 versus >=3 differs between the two age groups (p = 0.04). The data were generated to illustrate that {chi}2 analysis can at times out perform linear regression when the relationship is not linear.

 
Unfortunately, or fortunately, depending on one’s viewpoint, there is no consensus by either researchers or statisticians on how to deal with the data in the field of neurobehavioral outcomes associated with cardiac surgery. One should remember that there is no one sensitive variable that adequately describes "normal" brain function [1]. The result is that any reviewer can orchestrate a field of articles to support whatever conclusion they want (ie, warm bypass is good for you, pH-stat and dysautoregulation does not cause harm, etc) [2]. How do we sort out truth? Look at the data.

In a typical drug trial, we expect everyone to have more or less the same response to the drug. In other words, after treatment, every score shifts in the same direction. The reason there is a mean and standard deviation (SD) is because some people have a greater reaction than others but they should all respond. This becomes a significant response if the overall curve shifts far enough (Fig 2).



View larger version (16K):
[in this window]
[in a new window]
 
Fig 2. The effect of the treatment in shifting the mean test scores (and entire distribution) by 20 test score points in comparison with the control group.

 
This is not necessarily the case with cardiac surgery. A case in point is brain injury after cardiac surgery. If CPB caused significant brain injury in and of itself, then we would expect to see a consistent pattern of dysfunction, a "syndrome" similar, for example, to what we see after cardiac arrest. However, what we find after CPB is inconsistent. Some patients may have a right-sided motor dysfunction, others a verbal memory or visual abnormality, and yet others a speech disorder. Also, the "severity" of the lesion, ie, the social significance of the deficit, is not proportional to the volume of the injury. A large right frontal lobe infarction may go undetected whereas a very small capsule lesion resulting in a right arm paralysis would be considered a catastrophic stroke. What this suggests is that something has happened to a subset of patients that has not happened to the others. For example, in Figure 3, a simulation, we see that after cardiac surgery most people exhibit one of three behaviors on a given test: improve, stay the same, or get worse. Because the brain is a heterogeneous organ, on any given test of a specific behavior, most patients (eg, 85%) will exhibit a practice effect due to learning and show a slight improvement in test score. This increase in the group mean performance score is offset by those patients (eg, 15%) who experience a brain injury as evidenced by a decline in cognitive dysfunction. What results is a small change in the mean and a large increase in the SD.



View larger version (22K):
[in this window]
[in a new window]
 
Fig 3. Simulated distributions of preoperative and 1-month postoperative test scores in both the drug treatment and placebo groups. The preoperative distribution is assumed to be normal (100 ± 20). Both the postoperative treatment and placebo groups, however, are assumed to be a mixture of two normal distributions comprised of one subgroup experiencing a postoperative decline (-40 ± 10%) in test scores and the other subgroup experiencing an improvement (+10 ± 10%) in test scores. The overall distributions resulting from a mix of these two normal distributions are shown with a higher proportion of the placebo group experiencing the decline in test scores than the treatment group.

 
It has been suggested that "regression to the mean" explains much of the neurobehavioral deficits exhibited after surgery [3, 4]. Regression to the mean is an important factor in physiology and anatomy; for example, very tall parents can expect to have relatively shorter children. For cognitive testing, regression to the mean would suggest that high scoring individuals will do worse on tests the second time, and that lower scoring people would do better. What we see in reality is a greater improvement in the brighter group.

A psychological test is designed for maximum test retest reliability, or repeatability. The idea is to minimize or control for practice effects. Most individuals will get better after repeating a task; surgeons, for example, fortunately do not regress to an average level of performance with practice. So if a typical neurobehavioral test was given to normal subjects, most would show a modest improvement in performance. In fact, in order to reliably determine the influence of practice effect in a given study protocol, it is important to incorporate into the design of the study an age-, gender-, and education-matched control group, and administer the same test battery at a similar interval to the study population. This was discussed at length in the first Consensus Statement [5].

After cardiac surgery, we see a pattern where the variability in the group scores increases because some patients are uninjured and show improved scores due to practice effects, other patients scores stay the same, and a further subset shows marked deterioration in test scores. The result is that the overall group mean performance changes little. The "up goers" and the "down goers" offset each other so that the mean stays the same but the SD increases due to the greater dispersion of scores (variance).

This increase in the SD decreases our ability to detect group differences using parametric statistics (t tests). For example, if we were to compare two groups, "treated" and "untreated," we may find no significant difference between the mean performance of the two groups when comparing their pre- to postperformance (change score) either between or within groups. However, there may actually be a difference between the groups. In this example of performance on a single test, 25% of the untreated group actually showed a greater than 20% decline in performance, something never seen in normal controls. Furthermore, only 15% of the treated group showed a decline of 20% in performance (Fig 3).

What that means is that 50 of 200 people were impaired in one group versus 30 of 200 in the other. In order for a t test to show a difference of in the overall group score, performance of the additional 20 impaired subjects would have to be bad enough that it would pull down the scores of the other 150 subjects. Furthermore, all one could say is that the group mean performance was better in the treated group (ie, the use of drug X resulted in an 8% lesser decline in trail-making A scores than in the control group), raising the important issue of clinical relevance.

However, there is another way to look at these numbers, eg, using {chi}2 analysis. The results could be described as follows. The treatment resulted in a 40% reduction in the number of patients with evidence of brain injury as exhibited by the performance on this test (p = 0.02).

One way to phrase a hypothesis would be: "Does the treatment have an effect on the overall cognitive test performance of a group of patients undergoing cardiac surgery?" (ie, group mean analysis). Another might be: "Does the treatment result in significantly fewer patients suffering brain injury due to cardiac surgery?" (ie, incidence analysis). Same data, different question, different statistic, different answer. Incidence analysis enables us to examine which specific patients are at risk, and what factors are risk associated. For further reference, the issue of group mean versus incidence analysis is discussed at length in the second Consensus Statement [6].

Which approach best serves our patients and the surgical team [7]? As we have tried to demonstrate, the answer to that is clearly determined by the nature of the question to be answered. Our purpose is to improve outcome after cardiac surgery by making a safe operation safer. We do this by providing feedback to the team about the association between the patients’ risk factors and the modifiable risks that are associated with the circuits, anesthetics, and surgical techniques. We can serve our patients best by asking the right questions.

Addendum

Given the heterogeneous nature of brain anatomy and function, no single test adequately describes normal function, which is why a typical assessment battery includes 10 or more tests [1, 7]. Every effort should be made to maximize individual test independence, to minimize overlap between cognitive domains [5]. Statistically, a categorical approach (deficit/no-deficit) approach using a nonparametric analysis is more powerful.

There are several ways neurobehavioral deficit studies can be analyzed. The choice of statistical test determines what question is being answered. Analysis of variance designs including t tests, repeated-measures analysis, etc, test for shifts in the mean effect between groups. In contrast, {chi}2 2 x 2 tables can test for a difference in the proportion of patients experiencing substantial deficits (ie, >20% decline from preoperative scores).

A power analysis was performed using simulated data to compare the ability of a t test (the most simple of the analysis of variance designs) versus a 2 x 2 {chi}2 test to detect preoperative changes in total test scores. Total test scores were the sum of 10 different independent tests. For the 2 x 2 {chi}2 tests, a patient was defined as having an overall deficit if he scored a 20% decline in test score in two or more tests [7]. To simplify the power calculations, these tests were assumed independent in that none of the brain regions evaluated by the several tests overlapped. Thus, the effects of any specific neurologic lesion would only be detectable on only one of the tests. Although this requirement of test independence is perhaps experimentally unrealistic, it was necessary to avoid having to make elaborate assumptions about correlations between each pair of tests.

The power of analysis was based upon the following simplifying assumptions. (1) Each of the 10 tests were independent (described above). (2) The preop test scores were normally distributed (mean ± SD: 100 ± 20). (3) All postop test scores improved 10 ± 10% due to test learning. (4) In the placebo group, for each test taken, there was an additional 9.64% probability of a test score deficit of -40 ± 10%, giving a net individual test score decrement of -30% (10% learning minus 40% deficit). The 9.64% probability of test score deficit for each of the 10 different tests results in a 25% probability that a patient will be classified as having an overall deficit (ie, two or more of the test scores will simultaneously have 20% decline, based upon the binomial distribution).

(5) Likewise in the treatment group, there was a 6.95% probability of a individual test score deficit (-40 ± 10% test score decrease) giving a 15% probability of being classified as having an over all neurobehavioral deficit. (6) Sample size: 200 patients receiving the placebo and 200 patients receiving treatment.

Based on performance of 2,000 data simulations, the above {chi}2 analysis found significant treatment effects compared with placebo 66% of the time, but t test found significance only 6.7% of the time.

References

  1. Stump D.A. Selection and clinical significance of neuropsychologic tests. Ann Thorac Surg 1995;59:1331-1335.[Abstract/Free Full Text]
  2. Gill R., Murkin J.M. Neuropsychologic dysfunction after cardiac surgery. J Cardiothorac Vasc Anesth 1996;10:91-98.[Medline]
  3. Browne S.M., Halligan P.W., Wade D.T., Taggart D.P. Cognitive performance after cardiac operation. J Thorac Cardiovasc Surg 1999;117:481-485.[Abstract/Free Full Text]
  4. Mee R.W., Chua T.C. Regression towards the mean and the paired sample t test. The American Statistician 1991;45:39-41.
  5. Murkin J.M., Newman S., Stump D.A., Blumenthal J. Statement of consensus on assessment of neurobehavioral outcomes after cardiac surgery. Ann Thorac Surg 1995;59:1289-1295.[Free Full Text]
  6. Murkin J.M., Stump D.A., Blumenthal J.A., McKhann G. Defining dysfunction. Ann Thorac Surg 1997;64:904-905.
  7. Stump D.A., Rogers A.T., Hammon J.W. Neurobehavioral tests are monitoring tools used to improve cardiac surgery outcome. Ann Thorac Surg 1996;61:1295-1296.[Free Full Text]



This article has been cited by other articles:


Home page
Ann. Thorac. Surg.Home page
F. Hernandez Jr, J. R. Brown, D. S. Likosky, R. A. Clough, A. L. Hess, R. M. Roth, C. S. Ross, C. M. Whited, G. T. O'Connor, and J. D. Klemperer
Neurocognitive Outcomes of Off-Pump Versus On-Pump Coronary Artery Bypass: A Prospective Randomized Controlled Trial
Ann. Thorac. Surg., December 1, 2007; 84(6): 1897 - 1903.
[Abstract] [Full Text] [PDF]


Home page
JAMAHome page
J. D. Puskas, W. H. Williams, E. M. Mahoney, P. R. Huber, P. C. Block, P. G. Duke, J. R. Staples, K. E. Glas, J. J. Marshall, M. E. Leimbach, et al.
Off-Pump vs Conventional Coronary Artery Bypass Grafting: Early and 1-Year Graft Patency, Cost, and Quality-of-Life Outcomes: A Randomized Trial
JAMA, April 21, 2004; 291(15): 1841 - 1849.
[Abstract] [Full Text] [PDF]


Home page
JAMAHome page
D. Van Dijk, E. W. L. Jansen, R. Hijman, A. P. Nierich, J. C. Diephuis, K. G. M. Moons, J. R. Lahpor, C. Borst, A. M. A. Keizer, H. M. Nathoe, et al.
Cognitive Outcome After Off-Pump and On-Pump Coronary Artery Bypass Graft Surgery: A Randomized Trial
JAMA, March 20, 2002; 287(11): 1405 - 1412.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to Personal Folders
Right arrow Download to citation manager
Right arrow Author home page(s):
David A. Stump
John M. Murkin
Right arrow Permission Requests
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Stump, D. A.
Right arrow Articles by Murkin, J. M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Stump, D. A.
Right arrow Articles by Murkin, J. M.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
ANN THORAC SURG ASIAN CARDIOVASC THORAC ANN EUR J CARDIOTHORAC SURG
J THORAC CARDIOVASC SURG ICVTS ALL CTSNet JOURNALS