|
|
||||||||
Ann Thorac Surg 2000;70:1782-1785
© 2000 The Society of Thoracic Surgeons
a Department of Anesthesiology, Wake Forest University School of Medicine, Winston-Salem, North Carolina, USA
b Department of Anesthesiology, The University of Western Ontario, London Health Science Center, London, Ontario, Canada
Address reprint requests to Dr Stump, Department of Anesthesiology, Wake Forest University School of Medicine, Medical Center Blvd, Winston-Salem, NC 27157
e-mail: dstump{at}wfubmc.edu
Presented at Outcomes 2000, "The Key West Meeting," Key West, FL, May 2428, 2000.
Experimental design and statistics are two different but related arts. The art of experimental design is how you ask the question, and how you ask the question dictates which statistics you use to answer the question.
How you phrase the question depends upon your assumptions. For example, if your question is: "Are age and bypass time associated with negative outcomes after cardiac surgery?" then a common statistical approach is to make a tactical decision and say both sets of numbers are continuous variables and use a linear regression model. The problem is in the assumption that the variables are continuous. All of us know empirically that the 5 years between 40 and 45 do not impart the same risk or rate of deterioration as the 5 years between 70 and 75 (nor do we all age at the same rate, as Indiana Jones said, "its not the years, its the mileage"). Furthermore, during cardiopulmonary bypass (CPB), every minute past 100 carries a much greater risk than the preceding 100 minutes. Why, because extended bypass times generally indicates some complication, usually bleeding, has occurred and it is not the time on bypass per se, but what the time signifies, eg, some anomaly. Most surgeons have an average time to perform a three-vessel coronary artery bypass graft (CABG) with a relatively small standard deviation. The outliers are usually problem cases and skew the curve with an extended tail to the right. Therefore, age and CPB time are not continuous variables but categorical (young-old, long-short) and can be analyzed using a strategy like a 2 x 2 or 3 x 3
2 as in Table 1.
|
|
In a typical drug trial, we expect everyone to have more or less the same response to the drug. In other words, after treatment, every score shifts in the same direction. The reason there is a mean and standard deviation (SD) is because some people have a greater reaction than others but they should all respond. This becomes a significant response if the overall curve shifts far enough (Fig 2).
|
|
A psychological test is designed for maximum test retest reliability, or repeatability. The idea is to minimize or control for practice effects. Most individuals will get better after repeating a task; surgeons, for example, fortunately do not regress to an average level of performance with practice. So if a typical neurobehavioral test was given to normal subjects, most would show a modest improvement in performance. In fact, in order to reliably determine the influence of practice effect in a given study protocol, it is important to incorporate into the design of the study an age-, gender-, and education-matched control group, and administer the same test battery at a similar interval to the study population. This was discussed at length in the first Consensus Statement [5].
After cardiac surgery, we see a pattern where the variability in the group scores increases because some patients are uninjured and show improved scores due to practice effects, other patients scores stay the same, and a further subset shows marked deterioration in test scores. The result is that the overall group mean performance changes little. The "up goers" and the "down goers" offset each other so that the mean stays the same but the SD increases due to the greater dispersion of scores (variance).
This increase in the SD decreases our ability to detect group differences using parametric statistics (t tests). For example, if we were to compare two groups, "treated" and "untreated," we may find no significant difference between the mean performance of the two groups when comparing their pre- to postperformance (change score) either between or within groups. However, there may actually be a difference between the groups. In this example of performance on a single test, 25% of the untreated group actually showed a greater than 20% decline in performance, something never seen in normal controls. Furthermore, only 15% of the treated group showed a decline of 20% in performance (Fig 3).
What that means is that 50 of 200 people were impaired in one group versus 30 of 200 in the other. In order for a t test to show a difference of in the overall group score, performance of the additional 20 impaired subjects would have to be bad enough that it would pull down the scores of the other 150 subjects. Furthermore, all one could say is that the group mean performance was better in the treated group (ie, the use of drug X resulted in an 8% lesser decline in trail-making A scores than in the control group), raising the important issue of clinical relevance.
However, there is another way to look at these numbers, eg, using
2 analysis. The results could be described as follows. The treatment resulted in a 40% reduction in the number of patients with evidence of brain injury as exhibited by the performance on this test (p = 0.02).
One way to phrase a hypothesis would be: "Does the treatment have an effect on the overall cognitive test performance of a group of patients undergoing cardiac surgery?" (ie, group mean analysis). Another might be: "Does the treatment result in significantly fewer patients suffering brain injury due to cardiac surgery?" (ie, incidence analysis). Same data, different question, different statistic, different answer. Incidence analysis enables us to examine which specific patients are at risk, and what factors are risk associated. For further reference, the issue of group mean versus incidence analysis is discussed at length in the second Consensus Statement [6].
Which approach best serves our patients and the surgical team [7]? As we have tried to demonstrate, the answer to that is clearly determined by the nature of the question to be answered. Our purpose is to improve outcome after cardiac surgery by making a safe operation safer. We do this by providing feedback to the team about the association between the patients risk factors and the modifiable risks that are associated with the circuits, anesthetics, and surgical techniques. We can serve our patients best by asking the right questions.
Addendum
Given the heterogeneous nature of brain anatomy and function, no single test adequately describes normal function, which is why a typical assessment battery includes 10 or more tests [1, 7]. Every effort should be made to maximize individual test independence, to minimize overlap between cognitive domains [5]. Statistically, a categorical approach (deficit/no-deficit) approach using a nonparametric analysis is more powerful.
There are several ways neurobehavioral deficit studies can be analyzed. The choice of statistical test determines what question is being answered. Analysis of variance designs including t tests, repeated-measures analysis, etc, test for shifts in the mean effect between groups. In contrast,
2 2 x 2 tables can test for a difference in the proportion of patients experiencing substantial deficits (ie, >20% decline from preoperative scores).
A power analysis was performed using simulated data to compare the ability of a t test (the most simple of the analysis of variance designs) versus a 2 x 2
2 test to detect preoperative changes in total test scores. Total test scores were the sum of 10 different independent tests. For the 2 x 2
2 tests, a patient was defined as having an overall deficit if he scored a 20% decline in test score in two or more tests [7]. To simplify the power calculations, these tests were assumed independent in that none of the brain regions evaluated by the several tests overlapped. Thus, the effects of any specific neurologic lesion would only be detectable on only one of the tests. Although this requirement of test independence is perhaps experimentally unrealistic, it was necessary to avoid having to make elaborate assumptions about correlations between each pair of tests.
The power of analysis was based upon the following simplifying assumptions. (1) Each of the 10 tests were independent (described above). (2) The preop test scores were normally distributed (mean ± SD: 100 ± 20). (3) All postop test scores improved 10 ± 10% due to test learning. (4) In the placebo group, for each test taken, there was an additional 9.64% probability of a test score deficit of -40 ± 10%, giving a net individual test score decrement of -30% (10% learning minus 40% deficit). The 9.64% probability of test score deficit for each of the 10 different tests results in a 25% probability that a patient will be classified as having an overall deficit (ie, two or more of the test scores will simultaneously have 20% decline, based upon the binomial distribution).
(5) Likewise in the treatment group, there was a 6.95% probability of a individual test score deficit (-40 ± 10% test score decrease) giving a 15% probability of being classified as having an over all neurobehavioral deficit. (6) Sample size: 200 patients receiving the placebo and 200 patients receiving treatment.
Based on performance of 2,000 data simulations, the above
2 analysis found significant treatment effects compared with placebo 66% of the time, but t test found significance only 6.7% of the time.
References
This article has been cited by other articles:
![]() |
F. Hernandez Jr, J. R. Brown, D. S. Likosky, R. A. Clough, A. L. Hess, R. M. Roth, C. S. Ross, C. M. Whited, G. T. O'Connor, and J. D. Klemperer Neurocognitive Outcomes of Off-Pump Versus On-Pump Coronary Artery Bypass: A Prospective Randomized Controlled Trial Ann. Thorac. Surg., December 1, 2007; 84(6): 1897 - 1903. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. D. Puskas, W. H. Williams, E. M. Mahoney, P. R. Huber, P. C. Block, P. G. Duke, J. R. Staples, K. E. Glas, J. J. Marshall, M. E. Leimbach, et al. Off-Pump vs Conventional Coronary Artery Bypass Grafting: Early and 1-Year Graft Patency, Cost, and Quality-of-Life Outcomes: A Randomized Trial JAMA, April 21, 2004; 291(15): 1841 - 1849. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Van Dijk, E. W. L. Jansen, R. Hijman, A. P. Nierich, J. C. Diephuis, K. G. M. Moons, J. R. Lahpor, C. Borst, A. M. A. Keizer, H. M. Nathoe, et al. Cognitive Outcome After Off-Pump and On-Pump Coronary Artery Bypass Graft Surgery: A Randomized Trial JAMA, March 20, 2002; 287(11): 1405 - 1412. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ANN THORAC SURG | ASIAN CARDIOVASC THORAC ANN | EUR J CARDIOTHORAC SURG |
| J THORAC CARDIOVASC SURG | ICVTS | ALL CTSNet JOURNALS |