A few weeks ago, BMC Pediatrics published an article that purports to show that Hyperbaric Oxygen Therapy (HBOT) can produce “…significant improvements in overall functioning, receptive language, social interaction, eye contact, and sensory/cognitive awareness..” in autistic children. This study (Rossignol et al, 2009) is billed as a “…multicenter, randomized, double-blind, controlled trial.”
It’s all that and much, much less.
Let’s start by looking at the six “centers” where this research was carried out.
The International Child Development Research Center (ICRDC):
This imposing name is attached to a rather less imposing edifice. The ICRDC, brainchild Dr. Jeffrey Bradstreet, is located in a strip mall in Melbourne, Florida, where it not only carries out “cutting-edge research” but also sells a complete line of “supplements” and treats autistic children with a dizzying array of “alternative”, “biomedical” and “integrative” therapies, including HBOT.
Daniel Rossignol MD (Family Practice), Lanier Rossignol (Nurse Practitioner) and Scott Smith (Physician’s Assistant) were the authors from the ICDRC.
The Center for Autism Research and Education (CARE):
This “center” is located in Phoenix, Arizona and has – according to its website – a single practitioner, Cynthia Schneider, MD (OB/Gyn), who is also an author on this paper. One of the “integrative” therapies this “center” offers is HBOT.
One of the other authors, Sally Logerquist, is a PhD psychologist who – according to the paper – is also associated with CARE, but also appears to run social skills therapy groups for autistic children using the “Logerquist Excellent Attitude Program” (LEAP).
True Health Medical Center:
It’s rather difficult to find anything about this “center”, apart from the fact that it is located in Naperville, Illinois – in what appears to be an office complex. Anju Usman, MD (Family Practice) is the author associated with this location.
Although not officially called a “center”, the office of James Neubrander, MD (Pathology) is apparently one of the “centers” of this study. His office is located in the Menlo Park Mall (near Macy’s) and offers – you guessed it! – HBOT as a treatment for autism.
Princess Anne Medical Associates:
A Family Practice medical group in Virginia Beach, Virginia, this “center” is the home of Eric Madren, MD (Family Practice). It’s not clear if this four-physician practice offers HBOT.
The Rimland Center for Integrative Medicine:
A small, one-physician “center” in Lynchburg, Virginia, this is practice location of author Elizabeth Mumper, MD (Pediatrics). Not surprisingly, this “center” sells HBOT services for autistic children.
So, of the six “centers” involved in this study, five are single-physician operations. The remaining “center” has two physicians (three, if you count the naturopath).
Well, what about the research itself? Maybe that’s better than the “facilities” might suggest. Let’s take a look.
This study initially enrolled 62 children (33 treatment; 29 control), but only 29 of the treatment group and 26 of the control group finished all 40 sessions. For reasons that pass my understanding, one treatment subject who only finished 9 sessions was included in the analysis. The authors stated that including this subject did not alter results, which begs the question: “Why did they include this subject if it made no difference?”
The authors used the Aberrant Behavior Checklist (ABC), the Clinical Global Impression (CGI) scale and the Autism Treatment Evaluation Checklist (ATEC) as their outcome measures. All except the ATEC are widely accepted for use in autism treatment trials.
The ABC is a 58-question checklist of – surprise! – aberrant behaviors which are each given a score from “0” (“not at all a problem”) to “3” (“severe problem”). This test has been use – and validated – in a number of disorders, including autism. It gives a global score as well as five subscales: a total of six measures.
The CGI is a generic rating scale used in a variety of clinical trials. For each parameter (e.g. “overall functioning”, “sleep pattern”), the rater gives a score of between “1” (“very much improved”) and “7” (“very much worse”). The authors had both the treating physician and the parents rate the subjects on overall improvement and eighteen discrete parameters: a total of 38 measures in all (19 by the physician and 19 by the parents).
The ATEC was developed by Bernie Rimland and Stephen Edelson and has not been validated. In fact, it has only been used in two published studies – one by Rossignol et al. The ATEC has 25 questions on which the evaluator rates the subject on either a three-point (“not true”, “somewhat true”, “very true”) or four-point (“not a problem”, “minor problem”, “moderate problem”, “serious problem”) scale. It provides a total score and four subscales: a total of five measures.
In all, each subject had a total of 49 evaluation measures (CGI scores and the change in ABC and ATEC scores), of which 47 are independent. The importance of this will become apparent in the section on statistical analysis.
As I mentioned above, the decision to include one treatment subject who only completed nine sessions was curious. Why they included this subject and not any of the other three treatment subjects and three control subjects who also failed to complete the entire course of the study is concerning. The smart thing – and the proper response – would have been to drop this subject from analysis.
The authors’ method of analyzing the CGI scales was also curious. Rather than simply using the scores as they were provided, they took the scores and subtracted them from four (the “no change” score). There are a few problems with this.
For starters, the scores are not linear – the difference between “much improved” and “very much improved” is not necessarily the same as between “no change” and “minimally improved”. Nor is the difference between “no change” and “much improved” twice the difference between “much improved” and “very much improved”. For that reason, these types of numerical scores are often referred to as “pseudo-numbers”.
This may seem like nit-picking, but it is a serious concern. Imagine, if you will, that the numbers were replaced by colors. Is the difference between green and orange twice the difference between orange and red? If half of a population of birds are blue and the other half are yellow, is the “average” bird green? The simple fact is that it is not appropriate to treat these “scores” as though they were real numbers, to be added, subtracted and averaged.
Secondly, it appears that the authors used parametric statistics for their analysis of the CGI scores. This is a problem since – as I indicated above – it is nonsensical to do math on pseudo-numbers. I don’t have the raw numbers, so it isn’t possible for me to calculate the absolute impact of this mistake for all of the CGI subclasses, but I can figure out the raw numbers for one group, so let’s look at that one.
It took a little work, but the authors gave enough clues to tease out the raw numbers in the physician “overall functioning” CGI score. The treatment group had an “average” of 2.87 and the control group’s “average” was 3.62; using the unaltered data, a t-test [Note: not an appropriate use of the t-test] gives p-value of 0.0006, not far from what the authors report. When a more appropriate statistical test [Mann-Whitney U-test] is used, the p-value is 0.002, very different from the reported 0.0008. While this is still less than the threshold p-value of 0.05, see below for a discussion of multiple comparisons.
All of these statistical analyses of the CGI scores ignore the fact that these are pseudo-numbers and need to be treated as discrete groups rather than as actual numbers. In truth, even the ABC and ATEC scores should have been treated this way, as well, although it is fairly common practice to treat such multi-factor scores as real numbers. A Chi-square test or Fisher Exact test would be the ideal test, but the problem with that is that the treatment group has one score of “1” (very much improved) and the control group doesn’t. Likewise, the control group has two subjects with a score of “5” (minimally worse) and the treatment group has none. This prevents a Chi-square or Fisher test from comparing each score independently.
One solution is presented by the authors themselves, although they apparently didn’t use it. In their discussion of the CGI, the authors said:
“Children who received a score of ‘very much improved’ or ‘much improved’ on the physician CGI overall functioning score were considered to be ‘good responders’ to treatment.”
If we “bin” the scores into “good responders” and “others”, we find that there were 9 (out of 30 – 30%) “good responders” in the treatment group compared to 2 (out of 26 – 8%) in the control group. Unfortunately, this is not a statistically significant difference (p = 0.08) in the (Yates) Chi-square test and barely reached significance (p = 0.05, but see below) in the Fisher Exact test.
An even bigger problem in the statistical analysis was the failure to correct for multiple comparisons. This problem was brought up by one of the reviewers, and the authors responded by eliminating a table. They did not make the appropriate corrections.
The reason that multiple comparisons are a problem is that the analysis for statistical significance is based on probability. If the probability (the p-value) that the differences between the two groups (treatment and control) is due to random chance is equal to or less than 5%, that difference is considered to be “statistically significant” and accepted as real. That means that there is still a 5% (or less – look to the p-value) chance that the difference is due to chance and not real.
If multiple comparisons are made on the same group of subjects, the probability that one (or more) of them will be “statistically significant” by chance starts to climb. If 14 comparisons are made, the chance of an erroneous “statistical significance” is over 50%. If 47 independent comparisons are made – as in this study – the chance of an erroneous “statistical significance” is over 90%.
For this reason, it is standard procedure to apply a correction for multiple comparisons. The most well-known (and simplest) of these is the Bonferroni Correction, which changes the threshold for statistical significance by dividing it by the number of comparisons. In the case of this study, the threshold (normally p less than or equal to 0.05 or 5%) is reduced to 0.001.
Applying the appropriate correction for multiple comparisons changes the results of this study significantly. Only the physician CGI scores for overall functioning and receptive language reach significance – and these numbers are already suspicious because they were improperly handled to begin with. In fact, as I have shown above, the CGI “overall functioning” p-value wouldn’t reach significance. It is possible that – if the proper statistical tests were used – that the CGI score for “receptive language” would also not reach significance.
Another curious thing. The authors asked the parents after the study whether they thought their child was in the treatment or the control group. Rather than say that the parent’s guesses were no better than random chance (i.e. 50%), the authors stated:
“…there was no significant difference between the two groups in the ability of the parents to correctly guess the group assignment of their child.”
As I said, this was a curious way to put it. As I read this, all it says is that each group of parent were equally able to guess which group their child was assigned to. That could be a 50% accuracy (which would be equal to chance), but a 90% or 99% accuracy – if both groups were that accurate – would also fit that description.
Now, this could simply be an clumsy phrasing by the authors, or it could be a way to make it sound like their blinding was successful when it actually was not.
This study may have collected some useful data, but its analysis of that data rendered it useless. The CGI scores – where the only statistically significant result was (possibly) seen – were improperly manipulated and the wrong statistical analysis was used.
The other issue is that there is no discussion of why HBOT is thought to be superior to providing the same partial pressure of oxygen at room pressure. This study used 24% oxygen at 1.3 atm, which gives the same partial pressure of oxygen as 31% at sea level. This concentration of oxygen can be easily attained with an oxygen mask or simple oxygen tent – both of which are vastly less expensive than HBOT.
If the authors are arguing that the mild pressure of their inflatable HBOT chambers contributes to the treatment effect, they need to look at the literature on cell membrane compressibility. For those who want to do the calculations at home, the bulk modulus of water (the major component of cells) is 21,700 atm. This means that a 0.3 atm increase in pressure will reduce the cell volume by 0.0014%. The bulk modulus of the lipid bilayer in cell membranes is around 30,000 atm. This means that an increase of 0.3 atm pressure causes a 0.0010% reduction in membrane volume. These are well below the threshold for any clinical effects.
Real pressure effects on the central nervous system are seen at pressures over 19 atm. These effects are:
postural and intention tremors
fatigue and somnolence
decrease intellectual and psychomotor performance
poor sleep with nightmares
increased slow wave and decreased fast wave activity in EEG
None of these effects could be construed as “improvements”, even in autism.
So, this study fails to answer the following questions about HBOT and autism:
 Does HBOT improve any feature of autism?
 If so, is HBOT any better than supplemental oxygen (which is much cheaper)?
The only real effect of this study was to give a cover of legitimacy to practitioners who are already using HBOT to “treat” autism.