Discussion of the statistics of the NEDU study on the redistribution of decompression stop time from

Please register or login

Welcome to ScubaBoard, the world's largest scuba diving community. Registration is not required to read the forums, but we encourage you to join. Joining has its benefits and enables you to participate in the discussions.

Benefits of registering include

  • Ability to post and comment on topics and discussions.
  • A Free photo gallery to share your dive photos with the world.
  • You can make this box go away

Joining is quick and easy. Log in or Register now!

Gozo Diver

Registered
Messages
20
Reaction score
35
Location
Malta
Hi folks,

I've recently put together a piece which might be of interest to some of you here. It's an explainer and discussion of the maths behind the NEDU study on shallow and deep stops (which I think most of you will be familiar with): Discussion of the statistics of the NEDU study on the Redistribution of Decompression Stop Time from Shallow to Deep Stops – Joseph's Diving Log

In my experience with divers over the years, I've encountered some confusion about certain aspects of this study, most especially when it comes to interpretation of the numbers, so I bit the bullet and tried to provide an explanation for those who are interested but might not necessarily be equipped with the maths skills to understand the stats. Hope some of you find this helpful.

Happy & safe diving!

Joseph
 
I don’t necessarily agree with the one sided test being inappropriate for this study. if their null is that there is no difference in dcs between the two profiles, and they use a one sided fisher and it comes up statistically significant, that should be sufficient to conclude that the deep profile did in fact have a higher incidence.

For example if you flipped a coin and wanted to know “is this coin favoring tails” a statistically significant result would mean it favors tails, but a non significant result would mean it favors heads OR doesn’t favor anything. Can’t figure out which one because it was a one sided test. A two sided test could tell us does it favor head OR does it favor tails.

If the NEDU study p value on the one sided test was not significant, it then wouldn’t tell us anything. It could mean that they are equal OR the other one was actually better. But since the one sided test ended up statistically significant showing that deep stops result in more DCS, if you agree with their methods sample size and p value his should be a valid conclusion that deep stops truly did cause more DCS. In this study with the statistically significant result, you should not need a two tailed test.

@Dr Simon Mitchell may be able to shed some more light
 
I don’t necessarily agree with the one sided test being inappropriate for this study.

To be clear, that's absolutely not what I said. As I actually make clear in the article itself, in the context of the question that is being asked, the test is appropriate. This is a direct quote (from amongst others): "Given the context within which the study was carried out, namely the fact that the U.S. Navy would depart from continuing to use shallow-stop profiles ONLY in case of the “finding of significantly lower PDCS [probability of DCS] for the bubble model schedule [deep-stops profiles] than the gas content model schedule [shallow-stops profile]” the choice of a one-sided test is appropriate, as the authors point out." I repeat that in other places as well.

The finer point is that the ideal scenario would be to state the null hypothesis as in the "Two-sided Framework" box; here, one starts off from a stance of not knowing if there is a difference. The statistical significance from a two-sided test is stronger, albeit harder to achieve. (You might want to revisit the green box.) However, that would also mean having to run the experiment on larger numbers, which carries with it ethical implications given that the end-point is DCS. It would be great if we could have larger numbers on which a two-sided test still returns statistical significance. That, however, is different from saying the test being used in the study is not appropriate (which, I repeat, I do not say).

The reason for highlighting everything in boxes is precisely to ensure clarity of what the null hypothesis is stating in each instance and to distinguish the difference between them. I'd only ask to be careful not to claim the article is making a certain point which it isn't.
 
Last edited:
Hello Joseph,

That is a very clearly written piece. Thank you.

I would just add a couple of perspectives.

First, in your "against" appraisal at the end you highlight the fact that the study was unblinded and that this could have introduced an element of bias in diagnosing DCS between the two groups. I completely agree that one must be alert to the effect this sort of bias might have had on the conclusions of a study where the outcomes have a subjective element. This would be particularly so where the study's conclusions appear to support the authors' preconceptions and/or study hypotheses. Turning to the NEDU study specifically, it would be easy to form the view (given all the arguments around defence of this study against ill-informed criticism) that the authors started out with an anti-deep stop agenda. In fact, at conferences where the study was first presented, the authors clearly stated that they undertook the study believing deep stops would work. The study took place in an era where there was widespread belief that deep stops were the optimal approach and that the extant navy "shallow stop" practices were archaic. It follows that if any bias was operant, it would probably have influenced the result in the opposite direction to that actually observed. Therefore although one cannot deny its possible existence, I do not see bias as a major problem in this study given the direction in which the results fell.

Second, again in your "against" appraisal, you make the point (entirely correctly) that the result is not significant in a two tailed test. I think it is probably fair to also make the point that the most likely explanation for this is that the study was underpowered. Of course we will never know for sure, but had they continued on with more subjects it seems likely that they would have achieved a result that was significant on a two tailed test. Why didn't they? Well, you answered that but I think the point deserves re-emphasizing: in a biomedical study that potentially results in harm to human subjects it will be very difficult to convince an IRB that you should be allowed to enrol subjects beyond the barest minimum to answer the research question. Hence the sequential analysis rules, the mid point analysis, and early termination of the study with a barely statistically significant result.

Your point about the VGE grades being confluent with the study's findings is well made, and this increases the biological plausibility of the study's result. You also mention the potential for other studies to emerge. This has already happened (though no others use clinical DCS as the outcome measure). You can find a summary here.

None of these studies are perfect by any means, but their results are all pointing in the same general direction, which is to say that deep stop approaches to decompression adopted during the period of strong belief in the strategy appear to over-emphasize deep stops.

Simon M
 
Hello Joseph,

That is a very clearly written piece. Thank you.

I would just add a couple of perspectives.

First, in your "against" appraisal at the end you highlight the fact that the study was unblinded and that this could have introduced an element of bias in diagnosing DCS between the two groups. I completely agree that one must be alert to the effect this sort of bias might have had on the conclusions of a study where the outcomes have a subjective element. This would be particularly so where the study's conclusions appear to support the authors' preconceptions and/or study hypotheses. Turning to the NEDU study specifically, it would be easy to form the view (given all the arguments around defence of this study against unreasonable criticism) that the authors started out with an anti-deep stop agenda. In fact, at conferences where the study was first presented, the authors clearly stated that they undertook the study believing deep stops would work. The study took place in an era where there was widespread belief that deep stops were the optimal approach and that the extant navy "shallow stop" practices were archaic. It follows that if any bias was operant, it would probably have influenced the result in the opposite direction to that actually observed. Therefore although one cannot deny its possible existence, I do not see bias as a major problem in this study given the direction in which the results fell.

Second, again in your "against" appraisal, you make the point (entirely correctly) that the result is not significant in a two tailed test. I think it is probably fair to also make the point that the most likely explanation for this is that the study was underpowered. Of course we will never know for sure, but had they continued on with more subjects it seems likely that they would have achieved a result that was significant on a two tailed test. Why didn't they? Well, you answered that but I think the point deserves re-emphasizing: in a biomedical study that potentially results in harm to human subjects it will be very difficult to convince an IRB that you should be allowed to enrol subjects beyond the barest minimum to answer the research question. Hence the sequential analysis rules, the mid point analysis, and early termination of the study with a barely statistically significant result.

Your point about the VGE grades being confluent with the study's findings is well made, and this increases the biological plausibility of the study's result. You also mention the potential for other studies to emerge. This has already happened (though no others use clinical DCS as the outcome measure). You can find a summary here.

None of these studies are perfect by any means, but their results are all pointing in the same general direction, which is to say that deep stop approaches to decompression adopted during the period of strong belief in the strategy appear to over-emphasize deep stops.

Simon M
 
Last edited:
Hi Simon,

Many thanks for your thoughts; much appreciated. Have only got a couple of minor points to add, since we seem to be mostly in agreement.

It follows that if any bias was operant, it would probably have influenced the result in the opposite direction to that actually observed. Therefore although one cannot deny its possible existence, I do not see bias as a major problem in this study given the direction in which the results fell.

I see your point here. I can’t really comment much about the direction any potential bias would have taken because bias (and particularly unconscious bias) can be quite problematic to characterise, and trying to predict its direction is generally not bound to be a very fruitful exercise. Ideally, one mitigates it as much as possible (if not outright eliminate it) by designing blindness in the experiment, but (as I think we both agree) this can sometimes be difficult to achieve adequately in practice.

Of course we will never know for sure, but had they continued on with more subjects it seems likely that they would have achieved a result that was significant on a two tailed test.

As you correctly say, we can’t know for sure. Maybe yes, maybe not. Mathematically, the only way to find out is to add more data points - and again, we both agree that it’s highly difficult to convince an ethics board that more (potentially harmful) trials are justified. (Also agreed about the rest, i.e. the sequential analysis rules, etc.) In my analysis, I ran an exercise where I added a (simulated) point on either side to give the reader an idea (given the small numbers) of what result a two-tailed test would give in these hypothetical examples, but beyond that, I can’t say or do much.

This has already happened (though no others use clinical DCS as the outcome measure). You can find a summary here.

Many thanks for the pointer, Simon. I was aware of some of these (albeit haven’t gone into them in as much detail as the NEDU one). As you said, the endpoint in these studies is not DCS, but they provide valuable insight nonetheless. (Whilst I have heard about the Swedish Navy study, and indeed am very keen to have a look at it, I wasn’t sure whether it had been published already and I somehow missed it, or whether it hadn’t yet been published. I’m guessing it’s not out yet.)

Thanks very much once again for the feedback.

Best wishes,

Joseph
 
I've very much enjoyed this discussion and only have a couple of things to add. First, let me say thank you to Joseph for pointing out the issue regarding the one-tailed test. I thought a lot about this as well when I first read the study and I don't think I've seen anyone else raise this point. At first I wondered if it was a reach on the researchers part to find a statistical difference. I agree with Simon though that given the issues surrounding actual health concerns (and IRB hoops), designing the study to reach a two-tailed significance at alpha = 0.05, could have put many more people at serious risk. @Dr. Simon Mitchell, I would be a little hesitant to call the study "underpowered." From a statistical standpoint it technically was, but when looking for a small biological effect, sometimes that's just the nature of it. Heck, their sample sizes were far bigger than most of my studies and I don't work on humans! @Beau640 gave a nice lay description of the issue surrounding the one-tailed test and I agree his assessment.

On the issue of "blindness" and bias, it's not one that I worry a lot about. Indeed, the ideal experiment is always the double blind one. If the researchers clearly define their criteria for scoring the data a priori, then there are usually few problems with bias creeping in. My own studies are not blind. We get around this by clearly defining how we score the data. Each trial is then video recorded and we have a record for review in the even that any questions arise.
 
Hi @RyanT,

Glad you found the discussion enjoyable.

From a statistical standpoint it technically was, but when looking for a small biological effect, sometimes that's just the nature of it.

I sure appreciate your point about things sometimes being this way. Here we're dealing mostly with the statistics, and (as we all agree after all) insofar as that is concerned, the solution is to have the power of numbers. Unfortunately, however, the numbers are not always easy to obtain (to put it mildly).

@Beau640 gave a nice lay description of the issue surrounding the one-tailed test and I agree his assessment

That's the reason for having written at length about this in the article; it's relatively easy (and unfortunately all too common) for people to trip over the finer details of statistical argument. (I just hope I've distilled it in simple enough terms.) Just to reiterate for clarity, it's not a matter of test inappropriateness in this instance (given the context, i.e. the question that is being asked, as I previously commented), but rather of increasing robustness via a different choice of null hypothesis. However, of course, we are all aware of how difficult this can be due to the potential harm to test subjects. (I've tried to make this point as clear as possible in the article itself.)

On the issue of "blindness" and bias, it's not one that I worry a lot about.

I see your point about the measures you describe - and I acknowledge, of course, that sometimes it is difficult to employ blindness. However, speaking generally, whenever possible I do insist that such protocol is observed in scientific testing; as humans we are all too subject to bias, and needless to say, not all of it is necessarily of the conscious variety. (Moreover, some studies are by nature more subjective than others, underscoring the importance of adopting blindness whenever possible.)

Joseph
 
https://www.shearwater.com/products/perdix-ai/

Back
Top Bottom