David Doolette
Diving Physiologist
- Messages
- 80
- Reaction score
- 284
commentary on Correlation papers part 2
WARNING, VERY BORING DESCRIPTION OF STATISTICAL METHODS BELOW HERE---
The log-likelihood (LL) is an index that is used to compare the fit of models to the same data set - a higher (less negative) log-likelihood indicates a better fit to the data. Just to explain a bit more, suppose we had an ordered data set of three dive profiles and the outcomes were 1,0,1 (1=DCS and, 0=no-DCS) and we had a model that predicted perfectly, in other words the model predicted the probability of DCS (p) = 1 for the first dive, the probability of no-DCS (q=1-p) = 1 for the second dive and p=1 for the third dive, then the likelihood (not the log-likelihood) for this model is 1x1x1=1, perfect. For numerical reasons it is better to calculate the log-likelihood, which is ln(1)+ln(1)+ln(1)=0, the perfect log-likelihood. No real model does this well, so suppose we had a model that predicted p=0.8 for the first dive, q=0.6 for the second dive, and p=0.7 for the third dive, then the log-likelihood is ln(0.8)+ln(0.6)+ln(0.7) = -1.1. The further the model predictions are away from the observations, the more negative (and further from the perfect zero) the LL becomes. The LL is also dependent on the size of the data set, so can only be used to compare models fit to the same data. So if you have model ‘a’ and model ‘b’ fit to the same data set and LL_a=-100 and LL_b=-110, then model ‘a’ provides better predictions of the data (“fits the data better”) than model ‘b’.
The evaluation example above is informal, based on inspections of the log-likelihoods, and is perfectly valid. However, in certain circumstances, it is possible to formally evaluate if two log-likelihoods are statistically significantly different using log-likelihood ratio tests. These tests have two appropriate uses in evaluation of fitted models. First, and most fundamentally, is the comparison of a fitted model to a ‘saturated’ model. A saturated model has as many parameters as there are distinct patterns in the data of the explanatory variables used in the fitted model, and has the highest log-likelihood that can be achieved by a model that uses the particular explanatory variables. In probabilistic decompression models, the explanatory variables are the dive profiles. So for these data, the saturated model would have as many parameters as there are distinct dive profiles. (Technically, in the Wienke models discussed above, the explanatory variables are the integrals of the hazards since none of the parameters of the underlying biophysical models are fitted, but these integral hazards are calculated from the dive profiles.) 2(LLsaturated-LLfitted), the deviance, indicates how much worse the actual fitted model fits the data than the best possible fit. Under some circumstances, the deviance is chi-square distributed and it is possible to formally test if the deviance (lack of fit) is significant - if it is not significant you have a “good fit”. However, the likelihood of the saturated model cannot be (meaningfully) calculated for data where (most) all the dive profiles are different, as is the case in field collected diving data of this sort. So the deviance is not a meaningful measure of goodness of fit for this type of data. Other methods are usually employed that are based, not on log-likelihoods, but on the differences between observed and predicted incidences (residuals) in more or less arbitrary groupings of the data, and the value of such test statistics, and their validity, varies considerably based on the data grouping. Usually a battery of tests is needed to assess goodness of fit.
The second appropriate use of likelihood ratio tests is to compare two nested models. Two models are nested if the ‘full model’ has all the same parameters and explanatory variables as the ‘reduced model’, plus some more. 2(LLfull-LLreduced) is a measure of improved fit (reduced deviance) of the full model over the reduced model. This statistic is chi-square distributed, and a significant chi-square test indicates the improved fit of the full model justifies the extra parameters. One common use of this test is to the comparison of a fitted model (full) to a null model (reduced). In a null model, the explanatory variables of the fitted model are nulled, that is they do not contribute to the null model. A natural choice of null model for the data in the Comp Biol Med 2010 paper, for instance, is a model with one parameter equal to the observed incidence of DCS (20/2879=0.0069), this model assigns all dives in the data with a probability of DCS equal to 0.0069 and the specifics of dive profile has no influence on risk. This is what the author refers to as the “1 step set” and appears on the third line of table 3 of the (Comp Med Biol 2010 paper). The bare minimum requirement for useful model is to have a higher log-likelihood than the null model, which indeed the author’s models just barely (but significantly) have. The author’s choice of test statistics based on the 6 step sets, although superficially similar, is none of the above.
WARNING, VERY BORING DESCRIPTION OF STATISTICAL METHODS BELOW HERE---
The log-likelihood (LL) is an index that is used to compare the fit of models to the same data set - a higher (less negative) log-likelihood indicates a better fit to the data. Just to explain a bit more, suppose we had an ordered data set of three dive profiles and the outcomes were 1,0,1 (1=DCS and, 0=no-DCS) and we had a model that predicted perfectly, in other words the model predicted the probability of DCS (p) = 1 for the first dive, the probability of no-DCS (q=1-p) = 1 for the second dive and p=1 for the third dive, then the likelihood (not the log-likelihood) for this model is 1x1x1=1, perfect. For numerical reasons it is better to calculate the log-likelihood, which is ln(1)+ln(1)+ln(1)=0, the perfect log-likelihood. No real model does this well, so suppose we had a model that predicted p=0.8 for the first dive, q=0.6 for the second dive, and p=0.7 for the third dive, then the log-likelihood is ln(0.8)+ln(0.6)+ln(0.7) = -1.1. The further the model predictions are away from the observations, the more negative (and further from the perfect zero) the LL becomes. The LL is also dependent on the size of the data set, so can only be used to compare models fit to the same data. So if you have model ‘a’ and model ‘b’ fit to the same data set and LL_a=-100 and LL_b=-110, then model ‘a’ provides better predictions of the data (“fits the data better”) than model ‘b’.
The evaluation example above is informal, based on inspections of the log-likelihoods, and is perfectly valid. However, in certain circumstances, it is possible to formally evaluate if two log-likelihoods are statistically significantly different using log-likelihood ratio tests. These tests have two appropriate uses in evaluation of fitted models. First, and most fundamentally, is the comparison of a fitted model to a ‘saturated’ model. A saturated model has as many parameters as there are distinct patterns in the data of the explanatory variables used in the fitted model, and has the highest log-likelihood that can be achieved by a model that uses the particular explanatory variables. In probabilistic decompression models, the explanatory variables are the dive profiles. So for these data, the saturated model would have as many parameters as there are distinct dive profiles. (Technically, in the Wienke models discussed above, the explanatory variables are the integrals of the hazards since none of the parameters of the underlying biophysical models are fitted, but these integral hazards are calculated from the dive profiles.) 2(LLsaturated-LLfitted), the deviance, indicates how much worse the actual fitted model fits the data than the best possible fit. Under some circumstances, the deviance is chi-square distributed and it is possible to formally test if the deviance (lack of fit) is significant - if it is not significant you have a “good fit”. However, the likelihood of the saturated model cannot be (meaningfully) calculated for data where (most) all the dive profiles are different, as is the case in field collected diving data of this sort. So the deviance is not a meaningful measure of goodness of fit for this type of data. Other methods are usually employed that are based, not on log-likelihoods, but on the differences between observed and predicted incidences (residuals) in more or less arbitrary groupings of the data, and the value of such test statistics, and their validity, varies considerably based on the data grouping. Usually a battery of tests is needed to assess goodness of fit.
The second appropriate use of likelihood ratio tests is to compare two nested models. Two models are nested if the ‘full model’ has all the same parameters and explanatory variables as the ‘reduced model’, plus some more. 2(LLfull-LLreduced) is a measure of improved fit (reduced deviance) of the full model over the reduced model. This statistic is chi-square distributed, and a significant chi-square test indicates the improved fit of the full model justifies the extra parameters. One common use of this test is to the comparison of a fitted model (full) to a null model (reduced). In a null model, the explanatory variables of the fitted model are nulled, that is they do not contribute to the null model. A natural choice of null model for the data in the Comp Biol Med 2010 paper, for instance, is a model with one parameter equal to the observed incidence of DCS (20/2879=0.0069), this model assigns all dives in the data with a probability of DCS equal to 0.0069 and the specifics of dive profile has no influence on risk. This is what the author refers to as the “1 step set” and appears on the third line of table 3 of the (Comp Med Biol 2010 paper). The bare minimum requirement for useful model is to have a higher log-likelihood than the null model, which indeed the author’s models just barely (but significantly) have. The author’s choice of test statistics based on the 6 step sets, although superficially similar, is none of the above.