This post is based on the following paper:
A summary of this paper could read (my paraphrasing):
There has been an unprecedented rise in infant mortality since 2013/4, could deprivation be a driving force behind this? We conclude that deprivation has led to more infant deaths, specifically 24 infant deaths per 100,000 live births in the most deprived areas. Also, there have been 572 excess infant deaths compared with how many deaths we would have expected, given previous trends in the infant mortality rate.
I have a few issues with this, hence the blog post. First, I didn’t like that they used “excess mortality” as their outcome (which looks at trends, rather than what’s actually happened), because it is not equivalent to “more mortality”. A trend could get more positive without becoming positive: if infant deaths went down by 1 per 1,000 live births in 2010-2013 (a good thing), but then only went down by 0.5 per 1,000 births in 2014-2017 (a less good thing, but still good), then that would be an “excess mortality” of 0.5 per 1,000 births per year. Infant mortality hasn’t actually gotten any worse though, it’s just decreasing at a slower pace than before.
Now, that might still be good to know. But it’s also important to remember that there is a lower limit – mortality, sadly even for infants, cannot be zero – even if it could, that means that eventually the trend of mortality would have to stop at zero. I would have liked an assessment of what the cause of death was for the “excess mortality”.
My other main issue was that they said one of their analyses was causal. It wasn’t, and could never have been. This irritated me.
Because the paper used openly accessible data, I could reanalyse it at my leisure, which is always nice. I wanted to see if I could estimate the mortality rates, and see how they compared with the published data. I also wanted to reassess the claim that deprivation (as measured by the Index of Multiple Deprivation, IMD) was associated with the trend in mortality rates. If possible, I also wanted to see where the extra deaths were coming from – what were the ages of the babies that were dying?
This isn’t my field, and my analyses were conducted quickly, so if I did anything wrong feel free to comment – all my code is available , along with all the files I downloaded from the Office of National Statistics (ONS), the sources of which are linked in this post.
asked me to take a look this and looked over my initial results, give her a shout if you’re interested in this kind of thing as this is something she does for a living.
Finally, I want to say that I don’t think the paper has been poorly conducted. I don’t know how they did a few of their analyses (wish I could see their code…), so mine may be different in unknown ways – they may be better, they may be worse. But I think my conclusion casts doubt on the central theme, namely that there has been a rise in infant mortality in the past few years.
I’m thinking of including a tldr in these posts, as I can see this one is now over 5,000 words long (sorry).
So here it is, tldr: the increase in infant mortality looks like it’s a shift from stillbirths to deaths in babies under 1 day old, i.e. there is no definite increase in infant mortality, it jjust looks like it because of the way it is recorded.
Also, deprivation isn’t associated with the increase anyway.
EDIT: I have found a BBC Radio show () that looked at this last year. Dr Peter Davis looked at the data and concluded that the uptick in infant deaths could be due to how deaths are recorded, specifically how deaths of very premature babies are recorded. It’s not fun to be this late to the party, but in fairness, the published paper came out less than 3 weeks ago, and they don’t seem to acknowledge this as an issue at all, so I feel like my time doing this might have been justified.
I took the ONS data for births and deaths (split by local authority area, for the years 2010-2017 inclusive) from the following URLs:
I tied each of the local authority areas to deprivation (as measured by the IMD) of that area in 2015, with data taken from a data portal:
From the birth data, I took the total number of live births (column F in table 1 or 1a).
From the death data, I took the number of infant (under 1 year), neonatal (under 4 weeks) and perinatal (stillbirths and deaths under 1 week) deaths, as well as the rates for those deaths (columns H-J and O-Q from table 1a).
I tied everything together, after formatting the local authority areas (ONS coding of place names changes over time and between datasets, unhelpfully).
I then removed areas with only 1 year of data (useless for estimating trends), estimated rates from births and deaths (if missing), estimated number of births from rates and deaths (if missing), then estimated the standard error of the rates.
I ended up with 324 areas in England with at least two mortality rates, same as the original publication, which is a good start.
Note: reading things in from Excel when the tables are inconsistent is irritating, as is having to deal with headers that are split across multiple cells. That is all.
I looked at:
Stillbirths were not recorded explicitly in the death or birth data (which includes only live births). I tried to estimate stillbirths based on the number of perinatal deaths and the perinatal rate (births + stillbirths = deaths * 1000 / perinatal rate), but because the number of perinatal deaths is relatively low, the imprecision of the rate estimates (to 1 decimal place, eurgh) meant I kept getting negative estimates of the number of stillbirths. I know the lower bound is 0 and the upper bound is perinatal deaths minus neonatal deaths, but that doesn’t really help.
The reason it matters is that the standard error (SE) of an incidence rate is sqrt(cases/N2), with N being the number of births, which includes stillbirths for the perinatal mortality rate. I therefore might be overestimating the SE of the perinatal mortality rate by not accounting for stillbirths, but this is preferable to underestimating the SE. As far as I can tell, this can’t be helped.
I only estimated the SE if there were deaths in an area – there’s no SE of an incidence if there are no incidences.
The other reason the SE matters is that I don’t think the published paper took variance in the mortality rates into account in their analyses. Without their code, I can’t tell for sure though. Actually, without their code, there’s a lot that I’m uncertain of, I wish people would always publish their code (if they can, for instance when it’s publicly available data). I say that knowing full well I haven’t posted code in the past, but I’m trying to now.
I’m mixing the methods and results for the analyses, as it’s probably a little clearer than presenting methods then results.
I meta-analysed all rate estimates in all areas using random-effects separately for each year between 2010 and 2017, giving a weighted estimate of the mortality rates for each year. In plain English, I took a weighted average of the mortality rates across all areas of England, giving more weight to areas where there were more births and/or more deaths (as these have more statistical power).
Figure 1 shows the mortality rates averaged over all areas.
Figure 1: Mortality rates over time, Y-axis = mortality rate (per 1,000 live births/live births & stillbirths)
It is clear 2014 had the lowest infant and neonatal mortality rates (not 2013, strangely, as this was the lowest point in the published study). There’s a clear trend of decreasing mortality up to 2014, then increasing mortality up to 2017. However, perinatal and postnatal death rates have been consistently decreasing since 2010. This implies the rise in infant mortality rate has been in neonates, i.e. those aged less than 4 weeks. There could still be a trade-off in stillbirths, because I don’t know the stillbirth mortality rate, i.e. if stillbirths went down a lot and deaths within 1 week went up a little, we could still see a reduction in perinatal mortality rate and an increase in neonatal mortality rate, even though the increase in death rate is exclusively in the 0-1 week range.
Basically, I don’t know at this stage what age the increase happens, it’s just before in those aged less than 4 weeks, and could include those aged less than 1 week.
I used variance weighted least squares (VWLS – this is equivalent to linear regression, but accounts for the variance on the estimates of mortality rate, which I took so much time to estimate earlier) to estimate the trend of mortality rates overall, and between, 2010-2014 and 2014-2017 (my data shows the lowest infant mortality rate was in 2014, so I’ll use this as the “inflection point” – the point at which the trend goes from downwards to upwards).
The overall trend was a reduction over time in all mortality rates with P values below 0.001 (except neonatal mortality, where P = 0.17). This means that between 2010 and 2017, the evidence says the overall trend in infant, perinatal and postnatal mortality rates were downwards, which is good.
However, there were clear differences in the trend between 2010-2014 and 2014-2017, with very small P values (<0.001), so we can be confident the change in trend from downwards to upwards is real for infant and neonatal mortality rates, but not perinatal and postnatal mortality rates. Table 1 shows the mortality rates in different years, as well my estimates of the trends in mortality between different years.
Table 1: Average mortality rate estimates for each year, mortality rates expressed per 1,000 live births, 95% confidence intervals in brackets
Year | Infant mortality rate | Neonatal mortality rate | Perinatal mortality rate | Postnatal mortality rate |
2010 | 4.09 (3.91 to 4.28) | 2.88 (2.73 to 3.02) | 7.13 (6.89 to 7.37) | 1.14 (1.05 to 1.22) |
2011 | 4.05 (3.87 to 4.24) | 2.91 (2.75 to 3.07) | 7.14 (6.89 to 7.40) | 1.04 (0.96 to 1.12) |
2012 | 3.85 (3.66 to 4.04) | 2.77 (2.62 to 2.93) | 6.35 (6.09 to 6.61) | 1.05 (0.97 to 1.14) |
2013 | 3.71 (3.55 to 3.87) | 2.65 (2.51 to 2.80) | 6.17 (5.93 to 6.40) | 1.05 (0.97 to 1.13) |
2014 | 3.57 (3.40 to 3.75) | 2.61 (2.45 to 2.76) | 6.18 (5.93 to 6.43) | 1.01 (0.93 to 1.10) |
2015 | 3.72 (3.53 to 3.91) | 2.68 (2.51 to 2.85) | 5.98 (5.74 to 6.22) | 0.96 (0.88 to 1.04) |
2016 | 3.71 (3.50 to 3.92) | 2.77 (2.59 to 2.94) | 5.96 (5.71 to 6.21) | 0.91 (0.83 to 0.99) |
2017 | 3.81 (3.63 to 4.00) | 2.84 (2.68 to 3.01) | 5.93 (5.70 to 6.17) | 0.90 (0.82 to 0.98) |
Trend: 2010-2017 | -0.050 (-0.079 to -0.021) | -0.017 (-0.041 to 0.007) | -0.181 (-0.218 to -0.144) | -0.030 (-0.043 to -0.018) |
Trend: 2010-2014 | -0.138 (-0.194 to -0.081) | -0.078 (-0.125 to -0.031) | -0.287 (-0.364 to -0.210) | -0.023 (-0.049 to 0.003) |
Trend: 2014-2017 | 0.071 (-0.009 to 0.153) | 0.078 (0.007 to 0.150) | -0.073 (-0.180 to 0.032) | -0.037 (-0.073 to -0.001) |
OK, I’m fairly certain there is an increase in neonatal mortality, which shows up as an increase in infant mortality. Does deprivation make a difference?
I estimated the trend for mortality rate between 2010 and 2017 for all areas separately, then used VWLS to estimate the association between deprivation and the overall trend in mortalities over time.
For infant, neonatal and postnatal mortality rates, increasing deprivation made the trends more negative, i.e. the more deprived the area, the greater the reduction in mortality rates (as presented in the published paper – the decreasing inequality bit). However, nothing had an incredibly low P value, so there isn’t much to say here.
I also estimated the same trends between 2010-2014 and 2014-2017 for all areas separately, then used VWLS again to estimate the association between deprivation and the year-specific trends in mortality rates over time.
Again, most estimates were negative, so higher deprivation was associated with mortality rate trends that decreased more over time compared with lower deprivation. Again, P values were all high.
Table 2: Estimated association of deprivation (per 10 unit increase in IMD) and mortality rate trends, for all years, 2010-2014 and 2014-2017, mortality rates expressed per 1,000 live births
Year | Infant mortality rate | Neonatal mortality rate | Perinatal mortality rate | Postnatal mortality rate |
Trend: 2010-2017 | -0.025 (-0.055 to 0.004) | -0.024 (-0.051 to 0.002) | 0.019 (-0.016 to 0.056) | -0.006 (-0.022 to 0.009) |
Trend: 2010-2014 | -0.054 (-0.115 to 0.005) | -0.040 (-0.095 to 0.014) | 0.010 (-0.066 to 0.087) | -0.024 (-0.058 to 0.010) |
Trend: 2014-2017 | -0.031 (-0.117 to 0.055) | -0.023 (-0.103 to 0.055) | -0.027 (-0.132 to 0.077) | -0.000 (-0.048 to 0.047) |
From this, I concluded that deprivation was not materially associated with the trend of mortality rates, i.e. increasing deprivation did not materially increase or decrease mortality rate trends over time.
However, the published paper looked at quintiles of deprivation, rather than deprivation as a continuous variable. Non-linear effects might show up in categorical analyses, so I looked at those.
I grouped the areas into 5 fifths of deprivation (as measured using IMD), based on the 324 areas I had in my analysis. These quintiles may be different to the published paper, which could have used other quintile points, but my data should be similar, given it’s representative of vast areas of the country.
For this analysis, I used both random-effects meta-analysis to get the average mortality rates in each year (which means I could reproduce the graph from the published paper), as well as using VWLS to estimate trends for each area separately, then seeing whether deprivation affects the rate trends.
The figures I’ve made have way more variance than the published figure, I think because they aren’t accounting for the variance on the incidence rates (or I’m wrong, which is also a good possibility). Note that for all graphs, being in the most deprived fifth is bad for mortality rates, but this analysis is looking at whether bring in the most deprived fifth is bad for the trend in mortality rates over time.
I used Stata to make the graphs. Apparently, it took until Stata 15 to allow transparency in graphs, which seems long overdue. These aren’t as nice as R graphs (the published paper used R to make the graphs), but I didn’t want to spend hours fiddling with ggplot2, so here we are.
Figure 2 Deprivation-specific infant mortality rates, Y-axis = mortality rate (per 1,000 live births)
Figure 2 shows the deprivation-specific infant mortality rates – shaded areas are the 95% confidence interval of each deprivation group. The notable feature is that there is a big increase in infant mortality rates for the least deprived fifth of areas, taking it ahead of all other fifths except the most deprived. This isn’t in the published paper, and really matters for all these analyses. However, it’s clear from the confidence intervals that there isn’t going to be any evidence from statistical analyses for, well, anything – the precision is just too low.
Figure 3 Deprivation-specific neonatal mortality rates, Y-axis = mortality rate (per 1,000 live births)
Figure 3 shows the deprivation-specific neonatal mortality rates. There’s even less going on here, but the least deprived fifth is also behaving weirdly – I put this down to the imprecision of the data and natural variation.
Figure 4 Deprivation-specific perinatal mortality rates, Y-axis = mortality rate (per 1,000 live births and stillbirths)
Figure 4 shows the deprivation-specific perinatal mortality rates. This shows a similar general reduction in perinatal mortality rates over time, reasonably equally for all IMD fifths.
Figure 5 Deprivation-specific postnatal mortality rates, Y-axis = mortality rate (per 1,000 live births)
Figure 5 shows the deprivation-specific postnatal mortality rates. This shows a similar general reduction in postnatal mortality rates over time, reasonably equally for all deprivation fifths.
From the graphs, I don’t conclude that deprivation has an association with mortality rate trends over time, but the statistical tests might show otherwise.
For the overall trend, deprivation fifths do not have trends in mortality rates that are statistically distinct, i.e. there is no evidence to say the trend in each deprivation fifth is different to the least deprived fifth. On average, though, the trends are more negative in more deprived areas. This is also true in 2010-2014.
Table 3 Average mortality rate estimates for each year by IMD quintile, including trend analysis from 2010-2017, 2010-2014 and 2014-2017, mortality rates expressed per 1,000 live births.
Year | Infant mortality rate | Neonatal mortality rate | Perinatal mortality rate | Postnatal mortality rate |
Deprivation quintile = 1 |
||||
2010 | 3.23 (2.81 to 3.66) | 2.46 (2.08 to 2.85) | 5.89 (5.37 to 6.41) | 0.90 (0.66 to 1.15) |
2011 | 3.16 (2.76 to 3.56) | 2.53 (2.14 to 2.92) | 6.00 (5.48 to 6.53) | 0.83 (0.61 to 1.05) |
2012 | 3.01 (2.62 to 3.40) | 2.60 (2.17 to 3.03) | 5.05 (4.53 to 5.57) | 1.00 (0.76 to 1.24) |
2013 | 3.28 (2.86 to 3.70) | 2.79 (2.35 to 3.23) | 5.24 (4.76 to 5.73) | 0.95 (0.72 to 1.18) |
2014 | 2.98 (2.58 to 3.37) | 2.47 (2.08 to 2.86) | 5.57 (5.03 to 6.10) | 0.92 (0.68 to 1.15) |
2015 | 3.00 (2.59 to 3.41) | 2.35 (1.95 to 2.74) | 4.83 (4.37 to 5.29) | 0.86 (0.62 to 1.09) |
2016 | 3.13 (2.71 to 3.56) | 2.72 (2.27 to 3.16) | 4.68 (4.16 to 5.19) | 0.76 (0.53 to 0.98) |
2017 | 3.73 (3.27 to 4.20) | 2.92 (2.49 to 3.35) | 5.17 (4.67 to 5.66) | 0.81 (0.59 to 1.04) |
Trend: 2010-2017 | 0.029 (-0.036 to 0.095) | 0.034 (-0.028 to 0.097) | -0.142 (-0.221 to -0.064) | -0.017 (-0.052 to 0.017) |
Trend: 2010-2014 | -0.042 (-0.170 to 0.086) | 0.023 (-0.101 to 0.147) | -0.146 (-0.312 to 0.018) | 0.016 (-0.058 to 0.091) |
Trend: 2014-2017 | 0.227 (0.037 to 0.417) | 0.168 (-0.015 to 0.351) | -0.111 (-0.340 to 0.117) | -0.040 (-0.143 to 0.062) |
Deprivation quintile = 2 |
||||
2010 | 3.70 (3.29 to 4.11) | 2.70 (2.32 to 3.09) | 6.95 (6.38 to 7.52) | 1.04 (0.81 to 1.27) |
2011 | 3.38 (2.99 to 3.77) | 2.46 (2.10 to 2.82) | 6.20 (5.67 to 6.73) | 0.90 (0.68 to 1.11) |
2012 | 3.36 (2.92 to 3.80) | 2.39 (2.02 to 2.76) | 5.53 (5.04 to 6.03) | 0.91 (0.71 to 1.11) |
2013 | 3.47 (3.07 to 3.88) | 2.42 (2.05 to 2.80) | 5.68 (5.15 to 6.21) | 1.01 (0.79 to 1.24) |
2014 | 2.99 (2.62 to 3.36) | 2.31 (1.95 to 2.67) | 5.25 (4.76 to 5.74) | 1.11 (0.86 to 1.37) |
2015 | 3.40 (2.99 to 3.80) | 2.75 (2.35 to 3.15) | 5.33 (4.84 to 5.82) | 0.95 (0.72 to 1.17) |
2016 | 3.28 (2.88 to 3.67) | 2.54 (2.16 to 2.92) | 5.29 (4.81 to 5.76) | 1.00 (0.76 to 1.24) |
2017 | 3.22 (2.80 to 3.63) | 2.62 (2.21 to 3.04) | 5.19 (4.66 to 5.72) | 0.72 (0.51 to 0.93) |
Trend: 2010-2017 | -0.050 (-0.113 to 0.012) | 0.008 (-0.051 to 0.068) | -0.203 (-0.284 to -0.122) | -0.020 (-0.054 to 0.013) |
Trend: 2010-2014 | -0.135 (-0.259 to -0.011) | -0.081 (-0.198 to 0.036) | -0.380 (-0.547 to -0.214) | 0.024 (-0.050 to 0.099) |
Trend: 2014-2017 | 0.064 (-0.111 to 0.240) | 0.082 (-0.090 to 0.255) | -0.020 (-0.246 to 0.205) | -0.117 (-0.219 to -0.014) |
Deprivation quintile = 3 |
||||
2010 | 3.89 (3.51 to 4.27) | 2.64 (2.31 to 2.97) | 7.03 (6.51 to 7.55) | 1.09 (0.89 to 1.30) |
2011 | 3.85 (3.48 to 4.22) | 2.77 (2.43 to 3.12) | 6.95 (6.46 to 7.45) | 0.94 (0.75 to 1.14) |
2012 | 3.54 (3.18 to 3.90) | 2.54 (2.18 to 2.90) | 6.17 (5.64 to 6.70) | 0.94 (0.75 to 1.13) |
2013 | 3.56 (3.18 to 3.94) | 2.60 (2.26 to 2.94) | 6.07 (5.58 to 6.57) | 0.98 (0.78 to 1.17) |
2014 | 3.15 (2.78 to 3.52) | 2.44 (2.04 to 2.84) | 5.28 (4.76 to 5.80) | 0.94 (0.74 to 1.13) |
2015 | 3.25 (2.89 to 3.61) | 2.38 (2.05 to 2.71) | 5.83 (5.33 to 6.33) | 0.80 (0.62 to 0.99) |
2016 | 3.50 (3.02 to 3.98) | 2.47 (2.10 to 2.83) | 5.66 (5.19 to 6.12) | 0.70 (0.52 to 0.87) |
2017 | 3.29 (2.93 to 3.65) | 2.53 (2.14 to 2.91) | 5.65 (5.19 to 6.12) | 0.89 (0.69 to 1.08) |
Trend: 2010-2017 | -0.091 (-0.149 to -0.032) | -0.036 (-0.091 to 0.017) | -0.208 (-0.283 to -0.132) | -0.039 (-0.068 to -0.009) |
Trend: 2010-2014 | -0.178 (-0.296 to -0.060) | -0.052 (-0.166 to 0.060) | -0.437 (-0.600 to -0.274) | -0.027 (-0.090 to 0.035) |
Trend: 2014-2017 | 0.057 (-0.106 to 0.221) | 0.038 (-0.135 to 0.211) | 0.086 (-0.133 to 0.306) | -0.028 (-0.115 to 0.058) |
Deprivation quintile = 4 |
||||
2010 | 3.82 (3.51 to 4.12) | 2.61 (2.35 to 2.86) | 6.82 (6.37 to 7.28) | 1.08 (0.92 to 1.25) |
2011 | 4.16 (3.81 to 4.51) | 2.92 (2.64 to 3.20) | 7.63 (7.17 to 8.08) | 1.08 (0.91 to 1.26) |
2012 | 3.83 (3.53 to 4.13) | 2.61 (2.35 to 2.86) | 6.97 (6.46 to 7.48) | 1.07 (0.91 to 1.23) |
2013 | 3.60 (3.29 to 3.92) | 2.48 (2.21 to 2.76) | 6.23 (5.75 to 6.71) | 1.02 (0.86 to 1.18) |
2014 | 3.57 (3.28 to 3.87) | 2.39 (2.14 to 2.64) | 6.38 (5.97 to 6.78) | 0.92 (0.77 to 1.08) |
2015 | 3.57 (3.23 to 3.91) | 2.60 (2.25 to 2.95) | 5.98 (5.50 to 6.46) | 0.87 (0.72 to 1.02) |
2016 | 3.66 (3.29 to 4.02) | 2.61 (2.35 to 2.87) | 6.40 (5.98 to 6.82) | 0.96 (0.80 to 1.13) |
2017 | 3.66 (3.34 to 3.98) | 2.58 (2.30 to 2.87) | 5.87 (5.41 to 6.33) | 0.99 (0.82 to 1.15) |
Trend: 2010-2017 | -0.049 (-0.099 to -0.000) | -0.021 (-0.063 to 0.020) | -0.181 (-0.251 to -0.110) | -0.023 (-0.049 to 0.001) |
Trend: 2010-2014 | -0.097 (-0.193 to -0.000) | -0.083 (-0.164 to -0.002) | -0.224 (-0.362 to -0.086) | -0.038 (-0.089 to 0.011) |
Trend: 2014-2017 | 0.032 (-0.106 to 0.170) | 0.067 (-0.052 to 0.186) | -0.108 (-0.303 to 0.085) | 0.027 (-0.042 to 0.097) |
Deprivation quintile = 5 |
||||
2010 | 5.02 (4.60 to 5.45) | 3.52 (3.20 to 3.83) | 8.15 (7.64 to 8.67) | 1.41 (1.23 to 1.59) |
2011 | 4.81 (4.35 to 5.27) | 3.27 (2.88 to 3.65) | 7.88 (7.28 to 8.48) | 1.35 (1.17 to 1.53) |
2012 | 4.60 (4.17 to 5.04) | 3.18 (2.83 to 3.52) | 7.33 (6.77 to 7.90) | 1.33 (1.13 to 1.52) |
2013 | 4.21 (3.82 to 4.59) | 2.90 (2.58 to 3.21) | 6.91 (6.41 to 7.42) | 1.23 (1.06 to 1.39) |
2014 | 4.31 (3.87 to 4.74) | 3.09 (2.74 to 3.43) | 7.41 (6.91 to 7.91) | 1.21 (1.04 to 1.38) |
2015 | 4.47 (3.94 to 4.99) | 2.94 (2.53 to 3.35) | 7.08 (6.58 to 7.58) | 1.35 (1.16 to 1.54) |
2016 | 4.22 (3.70 to 4.73) | 2.95 (2.54 to 3.37) | 6.96 (6.38 to 7.53) | 1.17 (0.99 to 1.35) |
2017 | 4.48 (4.03 to 4.92) | 3.25 (2.89 to 3.61) | 7.07 (6.58 to 7.56) | 1.13 (0.93 to 1.32) |
Trend: 2010-2017 | -0.084 (-0.154 to -0.014) | -0.049 (-0.104 to 0.005) | -0.144 (-0.225 to -0.062) | -0.033 (-0.062 to -0.005) |
Trend: 2010-2014 | -0.207 (-0.343 to -0.072) | -0.130 (-0.235 to -0.025) | -0.239 (-0.402 to -0.076) | -0.052 (-0.107 to 0.002) |
Trend: 2014-2017 | 0.032 (-0.167 to 0.233) | 0.051 (-0.109 to 0.211) | -0.110 (-0.333 to 0.112) | -0.037 (-0.118 to 0.043) |
For 2014-2017, there is again no evidence that deprivation is associated with a change in trend. The trend estimates are either very close to 0 (perinatal and postnatal mortality rates) or negative (i.e. more deprivation = more reduction in mortality rates, infant and neonatal mortality rates).
Table 4 Estimated association of deprivation (for each fifth of deprivation compared to the least deprived fifth) and mortality rate trends, for 2010-2017, 2010-2014 and 2014-2017, mortality rates expressed per 1,000 live births
IMD quintile | Infant mortality rate | Neonatal mortality rate | Perinatal mortality rate | Postnatal mortality rate |
Trend: 2010-2017 |
||||
1 | Reference | Reference | Reference | Reference |
2 | -0.080 (-0.171 to 0.010) | -0.026 (-0.113 to 0.060) | -0.060 (-0.173 to 0.051) | -0.002 (-0.051 to 0.046) |
3 | -0.120 (-0.208 to -0.032) | -0.071 (-0.154 to 0.012) | -0.065 (-0.174 to 0.043) | -0.021 (-0.067 to 0.024) |
4 | -0.079 (-0.162 to 0.003) | -0.055 (-0.131 to 0.020) | -0.038 (-0.143 to 0.066) | -0.006 (-0.050 to 0.036) |
5 | -0.114 (-0.210 to -0.017) | -0.083 (-0.167 to -0.000) | -0.001 (-0.114 to 0.111) | -0.016 (-0.061 to 0.029) |
Trend: 2010-2014 |
||||
1 | Reference | Reference | Reference | Reference |
2 | -0.093 (-0.272 to 0.085) | -0.104 (-0.275 to 0.066) | -0.234 (-0.468 to 0.000) | 0.008 (-0.097 to 0.114) |
3 | -0.136 (-0.311 to 0.038) | -0.075 (-0.244 to 0.092) | -0.290 (-0.522 to -0.058) | -0.043 (-0.141 to 0.054) |
4 | -0.055 (-0.216 to 0.105) | -0.106 (-0.254 to 0.042) | -0.077 (-0.293 to 0.137) | -0.055 (-0.145 to 0.035) |
5 | -0.165 (-0.352 to 0.021) | -0.153 (-0.316 to 0.009) | -0.093 (-0.325 to 0.139) | -0.068 (-0.161 to 0.024) |
Trend: 2014-2017 |
||||
1 | Reference | Reference | Reference | Reference |
2 | -0.162 (-0.422 to 0.096) | -0.085 (-0.336 to 0.166) | 0.090 (-0.230 to 0.412) | -0.076 (-0.221 to 0.068) |
3 | -0.170 (-0.421 to 0.080) | -0.129 (-0.382 to 0.122) | 0.197 (-0.119 to 0.514) | 0.012 (-0.122 to 0.146) |
4 | -0.195 (-0.430 to 0.039) | -0.100 (-0.319 to 0.117) | 0.002 (-0.297 to 0.302) | 0.068 (-0.056 to 0.192) |
5 | -0.194 (-0.471 to 0.081) | -0.116 (-0.360 to 0.126) | 0.000 (-0.318 to 0.320) | 0.003 (-0.127 to 0.134) |
Conclusion from this: deprivation is not associated with mortality rate trends over time, either overall or between 2010-2014 or 2014-2017.
Anna asked me to look at when the deaths occurred, specifically. To do this, I used different data that Anna sent, which is a summary of deaths at different ages for England and Wales (, Table 16). I had the number of live births, the number of stillbirths, and the number of deaths for babies aged:
Although I couldn’t do anything about the association with deprivation with this data, I could produce some graphs and analyses detailing how mortality rates have changed over time for all those categories of mortality (and combinations of those categories).
Firstly, I made graphs of mortality rate for live births up to 1 year, which showed the same as in the previous analyses – a decrease in mortality rate until 2014 (again, not 2013), then an increase, see Figure 6.
Figure 6 Average mortality rate for live births up to 1 year, Y-axis = mortality rate per 1,000 live births
However, when I include stillbirths in the mortality rate, I get Figure 7.
Figure 7 Average mortality rate for all births up to 1 year (includes stillbirths), Y-axis = mortality rate per 1,000 births
Here, there is no change in trend in 2014 – the mortality rate slows down but doesn’t go back up in 2014.
So where is the extra mortality in figure 6 coming from?
It turns out, mostly from mortality at aged less than 1 day, Figure 8.
Figure 8 Average mortality rate for all live births up to 1 day, Y-axis = mortality rate per 1,000 live births
When I remove deaths aged less than 1 day from the rest of the live births, I get Figure 9.
Figure 9 Average mortality rate for all live births from 1 day to 1 year, Y-axis = mortality rate per 1,000 live births
Again, there is no sign of the 2014 change in trend. There is a slight increase in 2017, which may or may not be relevant, but it’s nothing like Figure 6.
I made graphs for the other age groups, but it doesn’t show anything particularly interesting (to me, anyway). Essentially, mortality rates decrease in each age group, and tend to slow or reverse a tiny amount around 2014, but it’s not consistent in either time or magnitude:
The SEs for each year are quite large in all analyses – I really don’t think the published paper took this into account, potentially leading to their conclusions (but then, why does everything I have show a 2014 inflection point, not 2013?).
Extra bit of information: the rates of stillbirth and rates of mortality in <1 day olds are congruous from 2010-2014, i.e. as stillbirth rates go down, so do mortality rates for <1 day olds: VWLS estimate of 0.32 change in mortality rate in <1 day olds for a unit increase in stillbirth rate (95% CI: 0.14 to 0.49).
In comparison, in 2014-2017 the two rates are divergent, i.e. as stillbirth rates go down, mortality rates for <1 day olds go up: VWLS estimate of -0.61 change in mortality rate in <1 day olds for a unit increase in stillbirth rate (95% CI: -0.90 to -0.33). It’s not a one-to-one association, so this is only suggestive of something going on, it’s not definitive.
Also, the correlation coefficients (r) between the stillbirth and mortality rates for <1 day olds are 0.89 in 2010-2014, and -0.85 in 2014-2017. Figures 10 and 11 show the dependence between the two rates: Figure 10 is the stillbirth rate minus the rate of mortality for <1 day olds over time, Figure 11 is the stillbirth rate divided by the rate of mortality for <1 day olds over time. They both show a reasonably stable ratio until 2014, when it suddenly drops.
Note: the Y-axis values are similar (as are the estimates), but this is just random chance, it just happens the difference between the rates is roughly equal to the ratio of the rates.
Figure 10 The stillbirth rate minus the rate of mortality for <1 day olds, Y-axis = difference in mortality rate per 1,000 live births
Figure 11 The stillbirth rate divided by the rate of mortality for <1 day olds, Y-axis = ratio of stillbirth rate and rate of mortality for <1 day olds
My conclusion from all of this is that the perceived increase in infant mortality comes from the increase in mortality for babies under 1 day old, which to my mind are very similar to the stillbirths – given stillbirths have continued to go down, it is plausible the perceived increase in infant mortality is down to keeping babies, who would have otherwise been stillborn, alive for a few hours. Or there could have been a change in recording procedure. Or the demographics of mothers could have changed. Or anything, really.
The infant mortality rate has increased between 2014 and 2017, which is bad since it decreased between 2010 and 2013. This seems to be driven almost entirely from the neonatal mortality rate, specifically in babies who die within a day – when looking at perinatal and postnatal mortality rates, everything appears to be decreasing over time (which is good).
Although deprivation is associated with mortality rates in all instances, I can’t find any association between deprivation and mortality rate trends, i.e. there is no evidence for an association between increasing deprivation and increasing or decreasing mortality rate trends over time, looking at all years, 2010-2014 and 2014-2017.
Not sure, it would help if people published their code. I’d guess that they haven’t accounted for the variance in the estimates of the mortality rates, which means their estimates are too precise. It also means that estimates from small areas with few people got more weight than they should have had. Given the model in of the paper, I don’t think they accounted for the variance in the outcome (mixed effects regression I *think* should be equivalent to a random effects meta-analysis model though [maybe without the variance in the outcome?], so who knows…). In any case, that’s my best theory for why our graphs of infant mortality rate over time split by deprivation quintiles are different.
I didn’t repeat their analysis looking at child poverty. They’re wrong to say that accounting for time-invariant confounders means their estimate is “likely to reflect a causal association” though. It’s only causal if there are literally no time-variant confounders or the child poverty and infant mortality rate association, which seems far-fetched. Government policy seems like a confounder, as do the regional, national and global economies (although it depends on whether they are using child poverty as a proxy for deprivation?). Also, it assumes the case-mix of mothers is stable or time-invariant, which is nonsense.
Anyway, it would take some time to redo this analysis, and given I’m reasonably certain that the increase in infant mortality is actually a shift from stillbirths to mortality under 1 day old, I don’t necessarily think there would be much to be gained from doing another analysis. Without knowing why the shift seems to have occurred, any new analyses run the risk of being biased in any case.
So: Is there really a rise in infant mortality in England, and is it driven by deprivation?
Not so far as I can see, no.
Deprivation, however, is very much associated with higher infant mortality generally – I just didn’t find that it is driving an increase in infant mortality in the past few years.
]]>The paper is now out on MedRxiv, which is a preprint server, where we send academic papers before they’ve been peer-reviewed. If you want to read it all in detail, then .
This study was funded by the Health Foundation to look at the “social and economic consequences of health: causal inference methods and longitudinal, intergenerational data”.
That’s the fancy way of saying “how does health affect social and economic outcomes?”
The “causal inference” part of the project is key. We already know poor health is associated with adverse social (e.g. wellbeing, social contact) and socioeconomic (e.g. educational attainment, income, employment) outcomes. What we don’t know is how much those outcomes are caused by poor health (i.e. reverse causation), how much other factors affect both health and those outcomes (i.e. confounding), and how much poor health directly causes those outcomes (i.e. what we want to know).
I’ve made a (hopefully) helpful picture to show the difference between those 3 things.
That’s the general theme of what we’re doing – how does (poor) health affect social and economic outcomes. To tackle that, in this paper we’ve looked at a variety of health conditions and risk factors for poor health, and attempted to estimate by how much they affect lots of different social and economic outcomes.
The method we’ve used is called Mendelian randomization, and it exploits the fact that at conception, we are all randomly assigned genetic variants, some of which predispose us to certain health conditions or risk factors. A genetic variant is a change in someone’s genetic code – we commonly use single nucleotide polymorphisms (commonly referred to as SNPs), which are single base pair changes in the genetic code.
Most genetic variants have a completely irrelevant effect on our day to day lives, and the environment, personal choice and luck matter much, much more. However, across an entire population, we can use these small effects to look at the effects of the health conditions and risk factors they predispose us to.
For example, there are plenty of genetic variants that affect body mass index (BMI), a measure of obesity (your weight in kilograms divided by height in metres squared). We can put all those variants together in a “genetic risk score” to get a single number that represents a person’s genetic predispositions towards having a higher or lower BMI. People with a higher genetic risk score for BMI will, on average, have a higher BMI than people with a lower genetic risk score for BMI. Having a higher genetic risk score doesn’t mean a person will have a higher BMI than someone with a lower score though, everything else matters too.
The major benefit of Mendelian randomization is that, if a few assumptions hold, we end up estimating the causal effect of an exposure (poor health) on an outcome (social/economic outcomes). Confounding isn’t a problem, because your genetic variants are fixed at conception, i.e. your genetic code doesn’t change from the first day you existed. There’s no chance of reverse causation either for the same reason. In the above picture, we’re estimating just the blue line with Mendelian randomization.
The assumptions are pretty big though, and should be remembered (and tested, if possible):
We can and do test and account for these assumptions to minimise the risk of getting the wrong answer. But, as ever, we can never be certain whether we have the “truth”. There are other reasons why any specific Mendelian randomization analysis might not give us the “truth” too, the same as any other study.
Causal inference is hard.
Here’s a good (somewhat technical) , in case you’re interested.
I find it difficult to give good reasons why I look at what I do.
I (personally) look at things because I’m curious, I like doing it, and/or I was told to do it by people that pay me. I work in a field that aims to improve health in general, so often the things I’m curious about, like doing, and am told to do can lead to better health too, which is nice.
However, for this specific study, we have a nice few paragraphs in the paper motivating the study that are properly justifiable:
Poor health has the potential to affect an individual’s ability to engage with society. For example, illnesses or adverse health behaviours could influence the ability to attend and concentrate at school or work and hence affect educational attainment, employment, and income. Illness and health behaviours may also affect an individual’s ability to maintain wellbeing and an active social life. From an individual perspective, maintaining good health can therefore have considerable social and socioeconomic benefits. Similarly, from a population perspective, improving population health could lead to a happier and more productive population.
Understanding the causal impacts of health on social and socioeconomic outcomes can help demonstrate the potential broader benefits of investing in effective health policy, thereby strengthening the case for cross-governmental action to improve health and its wider determinants at the population-level. Furthermore, patients require accurate information about how their lives might be affected by their health, for example on returning to work after cancer. However, studying the social and socioeconomic consequences of ill health (‘social drift’) is challenging because of social causation, i.e. the strong role of social and socioeconomic circumstances in disease causation. Social causation means that associations between health and social and socioeconomic outcomes are likely to be severely biased by confounding and reverse causality. Methodological approaches strengthening causal inference in this field are therefore essential.
Long story short, we looked at the causal effect of health conditions and risk factors on social and economic outcomes because it could be useful to public health policy.
And a little bit because it’s interesting.
Mendelian randomization is pretty good at getting causal answers for difficult questions, but the price of that is the results are imprecise. Much more imprecise than if we looked directly at the associations between health conditions/risk factors and social/economic outcomes without bothering with the genetics.
To compensate for this, we need a lot of people to study, all of whom need to have their genetic variants measured, as well as their health conditions, risk factors, social and economic outcomes. Fortunately for us, UK Biobank exists.
UK Biobank recruited just over 500,000 people between 2006 and 2010 from 22 centres across the UK. Participants could have been anyone between 40 and 70 years old who had registered with a GP, and they provided their medical history and socioeconomic information via questionnaires, interviews and some medical-type measurements (height, weight, blood pressure etc.). Medical data from hospital episode statistics (data from all hospital admissions) and the cancer registry have been linked to participants. As such, we have a good handle on who developed which health conditions at any point in time (certainly after the year 2000 and for many people before then – hospital data doesn’t go back in time forever), and know about risk factors and social and economic outcomes at one time point.
Because we are looking at genetics, we restricted people in UK Biobank to those that were unrelated (siblings and other related people can be used in genetic analyses, but not without accounting for their relatedness, and we didn’t do that), and to those that were if white British ancestry. This was to reduce the risk of bias from the second assumption about Mendelian randomization – that there should be no other factor that affects both the exposure and outcome. Because different ethnic groups have different propensities towards different genetic variants, as well as large differences in social and economic outcomes, restricting to just one ethnically similar group reduces the risk of this bias. We made a few other restrictions that were based on technical things, like those with excessively high missing genetic data.
Because of that, we went from just over 500,000 participants to 337,009 participants.
Still plenty of participants, so some of our analyses gave really quite precise answers. Others gave results that spanned everything from an “inconceivably large negative effect to inconceivably large positive effect” though, so hey ho.
We wanted to study as many health conditions and risk factors (I’ll call these exposures from here on out, as typing health conditions and risk factors is tiring) as we could, but obviously couldn’t study everything. We pragmatically selected exposures that:
The list of exposures that fit these criteria are listed in the following flowchart.
For health conditions, we looked through their self-reported conditions and their hospital episode data, and for risk factors we used what UK Biobank recorded at recruitment.
There are more information in the paper if you want to know exact details.
Pragmatism also featured in our choice of social and economic outcomes (just called outcomes from here on in) – we selected outcomes based on what was available in UK Biobank. One consequence of this is that we didn’t have a measurement of individual income, only household income (and even that was split into broad categories). Ah well.
We have a nice box in the paper describing all the outcomes, so I may as well add that here:
All the exposures needed genetic variants for Mendelian randomization to work. As such, we searched for genome-wide association studies (GWAS) that told us which genetic variants to use for each exposure. GWAS are big studies that look at particular exposures (BMI, asthma, cholesterol etc.) and see which of the hundreds of thousands or millions of genetic variants they measured are associated with the exposure. Some genetic variants have stronger evidence of an association than others, and we chose variants that had a lot of evidence for an association, so we could be sure of the first Mendelian randomization assumptions (the genetic variants are associated with the exposure).
We then created genetic risk scores for each exposure, based on the genetic variants we found in the GWAS. For all exposures, the higher the risk score, the more likely it was the person had the health condition or the higher their risk factor. Some risk scores were much better at predicting the exposure than others – BMI was pretty good, osteoarthritis was awful.
Because there’s lots of checking that needs to be done, we did plenty of extra analyses and wrote about them in reasonable detail in the paper. I won’t talk about them here, because it gets a bit too technical and far too dull, but they’re in the paper if you’re interested. Specifically, we were checking that the Mendelian randomization assumptions were alright, as well as a couple of other things.
We estimated a lot of results. For every exposure, we estimated the its effect on every outcome. Then because we did lots of extra analyses, we actually ended up with 13 more estimates of the same thing, plus a few extra analyses. We had 8 health conditions and 6 risk factors, so 14 exposures with 19 outcomes, multiplied by 14 estimates, and there are around 3,724 results in the paper.
While I would be very willing to list them all here, to save my and your sanity I’ll just present the main things that might be interesting.
I made a figure that represents the evidence we have for each exposure causing a change in each outcome. The stronger the evidence (i.e. the more precise the estimate and/or the larger the change), the darker each cell (the square representing the association between the corresponding exposure and outcome) is shaded. Each cell also has a plus or minus, depending on whether the association is positive (the exposure increases the outcome) or negative (the exposure decreases the outcome). Not all positives are good – increasing loneliness or deprivation are both bad. The stars represent associations for which we are pretty sure something is going on.
A quick glance at the picture is enough to tell me that smoking initiation (the best interpretation of which is probably having ever started smoking, rather than trying a few cigarettes) is bad for all the economic outcomes, except for being retired. BMI is also pretty bad for economic outcomes, and alcohol isn’t exactly great. For social outcomes, depression is bad.
Quick technical note about the legend
The legend shows the -log10 P value, which I can imagine means nothing to most people. The P value is the probability that, if we did an infinite number of future studies doing the exact the same things as in this study and if there were no true effect of the exposure on the outcome, we would see an estimate for the association that was as large or larger (I hate defining the P value, because apparently many textbooks give the wrong definition, and I’m now completely uncertain how to define it without using all the words). The smaller the P value, the more certain we are that there is a non-zero association between the exposure and the outcome (although by itself, it says nothing about how large that association could be).
The -log10 part is a conversion (i.e. the logarithm of base 10 of the P value): a value of 0 would be a P value of 1 (i.e. there is no association between the exposure and outcome), a value of 1 would be a P value of 0.1, 2 would be 0.01, 3 would be 0.001, and 20 would be 0.00000000000000000001 (i.e. we are very sure that there is an association between the exposure and the outcome). To go from the -log10 P value to the P value then, divide 1 by 10 to the power of the number (1/10^2 = 0.01).
In the paper, we list off important results for each exposure, including effect estimates and our confidence around them, but that feels like overkill for a blog post. Suffice to say, our results suggested that higher BMI, smoking and alcohol use adversely affect multiple socioeconomic outcomes. For example, smoking initiation was estimated to reduce household income, the chance of owning accommodation, being satisfied with health and of receiving a university degree, and increase deprivation.
Potential reasons for adverse effects of high BMI, alcohol use and smoking on social and economic outcomes include increased disease burden, social stigma (e.g. bias against obese people, smokers etc.), or behaviours which make employment, retention of employment, or social interaction challenging. of UK Biobank have shown evidence of effects of BMI on social and socioeconomic outcomes, which is nicely shown in this study too.
We also had evidence that asthma increased deprivation and decreased household income and the chance of having a university degree, and migraine reduced the chance of having a weekly leisure or social activity, especially in men.
We estimated too that depression reduced satisfaction with health, financial situation and family relationships, and reduced the chance of being happy and increase the chance of being lonely. These were all expected – depression reducing happiness isn’t exactly news (but it’s definitely what I tell people when they ask what I’ve been working on for so long) – but it’s still good to see things that are expected, since it is at least some evidence things are working as they should.
We didn’t see any evidence for an association between other exposures and outcomes. This doesn’t mean there aren’t any associations there, it just means we couldn’t detect them. In fact, for almost all health conditions, the estimates had really terrible precision – we just didn’t have enough people to study, especially those that had the health conditions we were interested in, and the genetic variants weren’t as good as they could have been for health conditions. As they say, it was an absence of evidence, not evidence of absence.
This study was not without a great many limitations. We managed 682 words on them. I think I’ll list them with bullet points, since I’m not allowed to do that in academic papers:
The paper goes into each limitation in more detail, but I am reasonably confident that Mendelian randomization is one of the best, if not the best, method to tackle the question about how much poor health affects social and economic outcomes.
The only better method would be to randomise people to have health conditions or have different levels of risk factors, which would either be impossible or just incredibly unethical.
We concluded that the results of this study imply that higher BMI, smoking and alcohol consumption are likely detrimental to socioeconomic outcomes. While people are smoking less in the UK (good for public health), the average BMI has risen and is continuing to rise worldwide (less good for public health). From this study, we think that reducing average BMI levels, and further reducing smoking and alcohol intake, in addition to health benefits, may also improve socioeconomic outcomes for individuals and populations.
We could also conclude that Mendelian randomization is a good method, but we don’t have enough precision even with over 300,000 participants to be able to find any associations that aren’t massive. This will get better over time as we find more genetic variants associated with health conditions, but for now it remains a problem.
One could argue we already knew those things – who doesn’t know that smoking is bad? However, the benefit of this study isn’t necessary knowing that some things are bad. The benefit is knowing that things are precisely, and causally, this bad. Which means public health policy can be better informed about the potential benefits of different policies affecting, and the likely consequences of changing levels of, different health conditions and risk factors.
Also, migraines evidently cause men to be less likely have a weekly leisure or social activity. That’s something I didn’t know before.
]]>I agree with the overall message and conclusion of the letter. I mostly agree with the bullet points of action at the end of the letter. I agree the CRUK adverts are extremely unlikely to do anything good, and may be harmful, so they should probably stop.
But I don’t agree with some of the arguments the academics made in the letter, and I want to talk about this, even though I’m pretty sure I could get slammed for doing so.
I want to state upfront that I completely agree that stigmatising people with obesity is incredibly harmful. I state this because I think some people may profoundly disagree with me on two of my arguments, namely that obesity is a mix of personal choice, the environment and genetics, and that smoking is also a mix of personal choice, the environment and genetics.
I am not arguing that smoking and obesity are the same, or that personal choice is more important than the environment for either smoking or obesity, although both are probably more important than genetics (for most people). I am also not saying I know what to do, how obesity should be tackled, or even if it should be tackled.
Rather, I’m arguing that treating smoking like it’s solely a personal choice is wrong, that personal choice exists for both smoking and obesity, and to deny that is to take away people’s autonomy.
First off, the things I agree with.
I completely agree. Most (possibly all) of the studies associating BMI with health outcomes are observational, not causal. From these studies, you can’t say anything about whether a high BMI causes cancer, or whether something else entirely is going on.
BMI can’t be randomised in a study in the same way a drug or other treatment can. You could look at randomised studies of treatments for obesity and see if they affect cancer in the long run, but that’s expensive, difficult, and probably not ethical – you’d have to not treat one group of people, who initially wanted treatment, to see if they were diagnosed with cancer more frequently than the group who were treated. Oh, and this only tells you if the treatment affects cancer risk, not BMI itself.
You could also look at genetic studies, which can be thought of as a natural experiment, since bits of genes are distributed randomly at conception. You would look at the bits of genetic code causing changes in BMI, and see if they also affect cancer risk, and any other outcomes. .
Genetic studies probably give more of a causal estimate than non-genetic studies, but they aren’t free from bias, e.g. , and . With complex, multifactorial outcomes like BMI, it’s difficult to be completely sure whether the effects you’re seeing are from BMI, or from some other process. It’s still a lot better than non-genetic observational studies though, hence the “more causal” estimates.
The isn’t actually what I was hoping for, which was a study showing that telling people obesity is bad doesn’t affect people’s BMI in the long term. The research was more about people’s perceptions of obesity and “self-efficacy for health behaviour change”. However, I agree that telling people that obesity causes cancer is extremely unlikely to change people’s BMI. I don’t believe that many people who are obese believe obesity has no health consequences, or that telling them there are consequences will change either behaviour or outcomes.
However, my opinion on this isn’t relevant or important. CRUK should not launch a nationwide intervention without evidence that it would work – advertising is an intervention, same as intervening with drugs or surgery. There are potentially both positive (reduction in BMI?) and negative effects (increased stigma towards those perceived as obese), and the money spent on advertising could have gone to studying the causes of cancer. Therefore, the campaign should have had evidence to support its use, beyond “we tested it in focus groups”. If they don’t have the evidence, it shouldn’t have happened.
. It is indefensible if healthcare providers are causing a barrier to access for their healthcare through stigmatising those they perceive as obese. As, of course, is shaming anyone exercising. It’s utterly absurd that anyone should try to make other people feel bad about obesity when they are actively trying to do something about it. The same is true of a lot of shaming, but it feels particularly unjust when people are shamed into not exercising because they are considered too obese to exercise.
Stigmatising obese people will not, and never will, reduce the amount of obesity in the world.
I’ll start with something easy.
The NHS has a , but it’s general advice, along with the information that GPs can recommend both exercise on prescription and local weight loss groups. Given CRUK has stated obesity is bad, it needs to provide some mechanism for reducing obesity, or it would be completely pointless. So I can see why they would partner with a group that would, presumably (hopefully?), be recommended by GPs. Whether that was necessary is debateable, but I don’t see it as being an immediate problem in and of itself.
More concerningly, the letter states that the programmes are:
not effective ways of achieving and maintaining weight loss or preventing cancer
The doesn’t seem to me to support any part of that assertion.
The research is a systematic review and meta-analysis of weight loss among overweight but otherwise healthy adults who used commercial weight-loss programs. The outcome was whether people in the included studies lost more than 5% of their initial body weight. I’ve included a forest plot below, but the conclusion was that, on average, 57% people on a commercial program lost less than 5% of their initial body weight.
Therefore, 43% of people lost more than 5% of their initial body weight, which would seem to me to be evidence that these programs can help people “achieve weight loss”. They don’t help everyone, but that doesn’t mean they aren’t effective – they worked for 43 out of every 100 people! So I disagree that the “evidence demonstrates these programs are not effective”, although I grant that others may disagree.
Some of the biggest studies lasted only 12 weeks, and all but one study lasted less than 1 year. Therefore, this isn’t any kind of evidence for long term effects, or of “maintaining weight loss”. In general, it seems like the longer the study, the more people lost 5% of their body weight. I’d have liked to see a meta-regression on this, to see whether length of study was important, but there wasn’t one. In any case, I also disagree the evidence demonstrates the programs aren’t effective at maintaining weight loss, since that wasn’t tested here.
Finally, the research doesn’t say anything at all about preventing cancer. I would doubt there is much evidence either way for that outcome – if there were, however, it should be cited.
Ok, so this is where things get trickier.
I fundamentally disagree with the approach to this argument. Specifically, this part:
Through making a direct comparison between smoking and weight, your campaign contributes to these assumptions, suggesting that it is a lifestyle choice.
I’m fairly certain this means the authors of this letter, who are criticising CRUK for stigmatising obesity by “implying that individuals are largely in control of and responsible for their body size”, are themselves stating unequivocally that smoking is a lifestyle choice.
I cannot overstate how much I object to this.
I can see why people would think that obesity is different from smoking. Obesity is a consequence of many things, it’s an outcome, not a behaviour or single action that can be stopped at will. Smoking is a deliberate action, where you need to buy cigarettes, light, and inhale them.
But that is completely misrepresenting smoking. Smoking is also a consequence of many things, it’s an outcome as much as a behaviour, same as obesity.
Suffice to say, I strongly reject the idea that smoking is a “lifestyle choice” while obesity is not.
Rather, smoking, like obesity, is complex and multifaceted. If people object to the CRUK adverts solely because smoking is a lifestyle choice, and obesity is not, then I think they are wrong.
There are, of course, plenty of other reasons to object to the adverts.
I expect this will be an unpopular opinion.
Are people largely in control of their body size?
I don’t know, nor do I know how we could ever test this.
But I know that individual choice affects obesity. To imply otherwise is to say that people have absolutely no autonomy over their own body, over what and how much they eat, or how much they exercise.
Choice is clearly not the only factor in play (I mean, see above), and for some people there is very little choice, but for many people there is plenty.
I include myself in this. I have been both overweight and obese. I still am overweight. I have experienced weight-related stigma from both family members and strangers, although not at all to the same degree as others will have done. But it’s my choice to eat and exercise the way I do. My environment affects those choices, and I recognise that I am very privileged in that I live close enough to work that I can walk in, I’m healthy enough to go for a run, and I have the time and resources to choose to eat “healthily” or “unhealthily” (quotes to show how little I care for those definitions), where others don’t have these options.
The choices we make about food and exercise become harder or easier based on the environment and genetics. It’s easier to cook food or exercise if you have the time and resources to do so. It’s easier to eat salads if you like them. It’s easier to eat less if you aren’t depressed and comfort eating is one of the ways you can cope. It’s easier to eat less if you aren’t taking a steroid.
Sometimes, it’s impossible to eat “healthily”, or to exercise, or to make any “good” choices. But this doesn’t mean that for other people, choice had no part in either gaining or losing weight. It also doesn’t mean that anyone should be judged or stigmatised for making those choices.
I don’t know whether structural change through policy or weight-loss programs that target individuals are better for losing weight, either individually or at a population level.
I don’t know whether obesity is even the core issue – what if the main issue is exercise, and both obesity and poor health are a result of not exercising? Even genetic studies couldn’t tell you that, since obesity may cause poor health by making it more difficult to exercise, both through just it being more physically and mentally difficult, and through weight-related stigma making exercise worse.
I don’t know whether the CRUK advert could be beneficial or detrimental. I’m almost certain it’s not going to do any good, but I’m not psychic. I’m equally certain CRUK doesn’t know what effect the adverts will have, since they are apparently not based on firm evidence.
I don’t know whether it’s weight-related stigma to compare obesity weight smoking – I see smoking and obesity as both consequences of personal choice, the environment and genetics. Saying that, the advert may increase stigma all the same, so it could definitely be destructive.
I don’t know by how much obesity causes cancer. I haven’t assessed the evidence CRUK has for their claims, though I’m fairly certain the evidence they have is not causal, so I don’t think they know by how much obesity causes cancer either.
I know weight-related stigma is horrible. It seems to me that much of it could come from people with more privilege who have had to make easier choices stigmatising those who have had to make harder choices.
I know the current Government is extremely unlikely to change policy to promote an environment that reduces obesity. Therefore, I believe the only option people who want to lose weight have is changing their own choices. Some people may benefit from weight loss programs. Others may benefit from lifestyle changes. Others may find it impossible to lose weight. And that’s ok.
While I agree with the letter’s overall message and conclusion, I think the argument could have been limited to the lack of causal evidence of obesity on cancer and the lack of evidence the campaign would have any effect on obesity in this country.
I completely disagree that smoking is a lifestyle choice and obesity is not at all. Both are a mix of personal choice, the environment and genetics.
]]>
UK Biobank approved my scope extension to do the CCR5 re-analysis, so this post is back up as it was in June. The is also back online.
Given the recent developments with this story, I’m including a quick edit describing how everything progressed – skip to “ORIGINAL POST” below for the, you know, original post.
At the end of September, Rasmus Nielsen issued a retraction of the CCR5-delta32 gene paper () that prompted this blog post:
The journal subsequently the paper in early October. Throughout, I thought Rasmus was transparent about what was going on, and should be applauded for that. Indeed, most of the comments I saw on his tweet were very supportive.
Robert Meier and David Reich (in collaboration with April Wei and Rasmus Nielsen, and others) have done a of CCR5 and mortality in UK Biobank using more complete data then either Rasmus or I had access to in June. They had access to the actual deletion, with no need of using nearby SNPs as proxies, and they found, as the title of their paper says:
“No statistical evidence for an effect of CCR5-Δ32 on lifespan in the UK Biobank cohort“
The problem, so far as I can tell, was missing data in the genetic variant April and Rasmus used to predict the CCR5-delta32 deletion. Because the data was not missing completely at random (i.e. certain people were more likely to be missing data on the genetic variant than others), it biased the analysis, making it look like the deletion increased mortality, when in fact it didn’t.
This is why we typically use imputed genetic data, i.e. we predict what the missing data should have been using other (related) data. We also have quality control on the imputed data, so that if the imputation looks poor for any genetic variant, we don’t use that variant. We should be suspicious of any SNPs that a) fail quality control, and b) aren’t imputed. Indeed, I was suspicious of rs62625034 for these reasons (and saw a weird pattern of missingness, see below).
As I said at the time, I think this story is a good example of post-publication peer review: I spotted that a paper wasn’t quite right, quickly re-analysed it, and posted about it on social media. The authors then engaged with the post, and we tried to get to the bottom of it. Then someone else saw the post, did another re-analysis, and collaborated with the original authors to work out exactly what went wrong, using more data. The original paper then gets retracted, and a new one takes its place, including the original authors as collaborators. All of this happened remarkably fast (for science): 4 months from publishing the paper to retraction and submission of a new paper.
Although it would have been better for either the original authors to have been more aware of the potential statistical issues using that particular genetic variant (and the reasons why we have certain procedures in place for using genetic data in general), or for the peer reviewers to pick up on these and other problems before publication, all things considered I think everything turned out pretty well.
Last thing: I talked about this problem with Adam Rutherford, doing a brief radio interview for BBC Radio’s (the first segment is talking about this story). It was mildly terrifying, right up until I heard it on the radio and didn’t think I said anything too stupid. I also liked how covered the story.
— ORIGINAL POST —
I debated for quite a long time on whether to write this post. I had said pretty much everything I’d wanted to say on Twitter, but I’ve done some more analysis and writing a post might be clearer than another Twitter thread.
To recap, a couple of weeks ago a by Xinzhu (April) Wei & Rasmus Nielsen of the University of California was published, claiming that a deletion in the CCR5 gene increased mortality (in white people of British ancestry in UK Biobank). I had some issues with the paper, which I posted . My tweets got more attention than anything I’d posted before. I’m pretty sure they got more attention than my published papers and conference presentations combined. ¯\_(ツ)_/¯
The CCR5 gene is topical because, as the paper states in the introduction:
In late 2018, a scientist from the Southern University of Science and Technology in Shenzhen, Jiankui He, announced the birth of two babies whose genomes were edited using CRISPR
To be clear, gene-editing human babies is awful. Selecting zygotes that don’t have a known, life-limiting genetic abnormality may be reasonable in some cases, but directly manipulating the genetic code is something else entirely. My arguments against the paper did not stem from any desire to protect the actions of Jiankui He, but to a) highlight a peer review process that was actually pretty awful, b) encourage better use of UK Biobank genetic data, and c) refute an analysis that seemed likely biased.
This paper has received an incredible amount of attention. If it is flawed, then poor science is being heavily promoted. Apart from the obvious problems with promoting something that is potentially biased, others may try to do their own studies using this as a guideline, which I think would be a mistake.
I’ll quickly recap the initial problems I had with the paper (excluding the things that were easily solved by reading the online supplement), then go into what I did to try to replicate the paper’s results. I ran some additional analyses that I didn’t post on Twitter, so I’ll include those results too.
Full disclosure: in addition to to me, Rasmus and I exchanged several emails, and they ran some additional analyses. I’ll try not to talk about any of these analyses as it wasn’t my work, but, if necessary, I may mention pertinent bits of information.
I should also mention that I’m not a geneticist. I’m an epidemiologist/statistician/evidence synthesis researcher who for the past year has been working with UK Biobank genetic data in a unit that is very, very keen on genetic epidemiology. So while I’m confident I can critique the methods for the main analyses with some level of expertise, and have spent an inordinate amount of time looking at this paper in particular, there are some things where I’ll say I just don’t know what the answer is.
I don’t think I’ll write a formal response to the authors in a journal – if anyone is going to, I’ll happily share whatever information you want from my analyses, but it’s not something I’m keen to do myself.
All my code for this is .
Not accounting for relatedness (i.e. related people in a sample) is a . It can bias genetic analyses through population stratification or familial structure, and can be easily dealt with by removing related individuals in a sample (or fancy analysis techniques, e.g. Bolt-LMM). The paper ignored this and used everyone.
Quality control (QC) is also an issue. When the IEU at the University of Bristol was , they looked for sex mismatches, sex chromosome aneuploidy (having sex chromosomes different to XX or XY), and participants with outliers in heterozygosity and missing rates (yeah, ok, I don’t have a good grasp on what this means, but I see it as poor data quality for particular individuals). The paper ignored these too.
The paper states it looks at people of “British ancestry”. Judging by the number in participants in the paper and the reference they used, the authors meant “white British ancestry”. I feel this should have been picked up on in peer review, since the terms are different. The referenced uses “white British ancestry”, so it would have certainly been clearer sticking to that.
The main analysis should have also been adjusted for all principal components (PCs) and centre (where participants went to register with UK Biobank). This helps to control for population stratification, and we know that . I thought choosing variables to include as covariables based on statistical significance was discouraged, but . Still, I see no plausible reason to do so in this case – principal components represent population stratification, population stratification is a confounder of the association between SNPs and any outcome, so adjust for them. There are enough people in this analysis to take the hit.
I don’t know why the main analysis was a ratio of the crude mortality rates at 76 years of age (rather than a Cox regression), and I don’t know why there are no confidence intervals (CIs) on the estimate. The CI exists, it’s in the online supplement. Peer review should have had problems with this. It is unconscionable that any journal, let alone a top-tier journal, would publish a paper when the main result doesn’t have any measure of the variability of the estimate. A P value isn’t good enough when it’s a non-symmetrical error term, since you can’t estimate the standard error.
So why is the CI buried in an additional file when it would have been so easy to put it into the main text? The CI is from bootstrapping, whereas the P value is from a log-rank test, and the CI of the main result crosses the null. The main result is non-significant and significant at the same time. This could be a reason why the CI wasn’t in the main text.
It’s also noteworthy that although the deletion appears strongly to be recessive (only has an effect is both chromosomes have the deletion), the main analysis reports delta-32/delta-32 against +/+, which surely has less power than delta-32/delta-32 against +/+ or delta-32/+. The CI might have been significant otherwise.
I think it’s wrong to present one-sided P values (in general, but definitely here). The hypothesis should not have been that the CCR5 deletion would increase mortality; it should have been ambivalent, like almost all hypotheses in this field. The whole point of the CRISPR was that the babies would be more protected from HIV, so unless the authors had an unimaginably strong prior that CCR5 was deleterious, why would they use one-sided P values? Cynically, but without a strong reason to think otherwise, I can only imagine because one-sided P values are half as large as two-sided P values.
The best analysis, I think, would have been a Cox regression. Happily, the authors did this after the main analysis. But the full analysis that included all PCs (but not centre) was relegated to the supplement, for reasons that are baffling since it gives the same result as using just 5 PCs.
Also, the survival curve should have CIs. We know nothing about whether those curves are separate without CIs. I reproduced survival curves with a different SNP (see below) – the CIs are large.
I’m not going to talk about the Hardy-Weinburg Equilibrium (HWE, inbreeding) analysis– it’s still not an area I’m familiar with, and I don’t really think it adds much to the analysis. There are loads of reasons why a SNP might be out of HWE – dying early is certainly one of them, but it feels like this would just be a confirmation of something you’d know from a Cox regression.
I have access to UK Biobank data for my own work, so I didn’t think it would be too complex to replicate the analyses to see if I came up with the same answer. I don’t have access to rs62625034, the SNP the paper says is a great proxy of the delta-32 deletion, for reasons that I’ll go into later. However, I did have access to rs113010081, which the paper said gave the same results. I also used rs113341849, which is another SNP in the same region that has extremely high correlation with the deletion (both SNPs have R2 values above 0.93 with rs333, which is the rs ID for the delta-32 deletion). Ideally, all three SNPs would give the same answer.
First, I created the analysis dataset:
I conducted 12 analyses in total (6 for each SNP), but they were all pretty similar:
With this suite of analyses, I was hoping to find out whether:
I found… Nothing. There was very little evidence the SNPs were associated with mortality (the hazard ratios, HRs, were barely different from 1, and the confidence intervals were very wide). There was little evidence including relateds or more covariables, or changing the time variable, changed the results.
Here’s just one example of the many survival curves I made, looking at delta-32/delta-32 (1) versus both other genotypes in unrelated people only (not adjusted, as Stata doesn’t want to give me a survival curve with CIs that is also adjusted) – this corresponds to the analysis in row 6.
You’ll notice that the CIs overlap. A lot. You can also see that both events and participants are rare in the late 70s (the long horizontal and vertical stretches) – I think that’s because there are relatively few people who were that old at the end of their follow-up. Average follow-up time was 7 years, so to estimate mortality up to 76 years, I imagine you’d want quite a few people to be 69 years or older, so they’d be 76 at the end of follow-up (if they didn’t die). Only 3.8% of UK Biobank participants were 69 years or older.
In my original tweet thread, I only did the analysis in row 2, but I think all the results are fairly conclusive for not showing much.
In a reply to me, Rasmus stated:
This is the claim that turned out to be incorrect:
Never trust data that isn’t shown – apart from anything else, when repeating analyses and changing things each time, it’s easy to forget to redo an extra analysis if the manuscript doesn’t contain the results anywhere.
This also means I couldn’t directly replicate the paper’s analysis, as I don’t have access to rs62625034. Why not? I’m not sure, but the likely explanation is that it didn’t pass the quality control process (either ours or UK Biobank’s, I’m not sure).
I’ve concluded that the only possible reason for a difference between my analysis and the paper’s analysis is that the SNPs are different. Much more different than would be expected, given the high amount of correlation between my two SNPs and the deletion, which the paper claims rs62625034 is measuring directly.
One possible reason for this is the imputation of SNP data. As far as I can tell, neither of my SNPs were measured directly, they were imputed. This isn’t uncommon for any particular SNP, as imputation of SNP data is generally very good. As I understand it, genetic code is transmitted in blocks, and the blocks are fairly steady between people of the same population, so if you measure one or two SNPs in a block, you can deduce the remaining SNPs in the same block.
In any case there is a lot of genetic data to start with – each genotyping chip measures hundred of thousands of SNPs. Also, we can measure the likely success rate of the imputation, and SNPs that are poorly imputed (for a given value of “poorly”) are removed before anyone sees them.
The two SNPs I used had good “info scores” (around 0.95 I think – for reference, we dropped all SNPs with an info score of less than 0.3 for SNPs with minor allele frequencies similar), so we can be pretty confident in their imputation. On the other hand, rs62625034 was not imputed in the paper, it was measured directly. That doesn’t mean everyone had a measurement – I understand the missing rate of the SNP was around 3.4% in UK Biobank (this is from direct communication with the authors, not from the paper).
But. And this is a weird but that I don’t have the expertise to explain, the imputation of the SNPs I used looks… well… weird. When you impute SNP data, you impute values between 0 and 2. They don’t have to be integer values, so dosages of 0.07 or 1.5 are valid. Ideally, the imputation would only give integer values, so you’d be confident this person had 2 mutant alleles, and this person 1, and that person none. In many cases, that’s mostly what happens.
Non-integer dosages don’t seem like a big problem to me. If I’m using polygenic risk scores, I don’t even bother making them integers, I just leave them as decimals. Across a population, it shouldn’t matter, the variance of my final estimate will just be a bit smaller than it should be. But for this work, I had to make the non-integer dosages integers, so anything less than 0.5 I made 0, anything 0.5 to 1.5 was 1, and anything above 1.5 was 2. I’m pretty sure this is fine.
Unless there’s more non-integer doses in one allele than the other.
rs113010081 has non-integer dosages for almost 14% of white British participants in UK Biobank (excluding relateds). But the non-integer dosages are not distributed evenly across dosages. No. The twos has way more non-integer dosages than the ones, which had way more non-integer dosages than the zeros.
In the below tables, the non-integers are represented by being missing (a full stop) in the rs113010081_x_tri variable, whereas the rs113010081_tri variable is the one I used in the analysis. You can see that of the 4,736 participants I thought had twos, 3,490 (73.69%) of those actually had non-integer dosages somewhere between 1.5 and 2.
What does this mean?
I’ve no idea.
I think it might mean the imputation for this region of the genome might be a bit weird. rs113341849 has the same pattern, so it isn’t just this one SNP.
But I don’t know why it’s happened, or even whether it’s particularly relevant. I admit ignorance – this is something I’ve never looked for, let alone seen, and I don’t know enough to say what’s typical.
I looked at a few hundred other SNPs to see if this is just a function of the minor allele frequency, and so the imputation was naturally just less certain because there was less information. But while there is an association between the minor allele frequency and non-integer dosages across dosages, it doesn’t explain all the variance in the estimate. There were very few SNPs with patterns as pronounced as in rs113010081 and rs113341849, even for SNPs with far smaller minor allele frequencies.
Does this undermine my analysis, and make the paper’s more believable?
I don’t know.
I tried to look at this with a couple more analyses. In the “x” analyses, I only included participants with integer values of dose, and in the “y” analyses, I only included participants with dosages < 0.05 from an integer. You can see in the results table that only using integers removed any effect of either SNP. This could be evidence that the imputation having an effect, or it could be chance. Who knows.
rs62625034 was directly measured, but not imputed, in the paper. Why?
It’s possibly because the SNP isn’t measuring what the probe meant to measure. It clearly has a very different minor allele frequency in UK Biobank (0.1159) than in the (~0.03). The paper states this means it’s likely measuring the delta-32 deletion, since the frequencies are similar and rs62625034 sits in the deletion region. This mismatch may have made it fail quality control.
But this raises a couple of issues. First is whether the missingness in rs62625034 is a problem – is the data missing completely at random or not missing at random. If the former, great. If the latter, not great.
The second issue is that rs62625034 should be measuring a SNP, not a deletion. In people without the deletion, the probe could well be picking up people with the SNP. The rs62625034 measurement in UK Biobank should be a mixture between the deletion and a SNP. The R2 between rs62625034 and the deletion is not 1 (although it is higher than for my SNPs – again, this was mentioned in an email to me from the authors, not in the paper), which could happen if the SNP is picking up more than the deletion.
The third issue, one I’ve realised only just now, is that that rs62625034 is not associated with lifespan in UK Biobank (and other datasets). This means that maybe it doesn’t matter that rs62625034 is likely picking up more than just the deletion.
Peter Joshi, author of the article, helpfully posted these :
If I read this right, Peter used UK Biobank (and other data) to produce the above plot showing lots of SNPs and their association with mortality (the higher the SNP, the more it affects mortality).
Not only does rs62625034 not show any association with mortality, but how did Peter find a minor allele frequency of 0.035 for rs62625034 and the paper find 0.1159? This is crazy. A minor allele frequency of 0.035 is about the same as the GO-ESP population, so it seems perfectly fine, whereas 0.1159 does not.
I didn’t clock this when I first saw it (sorry Peter), but using the same datasets and getting different minor allele frequencies is weird. Properly weird. Like counting the number of men and women in a dataset and getting wildly different answers. Maybe I’m misunderstanding, it wouldn’t be the first time – maybe the minor allele frequencies are different because of something else. But they both used UK Biobank, so I have no idea how.
I have no answer for this. I also feel like I’ve buried the lead in this post now. But let’s pretend it was all building up to this.
This paper has been enormously successful, at least in terms of publicity. I also like to think that my “post-publication peer review” and Rasmus’s reply represents a nice collaborative exchange that wouldn’t have been possible without Twitter. I suppose I could have sent an email, but that doesn’t feel as useful somehow.
However, there are many flaws with the paper that should have been addressed in peer review. I’d love to ask the reviewers why they didn’t insist on the following:
So, do I believe “CCR5-∆32 is deleterious in the homozygous state in humans”?
No, I don’t believe there is enough evidence to say that the delta-32 deletion in CCR-5 affects mortality in people of white British ancestry, let alone people of other ancestries.
I know that this post has likely come out far too late to dam the flood of news articles that have already come out. But I kind of hope that what I’ve done will be useful to someone.
]]>Also, Snowdon said I should get a blog.
The article cherry-picks data, conflates observational epidemiology with causal inference, and misunderstands basic statistics.
I don’t care whether people drink or not. I’d prefer it if people drank in moderation, but I’m certainly not an advocate for teetotalism.
I do, however, think people should be informed of the risks of anything they do, if they want to be.
I think the article is poor, but think people should feel happy to drink if they want to. Based on the available evidence though, I wouldn’t say it helps your heart, and there may be some risk of drinking moderately.
But that’s the same for cake.
Let’s delve into the article.
The piece starts out by saying that there is a drive to treat drinkers like smokers. That seems to conflate saying that alcohol can be harmful with saying people shouldn’t drink alcohol.
They aren’t the same.
I also don’t know which organisation runs this campaign, but calling people who say alcohol is harmful “anti-alcohol activists” is a trick to make those same people seem like “others” or “them”. It also makes them sound like fanatics, trying to stop “you” drinking “your” alcohol.
But that’s not why I’m writing this.
It’s the “health benefits of moderate drinking”, stated as if it were indisputable fact. As if it’s known that alcohol causes health benefits.
Causal statements like this need rigorous proof. They need hard evidence. If moderate alcohol intake is associated with health benefits, that’s one thing. But saying it causes those health benefits is quite another.
Even if alcohol caused some benefits though, something can have both positive and negative effects – it’s not absurd to tell people about the dangers of something even if it could have benefits, that’s why medications have lists of side-effects.
And calling something “statistical chicanery” is another tactic to make it seem like people saying alcohol is harmful are doing so by cheating, or through deception.
The link to “decades of evidence” is to a 2004 meta-analysis, showing
Strong trends in risk were observed for cancers of the oral cavity, esophagus and larynx, hypertension, liver cirrhosis, chronic pancreatitis, and injuries and violence.
Which sounds pretty bad to me.
I’m guessing that if this is the right link, then it was meant for you to observe that there is a J-shaped relationship between alcohol intake and coronary heart disease.
That is, low and high levels of drinking are bad for your heart, but some is good. This sounds good – alcohol protects your heart – and it is common advice to hear from loads of people, doctors included.
The problem is that the evidence for this assertion comes from observational studies – the association is NOT causal.
This is all about causality.
We cannot say that drinking alcohol protects your heart, only that if you drink moderately, you are less likely to have heart problems. They sound the same, but they aren’t. The first is causation, the second is correlation, and if there’s one thing statisticians love to say, it’s “correlation is not causation”.
Studies measuring alcohol intake and heart problems are mostly either cross-sectional or longitudinal – they either look at people at one point in time, or follow them up for some time.
These are observational studies, they (probably) don’t change people’s drinking behaviour. Of course, people might change their behaviour a little if they know they have to fill in a questionnaire about their drinking habits, but we kind of have to ignore that for now.
Anyway, observational studies do not allow you to make causal statements like “drinking is good for your heart”.
Why not?
It comes down to bias/confounding, the same things I on Twitter when those researchers claimed .
There are ways to account for this when comparing drinkers with non-drinkers, but they rely on knowing every possible way people are different.
Imagine the reasons why someone doesn’t drink very much. Off the top of my head, they:
Now imagine the reasons why someone doesn’t drink at all. The above holds true, but you can add in:
A confounder is something that affects both the exposure (alcohol intake) and the outcome (health). If you want to compare drinkers and non-drinkers, you need to account for everything that might affect someone’s drinking behaviour and their health. This includes many of the things I listed above.
But this is nigh-on impossible, as behaviours are governed by so many things. You can adjust out *some* of the confounding, but you can’t prove you’ve gotten ALL the confounding. You can measure people’s health, but you won’t capture everything that contributes to how healthy a person is. You can ask people about their behaviour, but there’s no way you’ll capture everything from a person’s life.
If you see, observationally, that moderate drinking is associated with fewer heart problems, what does that imply?
My last was about how you really should have mechanisms to posit causality, i.e. if you say X causes Y, you need to have an idea of how X causes Y (or evidence from trials). This holds true here too.
Suppose alcohol protects your heart. How?
Fortunately, people have postulated mechanisms, and we can assess them: one possible mechanism is that alcohol increases HDL cholesterol (the good one), which improves heart health.
We can’t assign a direction to that mechanism using observational studies, since people who live healthily might have good HDL levels anyway, meaning they drink moderate amounts because they can.
To work this out (and to assign causality more generally), you can use trials. Ideally randomised controlled trials, since they’re so good. The ideal trial, the one where we wouldn’t need mechanisms at all, is one where we randomise people to drink certain amounts (none, a little, some, a lot) over the course of their life, make sure they stick to that, then see what happens to them.
Since that would never work, the next best thing is to test the proposed mechanisms, because if alcohol increases HDL cholesterol in the short-term (i.e. after a few weeks), then we’re probably on safer territory. We’d then have to prove that higher HDL cholesterol causes better heart health, but one thing at a time.
Well, a of trials was done to look at exactly that, fairly recently too (2011):
Effect of alcohol consumption on biological markers associated with risk of coronary heart disease: systematic review and meta-analysis of interventional studies
In total, there were 63 trials included, looking at a few markers of heart health, including HDL cholesterol. They found that alcohol increased HDL a little bit.
But there were problems.
The trials were a mix of things, but having looked at a few, it looks like many studies randomised small numbers of people to drink either an alcoholic drink or a non-alcoholic drink (the good ones had alcohol-free wine compared with normal wine), and they measured their HDL before and after the trial.
The problem with small trials is that they can have quite variable results, because there is a lot of imprecision when you don’t have enough people. You do a trial with 60 people and get a result. You repeat it with new people, and get an entirely different result.
That’s one reason why we do meta-analyses in the first place – one study rarely can tell you the whole story, but when you combine loads of studies, you get closer to the truth.
But academic journals exist, and they tend to publish studies that are interesting, i.e. ones that show a “statistically significant” effect of something, in this case alcohol on HDL. This has three effects.
Repeat a study enough, you’ll eventually get the result you want. Since lots of people want alcohol to be beneficial to the heart, and because these trials are pretty inexpensive, there is a good chance that there are missing studies that were never published.
I’m aware this sounds like I’m reaching, and I could never prove that these things happened. But I can show, with relative certainty, that there are missing studies, ones that showed either that alcohol didn’t affect HDL or reduced it.
In meta-analyses, we tend to produce , which show whether studies fall symmetrically around the average effect, i.e. the average effect of alcohol on HDL. Since studies should give results that fall basically randomly around the true effect of alcohol on HDL, they should be symmetrical on a funnel plot.
If some studies have NOT been published, i.e. ones falling in the “no effect” area, or those without statistical significance, then you see asymmetry.
We don’t know WHY these studies are missing, just that something isn’t right, and we should treat the average effect with caution. The link I gave above shows a nice symmetrical funnel plot, and an asymmetrical one.
And here is the funnel plot I made from the meta-analysis data.
Note: I had to make this plot myself, the authors did not publish it – they stated in the paper:
No asymmetry was found on visual inspection of the funnel plot for each biomarker, suggesting that significant publication bias was unlikely.
See how the effect gets smaller (more left) as the “s.e. of md” goes down? That’s the standard error of the mean difference – the smaller it is, the more precise the result is, the more confident we are in the result. More people = smaller standard error.
With a smaller numbers of people, the standard error goes up, and the more variable the results become. One study may find a huge effect, the next a tiny effect. The fact ALL the small studies found a comparatively large effect is extremely suspicious.
So yeah, there was asymmetry in the funnel plot for the effect of alcohol on HDL cholesterol. The asymmetry says to me that there are missing studies that showed no effect of alcohol on HDL cholesterol, and so the true effect of alcohol on HDL cholesterol will thus be smaller than they said.
To be honest, there’s probably no effect, or if there is, it’s tiny.
To be fair though, I should say most of the studies had a small follow-up time. It’s entirely possible longer studies would have found a larger effect. The point is, we don’t know.
There are likely other proposed mechanisms, but I think the HDL mechanism is the one commonly thought of as the big one. :
The best-known effect of alcohol is a small increase in HDL cholesterol
So, I don’t really see the evidence as being particularly in support of alcohol protecting the heart. The observational evidence is confounded and possibly has reverse causation. The trial evidence looks to be biased. What about the genetic evidence?
We use genetics to look at things that are difficult to test observationally or through trials. We do this because it can (and should) be unconfounded and is not affected by reverse causation. This is true when we can show how and why the genetics works.
For proteins, we’re on pretty solid ground. A change in gene X causes a change in protein Y. But for behaviours in general, we’re on much shakier ground.
There is one gene, however, that if slightly faulty, produces a protein that doesn’t break down alcohol properly. This is a good genetic marker, since people without that protein get hangovers very quickly after drinking alcohol, so tend not to drink.
One found
Individuals with a genetic variant associated with non-drinking and lower alcohol consumption had a more favourable cardiovascular profile and a reduced risk of coronary heart disease than those without the genetic variant.
(in an Asian population) found:
robust evidence that alcohol consumption adversely affects several cardiovascular disease risk factors, including blood pressure, waist to hip ratio, fasting blood glucose and triglyceride levels. Alcohol also increases HDL cholesterol and lowers LDL cholesterol.
So alcohol may well cause higher HDL cholesterol levels.
Note that in genetic studies, you’re looking at lifetime exposure to something, in this case alcohol. So as above, a trial looking at the long-term intake of alcohol may find it raises HDL cholesterol.
It’s just, currently, the trial data doesn’t support this.
Halfway now, and I hope I have shown that the evidence alcohol protects the heart is shaky at best. This is kind of important for later. I don’t claim to have done a systematic or thorough search though, so let me know if there is anything big I’ve missed!
Let’s return to the article.
I got side-tracked by the article’s reference to the paper that said alcohol increases risk to loads of bad stuff, and has a J-shaped association with heart disease.
The is an example of why I mostly dislike research articles being converted into media articles. It is *exceedingly* difficult to convey the nuances of epidemiological research in 850 words to a lay audience. It just isn’t possible to relay all the necessary information that was used to inform the overall conclusion of the Global Burden of Disease study.
David Spiegelhatler’s flippant remarks at the end probably don’t help:
Yet Prof David Spiegelhalter, Winton Professor for the Public Understanding of Risk at the University of Cambridge, sounded a note of caution about the findings.
“Given the pleasure presumably associated with moderate drinking, claiming there is no ‘safe’ level does not seem an argument for abstention,” he said.
“There is no safe level of driving, but the government does not recommend that people avoid driving.
“Come to think of it, there is no safe level of living, but nobody would recommend abstention.”
states in the discussion that their
results point to a need to revisit alcohol control policies and health programmes, and to consider recommendations for abstention
Spiegelhalter seizes on the use of the word abstention to make the study authors sound more unreasonable that they actually are. I don’t think this is particularly helpful when talking about, well, anything. If you can make people who disagree with you look unreasonable, then it’ll make for an easier argument, but it doesn’t make you right and them wrong.
The study in attempted to explain the additional risk of cancers from drinking to that of smoking, because the public in general understand that smoking is bad. I don’t have an opinion one way or the other for this method of communicating risk.
I’m quite happy to state I don’t know enough about communication of risk.
What I do know is that that communicating risk is difficult, as few people are trained in statistics. Even those who are aren’t necessarily able to convert an abstract risk into their daily reality. So maybe the paper is useful, maybe not. I do not think their research question is brilliant, but my opinion is pretty uninformed:
In essence, we aim to answer the question: ‘Purely in terms of cancer risk – how many cigarettes are there in a bottle of wine’?
I don’t think it’s “shameless” (why should the authors feel shame?), and it isn’t a “deliberate conflation” of smoking and drinking. It’s expressing the risk of one behaviour as the similar risk you get from doing a different behaviour.
The article’s theory is that the authors wrote the paper for headlines (It’s worth stating here that saying “yeah, right” in an article makes you sound like a child.):
Maybe they were targeting the media with their paper. In general, researchers pretty much all want their work to be noticed, to have people possibly even act on their work. That’s whole point of research. It’s not a bad thing to want your work to be useful.
I dislike overstated claims, making work seem more important than it is, and gunning for the media at the expense of good work. But equally, researchers need their work to be seen. We’re rated on it now. If our work is shown to have “impact”, then it’s classified better, so we’re classified better, so our university’s are classified as better. I dislike this (not least because it means method work can be ignored, since it may take years for it to be appreciated and used widely), but there we go.
Questioning the paper’s academic merit is fine though, so what are the criticisms of the paper? There’s just one: that the authors chose a level of smoking that has not been extensively measured as the comparator.
The article says they used 35 cigarette week and “extrapolated” to 10 cigarettes per week, and called this “having a guess”.
It’s not extrapolation, and it’s not a guess.
The authors looked at previous studies, usually meta-analyses, to see what the extra risk of smoking 35 cigarettes a week was on several cancers, adjusted for alcohol intake. They made some assumptions with how they calculated the risk of 10 cigarettes a week: they assumed each cigarette was as bad as the next one, so assumed that each of the 35 cigarettes contributed to the extra risk of cancer equally.
This assumes a linear association between your exposure (smoking) and outcome (cancer), an incredibly common thing done by almost all researchers. It is actually interpolation though, not extrapolation (since the data point they wanted was between two they had). And it isn’t a guess, it’s based on evidence (with appropriate assumptions).
The article says there is a single study estimating risks at low levels of cigarette smoking that should have been used. However, that study didn’t adjust for drinking, so it was useless for this study. For the study to be meaningful, they had to work out the extra risk from smoking on cancer independent from any effect of alcohol, since alcohol and smoking are correlated.
Finally, the study didn’t just report 10 cigarettes a week. They reported 35 cigarettes a week, so made no guesses or assumptions (beyond those made in the meta-analyses). So I think the criticism of the study was unfounded. The article felt otherwise:
OK, but all it was doing was communicating risk. If people haven’t thought about smoking 10 cigarettes then it didn’t do it well, but how would anyone know? Has a study been done asking people?
This isn’t a war on alcohol, or a conspiracy to link alcohol and smoking so people stop drinking. It’s not a crusade by people that hate alcohol. It was trying to communicate the risk of alcohol to people who might not know how to deal with the statistics presented in dense academic papers.
The “decades of epidemiological studies” referenced is actually a paper from 2018, concluding:
The study supports a J-shaped association between alcohol and mortality in older adults, which remains after adjustment for cancer risk. The results indicate that intakes below 1 drink per day were associated with the lowest risk of death.
The J-shaped association could easily be confounding – teetotalers are different to drinkers in many ways (see above). But that’s not really “decades of studies” anyway, and the conclusion was that drinking very little or nothing was best.
The second reference is to a systematic review of observational studies. This is relevant to the point about decades of research, but not conclusive given they are observational studies.
The claim that the positive association between alcohol intake and heart health has “been tested and retested dozens, if not hundreds, of times by researchers all over the world and has always come up smiling” is facetious.
It betrays a lack of understanding of causality, or publication bias, of confounding and reverse causation.
Basically, a lack of understanding about the very studies the article is leveraging to support its argument. It shows ignorance of how to make causal claims, because the entire premise of the argument has been built on observational data.
This is next part is inflammatory and wrong:
It certainly wouldn’t put you in the “Flat Earth” territory to believe that alcohol might not be good for you, unless you took as gospel that observational evidence was causal.
This reference is to observational studies, not “biological experiments”. I don’t know which biological experiments is meant here, maybe the ones I talked about earlier and dismissed? Also, the best observational evidence we have is probably genetic for many things, because the chance of confounding is slightly less. And the genetic studies say any level of alcohol has a risk.
There are certainly people who have agendas. People who want everyone to stop drinking. I do not doubt this. But who in the “’public health’ lobby” is the article referencing? What claims have they made? Without references, this it’s a pointless argument.
Also, public health people would like it if everyone stopped smoking and drinking, because public health would improve. That is, on average, people would be healthier – even if alcohol helps the heart, more people die of alcohol-related causes than would be saved by any protective effect of alcohol.
But this doesn’t mean public health people call for teetotalism.
To my knowledge, they generally advocate reducing drinking and in general, moderation. Portraying them as fanatics who “deny, doubt and dismiss” is ludicrous.
Prospective studies are good, because they can rule out reverse causation, i.e. heart problems can’t cause you to reduce alcohol intake if everyone starts with a good heart. But they do not address confounding. They are just as flawed to confounding as cross-sectional studies.
So prospective studies might be the best “observational” evidence (not “epidemiological” evidence given we deal with trials too), but only if you want to discount genetics. And “best” doesn’t mean “correct”
Statistical significance in individual studies is not something I have every talked about in meta-analysis. Because it isn’t relevant. At all. In fact, if your small studies are all significant and your big studies aren’t, it’s probably because you have publication bias, i.e. small studies are published because they had “interesting” results, big ones because they were good.
The article is now comparing meta-analyses with 31 and 25 studies with one with 2 studies. Given the large variation in the effects seen in the studies from the previous meta-analyses, I wouldn’t trust the result of just 2 studies. I actually tried to find those 2 studies to see if they were big/good, but in the original meta-analysis paper, they don’t make it easy to isolate which studies those two actually are. So I gave up.
This part is a fundamental misunderstanding of statistics. Saying something is “not statistically significantly” associated with an outcome is not the same as saying something is “not associated” with an outcome.
There are plenty of reasons why even large associations may not be statistically significant. In general, it will be because you didn’t study enough people, or for long enough. But how the analysis was conducted matters, as does chance. But it takes as much or more evidence to prove two things aren’t associated as proving they are.
If you start from the assumption that alcohol is good, then yeah, you would need evidence that there are risks from very light drinking. But why start from that premise?
We know that drinking lots is bad, so why assume drinking a bit is good? I can see why, when presented with evidence that moderate alcohol drinking and good heart health are correlated, people might think drinking is good for your heart. But what about every other disease?
In the absence of complete evidence, it would make sense to assume that if lots of alcohol is bad, some alcohol may also be bad. I think it is a bit much to start from the premise that because moderate drinking is correlated with good heart health, small quantities of alcohol are fine or good.
The burden of proof should be on whether alcohol is fine in any quantity. And then finding out how much is “reasonably ok”, and at which point it becomes “too much”.
And no, again, we don’t know that “very light drinking confers significant health benefits to the heart”, because this is a causal statement and you only have observational evidence. If you drink very lightly, your heart may well be in better shape than people who drink a lot or don’t drink, but that doesn’t mean the drinking caused your heart to be healthy.
I certainly dismiss this article as quackery with mathematics…
Actually, this is a good point, but is against the article’s argument. If you have low-quality, biased studies in a meta-analysis, that meta-analysis will be more low-quality and biased. Meta-analysis is not a cure for poor underlying research.
Stated somewhat more prosaically:
shit in, shit out
“Ultra-low relative risks” is relative. Most people won’t be concerned about small risks. But it makes a big difference to a population.
Research is often not targeted at individuals, it’s targeted at people who make policies that affect vast numbers of people. A small decrease in risk probably won’t affect any single person in any noticeable way. But it might save hundreds or thousands of people.
The article is guilty of the same thing. It “clings” to research that shows a beneficial effect of alcohol because it suits the argument. The observational evidence is confounded. It’s biased. The trial evidence is likely biased and wrong.
If all your evidence suffers from the same flaw (confounding, affecting each study roughly the same), then the size of your pile of evidence is completely irrelevant. A lot of wrong answers won’t help you find the right one.
A good example in a different field is survivorship bias when looking at the damage done to planes returning from missions in WW2. Researchers looked at the damage on returning planes, and recommended that damaged areas get reinforced.
Except this would be pointless.
Abraham Wald noted that planes that returned survived – they never saw the damage done to the planes that were shot down. If a plane returned with holes, those holes didn’t matter. Whereas the areas that were NOT hit did matter. It wouldn’t matter how many planes you looked at. You could gather all the evidence that existed, and it would still be wrong, because of bias.
The same is true of observational studies.
You can do a million studies, but if they are all biased the same way, your answer will be wrong.
The article makes the same ignorant point once again, conflating observational research with causal inference, while also cherry-picking studies. The facile point Snowdon makes about spending time on PubMed to reinforce his own views belies his own flawed approach to medical literature.
And that’s it for the article!
In summary, the article uses observational data to make causal claims, cherry picks evidence (while accusing others of doing the same), and seems to misunderstand basic statistical points about statistical significance.
]]>But it also got me thinking about the previous conferences and training courses I’ve been on, and how tricky I find it to do something that seems to be pretty essential to an academic’s career: networking.
In this post, I’ll talk about my past conferences, and how I muddled through without any idea what I was doing. I still have no idea what I’m doing, so don’t expect any helpful tips or hints (I mean, “talk to people” seems to be the sole advice necessary, perhaps with the additional hint to look up who’s going to a conference and maybe hit them up with an email beforehand saying you’d love to meet to have a chat). But if you feel like you don’t make the most of conferences, have trouble starting conversations with people you don’t know, then at least you’ll know you’re not alone. It’s probably worth mentioning that I spoke reasonably frequently in my own department and took many courses there, but I don’t class that as at all similar to speaking at a Conference or going away on a week(s)-long course.
The first conference I went to was in London (UK), a one-day conference organised by The Lancet called in 2013 (1st year of my PhD). I wasn’t presenting, and, to be honest, I can’t remember much about this conference, other than I sat in a lecture theatre for a day, didn’t really move and didn’t speak to anyone. There are two impressions I took away: the first was that Ben Goldacre’s hair is magnificent; the second was that there was an early-year PhD student who spoke to this massive room full of professionals, and I thought “I can’t imagine doing that”.
I also wasn’t presenting at the second conference I went to, the in Cambridge (UK) in 2014 (now the European causal inference meeting, how times change). I was massively out of place here – the conference was held in the mathematics department, which should have been my first clue. It turned out that my tiny epidemiology/medical statistics brain was unprepared for very technical lectures about things I didn’t understand. I guess the lesson here is to check the conference thoroughly before you go to avoid wasting hours sat staring at a series of intimidating Greek symbols you can’t even guess the meaning of.
I went to a lot of local training courses (Bristol does loads), but this was my first training course that lasted more than a week, and also happened to be on a different continent. The at the University of Michigan (USA) was a 3-week course, where I did basic epidemiology for 3 weeks in the morning, and three 1-week courses in the afternoon. The basic epi course was helpful, as were two of the other courses. The third course, however, was taught entirely in R and SAS, statistical software packages I had no understanding of. I thus did not attend after the first day; I couldn’t understand what was going on, and I figured my time could be better spent.
The University of Michigan is in Ann Arbor, which is great, and I spent a lot of time walking/running around the nice woods. I went with a colleague from my University, which can certainly be difficult – it’s one thing to see someone in the office day-to-day, but a three week trip to a different continent is very different. I think it went ok, but I can only speak for myself…
Overall, we managed to get to know a couple of other course delegates, but as many of the attendees were there as parts of groups, it was quite difficult to socialise. Still, we learnt things, which was the major focus of our time there. I also realised that home-cooking is great – I was pretty sick of takeaway and eating out by the end of the trip. On the way home, we stopped off at Washington and New York – because the air fare wasn’t any different, we didn’t have to pay for the flights (universities are great), but we obviously paid for our hotel rooms.
The third conference I went to was in Alaska, the (short names are for boring conferences). It was a 5-day event, and I was presenting a poster on the first day, which is especially fun when it’s a 20-hour trip with a 9-hour time difference.
I’d never been to a conference of this size before, so had no real idea of what to expect. Still, I arrived, registered, and slept in preparation for all the questions I would undoubtedly be asked in the poster session. I also went through the conference schedule to find all the sessions I wanted to attend. I’d been given the advice not to both with clearly irrelevant sessions that I at least wouldn’t find interesting: it would be a better use of time to have an hour off, read something, or do some work instead. There were probably 5 or more parallel sessions for every session, mixed in with plenaries and social events. As it turned out, because the conference had a very broad remit, I don’t think there were any sessions on at the same time I wanted to go to, which was nice.
I arrived promptly for my poster session. During the session, I spoke to all of two people. They were both lovely (I even went with one, and an Alaskan native, up a mountain on the final day, which was awesome). But it seemed something of an anticlimax to travel several thousand miles and only speak to two people about my work. It wasn’t even that my poster was incredibly dull (it was only slightly dull, I’m sure), it was that very few people came to a session where there were hundreds of posters on display. I was a little relieved I didn’t have to talk to too many people, but mostly felt deflated.
For the rest of the conference, I went to a Wellcome Trust organised event (as they sponsored my PhD), walked a little around Anchorage, and, as mentioned, went up a mountain with some conference delegates. The necessity of bear mace was a little daunting, but there were people literally running up the mountain (presumably for pleasure), so I figured it was fine. Although since there were quite a few people on the mountain, maybe the runners just felt they didn’t need to outrun the bears…
To round of 2014, I went to the 7th in Glasgow (UK). I was presenting a poster, and after presenting in Alaska I wasn’t really looking forward to it. I was vaguely aware that there would be a poster walk though, which I guessed meant someone official would lead a group around, and whoever was in the group would read the poster and they might ask some questions.
I was therefore quite surprised when I found that I would, in fact, be part of the poster walk: everyone scheduled to stand by their posters during the session would in fact be required to give a talk about their poster to all the other people who were scheduled to stand by their posters. Looking at the layout of the posters, I figured out I would have to speak about halfway through the session. Nowadays, this would give me plenty of time to think of something to say (it was probably only a 5-minute speech at best), but as I had never given a public speech like this before, I felt some pressure.
Still, red-faced and stuttering, I gave a short talk about the work I had done the previous year and answered a few questions. I have literally no idea what anyone else’s posters were about – I was too busy racking my brain thinking of what I needed to say, or too busy feeling relieved to pay much attention.
Morale: even if you are just presenting a poster, be prepared to give a talk about it to 20 other poster people.
When I was looking around for conferences at which I could speak, preferably to an audience sympathetic to a new PhD student who hadn’t spoken at a conference before, I was advised that the (I couldn’t find a good link) in Cardiff (UK) would be a good fit. In fairness, the YSM conference was indeed a good place to give my first presentation – there were only two parallel sessions, the crowd were nice statisticians (many of whom had come over from the Office of National Statistics (ONS) in Newport), and the other presenters were a mixture of ONS staff, early postdocs and PhD students like myself.
I had practiced my talk a fair few times, both to myself and with others, so felt prepared. I don’t remember the specifics, but I gave a talk about albatross plots in a lecture theatre (one of the old ones with wooden benches rising up in tiers I think), answered some questions, and only really felt nervous before speaking (I’ve rarely felt nervous once starting, too busy concentrating on speaking I guess). Afterwards, a nice man came up to me and chatted about my talk, although as far as I remember, this has to date been the only time people have chatted with me after a talk.
I’d love to say that after this talk, the ice had been broken and I never felt nervous before speaking to a crowd about my work again, but it doesn’t really work like that. Over time, I’ve become pretty inured to giving talks (and will usually quite happily talk in front of anyone about anything now), but I still get a little nervous at conferences.
In any case, any UK-based statistical-type people who are looking for a first-time conference, YSM is a good place to start. They’re friendly!
In my third year of my PhD (of four years) in 2016, I decided I should probably go to more conferences and give talks. As such, I sent abstracts to the , , and the . All the abstracts were about the albatross plots I developed, and I figured I would go to the ones that accepted my abstract. As it turned out, they all did.
The ISCB conference was in Birmingham, and although some of the talks were relevant to my work, there wasn’t much there that was really what I was most interested in at the time (evidence synthesis). Still, I enjoyed the conference, and was looking forward to giving my talk. I was immediately daunted by the size of the lecture theatre though – it was a full-on 300/500/some large number seat auditorium, with a projection screen the size of a cinema screen. It was much bigger than I expected – there were plenty of parallel sessions, there weren’t many talks about evidence synthesis at the conference, and there weren’t many people in the auditorium for my session, so I thought I’d be in a smallish room.
I was presenting last, so I read and re-read my presentation and paid no attention to the people on stage (yeah, it was a stage) before me. When it was my turn, I headed up, probably quite a bit more nervous than I’ve been since. The size of the screen behind me was a distraction – I was more used to being able to gesture at the plots and for people to know what I was talking about, but that wouldn’t work here. I was also distracted by the size of the stage – whenever I’d talked before, I had to stay pretty much in place to avoid getting blinded by the projector or blocking peoples’ view.
I managed to say what I needed to though, and it probably went fine. I was asked some questions by people in the audience, and I answered as well as I could. I was given a recommendation to add something to the plot, that I instantly forgot because I was too busy trying to remember how to reply. It’s like exchanging names – I’m too busy trying to remember how I say my part (“hello, my name is…”) to remember the name of the person to which I’m speaking. It would probably be easier if I went first… Still, like trading names, I didn’t feel I could whip out my phone and note down the recommendation before I forgot it.
In any case, I finished up and sat down (the chair said it was nice to hear about something “refreshingly vague”, which I still think is a compliment, but can’t quite be sure). There was a bit of time at the end before a plenary, so I spoke to one or two people who wanted to know a bit more – I even made an albatross plot for someone who was asking about, I think, the in a meta-analysis with previous studies looking at magnesium sulphate to treat early heart attacks (hint: it looks weird in a meta-analysis).
After the little break was over, the plenary speaker got up to deliver his lecture. I realised immediately that it was the same person who gave me the recommendation I had now forgotten. And it was , probably one of the most famous living statisticians. Damn. I really wish I’d whipped out my phone and noted down what he said.
The RSS conference up in Manchester was much of the same, lots of parallel sessions, not speaking to anyone, limited applicability. The difference was this time I gave a speech in a smaller room, although to be honest I don’t really remember much about it. I guess it was unexceptional, apart from the amusing coincidence that the person speaking immediately before me, also spoke immediately before me at the ISCB conference. That’s niche academia for you I guess. I also went to an early career meet up on the first night, but for whatever reason, I wasn’t really in the right frame of mind to be exceptionally sociable, and don’t think I saw anyone from the night again during the conference.
The Cochrane Colloquium in Seoul was quite different. Mostly because it was in Seoul. There were still many parallel sessions, although now my probably wasn’t in finding something I wanted to go to, it was whittling down the things I wanted to go to since they all happened at the same time. Overall, I’m not really a fan of more than a couple of parallel sessions – a few people from my university went to the conference (colloquium…), and we were all speaking at the same time. This was a bit disappointing, I was looking forward to listening to a friend’s talk.
It also meant that each lecture room was pretty small. I guess that’s good for intimacy, but at the same time, I’d travelled across the globe to give a speech to a room that at best had 30-40 people in it. Probably more than were in the auditorium at ISCB, but that was 2 hours away by train at rush hour. I gave my speech, went to interesting talks, failed to win “best talk by a newbie” (understandably, I was just hopeful), went to some training-type sessions, the usual stuff.
Seoul had, however, quite a few differences to previous conferences. There were social things happening, and I got to know a few people. This really made a difference – like in Alaska, I could now go do things with people, including karaoke before and after soju (soju is great), going on a tour of Seoul, cooking our own bulgogi (Korean BBQ), making kimchi (dear lord, do not keep kimchi un-refrigerated in your hotel room), and going on a tour of the demilitarized zone (DMZ) between North and South Korea. For those that haven’t been, the DMZ is the no-man’s land between warring North and South Korea. The North side is all barren and military-esque. The South side has an amusement park, with rides. I… I’m not sure the South are taking this seriously.
In short, Korea was much better than previous conferences, and it was due both to the more relevant talks (good ol’ evidence synthesis), and meeting people and being able to do things with them. So although I have no tips on HOW to achieve this (karaoke and soju work great, but probably limited opportunity to get them both together), it was certainly a good thing at this particular conference. I imagine it also helped that I knew some people at the conference already – with the exception of the YSM, where I met a few people that I knew at the conference, I had never been to a conference where I knew anyone. So maybe go with a friend, if possible.
If you can, go to this course. You can ski. It’s brilliant.
People from my university also teach on it, and as they’ve taught me on short courses before, I can say that they’re pretty good. The courses I went on were also pretty good (James Carpenter was particularly good I thought). Wengen probably isn’t the most exciting place to go if you don’t like to hike up snowy mountains or ski/snowboard/toboggan down them, but it’s pretty good if you do.
There’s always a conclusion, right? Well, what I’ve learnt from my experiences is that conferences can be pretty hit and miss in terms of content – sometimes everything is interesting and sometimes very little is interesting – and it can be difficult to get to know people, especially if they are already there with people they know. However, sometimes (and this is probably much more true of the longer conferences), you can make some good friends and have a great time. So far, I’ve only met people randomly – the social events that have dedicated “get to know each other” or networking sessions I don’t think have ever worked for me.
I’m still nervous before giving a speech. Much less than I used to be, but still a bit. Practice has helped – the more conferences I do, the better I will be – but I also teach on some of my university’s short courses, and this practice has helped a lot too. As has the knowledge that in several years of giving talks, no one has ever slammed my ideas or been rude, literally everyone who has spoken to me has been nothing but nice and friendly.
So yes. If you’ve never been to a conference before, then start with something relatively small, preferably go with at least one person you know, possibly go to some without a poster/talk, then go with a poster, then a talk. That’s the progression I did, and it felt fine. Of course, you could always dive in with an international conference talk on the first go, it’ll probably be fine. I hope that everyone in other fields appear as nice as they do in mine. And try to make friends, but if you don’t, that’s fine too.
I did the in 2017, and I sucked. Properly sucked. I didn’t speak for 10 seconds, since I forgot what I was supposed to say. You get one slide in the 3-minute thesis, and have 3 minutes to describe your thesis, and this format did not work well either with my content (my thesis is pretty long and complicated, and the underlying statistical problem that makes up the interesting part of my thesis usually takes a little longer than 3 minutes to properly explain to a lay-person), or with me. So all my speaking practice meant squat when it came to talking in a really limited time-frame in a competition. I’m fine with a casual chat where I talk about my work, but something about that competition made me into a babbling wreck of a speaker.
My supervisor who was there said it was fine. I don’t think I believe them…
If you’re wondering how I managed to go to so many conferences as a PhD student, the I did a PhD with the Wellcome Trust, who gave me a reasonable sum that I could have used on anything – recruiting patients, travel, training, lab reagents/equipment, data etc. If you ever think about doing a PhD in epidemiology, I would strongly recommend the Wellcome Trust’s . There are other links, but it’s late, I’m tired, and I like Bristol.
]]>As I said last week, my PhD was in at the University of Bristol, and lasted for 4 years. My thesis was the written report that described what I did for the latter 3 years, and was assessed in my Viva.
Three years ago, when I was just starting my PhD proper, we discussed what I would do and in what order. My aim was to make for prostate cancer better, since PSA is a bit awful at detecting prostate cancer (although there is nothing better at the moment). To do this, I intended to find an individual characteristic (things like age, ethnicity, weight etc.) that was associated with both prostate cancer and PSA.
Quick note: PSA is a protein made in the prostate that can be found in the blood – if the prostate is damaged, then more PSA is in the blood (be it from cancer, infection or inflammation). A high PSA level indicates damage to the prostate, but it is very hard to tell from what – only a biopsy is definitive.
If something is associated with prostate cancer, then the overall risk of prostate cancer changes with that something, which is then called a “risk factor” for prostate cancer. For example, Black men have a of prostate cancer than White or Asian men. If something is associated with PSA, then the overall level of PSA in the blood changes with the something. For example, taking the drug finasteride . can be used to reduce the size of the prostate (for conditions like benign prostatic enlargement), and also for hair loss.
When PSA is used as a test for prostate cancer, a PSA level of less than 4.0 ng/ml is usually considered “normal” (although other thresholds of “normal” are used, such as 2.5 ng/ml or 3.0 ng/ml, and can depend on the age of the man). Since finasteride reduces PSA levels by half, this is sometimes taken into account by doctors – if a man has a PSA test and is taking his finasteride, his PSA might be doubled to get a more accurate reading. This is a simple example of adjusting test results to better fit each individual; if the PSA were not doubled, then it would be much lower than it should be, and might mask prostate cancer. So if something affects PSA, then it should be taken into account when measuring PSA.
Things become a bit more complicated when something affects both prostate cancer risk as well as PSA. If something increases the risk of prostate cancer (such as being Black), then on average, it also increases PSA levels, since prostate cancer also increases PSA levels. This effect can be removed if you just look at men without prostate cancer, but this is tricky, since prostate cancer is common and lots of men can have prostate cancer without realising or being diagnosed (the statistic at medical school was 80% of men at 80 years old have prostate cancer).
As an additional problem, because men have PSA tests before being offered a prostate biopsy to see whether they have prostate cancer, anything that affects PSA levels may look like it affects prostate cancer risk too. If something lowers PSA levels (like finasteride), then some men will go from having a PSA above the threshold for a biopsy, to having a PSA below the threshold. So although the risk of prostate cancer may be the same, because not all men with prostate cancer are DIAGNOSED with the disease, it can look like things that reduce PSA are protective for prostate cancer, and things that increase PSA are a risk for prostate cancer.
Below is a diagram showing the effects of increasing age on prostate cancer status (i.e. whether a man actually has prostate cancer) and PSA (PSA levels increase with age), and how this could affect prostate cancer diagnosis. We are reasonably sure that age increases both prostate cancer risk (same as many cancers) and PSA levels (as the prostate becomes more leaky over time, letting more PSA into the blood to increase the PSA levels in the blood), but it is not so clear for other things.
My PhD was to look for a variable (individual characteristic) that was associated with both prostate cancer and PSA, and try to work out how much it affected each, and therefore how much PSA would need to be adjusted for to account for the effect on PSA, without touching the effect on prostate cancer. So for age above, I would work out exactly how much an increase of 1 year in age increased PSA – the top right line in the diagram. Once PSA was adjusted for age, it would hopefully be better at finding prostate cancer, since it would no longer be affected by changes in age.
Before I even started my PhD proper, I created a that gave the deadlines for all the work I knew I would need to do. First, I would need to find a variable, then perform a couple of systematic reviews to find all the studies that looked at the associations between the variable, prostate cancer and PSA. Then, I would need to conduct a couple of meta-analyses, which would combine the associations to get my final results. I also wanted to use individual participant data (published papers in epidemiology usually give summary results, telling you the association between two things, rather than listing any individual participant results, which would generally contravene patient confidentiality). The individual data would be used to enhance the meta-analysis results, but individual data takes a long time to source (about a year for all the data I requested). The Gantt chart I created is shown below.
This Gantt chart shows the 9 chapters I intended to write, the bare basics of what I would need to do for each, as well as when I would write up each chapter (“thesis production”). As far as plans go, this one was pretty limited, but it gave us a timeframe to work from.
In actuality, I didn’t stick to this very well at all. Chapters got removed or changed, new chapters were added, but the point I wanted to make remains: as I did each stage of research, I wrote up a chapter detailing what I did and how, as if each chapter was its own research paper. This meant when it came to the final 3 months, I was still adding new content, but the bulk of the work had already been done and I didn’t really need to remember what I did 2 years ago, it was right in front of me. Also, since I was writing up as I went along, I was forced to fully comprehend everything I was doing and why – doing it is one thing, but doing it and writing it down so other people could follow the rationale needs much greater understanding.
PhDs should all start with a plan (Gantt chart optional but useful), but writing up as you go along is definitely a huge time-saver in the end. I admit that what I wrote in the earlier years was… well… poor quality, but that’s to be expected. The only way to get better is to practice, and writing up as I went along gave me plenty of useful practice.
My thesis was split into 8 chapters (eventually), each with its own objectives (which I listed in the thesis itself in the first chapter, and below too). I had 2 introductory/background chapters, 5 analysis chapters, and a discussion chapter.
Chapter 1 |
Provide background information on prostate cancer and PSA, as well as using PSA as a screening test for prostate cancer. |
Chapter 2 |
Describe evidence synthesis methodologies relevant to this thesis. |
Chapter 3 |
Identify individual characteristics that have a plausible association with both prostate cancer and PSA, and have a large body of evidence examining this association, then select a characteristic to examine further. |
Chapter 4 |
Perform a systematic review and aggregate data meta-analysis of studies examining the associations between the chosen characteristic, prostate cancer, advanced prostate cancer and PSA. |
Chapter 5 |
Identify and collect data from large, well conducted prostate cancer studies, then perform an individual participant data meta-analysis of the associations between the characteristic, prostate cancer, advanced prostate cancer and PSA. |
Chapter 6 |
Combine the results of the aggregate and individual participant data to estimate the associations between the characteristic, prostate cancer, advanced prostate cancer and PSA as precisely as possible. |
Chapter 7 |
Perform a Mendelian randomisation analysis to assess evidence for causality between the characteristic, prostate cancer, advanced prostate cancer and PSA. |
Chapter 8 |
Summarise the results, strengths and limitations from the thesis, and indicate what direction future work may take. |
In addition to the 8 chapters, I had a title page (with a word count), an abstract, a dedication and acknowledgements page, a declaration (that I didn’t cheat), then section with contents, list of figures, and list of tables. My appendix contained information I thought was too specialist for the thesis, or just surplus to requirements (but still interesting). Given this was a thesis, the specialist stuff was really niche… I also put 2 papers I published during the PhD as images at the end of the appendix (turning PDFs to images for inclusion in word is incredibly irritating, but I thought it best to do it this way rather than combine the PDFs later) – these papers were relevant to the thesis, I published other papers but left them out. My appendix also had a list of acronyms; I included over 100 previously conducted studies in my thesis, most of which had acronyms, so putting a list of them (and all the medical and statistical terms that are acronymised) was likely pretty useful.
Side note: writing papers for outside projects was also very beneficial during my PhD, and would recommend PhD students do it if there’s time. Firstly, outside work can pay. Secondly, working with other people on other work increases your research skills and contacts, and counts as networking, something I still struggle with. Thirdly, it increases the number of papers you’re on, something I am told is very important in academia. Finally, concentrating on one piece of work for 3 years can be crushing – taking a break to do other work can paradoxically be relaxing. Teaching is also good to do, not least because I found teaching (for me, 30-40 people for only 1-2 hours on a short course) a great way to practice public speaking, which comes in handy at conferences. So yeah – PhD students, do extra work if you have time, it’s great.
My introductory chapters gave the background for prostate cancer and PSA testing I needed people to know before reading the rest of the thesis, and described fundamental evidence synthesis methods, which I would use extensively in the thesis. However, because my analyses were pretty disparate, I kept most of the chapter-specific methodology in the analysis chapters themselves. I imagine this makes it clearer when reading through – if you go 1 chapter at a time, all the information you would need is there in the chapter, you don’t have to flick back through to the introductory chapters.
My analysis chapters were written at the time of the analysis, with substantial editing later as I became a better writer (note: not a good write, just better than I was). After I finished each chapter’s analysis, I wrote up what I did and sent it to my 3 supervisors (I hear this is unusual, most departments have 1 or 2 supervisors, but I know of one person in a different university with over 10 supervisors). My supervisors read through and made comments – most chapters were read through and changed 3-4 times before I started compiling my thesis.
My analyses were iterative. Every piece of research is likely at least a little iterative – you start out with an idea, and gradually it becomes refined over time. Writing up each chapter after the analysis helped with this, since I could spot any errors. It did make my code a mess though, so much so that for the main analyses I rewrote the entire thing so it would be clearer. Although, since it’s code, I would happily rewrite it today to make it even more clear. Code seems to never be finished. In any case, I was still editing my analyses up to a month before submission, fixing little errors that crept in.
It wasn’t until 3 months before the deadline that I assembled a complete thesis out of the individual chapters I made. I read up on all the rules for the thesis from the university, and I created the list of figures/tables/contents that autogenerates in Word. I captioned all my figures and tables properly in Word so they would appear in those autogenerated lists. But at this stage, my supervisors still wanted to see individual chapters, so I was maintaining both an up-to-date thesis, as well as individual chapters, which got confusing. I’m not sure what the best way is here; creating the thesis as a whole was important and took a good day or two to do correctly, and was a good psychological boost, but maintaining different copies of files is asking for trouble.
Side note: I use for referencing, rather than (the two options available to me). Mendeley has three advantages: 1) it’s free, 2) the library of references synchronises between my work and home computers, and 3) it makes Word crash less often than Endnote. In total, I had 306 references, which was a strain on Word, so preventing crashes was very important. There are few things are irritating as losing 5 minutes of work on your thesis, when that work was fixing typos on 20 different pages that you now can’t remember.
I eventually produced a preliminary-final-draft of my thesis, and had to FTP (file transfer protocol, used instead of email for large files or for secure sending) it to my supervisors because it was too large for an email (although we also used a shared drive, but this isn’t accessible at home). My supervisors read through it and made any more comments – in total, my supervisors probably read through my entire thesis 5 times at different stages of its development. This is likely an enormous positive of starting writing early on: it gives supervisors a much longer time window to make helpful suggestions.
My final-final-draft was submitted a day or two before the deadline. At this stage, I was beyond caring whether I had made any mistakes. Everyone had read through it multiple times, and I just wanted it to be done. I think I made a half-hearted attempt to read through one last time but gave up, and just sent it. I could always make corrections later.
Writing a thesis takes up a substantial amount of life. This can be drawn out over years, or it can be condensed into the smallest possible time. I favour drawing out the process – starting early has so many advantages, whereas starting with 3 months to go is likely to cause an overwhelming amount of stress. My PhD was also 4 years, and the limit for PhDs in my university is also 4 years, so there was no chance of taking more time at the end of the PhD to write up – the deadline was final an immovable (in reality, I’m sure they would give more time if necessary, but they say you need a very good reason to do so).
So yes, start early. This advice could go for literally all work (including the homework I always left to the last minute), but with a thesis, it’s completely worth it.
]]>My PhD was in at the University of Bristol, and lasted for 4 years. The first year consisted of 3 mini-projects, which each took about 3 months, followed by 3 months of preparation for the following 3 years, mostly refining my plan for the research I would conduct.
My research mainly looked at whether body-mass index (BMI) was associated with prostate cancer and/or prostate-specific antigen (PSA). The aim was to see if, by precisely estimating the associations using previous data, I could make more accurate. PSA testing needs to become more accurate to be clinically useful – presently, about two-thirds of men with a high PSA level don’t have prostate cancer, which means many men have prostate biopsies who don’t need them. I’ll write more about my PhD later I’m sure.
PhD Vivas are probably one of the more stressful experiences of any PhDs life. But before the Viva can take place at all, the student needs to write their PhD thesis, a write-up of all they did during their PhD (or, at least, the bits that they want to talk about). When a PhD student is writing their thesis, it is usually best not to ask when it might be done (by the deadline, hopefully), how they are getting on (not well), whether it’s fun (it’s not) or whether they have any free time (they don’t). Lots of students get very stressed during the thesis write-up, but in general I imagine the better the supervisor, the more stress-free the experience (my supervisors were great).
My thesis had a word limit of 80,000 words – I ended up using about 65,000, with an extra 20,000 words in my appendix. In it, I gave an overview of the prostate, prostate cancer and PSA testing, a description of the methods I would be using in the thesis, and then 5 chapters detailing individual pieces of research, which led to a final discussion chapter. It took a while to write – I wrote as I went along, but it was still 3 months or so of editing and adding new content at the end to get it to the final copy. I’ll talk about the thesis later – lots of PhDs I know have had trouble knowing what to do for it and when, so I will write about my experience.
Once my thesis was complete, I posted two copies to my two Viva examiners. Because I had worked in my department before and during my PhD, I had two external examiners – two academics in related fields who were chosen by my supervisors to assess the scientific merit of the work I had done. Usually, PhDs have an internal and external examiner, i.e. an examiner from the same university (although hopefully unconnected to the student, otherwise there could be bias), as well as someone from a different university. As it happened, one of my examiners didn’t receive the thesis, and my supervisor had to email them a copy (note to all PhD students: have your supervisor check you external received your thesis).
My examiners had over a month to read my thesis and make comments on it. This task can not have been fun – 65,000 words is the size of a small-medium novel, but written in the dense, complicated style of a scientific journal article (which tend to be 2,000-3,000 words in my experience). It’s also a fairly unrewarded task – I think the examiners might receive a small fee and any expenses, but nothing like what would cover their time reading the damn thing. So I offer huge thanks to my examiners, and all examiners too.
After submission, I had a week off. I started a job after that, but it was in the same department working with the same people, and it was a job specifically created for PhDs to write up parts of their thesis for publication in journals, revise for the Viva, and conduct some new research. It is a great job, and gave me time to revise for my Viva. In total, I had 2 months between submission of my thesis and the Viva.
In those two months, I can’t say that I did as much preparation as I could have. I think I read through my thesis once. This is always a painful task, since you inevitably notice typos, errors, and just generally unclear bits. But it needs to be done so you can talk about everything you did in the thesis. So I made notes of what I needed to change after the Viva, as well as any bits I needed to revise.
I looked up typical questions that are asked in most Vivas, and jotted down some answers. Mostly, it was along the lines of “which methods did you use and why, and which others could you have used?”, and “what potential impact will your research have?”. My supervisors had me write a detailed 2-3 minute talk about what I did and rehearse it, just in response to the most common opening question – what did you do? One piece of advice I received was to nail 5 key points, and have a couple of optional extras in case you need them. This was good advice – the first question was indeed to describe what I did.
The Viva itself was to test 3 things (or so I’ve been told): 1) did I write the thesis; 2) do I know the science behind what I did; 3) is the science any good? The first and third points I had little concern about – I know I wrote the thing, it took up a large chunk of my life and I went through it before the Viva – and I knew both that the methods I used were valid (I published a paper using similar methods during the PhD), and that my supervisors are good at what they do. If there were any problems they would have spotted them, and anyway, there was nothing I could do about the science once the thesis has been printed.
That left the second point, and the one I focused on the most. While I have a good grasp of systematic reviews and meta-analyses, two techniques I made extensive use of in my thesis, I also used Mendelian Randomization (which I have used less and am thus less confident with), and several statistical methods that I could explain in simple terms but would be stumped to go into detail with. I therefore spent time reading through anything I was unsure about – this turned out to be almost completely unnecessary, but I would say that it was likely worth my time revising those concepts anyway.
At the end of the Viva, the examiners make recommendations to the exam board, which determines whether the students passes, and if so whether they need to make any corrections. Pass without corrections is rare in my department (1-2%). Minor and major corrections account for almost all Viva results. These are both still passing grades, the difference is the degree of time required to get the thesis up to a standard the examiners would like. Minor corrections should take less than a month to make, whereas major corrections could take up to 6 months. If someone is working, this could be taken into account and major corrections given so the student has more time to work on it. Rarely, but sometimes, the examiners recommend the student take a year to redo their work and resubmit – this is not a passing grade. There may be worse outcomes, but these would be vanishingly rare in my department. All examiner recommendations go to the exams board, which has the final say, but I don’t know in what circumstances they would ever not go with the examiner recommendations.
My department are great – I had two mock Vivas before the real thing. These were two hour-long sessions where professors and researchers read through specific chapters of my thesis and grilled me on them. This was both good practice for being grilled (although as an academic, I’m fairly used to that from supervisor meetings, conferences and other presentations), and finding any weak parts of the thesis that need my attention.
They were set up as close to the real thing as possible. The mock-examiners met in a room and discussed what they would ask and how, then I was called in. They asked about what I did in general terms, then went through the chapters in question systematically, asking why I did things, could I have done them other ways, pointed out issues and made recommendations if appropriate. At the end, they gave some feedback.
After the mocks, I felt more prepared, because I could answer most of their questions well enough to satisfy both myself and them. However, I was talking to a friend who did their Viva shortly before me, and they felt that if they had a mock that went poorly, their confidence would have been shaken for the real thing and they would have done worse.
As it turned out, my mocks were in fact much harder than the real thing.
My Viva started about midday. That morning, I read through key parts of my thesis and thought through some questions that might come up, but generally took it easy. I arrived about 10, bought some lunch, and found a quiet place to revise.
Because I had two external examiners, I also had an internal chair, someone to do the introductions and make sure everything is above board. I managed to catch them as they went in with the examiners for lunch and let them know where I’d be, so they could come fetch me when they were ready.
About half an hour after they went in, I was collected. After a quick introduction, the examiners took turns to ask me questions. It started with the general “what did you do?”, and progressed from there. The difficult questions that came up in the mocks made no appearance – what I remember talking about most were the new methods I developed in order to do my research slightly better, which happened to be the bits I enjoyed the most and liked talking about. There were a few things I had to change to satisfy them, but mostly this was putting work back into the thesis I had taken out, thinking it was too much detail. Overall, it was a pretty enjoyable experience, and was over (apparently) quite quickly – all told, I was in there about 1 hour 40 minutes.
Once over, the examiners asked me to leave while they had a discussion, and I was shortly called back in to be told I passed with minor corrections. We then made some awkward chitchat – my supervisors were coming to talk to (and thank) the examiners, but had planned on me being in there longer. However, everyone eventually arrived, more chitchat was exchanged, and we all left feeling quite happy.
My supervisors and I went for hot chocolate afterwards.
I’ve spoken to a few people about their Viva. Some enjoyed it, some hated it. Examiners can be great (mine were lovely), or they can be awful (someone said their examiners were known for asking extremely awkward and difficult questions, which isn’t really the point). I think the fear of the Viva is a bit disproportionate to the risk of failure – it’s true that it’s a very important exam, but so long as the supervision has been good and the supervisors are happy with the thesis itself, then there should be little risk of failure. Examiners (in general) want you to pass. And problems with the science (I admit complete ignorance of non-scientific PhDs) can be fixed in corrections. Some people will of course fear talking about their work in front of strangers for two hours, in which case practice may help (as does having a supportive department – I’ve heard of professors slamming student’s work in seminars; this is not constructive).
I think the Viva can be also something of an anticlimax for many students. A PhD in this country usually takes 3 years, mine took 4, and a thesis can take months to write. There can be months of waiting between submission of the thesis and the Viva, and then hours to wait on the day. Then the Viva happens, and it goes pretty quickly. Then it’s done. Years of work, judged in a couple of hours. And then you’re a doctor, hopefully (although I’ve never actually found out when you legitimately become a doctor, probably after graduation). Still, the relief is good.
So overall, I liked my Viva. The mock Vivas were helpful, if not for the specific questions then for the experience. Preparation was not massively important for my Viva, but could easily be for others – reading through my Viva once would likely have been sufficient, but I may have felt woefully unprepared then. And examiners should definitely be thanked, often and well.
]]>In the second year of my PhD, I conducted a systematic review of the association between milk and IGF (another future blog post to write). We found 31 studies that examined this association, but the data presented in their papers was not suitable for meta-analysis, i.e. I couldn’t extract relevant, combinable information from most of the studies. There was lots of missing information, and a lot of different ways of presenting the data. The upshot was, we couldn’t do a meta-analysis.
This left few options. Meta-analysis is gold-standard for combining aggregate data (i.e. results presented in scientific articles, rather than the raw data), for good reason. The main benefits of meta-analysis are an overall estimation of the effect estimate (association between two variables, effect of an intervention, or any other statistic), and a graphical display showing the results of all the studies that went into the analysis (a ). The alternatives to a meta-analysis don’t have those benefits.
The least statistical thing we could have done is a , where each study found in a systematic review is described, with some conclusions drawn at the end. A narrative review, as described in the , is different to a narrative synthesis, and does not involve a systematic search and inclusion of studies; it’s more of a cherry-picking method. In either case, the potential for bias is large (conscious or sub-conscious), it takes forever to write the damn things, and it also takes forever to read them. I really wanted to avoid writing a narrative synthesis, and so looked around for other options.
There were a couple of statistical options that would mean less writing. The first was (and yes, I’m linking Wikipedia articles as well as academic papers – I think it’s often more helpful than linking a whole paper or a textbook, which I know most people don’t have access to). Vote counting is where you add up the number of studies that have a positive, negative or null effect (as determined by a P value of, usually, less than 0.05). It’s the most basic statistical evidence synthesis method. It’s also an awful use of statistics (P values of less than a threshold has been since the turn of the millennium), and even if you decide there is evidence of an effect, you can’t tell how large it might be. is a bit more helpful; it combines all the P values from all the studies you want, and spits out a combined P value, indicating the amount of evidence in favour of rejecting the null hypothesis (i.e. that two variables are associated, or whether a treatment works). This is slightly better than a vote count, as the P values are not dichotomised. Again, though, there is no way to tell how large the effect might be: a really tiny P value might just mean there were lots of people in the studies.
We considered creating , which are bar charts showing vote count results with added information to show how confident you are in the results, so well-conducted, large studies are more heavily weighted than poor-quality, small studies. The graphs let you see all the data at once, which we thought was quite good. In the end, though, we decided not to use these plots for two reasons. The first was that we thought we could make better use of the data we had (we knew the number of people in each study, and what the P value was, and which direction the effect was in). The second reason was that we couldn’t make the plots, at least not easily. There was no software we could find that made it simple, which makes an important point: if you design a new method of doing something, make sure that you write code so other people can do it. No one will use a new method if it takes days of tinkering to make it work, unless it really is much better than what came before. Michael Crowther makes this point in one of his first , and as a biostatistician, he knows what he’s talking about.
The data we extracted from each study were: the total number of participants, the P value for the association between milk and IGF, and the direction of the association (positive or negative). Because I am a simple person, I plotted P values on the horizontal (x) axis, against the number of participants on the vertical (y) axis. Really low P values for negative associations were on the far left of the plot, and really low P values for positive associations were on the far right, and null P values (P = 1) were in the middle.
I put the axes on logarithmic scales to make things look better, so each unit increase along the x-axis on the right-hand side was a 10-fold decrease in P value (from 0.1 to 0.01 to 0.001 etc.). The first plots looked like the one below, with each study represented by a point, P value along the x-axis and number of participants along the y-axis.
This plot showed the P values from all the studies examining the association between milk (and dairy protein and dairy products) and IGF-I. The studies were split into Caucasian (C) and non-Caucasian (NC) groups. The left of the plot shows studies with negative associations (i.e. as milk increases, IGF-I decreases), and the right of the plot shows studies with positive associations. The P values decrease from 1 in the centre towards 0 at the edges.
This looked like it might be a good way of displaying the results of the systematic review, as we could see that there was likely an overall positive association between milk and IGF-I. But we still couldn’t tell what the overall effect estimate might be, or how large it could be. On the bright side, we could see that the largest studies all had positive associations, and we could easily identify outlier studies, for example the two studies with negative associations and reasonably small P values.
It was who suggested putting effect contours onto the plots to make them more interpretable. Effect contours are lines added to the graph to show where a particular effect size should lie. For all studies that calculated their P value with a , where the effect estimate divided by the standard error of the estimate is compared with a normal distribution to calculate a P value, there is a defined relationship between the effect size, number of participants, and the P value. For a particular effect size, as the number of participants increases, the P value must decrease to make everything balance.
This makes sense – for a small study (say 20 people) to have a tiny P value (say 0.001), it must have a huge effect, whereas a large study (say 20,000 people), to have the same P value (0.001), must have a much smaller effect size. I’ll detail the maths in a future post, but for now I’ll just say that I derived how to calculated effect contours for several difference statistical methods (for example, linear regression and standardised mean differences), and the plots looked much better for it. For the article, we also removed the distinction between Caucasian and non-Caucasian studies, and removed the two studies with very negative associations. This wasn’t to make our results seem better – those two studies were qualitatively different studies that were discussed separately, immediately after we described the albatross plot. The journal also apparently bleached the background of the plot, which I wasn’t overly happy with.
Now the plot has contours, it’s much easier to see what the overall effect size might be. The effect contours were for , where the beta if the number of standard deviations (SDs) change in outcome for a SD increase in exposure. It’s not the most intuitive effect estimate, but it has good statistical properties (unlike normal beta coefficients). For our purposes, a standardised beta coefficient of 0.05 was a small effect, 0.10 was small to medium, and 0.25 was medium to large. Our exact wording in the paper is here:
Of the 31 data points (from 28 studies) included in the main IGF-I analysis, 29 data points showed positive associations of milk and dairy intake with IGF-I levels compared to two data points that showed negative or null associations. The
estimated standardized effect size was 0.10 SD increase in IGF-I per 1 SD increase in milk (estimated range 0.05–0.25 SDs), from observation of the albatross plot (Fig. 3a). The combined p value for a positive association was 2.2×10−27. All studies with non-Caucasian subjects displayed a positive association between milk intake and PCa risk; in particular, two studies [30, 52] had p values of 0.0001 and 0.001, respectively, and both studies had a sample size of less than 100. When considering only Caucasians, the overall impression from the albatross plot did not change; the
effect estimate was still considered to be around 0.10 SD.Of the 31 data points, 18 had an estimated standardized effect size between 0.05 and 0.25 SDs and four had an effect size of more than 0.25 SD. Eleven of these data
points (61%) used milk as an exposure, including two that had an estimated standardized effect size of more than 0.25 SD [30, 53].
For the plots to be drawn, all that is (generally) required is the total number of participants, the P value and the effect direction. However, some statistical methods either require more information, or require an assumption, to actually draw the contours. For example, standardised mean differences require the ratio of the group sizes to be specified (or assumed), so the contours are drawn for specific effect size given a particular group size ratio (e.g. 1 case for 1 control). If you can assume all the studies have a reasonably similar group size ratio, then it’s generally fine to set the contours to be created at this ratio. If the studies are all have very different ratios, then this is more of a problem, but we’ll get to that in a bit.
In general, if the studies line up along a contour, then the overall effect size will be the same as the contour. If the studies fall basically around the contour (but with some deviation), then the overall effect size will be the same, but you’ll be less certain about the result. The corresponding situation in a meta-analysis would be studies falling widely around the effect estimate in a forest plot, with a larger confidence interval as a result. If the studies don’t fall around a contour, and are scattered across the width of the plot, then there is likely no association. If the larger and smaller studies disagree, then there may be some form of bias in the studies (e.g. small-study bias).
When interpreting the plots, I tend to focus on how closely the studies fit around a particular contour, and state what that contour might be. I also tend to give a range of possible values for the contour, so if 90% of the studies fall between two contours I will mention them. It is incredibly important not to overstate the results – this is a helpful visual aid in interpreting the results of a systematic review, but it is not a meta-analysis, and you certainly can’t give an exact effect size and leave it at that. If an exact effect size is needed, then a meta-analysis is the only option.
The plots can be created in any statistical program (I imagine Excel could do it, I just wouldn’t want to try), and I wrote a page of instructions on how to create the plots that I think was cut from the journal paper at the last minute. I will dig it out and post it for anyone that might be interested. However, if you have Stata, you don’t need to create you own plots, because I spent several weeks writing code so you don’t have.
Writing a statistical program was a great learning experience, which I’m fairly certain is what people say after doing something that was unexpectedly complicated and time-consuming. The actual code for producing an individual plot is trivial, it can be done in a few lines. But the code for making sure that every plot that could be made works and looks good is much more difficult. Trying to think of all the ways people using the program could make it crash took most of the time, and required a lot of fail-safes. I’ll do a future post specifically about the code, and say for now that in general, writing a program is not too different from writing ordinary code, but if it involves graphics it might take much longer than expected.
If you want to try out albatross plots (and I hope that you do), and you have Stata, you can install the program by typing: ssc install albatross. The help file is very detailed, and there’s an extra help file containing lots of examples, so hopefully everything will be clear. If not, let me know. I’ll do a video at some point showing how to create plots in Stata, for those that prefer a demonstration to reading through help files.
The program has a feature that isn’t discussed in the paper, as I still need to prove that it works fine all the time. I’m pretty sure it does, and have used it in all the papers albatross plots are a feature of. The feature is called adjust, and it does just what it says: it adjusts the number of participants in a study to the effective sample size. This is the term for the number of participants that would have been required if something had been different, so if the ratio of group sizes was 1 (when in fact it was 3.2, or anything else). The point is to make the results from studies with different ratios comparable: statistical power is highest when the group ratio is 1, so studies with high group ratios will look like they have smaller effect sizes than they do, just because they have less power. I will write another post about adjust when I can, as it is a very useful option that definitely raises the interpretability of the plots.
So far, I have identified two main uses of the plots. The first use was shown in the paper reviewing whether , and is when a meta-analysis just isn’t possible given the data that are available. The World Health Organisation also used the plots this way in a systematic review of , where again, meta-analysis was not possible.
The second use of the plots is to compare the studies that were included in a meta-analysis with those that couldn’t be included in the meta-analyses because they lacked data. This allows you to determine whether the studies that couldn’t be included would have changed the results if they could have been included. I have yet to publish my results, but I used the plots in my thesis to compare the results of studies that were included in a meta-analysis looking at the association between body-mass index and prostate-specific antigen to those that couldn’t be included. We found three out of the fours studies were consistent, but one wasn’t (I mean, it isn’t as if it’s completely on the other side of the plot, but it certainly isn’t as close to the other studies as I’d like). The study that wasn’t consistent was conducted in a different population to all other studies, so we made sure to note there were limits on the generalisability of our results. The effect contours are for standardised beta coefficients again. The meta-analysis of the included studies gave a results of a 5.16% decrease in PSA for every 5 kg/m2 increase in BMI. This is roughly equivalent to a standardised beta of -0.05, which is good because that’s what I’d say is the magnitude of effect in the albatross plot below.
I think albatross plots are pretty useful, and made my life easier when otherwise I would have had to write a long narrative synthesis. By putting the code online, we made sure that people who wanted to use the plots could. By providing a tonne of examples, hopefully people will know how to use the plots too.
Incidentally, the albatross in the name refers to the fact the contours looked like wings. Rejected names include swan plot, pelican plot and pigeon plot. Oh, and if any of you are thinking albatross are bad luck, they were until a captain decided to . A lesson for us all.
]]>The series started in 2009 and is written with a medical audience in mind, particularly physicians who need to read and evaluate medical literature. It runs through (unsurprisingly) how to evaluate publications, but in doing so gives quite a lot of useful information about medical research in general. It is published by , a weekly German-language medical magazine, although the series itself has been translated into English. You certainly don’t need to read all of the articles, and they are fairly easy to read, so feel free to dive into any of them. The articles were written by experts with reference to textbooks, academic articles and their own experiences.
I’ll write about each of the articles briefly, say who I think the articles will be most useful for, and link the PDFs (open-access, so anyone can view them). In a later post, I’ll write about medical research from a more practical perspective, rather than the perspective of a physician needing to understand an academic paper. Fair warning, I haven’t read through the entirety of all the papers, and will update this post when I have if anything needs to be changed.
This article describes the structure of scientific publications (introduction, methods, results, discussion, conclusion, acknowledgements, references), and points out some key things to look for to check the paper is trustworthy. It contains a checklist for evaluating the quality of publications, although much of the criteria are study-specific and don’t apply to all studies – there are better tools for assessing risk of bias (which is essentially what they are getting at), which I’ll write about later.
Definitely worth reading if you are just starting to read medical research.
This article describes six aspects of study design, which are important to consider both when reading articles and before conducting any studies. The aspects include:
Worth reading if you are just starting to read or conduct medical research.
This article classifies primary medical research into three distinct categories: basic research; clinical research; and epidemiological research. For each category, the article “describes the conception, implementation, advantages, disadvantages and possibilities of using the different study types”, with examples. The three categories include many subcategories, presented in a nice diagram.
Worth reading for background information on study types; I will cover something similar in a future post.
This article describes P values and confidence intervals, pretty essential concepts all medical researchers should understand. Helpfully, the article describes the pitfalls of P values, such as dichotomising a statistical test and the difference between statistically significant (a phrase banned in my research department) and clinically significant (much more useful).
Very useful reading for anyone who wants to understand more about P values and confidence intervals. I will also do a post about P values in the future, because it is such an important topic.
This article is focused on laboratory tests, but gives a description of sensitivity, specificity and positive predictive value, all concepts that are used when designing and interpreting tests (e.g. blood tests predicting disease).
This is useful reading for health practitioners (especially those ordering and interpreting tests), but is pretty specialist information. Saying that, knowing about sensitivity and specificity, or at least knowing that medical tests aren’t generally foolproof, would benefit many people.
This article details methods of evidence synthesis (i.e. taking previous study results and combining them for a more complete understanding of a research question), which is one of my areas of expertise.
I will write more in-depth posts about systematic reviews and meta-analyses, but this is useful as an overview.
— The articles become more statistical and specialist from this point on —
Descriptive statistics are use to describe the participants of a study (e.g. their mean age or weight) or any correlations between variables (e.g. as age rises, weight tends to increase [up to a point]). The article describes the difference between continuous (called metric, variables can be measured on a scale such as height or weight) and categorical variables (variables that can be one of a set number, such as gender or nationality), and provides examples of different types of graphs that can used to display the statistics.
Useful for anyone who has limited experience with statistics, or who wants to know more about common graphs (e.g. scatter graphs, histograms).
The articles start to get more niche from here on out. Observational studies are those where participants are not experimented upon in any way, just observed. This has implications for the results of such studies: if you are comparing two groups (e.g. smokers and non-smokers), then there is generally no way to tell exactly why there are differences between the groups (e.g. it could be smoking, but maybe the differences are because smokers are older, or more likely to drink alcohol, or eat less). This article discusses the problems with observational studies, and some ways to combat these problems.
Useful for those interested in observational studies, but definitely getting more niche now.
A 2 x 2 (read: 2 by 2) table shows the results of a study where the exposure and outcome were both binary. Generally, the rows of the table are exposed or not (e.g. to a new drug that prevents migraines), and the columns are diseased or not (e.g. got a migraine). The number of participants is shown in each cell, for example, the number of participants who received the new drug AND got a migraine might be in the upper left cell, the number who received the drug AND did NOT get a migraine in the upper right cell, the number who did not receive the drug AND got a migraine in the bottom left cell, and finally the number who did not receive the drug AND did NOT get a migraine in the bottom right cell. The reason 2 x 2 tables are useful is because writing that all out is a nightmare. This article describes the tables and the statistics that are often performed with the tables to judge whether being exposed raises, lowers, or doesn’t change the chance of the outcome happening. It also discusses some problems with the tables and their interpretation.
Useful reading for interpreting 2 x 2 tables, but otherwise unlikely to be useful.
If a study conducts multiple statistical tests, then the paper will likely contain many P values. This is good, in that many tests means many pieces of information, but bad because conducting many tests raises the risk that there will be a “significant” results by chance. This article discusses this problem and the ways to combat it. However, it is worth remembering that there is no way to tell whether a study conducted 100 statistical tests and just chose to present the “significant” ones.
Useful for those with an interest in the problem of multiple tests, but this is a problem that is unlikely to affect too many people (with the larger problem that it is impossible to know what studies have actually done, since they are the ones reporting it).
This article describes epidemiological studies and how they are analysed (if the title didn’t give it away). Epidemiological studies are those that seek to quantify the risks for diseases, how often a disease is diagnosed (incidence) or how many people at any one time have a disease (prevalence). The article helpfully lists how measures such as incidence and odds ratios are calculated, and gives examples of studies.
Very useful background into epidemiological studies, which make up a large proportion of medical research. The description of epidemiology measures is particularly useful.
P values are calculated differently depending on the data; this article describes the different methods and gives three tables, one detailing statistical tests (e.g. Fisher’s exact test, Student’s t-test) and two detailing which test is appropriate in different situations. This article does not deal with regression analyses (which is the vast bulk of my work), but given the ubiquity of P values in medical literature, it is definitely worth being familiar with the different tests that can be run.
Useful is you want to know more about P value calculations – the tables are particularly useful.
The sample size calculation tells you that if you want to be able to find an effect size this large (say the group on a new drug gets 10% fewer infections compared to the old drug), you need this many participants. It’s pretty important if you conduct primary research – most funders won’t like it if you say “I’ll recruit as many people as I can”, but will like it if you say “We need to recruit 628 participants to be confident (i.e. 90% sure) that we will see a risk ratio of 0.9, and here are the calculations to prove it.”
It probably isn’t necessary for those not planning to conduct primary research to know how to calculate the sample size, but an appreciation of why power calculations are important is good.
Linear regression is an analysis of the association between two continuous variables (e.g. height and weight), with the option to account for multiple confounders (variables that might be associated with both the exposure and the outcome). This article describes linear regression and other regression models, discusses some of the factors to consider when using the models, and gives several examples.
A useful article for anyone needing to conduct or interpret a regression analysis.
Survival analysis as it sounds: an analysis of how long participants survive (although note that “survival” could be how long a person goes without contracting a disease or receiving a test, not just how long until they die), and any factors that associated with survival time. This article describes survival analysis and points out some things to consider when conducting or interpreting survival analyses.
Useful if you want to know about survival analysis.
A concordance analysis assesses the degree to which two measuring or rating techniques agree, for instance the agreement between two tests for the volume of a tumour. Often, the gold-standard test (the best test at the time) will be compared with a cheaper, less intrusive or newer alternative test. This article describes concordance analyses and points out some things to consider when conducting or interpreting concordance analyses.
Useful if you want to know about concordance analysis.
Randomised controlled trials (often abbreviated to RCTs, and yes, I’m using the British spelling) are an incredibly useful method of determining which of two or more treatments is better. The idea is that patients are randomised to a treatment, with the hope that there will be no baseline differences overall between the participants in any treatment group (so ages, weights, ethnicities etc. will be equal between groups). Therefore, any difference in the outcome (e.g. developing the disease, death, recovery time) will be entirely due to the difference in treatments. However, RCTs need to be well-conducted, as even the slightest amount of bias can render the study meaningless. For instance, if the outcome is the amount of pain, then if a study participant believes they are on a less effective drug, they might experience more pain than if they were on a drug they believed to be more effective (placebo effect, and the reason branded anti-pain meds [analgesics] are in nicer boxes and cost 10 times as much, even though they are the same as the unbranded meds). This article describes RCTs, and provides a helpful table of how an RCT should be reported.
RCTs are pretty important for medical research, and it is definitely worth reading this article to be certain you can identify why some medical research is considered brilliant and unbiased, and some is considered useless.
— The articles become much more specialist from this point on —
A crossover trial is one where patients are randomised to different, consecutive treatments, so each patient takes all the different treatments at different times. This means each patient serves as their own control – the response to the treatment of interest can be compared to other treatments (such as placebo pills or the gold-standard treatment) – making problems such as confounding less of an issue. Crossover trials are likely only useful for chronic conditions (e.g. pain), as an acute condition may improve before the next treatment can be started. This article describes crossover trials and details the statistical methods to appropriate analyse the results.
This article is only worth reading if you need to know about crossover trials.
Screening is testing patients without any symptoms of a disease to see whether they are likely to have said disease. and screening are regularly performed in the UK, with the aim of finding those cancers while they are still curative. Screening in general has many established problems, for example finding cancers that would never have caused harm (overdiagnosis), leading to unnecessary treatment (overtreatment). This article discusses screening and its potential problems, using breast cancer screening as an example.
Knowing about screening is important as most people in developed countries will be offered a screening test in later life, so this article may well be worth a read.
The purpose of RCTs is often to find out whether a new treatment is better than an old one. The purpose of equivalence or non-inferiority studies is to determine whether a new treatment (and usually cheaper or with fewer side-effects) is at least as good (equivalence) or not much worse (non-inferiority) than the old one. It’s almost impossible to prove that two treatments are exactly the same in terms of their outcome, since effect estimates are never absolutely precise (there is always allowance for random error), so the statistics involved in equivalence/non-inferiority trials are different to RCTs. This article discusses these trials.
Pretty specific paper only useful for those involved with equivalence or non-inferiority trials.
Big data is defined in this article as a dataset “on the order of a magnitude of a terabyte“, which is indeed big. This article discusses big data and procedures used to analyse it (such as machine learning).
Probably a useful article for many to read, as big data is fast becoming used in everything from genetics to business.
An indirect comparison is one where instead of comparing two treatments against each other (A versus B), you use two comparisons with a common comparator treatment (A versus C, and B versus C) to infer the difference between two treatments. This can be fairly intuitive: if A is better than C, and C is better than B, then A must be better than B (and how much better can be calculated statistically). A network meta-analysis is a meta-analysis that includes indirect comparisons, so not only are studies comparing A and B included, but studies that compare A with C, and B with C too. The aim is to use as much information as possible to arrive at the most informed answer. This article discusses indirect comparison and network meta-analyses, clarifies the assumptions that must be made, and provides a helpful checklist for evaluation of network meta-analyses.
In my experience, indirect comparisons is not commonly used outside of network meta-analyses, and this article is therefore only really useful for those interested in network meta-analyses.
When analysing the results of a non-randomised study (e.g. an observational study), confounding is always an issue. One way to deal with this is to include measured variables as covariates in a regression model, accounting for differences between groups (e.g. in age, height, gender). Propensity scores are an alternative method to using covariates, where the probability of an individual receiving a treatment is calculated using the observed variables, and this is used to account for any differences between groups instead of covariates in a regression model. This article introduces propensity scores, describes four methods of using them and compares propensity scores and regression models.
In my experience, regression models with covariates is an overwhelmingly more common method of accounting for confounding, so this article may be useful when coming across studies using propensity scores or if you are considering using them, but otherwise may be a bit specialist.
RCTs are brilliant, but sometimes the methods have to be adapted to face a particular challenge. This article details some of the variations of RCTs (and provides a helpful tables), useful in specific circumstances.
This final article in the series (as of October 2017) is very specific to RCTs, and won’t be useful for anyone not interested in specialist RCT methods.
]]>