I agree with the overall message and conclusion of the letter. I mostly agree with the bullet points of action at the end of the letter. I agree the CRUK adverts are extremely unlikely to do anything good, and may be harmful, so they should probably stop.
But I don’t agree with some of the arguments the academics made in the letter, and I want to talk about this, even though I’m pretty sure I could get slammed for doing so.
I want to state upfront that I completely agree that stigmatising people with obesity is incredibly harmful. I state this because I think some people may profoundly disagree with me on two of my arguments, namely that obesity is a mix of personal choice, the environment and genetics, and that smoking is also a mix of personal choice, the environment and genetics.
I am not arguing that smoking and obesity are the same, or that personal choice is more important than the environment for either smoking or obesity, although both are probably more important than genetics (for most people). I am also not saying I know what to do, how obesity should be tackled, or even if it should be tackled.
Rather, I’m arguing that treating smoking like it’s solely a personal choice is wrong, that personal choice exists for both smoking and obesity, and to deny that is to take away people’s autonomy.
First off, the things I agree with.
I completely agree. Most (possibly all) of the studies associating BMI with health outcomes are observational, not causal. From these studies, you can’t say anything about whether a high BMI causes cancer, or whether something else entirely is going on.
BMI can’t be randomised in a study in the same way a drug or other treatment can. You could look at randomised studies of treatments for obesity and see if they affect cancer in the long run, but that’s expensive, difficult, and probably not ethical – you’d have to not treat one group of people, who initially wanted treatment, to see if they were diagnosed with cancer more frequently than the group who were treated. Oh, and this only tells you if the treatment affects cancer risk, not BMI itself.
You could also look at genetic studies, which can be thought of as a natural experiment, since bits of genes are distributed randomly at conception. You would look at the bits of genetic code causing changes in BMI, and see if they also affect cancer risk, and any other outcomes. .
Genetic studies probably give more of a causal estimate than non-genetic studies, but they aren’t free from bias, e.g. , and . With complex, multifactorial outcomes like BMI, it’s difficult to be completely sure whether the effects you’re seeing are from BMI, or from some other process. It’s still a lot better than non-genetic observational studies though, hence the “more causal” estimates.
The isn’t actually what I was hoping for, which was a study showing that telling people obesity is bad doesn’t affect people’s BMI in the long term. The research was more about people’s perceptions of obesity and “self-efficacy for health behaviour change”. However, I agree that telling people that obesity causes cancer is extremely unlikely to change people’s BMI. I don’t believe that many people who are obese believe obesity has no health consequences, or that telling them there are consequences will change either behaviour or outcomes.
However, my opinion on this isn’t relevant or important. CRUK should not launch a nationwide intervention without evidence that it would work – advertising is an intervention, same as intervening with drugs or surgery. There are potentially both positive (reduction in BMI?) and negative effects (increased stigma towards those perceived as obese), and the money spent on advertising could have gone to studying the causes of cancer. Therefore, the campaign should have had evidence to support its use, beyond “we tested it in focus groups”. If they don’t have the evidence, it shouldn’t have happened.
. It is indefensible if healthcare providers are causing a barrier to access for their healthcare through stigmatising those they perceive as obese. As, of course, is shaming anyone exercising. It’s utterly absurd that anyone should try to make other people feel bad about obesity when they are actively trying to do something about it. The same is true of a lot of shaming, but it feels particularly unjust when people are shamed into not exercising because they are considered too obese to exercise.
Stigmatising obese people will not, and never will, reduce the amount of obesity in the world.
I’ll start with something easy.
The NHS has a , but it’s general advice, along with the information that GPs can recommend both exercise on prescription and local weight loss groups. Given CRUK has stated obesity is bad, it needs to provide some mechanism for reducing obesity, or it would be completely pointless. So I can see why they would partner with a group that would, presumably (hopefully?), be recommended by GPs. Whether that was necessary is debateable, but I don’t see it as being an immediate problem in and of itself.
More concerningly, the letter states that the programmes are:
not effective ways of achieving and maintaining weight loss or preventing cancer
The doesn’t seem to me to support any part of that assertion.
The research is a systematic review and meta-analysis of weight loss among overweight but otherwise healthy adults who used commercial weight-loss programs. The outcome was whether people in the included studies lost more than 5% of their initial body weight. I’ve included a forest plot below, but the conclusion was that, on average, 57% people on a commercial program lost less than 5% of their initial body weight.
Therefore, 43% of people lost more than 5% of their initial body weight, which would seem to me to be evidence that these programs can help people “achieve weight loss”. They don’t help everyone, but that doesn’t mean they aren’t effective – they worked for 43 out of every 100 people! So I disagree that the “evidence demonstrates these programs are not effective”, although I grant that others may disagree.
Some of the biggest studies lasted only 12 weeks, and all but one study lasted less than 1 year. Therefore, this isn’t any kind of evidence for long term effects, or of “maintaining weight loss”. In general, it seems like the longer the study, the more people lost 5% of their body weight. I’d have liked to see a meta-regression on this, to see whether length of study was important, but there wasn’t one. In any case, I also disagree the evidence demonstrates the programs aren’t effective at maintaining weight loss, since that wasn’t tested here.
Finally, the research doesn’t say anything at all about preventing cancer. I would doubt there is much evidence either way for that outcome – if there were, however, it should be cited.
Ok, so this is where things get trickier.
I fundamentally disagree with the approach to this argument. Specifically, this part:
Through making a direct comparison between smoking and weight, your campaign contributes to these assumptions, suggesting that it is a lifestyle choice.
I’m fairly certain this means the authors of this letter, who are criticising CRUK for stigmatising obesity by “implying that individuals are largely in control of and responsible for their body size”, are themselves stating unequivocally that smoking is a lifestyle choice.
I cannot overstate how much I object to this.
I can see why people would think that obesity is different from smoking. Obesity is a consequence of many things, it’s an outcome, not a behaviour or single action that can be stopped at will. Smoking is a deliberate action, where you need to buy cigarettes, light, and inhale them.
But that is completely misrepresenting smoking. Smoking is also a consequence of many things, it’s an outcome as much as a behaviour, same as obesity.
Suffice to say, I strongly reject the idea that smoking is a “lifestyle choice” while obesity is not.
Rather, smoking, like obesity, is complex and multifaceted. If people object to the CRUK adverts solely because smoking is a lifestyle choice, and obesity is not, then I think they are wrong.
There are, of course, plenty of other reasons to object to the adverts.
I expect this will be an unpopular opinion.
Are people largely in control of their body size?
I don’t know, nor do I know how we could ever test this.
But I know that individual choice affects obesity. To imply otherwise is to say that people have absolutely no autonomy over their own body, over what and how much they eat, or how much they exercise.
Choice is clearly not the only factor in play (I mean, see above), and for some people there is very little choice, but for many people there is plenty.
I include myself in this. I have been both overweight and obese. I still am overweight. I have experienced weight-related stigma from both family members and strangers, although not at all to the same degree as others will have done. But it’s my choice to eat and exercise the way I do. My environment affects those choices, and I recognise that I am very privileged in that I live close enough to work that I can walk in, I’m healthy enough to go for a run, and I have the time and resources to choose to eat “healthily” or “unhealthily” (quotes to show how little I care for those definitions), where others don’t have these options.
The choices we make about food and exercise become harder or easier based on the environment and genetics. It’s easier to cook food or exercise if you have the time and resources to do so. It’s easier to eat salads if you like them. It’s easier to eat less if you aren’t depressed and comfort eating is one of the ways you can cope. It’s easier to eat less if you aren’t taking a steroid.
Sometimes, it’s impossible to eat “healthily”, or to exercise, or to make any “good” choices. But this doesn’t mean that for other people, choice had no part in either gaining or losing weight. It also doesn’t mean that anyone should be judged or stigmatised for making those choices.
I don’t know whether structural change through policy or weight-loss programs that target individuals are better for losing weight, either individually or at a population level.
I don’t know whether obesity is even the core issue – what if the main issue is exercise, and both obesity and poor health are a result of not exercising? Even genetic studies couldn’t tell you that, since obesity may cause poor health by making it more difficult to exercise, both through just it being more physically and mentally difficult, and through weight-related stigma making exercise worse.
I don’t know whether the CRUK advert could be beneficial or detrimental. I’m almost certain it’s not going to do any good, but I’m not psychic. I’m equally certain CRUK doesn’t know what effect the adverts will have, since they are apparently not based on firm evidence.
I don’t know whether it’s weight-related stigma to compare obesity weight smoking – I see smoking and obesity as both consequences of personal choice, the environment and genetics. Saying that, the advert may increase stigma all the same, so it could definitely be destructive.
I don’t know by how much obesity causes cancer. I haven’t assessed the evidence CRUK has for their claims, though I’m fairly certain the evidence they have is not causal, so I don’t think they know by how much obesity causes cancer either.
I know weight-related stigma is horrible. It seems to me that much of it could come from people with more privilege who have had to make easier choices stigmatising those who have had to make harder choices.
I know the current Government is extremely unlikely to change policy to promote an environment that reduces obesity. Therefore, I believe the only option people who want to lose weight have is changing their own choices. Some people may benefit from weight loss programs. Others may benefit from lifestyle changes. Others may find it impossible to lose weight. And that’s ok.
While I agree with the letter’s overall message and conclusion, I think the argument could have been limited to the lack of causal evidence of obesity on cancer and the lack of evidence the campaign would have any effect on obesity in this country.
I completely disagree that smoking is a lifestyle choice and obesity is not at all. Both are a mix of personal choice, the environment and genetics.
]]>
To recap, a couple of weeks ago a by Xinzhu (April) Wei & Rasmus Nielsen of the University of California was published, claiming that a deletion in the CCR5 gene increased mortality (in white people of British ancestry in UK Biobank). I had some issues with the paper, which I posted . My tweets got more attention than anything I’d posted before. I’m pretty sure they got more attention than my published papers and conference presentations combined. ¯\_(ツ)_/¯
The CCR5 gene is topical because, as the paper states in the introduction:
In late 2018, a scientist from the Southern University of Science and Technology in Shenzhen, Jiankui He, announced the birth of two babies whose genomes were edited using CRISPR
To be clear, gene-editing human babies is awful. Selecting zygotes that don’t have a known, life-limiting genetic abnormality may be reasonable in some cases, but directly manipulating the genetic code is something else entirely. My arguments against the paper did not stem from any desire to protect the actions of Jiankui He, but to a) highlight a peer review process that was actually pretty awful, b) encourage better use of UK Biobank genetic data, and c) refute an analysis that seemed likely biased.
This paper has received an incredible amount of attention. If it is flawed, then poor science is being heavily promoted. Apart from the obvious problems with promoting something that is potentially biased, others may try to do their own studies using this as a guideline, which I think would be a mistake.
I’ll quickly recap the initial problems I had with the paper (excluding the things that were easily solved by reading the online supplement), then go into what I did to try to replicate the paper’s results. I ran some additional analyses that I didn’t post on Twitter, so I’ll include those results too.
Full disclosure: in addition to to me, Rasmus and I exchanged several emails, and they ran some additional analyses. I’ll try not to talk about any of these analyses as it wasn’t my work, but, if necessary, I may mention pertinent bits of information.
I should also mention that I’m not a geneticist. I’m an epidemiologist/statistician/evidence synthesis researcher who for the past year has been working with UK Biobank genetic data in a unit that is very, very keen on genetic epidemiology. So while I’m confident I can critique the methods for the main analyses with some level of expertise, and have spent an inordinate amount of time looking at this paper in particular, there are some things where I’ll say I just don’t know what the answer is.
I don’t think I’ll write a formal response to the authors in a journal – if anyone is going to, I’ll happily share whatever information you want from my analyses, but it’s not something I’m keen to do myself.
All my code for this is .
Not accounting for relatedness (i.e. related people in a sample) is a . It can bias genetic analyses through population stratification or familial structure, and can be easily dealt with by removing related individuals in a sample (or fancy analysis techniques, e.g. Bolt-LMM). The paper ignored this and used everyone.
Quality control (QC) is also an issue. When the IEU at the University of Bristol was , they looked for sex mismatches, sex chromosome aneuploidy (having sex chromosomes different to XX or XY), and participants with outliers in heterozygosity and missing rates (yeah, ok, I don’t have a good grasp on what this means, but I see it as poor data quality for particular individuals). The paper ignored these too.
The paper states it looks at people of “British ancestry”. Judging by the number in participants in the paper and the reference they used, the authors meant “white British ancestry”. I feel this should have been picked up on in peer review, since the terms are different. The referenced uses “white British ancestry”, so it would have certainly been clearer sticking to that.
The main analysis should have also been adjusted for all principal components (PCs) and centre (where participants went to register with UK Biobank). This helps to control for population stratification, and we know that . I thought choosing variables to include as covariables based on statistical significance was discouraged, but . Still, I see no plausible reason to do so in this case – principal components represent population stratification, population stratification is a confounder of the association between SNPs and any outcome, so adjust for them. There are enough people in this analysis to take the hit.
I don’t know why the main analysis was a ratio of the crude mortality rates at 76 years of age (rather than a Cox regression), and I don’t know why there are no confidence intervals (CIs) on the estimate. The CI exists, it’s in the online supplement. Peer review should have had problems with this. It is unconscionable that any journal, let alone a top-tier journal, would publish a paper when the main result doesn’t have any measure of the variability of the estimate. A P value isn’t good enough when it’s a non-symmetrical error term, since you can’t estimate the standard error.
So why is the CI buried in an additional file when it would have been so easy to put it into the main text? The CI is from bootstrapping, whereas the P value is from a log-rank test, and the CI of the main result crosses the null. The main result is non-significant and significant at the same time. This could be a reason why the CI wasn’t in the main text.
It’s also noteworthy that although the deletion appears strongly to be recessive (only has an effect is both chromosomes have the deletion), the main analysis reports delta-32/delta-32 against +/+, which surely has less power than delta-32/delta-32 against +/+ or delta-32/+. The CI might have been significant otherwise.
I think it’s wrong to present one-sided P values (in general, but definitely here). The hypothesis should not have been that the CCR5 deletion would increase mortality; it should have been ambivalent, like almost all hypotheses in this field. The whole point of the CRISPR was that the babies would be more protected from HIV, so unless the authors had an unimaginably strong prior that CCR5 was deleterious, why would they use one-sided P values? Cynically, but without a strong reason to think otherwise, I can only imagine because one-sided P values are half as large as two-sided P values.
The best analysis, I think, would have been a Cox regression. Happily, the authors did this after the main analysis. But the full analysis that included all PCs (but not centre) was relegated to the supplement, for reasons that are baffling since it gives the same result as using just 5 PCs.
Also, the survival curve should have CIs. We know nothing about whether those curves are separate without CIs. I reproduced survival curves with a different SNP (see below) – the CIs are large.
I’m not going to talk about the Hardy-Weinburg Equilibrium (HWE, inbreeding) analysis– it’s still not an area I’m familiar with, and I don’t really think it adds much to the analysis. There are loads of reasons why a SNP might be out of HWE – dying early is certainly one of them, but it feels like this would just be a confirmation of something you’d know from a Cox regression.
I have access to UK Biobank data for my own work, so I didn’t think it would be too complex to replicate the analyses to see if I came up with the same answer. I don’t have access to rs62625034, the SNP the paper says is a great proxy of the delta-32 deletion, for reasons that I’ll go into later. However, I did have access to rs113010081, which the paper said gave the same results. I also used rs113341849, which is another SNP in the same region that has extremely high correlation with the deletion (both SNPs have R2 values above 0.93 with rs333, which is the rs ID for the delta-32 deletion). Ideally, all three SNPs would give the same answer.
First, I created the analysis dataset:
I conducted 12 analyses in total (6 for each SNP), but they were all pretty similar:
With this suite of analyses, I was hoping to find out whether:
I found… Nothing. There was very little evidence the SNPs were associated with mortality (the hazard ratios, HRs, were barely different from 1, and the confidence intervals were very wide). There was little evidence including relateds or more covariables, or changing the time variable, changed the results.
Here’s just one example of the many survival curves I made, looking at delta-32/delta-32 (1) versus both other genotypes in unrelated people only (not adjusted, as Stata doesn’t want to give me a survival curve with CIs that is also adjusted) – this corresponds to the analysis in row 6.
You’ll notice that the CIs overlap. A lot. You can also see that both events and participants are rare in the late 70s (the long horizontal and vertical stretches) – I think that’s because there are relatively few people who were that old at the end of their follow-up. Average follow-up time was 7 years, so to estimate mortality up to 76 years, I imagine you’d want quite a few people to be 69 years or older, so they’d be 76 at the end of follow-up (if they didn’t die). Only 3.8% of UK Biobank participants were 69 years or older.
In my original tweet thread, I only did the analysis in row 2, but I think all the results are fairly conclusive for not showing much.
In a reply to me, Rasmus stated:
This is the claim that turned out to be incorrect:
Never trust data that isn’t shown – apart from anything else, when repeating analyses and changing things each time, it’s easy to forget to redo an extra analysis if the manuscript doesn’t contain the results anywhere.
This also means I couldn’t directly replicate the paper’s analysis, as I don’t have access to rs62625034. Why not? I’m not sure, but the likely explanation is that it didn’t pass the quality control process (either ours or UK Biobank’s, I’m not sure).
I’ve concluded that the only possible reason for a difference between my analysis and the paper’s analysis is that the SNPs are different. Much more different than would be expected, given the high amount of correlation between my two SNPs and the deletion, which the paper claims rs62625034 is measuring directly.
One possible reason for this is the imputation of SNP data. As far as I can tell, neither of my SNPs were measured directly, they were imputed. This isn’t uncommon for any particular SNP, as imputation of SNP data is generally very good. As I understand it, genetic code is transmitted in blocks, and the blocks are fairly steady between people of the same population, so if you measure one or two SNPs in a block, you can deduce the remaining SNPs in the same block.
In any case there is a lot of genetic data to start with – each genotyping chip measures hundred of thousands of SNPs. Also, we can measure the likely success rate of the imputation, and SNPs that are poorly imputed (for a given value of “poorly”) are removed before anyone sees them.
The two SNPs I used had good “info scores” (around 0.95 I think – for reference, we dropped all SNPs with an info score of less than 0.3 for SNPs with minor allele frequencies similar), so we can be pretty confident in their imputation. On the other hand, rs62625034 was not imputed in the paper, it was measured directly. That doesn’t mean everyone had a measurement – I understand the missing rate of the SNP was around 3.4% in UK Biobank (this is from direct communication with the authors, not from the paper).
But. And this is a weird but that I don’t have the expertise to explain, the imputation of the SNPs I used looks… well… weird. When you impute SNP data, you impute values between 0 and 2. They don’t have to be integer values, so dosages of 0.07 or 1.5 are valid. Ideally, the imputation would only give integer values, so you’d be confident this person had 2 mutant alleles, and this person 1, and that person none. In many cases, that’s mostly what happens.
Non-integer dosages don’t seem like a big problem to me. If I’m using polygenic risk scores, I don’t even bother making them integers, I just leave them as decimals. Across a population, it shouldn’t matter, the variance of my final estimate will just be a bit smaller than it should be. But for this work, I had to make the non-integer dosages integers, so anything less than 0.5 I made 0, anything 0.5 to 1.5 was 1, and anything above 1.5 was 2. I’m pretty sure this is fine.
Unless there’s more non-integer doses in one allele than the other.
rs113010081 has non-integer dosages for almost 14% of white British participants in UK Biobank (excluding relateds). But the non-integer dosages are not distributed evenly across dosages. No. The twos has way more non-integer dosages than the ones, which had way more non-integer dosages than the zeros.
In the below tables, the non-integers are represented by being missing (a full stop) in the rs113010081_x_tri variable, whereas the rs113010081_tri variable is the one I used in the analysis. You can see that of the 4,736 participants I thought had twos, 3,490 (73.69%) of those actually had non-integer dosages somewhere between 1.5 and 2.
What does this mean?
I’ve no idea.
I think it might mean the imputation for this region of the genome might be a bit weird. rs113341849 has the same pattern, so it isn’t just this one SNP.
But I don’t know why it’s happened, or even whether it’s particularly relevant. I admit ignorance – this is something I’ve never looked for, let alone seen, and I don’t know enough to say what’s typical.
I looked at a few hundred other SNPs to see if this is just a function of the minor allele frequency, and so the imputation was naturally just less certain because there was less information. But while there is an association between the minor allele frequency and non-integer dosages across dosages, it doesn’t explain all the variance in the estimate. There were very few SNPs with patterns as pronounced as in rs113010081 and rs113341849, even for SNPs with far smaller minor allele frequencies.
Does this undermine my analysis, and make the paper’s more believable?
I don’t know.
I tried to look at this with a couple more analyses. In the “x” analyses, I only included participants with integer values of dose, and in the “y” analyses, I only included participants with dosages < 0.05 from an integer. You can see in the results table that only using integers removed any effect of either SNP. This could be evidence that the imputation having an effect, or it could be chance. Who knows.
rs62625034 was directly measured, but not imputed, in the paper. Why?
It’s possibly because the SNP isn’t measuring what the probe meant to measure. It clearly has a very different minor allele frequency in UK Biobank (0.1159) than in the (~0.03). The paper states this means it’s likely measuring the delta-32 deletion, since the frequencies are similar and rs62625034 sits in the deletion region. This mismatch may have made it fail quality control.
But this raises a couple of issues. First is whether the missingness in rs62625034 is a problem – is the data missing completely at random or not missing at random. If the former, great. If the latter, not great.
The second issue is that rs62625034 should be measuring a SNP, not a deletion. In people without the deletion, the probe could well be picking up people with the SNP. The rs62625034 measurement in UK Biobank should be a mixture between the deletion and a SNP. The R2 between rs62625034 and the deletion is not 1 (although it is higher than for my SNPs – again, this was mentioned in an email to me from the authors, not in the paper), which could happen if the SNP is picking up more than the deletion.
The third issue, one I’ve realised only just now, is that that rs62625034 is not associated with lifespan in UK Biobank (and other datasets). This means that maybe it doesn’t matter that rs62625034 is likely picking up more than just the deletion.
Peter Joshi, author of the article, helpfully posted these :
If I read this right, Peter used UK Biobank (and other data) to produce the above plot showing lots of SNPs and their association with mortality (the higher the SNP, the more it affects mortality).
Not only does rs62625034 not show any association with mortality, but how did Peter find a minor allele frequency of 0.035 for rs62625034 and the paper find 0.1159? This is crazy. A minor allele frequency of 0.035 is about the same as the GO-ESP population, so it seems perfectly fine, whereas 0.1159 does not.
I didn’t clock this when I first saw it (sorry Peter), but using the same datasets and getting different minor allele frequencies is weird. Properly weird. Like counting the number of men and women in a dataset and getting wildly different answers. Maybe I’m misunderstanding, it wouldn’t be the first time – maybe the minor allele frequencies are different because of something else. But they both used UK Biobank, so I have no idea how.
I have no answer for this. I also feel like I’ve buried the lead in this post now. But let’s pretend it was all building up to this.
This paper has been enormously successful, at least in terms of publicity. I also like to think that my “post-publication peer review” and Rasmus’s reply represents a nice collaborative exchange that wouldn’t have been possible without Twitter. I suppose I could have sent an email, but that doesn’t feel as useful somehow.
However, there are many flaws with the paper that should have been addressed in peer review. I’d love to ask the reviewers why they didn’t insist on the following:
So, do I believe “CCR5-∆32 is deleterious in the homozygous state in humans”?
No, I don’t believe there is enough evidence to say that the delta-32 deletion in CCR-5 affects mortality in people of white British ancestry, let alone people of other ancestries.
I know that this post has likely come out far too late to dam the flood of news articles that have already come out. But I kind of hope that what I’ve done will be useful to someone.
]]>Also, Snowdon said I should get a blog.
The article cherry-picks data, conflates observational epidemiology with causal inference, and misunderstands basic statistics.
I don’t care whether people drink or not. I’d prefer it if people drank in moderation, but I’m certainly not an advocate for teetotalism.
I do, however, think people should be informed of the risks of anything they do, if they want to be.
I think the article is poor, but think people should feel happy to drink if they want to. Based on the available evidence though, I wouldn’t say it helps your heart, and there may be some risk of drinking moderately.
But that’s the same for cake.
Let’s delve into the article.
The piece starts out by saying that there is a drive to treat drinkers like smokers. That seems to conflate saying that alcohol can be harmful with saying people shouldn’t drink alcohol.
They aren’t the same.
I also don’t know which organisation runs this campaign, but calling people who say alcohol is harmful “anti-alcohol activists” is a trick to make those same people seem like “others” or “them”. It also makes them sound like fanatics, trying to stop “you” drinking “your” alcohol.
But that’s not why I’m writing this.
It’s the “health benefits of moderate drinking”, stated as if it were indisputable fact. As if it’s known that alcohol causes health benefits.
Causal statements like this need rigorous proof. They need hard evidence. If moderate alcohol intake is associated with health benefits, that’s one thing. But saying it causes those health benefits is quite another.
Even if alcohol caused some benefits though, something can have both positive and negative effects – it’s not absurd to tell people about the dangers of something even if it could have benefits, that’s why medications have lists of side-effects.
And calling something “statistical chicanery” is another tactic to make it seem like people saying alcohol is harmful are doing so by cheating, or through deception.
The link to “decades of evidence” is to a 2004 meta-analysis, showing
Strong trends in risk were observed for cancers of the oral cavity, esophagus and larynx, hypertension, liver cirrhosis, chronic pancreatitis, and injuries and violence.
Which sounds pretty bad to me.
I’m guessing that if this is the right link, then it was meant for you to observe that there is a J-shaped relationship between alcohol intake and coronary heart disease.
That is, low and high levels of drinking are bad for your heart, but some is good. This sounds good – alcohol protects your heart – and it is common advice to hear from loads of people, doctors included.
The problem is that the evidence for this assertion comes from observational studies – the association is NOT causal.
This is all about causality.
We cannot say that drinking alcohol protects your heart, only that if you drink moderately, you are less likely to have heart problems. They sound the same, but they aren’t. The first is causation, the second is correlation, and if there’s one thing statisticians love to say, it’s “correlation is not causation”.
Studies measuring alcohol intake and heart problems are mostly either cross-sectional or longitudinal – they either look at people at one point in time, or follow them up for some time.
These are observational studies, they (probably) don’t change people’s drinking behaviour. Of course, people might change their behaviour a little if they know they have to fill in a questionnaire about their drinking habits, but we kind of have to ignore that for now.
Anyway, observational studies do not allow you to make causal statements like “drinking is good for your heart”.
Why not?
It comes down to bias/confounding, the same things I on Twitter when those researchers claimed .
There are ways to account for this when comparing drinkers with non-drinkers, but they rely on knowing every possible way people are different.
Imagine the reasons why someone doesn’t drink very much. Off the top of my head, they:
Now imagine the reasons why someone doesn’t drink at all. The above holds true, but you can add in:
A confounder is something that affects both the exposure (alcohol intake) and the outcome (health). If you want to compare drinkers and non-drinkers, you need to account for everything that might affect someone’s drinking behaviour and their health. This includes many of the things I listed above.
But this is nigh-on impossible, as behaviours are governed by so many things. You can adjust out *some* of the confounding, but you can’t prove you’ve gotten ALL the confounding. You can measure people’s health, but you won’t capture everything that contributes to how healthy a person is. You can ask people about their behaviour, but there’s no way you’ll capture everything from a person’s life.
If you see, observationally, that moderate drinking is associated with fewer heart problems, what does that imply?
My last was about how you really should have mechanisms to posit causality, i.e. if you say X causes Y, you need to have an idea of how X causes Y (or evidence from trials). This holds true here too.
Suppose alcohol protects your heart. How?
Fortunately, people have postulated mechanisms, and we can assess them: one possible mechanism is that alcohol increases HDL cholesterol (the good one), which improves heart health.
We can’t assign a direction to that mechanism using observational studies, since people who live healthily might have good HDL levels anyway, meaning they drink moderate amounts because they can.
To work this out (and to assign causality more generally), you can use trials. Ideally randomised controlled trials, since they’re so good. The ideal trial, the one where we wouldn’t need mechanisms at all, is one where we randomise people to drink certain amounts (none, a little, some, a lot) over the course of their life, make sure they stick to that, then see what happens to them.
Since that would never work, the next best thing is to test the proposed mechanisms, because if alcohol increases HDL cholesterol in the short-term (i.e. after a few weeks), then we’re probably on safer territory. We’d then have to prove that higher HDL cholesterol causes better heart health, but one thing at a time.
Well, a of trials was done to look at exactly that, fairly recently too (2011):
Effect of alcohol consumption on biological markers associated with risk of coronary heart disease: systematic review and meta-analysis of interventional studies
In total, there were 63 trials included, looking at a few markers of heart health, including HDL cholesterol. They found that alcohol increased HDL a little bit.
But there were problems.
The trials were a mix of things, but having looked at a few, it looks like many studies randomised small numbers of people to drink either an alcoholic drink or a non-alcoholic drink (the good ones had alcohol-free wine compared with normal wine), and they measured their HDL before and after the trial.
The problem with small trials is that they can have quite variable results, because there is a lot of imprecision when you don’t have enough people. You do a trial with 60 people and get a result. You repeat it with new people, and get an entirely different result.
That’s one reason why we do meta-analyses in the first place – one study rarely can tell you the whole story, but when you combine loads of studies, you get closer to the truth.
But academic journals exist, and they tend to publish studies that are interesting, i.e. ones that show a “statistically significant” effect of something, in this case alcohol on HDL. This has three effects.
Repeat a study enough, you’ll eventually get the result you want. Since lots of people want alcohol to be beneficial to the heart, and because these trials are pretty inexpensive, there is a good chance that there are missing studies that were never published.
I’m aware this sounds like I’m reaching, and I could never prove that these things happened. But I can show, with relative certainty, that there are missing studies, ones that showed either that alcohol didn’t affect HDL or reduced it.
In meta-analyses, we tend to produce , which show whether studies fall symmetrically around the average effect, i.e. the average effect of alcohol on HDL. Since studies should give results that fall basically randomly around the true effect of alcohol on HDL, they should be symmetrical on a funnel plot.
If some studies have NOT been published, i.e. ones falling in the “no effect” area, or those without statistical significance, then you see asymmetry.
We don’t know WHY these studies are missing, just that something isn’t right, and we should treat the average effect with caution. The link I gave above shows a nice symmetrical funnel plot, and an asymmetrical one.
And here is the funnel plot I made from the meta-analysis data.
Note: I had to make this plot myself, the authors did not publish it – they stated in the paper:
No asymmetry was found on visual inspection of the funnel plot for each biomarker, suggesting that significant publication bias was unlikely.
See how the effect gets smaller (more left) as the “s.e. of md” goes down? That’s the standard error of the mean difference – the smaller it is, the more precise the result is, the more confident we are in the result. More people = smaller standard error.
With a smaller numbers of people, the standard error goes up, and the more variable the results become. One study may find a huge effect, the next a tiny effect. The fact ALL the small studies found a comparatively large effect is extremely suspicious.
So yeah, there was asymmetry in the funnel plot for the effect of alcohol on HDL cholesterol. The asymmetry says to me that there are missing studies that showed no effect of alcohol on HDL cholesterol, and so the true effect of alcohol on HDL cholesterol will thus be smaller than they said.
To be honest, there’s probably no effect, or if there is, it’s tiny.
To be fair though, I should say most of the studies had a small follow-up time. It’s entirely possible longer studies would have found a larger effect. The point is, we don’t know.
There are likely other proposed mechanisms, but I think the HDL mechanism is the one commonly thought of as the big one. :
The best-known effect of alcohol is a small increase in HDL cholesterol
So, I don’t really see the evidence as being particularly in support of alcohol protecting the heart. The observational evidence is confounded and possibly has reverse causation. The trial evidence looks to be biased. What about the genetic evidence?
We use genetics to look at things that are difficult to test observationally or through trials. We do this because it can (and should) be unconfounded and is not affected by reverse causation. This is true when we can show how and why the genetics works.
For proteins, we’re on pretty solid ground. A change in gene X causes a change in protein Y. But for behaviours in general, we’re on much shakier ground.
There is one gene, however, that if slightly faulty, produces a protein that doesn’t break down alcohol properly. This is a good genetic marker, since people without that protein get hangovers very quickly after drinking alcohol, so tend not to drink.
One found
Individuals with a genetic variant associated with non-drinking and lower alcohol consumption had a more favourable cardiovascular profile and a reduced risk of coronary heart disease than those without the genetic variant.
(in an Asian population) found:
robust evidence that alcohol consumption adversely affects several cardiovascular disease risk factors, including blood pressure, waist to hip ratio, fasting blood glucose and triglyceride levels. Alcohol also increases HDL cholesterol and lowers LDL cholesterol.
So alcohol may well cause higher HDL cholesterol levels.
Note that in genetic studies, you’re looking at lifetime exposure to something, in this case alcohol. So as above, a trial looking at the long-term intake of alcohol may find it raises HDL cholesterol.
It’s just, currently, the trial data doesn’t support this.
Halfway now, and I hope I have shown that the evidence alcohol protects the heart is shaky at best. This is kind of important for later. I don’t claim to have done a systematic or thorough search though, so let me know if there is anything big I’ve missed!
Let’s return to the article.
I got side-tracked by the article’s reference to the paper that said alcohol increases risk to loads of bad stuff, and has a J-shaped association with heart disease.
The is an example of why I mostly dislike research articles being converted into media articles. It is *exceedingly* difficult to convey the nuances of epidemiological research in 850 words to a lay audience. It just isn’t possible to relay all the necessary information that was used to inform the overall conclusion of the Global Burden of Disease study.
David Spiegelhatler’s flippant remarks at the end probably don’t help:
Yet Prof David Spiegelhalter, Winton Professor for the Public Understanding of Risk at the University of Cambridge, sounded a note of caution about the findings.
“Given the pleasure presumably associated with moderate drinking, claiming there is no ‘safe’ level does not seem an argument for abstention,” he said.
“There is no safe level of driving, but the government does not recommend that people avoid driving.
“Come to think of it, there is no safe level of living, but nobody would recommend abstention.”
states in the discussion that their
results point to a need to revisit alcohol control policies and health programmes, and to consider recommendations for abstention
Spiegelhalter seizes on the use of the word abstention to make the study authors sound more unreasonable that they actually are. I don’t think this is particularly helpful when talking about, well, anything. If you can make people who disagree with you look unreasonable, then it’ll make for an easier argument, but it doesn’t make you right and them wrong.
The study in attempted to explain the additional risk of cancers from drinking to that of smoking, because the public in general understand that smoking is bad. I don’t have an opinion one way or the other for this method of communicating risk.
I’m quite happy to state I don’t know enough about communication of risk.
What I do know is that that communicating risk is difficult, as few people are trained in statistics. Even those who are aren’t necessarily able to convert an abstract risk into their daily reality. So maybe the paper is useful, maybe not. I do not think their research question is brilliant, but my opinion is pretty uninformed:
In essence, we aim to answer the question: ‘Purely in terms of cancer risk – how many cigarettes are there in a bottle of wine��?
I don’t think it’s “shameless” (why should the authors feel shame?), and it isn’t a “deliberate conflation” of smoking and drinking. It’s expressing the risk of one behaviour as the similar risk you get from doing a different behaviour.
The article’s theory is that the authors wrote the paper for headlines (It’s worth stating here that saying “yeah, right” in an article makes you sound like a child.):
Maybe they were targeting the media with their paper. In general, researchers pretty much all want their work to be noticed, to have people possibly even act on their work. That’s whole point of research. It’s not a bad thing to want your work to be useful.
I dislike overstated claims, making work seem more important than it is, and gunning for the media at the expense of good work. But equally, researchers need their work to be seen. We’re rated on it now. If our work is shown to have “impact”, then it’s classified better, so we’re classified better, so our university’s are classified as better. I dislike this (not least because it means method work can be ignored, since it may take years for it to be appreciated and used widely), but there we go.
Questioning the paper’s academic merit is fine though, so what are the criticisms of the paper? There’s just one: that the authors chose a level of smoking that has not been extensively measured as the comparator.
The article says they used 35 cigarette week and “extrapolated” to 10 cigarettes per week, and called this “having a guess”.
It’s not extrapolation, and it’s not a guess.
The authors looked at previous studies, usually meta-analyses, to see what the extra risk of smoking 35 cigarettes a week was on several cancers, adjusted for alcohol intake. They made some assumptions with how they calculated the risk of 10 cigarettes a week: they assumed each cigarette was as bad as the next one, so assumed that each of the 35 cigarettes contributed to the extra risk of cancer equally.
This assumes a linear association between your exposure (smoking) and outcome (cancer), an incredibly common thing done by almost all researchers. It is actually interpolation though, not extrapolation (since the data point they wanted was between two they had). And it isn’t a guess, it’s based on evidence (with appropriate assumptions).
The article says there is a single study estimating risks at low levels of cigarette smoking that should have been used. However, that study didn’t adjust for drinking, so it was useless for this study. For the study to be meaningful, they had to work out the extra risk from smoking on cancer independent from any effect of alcohol, since alcohol and smoking are correlated.
Finally, the study didn’t just report 10 cigarettes a week. They reported 35 cigarettes a week, so made no guesses or assumptions (beyond those made in the meta-analyses). So I think the criticism of the study was unfounded. The article felt otherwise:
OK, but all it was doing was communicating risk. If people haven’t thought about smoking 10 cigarettes then it didn’t do it well, but how would anyone know? Has a study been done asking people?
This isn’t a war on alcohol, or a conspiracy to link alcohol and smoking so people stop drinking. It’s not a crusade by people that hate alcohol. It was trying to communicate the risk of alcohol to people who might not know how to deal with the statistics presented in dense academic papers.
The “decades of epidemiological studies” referenced is actually a paper from 2018, concluding:
The study supports a J-shaped association between alcohol and mortality in older adults, which remains after adjustment for cancer risk. The results indicate that intakes below 1 drink per day were associated with the lowest risk of death.
The J-shaped association could easily be confounding – teetotalers are different to drinkers in many ways (see above). But that’s not really “decades of studies” anyway, and the conclusion was that drinking very little or nothing was best.
The second reference is to a systematic review of observational studies. This is relevant to the point about decades of research, but not conclusive given they are observational studies.
The claim that the positive association between alcohol intake and heart health has “been tested and retested dozens, if not hundreds, of times by researchers all over the world and has always come up smiling” is facetious.
It betrays a lack of understanding of causality, or publication bias, of confounding and reverse causation.
Basically, a lack of understanding about the very studies the article is leveraging to support its argument. It shows ignorance of how to make causal claims, because the entire premise of the argument has been built on observational data.
This is next part is inflammatory and wrong:
It certainly wouldn’t put you in the “Flat Earth” territory to believe that alcohol might not be good for you, unless you took as gospel that observational evidence was causal.
This reference is to observational studies, not “biological experiments”. I don’t know which biological experiments is meant here, maybe the ones I talked about earlier and dismissed? Also, the best observational evidence we have is probably genetic for many things, because the chance of confounding is slightly less. And the genetic studies say any level of alcohol has a risk.
There are certainly people who have agendas. People who want everyone to stop drinking. I do not doubt this. But who in the “’public health’ lobby” is the article referencing? What claims have they made? Without references, this it’s a pointless argument.
Also, public health people would like it if everyone stopped smoking and drinking, because public health would improve. That is, on average, people would be healthier – even if alcohol helps the heart, more people die of alcohol-related causes than would be saved by any protective effect of alcohol.
But this doesn’t mean public health people call for teetotalism.
To my knowledge, they generally advocate reducing drinking and in general, moderation. Portraying them as fanatics who “deny, doubt and dismiss” is ludicrous.
Prospective studies are good, because they can rule out reverse causation, i.e. heart problems can’t cause you to reduce alcohol intake if everyone starts with a good heart. But they do not address confounding. They are just as flawed to confounding as cross-sectional studies.
So prospective studies might be the best “observational” evidence (not “epidemiological” evidence given we deal with trials too), but only if you want to discount genetics. And “best” doesn’t mean “correct”
Statistical significance in individual studies is not something I have every talked about in meta-analysis. Because it isn’t relevant. At all. In fact, if your small studies are all significant and your big studies aren’t, it’s probably because you have publication bias, i.e. small studies are published because they had “interesting” results, big ones because they were good.
The article is now comparing meta-analyses with 31 and 25 studies with one with 2 studies. Given the large variation in the effects seen in the studies from the previous meta-analyses, I wouldn’t trust the result of just 2 studies. I actually tried to find those 2 studies to see if they were big/good, but in the original meta-analysis paper, they don’t make it easy to isolate which studies those two actually are. So I gave up.
This part is a fundamental misunderstanding of statistics. Saying something is “not statistically significantly” associated with an outcome is not the same as saying something is “not associated” with an outcome.
There are plenty of reasons why even large associations may not be statistically significant. In general, it will be because you didn’t study enough people, or for long enough. But how the analysis was conducted matters, as does chance. But it takes as much or more evidence to prove two things aren’t associated as proving they are.
If you start from the assumption that alcohol is good, then yeah, you would need evidence that there are risks from very light drinking. But why start from that premise?
We know that drinking lots is bad, so why assume drinking a bit is good? I can see why, when presented with evidence that moderate alcohol drinking and good heart health are correlated, people might think drinking is good for your heart. But what about every other disease?
In the absence of complete evidence, it would make sense to assume that if lots of alcohol is bad, some alcohol may also be bad. I think it is a bit much to start from the premise that because moderate drinking is correlated with good heart health, small quantities of alcohol are fine or good.
The burden of proof should be on whether alcohol is fine in any quantity. And then finding out how much is “reasonably ok”, and at which point it becomes “too much”.
And no, again, we don’t know that “very light drinking confers significant health benefits to the heart”, because this is a causal statement and you only have observational evidence. If you drink very lightly, your heart may well be in better shape than people who drink a lot or don’t drink, but that doesn’t mean the drinking caused your heart to be healthy.
I certainly dismiss this article as quackery with mathematics…
Actually, this is a good point, but is against the article’s argument. If you have low-quality, biased studies in a meta-analysis, that meta-analysis will be more low-quality and biased. Meta-analysis is not a cure for poor underlying research.
Stated somewhat more prosaically:
shit in, shit out
“Ultra-low relative risks” is relative. Most people won’t be concerned about small risks. But it makes a big difference to a population.
Research is often not targeted at individuals, it’s targeted at people who make policies that affect vast numbers of people. A small decrease in risk probably won’t affect any single person in any noticeable way. But it might save hundreds or thousands of people.
The article is guilty of the same thing. It “clings” to research that shows a beneficial effect of alcohol because it suits the argument. The observational evidence is confounded. It’s biased. The trial evidence is likely biased and wrong.
If all your evidence suffers from the same flaw (confounding, affecting each study roughly the same), then the size of your pile of evidence is completely irrelevant. A lot of wrong answers won’t help you find the right one.
A good example in a different field is survivorship bias when looking at the damage done to planes returning from missions in WW2. Researchers looked at the damage on returning planes, and recommended that damaged areas get reinforced.
Except this would be pointless.
Abraham Wald noted that planes that returned survived – they never saw the damage done to the planes that were shot down. If a plane returned with holes, those holes didn’t matter. Whereas the areas that were NOT hit did matter. It wouldn’t matter how many planes you looked at. You could gather all the evidence that existed, and it would still be wrong, because of bias.
The same is true of observational studies.
You can do a million studies, but if they are all biased the same way, your answer will be wrong.
The article makes the same ignorant point once again, conflating observational research with causal inference, while also cherry-picking studies. The facile point Snowdon makes about spending time on PubMed to reinforce his own views belies his own flawed approach to medical literature.
And that’s it for the article!
In summary, the article uses observational data to make causal claims, cherry picks evidence (while accusing others of doing the same), and seems to misunderstand basic statistical points about statistical significance.
]]>But it also got me thinking about the previous conferences and training courses I’ve been on, and how tricky I find it to do something that seems to be pretty essential to an academic’s career: networking.
In this post, I’ll talk about my past conferences, and how I muddled through without any idea what I was doing. I still have no idea what I’m doing, so don’t expect any helpful tips or hints (I mean, “talk to people” seems to be the sole advice necessary, perhaps with the additional hint to look up who’s going to a conference and maybe hit them up with an email beforehand saying you’d love to meet to have a chat). But if you feel like you don’t make the most of conferences, have trouble starting conversations with people you don’t know, then at least you’ll know you’re not alone. It’s probably worth mentioning that I spoke reasonably frequently in my own department and took many courses there, but I don’t class that as at all similar to speaking at a Conference or going away on a week(s)-long course.
The first conference I went to was in London (UK), a one-day conference organised by The Lancet called in 2013 (1st year of my PhD). I wasn’t presenting, and, to be honest, I can’t remember much about this conference, other than I sat in a lecture theatre for a day, didn’t really move and didn’t speak to anyone. There are two impressions I took away: the first was that Ben Goldacre’s hair is magnificent; the second was that there was an early-year PhD student who spoke to this massive room full of professionals, and I thought “I can’t imagine doing that”.
I also wasn’t presenting at the second conference I went to, the in Cambridge (UK) in 2014 (now the European causal inference meeting, how times change). I was massively out of place here – the conference was held in the mathematics department, which should have been my first clue. It turned out that my tiny epidemiology/medical statistics brain was unprepared for very technical lectures about things I didn’t understand. I guess the lesson here is to check the conference thoroughly before you go to avoid wasting hours sat staring at a series of intimidating Greek symbols you can’t even guess the meaning of.
I went to a lot of local training courses (Bristol does loads), but this was my first training course that lasted more than a week, and also happened to be on a different continent. The at the University of Michigan (USA) was a 3-week course, where I did basic epidemiology for 3 weeks in the morning, and three 1-week courses in the afternoon. The basic epi course was helpful, as were two of the other courses. The third course, however, was taught entirely in R and SAS, statistical software packages I had no understanding of. I thus did not attend after the first day; I couldn’t understand what was going on, and I figured my time could be better spent.
The University of Michigan is in Ann Arbor, which is great, and I spent a lot of time walking/running around the nice woods. I went with a colleague from my University, which can certainly be difficult – it’s one thing to see someone in the office day-to-day, but a three week trip to a different continent is very different. I think it went ok, but I can only speak for myself…
Overall, we managed to get to know a couple of other course delegates, but as many of the attendees were there as parts of groups, it was quite difficult to socialise. Still, we learnt things, which was the major focus of our time there. I also realised that home-cooking is great – I was pretty sick of takeaway and eating out by the end of the trip. On the way home, we stopped off at Washington and New York – because the air fare wasn’t any different, we didn’t have to pay for the flights (universities are great), but we obviously paid for our hotel rooms.
The third conference I went to was in Alaska, the (short names are for boring conferences). It was a 5-day event, and I was presenting a poster on the first day, which is especially fun when it’s a 20-hour trip with a 9-hour time difference.
I’d never been to a conference of this size before, so had no real idea of what to expect. Still, I arrived, registered, and slept in preparation for all the questions I would undoubtedly be asked in the poster session. I also went through the conference schedule to find all the sessions I wanted to attend. I’d been given the advice not to both with clearly irrelevant sessions that I at least wouldn’t find interesting: it would be a better use of time to have an hour off, read something, or do some work instead. There were probably 5 or more parallel sessions for every session, mixed in with plenaries and social events. As it turned out, because the conference had a very broad remit, I don’t think there were any sessions on at the same time I wanted to go to, which was nice.
I arrived promptly for my poster session. During the session, I spoke to all of two people. They were both lovely (I even went with one, and an Alaskan native, up a mountain on the final day, which was awesome). But it seemed something of an anticlimax to travel several thousand miles and only speak to two people about my work. It wasn’t even that my poster was incredibly dull (it was only slightly dull, I’m sure), it was that very few people came to a session where there were hundreds of posters on display. I was a little relieved I didn’t have to talk to too many people, but mostly felt deflated.
For the rest of the conference, I went to a Wellcome Trust organised event (as they sponsored my PhD), walked a little around Anchorage, and, as mentioned, went up a mountain with some conference delegates. The necessity of bear mace was a little daunting, but there were people literally running up the mountain (presumably for pleasure), so I figured it was fine. Although since there were quite a few people on the mountain, maybe the runners just felt they didn’t need to outrun the bears…
To round of 2014, I went to the 7th in Glasgow (UK). I was presenting a poster, and after presenting in Alaska I wasn’t really looking forward to it. I was vaguely aware that there would be a poster walk though, which I guessed meant someone official would lead a group around, and whoever was in the group would read the poster and they might ask some questions.
I was therefore quite surprised when I found that I would, in fact, be part of the poster walk: everyone scheduled to stand by their posters during the session would in fact be required to give a talk about their poster to all the other people who were scheduled to stand by their posters. Looking at the layout of the posters, I figured out I would have to speak about halfway through the session. Nowadays, this would give me plenty of time to think of something to say (it was probably only a 5-minute speech at best), but as I had never given a public speech like this before, I felt some pressure.
Still, red-faced and stuttering, I gave a short talk about the work I had done the previous year and answered a few questions. I have literally no idea what anyone else’s posters were about – I was too busy racking my brain thinking of what I needed to say, or too busy feeling relieved to pay much attention.
Morale: even if you are just presenting a poster, be prepared to give a talk about it to 20 other poster people.
When I was looking around for conferences at which I could speak, preferably to an audience sympathetic to a new PhD student who hadn’t spoken at a conference before, I was advised that the (I couldn’t find a good link) in Cardiff (UK) would be a good fit. In fairness, the YSM conference was indeed a good place to give my first presentation – there were only two parallel sessions, the crowd were nice statisticians (many of whom had come over from the Office of National Statistics (ONS) in Newport), and the other presenters were a mixture of ONS staff, early postdocs and PhD students like myself.
I had practiced my talk a fair few times, both to myself and with others, so felt prepared. I don’t remember the specifics, but I gave a talk about albatross plots in a lecture theatre (one of the old ones with wooden benches rising up in tiers I think), answered some questions, and only really felt nervous before speaking (I’ve rarely felt nervous once starting, too busy concentrating on speaking I guess). Afterwards, a nice man came up to me and chatted about my talk, although as far as I remember, this has to date been the only time people have chatted with me after a talk.
I’d love to say that after this talk, the ice had been broken and I never felt nervous before speaking to a crowd about my work again, but it doesn’t really work like that. Over time, I’ve become pretty inured to giving talks (and will usually quite happily talk in front of anyone about anything now), but I still get a little nervous at conferences.
In any case, any UK-based statistical-type people who are looking for a first-time conference, YSM is a good place to start. They’re friendly!
In my third year of my PhD (of four years) in 2016, I decided I should probably go to more conferences and give talks. As such, I sent abstracts to the , , and the . All the abstracts were about the albatross plots I developed, and I figured I would go to the ones that accepted my abstract. As it turned out, they all did.
The ISCB conference was in Birmingham, and although some of the talks were relevant to my work, there wasn’t much there that was really what I was most interested in at the time (evidence synthesis). Still, I enjoyed the conference, and was looking forward to giving my talk. I was immediately daunted by the size of the lecture theatre though – it was a full-on 300/500/some large number seat auditorium, with a projection screen the size of a cinema screen. It was much bigger than I expected – there were plenty of parallel sessions, there weren’t many talks about evidence synthesis at the conference, and there weren’t many people in the auditorium for my session, so I thought I’d be in a smallish room.
I was presenting last, so I read and re-read my presentation and paid no attention to the people on stage (yeah, it was a stage) before me. When it was my turn, I headed up, probably quite a bit more nervous than I’ve been since. The size of the screen behind me was a distraction – I was more used to being able to gesture at the plots and for people to know what I was talking about, but that wouldn’t work here. I was also distracted by the size of the stage – whenever I’d talked before, I had to stay pretty much in place to avoid getting blinded by the projector or blocking peoples’ view.
I managed to say what I needed to though, and it probably went fine. I was asked some questions by people in the audience, and I answered as well as I could. I was given a recommendation to add something to the plot, that I instantly forgot because I was too busy trying to remember how to reply. It’s like exchanging names – I’m too busy trying to remember how I say my part (“hello, my name is…”) to remember the name of the person to which I’m speaking. It would probably be easier if I went first… Still, like trading names, I didn’t feel I could whip out my phone and note down the recommendation before I forgot it.
In any case, I finished up and sat down (the chair said it was nice to hear about something “refreshingly vague”, which I still think is a compliment, but can’t quite be sure). There was a bit of time at the end before a plenary, so I spoke to one or two people who wanted to know a bit more – I even made an albatross plot for someone who was asking about, I think, the in a meta-analysis with previous studies looking at magnesium sulphate to treat early heart attacks (hint: it looks weird in a meta-analysis).
After the little break was over, the plenary speaker got up to deliver his lecture. I realised immediately that it was the same person who gave me the recommendation I had now forgotten. And it was , probably one of the most famous living statisticians. Damn. I really wish I’d whipped out my phone and noted down what he said.
The RSS conference up in Manchester was much of the same, lots of parallel sessions, not speaking to anyone, limited applicability. The difference was this time I gave a speech in a smaller room, although to be honest I don’t really remember much about it. I guess it was unexceptional, apart from the amusing coincidence that the person speaking immediately before me, also spoke immediately before me at the ISCB conference. That’s niche academia for you I guess. I also went to an early career meet up on the first night, but for whatever reason, I wasn’t really in the right frame of mind to be exceptionally sociable, and don’t think I saw anyone from the night again during the conference.
The Cochrane Colloquium in Seoul was quite different. Mostly because it was in Seoul. There were still many parallel sessions, although now my probably wasn’t in finding something I wanted to go to, it was whittling down the things I wanted to go to since they all happened at the same time. Overall, I’m not really a fan of more than a couple of parallel sessions – a few people from my university went to the conference (colloquium…), and we were all speaking at the same time. This was a bit disappointing, I was looking forward to listening to a friend’s talk.
It also meant that each lecture room was pretty small. I guess that’s good for intimacy, but at the same time, I’d travelled across the globe to give a speech to a room that at best had 30-40 people in it. Probably more than were in the auditorium at ISCB, but that was 2 hours away by train at rush hour. I gave my speech, went to interesting talks, failed to win “best talk by a newbie” (understandably, I was just hopeful), went to some training-type sessions, the usual stuff.
Seoul had, however, quite a few differences to previous conferences. There were social things happening, and I got to know a few people. This really made a difference – like in Alaska, I could now go do things with people, including karaoke before and after soju (soju is great), going on a tour of Seoul, cooking our own bulgogi (Korean BBQ), making kimchi (dear lord, do not keep kimchi un-refrigerated in your hotel room), and going on a tour of the demilitarized zone (DMZ) between North and South Korea. For those that haven’t been, the DMZ is the no-man’s land between warring North and South Korea. The North side is all barren and military-esque. The South side has an amusement park, with rides. I… I’m not sure the South are taking this seriously.
In short, Korea was much better than previous conferences, and it was due both to the more relevant talks (good ol’ evidence synthesis), and meeting people and being able to do things with them. So although I have no tips on HOW to achieve this (karaoke and soju work great, but probably limited opportunity to get them both together), it was certainly a good thing at this particular conference. I imagine it also helped that I knew some people at the conference already – with the exception of the YSM, where I met a few people that I knew at the conference, I had never been to a conference where I knew anyone. So maybe go with a friend, if possible.
If you can, go to this course. You can ski. It’s brilliant.
People from my university also teach on it, and as they’ve taught me on short courses before, I can say that they’re pretty good. The courses I went on were also pretty good (James Carpenter was particularly good I thought). Wengen probably isn’t the most exciting place to go if you don’t like to hike up snowy mountains or ski/snowboard/toboggan down them, but it’s pretty good if you do.
There’s always a conclusion, right? Well, what I’ve learnt from my experiences is that conferences can be pretty hit and miss in terms of content – sometimes everything is interesting and sometimes very little is interesting – and it can be difficult to get to know people, especially if they are already there with people they know. However, sometimes (and this is probably much more true of the longer conferences), you can make some good friends and have a great time. So far, I’ve only met people randomly – the social events that have dedicated “get to know each other” or networking sessions I don’t think have ever worked for me.
I’m still nervous before giving a speech. Much less than I used to be, but still a bit. Practice has helped – the more conferences I do, the better I will be – but I also teach on some of my university’s short courses, and this practice has helped a lot too. As has the knowledge that in several years of giving talks, no one has ever slammed my ideas or been rude, literally everyone who has spoken to me has been nothing but nice and friendly.
So yes. If you’ve never been to a conference before, then start with something relatively small, preferably go with at least one person you know, possibly go to some without a poster/talk, then go with a poster, then a talk. That’s the progression I did, and it felt fine. Of course, you could always dive in with an international conference talk on the first go, it’ll probably be fine. I hope that everyone in other fields appear as nice as they do in mine. And try to make friends, but if you don’t, that’s fine too.
I did the in 2017, and I sucked. Properly sucked. I didn’t speak for 10 seconds, since I forgot what I was supposed to say. You get one slide in the 3-minute thesis, and have 3 minutes to describe your thesis, and this format did not work well either with my content (my thesis is pretty long and complicated, and the underlying statistical problem that makes up the interesting part of my thesis usually takes a little longer than 3 minutes to properly explain to a lay-person), or with me. So all my speaking practice meant squat when it came to talking in a really limited time-frame in a competition. I’m fine with a casual chat where I talk about my work, but something about that competition made me into a babbling wreck of a speaker.
My supervisor who was there said it was fine. I don’t think I believe them…
If you’re wondering how I managed to go to so many conferences as a PhD student, the I did a PhD with the Wellcome Trust, who gave me a reasonable sum that I could have used on anything – recruiting patients, travel, training, lab reagents/equipment, data etc. If you ever think about doing a PhD in epidemiology, I would strongly recommend the Wellcome Trust’s . There are other links, but it’s late, I’m tired, and I like Bristol.
]]>As I said last week, my PhD was in at the University of Bristol, and lasted for 4 years. My thesis was the written report that described what I did for the latter 3 years, and was assessed in my Viva.
Three years ago, when I was just starting my PhD proper, we discussed what I would do and in what order. My aim was to make for prostate cancer better, since PSA is a bit awful at detecting prostate cancer (although there is nothing better at the moment). To do this, I intended to find an individual characteristic (things like age, ethnicity, weight etc.) that was associated with both prostate cancer and PSA.
Quick note: PSA is a protein made in the prostate that can be found in the blood – if the prostate is damaged, then more PSA is in the blood (be it from cancer, infection or inflammation). A high PSA level indicates damage to the prostate, but it is very hard to tell from what – only a biopsy is definitive.
If something is associated with prostate cancer, then the overall risk of prostate cancer changes with that something, which is then called a “risk factor” for prostate cancer. For example, Black men have a of prostate cancer than White or Asian men. If something is associated with PSA, then the overall level of PSA in the blood changes with the something. For example, taking the drug finasteride . can be used to reduce the size of the prostate (for conditions like benign prostatic enlargement), and also for hair loss.
When PSA is used as a test for prostate cancer, a PSA level of less than 4.0 ng/ml is usually considered “normal” (although other thresholds of “normal” are used, such as 2.5 ng/ml or 3.0 ng/ml, and can depend on the age of the man). Since finasteride reduces PSA levels by half, this is sometimes taken into account by doctors – if a man has a PSA test and is taking his finasteride, his PSA might be doubled to get a more accurate reading. This is a simple example of adjusting test results to better fit each individual; if the PSA were not doubled, then it would be much lower than it should be, and might mask prostate cancer. So if something affects PSA, then it should be taken into account when measuring PSA.
Things become a bit more complicated when something affects both prostate cancer risk as well as PSA. If something increases the risk of prostate cancer (such as being Black), then on average, it also increases PSA levels, since prostate cancer also increases PSA levels. This effect can be removed if you just look at men without prostate cancer, but this is tricky, since prostate cancer is common and lots of men can have prostate cancer without realising or being diagnosed (the statistic at medical school was 80% of men at 80 years old have prostate cancer).
As an additional problem, because men have PSA tests before being offered a prostate biopsy to see whether they have prostate cancer, anything that affects PSA levels may look like it affects prostate cancer risk too. If something lowers PSA levels (like finasteride), then some men will go from having a PSA above the threshold for a biopsy, to having a PSA below the threshold. So although the risk of prostate cancer may be the same, because not all men with prostate cancer are DIAGNOSED with the disease, it can look like things that reduce PSA are protective for prostate cancer, and things that increase PSA are a risk for prostate cancer.
Below is a diagram showing the effects of increasing age on prostate cancer status (i.e. whether a man actually has prostate cancer) and PSA (PSA levels increase with age), and how this could affect prostate cancer diagnosis. We are reasonably sure that age increases both prostate cancer risk (same as many cancers) and PSA levels (as the prostate becomes more leaky over time, letting more PSA into the blood to increase the PSA levels in the blood), but it is not so clear for other things.
My PhD was to look for a variable (individual characteristic) that was associated with both prostate cancer and PSA, and try to work out how much it affected each, and therefore how much PSA would need to be adjusted for to account for the effect on PSA, without touching the effect on prostate cancer. So for age above, I would work out exactly how much an increase of 1 year in age increased PSA – the top right line in the diagram. Once PSA was adjusted for age, it would hopefully be better at finding prostate cancer, since it would no longer be affected by changes in age.
Before I even started my PhD proper, I created a that gave the deadlines for all the work I knew I would need to do. First, I would need to find a variable, then perform a couple of systematic reviews to find all the studies that looked at the associations between the variable, prostate cancer and PSA. Then, I would need to conduct a couple of meta-analyses, which would combine the associations to get my final results. I also wanted to use individual participant data (published papers in epidemiology usually give summary results, telling you the association between two things, rather than listing any individual participant results, which would generally contravene patient confidentiality). The individual data would be used to enhance the meta-analysis results, but individual data takes a long time to source (about a year for all the data I requested). The Gantt chart I created is shown below.
This Gantt chart shows the 9 chapters I intended to write, the bare basics of what I would need to do for each, as well as when I would write up each chapter (“thesis production”). As far as plans go, this one was pretty limited, but it gave us a timeframe to work from.
In actuality, I didn’t stick to this very well at all. Chapters got removed or changed, new chapters were added, but the point I wanted to make remains: as I did each stage of research, I wrote up a chapter detailing what I did and how, as if each chapter was its own research paper. This meant when it came to the final 3 months, I was still adding new content, but the bulk of the work had already been done and I didn’t really need to remember what I did 2 years ago, it was right in front of me. Also, since I was writing up as I went along, I was forced to fully comprehend everything I was doing and why – doing it is one thing, but doing it and writing it down so other people could follow the rationale needs much greater understanding.
PhDs should all start with a plan (Gantt chart optional but useful), but writing up as you go along is definitely a huge time-saver in the end. I admit that what I wrote in the earlier years was… well… poor quality, but that’s to be expected. The only way to get better is to practice, and writing up as I went along gave me plenty of useful practice.
My thesis was split into 8 chapters (eventually), each with its own objectives (which I listed in the thesis itself in the first chapter, and below too). I had 2 introductory/background chapters, 5 analysis chapters, and a discussion chapter.
Chapter 1 |
Provide background information on prostate cancer and PSA, as well as using PSA as a screening test for prostate cancer. |
Chapter 2 |
Describe evidence synthesis methodologies relevant to this thesis. |
Chapter 3 |
Identify individual characteristics that have a plausible association with both prostate cancer and PSA, and have a large body of evidence examining this association, then select a characteristic to examine further. |
Chapter 4 |
Perform a systematic review and aggregate data meta-analysis of studies examining the associations between the chosen characteristic, prostate cancer, advanced prostate cancer and PSA. |
Chapter 5 |
Identify and collect data from large, well conducted prostate cancer studies, then perform an individual participant data meta-analysis of the associations between the characteristic, prostate cancer, advanced prostate cancer and PSA. |
Chapter 6 |
Combine the results of the aggregate and individual participant data to estimate the associations between the characteristic, prostate cancer, advanced prostate cancer and PSA as precisely as possible. |
Chapter 7 |
Perform a Mendelian randomisation analysis to assess evidence for causality between the characteristic, prostate cancer, advanced prostate cancer and PSA. |
Chapter 8 |
Summarise the results, strengths and limitations from the thesis, and indicate what direction future work may take. |
In addition to the 8 chapters, I had a title page (with a word count), an abstract, a dedication and acknowledgements page, a declaration (that I didn’t cheat), then section with contents, list of figures, and list of tables. My appendix contained information I thought was too specialist for the thesis, or just surplus to requirements (but still interesting). Given this was a thesis, the specialist stuff was really niche… I also put 2 papers I published during the PhD as images at the end of the appendix (turning PDFs to images for inclusion in word is incredibly irritating, but I thought it best to do it this way rather than combine the PDFs later) – these papers were relevant to the thesis, I published other papers but left them out. My appendix also had a list of acronyms; I included over 100 previously conducted studies in my thesis, most of which had acronyms, so putting a list of them (and all the medical and statistical terms that are acronymised) was likely pretty useful.
Side note: writing papers for outside projects was also very beneficial during my PhD, and would recommend PhD students do it if there’s time. Firstly, outside work can pay. Secondly, working with other people on other work increases your research skills and contacts, and counts as networking, something I still struggle with. Thirdly, it increases the number of papers you’re on, something I am told is very important in academia. Finally, concentrating on one piece of work for 3 years can be crushing – taking a break to do other work can paradoxically be relaxing. Teaching is also good to do, not least because I found teaching (for me, 30-40 people for only 1-2 hours on a short course) a great way to practice public speaking, which comes in handy at conferences. So yeah – PhD students, do extra work if you have time, it’s great.
My introductory chapters gave the background for prostate cancer and PSA testing I needed people to know before reading the rest of the thesis, and described fundamental evidence synthesis methods, which I would use extensively in the thesis. However, because my analyses were pretty disparate, I kept most of the chapter-specific methodology in the analysis chapters themselves. I imagine this makes it clearer when reading through – if you go 1 chapter at a time, all the information you would need is there in the chapter, you don’t have to flick back through to the introductory chapters.
My analysis chapters were written at the time of the analysis, with substantial editing later as I became a better writer (note: not a good write, just better than I was). After I finished each chapter’s analysis, I wrote up what I did and sent it to my 3 supervisors (I hear this is unusual, most departments have 1 or 2 supervisors, but I know of one person in a different university with over 10 supervisors). My supervisors read through and made comments – most chapters were read through and changed 3-4 times before I started compiling my thesis.
My analyses were iterative. Every piece of research is likely at least a little iterative – you start out with an idea, and gradually it becomes refined over time. Writing up each chapter after the analysis helped with this, since I could spot any errors. It did make my code a mess though, so much so that for the main analyses I rewrote the entire thing so it would be clearer. Although, since it’s code, I would happily rewrite it today to make it even more clear. Code seems to never be finished. In any case, I was still editing my analyses up to a month before submission, fixing little errors that crept in.
It wasn’t until 3 months before the deadline that I assembled a complete thesis out of the individual chapters I made. I read up on all the rules for the thesis from the university, and I created the list of figures/tables/contents that autogenerates in Word. I captioned all my figures and tables properly in Word so they would appear in those autogenerated lists. But at this stage, my supervisors still wanted to see individual chapters, so I was maintaining both an up-to-date thesis, as well as individual chapters, which got confusing. I’m not sure what the best way is here; creating the thesis as a whole was important and took a good day or two to do correctly, and was a good psychological boost, but maintaining different copies of files is asking for trouble.
Side note: I use for referencing, rather than (the two options available to me). Mendeley has three advantages: 1) it’s free, 2) the library of references synchronises between my work and home computers, and 3) it makes Word crash less often than Endnote. In total, I had 306 references, which was a strain on Word, so preventing crashes was very important. There are few things are irritating as losing 5 minutes of work on your thesis, when that work was fixing typos on 20 different pages that you now can’t remember.
I eventually produced a preliminary-final-draft of my thesis, and had to FTP (file transfer protocol, used instead of email for large files or for secure sending) it to my supervisors because it was too large for an email (although we also used a shared drive, but this isn’t accessible at home). My supervisors read through it and made any more comments – in total, my supervisors probably read through my entire thesis 5 times at different stages of its development. This is likely an enormous positive of starting writing early on: it gives supervisors a much longer time window to make helpful suggestions.
My final-final-draft was submitted a day or two before the deadline. At this stage, I was beyond caring whether I had made any mistakes. Everyone had read through it multiple times, and I just wanted it to be done. I think I made a half-hearted attempt to read through one last time but gave up, and just sent it. I could always make corrections later.
Writing a thesis takes up a substantial amount of life. This can be drawn out over years, or it can be condensed into the smallest possible time. I favour drawing out the process – starting early has so many advantages, whereas starting with 3 months to go is likely to cause an overwhelming amount of stress. My PhD was also 4 years, and the limit for PhDs in my university is also 4 years, so there was no chance of taking more time at the end of the PhD to write up – the deadline was final an immovable (in reality, I’m sure they would give more time if necessary, but they say you need a very good reason to do so).
So yes, start early. This advice could go for literally all work (including the homework I always left to the last minute), but with a thesis, it’s completely worth it.
]]>My PhD was in at the University of Bristol, and lasted for 4 years. The first year consisted of 3 mini-projects, which each took about 3 months, followed by 3 months of preparation for the following 3 years, mostly refining my plan for the research I would conduct.
My research mainly looked at whether body-mass index (BMI) was associated with prostate cancer and/or prostate-specific antigen (PSA). The aim was to see if, by precisely estimating the associations using previous data, I could make more accurate. PSA testing needs to become more accurate to be clinically useful – presently, about two-thirds of men with a high PSA level don’t have prostate cancer, which means many men have prostate biopsies who don’t need them. I’ll write more about my PhD later I’m sure.
PhD Vivas are probably one of the more stressful experiences of any PhDs life. But before the Viva can take place at all, the student needs to write their PhD thesis, a write-up of all they did during their PhD (or, at least, the bits that they want to talk about). When a PhD student is writing their thesis, it is usually best not to ask when it might be done (by the deadline, hopefully), how they are getting on (not well), whether it’s fun (it’s not) or whether they have any free time (they don’t). Lots of students get very stressed during the thesis write-up, but in general I imagine the better the supervisor, the more stress-free the experience (my supervisors were great).
My thesis had a word limit of 80,000 words – I ended up using about 65,000, with an extra 20,000 words in my appendix. In it, I gave an overview of the prostate, prostate cancer and PSA testing, a description of the methods I would be using in the thesis, and then 5 chapters detailing individual pieces of research, which led to a final discussion chapter. It took a while to write – I wrote as I went along, but it was still 3 months or so of editing and adding new content at the end to get it to the final copy. I’ll talk about the thesis later – lots of PhDs I know have had trouble knowing what to do for it and when, so I will write about my experience.
Once my thesis was complete, I posted two copies to my two Viva examiners. Because I had worked in my department before and during my PhD, I had two external examiners – two academics in related fields who were chosen by my supervisors to assess the scientific merit of the work I had done. Usually, PhDs have an internal and external examiner, i.e. an examiner from the same university (although hopefully unconnected to the student, otherwise there could be bias), as well as someone from a different university. As it happened, one of my examiners didn’t receive the thesis, and my supervisor had to email them a copy (note to all PhD students: have your supervisor check you external received your thesis).
My examiners had over a month to read my thesis and make comments on it. This task can not have been fun – 65,000 words is the size of a small-medium novel, but written in the dense, complicated style of a scientific journal article (which tend to be 2,000-3,000 words in my experience). It’s also a fairly unrewarded task – I think the examiners might receive a small fee and any expenses, but nothing like what would cover their time reading the damn thing. So I offer huge thanks to my examiners, and all examiners too.
After submission, I had a week off. I started a job after that, but it was in the same department working with the same people, and it was a job specifically created for PhDs to write up parts of their thesis for publication in journals, revise for the Viva, and conduct some new research. It is a great job, and gave me time to revise for my Viva. In total, I had 2 months between submission of my thesis and the Viva.
In those two months, I can’t say that I did as much preparation as I could have. I think I read through my thesis once. This is always a painful task, since you inevitably notice typos, errors, and just generally unclear bits. But it needs to be done so you can talk about everything you did in the thesis. So I made notes of what I needed to change after the Viva, as well as any bits I needed to revise.
I looked up typical questions that are asked in most Vivas, and jotted down some answers. Mostly, it was along the lines of “which methods did you use and why, and which others could you have used?”, and “what potential impact will your research have?”. My supervisors had me write a detailed 2-3 minute talk about what I did and rehearse it, just in response to the most common opening question – what did you do? One piece of advice I received was to nail 5 key points, and have a couple of optional extras in case you need them. This was good advice – the first question was indeed to describe what I did.
The Viva itself was to test 3 things (or so I’ve been told): 1) did I write the thesis; 2) do I know the science behind what I did; 3) is the science any good? The first and third points I had little concern about – I know I wrote the thing, it took up a large chunk of my life and I went through it before the Viva – and I knew both that the methods I used were valid (I published a paper using similar methods during the PhD), and that my supervisors are good at what they do. If there were any problems they would have spotted them, and anyway, there was nothing I could do about the science once the thesis has been printed.
That left the second point, and the one I focused on the most. While I have a good grasp of systematic reviews and meta-analyses, two techniques I made extensive use of in my thesis, I also used Mendelian Randomization (which I have used less and am thus less confident with), and several statistical methods that I could explain in simple terms but would be stumped to go into detail with. I therefore spent time reading through anything I was unsure about – this turned out to be almost completely unnecessary, but I would say that it was likely worth my time revising those concepts anyway.
At the end of the Viva, the examiners make recommendations to the exam board, which determines whether the students passes, and if so whether they need to make any corrections. Pass without corrections is rare in my department (1-2%). Minor and major corrections account for almost all Viva results. These are both still passing grades, the difference is the degree of time required to get the thesis up to a standard the examiners would like. Minor corrections should take less than a month to make, whereas major corrections could take up to 6 months. If someone is working, this could be taken into account and major corrections given so the student has more time to work on it. Rarely, but sometimes, the examiners recommend the student take a year to redo their work and resubmit – this is not a passing grade. There may be worse outcomes, but these would be vanishingly rare in my department. All examiner recommendations go to the exams board, which has the final say, but I don’t know in what circumstances they would ever not go with the examiner recommendations.
My department are great – I had two mock Vivas before the real thing. These were two hour-long sessions where professors and researchers read through specific chapters of my thesis and grilled me on them. This was both good practice for being grilled (although as an academic, I’m fairly used to that from supervisor meetings, conferences and other presentations), and finding any weak parts of the thesis that need my attention.
They were set up as close to the real thing as possible. The mock-examiners met in a room and discussed what they would ask and how, then I was called in. They asked about what I did in general terms, then went through the chapters in question systematically, asking why I did things, could I have done them other ways, pointed out issues and made recommendations if appropriate. At the end, they gave some feedback.
After the mocks, I felt more prepared, because I could answer most of their questions well enough to satisfy both myself and them. However, I was talking to a friend who did their Viva shortly before me, and they felt that if they had a mock that went poorly, their confidence would have been shaken for the real thing and they would have done worse.
As it turned out, my mocks were in fact much harder than the real thing.
My Viva started about midday. That morning, I read through key parts of my thesis and thought through some questions that might come up, but generally took it easy. I arrived about 10, bought some lunch, and found a quiet place to revise.
Because I had two external examiners, I also had an internal chair, someone to do the introductions and make sure everything is above board. I managed to catch them as they went in with the examiners for lunch and let them know where I’d be, so they could come fetch me when they were ready.
About half an hour after they went in, I was collected. After a quick introduction, the examiners took turns to ask me questions. It started with the general “what did you do?”, and progressed from there. The difficult questions that came up in the mocks made no appearance – what I remember talking about most were the new methods I developed in order to do my research slightly better, which happened to be the bits I enjoyed the most and liked talking about. There were a few things I had to change to satisfy them, but mostly this was putting work back into the thesis I had taken out, thinking it was too much detail. Overall, it was a pretty enjoyable experience, and was over (apparently) quite quickly – all told, I was in there about 1 hour 40 minutes.
Once over, the examiners asked me to leave while they had a discussion, and I was shortly called back in to be told I passed with minor corrections. We then made some awkward chitchat – my supervisors were coming to talk to (and thank) the examiners, but had planned on me being in there longer. However, everyone eventually arrived, more chitchat was exchanged, and we all left feeling quite happy.
My supervisors and I went for hot chocolate afterwards.
I’ve spoken to a few people about their Viva. Some enjoyed it, some hated it. Examiners can be great (mine were lovely), or they can be awful (someone said their examiners were known for asking extremely awkward and difficult questions, which isn’t really the point). I think the fear of the Viva is a bit disproportionate to the risk of failure – it’s true that it’s a very important exam, but so long as the supervision has been good and the supervisors are happy with the thesis itself, then there should be little risk of failure. Examiners (in general) want you to pass. And problems with the science (I admit complete ignorance of non-scientific PhDs) can be fixed in corrections. Some people will of course fear talking about their work in front of strangers for two hours, in which case practice may help (as does having a supportive department – I’ve heard of professors slamming student’s work in seminars; this is not constructive).
I think the Viva can be also something of an anticlimax for many students. A PhD in this country usually takes 3 years, mine took 4, and a thesis can take months to write. There can be months of waiting between submission of the thesis and the Viva, and then hours to wait on the day. Then the Viva happens, and it goes pretty quickly. Then it’s done. Years of work, judged in a couple of hours. And then you’re a doctor, hopefully (although I’ve never actually found out when you legitimately become a doctor, probably after graduation). Still, the relief is good.
So overall, I liked my Viva. The mock Vivas were helpful, if not for the specific questions then for the experience. Preparation was not massively important for my Viva, but could easily be for others – reading through my Viva once would likely have been sufficient, but I may have felt woefully unprepared then. And examiners should definitely be thanked, often and well.
]]>In the second year of my PhD, I conducted a systematic review of the association between milk and IGF (another future blog post to write). We found 31 studies that examined this association, but the data presented in their papers was not suitable for meta-analysis, i.e. I couldn’t extract relevant, combinable information from most of the studies. There was lots of missing information, and a lot of different ways of presenting the data. The upshot was, we couldn’t do a meta-analysis.
This left few options. Meta-analysis is gold-standard for combining aggregate data (i.e. results presented in scientific articles, rather than the raw data), for good reason. The main benefits of meta-analysis are an overall estimation of the effect estimate (association between two variables, effect of an intervention, or any other statistic), and a graphical display showing the results of all the studies that went into the analysis (a ). The alternatives to a meta-analysis don’t have those benefits.
The least statistical thing we could have done is a , where each study found in a systematic review is described, with some conclusions drawn at the end. A narrative review, as described in the , is different to a narrative synthesis, and does not involve a systematic search and inclusion of studies; it’s more of a cherry-picking method. In either case, the potential for bias is large (conscious or sub-conscious), it takes forever to write the damn things, and it also takes forever to read them. I really wanted to avoid writing a narrative synthesis, and so looked around for other options.
There were a couple of statistical options that would mean less writing. The first was (and yes, I’m linking Wikipedia articles as well as academic papers – I think it’s often more helpful than linking a whole paper or a textbook, which I know most people don’t have access to). Vote counting is where you add up the number of studies that have a positive, negative or null effect (as determined by a P value of, usually, less than 0.05). It’s the most basic statistical evidence synthesis method. It’s also an awful use of statistics (P values of less than a threshold has been since the turn of the millennium), and even if you decide there is evidence of an effect, you can’t tell how large it might be. is a bit more helpful; it combines all the P values from all the studies you want, and spits out a combined P value, indicating the amount of evidence in favour of rejecting the null hypothesis (i.e. that two variables are associated, or whether a treatment works). This is slightly better than a vote count, as the P values are not dichotomised. Again, though, there is no way to tell how large the effect might be: a really tiny P value might just mean there were lots of people in the studies.
We considered creating , which are bar charts showing vote count results with added information to show how confident you are in the results, so well-conducted, large studies are more heavily weighted than poor-quality, small studies. The graphs let you see all the data at once, which we thought was quite good. In the end, though, we decided not to use these plots for two reasons. The first was that we thought we could make better use of the data we had (we knew the number of people in each study, and what the P value was, and which direction the effect was in). The second reason was that we couldn’t make the plots, at least not easily. There was no software we could find that made it simple, which makes an important point: if you design a new method of doing something, make sure that you write code so other people can do it. No one will use a new method if it takes days of tinkering to make it work, unless it really is much better than what came before. Michael Crowther makes this point in one of his first , and as a biostatistician, he knows what he’s talking about.
The data we extracted from each study were: the total number of participants, the P value for the association between milk and IGF, and the direction of the association (positive or negative). Because I am a simple person, I plotted P values on the horizontal (x) axis, against the number of participants on the vertical (y) axis. Really low P values for negative associations were on the far left of the plot, and really low P values for positive associations were on the far right, and null P values (P = 1) were in the middle.
I put the axes on logarithmic scales to make things look better, so each unit increase along the x-axis on the right-hand side was a 10-fold decrease in P value (from 0.1 to 0.01 to 0.001 etc.). The first plots looked like the one below, with each study represented by a point, P value along the x-axis and number of participants along the y-axis.
This plot showed the P values from all the studies examining the association between milk (and dairy protein and dairy products) and IGF-I. The studies were split into Caucasian (C) and non-Caucasian (NC) groups. The left of the plot shows studies with negative associations (i.e. as milk increases, IGF-I decreases), and the right of the plot shows studies with positive associations. The P values decrease from 1 in the centre towards 0 at the edges.
This looked like it might be a good way of displaying the results of the systematic review, as we could see that there was likely an overall positive association between milk and IGF-I. But we still couldn’t tell what the overall effect estimate might be, or how large it could be. On the bright side, we could see that the largest studies all had positive associations, and we could easily identify outlier studies, for example the two studies with negative associations and reasonably small P values.
It was who suggested putting effect contours onto the plots to make them more interpretable. Effect contours are lines added to the graph to show where a particular effect size should lie. For all studies that calculated their P value with a , where the effect estimate divided by the standard error of the estimate is compared with a normal distribution to calculate a P value, there is a defined relationship between the effect size, number of participants, and the P value. For a particular effect size, as the number of participants increases, the P value must decrease to make everything balance.
This makes sense – for a small study (say 20 people) to have a tiny P value (say 0.001), it must have a huge effect, whereas a large study (say 20,000 people), to have the same P value (0.001), must have a much smaller effect size. I’ll detail the maths in a future post, but for now I’ll just say that I derived how to calculated effect contours for several difference statistical methods (for example, linear regression and standardised mean differences), and the plots looked much better for it. For the article, we also removed the distinction between Caucasian and non-Caucasian studies, and removed the two studies with very negative associations. This wasn’t to make our results seem better – those two studies were qualitatively different studies that were discussed separately, immediately after we described the albatross plot. The journal also apparently bleached the background of the plot, which I wasn’t overly happy with.
Now the plot has contours, it’s much easier to see what the overall effect size might be. The effect contours were for , where the beta if the number of standard deviations (SDs) change in outcome for a SD increase in exposure. It’s not the most intuitive effect estimate, but it has good statistical properties (unlike normal beta coefficients). For our purposes, a standardised beta coefficient of 0.05 was a small effect, 0.10 was small to medium, and 0.25 was medium to large. Our exact wording in the paper is here:
Of the 31 data points (from 28 studies) included in the main IGF-I analysis, 29 data points showed positive associations of milk and dairy intake with IGF-I levels compared to two data points that showed negative or null associations. The
estimated standardized effect size was 0.10 SD increase in IGF-I per 1 SD increase in milk (estimated range 0.05–0.25 SDs), from observation of the albatross plot (Fig. 3a). The combined p value for a positive association was 2.2×10−27. All studies with non-Caucasian subjects displayed a positive association between milk intake and PCa risk; in particular, two studies [30, 52] had p values of 0.0001 and 0.001, respectively, and both studies had a sample size of less than 100. When considering only Caucasians, the overall impression from the albatross plot did not change; the
effect estimate was still considered to be around 0.10 SD.Of the 31 data points, 18 had an estimated standardized effect size between 0.05 and 0.25 SDs and four had an effect size of more than 0.25 SD. Eleven of these data
points (61%) used milk as an exposure, including two that had an estimated standardized effect size of more than 0.25 SD [30, 53].
For the plots to be drawn, all that is (generally) required is the total number of participants, the P value and the effect direction. However, some statistical methods either require more information, or require an assumption, to actually draw the contours. For example, standardised mean differences require the ratio of the group sizes to be specified (or assumed), so the contours are drawn for specific effect size given a particular group size ratio (e.g. 1 case for 1 control). If you can assume all the studies have a reasonably similar group size ratio, then it’s generally fine to set the contours to be created at this ratio. If the studies are all have very different ratios, then this is more of a problem, but we’ll get to that in a bit.
In general, if the studies line up along a contour, then the overall effect size will be the same as the contour. If the studies fall basically around the contour (but with some deviation), then the overall effect size will be the same, but you’ll be less certain about the result. The corresponding situation in a meta-analysis would be studies falling widely around the effect estimate in a forest plot, with a larger confidence interval as a result. If the studies don’t fall around a contour, and are scattered across the width of the plot, then there is likely no association. If the larger and smaller studies disagree, then there may be some form of bias in the studies (e.g. small-study bias).
When interpreting the plots, I tend to focus on how closely the studies fit around a particular contour, and state what that contour might be. I also tend to give a range of possible values for the contour, so if 90% of the studies fall between two contours I will mention them. It is incredibly important not to overstate the results – this is a helpful visual aid in interpreting the results of a systematic review, but it is not a meta-analysis, and you certainly can’t give an exact effect size and leave it at that. If an exact effect size is needed, then a meta-analysis is the only option.
The plots can be created in any statistical program (I imagine Excel could do it, I just wouldn’t want to try), and I wrote a page of instructions on how to create the plots that I think was cut from the journal paper at the last minute. I will dig it out and post it for anyone that might be interested. However, if you have Stata, you don’t need to create you own plots, because I spent several weeks writing code so you don’t have.
Writing a statistical program was a great learning experience, which I’m fairly certain is what people say after doing something that was unexpectedly complicated and time-consuming. The actual code for producing an individual plot is trivial, it can be done in a few lines. But the code for making sure that every plot that could be made works and looks good is much more difficult. Trying to think of all the ways people using the program could make it crash took most of the time, and required a lot of fail-safes. I’ll do a future post specifically about the code, and say for now that in general, writing a program is not too different from writing ordinary code, but if it involves graphics it might take much longer than expected.
If you want to try out albatross plots (and I hope that you do), and you have Stata, you can install the program by typing: ssc install albatross. The help file is very detailed, and there’s an extra help file containing lots of examples, so hopefully everything will be clear. If not, let me know. I’ll do a video at some point showing how to create plots in Stata, for those that prefer a demonstration to reading through help files.
The program has a feature that isn’t discussed in the paper, as I still need to prove that it works fine all the time. I’m pretty sure it does, and have used it in all the papers albatross plots are a feature of. The feature is called adjust, and it does just what it says: it adjusts the number of participants in a study to the effective sample size. This is the term for the number of participants that would have been required if something had been different, so if the ratio of group sizes was 1 (when in fact it was 3.2, or anything else). The point is to make the results from studies with different ratios comparable: statistical power is highest when the group ratio is 1, so studies with high group ratios will look like they have smaller effect sizes than they do, just because they have less power. I will write another post about adjust when I can, as it is a very useful option that definitely raises the interpretability of the plots.
So far, I have identified two main uses of the plots. The first use was shown in the paper reviewing whether , and is when a meta-analysis just isn’t possible given the data that are available. The World Health Organisation also used the plots this way in a systematic review of , where again, meta-analysis was not possible.
The second use of the plots is to compare the studies that were included in a meta-analysis with those that couldn’t be included in the meta-analyses because they lacked data. This allows you to determine whether the studies that couldn’t be included would have changed the results if they could have been included. I have yet to publish my results, but I used the plots in my thesis to compare the results of studies that were included in a meta-analysis looking at the association between body-mass index and prostate-specific antigen to those that couldn’t be included. We found three out of the fours studies were consistent, but one wasn’t (I mean, it isn’t as if it’s completely on the other side of the plot, but it certainly isn’t as close to the other studies as I’d like). The study that wasn’t consistent was conducted in a different population to all other studies, so we made sure to note there were limits on the generalisability of our results. The effect contours are for standardised beta coefficients again. The meta-analysis of the included studies gave a results of a 5.16% decrease in PSA for every 5 kg/m2 increase in BMI. This is roughly equivalent to a standardised beta of -0.05, which is good because that’s what I’d say is the magnitude of effect in the albatross plot below.
I think albatross plots are pretty useful, and made my life easier when otherwise I would have had to write a long narrative synthesis. By putting the code online, we made sure that people who wanted to use the plots could. By providing a tonne of examples, hopefully people will know how to use the plots too.
Incidentally, the albatross in the name refers to the fact the contours looked like wings. Rejected names include swan plot, pelican plot and pigeon plot. Oh, and if any of you are thinking albatross are bad luck, they were until a captain decided to . A lesson for us all.
]]>The series started in 2009 and is written with a medical audience in mind, particularly physicians who need to read and evaluate medical literature. It runs through (unsurprisingly) how to evaluate publications, but in doing so gives quite a lot of useful information about medical research in general. It is published by , a weekly German-language medical magazine, although the series itself has been translated into English. You certainly don’t need to read all of the articles, and they are fairly easy to read, so feel free to dive into any of them. The articles were written by experts with reference to textbooks, academic articles and their own experiences.
I’ll write about each of the articles briefly, say who I think the articles will be most useful for, and link the PDFs (open-access, so anyone can view them). In a later post, I’ll write about medical research from a more practical perspective, rather than the perspective of a physician needing to understand an academic paper. Fair warning, I haven’t read through the entirety of all the papers, and will update this post when I have if anything needs to be changed.
This article describes the structure of scientific publications (introduction, methods, results, discussion, conclusion, acknowledgements, references), and points out some key things to look for to check the paper is trustworthy. It contains a checklist for evaluating the quality of publications, although much of the criteria are study-specific and don’t apply to all studies – there are better tools for assessing risk of bias (which is essentially what they are getting at), which I’ll write about later.
Definitely worth reading if you are just starting to read medical research.
This article describes six aspects of study design, which are important to consider both when reading articles and before conducting any studies. The aspects include:
Worth reading if you are just starting to read or conduct medical research.
This article classifies primary medical research into three distinct categories: basic research; clinical research; and epidemiological research. For each category, the article “describes the conception, implementation, advantages, disadvantages and possibilities of using the different study types”, with examples. The three categories include many subcategories, presented in a nice diagram.
Worth reading for background information on study types; I will cover something similar in a future post.
This article describes P values and confidence intervals, pretty essential concepts all medical researchers should understand. Helpfully, the article describes the pitfalls of P values, such as dichotomising a statistical test and the difference between statistically significant (a phrase banned in my research department) and clinically significant (much more useful).
Very useful reading for anyone who wants to understand more about P values and confidence intervals. I will also do a post about P values in the future, because it is such an important topic.
This article is focused on laboratory tests, but gives a description of sensitivity, specificity and positive predictive value, all concepts that are used when designing and interpreting tests (e.g. blood tests predicting disease).
This is useful reading for health practitioners (especially those ordering and interpreting tests), but is pretty specialist information. Saying that, knowing about sensitivity and specificity, or at least knowing that medical tests aren’t generally foolproof, would benefit many people.
This article details methods of evidence synthesis (i.e. taking previous study results and combining them for a more complete understanding of a research question), which is one of my areas of expertise.
I will write more in-depth posts about systematic reviews and meta-analyses, but this is useful as an overview.
— The articles become more statistical and specialist from this point on —
Descriptive statistics are use to describe the participants of a study (e.g. their mean age or weight) or any correlations between variables (e.g. as age rises, weight tends to increase [up to a point]). The article describes the difference between continuous (called metric, variables can be measured on a scale such as height or weight) and categorical variables (variables that can be one of a set number, such as gender or nationality), and provides examples of different types of graphs that can used to display the statistics.
Useful for anyone who has limited experience with statistics, or who wants to know more about common graphs (e.g. scatter graphs, histograms).
The articles start to get more niche from here on out. Observational studies are those where participants are not experimented upon in any way, just observed. This has implications for the results of such studies: if you are comparing two groups (e.g. smokers and non-smokers), then there is generally no way to tell exactly why there are differences between the groups (e.g. it could be smoking, but maybe the differences are because smokers are older, or more likely to drink alcohol, or eat less). This article discusses the problems with observational studies, and some ways to combat these problems.
Useful for those interested in observational studies, but definitely getting more niche now.
A 2 x 2 (read: 2 by 2) table shows the results of a study where the exposure and outcome were both binary. Generally, the rows of the table are exposed or not (e.g. to a new drug that prevents migraines), and the columns are diseased or not (e.g. got a migraine). The number of participants is shown in each cell, for example, the number of participants who received the new drug AND got a migraine might be in the upper left cell, the number who received the drug AND did NOT get a migraine in the upper right cell, the number who did not receive the drug AND got a migraine in the bottom left cell, and finally the number who did not receive the drug AND did NOT get a migraine in the bottom right cell. The reason 2 x 2 tables are useful is because writing that all out is a nightmare. This article describes the tables and the statistics that are often performed with the tables to judge whether being exposed raises, lowers, or doesn’t change the chance of the outcome happening. It also discusses some problems with the tables and their interpretation.
Useful reading for interpreting 2 x 2 tables, but otherwise unlikely to be useful.
If a study conducts multiple statistical tests, then the paper will likely contain many P values. This is good, in that many tests means many pieces of information, but bad because conducting many tests raises the risk that there will be a “significant” results by chance. This article discusses this problem and the ways to combat it. However, it is worth remembering that there is no way to tell whether a study conducted 100 statistical tests and just chose to present the “significant” ones.
Useful for those with an interest in the problem of multiple tests, but this is a problem that is unlikely to affect too many people (with the larger problem that it is impossible to know what studies have actually done, since they are the ones reporting it).
This article describes epidemiological studies and how they are analysed (if the title didn’t give it away). Epidemiological studies are those that seek to quantify the risks for diseases, how often a disease is diagnosed (incidence) or how many people at any one time have a disease (prevalence). The article helpfully lists how measures such as incidence and odds ratios are calculated, and gives examples of studies.
Very useful background into epidemiological studies, which make up a large proportion of medical research. The description of epidemiology measures is particularly useful.
P values are calculated differently depending on the data; this article describes the different methods and gives three tables, one detailing statistical tests (e.g. Fisher’s exact test, Student’s t-test) and two detailing which test is appropriate in different situations. This article does not deal with regression analyses (which is the vast bulk of my work), but given the ubiquity of P values in medical literature, it is definitely worth being familiar with the different tests that can be run.
Useful is you want to know more about P value calculations – the tables are particularly useful.
The sample size calculation tells you that if you want to be able to find an effect size this large (say the group on a new drug gets 10% fewer infections compared to the old drug), you need this many participants. It’s pretty important if you conduct primary research – most funders won’t like it if you say “I’ll recruit as many people as I can”, but will like it if you say “We need to recruit 628 participants to be confident (i.e. 90% sure) that we will see a risk ratio of 0.9, and here are the calculations to prove it.”
It probably isn’t necessary for those not planning to conduct primary research to know how to calculate the sample size, but an appreciation of why power calculations are important is good.
Linear regression is an analysis of the association between two continuous variables (e.g. height and weight), with the option to account for multiple confounders (variables that might be associated with both the exposure and the outcome). This article describes linear regression and other regression models, discusses some of the factors to consider when using the models, and gives several examples.
A useful article for anyone needing to conduct or interpret a regression analysis.
Survival analysis as it sounds: an analysis of how long participants survive (although note that “survival” could be how long a person goes without contracting a disease or receiving a test, not just how long until they die), and any factors that associated with survival time. This article describes survival analysis and points out some things to consider when conducting or interpreting survival analyses.
Useful if you want to know about survival analysis.
A concordance analysis assesses the degree to which two measuring or rating techniques agree, for instance the agreement between two tests for the volume of a tumour. Often, the gold-standard test (the best test at the time) will be compared with a cheaper, less intrusive or newer alternative test. This article describes concordance analyses and points out some things to consider when conducting or interpreting concordance analyses.
Useful if you want to know about concordance analysis.
Randomised controlled trials (often abbreviated to RCTs, and yes, I’m using the British spelling) are an incredibly useful method of determining which of two or more treatments is better. The idea is that patients are randomised to a treatment, with the hope that there will be no baseline differences overall between the participants in any treatment group (so ages, weights, ethnicities etc. will be equal between groups). Therefore, any difference in the outcome (e.g. developing the disease, death, recovery time) will be entirely due to the difference in treatments. However, RCTs need to be well-conducted, as even the slightest amount of bias can render the study meaningless. For instance, if the outcome is the amount of pain, then if a study participant believes they are on a less effective drug, they might experience more pain than if they were on a drug they believed to be more effective (placebo effect, and the reason branded anti-pain meds [analgesics] are in nicer boxes and cost 10 times as much, even though they are the same as the unbranded meds). This article describes RCTs, and provides a helpful table of how an RCT should be reported.
RCTs are pretty important for medical research, and it is definitely worth reading this article to be certain you can identify why some medical research is considered brilliant and unbiased, and some is considered useless.
— The articles become much more specialist from this point on —
A crossover trial is one where patients are randomised to different, consecutive treatments, so each patient takes all the different treatments at different times. This means each patient serves as their own control – the response to the treatment of interest can be compared to other treatments (such as placebo pills or the gold-standard treatment) – making problems such as confounding less of an issue. Crossover trials are likely only useful for chronic conditions (e.g. pain), as an acute condition may improve before the next treatment can be started. This article describes crossover trials and details the statistical methods to appropriate analyse the results.
This article is only worth reading if you need to know about crossover trials.
Screening is testing patients without any symptoms of a disease to see whether they are likely to have said disease. and screening are regularly performed in the UK, with the aim of finding those cancers while they are still curative. Screening in general has many established problems, for example finding cancers that would never have caused harm (overdiagnosis), leading to unnecessary treatment (overtreatment). This article discusses screening and its potential problems, using breast cancer screening as an example.
Knowing about screening is important as most people in developed countries will be offered a screening test in later life, so this article may well be worth a read.
The purpose of RCTs is often to find out whether a new treatment is better than an old one. The purpose of equivalence or non-inferiority studies is to determine whether a new treatment (and usually cheaper or with fewer side-effects) is at least as good (equivalence) or not much worse (non-inferiority) than the old one. It’s almost impossible to prove that two treatments are exactly the same in terms of their outcome, since effect estimates are never absolutely precise (there is always allowance for random error), so the statistics involved in equivalence/non-inferiority trials are different to RCTs. This article discusses these trials.
Pretty specific paper only useful for those involved with equivalence or non-inferiority trials.
Big data is defined in this article as a dataset “on the order of a magnitude of a terabyte“, which is indeed big. This article discusses big data and procedures used to analyse it (such as machine learning).
Probably a useful article for many to read, as big data is fast becoming used in everything from genetics to business.
An indirect comparison is one where instead of comparing two treatments against each other (A versus B), you use two comparisons with a common comparator treatment (A versus C, and B versus C) to infer the difference between two treatments. This can be fairly intuitive: if A is better than C, and C is better than B, then A must be better than B (and how much better can be calculated statistically). A network meta-analysis is a meta-analysis that includes indirect comparisons, so not only are studies comparing A and B included, but studies that compare A with C, and B with C too. The aim is to use as much information as possible to arrive at the most informed answer. This article discusses indirect comparison and network meta-analyses, clarifies the assumptions that must be made, and provides a helpful checklist for evaluation of network meta-analyses.
In my experience, indirect comparisons is not commonly used outside of network meta-analyses, and this article is therefore only really useful for those interested in network meta-analyses.
When analysing the results of a non-randomised study (e.g. an observational study), confounding is always an issue. One way to deal with this is to include measured variables as covariates in a regression model, accounting for differences between groups (e.g. in age, height, gender). Propensity scores are an alternative method to using covariates, where the probability of an individual receiving a treatment is calculated using the observed variables, and this is used to account for any differences between groups instead of covariates in a regression model. This article introduces propensity scores, describes four methods of using them and compares propensity scores and regression models.
In my experience, regression models with covariates is an overwhelmingly more common method of accounting for confounding, so this article may be useful when coming across studies using propensity scores or if you are considering using them, but otherwise may be a bit specialist.
RCTs are brilliant, but sometimes the methods have to be adapted to face a particular challenge. This article details some of the variations of RCTs (and provides a helpful tables), useful in specific circumstances.
This final article in the series (as of October 2017) is very specific to RCTs, and won’t be useful for anyone not interested in specialist RCT methods.
]]>Stata and R are both statistical packages. You feed in data, and then usually write code to analyse the data. Stata and R both have great facilities for cleaning data, running most statistical tests on it, producing graphs and tables, and these days exporting results straight to word, PDF, LaTeX or excel.
StataCorp LLC created Stata, and calls it:
an integrated statistics, graphics, and data management solution for anyone who analyzes data
Bell Laboratories created R, and calls it:
an integrated suite of software facilities for data manipulation, calculation and graphical display
So pretty similar then.
One major difference between the two packages is that StataCorp charge people to use Stata, whereas R is completely free. Both packages allow anyone who uses them to create and distribute statistical programs; indeed, often the programs I run most frequently are those written by people using the packages, not people who created the packages.
Stata and R are both great packages to manipulate and analyse data. They both allow the user to write code, and then use this code to do everything one could want to the data. Why is code so great? From my experience:
Basically anything statistical. Analyse and manipulate data however you like, produce tables and graphs, anything where you have some numbers (or letters or words) and you want to do something with them.
I think a personal question. Apart from that Stata costs money and R does not, there aren’t too many differences. Stata is more intuitive for people that have used spreadsheets, since at any time you can click a button and load up a view of your data in a spreadsheet, but R allows you to do more at once as many datasets can be loaded in at the same time. For some purposes Stata is faster, for others R is faster, both in terms of how much code is needed and how fast it runs.
Both packages have a good community of users who develop programs within each, so whether Stata or R is better may depend more on the purpose you are using a stats package for, rather than a blanket “one is better than the other”.
One point that sometimes comes up is that Stata limits the number of variables (columns) allowed in any one dataset, whereas R is limited only by your computer. The newest iteration of Stata (version 15, out June 2017) has 3 versions ranging from cheapest to more expensive. The cheapest version, Stata/IC, has a maximum of 2,048 variables and 2.14 billion observations (rows). The most expensive version, Stata/MP, has a maximum of 120,000 variables and 20 billion observations. For many non-genetic purposes, Stata is absolutely fine. For genetic purposes, where there legitimately can be tens of thousands of variables, R is generally considered better, which is why genetics research does seem to favour R.
R is free, and can be downloaded from their .
Stata costs a varying amount depending on whether you are a student, business or institution, and whether you want an annual or perpetual license, and which version of Stata you want. For a student, Stata costs $198 for Stata/IC (entry-level), $395 for Stata/SE (mid-level), $695 for Stata/MP 2-core and $995 for Stata/MP 4-core (top-level), all in US dollars and all perpetual licenses. For a single university staff member, those prices jump to $595 to $1,495 for an individual license. For a single government or non-profit or business staff member, the prices jump again to $1,195 to $2,295. Annual licenses that require renewing every year are half the price of the perpetual licences. Stata/MP also can be purchased with more cores, so will work faster on supercomputers. For the UK, Stata needs to be purchased from .
I have been a researcher for about 5 years now, and have only used Stata and R. I am aware that in other universities (even different schools within my university), they use , , or other software. I haven’t used these programmes and know nothing about them, so I’ll stick to Stata and R. If I need to learn a new programme at any time, I’ll write posts comparing the new and old programmes.
I’ll say that even though Excel is extremely limited for data analysis (in terms of difficulty, replicability, consistency etc.), I still use it frequently for certain tasks. For instance, if you want a table in Stata, it’s often handy to export it to Excel before putting it in Word.
What an oddly specific question I have just asked myself. Statistical methods tend to fall into two camps, Frequentist or Bayesian, and I will cover the differences in a future post. For now, I’ll say that in the absence of a sensible prior, you may as well be a Frequentist, but why wouldn’t you use other information otherwise? That will hopefully make way more sense after I write the Frequentist/Bayesian post. XKCD has a about the difference if you’re particularly interested now.
Because this is my field. I have no experience of evidence synthesis for other disciplines, although I would imagine that medical research has one of the greatest demands for evidence synthesis given the number of medical studies performed daily around the world looking at often very similar things.
I would be interested to know if other disciplines perform evidence synthesis in different ways, so please let me know if you work in a different field and think I should be using a different method to do something.
Maybe.
I will never share any data that is from actual studies, but I am all for open sharing of all data and I will share what I can. The results of systematic reviews and meta-analyses are fair game, as is any code I write, simulations I run or methods I develop.
Please let me know, preferably as soon as possible so I can correct it quickly, but even if I made a mistake years ago I’d like to know.
Correcting academic papers can take an eternity, can lead to the wrong conclusions being drawn (although this is rare), and can also be intensely embarrassing. Making a mistake before publication is fine so long as it is spotted.
I have no problem being wrong. Being corrected is the best way to learn about my mistakes, and while I would prefer to be right, given the option, I’d much rather be wrong and know about it than be wrong and be ignorant of it.
It depends.
Datum is a singular data point, but if you are talking about data in the sense of a collection of results or observations, then it could be either. “These data show that…” and “the data shows that” are both acceptable.
I personally prefer data as a singular though. “These data” sounds odd to me.
]]>