The Causal Effects of Health Conditions and Risk Factors on Social and Socioeconomic Outcomes: Mendelian Randomization in UK Biobank

In this post, I’m going to talk about research I’ve been working on for the past year and a half. I’ll try to make it as non-technical as possible, and explain the concepts without assuming any knowledge of genetics, stats or epidemiology. I’ll include some pictures to help explain things, because I like pictures. We’ll see how we go.

The paper is now out on MedRxiv, which is a preprint server, where we send academic papers before they’ve been peer-reviewed. If you want to read it all in detail, then here’s the place to do so.


The Research Question

This study was funded by the Health Foundation to look at the “social and economic consequences of health: causal inference methods and longitudinal, intergenerational data”.

That’s the fancy way of saying “how does health affect social and economic outcomes?”

The “causal inference” part of the project is key. We already know poor health is associated with adverse social (e.g. wellbeing, social contact) and socioeconomic (e.g. educational attainment, income, employment) outcomes. What we don’t know is how much those outcomes are caused by poor health (i.e. reverse causation), how much other factors affect both health and those outcomes (i.e. confounding), and how much poor health directly causes those outcomes (i.e. what we want to know).

I’ve made a (hopefully) helpful picture to show the difference between those 3 things.


That’s the general theme of what we’re doing – how does (poor) health affect social and economic outcomes. To tackle that, in this paper we’ve looked at a variety of health conditions and risk factors for poor health, and attempted to estimate by how much they affect lots of different social and economic outcomes.


The method we used

The method we’ve used is called Mendelian randomization, and it exploits the fact that at conception, we are all randomly assigned genetic variants, some of which predispose us to certain health conditions or risk factors. A genetic variant is a change in someone’s genetic code – we commonly use single nucleotide polymorphisms (commonly referred to as SNPs), which are single base pair changes in the genetic code.

Most genetic variants have a completely irrelevant effect on our day to day lives, and the environment, personal choice and luck matter much, much more. However, across an entire population, we can use these small effects to look at the effects of the health conditions and risk factors they predispose us to.

For example, there are plenty of genetic variants that affect body mass index (BMI), a measure of obesity (your weight in kilograms divided by height in metres squared). We can put all those variants together in a “genetic risk score” to get a single number that represents a person’s genetic predispositions towards having a higher or lower BMI. People with a higher genetic risk score for BMI will, on average, have a higher BMI than people with a lower genetic risk score for BMI. Having a higher genetic risk score doesn’t mean a person will have a higher BMI than someone with a lower score though, everything else matters too.

The major benefit of Mendelian randomization is that, if a few assumptions hold, we end up estimating the causal effect of an exposure (poor health) on an outcome (social/economic outcomes). Confounding isn’t a problem, because your genetic variants are fixed at conception, i.e. your genetic code doesn’t change from the first day you existed. There’s no chance of reverse causation either for the same reason. In the above picture, we’re estimating just the blue line with Mendelian randomization.

The assumptions are pretty big though, and should be remembered (and tested, if possible):

  • Firstly, the genetic variants have to be associated with the exposure (e.g. you need variants associated with BMI if you want BMI to be your exposure). This assumption isn’t actually all that big – we choose the variants we use because they are associated with the exposure.
  • Secondly, there can’t be some mechanism by which any other factor affects the genetic variants AND the outcome of interest. This assumption is larger, and can be violated if there is some hidden structure in a population (i.e. if people don’t have children randomly). For instance, if people with a large BMI only had children with other people with large BMIs, then over time other factors associated with BMI (e.g. height) would become associated with genetic variants for BMI, and that would be a problem.
  • Thirdly, the genetic variants associated with the exposure have to affect the outcome ONLY through the exposure. It’s no good using genetic variants that affect BMI to look at social and economic outcomes if the same genetic variants ALSO affect someone’s propensity to smoke. You’d get the effect of BMI on the outcomes, along with the effect of smoking on the outcomes, which would be a big problem.

We can and do test and account for these assumptions to minimise the risk of getting the wrong answer. But, as ever, we can never be certain whether we have the “truth”. There are other reasons why any specific Mendelian randomization analysis might not give us the “truth” too, the same as any other study.

Causal inference is hard.

Here’s a good (somewhat technical) guide to reading Mendelian randomization studies, in case you’re interested.


The reason for this study

I find it difficult to give good reasons why I look at what I do.

I (personally) look at things because I’m curious, I like doing it, and/or I was told to do it by people that pay me. I work in a field that aims to improve health in general, so often the things I’m curious about, like doing, and am told to do can lead to better health too, which is nice.

However, for this specific study, we have a nice few paragraphs in the paper motivating the study that are properly justifiable:

Poor health has the potential to affect an individual’s ability to engage with society. For example, illnesses or adverse health behaviours could influence the ability to attend and concentrate at school or work and hence affect educational attainment, employment, and income. Illness and health behaviours may also affect an individual’s ability to maintain wellbeing and an active social life. From an individual perspective, maintaining good health can therefore have considerable social and socioeconomic benefits. Similarly, from a population perspective, improving population health could lead to a happier and more productive population.

Understanding the causal impacts of health on social and socioeconomic outcomes can help demonstrate the potential broader benefits of investing in effective health policy, thereby strengthening the case for cross-governmental action to improve health and its wider determinants at the population-level. Furthermore, patients require accurate information about how their lives might be affected by their health, for example on returning to work after cancer. However, studying the social and socioeconomic consequences of ill health (‘social drift’) is challenging because of social causation, i.e. the strong role of social and socioeconomic circumstances in disease causation. Social causation means that associations between health and social and socioeconomic outcomes are likely to be severely biased by confounding and reverse causality. Methodological approaches strengthening causal inference in this field are therefore essential.

Long story short, we looked at the causal effect of health conditions and risk factors on social and economic outcomes because it could be useful to public health policy.

And a little bit because it’s interesting.

The people we studied

Mendelian randomization is pretty good at getting causal answers for difficult questions, but the price of that is the results are imprecise. Much more imprecise than if we looked directly at the associations between health conditions/risk factors and social/economic outcomes without bothering with the genetics.

To compensate for this, we need a lot of people to study, all of whom need to have their genetic variants measured, as well as their health conditions, risk factors, social and economic outcomes. Fortunately for us, UK Biobank exists.

UK Biobank recruited just over 500,000 people between 2006 and 2010 from 22 centres across the UK. Participants could have been anyone between 40 and 70 years old who had registered with a GP, and they provided their medical history and socioeconomic information via questionnaires, interviews and some medical-type measurements (height, weight, blood pressure etc.). Medical data from hospital episode statistics (data from all hospital admissions) and the cancer registry have been linked to participants. As such, we have a good handle on who developed which health conditions at any point in time (certainly after the year 2000 and for many people before then – hospital data doesn’t go back in time forever), and know about risk factors and social and economic outcomes at one time point.

Because we are looking at genetics, we restricted people in UK Biobank to those that were unrelated (siblings and other related people can be used in genetic analyses, but not without accounting for their relatedness, and we didn’t do that), and to those that were if white British ancestry. This was to reduce the risk of bias from the second assumption about Mendelian randomization – that there should be no other factor that affects both the exposure and outcome. Because different ethnic groups have different propensities towards different genetic variants, as well as large differences in social and economic outcomes, restricting to just one ethnically similar group reduces the risk of this bias. We made a few other restrictions that were based on technical things, like those with excessively high missing genetic data.

Because of that, we went from just over 500,000 participants to 337,009 participants.

Still plenty of participants, so some of our analyses gave really quite precise answers. Others gave results that spanned everything from an “inconceivably large negative effect to inconceivably large positive effect” though, so hey ho.

The health conditions and risk factors (exposures) we studied

We wanted to study as many health conditions and risk factors (I’ll call these exposures from here on out, as typing health conditions and risk factors is tiring) as we could, but obviously couldn’t study everything. We pragmatically selected exposures that:

  1. caused a reasonable degree of death or disability to people in the UK, as measured by the Global Burden of Disease study (GBD) in disability-adjusted life years (DALYs)
  2. had known genetic determinants, so things like falls were excluded as we don’t know of any genetic variants that just affect your propensity to fall
  3. at least 2% of people in UK Biobank had (a 2% prevalence), so we weren’t looking at super rare exposures where we would get really imprecise results

The list of exposures that fit these criteria are listed in the following flowchart.

Supplementary Figure 1

For health conditions, we looked through their self-reported conditions and their hospital episode data, and for risk factors we used what UK Biobank recorded at recruitment.

There are more information in the paper if you want to know exact details.


The social and economic outcomes we studied

Pragmatism also featured in our choice of social and economic outcomes (just called outcomes from here on in) – we selected outcomes based on what was available in UK Biobank. One consequence of this is that we didn’t have a measurement of individual income, only household income (and even that was split into broad categories). Ah well.

We have a nice box in the paper describing all the outcomes, so I may as well add that here:

Box 2

The genetic variants we used

All the exposures needed genetic variants for Mendelian randomization to work. As such, we searched for genome-wide association studies (GWAS) that told us which genetic variants to use for each exposure. GWAS are big studies that look at particular exposures (BMI, asthma, cholesterol etc.) and see which of the hundreds of thousands or millions of genetic variants they measured are associated with the exposure. Some genetic variants have stronger evidence of an association than others, and we chose variants that had a lot of evidence for an association, so we could be sure of the first Mendelian randomization assumptions (the genetic variants are associated with the exposure).

We then created genetic risk scores for each exposure, based on the genetic variants we found in the GWAS. For all exposures, the higher the risk score, the more likely it was the person had the health condition or the higher their risk factor. Some risk scores were much better at predicting the exposure than others – BMI was pretty good, osteoarthritis was awful.


The extra analyses we did

Because there’s lots of checking that needs to be done, we did plenty of extra analyses and wrote about them in reasonable detail in the paper. I won’t talk about them here, because it gets a bit too technical and far too dull, but they’re in the paper if you’re interested. Specifically, we were checking that the Mendelian randomization assumptions were alright, as well as a couple of other things.


The results we estimated

We estimated a lot of results. For every exposure, we estimated the its effect on every outcome. Then because we did lots of extra analyses, we actually ended up with 13 more estimates of the same thing, plus a few extra analyses. We had 8 health conditions and 6 risk factors, so 14 exposures with 19 outcomes, multiplied by 14 estimates, and there are around 3,724 results in the paper.

While I would be very willing to list them all here, to save my and your sanity I’ll just present the main things that might be interesting.

I made a figure that represents the evidence we have for each exposure causing a change in each outcome. The stronger the evidence (i.e. the more precise the estimate and/or the larger the change), the darker each cell (the square representing the association between the corresponding exposure and outcome) is shaded. Each cell also has a plus or minus, depending on whether the association is positive (the exposure increases the outcome) or negative (the exposure decreases the outcome). Not all positives are good – increasing loneliness or deprivation are both bad. The stars represent associations for which we are pretty sure something is going on.

Figure 1

A quick glance at the picture is enough to tell me that smoking initiation (the best interpretation of which is probably having ever started smoking, rather than trying a few cigarettes) is bad for all the economic outcomes, except for being retired. BMI is also pretty bad for economic outcomes, and alcohol isn’t exactly great. For social outcomes, depression is bad.

Quick technical note about the legend

The legend shows the -log10 P value, which I can imagine means nothing to most people. The P value is the probability that, if we did an infinite number of future studies doing the exact the same things as in this study and if there were no true effect of the exposure on the outcome, we would see an estimate for the association that was as large or larger (I hate defining the P value, because apparently many textbooks give the wrong definition, and I’m now completely uncertain how to define it without using all the words). The smaller the P value, the more certain we are that there is a non-zero association between the exposure and the outcome (although by itself, it says nothing about how large that association could be).

The -log10 part is a conversion (i.e. the logarithm of base 10 of the P value): a value of 0 would be a P value of 1 (i.e. there is no association between the exposure and outcome), a value of 1 would be a P value of 0.1, 2 would be 0.01, 3 would be 0.001, and 20 would be 0.00000000000000000001 (i.e. we are very sure that there is an association between the exposure and the outcome). To go from the -log10 P value to the P value then, divide 1 by 10 to the power of the number (1/10^2 = 0.01).

In the paper, we list off important results for each exposure, including effect estimates and our confidence around them, but that feels like overkill for a blog post. Suffice to say, our results suggested that higher BMI, smoking and alcohol use adversely affect multiple socioeconomic outcomes. For example, smoking initiation was estimated to reduce household income, the chance of owning accommodation, being satisfied with health and of receiving a university degree, and increase deprivation.

Potential reasons for adverse effects of high BMI, alcohol use and smoking on social and economic outcomes include increased disease burden, social stigma (e.g. bias against obese people, smokers etc.), or behaviours which make employment, retention of employment, or social interaction challenging. Previous analyses of UK Biobank have shown evidence of effects of BMI on social and socioeconomic outcomes, which is nicely shown in this study too.

We also had evidence that asthma increased deprivation and decreased household income and the chance of having a university degree, and migraine reduced the chance of having a weekly leisure or social activity, especially in men.

We estimated too that depression reduced satisfaction with health, financial situation and family relationships, and reduced the chance of being happy and increase the chance of being lonely. These were all expected – depression reducing happiness isn’t exactly news (but it’s definitely what I tell people when they ask what I’ve been working on for so long) – but it’s still good to see things that are expected, since it is at least some evidence things are working as they should.

We didn’t see any evidence for an association between other exposures and outcomes. This doesn’t mean there aren’t any associations there, it just means we couldn’t detect them. In fact, for almost all health conditions, the estimates had really terrible precision – we just didn’t have enough people to study, especially those that had the health conditions we were interested in, and the genetic variants weren’t as good as they could have been for health conditions. As they say, it was an absence of evidence, not evidence of absence.


The many and varied limitations of this study

This study was not without a great many limitations. We managed 682 words on them. I think I’ll list them with bullet points, since I’m not allowed to do that in academic papers:

  • Mendelian randomization requires assumptions that we can’t prove to be true (but we tried to make sure they weren’t wrong)
  • The genetic risk scores measure a lifetime of risk to the exposure, not a risk at any particular time point. For things like BMI, this means that we’re looking at having a BMI that is consistently higher or lower than someone with a different genetic risk over a lifetime, not just having a higher BMI for a while. This means we don’t know if reducing BMI at 50 years old will have the same effect as reducing it at 20 years old. Although since we’re dealing with public health at a population level, we can be reasonably sure that reducing BMI across the whole population will probably be a good thing (on average).
  • UK Biobank is large, and could have been representative of the whole population (at least, representative of those that lived close enough to the recruitment centres), but only 5% of people invited to UK Biobank were recruited. As such, it isn’t representative of the population as a whole. Whether this affects individual results is impossible to tell, and work is ongoing to find out if we can account for this. Right now, we can say that the results are generalisable to people aged 40-70 years who, had they been invited to UK Biobank, would have accepted. We know people in UK Biobank are richer and healthier than the general population, so this could have affected our results.

The paper goes into each limitation in more detail, but I am reasonably confident that Mendelian randomization is one of the best, if not the best, method to tackle the question about how much poor health affects social and economic outcomes.

The only better method would be to randomise people to have health conditions or have different levels of risk factors, which would either be impossible or just incredibly unethical.


The conclusion we made

We concluded that the results of this study imply that higher BMI, smoking and alcohol consumption are likely detrimental to socioeconomic outcomes. While people are smoking less in the UK (good for public health), the average BMI has risen and is continuing to rise worldwide (less good for public health). From this study, we think that reducing average BMI levels, and further reducing smoking and alcohol intake, in addition to health benefits, may also improve socioeconomic outcomes for individuals and populations.

We could also conclude that Mendelian randomization is a good method, but we don’t have enough precision even with over 300,000 participants to be able to find any associations that aren’t massive. This will get better over time as we find more genetic variants associated with health conditions, but for now it remains a problem.

One could argue we already knew those things – who doesn’t know that smoking is bad? However, the benefit of this study isn’t necessary knowing that some things are bad. The benefit is knowing that things are precisely, and causally, this bad. Which means public health policy can be better informed about the potential benefits of different policies affecting,  and the likely consequences of changing levels of, different health conditions and risk factors.

Also, migraines evidently cause men to be less likely have a weekly leisure or social activity. That’s something I didn’t know before.