Estimating an Odds Ratio from a GWAS only reporting the P value

This is a post about estimating the odds ratio (OR) and its associated standard error (SE) for the effect of a SNP on an outcome from a genome-wide association study (GWAS), when the GWAS only reports P values.

For this method to work, you’ll need the minor allele frequency (MAF), the effect direction, the total number of participants and the number of cases in the GWAS.

Why?

Sometimes GWAS don’t report the effect size of each SNP, just the P values. This is intensely irritating, since the effect sizes are known, the authors just chose not to report them. If that GWAS is the only one available for an outcome, then Mendelian Randomization is impossible unless we are willing to make a few assumptions and estimate the ORs ourselves.

What assumptions?

There are two main assumptions.

The first assumption is probably more important: that any covariables the GWAS included had only a limited impact on the OR and SE of the effect estimate for each SNP, i.e. it’s as if the GWAS was a logistic regression of the outcome on one SNP at a time.

It is impossible to know by how much any covariable (age, sex, principal components) affected a SNP effect estimate. From my limited knowledge and some real-world examples, I believe that adjusting for covariables will mean the ORs estimated from using this method will be more null than they should be. This seems to me to be the right direction for any bias, since you are less likely to encounter false positives in subsequent Mendelian randomization analyses. But I caution against relying on this – I don’t know definitively whether the bias will always be towards the null.

The second assumption is likely less important: that each SNP is in Hardy-Weinberg equilibrium. If you only have the MAF, then it is impossible to know the distribution of homozygous minor and major alleles, and so impossible to estimate the SE of the log-OR definitively. In practice, I think so long as the SNPs aren’t wildly out of Hardy-Weinberg equilibrium, this is a minor assumption that will just lead to small random error and can thus be largely ignored with enough SNPs.

There may be more assumptions, but I don’t think so.

How do you estimate the SNP effect size?

With difficulty.

The SE of a log-OR is hard to estimate, since it uses matrices, and matrices are hard.

I’ve put my working at the bottom of this post, since it’s lengthy, detailed, and I can’t imagine many people care all that much. But it’s there if anyone wants to check it.

In brief, the SE of a SNP in a logistic regression model can be expressed in terms of the odds having the outcome in the entire GWAS (using the number of participants in total and the number of cases), the number of people with each dosage of SNP (0, 1 and 2 copies of the effect allele, usually the minor allele), and the SNP effect size. The P value from that regression is simply a normal transformation of the SNP effect size divided by its SE (the Z score), so we end up with an equation with 1 unknown, the SNP effect size, which is what we want to estimate anyway.

Now, the equation is complicated. Writing it out in full is a mission. I copied across the code I used from Stata to Word, and it looked like this (x is the SNP effect size, y is the Z score):

01

But the headline is this: we have an equation, and we just need to solve it to find the SNP effect size, and then its SE. It’s not, so far as I can tell, a closed-form equation (x is both normal, square-rooted and exponentiated), and thus I don’t think it can be solved by putting the effect size on the left and working it all out. It’s also incredibly long, and so even if it had a closed-form equation, it would take me ages to work out what it was and/or work out how to use an online equation solver.

And so, I chose to solve the problem with trial and error; plug numbers in for the SNP effect size, then wait until it produces the observed Z score. With Stata, this is quite quick.

I’ve put the code at the bottom of this post, but the idea is that I simulate 1,000,000 values of the SNP effect size between 0 and 1 (for positive effect directions, as based on the Z score), or 0 and -1 (for negative effect directions). I work out the difference between the observed and estimated Z scores, given the MAF, number of participants and number of cases, and look for the minimum difference. If the minimum value is at 1 (or -1), I simulate another 1,000,000 values, this time between 1 and 2 (or -1 and -2), because the real SNP effect size is likely larger than 1 (or smaller than -1).

This seems to work well – if you plot out a graph of the difference between the observed and estimated Z scores (y) for different SNP effect sizes (x), it looks like this:

02

The difference between the estimated and observed Z scores decreases to 0, then increases again. This is good, since it means the right SNP effect size is most likely where the line hits the x-axis. This means that, graphically, we can get a good estimate without the need for simulating values. But it depends on how big you expect the effect size to be, since it turns out that there isn’t just one right answer:

03

It looks like the pattern is for there to be two values of the SNP effect size that give the right Z score, it’s just one is incomprehensively massive. For context, x is the log-OR: a log-OR of 11 is equivalent to an OR of 59,874. We can probably conclude that’s nonsense and take the log-OR of 0.1 as the right-ish answer (OR = 1.11). After getting to a SNP effect size of about 40, the graph disappears, probably because it can’t handle numbers bigger than e40.

How well does this work?

In simulations, great.

I simulated 1,000 datasets with one outcome and one SNP, each with between 100 and 100,000 participants. The SNP MAF was chosen from a random uniform distribution for each dataset, with the dosages set to Hardy-Weinberg equilibrium, so the right proportion of observations had dosages of 0, 1 and 2. The log-OR for the SNP on the outcome was chosen from a random normal distribution (mean 0, SD 1), with the log-odds of having the outcome for each observation equal to a normal distribution (mean 0, SD 1) plus the log-OR multiplied by the dosage. I converted the log-odds to a probability, then the observation with got the outcome of not based on another random uniform distribution.

I used logistic regression to estimate the SE of the SNP effect, recorded everything, ending up with a dataset of 1,000 observations, each with an MAF, number of participants, number of cases, Z score, log-OR and SE. I then predicted the log-OR and SE based on the first four of those variables, and compared them to the observed numbers.

To see whether the method worked, I just looked at the regression coefficient from a linear regression of the true log-OR with the estimated log-OR: a coefficient of 1 would be perfection, less than 1 would mean I was underestimating the log-OR, and above 1 would mean I was overestimating the log-OR. The coefficient from the regression of all 1,000 simulated SNPs was 1.05, with a R2 of 0.993, so a small underestimate. The median percentage difference between the true and estimated log-ORs was -0.02%, interquartile range of -0.52% to 0.56%, with 9.5% of estimates more than 5% out. When restricting to log-ORs between -1 and 1 (ORs of between 0.37 and 2.72, which would contain almost all SNP effect sizes from any GWAS), the regression coefficient was 1.01, with an R2 of 0.9999, and the median percentage difference between the true and estimated log-ORs was -0.05%, interquartile range of -0.42% to 0.34%, and 0.4% of estimates more than 5% out. That’s pretty good.

So, in simulations, the method looks good, especially for more reasonably sized ORs.

For real-world data, I used the GWAS-significant SNPs from several GWAS: asthma, breast carcinoma, coronary heart disease, eczema, migraine, osteoarthritis and type 2 diabetes. I used data from the same studies as a MedRXiv paper of mine – all the SNP/GWAS details are in the supplementary tables there, if you’re interested.

I actually wasn’t expecting much, since real-world data is much messier than simulated data, and I was concerned about the assumptions I was making. But I was pleasantly surprised.

The coefficient from the regression of all 181 real-world GWAS SNPs was 1.20, with a R2 of 0.96, so again, on average the method underestimated the log-OR. The median percentage difference between the true and estimated log-ORs was -2.4%, interquartile range of -19.6% to 13.7%, with only 63% of estimates more than 10% out. All log-ORs were between -1 and 1.

Fortunately, this is a conservative bias, and one I’m happy with, since there’s no way to guess at exactly how much adjustment affects the results in any individual GWAS.

So, can you use this method to estimate SNP effect sizes?

Yes, I think so.

Any bias is likely to be conservative, so there’s limited risk of false positives in any subsequent Mendelian randomization analysis. And the performance is, I think, pretty good, even with real-world data.

It’s certainly no substitute for knowing what the SNP effect sizes are, but if they aren’t reported, I’m reasonably confident this method can be used to estimate them without running into problems.

Maths!

Here’s my maths, which I’ve included both because I might need to refer to this again later, and because it’s the most complicated maths I’ve done and I’m a little proud.

Feel free to skip entirely. Code is at the bottom.

The variance-covariance matrix in a logistic regression model is defined as:

04Where:
S is the variance covariance matrix
X is the n x 2 matrix of exposures
V is the diagonal matrix of weights, with n_i p_i (1-p_i) as the entries

So, assuming just one exposure variable in the model, i.e. only one Χ variable, we can break this down to the following:05

Where:

  • Χ_1 … Χ_n are the values of each level of the exposure variable, i.e. 0, 1 and 2 for the dosage of SNPs
  • n_1 … n_n are the number of participants within each level of the exposure variable, i.e. the number of participants who are homozygous for the major allele, heterozygous, or homozygous for the minor allele – this is estimated from the MAF
  • p_1 … p_n are the probability of having the outcome within each level of the exposure variable, i.e. the probability the participants who are homozygous for the major allele have the outcome, etc.

Note: I can’t see how to use subscript in text in wordpress – it renders subscript as “_”, sorry!

Inverting a matrix (i.e. to the power -1) converts it to the matrix required to form an identity matrix on multiplication (i.e. 10|01). In a 2×2 table (a and b in the first row, c and d in the second) this can be accomplished using a formula where the matrix is altered so that a = d, d = a and b and c = -b and -c, and the whole matrix is multiplied by 1/(ad+bc).

So:

06

The variance-covariance matrix gives the variance of the exposure in the top left and the variance of the constant term in the bottom right elements.

The SE of the effect estimate is therefore:

07

Knowing this is key to solving this problem, since we know there are 3 levels of the exposure (dosages of 0, 1 and 2). If we can estimate the number of participants in each level, and we know the P value of the effect estimate, then we can solve this.

We can get rid of the summation terms by substituting the actual terms for each level of the exposure, at the expense of making the equation ridiculously long:

08

We can make this even longer by substituting values of the probabilities.

The easy part is expressing probabilities in terms of odds:

09

And for completion, here’s the reverse:

10

Log-odds are the outcome in a logistic regression (shown here with one exposure variable):

11

β_0 is the constant term in the logistic regression, and β_1 is the log-OR for the exposure, and so what we are trying to estimate. We can ignore the error term, ε, and substitute into the above:

12

That’s the probability across all participants, but we can define the probability of having the outcome within each level of the exposure, as x takes the values 0, 1 and 2, denoting the dosage of the SNP [note, for most variables where I use 0, 1 and 2 as subscripts, they represent the value of the variable for that specific dosage, so p_0 is the probability of having the outcome in the participants with a dosage of 0, i.e. homozygous for the major allele. The exceptions are only β_0 and β_1, which are the constant term and SNP effect size, respectively]:
13

The β_0 is irritating, since we don’t know it. We can, however, express it in terms of β_1 and the overall odds of having the outcome across everyone in the GWAS, since we know the log-odds of having the outcome must be equal to the weighted average of the log-odds within each level of the exposure:

14

Substituting β_0 into the probability equations gives the following:

15

To make things a bit simpler, divide both the upper and lower sides of the right-hand side by the exponent and simplify:

16

Remember the long equation for the SE above?

Yeah, now substitute p_0, p_1 and p_2 into that.

Clearly, that would make the equation too long for anything other than a whiteboard, but it’s not too bad if split into parts. I’ve defined part a as the top of the fraction (and the bottom left part) in the SE equation, part b as the middle part of the bottom of the fraction, and part c as the right part of the bottom of the fraction:

17

Where:

18

Substituting the probabilities and simplifying (for a given value of “simplifying”), we get:19

We now have an (incredibly long) expression of the SE in terms of β_1, the odds of having the outcome in the whole GWAS, and the number of participants with each dosage of SNP. Now we just need to set that equal to something else we know.

Quick detour: I assume Hardy-Weinberg equilibrium for the SNPs, which means that for a given minor allele frequency, p, and major allele frequency, q (also expressed as 1-p), we can estimate what proportion of people in a population are likely to have each dosage of SNP, 0, 1 and 2:

20

Where, proportionately, p^2 people have a dosage of 0, 2pq people have a dosage of 1, and q^2 people have a dosage of 2. As an example, if the MAF is 0.2, then we can say 4% (0.22) of people have 2 copies of the minor allele, 32% (2*0.2*0.8) of people have 1 copy of the minor allele, and 64% (0.82) of people have 0 copies of the minor allele.

Right, back to it. P values are almost always estimated as follows:

21

The Φ means use the normal function on Z to get the P value, i.e. what percentage of a normal distribution is located either before (for negative Z scores) or after (for positive Z scores) the Z score. A Z score of ±1.96 corresponds to a (two-sided) P value of 0.05, which is why when we estimate 95% confidence intervals we add and subtract 1.96 times the SE to the effect estimate.

Anyway, all this means is that the P value can be converted to a Z score using an inverse normal function. In Stata, that’s the invnormal() function. In Excel, it’s the normsinv() function. Now, because most P values in GWAS are two-sided, we need to divide the P value by 2 before using the inverse normal function. Therefore, if you have a variable of P values, generate the Z score by using the following command:

22

You’ll want to make the Z score positive if the effect direction is positive, since it defaults to negative. If you don’t know the effect direction, then you can’t do anything anyway, since P values are non-directional. An OR of 2 and 0.5 will have the same SE (given the same number of participants), so give the same P value.

We know Z, so we can write:

23

As β_1 is the only unknown in this equation, we just need to find the value of β_1 that satisfies the equation.

As above, I use trial and error to find the value of β_1 that satisfies the equation.

Code!

This is my Stata code for estimating the log-OR and SE for a list of SNPs. The MAF, number of participants, number of cases and Z score are mandatory, and the Z-score denotes the effect direction. I’ve added comments, so hopefully it makes sense. Feel free to use for any purpose.

You should just be able to copy this into a .do file and run it as is.

/*

Estimating the beta and SE of SNPs given the MAF, odds of having the outcome, Z-score and effect direction

Written by Sean Harrison, 09-10 April 2020

Feel free to use for any purpose

 

For a given dataset containing values for the following variables (named like exactly as shown):

 

snp: SNP ID (string, or whatever, it’s just for reference)

maf: Minor allele frequency (MAF) of the SNP (numeric, between but not including 0 and 1)

n_cases: Number of people with the outcome (numeric, positive, not 0)

n_total: Total number of people (numeric, positive, not 0)

z: Z score

 

Assumptions:

 

1) There were no other variables in the logistic regression that produced the GWAS result, or that they had no meaningful impact on the log-OR or the SE

2) The MAF corresponds well to the expected distribution of SNPs, i.e. Hardy-Weinberg equilibrium

 

Coding:

 

This code does the following for each SNP in the dataset:

 

1) Works out the odds of having the outcome across all participants from the number with the outcome and the total number of participants

2) Works out the number of participants with 0, 1 and 2 copies of the minor allele, based on the MAF

3) Iterates to a log-odds ratio that gives the same Z score given the N, odds, effect direction (from the Z score) and MAF:

This is based on the extremely long and complicated equation for calculating the SE of a log-OR,

which is itself based on the number of participants in each category of the exposure [0,1,2] and the odds a participant in each category has the outcome

 

The code simulates a million values of Z for log-ORs between 0 and 1 or -1 (depending on the effect direction of Z), and picks the minimum difference between the observed and estimated Z scores

If the minimum score is at the maximum value of Z, the code simulates the next million values, and so on until the first minimum is reached

 

4) The log-odds ratio and SE are returned for each SNP

 

*/

 

*Set up results

gen _log_or = .

gen _log_or_se = .

 

local snp_n = c(N)

 

forvalues i = 1/`snp_n’ {

 

dis “Analysing SNP `i’ of `snp_n'”

qui {

*Odds of the outcome

local odds = n_cases[`i’]/(n_total[`i’]-n_cases[`i’])

 

*Ns given the MAF

*p^2 + 2pq + q^2 = 1

local n0 = n_total[`i’]*maf[`i’]^2

local n1 = n_total[`i’]*2*maf[`i’]*(1-maf[`i’])

local n2 = n_total[`i’]*(1-maf[`i’])^2

local N = n_total[`i’]

 

local z = z[`i’]

 

*Simulate values of the log-OR, and estimate the Z score

preserve

clear

 

*Simulate the first 1,000,000 observations between 0 and 1 or -1

set obs 1000000

if `z’ >= 0 {

gen _x = _n*0.000001

}

else {

gen _x = _n*-0.000001

}

gen n = _n

 

/*

Individual parts of the equation, just for reference – these are merged and simplified in the gen _y equation below

gen _p0 = 1/(1+exp(-(ln(`odds’) – _x*(`n1’+2*`n2′)/`N’)))

gen _p1 = 1/(1+exp(-(ln(`odds’) – _x*(`n1’+2*`n2′)/`N’)-_x))

gen _p2 = 1/(1+exp(-(ln(`odds’) – _x*(`n1’+2*`n2′)/`N’)-2*_x))

 

gen _a = `n0’*_p0*(1-_p0)+`n1’*_p1*(1-_p1)+`n2’*_p2*(1-_p2)

gen _b = `n1’*_p1*(1-_p1)+4*`n2’*_p2*(1-_p2)

gen _c = (`n1’*_p1*(1-_p1)+2*`n2’*_p2*(1-_p2))^2

 

gen _se = sqrt(_a/(_a*_b-_c))

 

gen _y = abs(_x/_se-`z’)

*/

 

*Main equation, giving the difference between the estimated and observed Z scores

gen _y = abs(_x/sqrt((`n0’*`odds’*exp(_x*(`n1’+2*`n2′)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′)/`N’))^2 + `n1’*`odds’*exp(_x*(`n1’+2*`n2′-`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-`N’)/`N’))^2 + `n2’*`odds’*exp(_x*(`n1’+2*`n2′-2*`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-2*`N’)/`N’))^2)/((`n0’*`odds’*exp(_x*(`n1’+2*`n2′)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′)/`N’))^2 + `n1’*`odds’*exp(_x*(`n1’+2*`n2′-`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-`N’)/`N’))^2 + `n2’*`odds’*exp(_x*(`n1’+2*`n2′-2*`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-2*`N’)/`N’))^2)*(`n1’*`odds’*exp(_x*(`n1’+2*`n2′-`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-`N’)/`N’))^2 + 4*`n2’*`odds’*exp(_x*(`n1’+2*`n2′-2*`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-2*`N’)/`N’))^2)-((`n1’*`odds’*exp(_x*(`n1’+2*`n2′-`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-`N’)/`N’))^2 + 2*`n2’*`odds’*exp(_x*(`n1’+2*`n2′-2*`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-2*`N’)/`N’))^2)^2)))-`z’)

 

local complete = 0

local j = 0

 

*Set a while loop to increase/decrease the simulated log-OR values by just less than 1 each iteration until a minimum is found

while `complete’ == 0 {

 

qui su _y

qui su n if _y == r(min)

 

*If the minimum isn’t the last observation

if r(mean) < 1000000 {

local complete = 1

}

*If the minimum is the last observation, it hasn’t reached the actual minimum yet, so increase/decrease the observations

else {

local j = `j’+1

if `z’ >= 0 {

replace _x = _n*0.000001 + (1000000-100)*0.000001*`j’

}

else {

replace _x = -(_n*0.000001 + (1000000-100)*0.000001*`j’)

}

*Update the difference between the estimated and observed Z scores

replace _y = abs(_x/sqrt((`n0’*`odds’*exp(_x*(`n1’+2*`n2′)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′)/`N’))^2 + `n1’*`odds’*exp(_x*(`n1’+2*`n2′-`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-`N’)/`N’))^2 + `n2’*`odds’*exp(_x*(`n1’+2*`n2′-2*`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-2*`N’)/`N’))^2)/((`n0’*`odds’*exp(_x*(`n1’+2*`n2′)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′)/`N’))^2 + `n1’*`odds’*exp(_x*(`n1’+2*`n2′-`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-`N’)/`N’))^2 + `n2’*`odds’*exp(_x*(`n1’+2*`n2′-2*`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-2*`N’)/`N’))^2)*(`n1’*`odds’*exp(_x*(`n1’+2*`n2′-`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-`N’)/`N’))^2 + 4*`n2’*`odds’*exp(_x*(`n1’+2*`n2′-2*`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-2*`N’)/`N’))^2)-((`n1’*`odds’*exp(_x*(`n1’+2*`n2′-`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-`N’)/`N’))^2 + 2*`n2’*`odds’*exp(_x*(`n1’+2*`n2′-2*`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-2*`N’)/`N’))^2)^2)))-`z’)

}

}

 

*While loop complete, so minimum difference found

su _y

su _x if _y == r(min)

local x = r(mean)

 

*Generate the log-OR SE

gen _y_se = sqrt((`n0’*`odds’*exp(_x*(`n1’+2*`n2′)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′)/`N’))^2 + `n1’*`odds’*exp(_x*(`n1’+2*`n2′-`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-`N’)/`N’))^2 + `n2’*`odds’*exp(_x*(`n1’+2*`n2′-2*`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-2*`N’)/`N’))^2)/((`n0’*`odds’*exp(_x*(`n1’+2*`n2′)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′)/`N’))^2 + `n1’*`odds’*exp(_x*(`n1’+2*`n2′-`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-`N’)/`N’))^2 + `n2’*`odds’*exp(_x*(`n1’+2*`n2′-2*`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-2*`N’)/`N’))^2)*(`n1’*`odds’*exp(_x*(`n1’+2*`n2′-`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-`N’)/`N’))^2 + 4*`n2’*`odds’*exp(_x*(`n1’+2*`n2′-2*`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-2*`N’)/`N’))^2)-((`n1’*`odds’*exp(_x*(`n1’+2*`n2′-`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-`N’)/`N’))^2 + 2*`n2’*`odds’*exp(_x*(`n1’+2*`n2′-2*`N’)/`N’)/(`odds’+exp(_x*(`n1’+2*`n2′-2*`N’)/`N’))^2)^2)))

 

su _y

su _y_se if _y == r(min)

local se = r(mean)

 

restore

 

*End of simulation, bring back the original data and update the results

replace _log_or = `x’ in `i’

replace _log_or_se = `se’ in `i’

 

}

}