Albatross plots: Part 1

In this post, I describe albatross plots, the plot I created as PhD student to avoid having to do a narrative synthesis. I’ll detail why and how I developed them, what they’re useful for, and what they’ve been used for so far. I’ll detail the statistics of how they work in another post (Part 2), and put up (and annotate) the Stata code. The we wrote covers the plots perfectly fine, but I wanted to give more of a personal (and less technical) description. I also gave a talk about the plots (well, I’ve given many), and it was recorded .

Why create the plots?

In the second year of my PhD, I conducted a systematic review of the association between milk and IGF (another future blog post to write). We found 31 studies that examined this association, but the data presented in their papers was not suitable for meta-analysis, i.e. I couldn’t extract relevant, combinable information from most of the studies. There was lots of missing information, and a lot of different ways of presenting the data. The upshot was, we couldn’t do a meta-analysis.

This left few options. Meta-analysis is gold-standard for combining aggregate data (i.e.  results presented in scientific articles, rather than the raw data), for good reason. The main benefits of meta-analysis are an overall estimation of the effect estimate (association between two variables, effect of an intervention, or any other statistic), and a graphical display showing the results of all the studies that went into the analysis (a ). The alternatives to a meta-analysis don’t have those benefits.

The least statistical thing we could have done is a , where each study found in a systematic review is described, with some conclusions drawn at the end. A narrative review, as described in the , is different to a narrative synthesis, and does not involve a systematic search and inclusion of studies; it’s more of a cherry-picking method. In either case, the potential for bias is large (conscious or sub-conscious), it takes forever to write the damn things, and it also takes forever to read them. I really wanted to avoid writing a narrative synthesis, and so looked around for other options.

There were a couple of statistical options that would mean less writing. The first was  (and yes, I’m linking Wikipedia articles as well as academic papers – I think it’s often more helpful than linking a whole paper or a textbook, which I know most people don’t have access to). Vote counting is where you add up the number of studies that have a positive, negative or null effect (as determined by a P value of, usually, less than 0.05). It’s the most basic statistical evidence synthesis method. It’s also an awful use of statistics (P values of less than a threshold has been since the turn of the millennium), and even if you decide there is evidence of an effect, you can’t tell how large it might be.  is a bit more helpful; it combines all the P values from all the studies you want, and spits out a combined P value, indicating the amount of evidence in favour of rejecting the null hypothesis (i.e. that two variables are associated, or whether a treatment works). This is slightly better than a vote count, as the P values are not dichotomised. Again, though, there is no way to tell how large the effect might be: a really tiny P value might just mean there were lots of people in the studies.

We considered creating , which are bar charts showing vote count results with added information to show how confident you are in the results, so well-conducted, large studies are more heavily weighted than poor-quality, small studies. The graphs let you see all the data at once, which we thought was quite good. In the end, though, we decided not to use these plots for two reasons. The first was that we thought we could make better use of the data we had (we knew the number of people in each study, and what the P value was, and which direction the effect was in). The second reason was that we couldn’t make the plots, at least not easily. There was no software we could find that made it simple, which makes an important point: if you design a new method of doing something, make sure that you write code so other people can do it. No one will use a new method if it takes days of tinkering to make it work, unless it really is much better than what came before. Michael Crowther makes this point in one of his first , and as a biostatistician, he knows what he’s talking about.

How I developed the plots

The data we extracted from each study were: the total number of participants, the P value for the association between milk and IGF, and the direction of the association (positive or negative). Because I am a simple person, I plotted P values on the horizontal (x) axis, against the number of participants on the vertical (y) axis. Really low P values for negative associations were on the far left of the plot, and really low P values for positive associations were on the far right, and null P values (P = 1) were in the middle.

I put the axes on logarithmic scales to make things look better, so each unit increase along the x-axis on the right-hand side was a 10-fold decrease in P value (from 0.1 to 0.01 to 0.001 etc.). The first plots looked like the one below, with each study represented by a point, P value along the x-axis and number of participants along the y-axis.

IGF-I no lines

This plot showed the P values from all the studies examining the association between milk (and dairy protein and dairy products) and IGF-I. The studies were split into Caucasian (C) and non-Caucasian (NC) groups. The left of the plot shows studies with negative associations (i.e. as milk increases, IGF-I decreases), and the right of the plot shows studies with positive associations. The P values decrease from 1 in the centre towards 0 at the edges.

This looked like it might be a good way of displaying the results of the systematic review, as we could see that there was likely an overall positive association between milk and IGF-I. But we still couldn’t tell what the overall effect estimate might be, or how large it could be. On the bright side, we could see that the largest studies all had positive associations, and we could easily identify outlier studies, for example the two studies with negative associations and reasonably small P values.

It was  who suggested putting effect contours onto the plots to make them more interpretable. Effect contours are lines added to the graph to show where a particular effect size should lie. For all studies that calculated their P value with a , where the effect estimate divided by the standard error of the estimate is compared with a normal distribution to calculate a P value, there is a defined relationship between the effect size, number of participants, and the P value. For a particular effect size, as the number of participants increases, the P value must decrease to make everything balance.

This makes sense – for a small study (say 20 people) to have a tiny P value (say 0.001), it must have a huge effect, whereas a large study (say 20,000 people), to have the same P value (0.001), must have a much smaller effect size. I’ll detail the maths in a future post, but for now I’ll just say that I derived how to calculated effect contours for several difference statistical methods (for example, linear regression and standardised mean differences), and the plots looked much better for it. For the article, we also removed the distinction between Caucasian and non-Caucasian studies, and removed the two studies with very negative associations. This wasn’t to make our results seem better – those two studies were qualitatively different studies that were discussed separately, immediately after we described the albatross plot. The journal also apparently bleached the background of the plot, which I wasn’t overly happy with.

IGF-I lines

Now the plot has contours, it’s much easier to see what the overall effect size might be. The effect contours were for , where the beta if the number of standard deviations (SDs) change in outcome for a SD increase in exposure. It’s not the most intuitive effect estimate, but it has good statistical properties (unlike normal beta coefficients). For our purposes, a standardised beta coefficient of 0.05 was a small effect, 0.10 was small to medium, and 0.25 was medium to large. Our exact wording in the paper is here:

Of the 31 data points (from 28 studies) included in the main IGF-I analysis, 29 data points showed positive associations of milk and dairy intake with IGF-I levels compared to two data points that showed negative or null associations. The
estimated standardized effect size was 0.10 SD increase in IGF-I per 1 SD increase in milk (estimated range 0.05–0.25 SDs), from observation of the albatross plot (Fig. 3a). The combined p value for a positive association was 2.2×10−27. All studies with non-Caucasian subjects displayed a positive association between milk intake and PCa risk; in particular, two studies [30, 52] had p values of 0.0001 and 0.001, respectively, and both studies had a sample size of less than 100. When considering only Caucasians, the overall impression from the albatross plot did not change; the
effect estimate was still considered to be around 0.10 SD.

Of the 31 data points, 18 had an estimated standardized effect size between 0.05 and 0.25 SDs and four had an effect size of more than 0.25 SD. Eleven of these data
points (61%) used milk as an exposure, including two that had an estimated standardized effect size of more than 0.25 SD [30, 53].

Drawing and interpreting albatross plots

For the plots to be drawn, all that is (generally) required is the total number of participants, the P value and the effect direction. However, some statistical methods either require more information, or require an assumption, to actually draw the contours. For example, standardised mean differences require the ratio of the group sizes to be specified (or assumed), so the contours are drawn for specific effect size given a particular group size ratio (e.g. 1 case for 1 control). If you can assume all the studies have a reasonably similar group size ratio, then it’s generally fine to set the contours to be created at this ratio. If the studies are all have very different ratios, then this is more of a problem, but we’ll get to that in a bit.

In general, if the studies line up along a contour, then the overall effect size will be the same as the contour. If the studies fall basically around the contour (but with some deviation), then the overall effect size will be the same, but you’ll be less certain about the result. The corresponding situation in a meta-analysis would be studies falling widely around the effect estimate in a forest plot, with a larger confidence interval as a result. If the studies don’t fall around a contour, and are scattered across the width of the plot, then there is likely no association. If the larger and smaller studies disagree, then there may be some form of bias in the studies (e.g. small-study bias).

When interpreting the plots, I tend to focus on how closely the studies fit around a particular contour, and state what that contour might be. I also tend to give a range of possible values for the contour, so if 90% of the studies fall between two contours I will mention them. It is incredibly important not to overstate the results – this is a helpful visual aid in interpreting the results of a systematic review, but it is not a meta-analysis, and you certainly can’t give an exact effect size and leave it at that. If an exact effect size is needed, then a meta-analysis is the only option.

Stata code

The plots can be created in any statistical program (I imagine Excel could do it, I just wouldn’t want to try), and I wrote a page of instructions on how to create the plots that I think was cut from the journal paper at the last minute. I will dig it out and post it for anyone that might be interested. However, if you have Stata, you don’t need to create you own plots, because I spent several weeks writing code so you don’t have.

Writing a statistical program was a great learning experience, which I’m fairly certain is what people say after doing something that was unexpectedly complicated and time-consuming. The actual code for producing an individual plot is trivial, it can be done in a few lines. But the code for making sure that every plot that could be made works and looks good is much more difficult. Trying to think of all the ways people using the program could make it crash took most of the time, and required a lot of fail-safes. I’ll do a future post specifically about the code, and say for now that in general, writing a program is not too different from writing ordinary code, but if it involves graphics it might take much longer than expected.

If you want to try out albatross plots (and I hope that you do), and you have Stata, you can install the program by typing: ssc install albatross. The help file is very detailed, and there’s an extra help file containing lots of examples, so hopefully everything will be clear. If not, let me know. I’ll do a video at some point showing how to create plots in Stata, for those that prefer a demonstration to reading through help files.


The program has a feature that isn’t discussed in the paper, as I still need to prove that it works fine all the time. I’m pretty sure it does, and have used it in all the papers albatross plots are a feature of. The feature is called adjust, and it does just what it says: it adjusts the number of participants in a study to the effective sample size. This is the term for the number of participants that would have been required if something had been different, so if the ratio of group sizes was 1 (when in fact it was 3.2, or anything else). The point is to make the results from studies with different ratios comparable: statistical power is highest when the group ratio is 1, so studies with high group ratios will look like they have smaller effect sizes than they do, just because they have less power. I will write another post about adjust when I can, as it is a very useful option that definitely raises the interpretability of the plots.

Uses of the plots

So far, I have identified two main uses of the plots. The first use was shown in the paper reviewing whether , and is when a meta-analysis just isn’t possible given the data that are available. The World Health Organisation also used the plots this way in a systematic review of , where again, meta-analysis was not possible.

The second use of the plots is to compare the studies that were included in a meta-analysis with those that couldn’t be included in the meta-analyses because they lacked data. This allows you to determine whether the studies that couldn’t be included would have changed the results if they could have been included. I have yet to publish my results, but I used the plots in my thesis to compare the results of studies that were included in a meta-analysis looking at the association between body-mass index and prostate-specific antigen to those that couldn’t be included. We found three out of the fours studies were consistent, but one wasn’t (I mean, it isn’t as if it’s completely on the other side of the plot, but it certainly isn’t as close to the other studies as I’d like). The study that wasn’t consistent was conducted in a different population to all other studies, so we made sure to note there were limits on the generalisability of our results. The effect contours are for standardised beta coefficients again. The meta-analysis of the included studies gave a results of a 5.16% decrease in PSA for every 5 kg/m2 increase in BMI. This is roughly equivalent to a standardised beta of -0.05, which is good because that’s what I’d say is the magnitude of effect in the albatross plot below.



I think albatross plots are pretty useful, and made my life easier when otherwise I would have had to write a long narrative synthesis. By putting the code online, we made sure that people who wanted to use the plots could. By providing a tonne of examples, hopefully people will know how to use the plots too.

Incidentally, the albatross in the name refers to the fact the contours looked like wings. Rejected names include swan plotpelican plot and pigeon plot. Oh, and if any of you are thinking albatross are bad luck, they were until a captain decided to . A lesson for us all.

3 thoughts on “Albatross plots: Part 1

    1. Hi Jasmine, sorry for the slow response!

      Thanks for the message – I haven’t managed to code up an R version of albatross plots as yet, since I’m not great with ggplot2. In terms of creating one yourself, it’s totally doable, but I imagine quite arduous.

      1) You’ll need a dataframe with the number of participants (N) in each study, and the P value (P) and the effect direction (E) of whatever effect estimate you are looking at. If it’s a mean difference, standardized mean difference, odds ratio or risk ratio, you may want other information (or need to assume other information in order to generate effect contours). Exactly what you’ll need is listed in table 1 of the paper:

      2) Albatross plots are a scatter plot with some contours superimposed. The X-axis is the P value, which I coded to be on the log10 scale for the inverse P value, i.e. log10(1/P). This means the X-axis extends from P=1 in the middle to P->0 at either end. The effect direction (E) makes the X-axis value either positive or negative, i.e. negative effect directions extend to the left towards P->0, and positive extend to the right towards P->0. The Y-axis is scaled as a squared log10 scale, so [log10(N)]^2 – this just makes it easier to interpret.

      In Stata, I scale all the P and N values to fit the X and Y scales, then relabel the values. As an example, a study with a P value of 0.05 and 100 people with a positive effect direction has the co-ordinates: X=1.30, Y=4. Another example, a study with a P value of 0.001 and 1000 people with a negative effect direction has the co-ordinates: X=3, Y=9. I imagine you can relabel the X and Y axes to show the correct P and N values in R: it’s nice numbers at least – for the X axis, 0=1, 1=0.1, 2=0.01 etc., and for the Y axis, 0=1, 1=10, 4=100, 9=1000 etc.

      3) That’s all for the individual study points in the scatter plot – I have different markers for different groups, and have an adjust option that changes the N value to the effective N value, but the equations are a bit of a nightmare – let me know if you want them. I do intend to put them up as a post one day…

      4) For the superimposed line graphs showing the effect contours, you need the equations in table 1 of the paper (link above). First, decide on the statistical method, then make sure that you have the additional required variables (if necessary). The additional variables aren’t for each study, The equations are in the form N = Z^2p * [effect estimate] * [usually something else]. This is analagous to Y = X * [chosen value for the contour] * [a constant].

      For simplicity of reading, the equations use Z^2 rather than the P value, but they are interchangable with a transformation. In Stata it’s Z = invnormal(P/2) or P = normal(-abs(Z))*2. In Excel, it’s normsinv(P/2) or P = normsdist(-abs(Z))*2. I think R uses normal() and invnormal(), but can’t remember. The P/2 business is specifically for 2-sided P values: you don’t bother with it for 1-sided P values. However, if you are using 1-sided P values, you’ll need to rescale the X axis. Let’s hope you’re not using 1-sided P values, because they are a faff…

      Ok, so once you’ve decided on the statistical type, you need to make decisions about a) the contour effect size, and b) the size of any remaning variables. For a), you just need to play around with likely effect sizes, and find ~3 that you like the look of. For b), you need to know roughly what the values of each additional variable is in each of the included studies. For instance, for mean differences, the equation requires the standard deviation (SD) of the mean and the ratio of group sizes. If the ratio of group sizes is 1 (i.e. same sized groups), then you can just use the simpler equation that’s listed. But in either case, you need the SD – to be comparable, each included study should have roughly the same SD, and you can plug that number into the equation.

      For odds ratios (ORs) and risk ratios (RRs), you’ll need estimates of the ratio of group sizes and the baseline risks in the control groups. Plus in these numbers to the relevant equations, and you’re left with what I put above: Y = X * [chosen value for the contour] * [a constant].

      Now, if you have wildly different ratios of group sizes, then you may want to adjust down to the effective sample size, as I mentioned above. This means estimating the number of participants required to get the same P value and effect size IF the number of participants in each group in a study were equal. The equations necessary to do this aren’t published, so let me know if you need this (or anyone else, if you’re reading this, it’ll give me an incentive to publish a post about it).

      Once you have the equations you’ll need to produce the contours, you’ll have to recale them to fit your graph. This means: a) turning the scaled P value (the X axis value) into a Z score, i.e. invnormal(10^-abs(X)/2), then b) further scaling the right hand side of the equation so Y is scaled correctly, i.e. log10 the right side then square it.

      This will give you a Y = mX + c equation, which you should be able to plot in a specified range in R on top of the scatter plot.

      Example: for a standardised mean difference (equally sized groups), the equation for the contour is:

      N = [Z^2] * (8 + SMD^2) / (2*SMD^2)

      Now do a) above:

      N = [invnormal(10^-abs(X)/2] * (8 + SMD^2) / (2*SMD^2)

      And b):

      Y = log10([invnormal(10^-abs(X)/2] * (8 + SMD^2) / (2*SMD^2))^2

      And let’s say you want the SMD to be 1 for a contour:

      Y = log10([invnormal(10^-abs(X)/2] * [9/2])^2

      If you wanted the SMD to be 2 for another contour:

      Y = log10([invnormal(10^-abs(X)/2] * [12/8])^2

      If you wanted the group sizes to be unequal for the contours (so if every study had a 2:1 group size ratio, for example), you’d use the equation for unequal sized groups, substituting in a value of r. For example, if you wanted a 2:1 group size ratio (r=2):

      N = [Z^2] * (2*(r+1)^2 + r*SMD^2) / (2*r*SMD^2)
      N = [invnormal(10^-abs(X)/2] * (2*(2+1)^2 + 2*SMD^2) / (2*2*SMD^2)
      Y = log10([invnormal(10^-abs(X)/2] * (18 + 2*SMD^2) / (4*SMD^2))^2

      So for an SMD of 1:

      Y = log10([invnormal(10^-abs(X)/2] * [20/4])^2

      The other statistical types follow the same rules – use the equations, substitute the values, scale everything correctly, plot the data and the contours, rescale the X and Y axes.

      I appreciate this is complicated, especially when written in a comment, so please let me know if you run into trouble!

      All the best,


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. (  /  Change )

Google photo

You are commenting using your Google account. (  /  Change )

Twitter picture

You are commenting using your Twitter account. (  /  Change )

Facebook photo

You are commenting using your Facebook account. (  /  Change )

Connecting to %s