Albatross plots: Part 1

In this post, I describe albatross plots, the plot I created as PhD student to avoid having to do a narrative synthesis. I’ll detail why and how I developed them, what they’re useful for, and what they’ve been used for so far. I’ll detail the statistics of how they work in another post (Part 2), and put up (and annotate) the Stata code. The we wrote covers the plots perfectly fine, but I wanted to give more of a personal (and less technical) description. I also gave a talk about the plots (well, I’ve given many), and it was recorded .

Why create the plots?

In the second year of my PhD, I conducted a systematic review of the association between milk and IGF (another future blog post to write). We found 31 studies that examined this association, but the data presented in their papers was not suitable for meta-analysis, i.e. I couldn’t extract relevant, combinable information from most of the studies. There was lots of missing information, and a lot of different ways of presenting the data. The upshot was, we couldn’t do a meta-analysis.

This left few options. Meta-analysis is gold-standard for combining aggregate data (i.e.  results presented in scientific articles, rather than the raw data), for good reason. The main benefits of meta-analysis are an overall estimation of the effect estimate (association between two variables, effect of an intervention, or any other statistic), and a graphical display showing the results of all the studies that went into the analysis (a ). The alternatives to a meta-analysis don’t have those benefits.

The least statistical thing we could have done is a , where each study found in a systematic review is described, with some conclusions drawn at the end. A narrative review, as described in the , is different to a narrative synthesis, and does not involve a systematic search and inclusion of studies; it’s more of a cherry-picking method. In either case, the potential for bias is large (conscious or sub-conscious), it takes forever to write the damn things, and it also takes forever to read them. I really wanted to avoid writing a narrative synthesis, and so looked around for other options.

There were a couple of statistical options that would mean less writing. The first was  (and yes, I’m linking Wikipedia articles as well as academic papers – I think it’s often more helpful than linking a whole paper or a textbook, which I know most people don’t have access to). Vote counting is where you add up the number of studies that have a positive, negative or null effect (as determined by a P value of, usually, less than 0.05). It’s the most basic statistical evidence synthesis method. It’s also an awful use of statistics (P values of less than a threshold has been since the turn of the millennium), and even if you decide there is evidence of an effect, you can’t tell how large it might be.  is a bit more helpful; it combines all the P values from all the studies you want, and spits out a combined P value, indicating the amount of evidence in favour of rejecting the null hypothesis (i.e. that two variables are associated, or whether a treatment works). This is slightly better than a vote count, as the P values are not dichotomised. Again, though, there is no way to tell how large the effect might be: a really tiny P value might just mean there were lots of people in the studies.

We considered creating , which are bar charts showing vote count results with added information to show how confident you are in the results, so well-conducted, large studies are more heavily weighted than poor-quality, small studies. The graphs let you see all the data at once, which we thought was quite good. In the end, though, we decided not to use these plots for two reasons. The first was that we thought we could make better use of the data we had (we knew the number of people in each study, and what the P value was, and which direction the effect was in). The second reason was that we couldn’t make the plots, at least not easily. There was no software we could find that made it simple, which makes an important point: if you design a new method of doing something, make sure that you write code so other people can do it. No one will use a new method if it takes days of tinkering to make it work, unless it really is much better than what came before. Michael Crowther makes this point in one of his first , and as a biostatistician, he knows what he’s talking about.

How I developed the plots

The data we extracted from each study were: the total number of participants, the P value for the association between milk and IGF, and the direction of the association (positive or negative). Because I am a simple person, I plotted P values on the horizontal (x) axis, against the number of participants on the vertical (y) axis. Really low P values for negative associations were on the far left of the plot, and really low P values for positive associations were on the far right, and null P values (P = 1) were in the middle.

I put the axes on logarithmic scales to make things look better, so each unit increase along the x-axis on the right-hand side was a 10-fold decrease in P value (from 0.1 to 0.01 to 0.001 etc.). The first plots looked like the one below, with each study represented by a point, P value along the x-axis and number of participants along the y-axis.

IGF-I no lines

This plot showed the P values from all the studies examining the association between milk (and dairy protein and dairy products) and IGF-I. The studies were split into Caucasian (C) and non-Caucasian (NC) groups. The left of the plot shows studies with negative associations (i.e. as milk increases, IGF-I decreases), and the right of the plot shows studies with positive associations. The P values decrease from 1 in the centre towards 0 at the edges.

This looked like it might be a good way of displaying the results of the systematic review, as we could see that there was likely an overall positive association between milk and IGF-I. But we still couldn’t tell what the overall effect estimate might be, or how large it could be. On the bright side, we could see that the largest studies all had positive associations, and we could easily identify outlier studies, for example the two studies with negative associations and reasonably small P values.

It was  who suggested putting effect contours onto the plots to make them more interpretable. Effect contours are lines added to the graph to show where a particular effect size should lie. For all studies that calculated their P value with a , where the effect estimate divided by the standard error of the estimate is compared with a normal distribution to calculate a P value, there is a defined relationship between the effect size, number of participants, and the P value. For a particular effect size, as the number of participants increases, the P value must decrease to make everything balance.

This makes sense – for a small study (say 20 people) to have a tiny P value (say 0.001), it must have a huge effect, whereas a large study (say 20,000 people), to have the same P value (0.001), must have a much smaller effect size. I’ll detail the maths in a future post, but for now I’ll just say that I derived how to calculated effect contours for several difference statistical methods (for example, linear regression and standardised mean differences), and the plots looked much better for it. For the article, we also removed the distinction between Caucasian and non-Caucasian studies, and removed the two studies with very negative associations. This wasn’t to make our results seem better – those two studies were qualitatively different studies that were discussed separately, immediately after we described the albatross plot. The journal also apparently bleached the background of the plot, which I wasn’t overly happy with.

IGF-I lines

Now the plot has contours, it’s much easier to see what the overall effect size might be. The effect contours were for , where the beta if the number of standard deviations (SDs) change in outcome for a SD increase in exposure. It’s not the most intuitive effect estimate, but it has good statistical properties (unlike normal beta coefficients). For our purposes, a standardised beta coefficient of 0.05 was a small effect, 0.10 was small to medium, and 0.25 was medium to large. Our exact wording in the paper is here:

Of the 31 data points (from 28 studies) included in the main IGF-I analysis, 29 data points showed positive associations of milk and dairy intake with IGF-I levels compared to two data points that showed negative or null associations. The
estimated standardized effect size was 0.10 SD increase in IGF-I per 1 SD increase in milk (estimated range 0.05–0.25 SDs), from observation of the albatross plot (Fig. 3a). The combined p value for a positive association was 2.2×10−27. All studies with non-Caucasian subjects displayed a positive association between milk intake and PCa risk; in particular, two studies [30, 52] had p values of 0.0001 and 0.001, respectively, and both studies had a sample size of less than 100. When considering only Caucasians, the overall impression from the albatross plot did not change; the
effect estimate was still considered to be around 0.10 SD.

Of the 31 data points, 18 had an estimated standardized effect size between 0.05 and 0.25 SDs and four had an effect size of more than 0.25 SD. Eleven of these data
points (61%) used milk as an exposure, including two that had an estimated standardized effect size of more than 0.25 SD [30, 53].

Drawing and interpreting albatross plots

For the plots to be drawn, all that is (generally) required is the total number of participants, the P value and the effect direction. However, some statistical methods either require more information, or require an assumption, to actually draw the contours. For example, standardised mean differences require the ratio of the group sizes to be specified (or assumed), so the contours are drawn for specific effect size given a particular group size ratio (e.g. 1 case for 1 control). If you can assume all the studies have a reasonably similar group size ratio, then it’s generally fine to set the contours to be created at this ratio. If the studies are all have very different ratios, then this is more of a problem, but we’ll get to that in a bit.

In general, if the studies line up along a contour, then the overall effect size will be the same as the contour. If the studies fall basically around the contour (but with some deviation), then the overall effect size will be the same, but you’ll be less certain about the result. The corresponding situation in a meta-analysis would be studies falling widely around the effect estimate in a forest plot, with a larger confidence interval as a result. If the studies don’t fall around a contour, and are scattered across the width of the plot, then there is likely no association. If the larger and smaller studies disagree, then there may be some form of bias in the studies (e.g. small-study bias).

When interpreting the plots, I tend to focus on how closely the studies fit around a particular contour, and state what that contour might be. I also tend to give a range of possible values for the contour, so if 90% of the studies fall between two contours I will mention them. It is incredibly important not to overstate the results – this is a helpful visual aid in interpreting the results of a systematic review, but it is not a meta-analysis, and you certainly can’t give an exact effect size and leave it at that. If an exact effect size is needed, then a meta-analysis is the only option.

Stata code

The plots can be created in any statistical program (I imagine Excel could do it, I just wouldn’t want to try), and I wrote a page of instructions on how to create the plots that I think was cut from the journal paper at the last minute. I will dig it out and post it for anyone that might be interested. However, if you have Stata, you don’t need to create you own plots, because I spent several weeks writing code so you don’t have.

Writing a statistical program was a great learning experience, which I’m fairly certain is what people say after doing something that was unexpectedly complicated and time-consuming. The actual code for producing an individual plot is trivial, it can be done in a few lines. But the code for making sure that every plot that could be made works and looks good is much more difficult. Trying to think of all the ways people using the program could make it crash took most of the time, and required a lot of fail-safes. I’ll do a future post specifically about the code, and say for now that in general, writing a program is not too different from writing ordinary code, but if it involves graphics it might take much longer than expected.

If you want to try out albatross plots (and I hope that you do), and you have Stata, you can install the program by typing: ssc install albatross. The help file is very detailed, and there’s an extra help file containing lots of examples, so hopefully everything will be clear. If not, let me know. I’ll do a video at some point showing how to create plots in Stata, for those that prefer a demonstration to reading through help files.


The program has a feature that isn’t discussed in the paper, as I still need to prove that it works fine all the time. I’m pretty sure it does, and have used it in all the papers albatross plots are a feature of. The feature is called adjust, and it does just what it says: it adjusts the number of participants in a study to the effective sample size. This is the term for the number of participants that would have been required if something had been different, so if the ratio of group sizes was 1 (when in fact it was 3.2, or anything else). The point is to make the results from studies with different ratios comparable: statistical power is highest when the group ratio is 1, so studies with high group ratios will look like they have smaller effect sizes than they do, just because they have less power. I will write another post about adjust when I can, as it is a very useful option that definitely raises the interpretability of the plots.

Uses of the plots

So far, I have identified two main uses of the plots. The first use was shown in the paper reviewing whether , and is when a meta-analysis just isn’t possible given the data that are available. The World Health Organisation also used the plots this way in a systematic review of , where again, meta-analysis was not possible.

The second use of the plots is to compare the studies that were included in a meta-analysis with those that couldn’t be included in the meta-analyses because they lacked data. This allows you to determine whether the studies that couldn’t be included would have changed the results if they could have been included. I have yet to publish my results, but I used the plots in my thesis to compare the results of studies that were included in a meta-analysis looking at the association between body-mass index and prostate-specific antigen to those that couldn’t be included. We found three out of the fours studies were consistent, but one wasn’t (I mean, it isn’t as if it’s completely on the other side of the plot, but it certainly isn’t as close to the other studies as I’d like). The study that wasn’t consistent was conducted in a different population to all other studies, so we made sure to note there were limits on the generalisability of our results. The effect contours are for standardised beta coefficients again. The meta-analysis of the included studies gave a results of a 5.16% decrease in PSA for every 5 kg/m2 increase in BMI. This is roughly equivalent to a standardised beta of -0.05, which is good because that’s what I’d say is the magnitude of effect in the albatross plot below.



I think albatross plots are pretty useful, and made my life easier when otherwise I would have had to write a long narrative synthesis. By putting the code online, we made sure that people who wanted to use the plots could. By providing a tonne of examples, hopefully people will know how to use the plots too.

Incidentally, the albatross in the name refers to the fact the contours looked like wings. Rejected names include swan plotpelican plot and pigeon plot. Oh, and if any of you are thinking albatross are bad luck, they were until a captain decided to . A lesson for us all.