This post was going to be a description of medical studies – what they are for, how are they done etc. – but in researching the topic I came across the Series on Evaluation of Scientific Publications, and I decided to write a post detailing this useful resource for understanding (medical) literature.
The series started in 2009 and is written with a medical audience in mind, particularly physicians who need to read and evaluate medical literature. It runs through (unsurprisingly) how to evaluate publications, but in doing so gives quite a lot of useful information about medical research in general. It is published by Deutsches Ärzteblatt International, a weekly German-language medical magazine, although the series itself has been translated into English. You certainly don’t need to read all of the articles, and they are fairly easy to read, so feel free to dive into any of them. The articles were written by experts with reference to textbooks, academic articles and their own experiences.
I’ll write about each of the articles briefly, say who I think the articles will be most useful for, and link the PDFs (open-access, so anyone can view them). In a later post, I’ll write about medical research from a more practical perspective, rather than the perspective of a physician needing to understand an academic paper. Fair warning, I haven’t read through the entirety of all the papers, and will update this post when I have if anything needs to be changed.
Quick Links to all papers
- Critical Appraisal of Scientific Articles
- Study Design in Medical Research
- Types of Study in Medical Research
- Confidence Interval or P-Value?
- Requirements and Assessment of Laboratory Tests
- Systematic Literature Reviews and Meta-Analyses
- Descriptive Statistics
- Avoiding Bias in Observational Studies
- Interpreting Results in 2 × 2 Tables
- Judging a Plethora of p-Values
- Data Analysis of Epidemiological Studies
- Choosing Statistical Tests
- Sample Size Calculation in Clinical Trials
- Linear Regression Analysis
- Survival Analysis
- Concordance Analysis
- Randomized Controlled Trials
- On the Proper Use of the Crossover Design in Clinical Trials
- Establishing Equivalence or Non-Inferiority in Clinical Trials
- Big Data in Medical Science – a Biostatistical View
- Indirect Comparisons and Network Meta-Analyses
- Propensity Score: an Alternative Method of Analyzing Treatment Effects
- The Range and Scientific Value of Randomized Trials
This article describes the structure of scientific publications (introduction, methods, results, discussion, conclusion, acknowledgements, references), and points out some key things to look for to check the paper is trustworthy. It contains a checklist for evaluating the quality of publications, although much of the criteria are study-specific and don’t apply to all studies – there are better tools for assessing risk of bias (which is essentially what they are getting at), which I’ll write about later.
Definitely worth reading if you are just starting to read medical research.
Part 2: Study Design in Medical Research
This article describes six aspects of study design, which are important to consider both when reading articles and before conducting any studies. The aspects include:
- Question to be answered: everything else stems from this
- The study population: who the target population should be (e.g. the over 50s, people with asthma)
- The type of study: primary (data collecting) or secondary (data using) research, and experimental, clinical or epidemiological research (see also Part 3)
- Unit of analysis: for example, a patient, cell, organ or study
- Measuring technique: how whatever is recorded is measured
- Calculation of sample size: how many participants/units of analysis are needed to satisfactorily answer the research question
Worth reading if you are just starting to read or conduct medical research.
This article classifies primary medical research into three distinct categories: basic research; clinical research; and epidemiological research. For each category, the article “describes the conception, implementation, advantages, disadvantages and possibilities of using the different study types”, with examples. The three categories include many subcategories, presented in a nice diagram.
Worth reading for background information on study types; I will cover something similar in a future post.
Part 4: Confidence Interval or P-Value?
This article describes P values and confidence intervals, pretty essential concepts all medical researchers should understand. Helpfully, the article describes the pitfalls of P values, such as dichotomising a statistical test and the difference between statistically significant (a phrase banned in my research department) and clinically significant (much more useful).
Very useful reading for anyone who wants to understand more about P values and confidence intervals. I will also do a post about P values in the future, because it is such an important topic.
This article is focused on laboratory tests, but gives a description of sensitivity, specificity and positive predictive value, all concepts that are used when designing and interpreting tests (e.g. blood tests predicting disease).
This is useful reading for health practitioners (especially those ordering and interpreting tests), but is pretty specialist information. Saying that, knowing about sensitivity and specificity, or at least knowing that medical tests aren’t generally foolproof, would benefit many people.
This article details methods of evidence synthesis (i.e. taking previous study results and combining them for a more complete understanding of a research question), which is one of my areas of expertise.
I will write more in-depth posts about systematic reviews and meta-analyses, but this is useful as an overview.
— The articles become more statistical and specialist from this point on —
Part 7: Descriptive Statistics
Descriptive statistics are use to describe the participants of a study (e.g. their mean age or weight) or any correlations between variables (e.g. as age rises, weight tends to increase [up to a point]). The article describes the difference between continuous (called metric, variables can be measured on a scale such as height or weight) and categorical variables (variables that can be one of a set number, such as gender or nationality), and provides examples of different types of graphs that can used to display the statistics.
Useful for anyone who has limited experience with statistics, or who wants to know more about common graphs (e.g. scatter graphs, histograms).
The articles start to get more niche from here on out. Observational studies are those where participants are not experimented upon in any way, just observed. This has implications for the results of such studies: if you are comparing two groups (e.g. smokers and non-smokers), then there is generally no way to tell exactly why there are differences between the groups (e.g. it could be smoking, but maybe the differences are because smokers are older, or more likely to drink alcohol, or eat less). This article discusses the problems with observational studies, and some ways to combat these problems.
Useful for those interested in observational studies, but definitely getting more niche now.
A 2 x 2 (read: 2 by 2) table shows the results of a study where the exposure and outcome were both binary. Generally, the rows of the table are exposed or not (e.g. to a new drug that prevents migraines), and the columns are diseased or not (e.g. got a migraine). The number of participants is shown in each cell, for example, the number of participants who received the new drug AND got a migraine might be in the upper left cell, the number who received the drug AND did NOT get a migraine in the upper right cell, the number who did not receive the drug AND got a migraine in the bottom left cell, and finally the number who did not receive the drug AND did NOT get a migraine in the bottom right cell. The reason 2 x 2 tables are useful is because writing that all out is a nightmare. This article describes the tables and the statistics that are often performed with the tables to judge whether being exposed raises, lowers, or doesn’t change the chance of the outcome happening. It also discusses some problems with the tables and their interpretation.
Useful reading for interpreting 2 x 2 tables, but otherwise unlikely to be useful.
Part 10: Judging a Plethora of p-Values
If a study conducts multiple statistical tests, then the paper will likely contain many P values. This is good, in that many tests means many pieces of information, but bad because conducting many tests raises the risk that there will be a “significant” results by chance. This article discusses this problem and the ways to combat it. However, it is worth remembering that there is no way to tell whether a study conducted 100 statistical tests and just chose to present the “significant” ones.
Useful for those with an interest in the problem of multiple tests, but this is a problem that is unlikely to affect too many people (with the larger problem that it is impossible to know what studies have actually done, since they are the ones reporting it).
This article describes epidemiological studies and how they are analysed (if the title didn’t give it away). Epidemiological studies are those that seek to quantify the risks for diseases, how often a disease is diagnosed (incidence) or how many people at any one time have a disease (prevalence). The article helpfully lists how measures such as incidence and odds ratios are calculated, and gives examples of studies.
Very useful background into epidemiological studies, which make up a large proportion of medical research. The description of epidemiology measures is particularly useful.
Part 12: Choosing Statistical Tests
P values are calculated differently depending on the data; this article describes the different methods and gives three tables, one detailing statistical tests (e.g. Fisher’s exact test, Student’s t-test) and two detailing which test is appropriate in different situations. This article does not deal with regression analyses (which is the vast bulk of my work), but given the ubiquity of P values in medical literature, it is definitely worth being familiar with the different tests that can be run.
Useful is you want to know more about P value calculations – the tables are particularly useful.
The sample size calculation tells you that if you want to be able to find an effect size this large (say the group on a new drug gets 10% fewer infections compared to the old drug), you need this many participants. It’s pretty important if you conduct primary research – most funders won’t like it if you say “I’ll recruit as many people as I can”, but will like it if you say “We need to recruit 628 participants to be confident (i.e. 90% sure) that we will see a risk ratio of 0.9, and here are the calculations to prove it.”
It probably isn’t necessary for those not planning to conduct primary research to know how to calculate the sample size, but an appreciation of why power calculations are important is good.
Part 14: Linear Regression Analysis
Linear regression is an analysis of the association between two continuous variables (e.g. height and weight), with the option to account for multiple confounders (variables that might be associated with both the exposure and the outcome). This article describes linear regression and other regression models, discusses some of the factors to consider when using the models, and gives several examples.
A useful article for anyone needing to conduct or interpret a regression analysis.
Part 15: Survival Analysis
Survival analysis as it sounds: an analysis of how long participants survive (although note that “survival” could be how long a person goes without contracting a disease or receiving a test, not just how long until they die), and any factors that associated with survival time. This article describes survival analysis and points out some things to consider when conducting or interpreting survival analyses.
Useful if you want to know about survival analysis.
Part 16: Concordance Analysis
A concordance analysis assesses the degree to which two measuring or rating techniques agree, for instance the agreement between two tests for the volume of a tumour. Often, the gold-standard test (the best test at the time) will be compared with a cheaper, less intrusive or newer alternative test. This article describes concordance analyses and points out some things to consider when conducting or interpreting concordance analyses.
Useful if you want to know about concordance analysis.
Part 17: Randomized Controlled Trials
Randomised controlled trials (often abbreviated to RCTs, and yes, I’m using the British spelling) are an incredibly useful method of determining which of two or more treatments is better. The idea is that patients are randomised to a treatment, with the hope that there will be no baseline differences overall between the participants in any treatment group (so ages, weights, ethnicities etc. will be equal between groups). Therefore, any difference in the outcome (e.g. developing the disease, death, recovery time) will be entirely due to the difference in treatments. However, RCTs need to be well-conducted, as even the slightest amount of bias can render the study meaningless. For instance, if the outcome is the amount of pain, then if a study participant believes they are on a less effective drug, they might experience more pain than if they were on a drug they believed to be more effective (placebo effect, and the reason branded anti-pain meds [analgesics] are in nicer boxes and cost 10 times as much, even though they are the same as the unbranded meds). This article describes RCTs, and provides a helpful table of how an RCT should be reported.
RCTs are pretty important for medical research, and it is definitely worth reading this article to be certain you can identify why some medical research is considered brilliant and unbiased, and some is considered useless.
— The articles become much more specialist from this point on —
A crossover trial is one where patients are randomised to different, consecutive treatments, so each patient takes all the different treatments at different times. This means each patient serves as their own control – the response to the treatment of interest can be compared to other treatments (such as placebo pills or the gold-standard treatment) – making problems such as confounding less of an issue. Crossover trials are likely only useful for chronic conditions (e.g. pain), as an acute condition may improve before the next treatment can be started. This article describes crossover trials and details the statistical methods to appropriate analyse the results.
This article is only worth reading if you need to know about crossover trials.
Part 19: Screening
Screening is testing patients without any symptoms of a disease to see whether they are likely to have said disease. Breast cancer and bowel cancer screening are regularly performed in the UK, with the aim of finding those cancers while they are still curative. Screening in general has many established problems, for example finding cancers that would never have caused harm (overdiagnosis), leading to unnecessary treatment (overtreatment). This article discusses screening and its potential problems, using breast cancer screening as an example.
Knowing about screening is important as most people in developed countries will be offered a screening test in later life, so this article may well be worth a read.
The purpose of RCTs is often to find out whether a new treatment is better than an old one. The purpose of equivalence or non-inferiority studies is to determine whether a new treatment (and usually cheaper or with fewer side-effects) is at least as good (equivalence) or not much worse (non-inferiority) than the old one. It’s almost impossible to prove that two treatments are exactly the same in terms of their outcome, since effect estimates are never absolutely precise (there is always allowance for random error), so the statistics involved in equivalence/non-inferiority trials are different to RCTs. This article discusses these trials.
Pretty specific paper only useful for those involved with equivalence or non-inferiority trials.
Big data is defined in this article as a dataset “on the order of a magnitude of a terabyte“, which is indeed big. This article discusses big data and procedures used to analyse it (such as machine learning).
Probably a useful article for many to read, as big data is fast becoming used in everything from genetics to business.
An indirect comparison is one where instead of comparing two treatments against each other (A versus B), you use two comparisons with a common comparator treatment (A versus C, and B versus C) to infer the difference between two treatments. This can be fairly intuitive: if A is better than C, and C is better than B, then A must be better than B (and how much better can be calculated statistically). A network meta-analysis is a meta-analysis that includes indirect comparisons, so not only are studies comparing A and B included, but studies that compare A with C, and B with C too. The aim is to use as much information as possible to arrive at the most informed answer. This article discusses indirect comparison and network meta-analyses, clarifies the assumptions that must be made, and provides a helpful checklist for evaluation of network meta-analyses.
In my experience, indirect comparisons is not commonly used outside of network meta-analyses, and this article is therefore only really useful for those interested in network meta-analyses.
When analysing the results of a non-randomised study (e.g. an observational study), confounding is always an issue. One way to deal with this is to include measured variables as covariates in a regression model, accounting for differences between groups (e.g. in age, height, gender). Propensity scores are an alternative method to using covariates, where the probability of an individual receiving a treatment is calculated using the observed variables, and this is used to account for any differences between groups instead of covariates in a regression model. This article introduces propensity scores, describes four methods of using them and compares propensity scores and regression models.
In my experience, regression models with covariates is an overwhelmingly more common method of accounting for confounding, so this article may be useful when coming across studies using propensity scores or if you are considering using them, but otherwise may be a bit specialist.
RCTs are brilliant, but sometimes the methods have to be adapted to face a particular challenge. This article details some of the variations of RCTs (and provides a helpful tables), useful in specific circumstances.
This final article in the series (as of October 2017) is very specific to RCTs, and won’t be useful for anyone not interested in specialist RCT methods.