Sean Harrison: Blog

Introduction: Questions about Stata, R and other things

First off, let’s answer some burning questions about and , and about some other things. Yes, burning. These are very important questions (that I could think of reasonably quickly).

What are they?

Stata and R are both statistical packages. You feed in data, and then usually write code to analyse the data. Stata and R both have great facilities for cleaning data, running most statistical tests on it, producing graphs and tables, and these days exporting results straight to word, PDF, LaTeX or excel.

StataCorp LLC created Stata, and calls it:

an integrated statistics, graphics, and data management solution for anyone who analyzes data

Bell Laboratories created R, and calls it:

an integrated suite of software facilities for data manipulation, calculation and graphical display

So pretty similar then.

One major difference between the two packages is that StataCorp charge people to use Stata, whereas R is completely free. Both packages allow anyone who uses them to create and distribute statistical programs; indeed, often the programs I run most frequently are those written by people using the packages, not people who created the packages.

Why use them?

Stata and R are both great packages to manipulate and analyse data. They both allow the user to write code, and then use this code to do everything one could want to the data. Why is code so great? From my experience:

  1. Everything is reproducible – you can start with your initial data, clean it up, analyse it, produce tables and graphs, and save all of the output. If the initial data changes a little, no problem, just run the code again.
  2. You can check through what’s been done months after you’ve forgotten what it was you did.
  3. Errors can be found and fixed without having to completely redo the analysis.
  4. Repeating an analysis every month becomes as simple as loading new data and clicking “run” on the code.
  5. Code can be checked by other people. This is great both within organisations (e.g. when someone needs help) and for academics at peer-review. Although code is not currently routinely checked at peer-review, it could be. We take a lot on faith – we assume that everything done to the initial data in a study has been completed correctly – and I hope this is highly unlikely to be always true.
  6. Learning to use statistical code takes a lot of time initially, but overall, it is a massive time-saver. The more code you know, the less you need to type, and the quicker things get done.

What can I do with them?

Basically anything statistical. Analyse and manipulate data however you like, produce tables and graphs, anything where you have some numbers (or letters or words) and you want to do something with them.

Is one better than the other?

I think a personal question. Apart from that Stata costs money and R does not, there aren’t too many differences. Stata is more intuitive for people that have used spreadsheets, since at any time you can click a button and load up a view of your data in a spreadsheet, but R allows you to do more at once as many datasets can be loaded in at the same time. For some purposes Stata is faster, for others R is faster, both in terms of how much code is needed and how fast it runs.

Both packages have a good community of users who develop programs within each, so whether Stata or R is better may depend more on the purpose you are using a stats package for, rather than a blanket “one is better than the other”.

One point that sometimes comes up is that Stata limits the number of variables (columns) allowed in any one dataset, whereas R is limited only by your computer. The newest iteration of Stata (version 15, out June 2017) has 3 versions ranging from cheapest to more expensive. The cheapest version, Stata/IC, has a maximum of 2,048 variables and 2.14 billion observations (rows). The most expensive version, Stata/MP, has a maximum of 120,000 variables and 20 billion observations. For many non-genetic purposes, Stata is absolutely fine. For genetic purposes, where there legitimately can be tens of thousands of variables, R is generally considered better, which is why genetics research does seem to favour R.

How much do they cost?

R is free, and can be downloaded from their .

Stata costs a varying amount depending on whether you are a student, business or institution, and whether you want an annual or perpetual license, and which version of Stata you want. For a student, Stata costs $198 for Stata/IC (entry-level), $395 for Stata/SE (mid-level), $695 for Stata/MP 2-core and $995 for Stata/MP 4-core (top-level), all in US dollars and all perpetual licenses. For a single university staff member, those prices jump to $595 to $1,495 for an individual license. For a single government or non-profit or business staff member, the prices jump again to $1,195 to $2,295. Annual licenses that require renewing every year are half the price of the perpetual licences. Stata/MP also can be purchased with more cores, so will work faster on supercomputers. For the UK, Stata needs to be purchased from .

Why aren’t you covering [my favourite statistical software]?

I have been a researcher for about 5 years now, and have only used Stata and R. I am aware that in other universities (even different schools within my university), they use , , or other software. I haven’t used these programmes and know nothing about them, so I’ll stick to Stata and R. If I need to learn a new programme at any time, I’ll write posts comparing the new and old programmes.

I’ll say that even though Excel is extremely limited for data analysis (in terms of difficulty, replicability, consistency etc.), I still use it frequently for certain tasks. For instance, if you want a table in Stata, it’s often handy to export it to Excel before putting it in Word.

Do you favour Frequentist or Bayesian statistics?

What an oddly specific question I have just asked myself. Statistical methods tend to fall into two camps, Frequentist or Bayesian, and I will cover the differences in a future post. For now, I’ll say that in the absence of a sensible prior, you may as well be a Frequentist, but why wouldn’t you use other information otherwise? That will hopefully make way more sense after I write the Frequentist/Bayesian post. XKCD has a about the difference if you’re particularly interested now.

Why evidence synthesis for medical research?

Because this is my field. I have no experience of evidence synthesis for other disciplines, although I would imagine that medical research has one of the greatest demands for evidence synthesis given the number of medical studies performed daily around the world looking at often very similar things.

I would be interested to know if other disciplines perform evidence synthesis in different ways, so please let me know if you work in a different field and think I should be using a different method to do something.

Will you be sharing any data or results on this blog?

Maybe.

I will never share any data that is from actual studies, but I am all for open sharing of all data and I will share what I can. The results of systematic reviews and meta-analyses are fair game, as is any code I write, simulations I run or methods I develop.

What if someone spots an error in your work?

Please let me know, preferably as soon as possible so I can correct it quickly, but even if I made a mistake years ago I’d like to know.

Correcting academic papers can take an eternity, can lead to the wrong conclusions being drawn (although this is rare), and can also be intensely embarrassing. Making a mistake before publication is fine so long as it is spotted.

I have no problem being wrong. Being corrected is the best way to learn about my mistakes, and while I would prefer to be right, given the option, I’d much rather be wrong and know about it than be wrong and be ignorant of it.

Is data singular or plural?

It depends.

Datum is a singular data point, but if you are talking about data in the sense of a collection of results or observations, then it could be either. “These data show that…” and “the data shows that” are both acceptable.

I personally prefer data as a singular though. “These data” sounds odd to me.