Summary
Earlier this week we posted the first paper in a series about a 48 Replicate RNA-seq experiment (Gierliński et al, 2015). Today, the second paper appeared on arXiv (Schurch et al, 2015). Both papers are now in print: (Gierlinski et al, 2015; Schurch et al, 2016).
The main questions we were aiming to answer in this work when we started it over 2 years ago were, for RNA-seq experiments that study differential gene expression (DGE):
- How many replicates should we do?
- Which of the growing number of statistical analysis methods should we use?
- Are the assumptions made by any of the methods in (2) correct?
- How useful are spike-ins to normalise for concerted shifts in expression?
Paper I (Gierlinski et al, 2015), addressed Point 3 in the list. Our second paper looks in detail at points 1 and 2. The high number of replicates in our experiment allowed us to see how variable results would be if we had fewer replicates. For example, we took 100 sets of 3 replicates at a time to see the variance (uncertainty) in an experiment with only 3 replicates. We did the same thing for 4 replicates and so on up to 40 replicates. In effect, the sampling over all the different DGE methods we did was like performing over 40,000 RNA-seq experiments!
The Abstract of the paper, Figures and Tables give a summary of the conclusions, so I won’t repeat them here, but since it is quite unusual to do 48 replicates (Well to our knowledge no one has done this before!) I thought I would briefly summarise why we did it and the kind of lessons we learned from the experiment and its analysis.
Background
My group’s core interests were originally in studying the relationship between protein sequence, structure and function. We still develop and apply techniques and tools in this area such as Jalview, JPred and other more specialised predictive tools (see: www.compbio.dundee.ac.uk). In around 2007 though, we did our first analysis of NGS sequencing data (Cole et al, 2009) in collaboration with wet-lab colleagues here in Dundee. This led us into lots of collaborations on the design and analysis of NGS experiments, in particular experiments to determine changes in gene expression given various experimental and biological stimuli. Since we are in a big molecular/cell biology research centre, our experience spans a wide range of species, biological questions and experiment types.
To begin with we looked at differential gene expression (DGE) by Direct RNA Sequencing (Helicos biotechnology, now seqLL) which eventually led to some publications (e.g. Sherstnev et al, 2012; Duc et al, 2013; Cole et al, 2014; Schurch et al, 2014) using that technique, but later we turned to what has become the “standard” for DGE: Illumina RNA-seq. Irrespective of the technology, we kept facing the same questions:
- How many replicates should we do?
- Which of the growing number of statistical analysis methods should we use?
- Are the assumptions made by any of the methods in (2) correct?
- How do you deal with concerted shifts in expression (i.e. when a large proportion of genes are affected – most DGE methods normalise these away…)
We wanted clear answers to these questions, because without good experimental design, the interpretation of the results becomes difficult or impossible. Our thinking was (and still is) that if we get good data from a sufficiently powered experiment, then the interpretation would be much easier than if we were scrabbling around trying to figure out if a change in gene expression is real or an artefact. Of course, we also wanted to know which of the plethora of DGE analysis methods should we use? When we tried running more than one, we often got different answers!
The Joy of Benchmarking ?
2-3 years ago when we were worrying about these questions, there was no clear guidance in the literature or from talking to others with experience of DGE, so when Nick Schurch and others in the group came to me with the idea of designing an experiment specifically to evaluate DGE methods, it seemed timely and a good idea! Indeed, most of the group said: “How hard can it be??”
My group has done a lot of benchmarking over the years (mainly in the area of sequence alignment and protein structure prediction) so I know it is always difficult to do benchmarking. Indeed, I hate benchmarking, important though it is, because no benchmark is perfect and you are often making some kind of judgement about the work of others. As a result you want to be as sure as you can possibly be that you have not messed up. As a developer of methods myself, I don’t want to be the one who says Method X is better than Method Y unless I am confident that that we are doing the test as well as we can. As a consequence, I think the care you have to take in benchmarking is even greater than the normal care you take in any experiment and so benchmarking always takes much longer to do than anyone can predict! Having said all that, I think in this study we have done as good a job as is reasonably possible – hopefully you will agree!
Collaboration
We don’t have a wet-lab ourselves, but we have a lot of collaborators who do, so the work was a very close collaboration between ourselves and three other groups. The experimental design was the result of discussions between the four groups, but Tom Owen-Hughes’ group selected the mutant, grew the yeast and isolated the RNA while Mark Blaxter’s group at Edinburgh Genomics, did the sequencing and “my” group did the data analysis. With the possible exception of growing the yeast and extracting the RNA, no aspect of this study was straightforward!
We settled on 48 reps since after doing some simulations, we thought this would be enough to model the effect of replicates without being prohibitively expensive. Mmmm, it was still quite an expensive experiment…
Why not other species?
Originally, we planned to do this experiment in multiple species, but while we had collaborators in Arabidopsis, C.elegans and mouse, it was Tom’s yeast team that were first with RNA (within a week of agreeing to do it!) so since the other groups were still planning, we decided to do an initial analysis in yeast and see what that told us. That initial analysis started in March 2013 and we presented our preliminary findings at the UK Genome Sciences meeting in Nottingham in October that year. It has taken us over a year to get the papers written since everyone in the collaboration is working on other projects as their “main” activity!
What is next?
Early on, we decided to include RNA spike-ins in the experiment. These are known concentrations of RNAs that are added to the experiment to provide a calibration marker. This was a good idea, but it made the lab work and sequencing more complex to optimise. It also confused us a lot in the early stages of the analysis, so we had to do another, smaller-scale RNA-seq experiment to work out what was going on. This will be covered in detail in Paper III since we learned a lot that I hope will be of use/interest to others in the field.
If, after reading the paper you have comments or questions, then we’ll all be happy to hear from you!
Geoff: Fantastically interesting paper. How robust do you think your finding are to:
1. Read length:What is you sequenced 150bp or 250bp?
2. Paired-end versus SE: What if you did PE sequencing.
3. Depth of sequencing: What if you had sequenced more, or less?
All of these things could have altered the quality of the read mapping, and therefore results of testing for diff. expression. Would any of these things have altered your general recommendations for replication of RNAseq experiments?
I’ll answer the questions in reverse:
3. depth is a red herring. With more reps you automatically get more depth as you’re sampling more of the sequence diversity. With 1bn reads across 96 reps already, more reads aren’t going to make any difference. Other papers have also covered this which we mention in the paper.
2. that’s a good question. We chose SE for cost reasons, but we do have some PE data. The biggest difference is the accuracy of mapping. ~5% of genes show a differences in ‘expression’ PE vs SE. However, it won’t change the global distributions nor, I doubt, our conclusions.
1. similar to (2). Yes it’ll affect the mapping of some (small? – I haven’t checked) number of genes, but it’ll unlikely affect the conclusions.
All the things you mention are technical aspects, but the issue of reps is a function of the biology. With more reps you can understand the biological variability, and thus the differences, better. Technical aspects will change things at the edges, but not the intrinsic biological variability.
I’ll add to Chris’s answer here. On point 2, one of the other big differences with PE data is the incidence of bad replicates. We can’t say with any confidence that PE data results in fewer bad replicates, but the kind of problems we highlighted for the bad replicates in this study (see Gierlinski et al 2015) certainly seem to occur less. In other (less high replicate, but still pretty good) data we have for other experiments, the incidence of bad replicates certainly appears lower.
Matt, I’ve nothing really to add to the comments from Chris and Nick! I hope that helps!
Chris, thanks for the answers! I was thinking of substantially fewer reads, not more.. Conventional wisdom says more reps better at the cost of depth, and I assume you support this, correct?
It are these evidence based recommendations that RNAseq really needs – thanks for this!