This post is by David Clausen (@davidandacat). David is a PhD Candidate in Biostatistics at the University of Washington and a member of the StatDivLab. David enjoys technical replicates, sensibly-interpreted semiparametric models, and cats.
This post will argue that including technical replicates - repeated measurements on identical samples - is worthwhile in NGS experiments, and it will guide you through one way to use them productively. In that endeavor, we will look at a human 16S dataset to get an overall sense of the scale of variability between replicates, and then we will explore how this variability may affect common analyses and what we can do about it. This is a fairly long post, so it will be posted in three parts: in the first part, we will examine measurement error in an important dataset; in the second, we will explore how this error may impact inferential analyses; and in the third, we will present a solution that partially (!) addresses measurement error in this context.
Part 1: A (Taxonomically) Broad View of Variability Across Technical Replicates
As with many measurements, the estimates of sample composition we obtain from 16S sequencing are less than perfectly precise. That is, repeatedly 16S sequencing on specimens from the same sample will typically give us different estimates of sample composition, even if we use the same sequencing and bioinformatics protocols. For an example of this, we can turn to data published by the Microbiome Quality Control Consortium Project (MBQC), which assessed technical variation in 16S sequencing data generated by many different sequencing laboratories and under many bioinformatics protocols. To start, let's consider technical replicates of a single sample (a stool sample from a healthy 36-year old man) produced by a single sequencing laboratory (wet laboratory B) and a single bioinformatics laboratory (dry laboratory 2). (For those of you familiar with the MBQC study design, we are only looking at specimens extracted and sequenced at the same laboratory - samples extracted at a central lab prior to shipment for sequencing are excluded here.) The plot below shows measured phylum-level compositions for each replicate.
To some extent, all of these replicates agree - most 16S reads are attributed to Firmicutes, unsurprisingly. However, we can also see that there is a fair amount of variation between replicates: for example, should we believe that the Firmicutes:Bacteroidetes ratio in this sample is in the neighborhood of 2 or 3 to 1, as replicate 2 might have us think, or is it greater than 10 to 1, as (just eyeballing it), replicate 1 suggests? In any case, we can at a minimum conclude that these replicate measurements cannot all reflect the actual composition of the sample with equal fidelity.
In a moment, we'll look at genus-level data on the same replicates, but first, I'd like to emphasize that because these data are repeat measurements on aliquots from the same sample, if we accept that the MBQC produced uniform aliquots (I do), then the variation we see here is primarily an artifact of sample preparation, sequencing, and bioinformatics. That is, in looking at 16S data, we observe the true sample composition plus some noise that I'll call measurement error. Since we're only looking at measurements on a single sample, all (or at least most) of the differences we see between replicates should be due to noise (i.e., measurement error). We can see see this variation because this laboratory sequenced technical replicates, but there is little reason to expect substantially lower noise when technical replicates are not included in a sequencing experiment - this only prevents us from observing variation due to noise.
Now let's take a look at the same replicates at the genus level. To clean things up a bit, the plot below only shows genera that accounted for at least 0.1% of reads in some replicate and that were assigned to a known genus by bioinformatics (i.e., we've thrown out unclassified reads for the time being).
Again, we observe substantial variability between replicates: in replicate 1, Prevotella (light peach labeled in white), accounts for less than 10% of reads, whereas in replicate 2 almost half of reads are attributed to Prevotella. This plot also highlights a difficulty of viewing relative abundance data as proportions: errors in a single taxon affect all other taxa because we divide by the sum of all reads in all taxa to obtain an estimated proportion. It's not clear from this plot, for example, whether the variability we see in Anaerostipes (dark purple labeled in white) is due to variability in detection of Anaerostipes or due to variability in detection of Prevotella. That is, perhaps the apparent differences in the proportion of reads attributed to Faecalibacterium across replicates are primarily the result of the large variation in Prevotella.
This issue of propagation of error across taxa is in fact a fundamental part of this kind of data, and not just a peculiarity of this dataset. Since 16S sequencing provides information about relative, but not absolute, abundances, we cannot completely disentangle variation in one taxon from variation in another. That is, we have information about abundances in one taxon relative to abundances in others. This poses a somewhat tricky problem when our measurements contain error - how should we best view and analyze data when summary statistics for all taxa may be affected by measurement error in a single taxon? For instance, suppose that we knew that replicates 2, 3, and 4 above were contaminated with Prevotella from an outside source - what effect would this have on how we might compare those replicates with replicate 1? What genera are affected if we do not take this information into account?
In the next part of this post, we will look at one way of simplifying the problem of analyzing relative abundance data and consider how measurement error may impact our analyses in this case.
References
- MBQC data: Rashmi Sinha et al., 'Assessment of Variation in Microbial Community Amplicon Sequencing by the Microbiome Quality Control (MBQC) Project Consortium,' Nature Biotechnology 35, no. 11 (2017): pp. 1077-1086.
- Code to reproduce our re-analysis of the MBQC data and the figures is available here
- The "repli-cats" photo used under a CC BY-SA 2.0 license from Mathias Erhart
Appendix for the Doubtful
Should you perhaps be thinking that I have pulled the wool over your eyes by cherry-picking a 'bad' sequencing laboratory, here is a multi-laboratory plot of read proportions by phylum for the specimen examined in this post:
By my reckoning, sequencing laboratory B is fairly middling insofar as within-laboratory phylum-level variability in this particular specimen is concerned. However, if you remain unconvinced, we will revisit issues arising from within- and between-laboratory variation in measurements (and beyond!) and continue to examine data across sequencing laboratories in the next two parts of this post. (So stay tuned.)
Yet Another Appendix
It occurs to me as well that you may be wondering why I didn't start with laboratory J, which seems to have quite low variation from replicate to replicate on the specimen we've examined in this post. The short answer is that we will return to laboratory J (and the other laboratories) in subsequent posts. A slightly longer answer is that while low techical variation (i.e., low variability between replicate measurements on the same specimen) is a virtue, it is not the only virtue when it comes to measurement. We might, for example, want to know how close our measurements are to the truth (I certainly do). We can't directly ascertain that with the specimen we've been looking at because we don't know its true composition, but the MBQC wisely also chose to include mock specimens of known composition. Here is a plot of measured phylum-level relative abundances (i.e., proportion of reads assigned to each detected phylum) in the fecal mock community, along with the true composition of the fecal mock (indicated by sequencing laboratory 'T' at the right hand of the figure):
I will let you draw your own conclusions here, but I hope we can agree that the relationship between measure and true relative abundances is, while far from arbitrary, not particularly straightforward.