1
FOR FURTHER INFORMATION Correlation is not an appropriate measure of association for relative data and can lead to contradictory conclusions if used naïvely. Figure 2 (left) shows that even thought absolute expression levels are overwhelmingly positively correlated in this experiment, relative expression levels misleadingly suggest nothing of the sort. Correlation with purely relative data (proportions, percentages, ppm, RPKM, etc.) gives no indication of relationships within the corresponding absolute data. Furthermore, the process of “standardizing” or “normalizing” measurements by dividing through by a common randomly distributed amount can induce what Karl Pearson in 1896 called spurious correlation. REFERENCES [1] Lovell, D., Müller, W , Taylor, J., Zwart, A., and Helliwell, C. “Proportions, Percentages, PPM: Do the Molecular Biosciences Treat Compositional Data Right?” In Compositional Data Analysis: Theory and Applications, (Vera Pawlowsky- Glahn and Antonella Buccianti, eds) 191–207. Chichester, UK: John Wiley & Sons, Ltd, 2011. What to do when it's all relative: Motivation Common measurement processes in transcriptomics, proteomics, metabolomics, metagenomics and ecology produce data that carry only relative information, requiring additional—often untested [2]—assumptions to make inference about absolute abundance. Less common is awareness of why these relative data need special analysis and interpretation [1]. Our aim is to change that. To help ensure researchers do not draw the wrong conclusions from relative abundances we use yeast gene expression data to illustrate the issues (Figure 1). In molecular bioscience, measurements of relative abundance are, well… abundant. Appreciation that they need special analysis and interpretation is scarce. Correlation is often used as a measure of association between different relative biomolecular abundances. This is wrong and potentially misleading except when total abundance is fixed across conditions. We present an alternative that is right for relative abundances. How to analyse relative abundances without being misled David R Lovell (CSIRO), Vera Pawlowsky-Glahn (University of Girona) and Juan José Egozcue (Universitat Politècnica de Cataluña) CSIRO COMPUTATIONAL INFORMATICS David Lovell CSIRO/Australian Bioinformatics Network e [email protected] w www.csiro.au/people/David.Lovell [2] Lovén, J, et al. 2012. “Revisiting Global Gene Expression Analysis.” Cell 151 (3) (October 26): 476–482. [3] Marguerat, S, et al. 2012. “Quantitative Analysis of Fission Yeast Transcriptomes and Proteomes in Proliferating and Quiescent Cells.” Cell 151 (3) (October 26): 671–683. Figure 1: (Left) Absolute and (Right) relative expression levels of 3031 yeast mRNAs over a 16-point time course experiment in which cells were deprived of nutrients [3]. The red and blue pairs of mRNAs are used in Figure 2. Note that the absolute abundances show that production of different mRNAs is generally positively correlated (i.e., expression levels change in the same direction) in this experiment. Note also that mRNA abundance spans six orders of magnitude. [4] D. Lovell, V. Pawlowsky-Glahn, and J. J. Egozcue. “Have You Got Things in Proportion? A Practical Strategy for Exploring Association in High-dimensional Compositions.” In Proceedings of the 5th International Workshop on Compositional Data Analysis, edited by K Hron, Peter Filzmoser, and M Templ, 100–110. Vorau, Austria, 2013. http://www.codawork2013.com Proportionality to the rescue If the relative abundances of two different molecules stay in a fixed proportion to one another across different experimental conditions, then their absolute abundances behave proportionally also: x i /t i y i /t i implies that x i y i where x i and y i are the absolute abundances of the molecules, and t i the total abundance at condition i. All we need now is a measure of how close to proportionality is the behaviour of two amounts. Figure 2: (Left) Histograms of correlation coefficients calculated (appropriately) from the absolute data (x-axis) and (inappropriately) from the relative data (y- axis). The blue and the red points correspond to the blue and red pairs of mRNA in Figure 1 illustrating that correlating relative abundances can lead us draw the opposite conclusion about the relationship between variables (Right) Histograms of correlation coefficients calculated (appropriately) from the absolute data (x-axis) and the φ() values from the relative data (y-axis). The red rectangle highlights the fact that while the absolute abundances of the vast majority of mRNAs are strongly positively correlated, only a very few behave proportionally. φ: a measure of “goodness of fit to proportionality” We have shown [4] that φ(log x, log y) = 1 + β 2 – 2β|r| tells us how proportional x and y are. In this equation β is the slope of the Standardised Major Axis of log x, log y r is the correlation of log x, log y. As x and y behave more proportionally, the slope of their logarithms approaches 1, as does their correlation, and φ(log x, log y) approaches 0. Use φ on all your relatives… Now you can analyse the relationships between relative abundances with confidence using φ() instead of correlation. Figure 3 shows how it helps select strongly proportional mRNA pairs from the yeast data. Figure 4 shows how it can be used as the basis of familiar analyses and visualisations, including graphs, heatmaps and hierarchical clustering. Whether you have only relative abundance data, or whether you wish to explore relative relationships in absolute abundances: Don’t be misled: use φ Figure 3: (Left) β and r 2 values of a subset of the mRNA pairs that are strongly proportional, coloured by their φ() values. (Right) Absolute expression levels of the 424 pairs of mRNAs with φ(clr(x i ); clr(x j )) < 0.05 plotted on a natural scale. Figure 4: φ() can be used in place of correlation as the basis of many familiar analyses and visualisations. (Right) a subset of mRNAs clustered using φ() as a distance metric (Below) mRNAs with φ(clr(x i ); clr(x j )) < 0.05 visualised as a graph with edges between strongly proportional mRNAs

What to do when its all relative: How to analyse relative abundances without being misled - David Lovell

Embed Size (px)

DESCRIPTION

David R Lovell (CSIRO), Vera Pawlowsky-Glahn (University of Girona) and Juan José Egozcue (Universitat Politècnica de Cataluña) Most measurement processes in molecular bioscience yield information about relative abundances only. This can be a result of sample preparation steps where nucleic acids are brought to a specified concentration before measurement; or by the measurement step itself, as in the case of next-generation sequencing. Awareness is gradually growing that these data need special treatment and interpretation but it is not yet widely appreciated just how misleading traditional statistical approaches (such as correlation) can be. Using data on absolute levels of yeast gene expression (i.e., mRNA copies per cell) over a 16-point time course experiment, we demonstrate why the concept of differential expression is challenging to interpret with relative abundances, and why correlation is an inappropriate measure of association for this kind of data. We present a simple, well-principled measure of association that can be used with confidence on relative abundances: proportionality. We give a straightforward graphical presentation of a statistic that can be used to assess the degree of proportionality between two sets of values, and then show how that can be plugged into analysis strategies that are familiar in molecular bioscience, including networks of association and clustered heatmaps. This approach gives bioscientists the means to analyse (and reanalyse) data generated in transcriptomics, proteomics, metagenomics and elsewhere, secure in the knowledge that the inferences made from the data are not artifacts of its relative nature.

Citation preview

Page 1: What to do when its all relative: How to analyse relative abundances without being misled - David Lovell

FOR FURTHER INFORMATION

Correlation is not an appropriate measure of association for relative data and can lead to contradictory conclusions if used naïvely. Figure 2 (left) shows that even thought absolute expression levels are overwhelmingly positively correlated in this experiment, relative expression levels misleadingly suggest nothing of the sort.

Correlation with purely relative data (proportions, percentages, ppm, RPKM, etc.) gives no indication of relationships within the corresponding absolute data. Furthermore, the process of “standardizing” or “normalizing” measurements by dividing through by a common randomly distributed amount can induce what Karl Pearson in 1896 called spurious correlation.

REFERENCES[1] Lovell, D., Müller, W , Taylor, J., Zwart, A., and Helliwell, C. “Proportions, Percentages, PPM: Do the Molecular Biosciences Treat Compositional Data Right?” In Compositional Data Analysis: Theory and Applications, (Vera Pawlowsky-Glahn and Antonella Buccianti, eds) 191–207. Chichester, UK: John Wiley & Sons, Ltd, 2011.

What to do when it's all relative:

MotivationCommon measurement processes in transcriptomics, proteomics, metabolomics, metagenomics and ecology produce data that carry only relative information, requiring additional—often untested [2]—assumptions to make inference about absolute abundance. Less common is awareness of why these relative data need special analysis and interpretation [1].

Our aim is to change that. To help ensure researchers do not draw the wrong conclusions from relative abundances we use yeast gene expression data to illustrate the issues (Figure 1).

In molecular bioscience, measurements of relative abundance are, well… abundant.Appreciation that they need special analysis and interpretation is scarce.Correlation is often used as a measure of association between different relative biomolecular abundances. This is wrong and potentially misleading except when total abundance is fixed across conditions.We present an alternative that is right for relative abundances.

How to analyse relative abundances without being misledDavid R Lovell (CSIRO), Vera Pawlowsky-Glahn (University of Girona) and Juan José Egozcue (Universitat Politècnica de Cataluña)

CSIRO COMPUTATIONAL INFORMATICS

David LovellCSIRO/Australian Bioinformatics Networke [email protected] www.csiro.au/people/David.Lovell

[2] Lovén, J, et al. 2012. “Revisiting Global Gene Expression Analysis.” Cell 151 (3) (October 26): 476–482. [3] Marguerat, S, et al. 2012. “Quantitative Analysis of Fission Yeast Transcriptomes and Proteomes in Proliferating and Quiescent Cells.” Cell 151 (3) (October 26): 671–683.

Figure 1: (Left) Absolute and (Right) relative expression levels of 3031 yeast mRNAs over a 16-point time course experiment in which cells were deprived of nutrients [3]. The red and blue pairs of mRNAs are used in Figure 2. Note that the absolute abundances show that production of different mRNAs is generally positively correlated (i.e., expression levels change in the same direction) in this experiment. Note also that mRNA abundance spans six orders of magnitude.

[4] D. Lovell, V. Pawlowsky-Glahn, and J. J. Egozcue. “Have You Got Things in Proportion? A Practical Strategy for Exploring Association in High-dimensional Compositions.” In Proceedings of the 5th International Workshop on Compositional Data Analysis, edited by K Hron, Peter Filzmoser, and M Templ, 100–110. Vorau, Austria, 2013. http://www.codawork2013.com

Proportionality to the rescueIf the relative abundances of two different molecules stay in a fixed proportion to one another across different experimental conditions, then their absolute abundances behave proportionally also:

xi/ti yi/ti implies that xi yi

where xi and yi are the absolute abundances of the molecules,and ti the total abundance at condition i.

All we need now is a measure of how close to proportionality is the behaviour of two amounts.

Figure 2: (Left) Histograms of correlation coefficients calculated (appropriately) from the absolute data (x-axis) and (inappropriately) from the relative data (y-axis). The blue and the red points correspond to the blue and red pairs of mRNA in Figure 1 illustrating that correlating relative abundances can lead us draw the opposite conclusion about the relationship between variables(Right) Histograms of correlation coefficients calculated (appropriately) from the absolute data (x-axis) and the φ() values from the relative data (y-axis). The red rectangle highlights the fact that while the absolute abundances of the vast majority of mRNAs are strongly positively correlated, only a very few behave proportionally.

φ: a measure of “goodness of fit to proportionality”We have shown [4] that

φ(log x, log y) = 1 + β2 – 2β|r|

tells us how proportional x and y are. In this equation

β is the slope of the Standardised Major Axis of log x, log yr is the correlation of log x, log y.

As x and y behave more proportionally, the slope of their logarithms approaches 1, as does their correlation, and φ(log x, log y) approaches 0.

Use φ on all your relatives…Now you can analyse the relationships between relative abundances with confidence using φ() instead of correlation. Figure 3 shows how it helps select strongly proportional mRNA pairs from the yeast data. Figure 4 shows how it can be used as the basis of familiar analyses and visualisations, including graphs, heatmaps and hierarchical clustering.

Whether you have only relative abundance data, or whether you wish to explore relative relationships in absolute abundances:

Don’t be misled: use φ

Figure 3: (Left) β and r2 values of a subset of the mRNA pairs that are strongly proportional, coloured by their φ() values. (Right) Absolute expression levels of the 424 pairs of mRNAs with φ(clr(xi); clr(xj)) < 0.05 plotted on a natural scale.

Figure 4: φ() can be used in place of correlation as the basis of many familiar analyses and visualisations. (Right) a subset of mRNAs clustered using φ() as a distance metric(Below) mRNAs with φ(clr(xi); clr(xj)) < 0.05 visualised as a graph with edges between strongly proportional mRNAs