Upload
others
View
26
Download
0
Embed Size (px)
Citation preview
Multi-experiment Viewer
Pipeline for
percentage analysis Samuel GRANJEAUD
(CRCM, Marseille)
Nantes 2018-03-06
CRCM-INSERM-U1068 Marseille
• Facility "Integrative Bio-informatics"
– Ghislain Bidaut
• Team "Immunology and Cancer"
– Françoise Gondois-Rey
– Anne-Sophie Chrétien, Cyril Fauriat
– Daniel Olive, Jacques Nunès
• Centre d’Immunophénomique (Ciphe)
– Hervé Luche
– Quentin Barbier
– Camille Santa-Maria, Emilie Grégory
IMPACT Tools
http://impact.marseille.inserm.fr/
http://impact.marseille.inserm.fr/contexte/pourcentages/
Analyse de données
Méthodes Non Supervisées Agrégation naturelle Réduction de dimensions Exploration des données
• Classifications hiérarchiques
• Nuées dynamiques (K-means)
• Analyse en composantes principales
• Cartes de Kohonen (SOM)
• Coordonnées parallèles
• Heatmaps
Méthodes Supervisées Agrégation orientée Réponse à une question Recherche ciblée
• Tests statistiques
• Corrélation à un profil
• Score discriminant
• Réseaux de neurones (SVM)
• Random Forest
• LDA, PLS, Lasso…
MeV history
• Transcriptomics
– year ~ 2000 > 2009
• Data are matrices of
RNA expression level
• Columns are samples
• Rows are genes
• Numerical matrix
is displayed as
color heatmap
• Numeric to color
mapping is global
Computational analysis of microarray data
Quackenbush J
TM4: a free, open-source system for microarray data
management and analysis online
Saeed AI, …, Quackenbush J
Microarray data normalization and transformation
Quackenbush J
TDMS file format
• Numerical measurements
at the center
• Annotations at margins
• Export from Excel
– text tabulated format
– '.' as decimal separator
• Tips
– Set GB as your regional
– Inspect file with NotePad++
MeV for repetitive analyses
• MeV does a comparison between columns
• MeV repeats this comparison over all rows
• Templates for coloring group • sex to color mapping is always the same
• Templates for building identifiers • treatment, mutation, sex merging is always the
same
• Scale up for many panels, organs…
• Scale up for MFI (asinh transform) and
Luminex
mev-pivot
• MeV does comparisons between columns
• Percentages arise from multiple panels,
organs, time points…
– multiple dimensions
– different point of views
• mev-pivot reorganizes data
– keep TDMS format
– avoid copy/paste errors
– allows safe reorganisations of
data for MeV and Excel
Go to mev-pivot
Or view
Excel presentation
Analysis Pipeline
Adjust
• Log2 transform (or asinh)
• Center each row
• Filter rows
Explore
• Unsupervised analysis: HClustering, PCA
• Identify outliers
• Remove outliers
Identify
• Supervised analysis: statistical methods
• Identify difference between groups
• Remove small differences
Adjust
• Percentages are
usually displayed
in a log scale
• some MFI also
• Luminex concentr.
• Centering focus on
differences
between groups
within populations
Log2 properties
• log2 is proportional to log10
• 1 qRT-PCR cycle ~ x 2
• ratio => addition
x 2 => + 1
• is symmetric: +100% = x 2 = +1
- 50% = / 2 = -1
• stabilizes the dispersion
• log2( a / b ) = log2( a ) - log2( b )
Analysis Pipeline
Adjust
• Log2 transform (or asinh)
• Center each row
• Filter rows
Explore
• Unsupervised analysis: HClustering, PCA
• Identify outliers
• Remove outliers
Identify
• Supervised analysis: statistical methods
• Identify difference between groups
• Remove small differences
Analysis Pipeline
Adjust
• Log2 transform (or asinh)
• Center each row
• Filter rows
Explore
• Unsupervised analysis: HClustering, PCA
• Identify outliers
• Remove outliers
Identify
• Supervised analysis: statistical methods
• Identify difference between groups
• Remove small differences
Conclusions
• One analysis of matrices of percentages
• MeV, a graphical tool to explore and query
• Transform percentages and get
informative color heatmaps • Interpret difference between samples
• Highlight most interesting populations • Exploratory analyses
• Apply statistical tests for each population
• statistical significance vs practical importance
• Cope with the question of multiple tests
Hands on workflow
• Import
• Log2 transform
• Center rows
• Adjust color scale
• HClustering • euclidean distance
• Import factor
• Find outlier sample • using PCA
• Filter rows
• Apply stat. methods • t-test, ANOVA 2-ways
• SAM
• Label rows
• Export images
• Save analysis
• Analysis pipeline fits
any type of data:
percentages, MFI,
Luminex…
Statistics
• Important, Unavoidable
• Abusive or abused » Statistics does not tell us whether we are right.
It tells us the chances of being wrong.
• Point Of Significance in Nat. Meth. » http://mkweb.bcgsc.ca/pointsofsignificance/
» http://blogs.nature.com/methagora/2013/08/giving_st
atistics_the_attention_it_deserves.html
What is statistical test?
• 𝑠𝑐𝑜𝑟𝑒 = 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒
𝑔𝑟𝑜𝑢𝑝 𝑑𝑖𝑠𝑝𝑒𝑟𝑠𝑖𝑜𝑛
𝑐𝑜𝑢𝑛𝑡
=𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒
𝑑𝑖𝑠𝑝𝑒𝑟𝑠𝑖𝑜𝑛 𝑁
• score => P-value ~ the chances of
obtaining the observed (or more extreme)
score if no real effect exists (that is, if the
no-difference hypothesis is correct).
• the bigger the absolute score is,
the more "significant" the result is.
Perhaps we are asking the wrong questions.
Agent Brown, Matrix.
What is "significance"?
• Statistical significance is not the same as
practical importance.
• P-value does not tell whether the result is
of a practical importance.
• Statistics does not tell us whether we are
right. It tells us the chances of being
wrong.
• Any particular threshold for declaring
significance is arbitrary.
A Difference, to Be a Difference, Must Make a Difference
Gertrude Stein
.
FC vs P: Volcano plot
Volcano Plots in Analyzing Differential Expressions with mRNA Microarrays
Wentian Li
log fold change
-log
10(P
-valu
e)
P-value not
significant
Difference not
important
important
and
significant
Multiple testing
• p-value is the risk of false positive
• 5% means that among 100 statistical tests
5 will be called positive, but are false
positive in fact
• 5% is the risk of being wrong (when not
rejecting the null hypothesis)
• 5% for 100 populations => 5 False Pos,
5% is misleading when computing many
tests
• One must control this risk => FDR
Adjusting FDR with SAM
𝐹𝐷𝑅 = 𝐹𝑎𝑙𝑠𝑒 𝐷𝑖𝑠𝑐𝑜𝑣𝑒𝑟𝑦 𝑅𝑎𝑡𝑒
𝐹𝐷𝑅 = 𝐹𝑎𝑙𝑠𝑒 𝐷𝑖𝑠𝑐𝑜𝑣𝑒𝑟𝑖𝑒𝑠
𝐷𝑖𝑠𝑐𝑜𝑣𝑒𝑟𝑖𝑒𝑠
YOU control FDR
using the slider
http://mkweb.bcgsc.ca/pointsofsignificance/
Nat. Methods
More about multiple testing