41
Bio st a t ist ics Weighted gene co Weighted gene co - - expression network expression network analysis (WGCNA) and network edge analysis (WGCNA) and network edge orienting (NEO) orienting (NEO) Bin Zhang and Steve Horvath University of California, Los Angeles, USA Departments of Human Genetics and Biostatistics Part I: WGCNA Part II: NEO

Weighted gene co-expression network analysis (WGCNA) and … · 2017-10-25 · Weighted gene co-expression network analysis (WGCNA) and network edge ... weighted gene co-expression

Embed Size (px)

Citation preview

Biostatistics

Weighted gene coWeighted gene co--expression network expression network analysis (WGCNA) and network edge analysis (WGCNA) and network edge

orienting (NEO)orienting (NEO)

Bin Zhang and Steve HorvathUniversity of California, Los Angeles, USADepartments of Human Genetics and Biostatistics

Part I: WGCNA

Part II: NEO

Biostatistics

Challenges of Modern GeneticsChallenges of Modern Genetics1. Genetic analysis of complex diseases is difficult

• Requires searching for many small effect genes • Difficult to detect signal at the DNA level• RNA level day may identify clusters of genes

2. Microarray technology – measures RNA levels (gene expression)

• But this data is noisy!• Focusing on single genes can lead to spurious results

due to outliers or array artifacts

Network analysis of RNA data: “Gene Co-expression Network Analysis” (GCNA)

Biostatistics

ScaleScale--free Networks: free Networks: Derek J de Derek J de SollaSolla PricePrice

• Derek J de Solla Price was a professor of applied mathematics at Raffles College which became part of the University of Singapore in 1948.

• Singapore = great location for a systems biology conference!

• In 1965 he published the first example of a scale-free network.

• The network of scientific journal articles has connections (citations) that follow a power-law distribution.

Timeline for ScaleTimeline for Scale--Free Gene CoFree Gene Co--Expression NetworksExpression Networks

1965: Concept first conceived by Derek J. de Solla Price

2000: The concept of modeling gene expression data as a network was introduced by Butte and Kohane.

2002: Featherstone and Broadie showed that these networks exhibited scale-free topology.

1999: Resurrected by Barabasi and Albert by discovering its applicability for modeling the internet and biological networks.

Biostatistics

Gene CoGene Co--Expression Network Analysis Expression Network Analysis (GCNA) = Systems Genetics Approach(GCNA) = Systems Genetics Approach

• Goal is to understand the “system” instead of reporting a list of individual parts• Focus on gene clusters: “modules” rather than

individual genes• Easily integrated with other types of data: genetic

marker and protein data, clinical traits

• Network structure translates to biological pathways (can be confirmed and annotated using gene ontology software)

Biostatistics

GCNA addresses issues in microarray data GCNA addresses issues in microarray data & complex disease genetics& complex disease genetics

• Individual gene expressions may be poorly measured, so it is safer to study this data at the module level.

• Modules are likely to represent pathways –genes which are co- regulated and/or interact.

• The signal from these pathways tends to be stronger than the signal from a single gene.

• Alleviates multiple testing problem in traditional association/differential expression analyses.

Biostatistics

Network TerminologyNetwork Terminology

Barabási AL, Oltvai ZN (2004). Network biology: Understanding the cell's functional organization. Nature reviews genetics, 5, 101-113.

(A) Random Network: each node has approximately the same number of links, for example 2.(B) Scale-Free Network: a few nodes are very highly connected.

!)Pr(

kk

ke kk )Pr(

Definitions:Node = objects (ex. Genes)Connection = link between 2 nodesk = Degree(Nodei) = # of links to Nodei

Pr(k) = probability Nodei has k links.

Biostatistics

How to construct a gene How to construct a gene coco--expression network?expression network?A) Microarray gene expression dataB) Use Pearson correlation to

determine concordance of gene expressions xi and xj r(xi,xj)

C) The Pearson correlation matrix is transformed via an adjacency function: • Step function: aij = I r(xi, xj)> τ

Unweighted network• Power function: aij = r(xi, xj)β

Weighted network

Biostatistics

Weighted

All genes are connectedWidth of line = strength of k

Unweighted

Some genes are connectedAll connections are equal

Two perspectives on scaleTwo perspectives on scale--free free networks: networks: unweightedunweighted and and weightedweighted

Hard thresholding ignores connection strength information.

aij = I r(xi, xj)> τ aij = r(xi, xj)β

Biostatistics

Gene (xGene (xii) ) –– to to –– Gene (Gene (xxjj) ) relationships in a networkrelationships in a network

• Adjacency matrix A = network, where each aij entry gives the connection strength between xi and xj

• Connectivity of gene xi = row sum of a gene xi’ s connection strengths

• Topological overlap between xi and xj = measure of clustering or shared neighbors. Ravasz et al (2002)

i ijjk a

min( , ) 1

iu uj iju

iji j ij

a a aTOM

k k a

Where is the number of genes

connected to both xi and xj (Note: this TOM definition is for an unweightednetwork.)

iu uju

a a

Biostatistics

Average Linkage Hierarchical ClusteringAverage Linkage Hierarchical ClusteringFigure I. Figure II.

• Agglomerative partitioning (Figure I) to define clusters. Start with n groups: 1 gene/group, combine until 1 size n group.

• Clusters defined using “average linkage” (Figure II) = cluster with smallest average distance (1 – TOM) is combined.

(source: http://www.resample.com/xlminer/help/HClst/HClst_intro.htm)

Biostatistics

Defining Network ModulesDefining Network Modules

2. Trim the tree at a level that gives a manageable number of genesand gene clusters (~1,000 genes, 3-10 clusters)

• Gene clusters are called modules• Grey colors indicate genes outside of the modules

1. Hierarchical clustering of overlap measures results in a cluster tree (dendrogram)

Biostatistics

Network Module AnalysisNetwork Module Analysis

• Identify relevant modules according to one or more of the following strategies:

• Associate module with trait, SNPsand/or connectivity data.

• Annotate module members and primary functions using gene ontology software.

Biostatistics

Types of Network ConnectivityTypes of Network ConnectivityRecall: connectivity of a gene i:

Intra-modular connectivityis the sum of the connection strengths of gene i within its module.

Intra-modular connectivity is more biologically meaningful than whole network connectivity.

i ijjk a

Whole network connectivityis the sum of connection strengths (aij) across all network genes.

Applications of WGCNA Part I: Applications of WGCNA Part I: interinter--species comparisonspecies comparison

1. Application to human and chimp brain tissue expression (2006)

• Modules that correspond to brain regions.

• Most and least conserved regions.

• Results agreed with known evolutionary hierarchy.

• Identified groups of genes that could be evolutionary drivers.

2. Application to two mouse strains (2007)

• Differential network analysis between BxH and BxD

• Identified pathways and genes related to weight.

Applications of WGCNA Part II: finding Applications of WGCNA Part II: finding traittrait--related pathways and genes related pathways and genes

1. Analysis of endothelial cell (EC) responses to oxidized lipids (2006)

• Identified 15 pathways characterizing response

• Identified potential gene targets for atherosclerosis

2. Integrated analysis of chronic fatigue syndrome data: microarray, SNP, traits (2008)

• Tutorial on integrated WGCNA, compared with standard microarray analysis

• Systems genetics screening criteria yields genes that are causal for parent module

WGCNA Software: WGCNA Software: stand alone and R packagestand alone and R package

Biostatistics

Part II: Network Edge Part II: Network Edge Orienting (NEO)Orienting (NEO)

UndirectedWeighted Network

DirectedWeighted Network

Jason Aten1,2 and Steve Horvath31Biomathematics, 2Human Genetics and 3Biostatistics

Biostatistics

Motivation for Cause and Effect Motivation for Cause and Effect Analysis in GeneticsAnalysis in Genetics

• Large-scale genetic marker and gene expression data sets can result in numerous genetic candidates for follow-up studies.

• Many are due to chance rather than a true clinical relationship.

• Cause and effect analysis can be performed on a weighted gene co-expression network when genetic marker data is available, based on the ‘Mendelianrandomization’ concept.

• Such an analysis may: • Help prioritize among these gene candidates for follow up

analysis.• Reduce spurious findings.

Biostatistics

Historical Rationale for Causal Historical Rationale for Causal Inference in Genetics (Inference in Genetics (KatanKatan 1986)1986)

1. DNA variation as measured by genetic markers can only be causal or have no effect on gene expression and trait data, it is never reactive

2. Mendel’s law of independent assortment: genetic traits are inherited randomly ‘Mendelian Randomization’

3. People with a particular DNA variation (X) that conferred only a small physiological effect are otherwise comparable to people who have the normal allele (Y)

• The X subjects likely do not know of their particular genetic difference from the Y subjects, and lead comparable lives

• A study of this trait in X and Y adults would be equivalent to aprospective study that began with X and Y newborns and followed them through adulthood to see which developed the trait

How to infer causal relationships?How to infer causal relationships?• Katan (1986): described how causal analysis in observational studies

on APOE gene (M) could determine whether there is a link betweencholesterol (A) and cancer (B)

• Based on research findings • APOE alleles influenced cholesterol levels• Observational studies that low cholesterol was associated with cancer

• Three possible relationships:

• Correlation information can distinguish relationship 1 from 2 and 3.

2. M A BConfounder

1. M A B, |r(M,B)| > 0

2 = 3. M A B, |r(M,B)| = 0

Biostatistics

But, in practice true causality is But, in practice true causality is difficult to establish.difficult to establish.

• r(M,B) = 0 is unlikely particularly in large data sets or if B is a quantitative trait

• M A : may be verified if SNP and gene expression correspond to the same gene• Often not possible: it is expensive to have high coverage of

genes with both SNP markers and gene expression profiles.• Confounded by other markers in linkage disequilibrium with study

marker(s)

• Relationships could be confounded by• Gene or environment interactions• Population stratification

• Causality inferred by genetic associations is best considered probable causality

Biostatistics

Network Edge Orienting Software (NEO)Network Edge Orienting Software (NEO)• Developed by Jason Aten and Steve Horvath (2008) for

estimating edge orientations in a gene co-expression network

• Methods based on structural equation modeling (SEM)

• First conceived of by geneticst Sewall Wright (1921) • Allows study of causal graphs in the context of statistical

distributions• Each variable in a graph is modeled by combinations of 1 or

more other variables using linear regression

• NEO calculates Local Edge Orienting (LEO) scores

• Based on the relative probabilities of local structural equationmodels – models including only 3 nodes

• Higher scores indicate stronger evidence for a causal relationship

Biostatistics

NEO software: Input NEO software: Input

1. A set of quantitative variables (traits)

• Physiological traits• Gene expression data• Typically input both

2. SNP marker data (or other genetic marker data)

Biostatistics

HDL

E4

E2 E3

Chr1 Chr2 … ChrX

UnorientedUnoriented Network ExampleNetwork Example

Key:

= marker

E1, …,E4 = gene expressions

HDL, Insulin = clinical traits

2. Edges between traits and gene expressions are not yet oriented

1. Note that if the transcript corresponding to a SNP is known, the orientation of the edge is known

Insulin

E1

Biostatistics

HDL

E4

E2E1 E3

Chr1 Chr2 ... Chr22 ChrX

Edges are directed. A score, which measures the strength of evidence for this direction, is assigned to each directed edgeInsulin

LEO=1.5

LEO=3.5

LEO=0.5

LEO=0.8

Network Edges OrientedNetwork Edges Oriented

NEO software: OutputNEO software: Output1. Diagram of the directed network2. Spreadsheet that summarizes LEO scores and provides hyperlinks

to model fits (html files)

There are 5 models There are 5 models for a marker M and for a marker M and traits A and Btraits A and B

Relationship r(A, B) r(M, A) r(M, B) r(M, A | B) r(M, B | A) r(A, B | M)1. M → A → B 1 1 1 1 0 12. M → B → A 1 1 1 0 1 13. A ← M → B 1 1 1 1 1 04. M → A ← B* 1 1 0 1 1 15. M → B ← A* 1 0 1 1 1 1*Note that models 4 and 5 are equivalent to the confounded model: M → X ← Counfounder → Y.

In the table below “r” refers to correlation, the value “1”indicates r > 0, while the 0 indicates r = 0.

Biostatistics

Scores from NEO SoftwareScores from NEO Software1. Scores for model selection:

• Model p-values• Local edge orienting score (LEO.NB.SingleMarker)

2. Traditional SEM measures for assessing model fit:

• Root Mean Square Error of Approximation (RMSEA)

• Comparative Fit Index (CFI)• Standardized Root Mean Square Residual

(SRMSR)

Biostatistics

Model PModel P--valuesvalues• H0: correlation = 0, H1: |correlation| > 0

• Correlations close to zero = H0 cannot be rejected, it’s possible that the data fits the null distribution.

• Larger p-values = better model fit. P-value > 0.05 is considered to indicate good fit.

• Steps for calculating a model p-value:

1. Correlation between a pair of nodes (r) is transformed to a Z-score using Fisher’s Z transformation:

2. The corresponding p-value for this score can be obtained from a standard normal distribution table.

Biostatistics

LEO Score = Relative Model FitLEO Score = Relative Model Fit

• Compares p-value of model A B with next best p-value.

• LEO score > 1 indicates possible causal model.

• Implies model p-value of causal model is 101 = 10 fold higher than best competing model.

Biostatistics

SEM Measures for Assessing SEM Measures for Assessing Model FitModel Fit

• Compare observed Sm and expected Σcovariance matrices.

• Σ consists of path coefficients among traits and genetic markers.

• Recommended thresholds for assessing likely causality:• RMSEA ≤ 0.05• CFI ≥ 0.90• SRMSR ≤ 0.10

Biostatistics

•• NEO analysis has been NEO analysis has been generalized to multiple generalized to multiple markers. markers.

•• Two LEO scores per model Two LEO scores per model rather than one.rather than one.

•• Common Common pleiotropicpleiotropic anchor anchor (CPA) > 0.8(CPA) > 0.8

•• Orthogonal candidate anchor Orthogonal candidate anchor (OCA) > 0.3(OCA) > 0.3

MultiMulti--marker Modelsmarker Models

Biostatistics

4 Multi4 Multi--Marker ModelsMarker Models

Biostatistics

•• MethodsMethods•• Selecting markers with the Selecting markers with the

best correlationbest correlation•• ForwardForward--stepwise multivariate stepwise multivariate

regression approachregression approach•• Combination Combination

•• The OCA and CPA scores are The OCA and CPA scores are computed at each SNP computed at each SNP selection step and should be selection step and should be robust to the number of robust to the number of SNPsSNPsselected selected

MultiMulti--Marker NEO can Perform Marker NEO can Perform Marker SelectionMarker Selection

Biostatistics

E1 → E2E1 → E3

E3 ← HiddenConfounder → E4E4 → TraitTrait → E5

MultiMulti--Marker Simulation TestMarker Simulation Test• Simulated a causal network consisting of the

following nodes:

• 5 gene expressions (E1-E5)• Each gene expression controlled by 3 SNPs (18 correct

SNPs)• 82 Noise SNPs• Trait• Confounder

• Can NEO retrieve the correct SNPs and the correct edge orientations?

Simulation ResultsSimulation Results• A red or orange

square in position (i,j) indicates that a trait in row i causally affects the corresponding trait from column j.

• NEO successfully reproduced the simulated orientations.

• All 18 SNPs were identified.

Biostatistics

NEO and WGCNA Software NEO and WGCNA Software Available OnlineAvailable Online

• R software, tutorials, and simulated and real data sets for NEO and WGCNA can be found online:

• www.genetics.ucla.edu/labs/horvath/aten/NEO/

• http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

• Google search • weighted co-expression network• “WGCNA”• “co-expression network”

Biostatistics

Summary: WGCNA & NEOSummary: WGCNA & NEO• WGCNA is a systems genetics approach that is useful for

complex disease analysis• Genetic signal is weak for individual genes, problematic for

traditional DNA-level analyses• RNA level data analysis may identify clusters of genes

corresponding to trait-related pathways• Helps alleviate multiple testing problem• Focusing on clusters of genes rather than individual genes

improves information quality from microarray data

• WGCNA is also useful for inter-species comparison of gene expression levels

• NEO can estimate edge orientation in a weighted gene co-expression network if relevant genetic marker data is available

• NEO can also perform marker selection

Key References:Key References:

Biostatistics

AcknowledgementsAcknowledgements

• WGCNA developed by Bin Zhang and Steve Horvath

• NEO developed by Jason Aten and Steve Horvath

• Lab members: Peter Langfelder, Jun Dong, Tova Fuller, Ai Li, Wen Lin, Wei Zhao

• Collaborators: Jake Lusis, Tom Drake, AnatoleGhazalpour