69
Imputation & Meta-analysis Alexander Teumer OHBM – 26/06/2016

PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Imputation & Meta-analysis

Alexander Teumer

OHBM – 26/06/2016

Page 2: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Imputation

Why do we impute

To allow comparison with other samples on other chips

To fine map – i.e. run association at variants we have not

genotyped

To improve call rate – i.e. increase the number of variants

available for poorly genotyped samples (not ideal)

To identify genotyping errors

array system A

array system B

reference panel

DNA

SNP

recombination hotspots

Page 3: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

A quick conceptual theory of imputation

Start with some genotype data

Using LD the structure within

your data, phase your data

to reconstruct the haplotypes

Page 4: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

A quick conceptual theory of imputation

Compare your phased data to

the references

Use the LD structure to

impute in the missing genotypes

(Marchini, J. and Howie, B. 2010. Nat Rev Genet 11 499-511.)

Page 5: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Chose Genotyping Array

Ideally use a chip designed for imputation

All chips have data sheets if you are obtaining genotyping make sure you

check the sheet before choosing the chip!

Also look for papers on imputation using your preferred chip and ask

authors who have published using that chip

Check the manifests and make sure your favourite genes are covered!

Some arrays are less suitable for imputation

ExomeChip (almost only exomes covered, most SNPs not in refpanel)

Cardio-MetaboChip (selected regions only)

...but some have tag SNPs added

Illumina HumanCoreExome BeadChip

(Exome+300k genome-wide tag SNPs)

Page 6: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Easiest (and best) way of imputing

Use the Imputation Servers

Michigan: https://imputationserver.sph.umich.edu/ (Minimac3)

Sanger: https://imputation.sanger.ac.uk/ (PBWT)

Page 7: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Step 1 – Chose phasing method

ShapeIT

Well established method

Phased data not downloadable from imputation server

(cannot be re-used for fast re-imputation with different reference panel)

Eagle v2.0

New algorithm

Very fast and accurate

HapiUR

Available on Michigan server only

No reference-based phasing algorithm

This phasing does not take into account any sources of information

other than the input genotypes, i.e. no family data

Page 8: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Step 2 – Pick your references

HapMapII

2.4M SNPs

Well imputed and well known set

Good for first imputation run – not commonly used anymore

1KGP aka 1000G

Phase1v3 ~37M SNPs+INDELs of these ~11M will be useable

1,092 individuals

Phase3v5 ~82M SNPs+INDELs of these ~12M will be useable

2,504 individuals

Haplotype reference consortium (HRC)

release 1.1 (full panel only usable through the imputation servers)

39M SNPs (MAC≥5), 32,470 individuals (pan European + 1000G)

Page 9: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Step 2 – Pick your references

All Ethnicities vs Specific Ethnicity panels

Consider what the consortiums/collaborators you want to

work with want to do

Case by case basis

All ethnicities panels are larger (and slower) – but often

requested by collaborators

Can be more accurate – esp. for a ‘cosmopolitan US’ sample

May not improve imputation for homogeneous populations or

those with strong founder effects

Page 10: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Main Differences of Imputation Servers

Michigan: Minimac3 very precise

Sanger: PBWT very fast

Chr X imputation coming soon for imputation servers

Durbin et al., Poster 2015

Page 11: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Genotype data - Make your data clean! Convert to PLINK binary format

Exclude samples with:

Excessive missingness (>5%)

Reported vs. genotyped sex-mismatch

Unusual high/low heterozygosity

Check for ancestry outliers (PCA/MDS) or related/duplicate samples

Exclude SNPs with:

Excessive missingness (>5%)

Monomorphic SNPs (may represent genotyping errors)

Genotyping platform dependent: low MAF (<1%)

i.e. for HumanCoreExome or old array types

HWE violations (~P<10-4)

Mendelian errors (in case of family data available)

Duplicate chromosomal positions

Page 12: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Align DNA strand to reference panel: usually forward (+) strand

Problem: strand ambiguous snps (AT and CG snps):

Remember: DNA is composed of 2 antiparallel strands the complement of an A is

a T and the complement of a C is G this makes it difficult to work out if the

genotypes are strand aligned to the references.

(+) and (–) strand is an arbitrary construct changes between builds and sources.

Check allele frequency or drop these SNPs and re-impute them…

Align SNP positions to the same genome build

Imputation servers require GRCh37 (hg19)

Convert using Liftover (http://genome.ucsc.edu/cgi-bin/hgLiftOver)

Genotype data - Make your data clean!

Page 13: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Format input file

VCF format required

One file per chr for Michigan, one for all chr for Sanger imputation server

Use PLINK≥1.9 or PSEQ to convert plink files to VCF

Consider sample IDs: FID, IID or both (PLINK)

Ensure chromosomes are numbers 1...22, X, Y (without prefix) (PSEQ)

Match alleles and coordinates to GRCh37 (+) strand, Sanger: match also ref alleles

checkVCF tool, plink: use options --a2-allele and --real-ref-alleles to set reference alleles

Sort SNPs by genomic position (per chromosome)

VCFtools

Comments

Genoytpes Info

Page 14: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Output VCF

Comments, info and genotypes in one file

One line per variant

One column per person

Allele dosage info and genotype probabilities incl.

imputation uncertainties

Page 15: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

But I’m going to assume you have the

time, computational capacity, storage

space and desire to do this yourself…

Page 16: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Genotypes and reference panel

Sample and SNP QC are the same as for imputation

server approach

Download reference panel

match strand and genome build positions with own genotypes

HapMapII (NCBI build 36 / hg18 coordinates)

HapMapIII (NCBI build 36 / hg18 coordinates)

1000G phase1 release 3 (NCBI build 37 / GRCh37 / hg19)

1000G phase3 release 5 (NCBI build 37 / GRCh37 / hg19)

build your own reference panel...

full HRC panel not publically available for download

Page 17: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Phase your data

Chose pre-phasing program

ShapeIT http://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html

Eagle https://data.broadinstitute.org/alkesgroup/Eagle/

Download genetic map and reference panel

genetic map contains recombination information

appropriate reference panel (optional) improves phasing

(speed + precision)

MaCH http://www.sph.umich.edu/csg/abecasis/MaCH/

Page 18: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Impute your data

Chose imputation program

Minimac/Minimac3

IMPUTE2

Beagle

Never use PLINK

Similar accuracy, features, time frame

Different output formats & downstream analysis options

Take care of chrX imputation, i.e. for PAR and non-PAR:

specific options (IMPUTE2)

split by sex (Minimac/Minimac3)

Imputation program

popularity

Mach/Minimac

Beagle

PLINK

Impute

Page 19: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

File formats

Different software require different file formats

Tools for conversion available

Software Phasing MaCH Eagle ShapeIT

File format

Input Merlin PLINK PLINK/GEN

Output Mach HAPS HAPS

Software Imputing Minimac Minimac3 IMPUTE2

File format

Input Mach/ HAPS VCF HAPS

Output DOSE VCF/ GEN

DOSE

Software GWAS (dosage)

mach2QTL/ ProbABEL* EPACTS*

SNPTEST2/ QUICKTEST

File format

Input DOSE VCF GEN/

VCF (SNPTEST2)

* supports analysis of related samples

Page 20: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Meta-analysis

Page 21: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Approaches to GWAS meta-analysis

Fixed effects

Most common - most powerful approach for discovery under the model that the true effect of each risk allele is the same in each data set

Inverse variance weighted most common

N weighted (z-score based) also common

Random effects

Uncommon - more appropriate when the aim is to consider the generalizability of the observed association and estimate the average effect size of the associated variant and its uncertainty across different populations

Bayesian

Very uncommon – mainly MAs from the Welcome Trust

Page 22: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Quality control of data going into MA is

critical!

Exclude rare variants

Typically 1% or 0.5% MAF with large samples (5000+) can

consider going lower

Exclude poorly imputed variants

Imputation accuracy metric depends on the software used

Mach/minimac/QUICKTEST r2

IMPUTE properinfo/info

BEAGLE ovarimp

Typically calculated as observed variance/expected –

can empirically go over 1 usually capped at 1

Threshold ~0.6

Page 23: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Important considerations for MA

Duplicate QC and meta-analysis sites

Always check the input data

Column header, beta/SE/P-value distribution, allele frequencies,...

Use GWAtoolbox or EasyQC R-packages

Harmonize variant ID CHR:POS:TYPE (SNP/INDEL)

Make sure you double check meta-analysis results

QQ plots

Manhattan plots

Allele frequencies (min/max per SNP)

Heterogeneity (HetPVal / I²)

Compare inverse-variance vs. z-score based meta-analysis results

Consider allowing cohorts to ignore variants with MAF <0.5% and low r² – it will save you a lot of time and save a lot of storage space!

Page 24: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Input file QC: GWAtoolbox

Checks consistency and distribution of input file columns

Compares beta distribution across cohorts

Harmonizes input files (header + separator)

Corrects for genomic control and calculates effective N

Input script like METAL

Page 25: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

GWAS Meta-Analysis

Most commonly used software for common variant

analysis: METAL

Automatic strand flipping of non-ambiguous SNPs

Calculation of max/min/mean allele frequency

Inverse variance & sample size weightings

Automatic genomic control correction

Heterogeneity tests

Most commonly used software for rare variant analysis:

RAREMETAL

seqMeta (R-package)

Page 26: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

http://www.sph.umich.edu/csg/abecasis/metal/

Documentation can be found at the metal wiki:

http://genome.sph.umich.edu/wiki/Metal_Documentation

METAL

Page 27: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

METAL

Requires results files

‘Script’ file

Describes the input files

Defines meta-analysis strategy

Name output file

Page 28: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Steps

1. Check format of results files

1. Ensure all necessary columns are available

2. Modify files to include all information

2. Prepare script file

1. Ensure headers match description

2. Crosscheck each results file matches Process name

3. Run metal

1. metal < metal_script_file > metal_run.log

2. Output: result file + info file

3. Check log for errors and warnings

Page 29: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

METAL script file:

SNPID chr position coded_

all noncoded

_all strand_

genome beta SE pval

AF_code

d_all HWE_pv

al callrate n_total

impute

d used_for_

imp oevar_

imp

rs10 7 92221824 C A + -0.484216 0.240421 0.0440064 0.942 1 1 2004 1 0 0.346707

rs1000000 12 125456933 G A + -0.117195 0.0814519 0.150201 0.7925 0.115932 1 2004 1 0 0.993797

...

SNPID chr position coded_

all noncoded

_all strand_

genome beta SE pval

AF_code

d_all HWE_pv

al callrate n_total

impute

d used_for_

imp oevar_

imp

rs10 7 92221824 C A + -0.484216 0.240421 0.0440064 0.942 1 1 2004 1 0 0.346707

rs1000000 12 125456933 G A + -0.117195 0.0814519 0.150201 0.7925 0.115932 1 2004 1 0 0.993797

...

MARKER SNPID

ALLELE coded_all noncoded_all

EFFECT beta

STDERR SE

PVALUE pval

FREQLABEL AF_coded_all

GENOMICCONTROL ON

ADDFILTER SE > 0

ADDFILTER pval > 0

SCHEME STDERR

SEPARATOR COMMA

CUSTOMVARIABLE TotalSampleSize

LABEL TotalSampleSize as n_total

OUTFILE Meta-results_invvar .txt

PROCESS results1.txt

PROCESS results2.txt

ANALYZE HETEROGENEITY

Running METAL

# define column names

# set genomic control on/off

# filter result file lines

# set weights to inverse-variance

# define input file separator

# add custom variable to calculate N total

# set prefix of output filename

# define input files

# start meta-analysis and calc heterogeneity

Page 30: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Output # This file contains a short description of the columns in the

# meta-analysis summary file, named ' Meta-results_invvar1.txt'

# Marker - this is the marker name

# Allele1 - the first allele for this marker in the first file where it occurs

# Allele2 - the second allele for this marker in the first file where it occurs

# Freq1 - weighted average of frequency for allele 1 across all studies

# FreqSE - corresponding standard error for allele frequency estimate

# Effect - overall estimated effect size for allele1

# StdErr - overall standard error for effect size estimate

# P-value - meta-analysis p-value

# Direction - summary of effect direction for each study, with one '+' or '-' per study

# HetChiSq - chi-squared statistic in simple test of heterogeneity

# df - degrees of freedom for heterogeneity statistic

# HetPVal - P-value for heterogeneity statistic

# TotalSampleSize - custom variable 1

# Input for this meta-analysis was stored in the files:

# --> Input File 1 : results1.txt

# --> Input File 2 : results2.txt

MarkerName Allele1 Allele2 Freq1 FreqSE Effect StdErr P-value Direction HetChiSq HetDf HetPVal TotalSamp

leSize

rs2326918 a g 0.8545 0.0053 0.0638 0.091 0.4836 +- 0.483 1 0.4873 2412

rs10760160 a c 0.5164 0.006 -0.0492 0.0625 0.431 -- 0.007 1 0.9324 2412

SNP1-152986 a c 0.3796 0 -0.147 0.3169 0.6427 ?- 0 0 1 408

...

info file:

result file:

Page 31: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Common Errors

###########################################################################

## Processing file 'results3.txt'

## WARNING: Bad alleles for marker '5:92717972:SNP', expecting 'a/g' found 'c/g'

## WARNING: Bad alleles for marker '9:110286832:SNP', expecting 'a/g' found 'a/c'

Page 32: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Questions?

GWAS Catalog: http://www.ebi.ac.uk/gwas/home

Page 33: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Appendix

Page 34: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Phase your data - Details

Phasing programs “use a hidden Markov model (HMM) to

model the haplotypes underlying G as an imperfect

mosaic of haplotypes in the set H. Compatible haplotypes

are sampled for G using the forward-backward algorithm

for HMMs”

Problem: complexity is quadratic and scales with sample

size and Nsnps O(MK2) Delaneau, O. et al. 2013. Nat Meth 10 5-6.

Page 35: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Phase your data

Currently best program for phasing is SHAPEIT2

Delaneau, O., Zagury, J.-F. et al. 2013. Nat Meth 10 5-6.

Avoids the quadratic bottle neck by:

“collapsing all K haplotypes in H into a graph structure, Hg, and

then carrying out the HMM calculations on this graph.”

Sampling pairs of haplotypes

Transition accuracy is improved by drawing on surrogate

family members

Page 36: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Phase your data

SHAPEIT2

Transition accuracy is improved by drawing on surrogate

family members

restricts each phasing update to a set of k template haplotypes

chosen separately for each individual at each iteration

The k templates are chosen by computing Hamming distances

between an individual's current sampled haplotypes and each

possible template haplotype.

the k templates with the smallest distances are refereed to as

“surrogate family members”

Page 37: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

SHAPEIT2

https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/s

hapeit.html

Can multi-thread

Note: this is a genetic map based on recombination (cM) not a

physical map (BP)!

Page 38: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Recommendation

MiniMac3

lower memory and more computationally efficient

implementation

References are in a custom format (m3vcf) that can handle

very large references with lower memory

Can read in the SHAPEIT2 references

Output is vcf format

Includes both SNP and individuals IDs – safest format to avoid

errors

Downstream analysis with RAREMETALWORKER or other vcf

input tools

Page 39: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

vcf format

Page 40: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Imputing in minimac3

Can impute X

Impute Males & Females together for the pseudo Autosomal

region (PAR)

Separately for the non-PAR

Page 41: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Output

Comments, info and genotypes in a single file

One line per variant

One column per person

Page 42: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Output

Page 43: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

The comments

Page 44: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

The info

Page 45: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

The genotypes

Page 46: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

A practical example

http://labs.med.miami.edu/myers/LFuN/LFuN.html

post-mortem gene expression in ‘brain’ tissue

N=193

Page 47: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Imputation

Chromosome 22 only – HapMapII- b36r22

MaCH phasing

(In real life with a sample this size include the reference

in the phasing)

Minimac Imputation

Run twice

Once without stand alignment (badImp)

Once with strand alignment (goodImp)

Page 48: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

How do we know there was no

strand alignment from the output?

No way of telling from the phasing log

B/c we didn’t include a reference

Imputation log is FULL of errors

rs915677-T rs915677-R rs9617528-T rs9617528-R

A 0 .08 .72 0

C .91 0 0 .17

G 0 .92 .28 0

T .09 0 0 .83

Page 49: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Plot the r2 for the 2 imputation runs

How do they compare?

badImp 17,908/39905 with r2 >=.6

goodImp 24,685/39905 with r2 >=.6

still quite bad b/c of small N

Should have compensated by including ref data

in the phasing step

In a QIMR dataset N=19k 32296/33815

Page 50: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Bad

Imputation

Better

Imputation

Good

Imputation

Page 51: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Analyses…

DO NOT ANALYSE HARDCALL

GENOTYPES!!!!!! Analyse the dosage or probabilities as this will account

for the imputation uncertainty

Page 52: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Analyses in RAREMETALWORKER

Simple phenotype file formats

Can account for relatedness & twins

Can use GRM to account for relatedness (memory+++)

Ped file

(no header)

Dat file

raremetalworker --ped your.ped --dat your.dat --vcf your.vcf.gz --

prefix example

raremetalworker --ped your.ped --dat your.dat --vcf your.vcf.gz --

kinPedigree --prefix example

Page 53: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Files to practice with

Detailed cookbooks are available:

Minimac http://genome.sph.umich.edu/wiki/Minimac:_GIANT_1000_Genomes_Imputation_Cookbook

Minimac3 http://genome.sph.umich.edu/wiki/Minimac3_Imputation_Cookbook

Impute2 http://genome.sph.umich.edu/wiki/Impute2:_GIANT_1000_Genomes_Imputation_Cookbook

But really and truly consider using the Imputation Servers

so that you can access the HRC references!

https://imputationserver.sph.umich.edu/

https://imputation.sanger.ac.uk/

Page 54: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Meta-analysis

(extended)

Page 55: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Setting up a Meta-analysis

Managing the personal and social connections is extremely important

meta-analyses are usually unfunded

Time line is too short and budget is too small for a grant

Meta-analyses do not work top down – to be successful they MUST be led by analysts who know what they are doing

Evangelou, E. 2013. Nat Rev Genet 14 379-389.

Page 56: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Columns METAL uses

SNP

Effect allele & non-effect allele

Frequency of effect allele

OR/Beta

SE [for standard error meta-analysis]

P-value [for Z-score meta-analysis]

IMPORTANT – you can not use FDR controlled or adaptively

permuted P values!

N/weight column [for Z-score meta-analysis]

Page 57: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Effect allele

Differs for different programs and analysis options

Minor/major allele

Alphabetical

1st listed

DO NOT ASSUME YOU KNOW ALWAYS DOUBLE

CHECK!

Page 58: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Genomic control

λ (lambda)

Median test statistic/ expected median test stat

Should be one

Page 59: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Strand Ambiguous SNPs

When you get data from different studies is not always

aligned the same way

Remember A<>T & C<>G

If a SNP is A/C or then the reverse strand is T/G

No ambiguity, regardless of strand we know which allele is

which

A/G, T/C & T/G also non ambiguous

METAL can align you non ambiguous SNPs

Page 60: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Strand Ambiguous SNPs

Remember A<>T & C<>G

If a SNP is A/T then the reverse strand is T/A

AMBIGUOUS!!! Need to check allele freq to make sure

samples are aligned

C/G SNPs are also ambiguous!

METAL can not align ambiguous SNPs

Page 61: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Meta-analysis running

We will run meta-analysis based on effect size and on test

statistic

For the weights of test statistic, I’ve assumed that the

sample sizes are the same

METAL defaults to weight of 1 when no weight column is

supplied

Page 62: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

INPUT FILES

Results1.txt

Results2.txt

Page 63: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Step 2: script file: meta_run_file

# PERFORM META-ANALYSIS based on effect size and on test statistic # Loading in the input files with results from the participating samples # Note: Order of samples is …[sample size, alphabetic order,..] # Phenotype is .. # MB March 2013 MARKER SNP ALLELE A1 A2 PVALUE P EFFECT log(OR) STDERR SE specifies column names PROCESS results1.txt PROCESS results2.txt processes two results files OUTFILE meta_res_Z .txt Output file naming ANALYZE Conducts Z-based meta-analysis from test statistic CLEAR Clears workspace SCHEME STDERR Changes meta-analysis scheme to beta + SE PROCESS results1.txt PROCESS results2.txt processes two results files OUTFILE meta_res_SE .txt Output file naming ANALYZE Conducts effect size meta-analysis

Page 64: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Larger Consortia # PERFORM META-ANALYSIS on P-values

module load metal

metal << EOT

# Loading in the inputfiles with results from the participating samples

# Note: Order of samples is alpahabetic

# Phenotype is WB

# 1. AGES_HAP

MARKER SNPID

ALLELE coded_all noncoded_all

EFFECT Beta

PVALUE Pval

WEIGHT n_total

GENOMICCONTROL ON

COLUMNCOUNTING LENIENT

PROCESS AGES_HAP.txt

# 2. ALSPAC_HAP

MARKER SNPID

ALLELE coded_all noncoded_all

EFFECT Beta

PVALUE Pval

WEIGHT n_total

GENOMICCONTROL ON

COLUMNCOUNTING LENIENT

PROCESS ALSPAC_HAP.txt

AND SO ON (in this case 40 files)

Page 65: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Running metal

metal < metal_run_file > metal_run.log

metal is the command

metal_run_file is the script file

This will output information on the running of METAL things to

standard out [the terminal]

It will spawn 4 files:

2 results files: meta_res_Z1.txt + meta_res_SE1.txt

2 info files: meta_res_Z1.txt.info + meta_res_SE1.txt.info

Page 66: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Output you’ll see

Overview of METAL commands

Any errors

And your best hit from meta-analysis

Page 67: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Output

Page 68: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Don’t ask for stuff you don’t need (Its annoying & adding extra columns*30M lines is a waste of

space…)

You need:

SNP, CHR:BP, EffectAllele, NonEffectAllele, EA_Freq, Ntotal,

Beta, SE, P, Rsq

Not

Page 69: PowerPoint Presentation Materials...Title PowerPoint Presentation Author  Created Date 5/26/2016 9:35:46 AM

Part of the slides are by courtesy of Sarah Medland