22
Characterization of exome sequencing data in population studies Exome Sequencing in the Rotterdam Study Jeroen van Rooij, PhD-Student Department of Internal Medicine, Department of Neurology SNP’s and Human Diseases (15-11-2017)

Characterization of exome sequencing data in population

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Characterization of exome sequencing data in population

Characterization of exome sequencing data in

population studies

Exome Sequencing in the Rotterdam Study

Jeroen van Rooij, PhD-Student

Department of Internal Medicine, Department of Neurology

SNP’s and Human Diseases

(15-11-2017)

Page 2: Characterization of exome sequencing data in population

ERGO: The Rotterdam Study

Genome-Wide Association Studies

Page 3: Characterization of exome sequencing data in population

Exome Sequencing of 3000 samples in the

Rotterdam Study

Funded by local and European grants (NGI-

NCHA, NWO, BBMRI)

n = 3,000 samples from RS-I

Sequencing July 2011-2014; Nimblegen v2

3,000 RS-I samples were genotyped with

Illumina’s Exome Array (overlap ~1,500)

Additional 1000 ADSP samples • CCDS (Sept 2009)

• miRBase (v14, Sept 2009)

• RefSeq (Jan 2010)

• 2,100,000 probes

• 30,246 coding genes

• 329,028 exons

• 710 miRNAs

• 36.5 Mb primary target

• 44.1 Mb capture target

Page 4: Characterization of exome sequencing data in population

Sequencing is done in-house, followed by

standard BWA-GATK processing

• BclToFastQ (CASAVA)

• Chastity Filter

Demultiplexing

• BWA (paired)

• SortSam, MarkDuplicates (picard)

Alignment • BaseQualityScore

Recalibration, IndelRealignment (GATK)

Processing

• HaplotypeCaller

• VQSR

• VarEval

Variant-Calling • ANNOVAR,

VCFtools

• PlinkSeq, SKAT, R

• Spotfire

Analysis

Illumina Hiseq 2000 x2

Illumina Compute

Isilon Storage (150 TB)

Dell Compute (128 cores)

Second freeze released in November 2014; 2628 samples and ~700.000 variants,

Third freeze released in March 2016; added QC for consortia-based analysis

Page 5: Characterization of exome sequencing data in population

Annotating Dataset

After all QC’ing procedures, the dataset is frozen and released to other

researchers

Each variant is annotated with a range of databases

Use the genotyping matrix + annotations for various purposes;

Normal metrics

Backward genetics

Population screening

Forward genetics

Population genetics

Page 6: Characterization of exome sequencing data in population

Determining “normal” dataset characteristics #S

NV

s

Depth of Sequencing

• Population properties

• QC (outliers, missing data)

• Technical feedback

(ie; depth of coverage needed)

Page 7: Characterization of exome sequencing data in population

Cases versus controls in ERGO, or in combined datasets

Adding to GWAS; finemapping GWAS hits and associating rare variants

Page 8: Characterization of exome sequencing data in population

Determining variants in a (healthy) population

Ie; someone sequenced a cases-series and needs controls

Page 9: Characterization of exome sequencing data in population

1000G (N=2.504)

Combined (N=73.994)

ExAC (N=60.706)

ESP (N=6.503)

GONL (N=500)

326 114 61

288 151 307

280 159 829

236 204 972

129 311 5.098

121 319 5.368

UK10K (N=3.781)

Compare variants across populations

Page 10: Characterization of exome sequencing data in population

And respective variant frequencies

Page 11: Characterization of exome sequencing data in population

Population genetics; rare versus common SNVs

Page 12: Characterization of exome sequencing data in population

Different composition of common vs rare variants

Page 13: Characterization of exome sequencing data in population

Allowing us to characterize possibly damaging variants

Page 14: Characterization of exome sequencing data in population

And understand the layout of our genome

Count the number of variants per

gene (correcting for gene size)

Score = Gene Size / #SNVs

Higher score means fewer variants

Compare tails; mutation tolerant vs

intolerant genes

Page 15: Characterization of exome sequencing data in population

TTN MBNL3 NCOR1 RAB3IP BTBD9 CLASP1 MLLT4 EPM2AIP1SC5D VPS13C APOOL DCLRE1C NCAPH TGFB2 IKZF1 PIK3R1 BPTF RAB5B RNF41 CYB561D1

SYNE1 ANKRD12 CCDC88A NCOR2 NFIB CACNA1G CBFA2T2 ARNTL2 NHSL1 CEP170B GRSF1 KCNMA1 ZDHHC15 THSD7B SLC4A10 WAC COL18A1 GNE EMP2 STRIP2

NEB MKLN1 PHACTR2 BEND4 NEDD4L PPM1A REL MDM2 GPR107 TMEM59 NUCKS1 EPB41L1 DST IQGAP2 PLEKHA2 EIF5A2 GABRA4 LUZP2 LGSN NSUN4

DST KCNQ3 CACNA1B JMJD1C NR2C2 NFIC DLG2 COL11A1 DYSF MLLT3 SLC35D1 KLHL15 SYNGAP1 NPAS3 LRRTM2 RXRA KLF8 CEP63 SLITRK1 NR2F2

DGKH ATRX IKZF3 CACNG8 UBE2W DLG2 C1orf95 FAM168A KLHL3 ABI2 GOLGA7B OPA1 PARD3 DCAF17 ARHGEF40KDM6A MAP3K4 RLIM GON4L ERG

TNRC6B MPRIP BCAT1 ATP2B4 UHMK1 GAS7 SLC4A4 TRIM13 GPR126 GRSF1 KLHDC10 ARPIN,C15orf38-AP3S2SMAD3 EPB41 GIT2 ENC1 OCLN EIF4G1 LCOR UCHL5

OPRM1 RORA CLCN5 RREB1 ACVR1C KLF7 DMBT1 MAPKBP1 PCDH15 PEG10 NCOA7 PCDH9 KIF13A SMAD3 GRIA4 DYRK1A KCNJ2 MATR3 MON2 FGF5

OPRM1 APC TACC2 ROBO2 ZAN CACNA1G HNRNPR DOCK8 NEBL EPB41L5 RIMS2 SAP30L GABRA4 FAM160B1TMEM194APPFIA2 PUM1 SCRN1 PKNOX1 ATP8B3

PLEC MAST4 PPP1R12B TTN MGAT4A PTBP3 CACNA1D SLC45A4 GPR107 KIF21A NEURL1B CEP41 RUNX1 CTNND1 GOLGB1 STON1 PTPRE UBE3A IGF2 UTP15

CACNA1C ORAI2 SESN3 PAX5 NEDD4L KLF7 MLLT4 PLCB1 LY75 NRP2 SYT7 GNAO1 TRPM3 ZNF562 PAFAH1B1TTC33 MBNL1 PLEKHG5 MAP3K7 UBE2K

AFF2 COL6A3 ZKSCAN1 GFPT1 USP15 CACNB4 NFIB IGSF3 MYH14 ORC4 RASGRF1 PTPN3 ANO6 EFCAB14 TP53INP1 LIN28B TPCN1 YAP1 LDLR ZMYND8

AFF2 SMAD2 IKZF3 BTBD9 POLH CHL1 FAM199X NEDD4 TMEM56,TMEM56-RWDD3LAMP2 MLEC FGF12 KDM6A PREPL ZNF662 RGPD2 KIAA1598 SLC8A3 MRO RNF38

ENAH MTR ANKRD17 TSC1 IRAK3 PRDM2 SLC4A4 ZAK MYO16 ATP2B3 FOXP2 RAPGEF1 CDC73 IKZF1 KIAA0586 ACTR3 ALPK1 FBXO28 VAMP4 TLK2

NSL1 PTPRB ADAM22 CENPE ATP2A2 VAPB ACOX1 AGL PLAG1 SNX2 OSBPL3 PHKA1 AP3S2,C15orf38-AP3S2RGS5 HLF VCL KITLG PVRL3 SIN3B NR5A2

RAB3B MON2 PHACTR2 IL17RA PDE11A TRAF6 MR1 PARP8 PDPK1 ETV1 SP3 DDX6 RASSF6 ZNF652 ZNF407 EIF4G1 SLC39A9 CPT1A MBNL1 TLE4

DCUN1D5 BAHCC1 CELF2 PAX5 PCDH11X CREB5 VAPB SECISBP2L UACA MOB3B SEZ6L POGZ PIK3CB PVR TBL1X CXADR PKNOX1 POLR1B PLB1 ISY1,ISY1-RAB43

LSM8 SPTBN1 IKZF3 NRXN3 PTPRD CD84 KCMF1 PITPNM3 KLF13 TMEM2 PTPRT NFYA PHF8 TLE3 ZNF75D HMHA1 CUL4B SORBS2 BIVM,BIVM-ERCC5SYT14

ENTPD1 ITPR1 ADAM22 DGKH PCDH11X MTX3 HNRNPR PCGF5 PDE5A TBC1D5 ZDHHC3 DLG2 SMIM12 FBXO45 MKLN1 MCUR1 FAM13B DACH1 ZMIZ2 GRIP1

ABL2 TRPS1 LDLRAD4 CYLD PPM1A MTX3 EIF4EBP2 OSBPL6 SAMD8 FBXO32 MFAP3L SPATS2L DCP1A ZFAND5 MXD1 MAP2 IPO8 YAP1 FBXL20 BAG5

ERBB4 CACNA1I LDLRAD4 PTPN13 SYNRG GAB1 MAP4K4 HIPK3 CDHR1 CDH4 RALGPS1 ABI2 JADE1 GRIA1 ZMYM3 MEF2A ZDHHC20 BCL2L11 RBM47 MTSS1

ABL2 MDM4 TMOD2 PAX5 ABCA2 KIRREL ZFP14 EGLN1 CACNA1C KALRN EPB41L1 PSEN1 KRAS TNIK PPP1R12A ARHGAP19EFNA5 WAC L1CAM FAM217B

RELN TFDP2 IL6ST ZNF678 ZYG11B TACC1 HNRNPR SYNJ2BP CPEB2 ADCYAP1R1MAP4 TBC1D2B CCSER2 GNRHR CXXC4 TMEM236 CRKL WNK1 CDK14 DPYSL3

SCN8A CLCN5 ANKRD17 DNAL1 ILDR2 SPG11 IYD SENP6 PCLO PDE7A NRCAM TRIM44 MAVS COL25A1 ELK4 TOM1L2 LDOC1L MROH1 ZNF527 GCLM

DMXL1 NIPBL OBSCN SCN5A PDE4D CREBRF GGCX PCDH15 ZBTB20 ORC4 ZNF24 KIAA0586 CACNA1F NFATC4 PPP1R12B TBL1X FOXJ3 CNOT1 CXADR VPS8

WNK3 ARHGEF12SRGAP3 BTBD7 PAPD5 DSTYK AGL SORT1 NOL9 MAPK1IP1LZDHHC3 GNAL LIMCH1 USP8 GNAL ADCY10 AKT2 RAB5B UNKL LTBP4

Kegg Pathway NameGenes in

Pathway

Genes

Observed

Genes

Expected

Enrichment

Ratio

Enrichment

P-value

Adjusted

P-value

MAPK signaling pathway 268 23 5.7 4.1 1.4E-08 1.4E-06

Calcium signaling pathway 177 15 3.7 4.0 5.6E-06 5.0E-04

Renal cell carcinoma 70 9 1.5 6.1 1.6E-05 1.5E-03

Adherens junction 73 9 1.5 5.8 2.2E-05 2.1E-03

Chronic myeloid leukemia 73 9 1.5 5.8 2.2E-05 2.1E-03

Colorectal cancer 62 8 1.3 6.1 4.6E-05 4.4E-03

Chagas disease (American trypanosomiasis) 104 10 2.2 4.6 7.0E-05 6.7E-03

Neurotrophin signaling pathway 127 11 2.7 4.1 8.1E-05 7.7E-03

Endocytosis 201 14 4.2 3.3 9.9E-05 9.4E-03

Regulation of actin cytoskeleton 212 14 4.5 3.1 2.0E-04 1.9E-02

Hypertrophic cardiomyopathy (HCM) 83 8 1.8 4.6 4.0E-04 3.8E-02

Type II diabetes mellitus 48 6 1.0 5.9 5.0E-04 4.8E-02

Tight junction 132 10 2.8 3.6 5.0E-04 4.8E-02

ErbB signaling pathway 87 8 1.8 4.4 5.0E-04 4.8E-02

Olfactory transduction 388 71 9.1 7.8 1.5E-42 8.3E-41

Systemic lupus erythematosus 136 11 3.2 3.4 4.0E-04 2.2E-02

Protein digestion and absorption 81 8 1.9 4.2 6.0E-04 3.3E-02

500 genes with lowest gene

density scores

500 genes with highest gene

density scores

Page 16: Characterization of exome sequencing data in population

C19orf45 TPTE PSMG4 ANP32C BCAR1 GAA OR4K15 HIST1H3A RIPK4 TMEM82 JSRP1 WDR18 SLC25A5 CCDC154 ESPNL PEBP4 OR5H14 TJP3 MSLN HNRNPCL1,LOC649330

SSTR4 C12orf45 JPH3 IDUA DBH HEXDC ISG15 CEACAM18SERPINB10PTCHD3 IL32 LCN12 MFSD12 LILRA6,LILRB3KLC3 C4orf45 PLEKHN1 ADAD2 PRR22 APBA3

GPT OR2A7 EPPK1 TMEM132ARP1L1 OR52N2 SLC22A20 C17orf70 ADAM8 LBP RAPSN C17orf50 KRT40 LOC100996758,NPY4ROR51I1 OR4C15 OR1I1 ADAMTS7 CELA3A PRAMEF2

SGK223 OR5W2 WFIKKN1 BEST2 OMP KLHL38 OR10H3 OR14I1 AMH PDLIM7 PKD1L2 OR10T2 OR4N2 KRTAP13-4OR5M9 SSPO CROCC C11orf40 LAMA5 OR4C16

TEX29 SYCE2 OR4N4 VSTM5 COL18A1 LKAAEAR1TELO2 TMPRSS9 POTEG C9orf50 KRTAP10-7MYPOP PRSS55 NPS DAPL1 NDUFAB1 HIST1H4K LHB GPR31 KRTAP22-1

OR10G3 GCAT C12orf60 B4GALNT2TRPM5 AGRN CDC27 HIST1H3G PSG9 CCL24 C8G LIPK AQP7 OR8G5 SH3TC1 DEFB108B MYBBP1A TPSG1 KRTAP10-1KRTAP19-7

OR51A4 ARVCF FNDC1 RAET1L OR8S1 ERCC2 SLC16A11 DNLZ HDGFRP2 DYNLL1 CLYBL TNFRSF18 CCDC27 AGXT FAM83E DEGS2 LMNTD2 TNFRSF4 CYP2A6 OR13C5

OR52E6 CPSF3L ARHGAP22MPHOSPH6SLC7A13 GFAP COL20A1 C1QTNF9B,C1QTNF9B-AS1C19orf26 DPP7 PRSS3 LILRA2 OR4M2 HELZ2 FAM131C MUC5B CYP2W1 COL6A2 C2orf82 CHTF18

CLPP TUBA3C PRM1 FANK1 DACT2 IFNA10 OR52R1 GSX2 NOXA1 APOA1 ABCA7 BLVRB MAMDC4 C5orf45 IL17B MUC2 IGLL1 OR10AG1 COL9A3 KRTAP10-3

APOBEC3BNANS CELA2A TARM1 PKP3 FUZ SPNS3 PRR15 RPL19 MEGF6 TAS2R30 FAM129C CCDC9 AQP12B PIEZO1 SPANXC,SPANXDC6orf226 DBX1 OR5H6 QRFP

TMED6 OR10V1 PFKP HSPG2 LILRA3 ELSPBP1 NUDT1 PRODH2 GHSR SH2D2A KRT32 RADIL OR11H6 HIST1H2ABTEKT5 FGF22 OBP2A KRTAP12-4MAP2K3 LOC653486,SCGB1C1

OR51A7 OR1A1 SAMD11 RESP18 MT4 HS3ST6 SPTBN5 MAD1L1 IFITM2 WDR90 PADI4 SLC38A10 OR1N2 MICALL2 SCNN1D HIST1H1C CEP131 PRSS41 KRTAP12-2PDIA2

OR52E4 OR1L8 ATHL1 STXBP2 PYDC2 HIST1H2AHCDRT15 OGFR CAPN15 RRAS OR2T34 METTL11B FGF6 FTCD LYPD2 TAS1R3 TMEM255BC5orf52 PRM3 PRLH

OR6C68 OR4C46 QRFPR GPR142 NAPRT OR51M1 OR10A6 DEFB133 SLC25A47 SPEM1 CPAMD8 B3GNT8 ZNF414 FSCN2 OR10A2 KRTAP6-1 CYP2D6 CST1 C19orf71 SYT8

OR6N1 TAS2R46 RBMXL3 NUDT17 FERMT3 CYP2F1 OR5F1 OR10J5 SERPINH1 OR11H1 HIST1H1A TAS2R31 MRPL23 OR4C13 POLRMT HIST1H2BALMF2 KRTAP21-2LCE4A HIST1H4C

CEBPD ARID3C DNASE1L2IFNL3 IRF7 EXOC3L1 MRPS26 OR4B1 RASSF7 RPS15 OR52M1 C9orf173 CCDC124 DEFB136 ZNRF4 GLTSCR2 TMEM175 OR8D4 OR51Q1 OR51B6

POR ERAS CLIC3 POLD1 CHPF COL6A1 RAET1E C5orf60 MUC6 FUT3 ACY3 MYOM2 MAPK15 OR4D6 EXOC3L4 RNF151 TMEM244 TPSAB1 CPZ GPRIN2

OR13D1 SERPINA9 NT5C3B SCRIB OR52I1 OPLAH FBN3 FFAR3 ECI1 RECQL4 PRODH EXD3 F12 OR5B3 DEFB106A,DEFB106BOR6B2 HNRNPCL2CELA3B OR5P2 TPSD1

INTS1 GSC2 C2orf57 FAM209B LOC286238KRTAP10-5KRT84 ZP3 GALK1 GPR157 BPIFB3 GPR20 TMEM88B ZNF717 FAM179A OR8H3 MED16 TAS2R19 AZU1 KCNJ12,KCNJ18

CD207 TMEM86B NUDT16L1KCNH3 DEFB119 CABP2 FPR1 CCDC88B POM121L12CATSPER4 OR52N4 R3HDML ZC3H3 KRTAP10-10OR11H12 ZNF593 OR10H5 GZMM KRTAP19-2FRG1

OR7G1 MMP27 PNPLA7 APOBEC1 TRAP1 OR1B1 FLYWCH1 OR13H1 HIST1H1T OR5R1 PEMT ENTPD2 OR10X1 MUC20 HIST1H4E DHDH SLC34A3 HIST1H4H PTX4 TAS2R43

AATK TFR2 GNB1L HIST1H1B SDF2L1 FAM166A PRAMEF1 GSDMD TUBA3E SYCE3 CA9 IGFALS RTP5 OR5H15 OBP2A DEFB124 SYCE1L EGFL7 NDUFS7 OR4C3

LGALS3 ZNHIT2 CARD9 CYP4A22 SDCBP2 AGRP ASPG TBL3 CCDC64B CRYGD SOHLH1 SNAPC4 LRRC56 OR8B2 INSL3 MRGPRE OR51F1 TMEM247 FRG2B OR9G1,OR9G9

GPR108 KRT36 COX5B KRT83 AGBL1 IMP4 OR2T8 LENG1 SPANXN2 TSEN54 C1orf127 OR7A10 SLC6A18 OR1D5 SLC38A8 PRSS57 OR7G3 ASTL OR4A16 DEFB104A,DEFB104B

TAAR6 SLC22A25 TEKT1 CNN2 CBLC GPR152 BIRC7 PI3 RLN3 TSPO OR4A15 WBSCR27 OR1L3 ACTL7B OR2T4 CTU2 OR10G9 RPL3L HIST1H4B OR8U1,OR8U8

Page 17: Characterization of exome sequencing data in population

High and low genes in olfactory transduction

~350

6.91

12

2.63

3

4.52

3

1.11

5

1.66

8

3.36

22

1.38 9

1.70

3

0.79 7

2.38

Page 18: Characterization of exome sequencing data in population

Reporting back incidental findings

ACTA2 DSG2 MSH2 NTRK1 SCN5A TNNI3

ACTC1 DSP MSH6 PCSK9 SDHAF2 TNNT2

APC FBN1 MUTYH PKP2 SDHB TP53

APOB GLA MYBPC3 PMS2 SDHC TPM1

BRCA1 KCNH2 MYH11 PRKAG2 SDHD TSC1

BRCA2 KCNQ1 MYH7 PTEN SMAD3 TSC2

CACNA1S LDLR MYL2 RB1 STK11 VHL

CFTR LMNA MYL3 RET TGFBR1 WT1

COL3A1 MEN1 MYLK RYR1 TGFBR2

DSC2 MLH1 NF2 RYR2 TMEM43

Page 19: Characterization of exome sequencing data in population

ClinVar

HG

MD

Compare and feedback on methods/databases

Benign

variants

Pathogenic

mutations

Disagreement

Retrospectively

check health

records

Page 20: Characterization of exome sequencing data in population

Versions of Clinvar database of clinically relevant variants

Cli

nic

al R

ele

van

ce

(1=

ben

ign

, 3=

un

kn

ow

n, 5=

path

og

en

ic)

Page 21: Characterization of exome sequencing data in population

Take home messages

Large datasets helps QC for smaller sets

Sequencing adds rare variants to the GWAS studies

Use population samples as controls

Scores on gene levels help rank relevant genes

Be critical/careful on which tool to use

Page 22: Characterization of exome sequencing data in population