37
EBI is an Outstation of the European Molecular Biology Laboratory. Ewan Birney Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure

Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure

  • Upload
    les

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure. Ewan Birney. Outline of the talk. Who am I? A quick crash course in genomics Genomics 2000-2012 HapMap , 1000 Genomes, GWAS, ENCODE Route into medicine - PowerPoint PPT Presentation

Citation preview

Page 1: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

EBI is an Outstation of the European Molecular Biology Laboratory.

Ewan Birney

Human genetics and genomicsThe Science of the 21st Century…and why it needs infrastructure

Page 2: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

Outline of the talk

• Who am I?• A quick crash course in genomics • Genomics 2000-2012

• HapMap, 1000 Genomes, GWAS, ENCODE• Route into medicine• Elixir: Why do we need an information infrastructure?

Page 3: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

Who am I?• Associate Director at

European Bioinformatics Institute (EBI)

• Involved in genomics since I was 19 (almost 20 years!)

• Trained as a biochemist – most people think I am CS

• Analysed – sometimes lead – human/mouse/rat/platypus etc genomes

EBI is in Hinxton, SouthCambridgeshire

EBI is part of EMBL, likeCERN for molecular biology

Page 4: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

EBI is an Outstation of the European Molecular Biology Laboratory.

Crash Course in genomics for geeks

Page 5: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

DNA is a covalently linked polymer nearly always found in anti-parallel, non covalent pairs

Page 6: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

We represent it as strings, not worrying about one pair of the two polymers

>6 dna:chromosome chromosome:GRCh37:6:133017695:133161157:1GCAGCAAGACAGAAGTGACTCATACATACAAGGGATCCCCAATAAGATTATCGGCAGATTTCTCATCAATAACTTTGGAGACCACAAAGCATTGAGCTGATATATTTAAAGTACTGAAAGAAAAAAAAATCTGACAACCAAGAATTCTATATCCATCAGAACTGCCCTTCAAAAGGGAGGGAGAAATGAAGACATTCTCAGATTTGAGAAGAAAGGAAAGAGAGAAGGGAGGGGAGGGGAGAGGAGGGGAGGGGAGGAGAGGAGAGGAGAGGGCACAGTGGCTCACGCCTGTAATCCTAGCACTTTGCAAGACTGAGGCCAGTGGAACACCTGAGGTCAGGAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCCACTAAAAATACAAAAAATTAGCCAGGCGTGGTGGCAGGTGCCTGTAGTTCCAGCTACTCAGGAGGCTGAGGCAGCAGAATGGCGTGAACTCGGGAGGTGGAGCTTGCAGTGAGCTGAGATTGCGCCCCTGCACTCCAGCCTGGGTGACAGAGTGAGACTCTGTCTCAAAAAAATAAAAAGTTTAAAAATATTTTAAAAAAAGAAAGAAAGAAGGGAG

1 monomer is called a “base pair” – bp

Page 7: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

We can routinely determine small parts of DNA1977-1990 – 500 bp, manual tracking

1990-2000 – 500 bp, computational tracking, 1D, “capillary”

2005-2012 – 20-100bp, 2D systems, (“2nd Generation” or NGS)

2012 - ?? >5kb, Real time “3rd Generation”

Fred Sanger, inventorof terminator DNA sequencing

Page 8: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

Costs have come exponentially down

Page 9: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

1980 1985 1990 1995 2000 2005 2010 2015100000

1000000

10000000

100000000

1000000000

10000000000

100000000000

1000000000000

10000000000000

100000000000000

Capillary reads

Assembled sequences

Next gen. reads

Date

Bases

Volumes have gone up!

Similar explosion ofImage based methods

Page 10: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

A genome is all our DNA

Every cell has two copies of 3e9bp(one from mum, one from dad) in24 polymers (“chromosomes”)

Ecoli: 4e6, Yeast, 12e6 Medaka, 0.9e9

White Pine20e9

Page 11: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

Human Genome project

• 1989 – 2000 – sequencing the human genome• Just 1 “individual” – actually a mosaic of about 24 individuals but

as if it was one• Old school technologies• A bit epic

• Now• Same data volume generated in ~3mins in a current large scale

centre • It’s all about the analysis

Page 12: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

Reference Based Compression

0.5 to 0.01 Bits/Base

10x to 100x more compressed than currentdata storage

Page 13: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

EBI is an Outstation of the European Molecular Biology Laboratory.

What happened next?

Page 14: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

We looked into human variation

3 in 10,000 bases between any two individuals are different(a bit more between Africans)

The similarity of a European to an African (any population) isOnly marginally smaller than European to European (2 or 3%).Only a minute amount of DNA is unique to any population

Page 15: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

… and associate this with traits or disease

(you can infer the majority of the genome by knowing a base About 1 every 5,000 to 10,000 bases – the experiments toLook at this density is far cheaper than sequencing)

Page 16: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure
Page 17: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

Genome

ENCODE Dimensions

Methods/Factors

Cel

ls

182

Cel

l Lin

es/ T

issu

es

164 Assays (114 different Chip)

3,010 Experiments5 TeraBases1716x of the Human Genome

Page 18: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

ENCODE Uniform Analysis Pipeline

Mapped reads from production (Bam)

Uniform Peak Calling Pipeline (SPP, PeakSeq)

IDR Processing, QC and Blacklist Filtering

Motif Discovery Stats, GSC enrichments, etc.

Signal Aggregation over peaks

Signal Generation(read extension and mappability correction)

Segmentation

Poor reproducibilityGood reproducibility

Rep1

Rep2

Self Organising Maps

ChromHMM/Segway

Anshul Kundaje, Qunhua Li, Michael Hoffman, Jason Ernst, Joel Rozowsky, Pouya Kheradpour

Page 19: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

Fish Transgenics

• 7 strong positives, 6 negatives from Segmentation, 3 Blood• 2 strong positives, 4 weak, 5 negatives from Naïve picks

(Fisher’s Exact 0.0393)• Wittbrodt Lab, Heidelberg

Page 20: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

EBI is an Outstation of the European Molecular Biology Laboratory.

Impact on Medicine

Page 21: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

3 big areas of impact for medicine

Page 22: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

Germ Line impact

• Everyone has differential risk of disease

• But the shift in risk is small• Perhaps 1 to 2% have a

striking change in risk to a serious disease (>10 fold) which is “actionable”

• This goes up to 3-4% if you count some less clinically worrying diseases

1:500 people have HCM1:500 people have FH

Page 23: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

Precision cancer diagnosis

• Cancer is a genomic disease

• By sequencing a cancer you can understand its molecular form better

• Particular molecular forms respond to particular bugs

Page 24: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

Pathogens

• Sequencing provides a clear cut diagnosis of pathogens

• Can also be used to sequence environments (eg, hospitals)

• Immune systems for hospitals

Page 25: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

EBI is an Outstation of the European Molecular Biology Laboratory.

What can geeks do for biology?

Page 26: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

Biology is a big data science

• Not perhaps as big as high energy physics, but “just” one order of magnitude less (~35 PB plays ~300PB)

• Hetreogenity/diversity of data far larger• Lots and lots of details• Always have “dirty” data even in best case scenarios

• Big Data => scaleable algorithms• CS ‘stringology’ – eg, Burrows-Wheeler transform

• Biological inference => statistics• “just” need to collapse a 3e9 dimensional problem into something

more tractable• We are often I/O not CPU bound

Page 27: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

We need geeks!

• In Academic/Industry/Start ups

• Convert physicists/statisticians to the 21st century science!

• Insist on maths skills in biology

Page 28: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

Why we need an infrastructure…

Page 29: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

Infrastructures are critical…

Page 30: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

But we only notice them when they go wrong

Page 31: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

Biology already needs an information infrastructure

• For the human genome• (…and the mouse, and the rat, and… x 150 now, 1000 in the

future!) - Ensembl• For the function of genes and proteins

• For all genes, in text and computational – UniProt and GO• For all 3D structures

• To understand how proteins work – PDBe• For where things are expressed

• The differences and functionality of cells - Atlas

Page 32: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

..But this keeps on going…

• We have to scale across all of (interesting) life• There are a lot of species out there!

• We have to handle new areas, in particular medicine• A set of European haplotypes for good imputation• A set of actionable variants in germline and cancers

• We have to improve our chemical understanding• Of biological chemicals• Of chemicals which interfere with Biology

Page 33: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

How?

Fully Centralised

Pros: Stability, reuse,Learning ease

Cons: Hard to concentrateExpertise across of life scienceGeographic, language placementBottlenecks and lack of diversity

Pros: Responsive, GeographicLanguage responsive

Cons: Internal communication overheadHarder for end users to learnHarder to provide multi-decade stability

Fully Distributed

Page 34: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

How?

GeneralHub (EBI)

Robust network with a strong hubNode

“DomainSpecific hub”

“National”

Page 35: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

Overlapping Networks

Page 36: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

Other infrastructures needed for biology

• EuroBioImaging• Cellular Imaging and whole body imaging

• BioBanks (BBMRI)• We need numbers – European populations – in particular for rare

diseases, but also for specific sub types of common disease• Mouse models and phenotypes (Infrafrontier)

• A baseline set of knockouts and phenotypes in our most tractable mammalian model

• (it’s hard to prove something in human)• Robust molecular assays in a clinical setting (EATRIS)

• The ability to reliably use state of the art molecular techniques in a clinical research setting

Page 37: Human genetics and genomics The Science of the 21 st  Century …and why it needs infrastructure

Questions?(you can follow me on twitter @ewanbirney)