1
Data science for pathogen genomic surveillance: predicting quantitative phenotype from genotype Eric J. Ma, Islam T. M. Hussein, Jonathan A. Runstadler Department of Biological Engineering, MIT Analysis: HIV Drug Resistance Cross-Validated Prediction Performance • Good prediction performance: high correlation, low error. Global Drug Resistance Prediction • Model predictions are largely concordant with one another. Important Amino Acids Match between expert-identified important positions and model predictions. Position 10 33 88 84 47 46 54 Rel. Impt 47% 12% 5% 3% 2% 2% 2% Database Y Y Y Y Y Y Y Temporal Emergence of Drug Resistance • FDA approval dates - IDV (indinavir): 1996, FPV (fosamprenavir): 2003, DRV (duranivir): 2006 (arrows) • Drug resistance emerged a few years after approval • FPV and DRV have similar chemical structures Goal: Establish example pipeline for genomic surveillance Input: HIV protease sequence & drug resistance profile Conclusions & Future Work • Machine learning models can predict drug resistance from protein sequence. • Genomic surveillance able to capture temporal rise of drug resistance • Applicable to other pathogens, with high quality genotype-phenotype data Genomic Surveillance Zoonotic pathogens circulating in wild may affect livestock and human health. • Given sequence information, can we compute a pathogen “risk factor”? • Given a computed risk, can we do preventative surveillance of zoonotic pathogens? Influenza Influenza Genome Structure 1 PB2 2.4 kb 2 PB1 2.4 kb 3 PA 2.2 kb 4 HA 1.8 kb 5 NP 1.6 kb 6 NA 1.5 kb 7 M 1.0 kb 8 NS 0.9 kb Reassortment Influenza is a zoonotic pathogen that has a broad tropic range. • Segmented genome allows reassortment, accelerating viral evolution. • A high polymerase mutation rate rapidly generates novel sequence diversity. Introduction Difficulties Necessity: The presence of a point mutation may “enhance” phenotype, but not necessarily cause “dangerouse” phenotype levels (right). Epistasis: Mapping from genotype to phenotype. Experiments: Require assays to measure biochemical phenotype relevant to pathogenesis without infecting humans. Data: Lack genotype-phenotype data. Biology: Novel sequence diversity generated through error-prone polymerase. Gene(s) Genotypes Phenotype HT Assay HA NA receptor affinity inhibitor resistance a(2,6) binding a(2,6) cleavage antigenic distance HA, NA hemagglutinin/ neuraminidase inhibition Pol replication activity polymerase assay Significance infection potential treatability disease burden immunity training data for ML ML predicts phenotype Risk Prediction risk phenotype MERI... MKAK... risk phenotype risk phenotype MNPN... risk phenotype MKAK... MNPN... Application Risk Profile Informed Intervention NS1 IFN-β production dampening innate immunity risk phenotype MDSN... immunity Vision Assay: Biochemical, quantitative measure relevant to pathogenesis Characterize: population diversity Machine learning: learn non-linear mapping from genotype to phenotype Model: quantitative risk profile Experimentation Plan Rational Library ATGGTAACCA PacBio Sequencing Polymerase Assay Genotype- Phenotype Sequence PEU Machine Learning Web Query MERIKEL MERIREL MDRIKEL MERIKNL 26 10 9 15 • Rational sampling to cover polymorphic diversity. High throughput library construction & verification. • Safe, scalable, standardized assay of RNA replication rate. • Matched phenotype to genotype • Machine learning models to predict RNA replication rate. • Open data release via web interface & API

BE Retreat 2015 Poster

  • Upload
    eric-ma

  • View
    80

  • Download
    0

Embed Size (px)

Citation preview

  • Data science for pathogen genomic surveillance: predicting quantitative phenotype from genotype

    Eric J. Ma, Islam T. M. Hussein, Jonathan A. RunstadlerDepartment of Biological Engineering, MIT

    Analysis: HIV Drug Resistance

    Cross-Validated Prediction Performance

    Good prediction performance: high correlation, low error.

    Global Drug Resistance Prediction

    Model predictions are largely concordant with one another.

    Important Amino Acids

    Match between expert-identified important positions and model predictions.

    Position10338884474654

    Rel. Impt47%12%5%3%2%2%2%

    DatabaseYYYYYYY

    Temporal Emergence of Drug Resistance

    FDA approval dates - IDV (indinavir): 1996, FPV (fosamprenavir): 2003, DRV (duranivir): 2006 (arrows)

    Drug resistance emerged a few years after approval FPV and DRV have similar chemical structures

    Goal: Establish example pipeline for genomic surveillance Input: HIV protease sequence & drug resistance profile

    Conclusions & Future Work Machine learning models can predict drug resistance from protein sequence. Genomic surveillance able to capture temporal rise of drug resistance Applicable to other pathogens, with high quality genotype-phenotype data

    Genomic Surveillance Zoonotic pathogens circulating in wild may affect livestock and human health. Given sequence information, can we compute a pathogen risk factor? Given a computed risk, can we do preventative surveillance of zoonotic pathogens?

    InfluenzaInuenza Genome Structure

    1 PB2 2.4 kb2 PB1 2.4 kb3 PA 2.2 kb4 HA 1.8 kb5 NP 1.6 kb6 NA 1.5 kb7 M 1.0 kb8 NS 0.9 kb

    Reassortment

    Influenza is a zoonotic pathogen that has a broad tropic range. Segmented genome allows reassortment, accelerating viral evolution. A high polymerase mutation rate rapidly generates novel sequence diversity.

    Introduction

    Difficulties Necessity: The presence of a point mutation may

    enhance phenotype, but not necessarily cause dangerouse phenotype levels (right).

    Epistasis: Mapping from genotype to phenotype. Experiments: Require assays to measure

    biochemical phenotype relevant to pathogenesis without infecting humans.

    Data: Lack genotype-phenotype data. Biology: Novel sequence diversity generated

    through error-prone polymerase.

    Gene(s)Genotypes PhenotypeHT Assay

    HA

    NA

    receptor anity

    inhibitor resistance

    a(2,6) binding

    a(2,6) cleavage

    antigenic distanceHA, NAhemagglutinin/neuraminidase

    inhibition

    Pol replication activitypolymeraseassay

    Signicance

    infectionpotential

    treatability

    diseaseburden

    immunity

    training data for ML ML predicts phenotype

    Risk Prediction

    risk

    phenotypeMERI...

    MKAK... risk

    phenotype

    risk

    phenotypeMNPN...

    risk

    phenotype

    MKAK...MNPN...

    Application

    Risk Prole

    InformedIntervention

    NS1 IFN-productiondampening

    innate immunity ris

    k

    phenotypeMDSN... immunity

    Vision

    Assay: Biochemical, quantitative measure relevant to pathogenesis Characterize: population diversity Machine learning: learn non-linear mapping from genotype to phenotype Model: quantitative risk profile

    Experimentation PlanRational Library

    ATGGTAACCA

    PacBioSequencing

    PolymeraseAssay

    Genotype-Phenotype

    Sequence PEU

    MachineLearning

    Web Query

    MERIKELMERIRELMDRIKELMERIKNL

    2610915

    Rational sampling to cover polymorphic diversity. High throughput library construction & verification. Safe, scalable, standardized assay of RNA replication rate. Matched phenotype to genotype Machine learning models to predict RNA replication rate. Open data release via web interface & API