Bioinformatics for High School

Embed Size (px)

Citation preview

  • 8/8/2019 Bioinformatics for High School

    1/28

    An Introduction toAn Introduction to

    BioinformaticsBioinformatics(high-school version)(high-school version)

    Ying XuYing Xu

    Institute of Bioinformatics, and Biochemistry andInstitute of Bioinformatics, and Biochemistry and

    Molecular Biology DepartmentMolecular Biology DepartmentUniversity of GeorgiaUniversity of Georgia

    [email protected]@bmb.uga.edu

  • 8/8/2019 Bioinformatics for High School

    2/28

    The BasicsThe Basics

    genes

    cell chromosome

    ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgat

    cgtgtgggtagtagctgatatgatgcgaggtaggggataggatag

    caacagatgagcggatgctgagtgcagtggcatgcgatgtcgatg

    atagcggtaggtagacttcgcgcataaagctgcgcgagatgattg

    caaagragttagatgagctgatgctagaggtcagtgactgatgatc

    gatgcatgcatggatgatgcagctgatcgatgtagatgcaataagt

    cgatgatcgatgatgatgctagatgatagctagatgtgatcgatggt

    aggtaggatggtaggtaaattgatagatgctagatcgtaggta

    genome andsequencing

    protein

    metabolicpathway/network

  • 8/8/2019 Bioinformatics for High School

    3/28

    BioinformaticsBioinformatics(or computational biology)(or computational biology)

    This interdisciplinary science is aboutThis interdisciplinary science is aboutproviding computational support toproviding computational support to

    studies onstudies on linking the behavior of cells,linking the behavior of cells,organisms and populations toorganisms and populations to thetheinformation encoded in the genomesinformation encoded in the genomes

    Temple Smith

    ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgat

    cgtgtgggtagtagctgatatgatgcgaggtaggggataggatag

    caacagatgagcggatgctgagtgcagtggcatgcgatgtcgatg

    atagcggtaggtagacttcgcgcataaagctgcgcgagatgattg

    caaagragttagatgagctgatgctagaggtcagtgactgatgatc

    gatgcatgcatggatgatgcagctgatcgatgtagatgcaataagt

    cgatgatcgatgatgatgctagatgatagctagatgtgatcgatggt

    aggtaggatggtaggtaaattgatagatgctagatcgtaggta

  • 8/8/2019 Bioinformatics for High School

    4/28

    Information Encoded inInformation Encoded in

    GenomesGenomes

    What information? And how to find and interpretWhat information? And how to find and interpretit?it?

    Working molecules (proteins, RNAs) in our cellsWorking molecules (proteins, RNAs) in our cells

    ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcga

    ggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggta

    ggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtga

    ctgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggta

    bacterial

    cell

  • 8/8/2019 Bioinformatics for High School

    5/28

    Information Encoded inInformation Encoded in

    GenomesGenomes

    How to find where protein-encoding genes are in a genome?How to find where protein-encoding genes are in a genome?

    A genome is like a book written in words consisting of 4A genome is like a book written in words consisting of 4letters (A, C, G, T), and each protein-encoding gene is likeletters (A, C, G, T), and each protein-encoding gene is like

    an instruction about how the protein is madean instruction about how the protein is made

    People have found that the six-letter words (e.g., AAGTGC)People have found that the six-letter words (e.g., AAGTGC)have different frequencies in genes from non-gene regionshave different frequencies in genes from non-gene regions

    ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgc

    gatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgat

    cgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggta

  • 8/8/2019 Bioinformatics for High School

    6/28

    Information Encoded inInformation Encoded in

    GenomesGenomes

    Frequency in genes (AAA ATT) = 1.4%; Frequency in non-genes (AAA ATT) = 5.2%Frequency in genes (AAA GAC) = 1.9%; Frequency in non-genes (AAA GAC) = 4.8%

    Frequency in genes (AAA TAG) = 0.0%; Frequency in non-genes (AAA TAG) = 6.3%

    .

    AAAATTAAAATTAAAGACAAAATTAAAGACAAACACAAAATTAAATAGAAATAGAAAATT ..

    Is this a gene or non-gene region if you have to makea bet?

  • 8/8/2019 Bioinformatics for High School

    7/28

    Information Encoded inInformation Encoded in

    GenomesGenomes

    Preference model: for each 6-letter word X (e.g., AAA AAA), calculate its frequencies in

    gene and non-gene regions, FC(X), FN(X) calculate Xspreference value P(X) = log (FC(X)/FN(X))

    Properties: P(X) is 0 if X has the same frequencies in gene and non-gene regions P(X) has positive score if X has higher frequency in gene than in non-

    gene region; the larger the difference, the more positive the score is P(X) has negative score if X has higher frequency in non-gene than in

    gene region; the larger the difference, the more negative the score is

    Gene prediction: given a DNA region, calculate the sum of P(X)values for all 6-letter words X in the region; if the sum is larger than zero, predict gene otherwise predict non-gene

  • 8/8/2019 Bioinformatics for High School

    8/28

    Information Encoded inInformation Encoded in

    GenomesGenomes

    You just learned your first bioinformatics methodYou just learned your first bioinformatics methodfor gene prediction for gene prediction congratulationscongratulations!!

  • 8/8/2019 Bioinformatics for High School

    9/28

    Information Encoded inInformation Encoded in

    GenomesGenomes

    Ok, we now have learned how to find genes encodedOk, we now have learned how to find genes encodedin a genomein a genome

    How do we find out what they do (their biologicalHow do we find out what they do (their biologicalfunctions, e.g. sensors, transportors, regulators,functions, e.g. sensors, transportors, regulators,

    enzymes)?enzymes)?

  • 8/8/2019 Bioinformatics for High School

    10/28

    Information Encoded inInformation Encoded in

    GenomesGenomes

    People have observed that similar protein sequences tend toPeople have observed that similar protein sequences tend tohave similar functionshave similar functions

    Over the years, many genes have been thoroughly studied indifferent organisms,e.g.,human, mouse, fly, ., rice,

    their biological functions have been identified and documented

    For a new protein, scientists can possibly predict its function by

    identifying well-studied proteins in other organisms, that havehigh sequence similarities to it

    This works for ~60% of genes in a newly sequenced genome

  • 8/8/2019 Bioinformatics for High School

    11/28

    Information Encoded inInformation Encoded in

    GenomesGenomes

    Scientists have developed computationalScientists have developed computationaltechniques fortechniques for identifying regulatory signals that controls geneidentifying regulatory signals that controls gene

    transcriptiontranscription

    predicting protein-protein interactionspredicting protein-protein interactions

    elucidating biological networks for a particular functionelucidating biological networks for a particular function ... and elucidating many other information... and elucidating many other information

  • 8/8/2019 Bioinformatics for High School

    12/28

    Information Encoded inInformation Encoded in

    GenomesGenomes

    E. Coli O157 and O111 are human pathogenic while E. ColiK12 is not;

    Can we tell why? Which genes or pathways in E. coli O157

    and O111 are responsible for the pathogenicity?

  • 8/8/2019 Bioinformatics for High School

    13/28

    Information Encoded inInformation Encoded in

    GenomesGenomes

    E.coliK-12

    E.coliO157

    B

    .pseudomallei

    P.furiosus

    Randomseq

    humanchromosome#1

  • 8/8/2019 Bioinformatics for High School

    14/28

    Information Encoded inInformation Encoded in

    GenomesGenomes

    Red: prokaryotes

    Blue: eukaryotes

    Green: plastids

    Orange: plasmids

    Black: mitochondria

    x-axis: average of variations of the K-mer

    frequencies,

    y-axis: average barcode similarity among

    fragments of a genome

  • 8/8/2019 Bioinformatics for High School

    15/28

    Information Encoded inInformation Encoded in

    GenomesGenomes

    Yes, biologists can derive a lot of information fromYes, biologists can derive a lot of information from

    genomes nowgenomes now

    but we are far from fully understanding any genomebut we are far from fully understanding any genomeyet, even for the simplest living organisms, bacteriayet, even for the simplest living organisms, bacteria

    We can clearly use new ideas from bright young mindsWe can clearly use new ideas from bright young minds interested in doing bioinformatics? interested in doing bioinformatics?

  • 8/8/2019 Bioinformatics for High School

    16/28

    Linking Genome Information toLinking Genome Information to

    Biological Systems BehaviorsBiological Systems Behaviors

    To fully understand cellular behaviors, we need toTo fully understand cellular behaviors, we need to elucidate information encoded in the genome, andelucidate information encoded in the genome, and

    understand working molecules, encoded by the genome,understand working molecules, encoded by the genome,

    behaves according to the physical laws on earth!behaves according to the physical laws on earth!

    ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagca

    acagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaag

    gene

    protein

  • 8/8/2019 Bioinformatics for High School

    17/28

    Key Drivers ofKey Drivers of

    BioinformaticsBioinformatics Human genome project has fundamentallyHuman genome project has fundamentallychanged biological sciencechanged biological science

    A key consequence of the genome project isA key consequence of the genome project isscientists learned that they can producescientists learned that they can produce

    biological data massivelybiological data massively genome sequencesgenome sequences

    microarray data for gene expression levelsmicroarray data for gene expression levels yeast two hybrid systems for protein-protein interactionsyeast two hybrid systems for protein-protein interactions

    and other high-throughput biological dataand other high-throughput biological dataThese data reflect the cellular states,molecular structures and functions, incomplex ways

  • 8/8/2019 Bioinformatics for High School

    18/28

    Key Drivers ofKey Drivers of

    BioinformaticsBioinformatics

    and let bioinformaticians to (help to) decipherand let bioinformaticians to (help to) decipher

    the meaning of these data, like in genomethe meaning of these data, like in genomesequencessequences

    Together, high-throughput probing technologiesTogether, high-throughput probing technologiesand bioinformatics are transforming biologicaland bioinformatics are transforming biological

    science into a new science more like physicsscience into a new science more like physics

  • 8/8/2019 Bioinformatics for High School

    19/28

    Key Drivers ofKey Drivers of

    BioinformaticsBioinformatics

    Like physics, whereLike physics, where general rules and lawsgeneral rules and laws areare

    taught at the start,taught at the start, biology will surely bebiology will surely bepresented to future generations of students as apresented to future generations of students as a

    set of basic systemsset of basic systems ....... duplicated and....... duplicated and

    adapted to a very wide range of cellular andadapted to a very wide range of cellular and

    organismic functions,organismic functions, following basic evolutionaryfollowing basic evolutionary

    principles constrained by Earths geologicalprinciples constrained by Earths geological

    history.history. Temple SmithTemple Smith,, Current Topics in Computational Molecular BiologyCurrent Topics in Computational Molecular Biology

  • 8/8/2019 Bioinformatics for High School

    20/28

    Biomarker IdentificationBiomarker Identification

    Our goal is to identify markers in blood that canOur goal is to identify markers in blood that cantell if a person has a particular form of cancertell if a person has a particular form of cancer

    in a similar fashion to doingpregnancy test using a test kit,

    possibly at home

  • 8/8/2019 Bioinformatics for High School

    21/28

    Biomarker IdentificationBiomarker Identification

    Microarray gene expression data allow comparativeMicroarray gene expression data allow comparativeanalyses of gene expression patterns in canceranalyses of gene expression patterns in cancer versusversusnormal tissuesnormal tissues

    on cancertissues

    on normaltissues

    Finding genes showing

    maximum difference in theirexpression levels betweencancer and normal tissues

  • 8/8/2019 Bioinformatics for High School

    22/28

    Biomarker IdentificationBiomarker Identification

    proteins A, , Zhighly expressed incancer

  • 8/8/2019 Bioinformatics for High School

    23/28

    Biomarker IdentificationBiomarker Identification

    QuestionQuestion:: Can we predict which of these tissue markerCan we predict which of these tissue markerproteins can get secreted into blood circulation so we canproteins can get secreted into blood circulation so we canget markers in blood?get markers in blood?

    Through literature search, we found over proteins beingThrough literature search, we found over proteins beingsecreted into blood circulation due to various physiologicalsecreted into blood circulation due to various physiologicalconditionsconditions

    We then trained a classifier to identify features thatWe then trained a classifier to identify features that

    distinguish between proteins that can be secreted into blooddistinguish between proteins that can be secreted into bloodand proteins that cannotand proteins that cannot

  • 8/8/2019 Bioinformatics for High School

    24/28

    Biomarker IdentificationBiomarker Identification

    We have developed a classifier to distinguish blood-We have developed a classifier to distinguish blood-secretory proteins and other proteinssecretory proteins and other proteins

    On a test set with 52 positive data and 3,629 negative data,On a test set with 52 positive data and 3,629 negative data,our classifier achievesour classifier achieves

    89.6% sensitivity, 98.5% specificity and 94% AUC89.6% sensitivity, 98.5% specificity and 94% AUC

  • 8/8/2019 Bioinformatics for High School

    25/28

    Biomarker IdentificationBiomarker Identification

    The predicted marker proteins can be validatedThe predicted marker proteins can be validatedusing mass spectrometry experimentusing mass spectrometry experiment

  • 8/8/2019 Bioinformatics for High School

    26/28

    Biomarker IdentificationBiomarker Identification

    If successful, it will be possible to test for cancerIf successful, it will be possible to test for cancerusing a test-kit like pregnancy test-kitsusing a test-kit like pregnancy test-kits

  • 8/8/2019 Bioinformatics for High School

    27/28

    Take-Home MessageTake-Home Message

    Biological science is under rapid transformation because ofBiological science is under rapid transformation because ofhigh-throughput measurement technologies andhigh-throughput measurement technologies and

    bioinformaticsbioinformatics

    As an emerging field, bioinformatics is about usingAs an emerging field, bioinformatics is about usingcomputational techniques to solve biological problems, andcomputational techniques to solve biological problems, and

    represents the future of biologyrepresents the future of biology

  • 8/8/2019 Bioinformatics for High School

    28/28

    THANK YOU!