40
Bioinformatics Bioinformatics The Prediction of Life The Prediction of Life Tony C Smith Tony C Smith Department of Computer Science Department of Computer Science University of Waikato University of Waikato [email protected] [email protected]

Bioinformatics The Prediction of Life Tony C Smith Department of Computer Science University of Waikato [email protected]

Embed Size (px)

Citation preview

BioinformaticsBioinformatics

The Prediction of LifeThe Prediction of Life

Tony C SmithTony C SmithDepartment of Computer ScienceDepartment of Computer Science

University of WaikatoUniversity of [email protected]@cs.waikato.ac.nz

Bioinformatics Tony C Smith

BioinformaticsBioinformatics

Computation with biological dataComputation with biological data

Data:Data: genes, proteins, microarrays, mass genes, proteins, microarrays, mass spectra, written documents, populations of spectra, written documents, populations of

organisms …organisms …

Goal:Goal: knowledge discovery knowledge discovery

Bioinformatics Tony C Smith

The The essenceessence is prediction … is prediction … My dog is very littlMy dog is very littl__ ?

We know that letters do not occur in English at random; not all letters are equally common (e.g. ‘e’ is more common than ‘x’)

We know that context changes the probability of a letter (e.g. what’s the most likely letter after the sequence “I eat Weet-Bi_”)

Prediction is important in many applications (e.g. encryption, compression, communication, graphics, simulation … and bioinformatics!)

Bioinformatics Tony C Smith

Prediction in bioinformaticsPrediction in bioinformatics

Predicting the location of genes in DNAPredicting the location of genes in DNAPredicting the function of proteinsPredicting the function of proteinsPredicting diseases from molecular samplesPredicting diseases from molecular samplesPredicting population dynamicsPredicting population dynamicsAnything that involves “making a judgment”; Anything that involves “making a judgment”; typically expressible as a yes/no decision about typically expressible as a yes/no decision about some sample datumsome sample datum

Bioinformatics Tony C Smith

RepresentationRepresentation

W e e t – B i xW e e t – B i x

0101011101100101011001010111010000101101 …

… to the computer, everything is binary!

Bioinformatics Tony C Smith

0101011101100101011001010111010000101101

0101101100100111111011010011010000101101 A A C G T C A T T C G A T G A T T C G A

Just as we can teach a computer to predict things about a sequence of letters in English prose, we can also teach it to predict things about a other sequences—like a genetic sequence

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaagcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcataacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagccacccacaccagttatatagagacgaactcgcatcagc

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcccaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcatttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctatcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcaccgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacaggctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgccgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgcttacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgccatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatataatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactcgagacgaactc

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcacaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcgacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacggctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttaccagttatatagagacgaactcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttaatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatctacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactaagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagcgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacgggaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgacgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactccatcagtgttgcgcacccacaccagttatatagagacgaactc

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

A gene encodes a protein

It is a blueprint that provides biochemical instructions on how to construct a sequence of amino acids so as to make a working protein that will perform some function in the organism

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

encoding region untranslated region

transcription

factor RNARNARNARNARNA

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

untranslated region

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

untranslated regionttgcaatcggcgctacgcttcaaaatttattatattcccggcttgcaatcggcgctacgcttcaaaatttattatattcccggc

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggcttgcaatcggcgctacgcttcaaaatttattatattcccggc

What transcription factors bind to this gene?

Where is the transcription factor binding site?

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggcttgcaatcggcgctacgcttcaaaatttattatattcccggc

Clues: A binding site is often a short general pattern

E.g. CCGATNATCGG

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggcttgcaatcggcgctacgcttcaaaatttattatattcccggc

Clues: The patterns are often reverse complements

E.g. CCGATNATCGGGGCTANTAGCC

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggcttgcaatcggcgctacgcttcaaaatttattatattcccggc

Clues: Where there is one binding site, often there is another nearby.

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

All of these properties are the kinds of things for which computer science has developed algorithms and data structures to identify quickly and efficiently, and therefore it is exactly the kind of problem computer scientists should be able to solve.

Bioinformatics Tony C Smith

proteomicsproteomics

Three consecutive nucleotides in the coding regionform a ‘codon’ … i.e. encode an amino acid.

A string of amino acids makes a protein.

3 nucleotides, 4 possibilities for each, so

43 = 64 possible codons

But there are only 20 amino acids!

Bioinformatics Tony C Smith

proteomicsproteomics

Glycine: GGA, GGC, GGG, GGTTyrosine: TAT, TACMethionine: ATG

There is quite a bit of redundancy in codons.

Bioinformatics Tony C Smith

Amidegroup

Carboxylgroup

R group

Amino AcidAmino Acid

Bioinformatics Tony C Smith

Amino AcidAmino Acid

glycine

tyrosine

Bioinformatics Tony C Smith

Primary structure: MSALVSTTPSLLAGVRNVDB …..

Bioinformatics Tony C Smith

Tertiary Structure

Bioinformatics Tony C Smith

Secondary Structure

Bioinformatics Tony C Smith

Signal peptideSignal peptide

A relatively short sequence of amino A relatively short sequence of amino residues at the N-terminus of the nascent residues at the N-terminus of the nascent proteinprotein

typically 15-50 residuestypically 15-50 residues

MAGPRPSPWARLLLAALISVSLSGTLAMAGPRPSPWARLLLAALISVSLSGTLARCKKAPVSKKCETCVGQAALTGL …RCKKAPVSKKCETCVGQAALTGL …

Cleaved off as protein passes through Cleaved off as protein passes through membrane membrane (operates like a pass key)(operates like a pass key)

Knowing signal peptide helps determine Knowing signal peptide helps determine protein function in the organismprotein function in the organism

Bioinformatics Tony C Smith

How do we do it?How do we do it? see any patterns?see any patterns?

ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaatttcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccacgcccagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaatttcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcaatttattatagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacggctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtaacgcatcagactctcgtcgcgttcgcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtaacgcatcagactctcgtcgcgttcgcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgctacgcttcaaaatttattatattcccggcggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgctacgcttcaaaatttattatattcccggcggcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcaaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcaacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttatttattatattcccggcgcgcccacaccagttatatagagacgaactcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttatttattatattcccggcgcggctacgttcatcccagcattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgctacgttcatcccagcattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacggtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcaggacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcaggacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagatgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagatgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctactcatatcgcagctacagcgcacatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctactcatatcgcagctacagcgcatcagacgcatacgacgacgaagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgctcagacgcatacgacgacgaagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaagcagcgattttaaaattaacgcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacgaactcgcatcagtgcaatcggccggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacgaactcgcatcagtgcaatcggccggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgccatcttttactacgacggcgcctacgcatcgcagcatacgattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactcttagaggcgaggacatcatcatatcgcagctacagcgcatcagttagaggcgaggacatcatcatatcgcagctacagcgcatcagttagaggcgaggacatcatcatatgcatcagtgttgcgcacccacaccagttatatagagacgaactcttagaggcgaggacatcatcatatcgcagctacagcgcatcagttagaggcgaggacatcatcatatcgcagctacagcgcatcagttagaggcgaggacatcatcatatcgccgc

Bioinformatics Tony C Smith

Local biases in residues around the cleavage site

Sequence regularities can be

exploited by statistical and pattern-based

models

Bioinformatics Tony C Smith

Proteomic predictionProteomic prediction

Language: • letters combine to form words• words combine to form phrases• phrases combine to form sentences• sentences combine to form sentences (and ultimately Harry Potter books)

Proteins: • amino acids combine to form peptides• peptides combine to form secondary motifs (e.g. α-helixes and β-sheets)• motifs combine to make proteins• proteins combine to make toenails (and ultimately people)

Bioinformatics Tony C Smith

ApproachApproach

Problem is stated as two-class:Problem is stated as two-class:

an amino acid is either the first residue of an amino acid is either the first residue of the mature protein or it is notthe mature protein or it is not

Each residue is described by a single Each residue is described by a single document, which includes as many document, which includes as many electrochemical, structural or contextual electrochemical, structural or contextual facts as are available (desirable)facts as are available (desirable)

Bioinformatics Tony C Smith

Properties of amino acidsProperties of amino acids

Bioinformatics Tony C Smith

Residue as a documentResidue as a document

E.g.E.g. CysteineCysteine CysCys CC

aliphatic [aliphatic [yesyes], aromatic [], aromatic [nono], hydrophobic [], hydrophobic [yesyes], ], charge [charge [--], polarized [], polarized [yesyes]],, small [ small [nono], number of ], number of nitrogen atoms [nitrogen atoms [11], contains sulphur [], contains sulphur [yesyes], has a ], has a carbon ring [carbon ring [nono], ionized [], ionized [yesyes], valence [], valence [22], cbeta ], cbeta [[nono], covalent [], covalent [yesyes], h-bond [], h-bond [yesyes], ],

etc. (whatever else experimenter wants to include)etc. (whatever else experimenter wants to include)

Bioinformatics Tony C Smith

Sample documentSample document PRNUM:1. AANUM:21.PRNUM:1. AANUM:21.

AMINO[-8]:L. ALIPH[-8]:-. AROMA[-8]:-. CBETA[-8]:-. CHARG[-8]:-. COVAL[-8]:-. HBOND[-8]:-. HPHOB[-8]:+. IONIZ[-8]:-. NITRO[-8]:1. AMINO[-8]:L. ALIPH[-8]:-. AROMA[-8]:-. CBETA[-8]:-. CHARG[-8]:-. COVAL[-8]:-. HBOND[-8]:-. HPHOB[-8]:+. IONIZ[-8]:-. NITRO[-8]:1. POLAR[-8]:-. POSNG[-8]:0. SMALL[-8]:-. SULPH[-8]:-. TEENY[-8]:-. CRING[-8]:-. VALEN[-8]:2. AMINO[-7]:L. ALIPH[-7]:-. AROMA[-7]:-. POLAR[-8]:-. POSNG[-8]:0. SMALL[-8]:-. SULPH[-8]:-. TEENY[-8]:-. CRING[-8]:-. VALEN[-8]:2. AMINO[-7]:L. ALIPH[-7]:-. AROMA[-7]:-. CBETA[-7]:-. CHARG[-7]:-. COVAL[-7]:-. HBOND[-7]:-. HPHOB[-7]:+. IONIZ[-7]:-. NITRO[-7]:1. POLAR[-7]:-. POSNG[-7]:0. SMALL[-7]:-. CBETA[-7]:-. CHARG[-7]:-. COVAL[-7]:-. HBOND[-7]:-. HPHOB[-7]:+. IONIZ[-7]:-. NITRO[-7]:1. POLAR[-7]:-. POSNG[-7]:0. SMALL[-7]:-. SULPH[-7]:-. TEENY[-7]:-. CRING[-7]:-. VALEN[-7]:2. AMINO[-6]:F. ALIPH[-6]:+. AROMA[-6]:+. CBETA[-6]:-. CHARG[-6]:-. COVAL[-6]:-. SULPH[-7]:-. TEENY[-7]:-. CRING[-7]:-. VALEN[-7]:2. AMINO[-6]:F. ALIPH[-6]:+. AROMA[-6]:+. CBETA[-6]:-. CHARG[-6]:-. COVAL[-6]:-. HBOND[-6]:-. HPHOB[-6]:+. IONIZ[-6]:-. NITRO[-6]:1. POLAR[-6]:-. POSNG[-6]:0. SMALL[-6]:-. SULPH[-6]:-. TEENY[-6]:-. CRING[-6]:+. HBOND[-6]:-. HPHOB[-6]:+. IONIZ[-6]:-. NITRO[-6]:1. POLAR[-6]:-. POSNG[-6]:0. SMALL[-6]:-. SULPH[-6]:-. TEENY[-6]:-. CRING[-6]:+. VALEN[-6]:2. AMINO[-5]:A. ALIPH[-5]:-. AROMA[-5]:-. CBETA[-5]:-. CHARG[-5]:-. COVAL[-5]:-. HBOND[-5]:-. HPHOB[-5]:-. IONIZ[-5]:-. VALEN[-6]:2. AMINO[-5]:A. ALIPH[-5]:-. AROMA[-5]:-. CBETA[-5]:-. CHARG[-5]:-. COVAL[-5]:-. HBOND[-5]:-. HPHOB[-5]:-. IONIZ[-5]:-. NITRO[-5]:1. POLAR[-5]:-. POSNG[-5]:0. SMALL[-5]:+. SULPH[-5]:-. TEENY[-5]:+. CRING[-5]:-. VALEN[-5]:2. AMINO[-4]:T. ALIPH[-4]:+. NITRO[-5]:1. POLAR[-5]:-. POSNG[-5]:0. SMALL[-5]:+. SULPH[-5]:-. TEENY[-5]:+. CRING[-5]:-. VALEN[-5]:2. AMINO[-4]:T. ALIPH[-4]:+. AROMA[-4]:-. CBETA[-4]:+. CHARG[-4]:-. COVAL[-4]:-. HBOND[-4]:+. HPHOB[-4]:-. IONIZ[-4]:-. NITRO[-4]:1. POLAR[-4]:+. POSNG[-AROMA[-4]:-. CBETA[-4]:+. CHARG[-4]:-. COVAL[-4]:-. HBOND[-4]:+. HPHOB[-4]:-. IONIZ[-4]:-. NITRO[-4]:1. POLAR[-4]:+. POSNG[-4]:0. SMALL[-4]:+. SULPH[-4]:-. TEENY[-4]:-. CRING[-4]:-. VALEN[-4]:2. AMINO[-3]:C. ALIPH[-3]:+. AROMA[-3]:-. CBETA[-3]:-. CHARG[-4]:0. SMALL[-4]:+. SULPH[-4]:-. TEENY[-4]:-. CRING[-4]:-. VALEN[-4]:2. AMINO[-3]:C. ALIPH[-3]:+. AROMA[-3]:-. CBETA[-3]:-. CHARG[-3]:-. COVAL[-3]:+. HBOND[-3]:+. HPHOB[-3]:+. IONIZ[-3]:+. NITRO[-3]:1. POLAR[-3]:+. POSNG[-3]:-. SMALL[-3]:-. SULPH[-3]:+. 3]:-. COVAL[-3]:+. HBOND[-3]:+. HPHOB[-3]:+. IONIZ[-3]:+. NITRO[-3]:1. POLAR[-3]:+. POSNG[-3]:-. SMALL[-3]:-. SULPH[-3]:+. TEENY[-3]:-. CRING[-3]:-. VALEN[-3]:2. AMINO[-2]:I. ALIPH[-2]:-. AROMA[-2]:-. CBETA[-2]:+. CHARG[-2]:-. COVAL[-2]:-. HBOND[-2]:-. TEENY[-3]:-. CRING[-3]:-. VALEN[-3]:2. AMINO[-2]:I. ALIPH[-2]:-. AROMA[-2]:-. CBETA[-2]:+. CHARG[-2]:-. COVAL[-2]:-. HBOND[-2]:-. HPHOB[-2]:+. IONIZ[-2]:-. NITRO[-2]:1. POLAR[-2]:-. POSNG[-2]:0. SMALL[-2]:-. SULPH[-2]:-. TEENY[-2]:-. CRING[-2]:-. VALEN[-2]:2. HPHOB[-2]:+. IONIZ[-2]:-. NITRO[-2]:1. POLAR[-2]:-. POSNG[-2]:0. SMALL[-2]:-. SULPH[-2]:-. TEENY[-2]:-. CRING[-2]:-. VALEN[-2]:2. AMINO[-1]:A. ALIPH[-1]:-. AROMA[-1]:-. CBETA[-1]:-. CHARG[-1]:-. COVAL[-1]:-. HBOND[-1]:-. HPHOB[-1]:-. IONIZ[-1]:-. NITRO[-1]:1. AMINO[-1]:A. ALIPH[-1]:-. AROMA[-1]:-. CBETA[-1]:-. CHARG[-1]:-. COVAL[-1]:-. HBOND[-1]:-. HPHOB[-1]:-. IONIZ[-1]:-. NITRO[-1]:1. POLAR[-1]:-. POSNG[-1]:0. SMALL[-1]:+. SULPH[-1]:-. TEENY[-1]:+. CRING[-1]:-. VALEN[-1]:2. POLAR[-1]:-. POSNG[-1]:0. SMALL[-1]:+. SULPH[-1]:-. TEENY[-1]:+. CRING[-1]:-. VALEN[-1]:2. AMINO[0]:R. ALIPH[0]:+. AROMA[0]:-. AMINO[0]:R. ALIPH[0]:+. AROMA[0]:-. CBETA[0]:-. CHARG[0]:+. COVAL[0]:-. HBOND[0]:+. HPHOB[0]:-. IONIZ[0]:+. NITRO[0]:4. POLAR[0]:+. POSNG[0]:+. SMALL[0]:-. CBETA[0]:-. CHARG[0]:+. COVAL[0]:-. HBOND[0]:+. HPHOB[0]:-. IONIZ[0]:+. NITRO[0]:4. POLAR[0]:+. POSNG[0]:+. SMALL[0]:-. SULPH[0]:-. TEENY[0]:-. CRING[0]:-. VALEN[0]:3.SULPH[0]:-. TEENY[0]:-. CRING[0]:-. VALEN[0]:3. AMINO[1]:H. ALIPH[1]:+. AROMA[1]:+. CBETA[1]:-. CHARG[1]:+. COVAL[1]:-. AMINO[1]:H. ALIPH[1]:+. AROMA[1]:+. CBETA[1]:-. CHARG[1]:+. COVAL[1]:-. HBOND[1]:+. HPHOB[1]:-. IONIZ[1]:+. NITRO[1]:3. POLAR[1]:+. POSNG[1]:+. SMALL[1]:-. SULPH[1]:-. TEENY[1]:-. CRING[1]:+. HBOND[1]:+. HPHOB[1]:-. IONIZ[1]:+. NITRO[1]:3. POLAR[1]:+. POSNG[1]:+. SMALL[1]:-. SULPH[1]:-. TEENY[1]:-. CRING[1]:+. VALEN[1]:3. AMINO[2]:Q. ALIPH[2]:+. AROMA[2]:-. CBETA[2]:-. CHARG[2]:-. COVAL[2]:-. HBOND[2]:+. HPHOB[2]:-. IONIZ[2]:-. VALEN[1]:3. AMINO[2]:Q. ALIPH[2]:+. AROMA[2]:-. CBETA[2]:-. CHARG[2]:-. COVAL[2]:-. HBOND[2]:+. HPHOB[2]:-. IONIZ[2]:-. NITRO[2]:2. POLAR[2]:+. POSNG[2]:0. SMALL[2]:-. SULPH[2]:-. TEENY[2]:-. CRING[2]:-. VALEN[2]:2. AMINO[3]:Q. ALIPH[3]:+. NITRO[2]:2. POLAR[2]:+. POSNG[2]:0. SMALL[2]:-. SULPH[2]:-. TEENY[2]:-. CRING[2]:-. VALEN[2]:2. AMINO[3]:Q. ALIPH[3]:+. AROMA[3]:-. CBETA[3]:-. CHARG[3]:-. COVAL[3]:-. HBOND[3]:+. HPHOB[3]:-. IONIZ[3]:-. NITRO[3]:2. POLAR[3]:+. POSNG[3]:0. AROMA[3]:-. CBETA[3]:-. CHARG[3]:-. COVAL[3]:-. HBOND[3]:+. HPHOB[3]:-. IONIZ[3]:-. NITRO[3]:2. POLAR[3]:+. POSNG[3]:0. SMALL[3]:-. SULPH[3]:-. TEENY[3]:-. CRING[3]:-. VALEN[3]:2. AMINO[4]:R. ALIPH[4]:+. AROMA[4]:-. CBETA[4]:-. CHARG[4]:+. SMALL[3]:-. SULPH[3]:-. TEENY[3]:-. CRING[3]:-. VALEN[3]:2. AMINO[4]:R. ALIPH[4]:+. AROMA[4]:-. CBETA[4]:-. CHARG[4]:+. COVAL[4]:-. HBOND[4]:+. HPHOB[4]:-. IONIZ[4]:+. NITRO[4]:4. POLAR[4]:+. POSNG[4]:+. SMALL[4]:-. SULPH[4]:-. TEENY[4]:-. COVAL[4]:-. HBOND[4]:+. HPHOB[4]:-. IONIZ[4]:+. NITRO[4]:4. POLAR[4]:+. POSNG[4]:+. SMALL[4]:-. SULPH[4]:-. TEENY[4]:-. CRING[4]:-. VALEN[4]:3. AMINO[5]:Q. ALIPH[5]:+. AROMA[5]:-. CBETA[5]:-. CHARG[5]:-. COVAL[5]:-. HBOND[5]:+. HPHOB[5]:-. CRING[4]:-. VALEN[4]:3. AMINO[5]:Q. ALIPH[5]:+. AROMA[5]:-. CBETA[5]:-. CHARG[5]:-. COVAL[5]:-. HBOND[5]:+. HPHOB[5]:-. IONIZ[5]:-. NITRO[5]:2. POLAR[5]:+. POSNG[5]:0. SMALL[5]:-. SULPH[5]:-. TEENY[5]:-. CRING[5]:-. VALEN[5]:2. AMINO[6]:Q. ALIPH[6]:IONIZ[5]:-. NITRO[5]:2. POLAR[5]:+. POSNG[5]:0. SMALL[5]:-. SULPH[5]:-. TEENY[5]:-. CRING[5]:-. VALEN[5]:2. AMINO[6]:Q. ALIPH[6]:+. AROMA[6]:-. CBETA[6]:-. CHARG[6]:-. COVAL[6]:-. HBOND[6]:+. HPHOB[6]:-. IONIZ[6]:-. NITRO[6]:2. POLAR[6]:+. POSNG[6]:0. +. AROMA[6]:-. CBETA[6]:-. CHARG[6]:-. COVAL[6]:-. HBOND[6]:+. HPHOB[6]:-. IONIZ[6]:-. NITRO[6]:2. POLAR[6]:+. POSNG[6]:0. SMALL[6]:-. SULPH[6]:-. TEENY[6]:-. CRING[6]:-. VALEN[6]:2. AMINO[7]:Q. ALIPH[7]:+. AROMA[7]:-. CBETA[7]:-. CHARG[7]:-. SMALL[6]:-. SULPH[6]:-. TEENY[6]:-. CRING[6]:-. VALEN[6]:2. AMINO[7]:Q. ALIPH[7]:+. AROMA[7]:-. CBETA[7]:-. CHARG[7]:-. COVAL[7]:-. HBOND[7]:+. HPHOB[7]:-. IONIZ[7]:-. NITRO[7]:2. POLAR[7]:+. POSNG[7]:0. SMALL[7]:-. SULPH[7]:-. TEENY[7]:-. COVAL[7]:-. HBOND[7]:+. HPHOB[7]:-. IONIZ[7]:-. NITRO[7]:2. POLAR[7]:+. POSNG[7]:0. SMALL[7]:-. SULPH[7]:-. TEENY[7]:-. CRING[7]:-. VALEN[7]:2. AMINO[8]:Q. ALIPH[8]:+. AROMA[8]:-. CBETA[8]:-. CHARG[8]:-. COVAL[8]:-. HBOND[8]:+. HPHOB[8]:-. CRING[7]:-. VALEN[7]:2. AMINO[8]:Q. ALIPH[8]:+. AROMA[8]:-. CBETA[8]:-. CHARG[8]:-. COVAL[8]:-. HBOND[8]:+. HPHOB[8]:-. IONIZ[8]:-. NITRO[8]:2. POLAR[8]:+. POSNG[8]:0. SMALL[8]:-. SULPH[8]:-. TEENY[8]:-. CRING[8]:-. VALEN[8]:2. MULT3:7. MULT5:4. IONIZ[8]:-. NITRO[8]:2. POLAR[8]:+. POSNG[8]:0. SMALL[8]:-. SULPH[8]:-. TEENY[8]:-. CRING[8]:-. VALEN[8]:2. MULT3:7. MULT5:4. MULT7:3. MULT9:2. 2GRAM:IA. GRAM2:HQ. 3GRAM:CIA. GRAM3:HQQ. MULT7:3. MULT9:2. 2GRAM:IA. GRAM2:HQ. 3GRAM:CIA. GRAM3:HQQ.

Bioinformatics Tony C Smith

Artificial IntelligenceArtificial Intelligence

Computers do things Computers do things only human brains only human brains can otherwise docan otherwise do

expert expert

Bioinformatics Tony C Smith

Artificial IntelligenceArtificial Intelligence

Computers do things Computers do things only human brains only human brains can otherwise docan otherwise do

expertsystem

expert

Bioinformatics Tony C Smith

Artificial IntelligenceArtificial Intelligence

Computers do things Computers do things only human brains only human brains can otherwise docan otherwise do

learningsystem

expertsystem

Bioinformatics Tony C Smith

Machine learningMachine learning

creating computer programs that get better with experiencecreating computer programs that get better with experiencelearn how to make expert judgmentslearn how to make expert judgmentsdiscover previously hidden, potentially useful information (data discover previously hidden, potentially useful information (data mining)mining)

What is machine learning?

How does it work?user provides learning system with examples of concept to be learneduser provides learning system with examples of concept to be learned

induction algorithm infers a characteristic model of the examplesinduction algorithm infers a characteristic model of the examples

model is used to predict whether or not future novel instances are also model is used to predict whether or not future novel instances are also examples – and it does this very consistently, and very, very quickly!examples – and it does this very consistently, and very, very quickly!

Bioinformatics Tony C Smith

BioinformaticsBioinformatics

Biologists know proteins, computer scientists Biologists know proteins, computer scientists know machine learningknow machine learning

Together, they can find hidden and potentially Together, they can find hidden and potentially useful information about genes and proteinsuseful information about genes and proteins

Biotechnology is a multi-billion dollar industryBiotechnology is a multi-billion dollar industry

Biotechnology is one of the best funded areas of Biotechnology is one of the best funded areas of scientific researchscientific research

Shortage of people educated in bioinformaticsShortage of people educated in bioinformatics

Bioinformatics Tony C Smith

The University of WaikatoThe University of Waikato

Waikato University is ranked first in the country Waikato University is ranked first in the country in computer science and in molecular, cellular, in computer science and in molecular, cellular, and whole-organism biologyand whole-organism biology

centre of the universe for machine learningcentre of the universe for machine learning

Bioinformatics Tony C Smith

The University of WaikatoThe University of Waikato

If you’re interested in getting involved in If you’re interested in getting involved in bioinformatics, or indeed any other area bioinformatics, or indeed any other area

along the leading edge of computer along the leading edge of computer science and/or biology, then …science and/or biology, then …

Waikato wants You!Waikato wants You!