34
Bioinformatics - Craig A. Struble 1 MSCS282: MSCS282: Bioinformatics Bioinformatics I I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael A. Thomas, Ph.D. Bioinformatics Research Center Medical College of Wisconsin

MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

  • Upload
    others

  • View
    3

  • Download
    1

Embed Size (px)

Citation preview

Page 1: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 1

MSCS282: MSCS282: Bioinformatics Bioinformatics II

IntroductionCraig A. Struble, Ph.D.

Department of Mathematics, Statistics, and Computer Science

Marquette University

Michael A. Thomas, Ph.D.

Bioinformatics Research Center

Medical College of Wisconsin

Page 2: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 2

Michael A. Thomas, Ph.D.Michael A. Thomas, Ph.D.

Page 3: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 3

OverviewOverview

� Welcome

� Syllabus

� Student Introductions

� Introduction to Bioinformatics

Page 4: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 4

What Is What Is BioinformaticsBioinformatics??

� “Bioinformatics is a new subject of genetic data collection, analysis and dissemination to the research community.” Hwa A. Lim (1987)

� “Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data,including those to acquire, store, organize, archive, analyze, or visualize such data.” NIH working definition (2000)

Page 5: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 5

What is What is BioinformaticsBioinformatics??

InformaticsComputer Science

Computer Engineering

Information Science

Biology &

Other Natural

Sciences

Mathematics

& Statistics

Bioinformatics

Page 6: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 6

BioinformaticsBioinformatics is sometimes is sometimes

called…called…

� Computational biology

� Computational molecular biology

� Biomolecular informatics

� Computational genomics

� …

Page 7: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 7

Different perspectives on Different perspectives on

BioinformaticsBioinformatics

� Bioinformatics is a tool

– Biologists, biochemists, medical professionals,

etc.

– Obtain meaningful and understandable results

� Bioinformatics is a discipline

– Informaticians, mathematicians, statisticians,

etc.

– Generate meaningful and understandable results

Page 8: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 8

Biological DataBiological Data

� Genomes

– DNA Sequences of A, T, C, G

– Annotated with function, “interesting” features

� Proteins

– Amino Acid Sequences

� Sequences of 20 letters

– Annotated with structure, function, etc.

Page 9: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 9

Biological DataBiological Data

� Gene Expression

– Dynamic behavior of genes

� Protein Expression

– Dynamic behavior of proteins

� Structural Features

– RNA and proteins

� …

Page 10: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 10

Biological Data Biological Data Sus scrofaSus scrofa agoutiagouti--related protein generelated protein gene

1 ggcacattct cctgttgagc caggctatgc tgaccacaat gttgctgagc tgtgccctac

61 tgctggcaat gcccaccatg ctgggggccc agataggctt ggcccccctg gagggtatcg

121 gaaggcttga ccaagccttg ttcccagaac tccaaggtca gtgcgggcag gagtgggttg

181 ggtggggctt ggacatcctc tggccacaaa gtattctgct tgtatgagcc ctttcttccc

241 cttcccaatc ccaggcctgg gaggtgggtg ttttgtgcat gggtggttct gccctcacat

301 catctgtccc agatctaggc ctgcagcccc cactgaagag gacaactgca gaacgggcag

361 aagaggctct gctgcagcag gccgaggcca aggccttggc agaggtaaca gctcagggaa

421 agggctgagg ccacaagtct tgagtgggtg tgtcaagcat caacctctat ctgtgcttgg

481 agttgccact gtggtacaac gggattggcg gtgtcttggg agcgctggga cgtggtttca

541 tccccggcca gcacaagtgg gttaaggatc tggccttgcc atcccttcag cttaggctga

601 gactgtggct tggagctgat ctctgaccgg aagctccata tgctctgggg tgaccaaaaa

661 tggaaaaaca aacatacaaa acacctctac ctgcacttcc tgaccccctc acccggggcg

721 acactgcaga ccatcccgtt cacgctccac ttccatcctg ccttgatctg gcgcattcca

781 tgaatgtgct tttggaagtc cttgtttccc aacccttgta ggtgctagat cctgaaggac

841 gcaaggcacg ctccccacgt cgctgcgtaa ggctgcacga atcctgtctg ggacaccagg

901 taccatgctg cgacccatgt gctacatgct actgccgttt cttcaacgcc ttctgctact

961 gccgcaagct gggtactgcc acgaacccct gcagccgcac ctagctggcc agccaatgtc

1021 gtcg

Page 11: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 11

Genome SizesGenome Sizes

3.3 billion bpHuman

4.7 million bpEsherichia coli

3569 bpBacteriophage MS2

Genome SizeSpecies

Page 12: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 12

Database GrowthDatabase Growth

Page 13: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 13

Database GrowthDatabase Growth

Page 14: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 14

Database GrowthDatabase Growth

Page 15: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 15

Database GrowthDatabase Growth

� Exponential growth in sequence data

� Not much growth in sequence size

� Expect exponential growth in annotation

information

What are we to do with all this data?

Page 16: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 16

Challenges of Large Challenges of Large

DatabasesDatabases� Storage

– Indexing, physical layout, memory management

� Modeling

– Relational, hierarchical, semi-structured

� Efficiency

– Update, query, analysis

� Interpretation

– Visualization

Page 17: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 17

Problems in Problems in BioinformaticsBioinformatics

� Consider just sequence analysis

– Sequence alignment

– Gene discovery

– Promoter discovery

– Intron splice sites

– Protein and RNA structure prediction

– …

Page 18: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 18

Applications of Applications of BioinformaticsBioinformatics

� VCMAP

� DORR and ASAP

� miRNA

Page 19: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 19

VCMAPVCMAP

� Comparative mapping is a strategy that allows cross-organism study

of physiological genomics

� Virtual Comparative Map (VCMap) performs homology analysis

with mathematical predictions to construct un-tested (in the wet-lab)

cross-organism maps between human, rat, mouse and zebrafish

� This application provides a highly modular investigative environment

for the:

– Analysis of multiple organisms including Zebrafish

– Collection of genetic and radiation hybrid maps

– Prediction of Genes based on homology

Page 20: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 20

VCMAPVCMAP� Homology analysis was based on sequence similarity (Altschul,

et al 1990) and curated homologous genes.

� 85 % similarity with 100 bp stretch across all species was used

to create the maps

� NCBI’s UniGene sequence sets, RH and Genetic maps were

chosen to create anchor objects (Kwitek-Black, et al. 2001).

� 1-to-1 homologous objects were used for building the virtual

comparative maps with a pipeline architecture

Page 21: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 21

VCMAPVCMAPDownload UniGene data from NCBI

Mask UniGene sequences

Blast

Search UniGene

Generate anchor report

Create Homolog UniGene

Object and Scoring

DB

Map Data

Anchor

Report

Format masked sequences

1-to-1

Objects

VC Maps

Building

Load UniGene data to DB

Page 22: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 22

VCMAPVCMAP• The zebrafish virtual map shows the “evolutionary breakpoints” of a region in chromosome 7 (in green) with synteny to human chromosomes 15 (pink), 11 (brown) and 18.

• The zebrafish virtual map was also able to identify a gene, pyruvate carboxylase (PC; associated with the disorder necrotizing encephalopathy), with mapped homologues forzebrafish, human and rat, and a (unmapped) homologue in mouse.

Page 23: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 23

Disease Oriented Research ResourceDisease Oriented Research Resource

� The major goal of the RGD Disease Oriented Research

Resource is to create collaborative relationships between

RGD and 20 particular disease rat research communities to

identify, collect, and integrate disease-specific components

of data and information all the way down to specific genes

of interest into RGD Disease “portals”.

• Prioritize data for curation and addition to RGD based on targeted disease areas

• Effectively combine automated and manual data acquisition and curation methods

• Provide a way to integrate Rat Genome Sequencing Project results with RGD activities

• Help RGD incorporate tools developed in BRC to add focus of data mining and analysis to traditional curationand database functions

Specific Goals for RGD

Page 24: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 24

DORR WorkflowDORR Workflow

IdentifyDisease

Genes

Strains

QTLs

Biomedical Literature

GROIs

Microarray

Pathways

Phenotypes

VCMap

ASAP

In Development

Curation

DORR Website &Genome Browser

Rat Genome Database

Extract from research,

electronic sources & Iterate

Page 25: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 25

C R IS P

Qu e r y w ith s ing le

k e yw o rd s andus ing th e

k e yw o rd s ind if fe r e n t

com b inations

Re vie w fo r

au tho r s , g e ne s ,s tr ains ,

p he no type s ,d is e as e s an d

par s e

C onso lidate

P ubMed

Que r y us ingpr e vious

k e yw o r ds , alo ngw ith the ne w d ata

fou nd in C RISPs e ar ch

Re vie w fo rau tho r s , g e ne s ,

s tr ains ,p he no type s ,

d is e as e s an dpar s e

C onso lidate

N ome nclature

Que r y L ocus L in k ,RGD, and M GD

u s in g g e ne andph e no typ e nam e s

Re vie w fo rvar iat ions in ge ne

or p he no typenam e s and par s e

C onso lidate

D isease

S tra ins

P henotypes

Genes

QTL's

D isease

S tra ins

P henotypes

Genes

QTL's

S egregate

LocuslinkQuery using

Genesymbols

Re c ord full na m e ,

Loc us Link ID's , Re fS e q ID's , Unige ne

ID's , O M IM ID,Hom ologe ne ,

Chrom os om e , &Cytoge ne tic m a p

da ta .

C onso lidate

Key O bject D atabase

S egregate

Gene M ap99

Query usingGene

symbols

Re cor d RH m ap

data in clud in gcR10000 or

cR3000 and theGe ne Br idg e

In te r val

C onso lidate S egregate

P rio ritiz e Artic lesfor C uration

R GD & MGDQuery using

Genesymbols

Re cordchro mo so me

andcytog e ne tic

p o sitio n

C onso lidate S egregate

K ey O b jectAttribu tes D atabase

R

O

Y

G

B

R

O

Y

G

B

R

O

Y

G

B

R

O

Y

G

B

Manual C urationP rocess

F ill Ob jectTemplates

B u lk dataprocessing

Load D ata in toR GD

S egregate

S egregate

V C M apQuery using

Genesymbols

Deve lop GeneImag e s

Analyz e fo rR OI's

R ecord R OIAttribu tes

Data AcquisitionData Acquisition

Page 26: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 26

Automated Sequence Automated Sequence

Analysis PipelineAnalysis Pipeline

MetaGene

“Hot Zone”RepeatMask

BLAST (nr+pir)

MetaGene Report

AssembledSeq+UnassembledSeq

UniGene Search

UniGeneReport

Visualization(Clickable Image Map)

RepeatMask

Search extra genomic sequence(HTGS, TraceDB,nr)

Assembly Seq

Seq(Fasta/Trace)

RepeatMask

ePCRRepeatMask Homolog Search

Repeat Report

MarkerReport

Homolog Report

Gene Report

Additional Reports

input sequences

Markers

VCMapseqs mapped in ROI

Seqs predicted in ROI

Page 27: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 27

miRNA miRNA Gene and Target Gene and Target

PredictionPrediction� microRNA genes (miRNAs) were recently

recognized as a class of functional non-coding genes

� ~70nt precursor which has a hairpin fold

� ~20nt RNA molecule from Dicer cutting the stem loop

� First identified were lin-4 and let-7

– Developmental role in C. Elegans

Page 28: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 28

miRNA miRNA ExamplesExamples

Page 29: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 29

miRNA miRNA Gene PredictionGene Prediction

� Can we predict where miRNA genes might

be?

� Microscan (Burge Lab)

Scan C. Elegans for

70nt hairpin folds

(structure prediction)

Compare with C.

Briggsae

(sequence alignment)

Score alignments

•3’ and 5’ conservation

•Overall conservation

•Size of loop, etc.

Select sequences with

score > threshold

Page 30: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 30

miRNA miRNA Target PredictionTarget Prediction

� Lin-4 and let-7 interact with 3’-UTR

� Idea: look for conserved 3’-UTR regions

which are complementary to discovered

miRNA genesFind conserved

sequences of ~20nt in

length from 3’-UTR

database

Align with miRNA

genes

Score alignment

•3’ and 5’ matches

•Overall matching

Select high scoring

matches

Page 31: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 31

Goals of the CourseGoals of the Course

� For everyone

– Communication

� For the biologist

– Incorporate bioinformatics into research

– Understand computational modeling

� For the computational scientist

– Develop tools for biological research

– Create new algorithms for mining biological data

– Understand how to find biologically meaningful information

Page 32: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 32

SummarySummary

� Bioinformatics is truly interdisciplinary

– Biology (natural sciences), informatics, mathematics &

statistics

� Databases

– Large, semistructured, incomplete, inaccurate

� Wide-range of problems

– Solutions employ knowledge from sciences with

algorithms and models from informatics, mathematics,

and statistics

Page 33: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 33

Biological DataBiological Data

� DNA and Protein Sequences are annotated

– Source

– Organism

– Function

– Updates

– Etc.

Page 34: MSCS282: Bioinformatics I · MSCS282: Bioinformatics I Introduction Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Michael

Bioinformatics - Craig A. Struble 34

Classic examplesClassic examples

� Sequence alignment

� Multiple sequence alignment

Examples from Setubal/Meidanis (1997)