Upload
others
View
3
Download
1
Embed Size (px)
Citation preview
Bioinformatics - Craig A. Struble 1
MSCS282: MSCS282: Bioinformatics Bioinformatics II
IntroductionCraig A. Struble, Ph.D.
Department of Mathematics, Statistics, and Computer Science
Marquette University
Michael A. Thomas, Ph.D.
Bioinformatics Research Center
Medical College of Wisconsin
Bioinformatics - Craig A. Struble 2
Michael A. Thomas, Ph.D.Michael A. Thomas, Ph.D.
Bioinformatics - Craig A. Struble 3
OverviewOverview
� Welcome
� Syllabus
� Student Introductions
� Introduction to Bioinformatics
Bioinformatics - Craig A. Struble 4
What Is What Is BioinformaticsBioinformatics??
� “Bioinformatics is a new subject of genetic data collection, analysis and dissemination to the research community.” Hwa A. Lim (1987)
� “Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data,including those to acquire, store, organize, archive, analyze, or visualize such data.” NIH working definition (2000)
Bioinformatics - Craig A. Struble 5
What is What is BioinformaticsBioinformatics??
InformaticsComputer Science
Computer Engineering
Information Science
Biology &
Other Natural
Sciences
Mathematics
& Statistics
Bioinformatics
Bioinformatics - Craig A. Struble 6
BioinformaticsBioinformatics is sometimes is sometimes
called…called…
� Computational biology
� Computational molecular biology
� Biomolecular informatics
� Computational genomics
� …
Bioinformatics - Craig A. Struble 7
Different perspectives on Different perspectives on
BioinformaticsBioinformatics
� Bioinformatics is a tool
– Biologists, biochemists, medical professionals,
etc.
– Obtain meaningful and understandable results
� Bioinformatics is a discipline
– Informaticians, mathematicians, statisticians,
etc.
– Generate meaningful and understandable results
Bioinformatics - Craig A. Struble 8
Biological DataBiological Data
� Genomes
– DNA Sequences of A, T, C, G
– Annotated with function, “interesting” features
� Proteins
– Amino Acid Sequences
� Sequences of 20 letters
– Annotated with structure, function, etc.
Bioinformatics - Craig A. Struble 9
Biological DataBiological Data
� Gene Expression
– Dynamic behavior of genes
� Protein Expression
– Dynamic behavior of proteins
� Structural Features
– RNA and proteins
� …
Bioinformatics - Craig A. Struble 10
Biological Data Biological Data Sus scrofaSus scrofa agoutiagouti--related protein generelated protein gene
1 ggcacattct cctgttgagc caggctatgc tgaccacaat gttgctgagc tgtgccctac
61 tgctggcaat gcccaccatg ctgggggccc agataggctt ggcccccctg gagggtatcg
121 gaaggcttga ccaagccttg ttcccagaac tccaaggtca gtgcgggcag gagtgggttg
181 ggtggggctt ggacatcctc tggccacaaa gtattctgct tgtatgagcc ctttcttccc
241 cttcccaatc ccaggcctgg gaggtgggtg ttttgtgcat gggtggttct gccctcacat
301 catctgtccc agatctaggc ctgcagcccc cactgaagag gacaactgca gaacgggcag
361 aagaggctct gctgcagcag gccgaggcca aggccttggc agaggtaaca gctcagggaa
421 agggctgagg ccacaagtct tgagtgggtg tgtcaagcat caacctctat ctgtgcttgg
481 agttgccact gtggtacaac gggattggcg gtgtcttggg agcgctggga cgtggtttca
541 tccccggcca gcacaagtgg gttaaggatc tggccttgcc atcccttcag cttaggctga
601 gactgtggct tggagctgat ctctgaccgg aagctccata tgctctgggg tgaccaaaaa
661 tggaaaaaca aacatacaaa acacctctac ctgcacttcc tgaccccctc acccggggcg
721 acactgcaga ccatcccgtt cacgctccac ttccatcctg ccttgatctg gcgcattcca
781 tgaatgtgct tttggaagtc cttgtttccc aacccttgta ggtgctagat cctgaaggac
841 gcaaggcacg ctccccacgt cgctgcgtaa ggctgcacga atcctgtctg ggacaccagg
901 taccatgctg cgacccatgt gctacatgct actgccgttt cttcaacgcc ttctgctact
961 gccgcaagct gggtactgcc acgaacccct gcagccgcac ctagctggcc agccaatgtc
1021 gtcg
Bioinformatics - Craig A. Struble 11
Genome SizesGenome Sizes
3.3 billion bpHuman
4.7 million bpEsherichia coli
3569 bpBacteriophage MS2
Genome SizeSpecies
Bioinformatics - Craig A. Struble 12
Database GrowthDatabase Growth
Bioinformatics - Craig A. Struble 13
Database GrowthDatabase Growth
Bioinformatics - Craig A. Struble 14
Database GrowthDatabase Growth
Bioinformatics - Craig A. Struble 15
Database GrowthDatabase Growth
� Exponential growth in sequence data
� Not much growth in sequence size
� Expect exponential growth in annotation
information
What are we to do with all this data?
Bioinformatics - Craig A. Struble 16
Challenges of Large Challenges of Large
DatabasesDatabases� Storage
– Indexing, physical layout, memory management
� Modeling
– Relational, hierarchical, semi-structured
� Efficiency
– Update, query, analysis
� Interpretation
– Visualization
Bioinformatics - Craig A. Struble 17
Problems in Problems in BioinformaticsBioinformatics
� Consider just sequence analysis
– Sequence alignment
– Gene discovery
– Promoter discovery
– Intron splice sites
– Protein and RNA structure prediction
– …
Bioinformatics - Craig A. Struble 18
Applications of Applications of BioinformaticsBioinformatics
� VCMAP
� DORR and ASAP
� miRNA
Bioinformatics - Craig A. Struble 19
VCMAPVCMAP
� Comparative mapping is a strategy that allows cross-organism study
of physiological genomics
� Virtual Comparative Map (VCMap) performs homology analysis
with mathematical predictions to construct un-tested (in the wet-lab)
cross-organism maps between human, rat, mouse and zebrafish
� This application provides a highly modular investigative environment
for the:
– Analysis of multiple organisms including Zebrafish
– Collection of genetic and radiation hybrid maps
– Prediction of Genes based on homology
Bioinformatics - Craig A. Struble 20
VCMAPVCMAP� Homology analysis was based on sequence similarity (Altschul,
et al 1990) and curated homologous genes.
� 85 % similarity with 100 bp stretch across all species was used
to create the maps
� NCBI’s UniGene sequence sets, RH and Genetic maps were
chosen to create anchor objects (Kwitek-Black, et al. 2001).
� 1-to-1 homologous objects were used for building the virtual
comparative maps with a pipeline architecture
Bioinformatics - Craig A. Struble 21
VCMAPVCMAPDownload UniGene data from NCBI
Mask UniGene sequences
Blast
Search UniGene
Generate anchor report
Create Homolog UniGene
Object and Scoring
DB
Map Data
Anchor
Report
Format masked sequences
1-to-1
Objects
VC Maps
Building
Load UniGene data to DB
Bioinformatics - Craig A. Struble 22
VCMAPVCMAP• The zebrafish virtual map shows the “evolutionary breakpoints” of a region in chromosome 7 (in green) with synteny to human chromosomes 15 (pink), 11 (brown) and 18.
• The zebrafish virtual map was also able to identify a gene, pyruvate carboxylase (PC; associated with the disorder necrotizing encephalopathy), with mapped homologues forzebrafish, human and rat, and a (unmapped) homologue in mouse.
Bioinformatics - Craig A. Struble 23
Disease Oriented Research ResourceDisease Oriented Research Resource
� The major goal of the RGD Disease Oriented Research
Resource is to create collaborative relationships between
RGD and 20 particular disease rat research communities to
identify, collect, and integrate disease-specific components
of data and information all the way down to specific genes
of interest into RGD Disease “portals”.
• Prioritize data for curation and addition to RGD based on targeted disease areas
• Effectively combine automated and manual data acquisition and curation methods
• Provide a way to integrate Rat Genome Sequencing Project results with RGD activities
• Help RGD incorporate tools developed in BRC to add focus of data mining and analysis to traditional curationand database functions
Specific Goals for RGD
Bioinformatics - Craig A. Struble 24
DORR WorkflowDORR Workflow
IdentifyDisease
Genes
Strains
QTLs
Biomedical Literature
GROIs
Microarray
Pathways
Phenotypes
VCMap
ASAP
In Development
Curation
DORR Website &Genome Browser
Rat Genome Database
Extract from research,
electronic sources & Iterate
Bioinformatics - Craig A. Struble 25
C R IS P
Qu e r y w ith s ing le
k e yw o rd s andus ing th e
k e yw o rd s ind if fe r e n t
com b inations
Re vie w fo r
au tho r s , g e ne s ,s tr ains ,
p he no type s ,d is e as e s an d
par s e
C onso lidate
P ubMed
Que r y us ingpr e vious
k e yw o r ds , alo ngw ith the ne w d ata
fou nd in C RISPs e ar ch
Re vie w fo rau tho r s , g e ne s ,
s tr ains ,p he no type s ,
d is e as e s an dpar s e
C onso lidate
N ome nclature
Que r y L ocus L in k ,RGD, and M GD
u s in g g e ne andph e no typ e nam e s
Re vie w fo rvar iat ions in ge ne
or p he no typenam e s and par s e
C onso lidate
D isease
S tra ins
P henotypes
Genes
QTL's
D isease
S tra ins
P henotypes
Genes
QTL's
S egregate
LocuslinkQuery using
Genesymbols
Re c ord full na m e ,
Loc us Link ID's , Re fS e q ID's , Unige ne
ID's , O M IM ID,Hom ologe ne ,
Chrom os om e , &Cytoge ne tic m a p
da ta .
C onso lidate
Key O bject D atabase
S egregate
Gene M ap99
Query usingGene
symbols
Re cor d RH m ap
data in clud in gcR10000 or
cR3000 and theGe ne Br idg e
In te r val
C onso lidate S egregate
P rio ritiz e Artic lesfor C uration
R GD & MGDQuery using
Genesymbols
Re cordchro mo so me
andcytog e ne tic
p o sitio n
C onso lidate S egregate
K ey O b jectAttribu tes D atabase
R
O
Y
G
B
R
O
Y
G
B
R
O
Y
G
B
R
O
Y
G
B
Manual C urationP rocess
F ill Ob jectTemplates
B u lk dataprocessing
Load D ata in toR GD
S egregate
S egregate
V C M apQuery using
Genesymbols
Deve lop GeneImag e s
Analyz e fo rR OI's
R ecord R OIAttribu tes
Data AcquisitionData Acquisition
Bioinformatics - Craig A. Struble 26
Automated Sequence Automated Sequence
Analysis PipelineAnalysis Pipeline
MetaGene
“Hot Zone”RepeatMask
BLAST (nr+pir)
MetaGene Report
AssembledSeq+UnassembledSeq
UniGene Search
UniGeneReport
Visualization(Clickable Image Map)
RepeatMask
Search extra genomic sequence(HTGS, TraceDB,nr)
Assembly Seq
Seq(Fasta/Trace)
RepeatMask
ePCRRepeatMask Homolog Search
Repeat Report
MarkerReport
Homolog Report
Gene Report
Additional Reports
input sequences
Markers
VCMapseqs mapped in ROI
Seqs predicted in ROI
Bioinformatics - Craig A. Struble 27
miRNA miRNA Gene and Target Gene and Target
PredictionPrediction� microRNA genes (miRNAs) were recently
recognized as a class of functional non-coding genes
� ~70nt precursor which has a hairpin fold
� ~20nt RNA molecule from Dicer cutting the stem loop
� First identified were lin-4 and let-7
– Developmental role in C. Elegans
Bioinformatics - Craig A. Struble 28
miRNA miRNA ExamplesExamples
Bioinformatics - Craig A. Struble 29
miRNA miRNA Gene PredictionGene Prediction
� Can we predict where miRNA genes might
be?
� Microscan (Burge Lab)
Scan C. Elegans for
70nt hairpin folds
(structure prediction)
Compare with C.
Briggsae
(sequence alignment)
Score alignments
•3’ and 5’ conservation
•Overall conservation
•Size of loop, etc.
Select sequences with
score > threshold
Bioinformatics - Craig A. Struble 30
miRNA miRNA Target PredictionTarget Prediction
� Lin-4 and let-7 interact with 3’-UTR
� Idea: look for conserved 3’-UTR regions
which are complementary to discovered
miRNA genesFind conserved
sequences of ~20nt in
length from 3’-UTR
database
Align with miRNA
genes
Score alignment
•3’ and 5’ matches
•Overall matching
Select high scoring
matches
Bioinformatics - Craig A. Struble 31
Goals of the CourseGoals of the Course
� For everyone
– Communication
� For the biologist
– Incorporate bioinformatics into research
– Understand computational modeling
� For the computational scientist
– Develop tools for biological research
– Create new algorithms for mining biological data
– Understand how to find biologically meaningful information
Bioinformatics - Craig A. Struble 32
SummarySummary
� Bioinformatics is truly interdisciplinary
– Biology (natural sciences), informatics, mathematics &
statistics
� Databases
– Large, semistructured, incomplete, inaccurate
� Wide-range of problems
– Solutions employ knowledge from sciences with
algorithms and models from informatics, mathematics,
and statistics
Bioinformatics - Craig A. Struble 33
Biological DataBiological Data
� DNA and Protein Sequences are annotated
– Source
– Organism
– Function
– Updates
– Etc.
Bioinformatics - Craig A. Struble 34
Classic examplesClassic examples
� Sequence alignment
� Multiple sequence alignment
Examples from Setubal/Meidanis (1997)