FATCHIYAH, M.KES. PH.D EMAIL: FATCHIYA@YAHOO.CO

Preview:

Citation preview

F A T C H I Y A H , M . K E S . P H . D

E M A I L : F A T C H I Y A @ Y A H O O . C O . I D

BIOINFORMATIKA DAN BIOLOGI KOMPUTASI DALAM

BIOLOGI MOLEKULER

Introduction

The Human Genome Project

Challenges of Molecular Biology

The changing role of the Biologist in the Age of

Information

Bioinformatics software

Genomics

Impact on medicine

What genes cause the condition? What are the normal function of the gene? What mutations have been linked to diseases? How does the mutation alter gene function?

What laboratories are performing DNA tests? Are there gene therapies or clinical trials?

What names are used to refer to the genes and the diseases?

What other conditions are linked to these same genes?

2/28/2011

3

fatchiyah JB UB Bioinformatic

What is “bioinformatics”?Bioinformatics:

The use of computers to collect, analyze, and interpret biological information at the molecular level.

"The mathematical, statistical and computing methods that aim to solve biological problems using DNA and amino acid sequences and related information."

A set of software tools for molecular sequence analysis

2/28/2011

4

fatchiyah JB UB Bioinformatic

NIH Working Definition

Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. http://www.bisti.nih.gov/CompuBioDef.pdf

fatchiyah JB UB Bioinformatic2/28/2011

5

Bioinformatics

Interdisciplinary approach Computer science, Mathematics & Statistics.

Molecular biology, Biochemistry & Medicine.

Scopes of bioinformatics Genomics Primarily sequences (DNA and RNA) Databanks and search algorithms Supports studies of molecular evolution (“Tree wars”)

Proteomics Sequences (Protein) and structures Mass spectrometry, X-ray crystallography Databanks, knowledge bases, visualization

Functional Genomics (transcriptomics) Microarray data Databanks, analysis tools, controlled terminologies

Systems Biology (metabolomics) Metabolites and interacting systems (interactomics) Graphs, visualization, modeling, networks of entities

DNA RNA Protein PhenotypePhenotype

2/28/2011

7

fatchiyah JB UB Bioinformatic

What the function of bioinformatics analysis

1. Genome sequence: for the first time there is a blueprint of the activity of a cell

2. Gene expression, in the form of cDNA array, and proteomic studies:

how these genes interact, interfacing with each other, and how they form networks.

3. On structural level, the mechanism how these molecules work.

Major impact on diagnosis, treatment, drug discovery, regulation and metabolism, biodegradation

2/28/2011

8

fatchiyah JB UB Bioinformatic

In SilicoIn Vivo

Analysis development

In Vitro

2/28/2011

9

fatchiyah JB UB Bioinformatic

MESDAMESETMESSRSMYN

AMEISWALTERYALLKINCAL

LMEWALLYIPREFERDREVIL

MYSELFIMACENTERDIRATV

ANDYINTENNESSEEILIKENM

RANDDYNAMICSRPADNAPRI

MASERADCALCYCLINNDRKI

NASEMRPCALTRACTINKAR

KICIPCDPKIQDENVSDETAVS

WILLWINITALL

3D

structure

Cell

System Dynamics

Cell

Structures

Complexes

Sequence

Structural Scales

Organism

2/28/2011

10

fatchiyah JB UB Bioinformatic

Blue print of gene

Genome sequence: for the first time there is a blueprint of the activity of a cell

Gene expression, in the form of cDNA array, and proteomic studies: how these genes interact, interfacing with each other, and how they form networks.

On structural level, the mechanism how these molecules work.

Major impact on diagnosis, treatment, drug discovery, regulation and metabolism, biodegradation

Challenges in Computational Biology

February 28, 2011BBSI Summer School - Iowa State University

12

1. Obtain the genome of an organism.

2. Identify and annotate genes.

3. Find the sequences, three dimensional structures, and functions of proteins.

4. Find sequences of proteins that have desired three dimensional structures.

5. Compare DNA sequences and proteins sequences for similarity.

6. Study the evolution of sequences and species.

http://gila.engr.uic.edu/bioinformatics/

Bioinformatics

Computational analysis of high-throughput biological data Whole genome sequencing.

Global genomic expression & profiling.

Functional genomics.

Structural genomics/proteomics

Comparative genomics.

2/28/2011fatchiyah, JB-UB 14

In the mid-1990s, the GenBank database became part of the International Nucleotide Sequence Database Collaboration:

Internationally Networking Collaboration

NCBI investigators maintain on going collaborations with several institutes within NIH and also with numerous academic and government research laboratories

DDBJ Mishima,

Japan

GenBankNCBIUSA

EMBLEuropea

www.ncbi.nlm.nih.gov/

http://www.ebi.ac.uk/

2/28/2011fatchiyah, JB-UB 15

Nucleotide

Protein

PubMed

The original version of Entrez had just 3 nodes: nucleotides, proteins, and PubMed abstracts.

Entrez has now grown to nearly 20 nodes

2/28/2011fatchiyah, JB-UB 16

The future of genomic rests on the foundation of the Human Genome

Project

2/28/2011fatchiyah, JB-UB 17

The Wellcome Trust

Free unrestricted access for all

The door to discovery is wide open

Genome browsers

Ensemblwww.ensembl.org

University of California Santa Cruzhttp://genome.cse.ucsc.edu

European Bioinformatics Instituteswww.ebi.ac.uk

MGD the Jackson Laboratorywww.informatics.jax.org

GenBankwww.ncbi.nlm.nih.gov

DNA Data Bank of Japanwww.ddbj.nig.ac.jp

Genome Databases

I. The Human Genome Project

The genome sequence is complete - almost!

approximately 3.2 billion base pairs.

DNA

Protein

Nucleotides

sequence

Gene expression = Protein production

The Flow of Biotechnology Information

2/28/2011fatchiya

20

Gene > DNA sequenceAATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACAC

TGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAA

TCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTA

ACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGG

TTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAA

TTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTG

GTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGA

CGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGC

TACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGA

ACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGG

TAAGAAGATCGCGAACATCTAGTAGA

> Protein sequence

MKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDVNIDELLNEDILILGCSAMGDEVLEESEFEPFIEEISTKISGKKVALFGSYGWGDGKWMRDFEERMNGYGCVVVETPLIVQNEPDEAEQDCIEFGKKIANI

Genome size

Organism Number of base pairs

X-174 virus 5,386

Epstein Bar Virus 172,282

Mycoplasma genitalium 580,000

Hemophilus Influenza 1.8 106

Yeast (S. Cerevisiae) 12.1 106

Human 3.2 109

Wheat 16 109

Lilium longiflorum 90 109

Salamander 100 109

Amoeba dubia 670 109

2/28/2011fatchiya 22

Sanger Technique for Sequencing

A Sequence print-out from a control sample

2/28/2011fatchiya

23

2/28/2011fatchiya 24

FASTA Format2/28/2011fatchiya

25

>identifier descriptive textnucleotide of amino-acid sequence on multiple lines if needed.

Example:>gi|41|emb|X63129.1|BTA1AT B. taurus mRNA for alpha-1-anti-trypsin

GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTCCATCACGCGGGGCCTTCTGCTGCTGGC ….

MOST important data format!!!

Modified FASTA Format

2/28/2011fatchiya

26

1) A few tools follow the convention that lower case sequences are masked. (repeat masker, some versions of blast, megablast, blastz)

2) A few analysis tools (like CLUSTAL) want a simplified identifier on the defline. So they can have a short string for the alignment.

>X63129.1

GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC

CATCACGCGGGGCCTTCTGCTGCTGGC ….

2/28/2011fatchiya 27

H5N1

Write specific name

of gene

Click & Go

2/28/2011fatchiya 28GenBank Record-2

GeneBank Record

modification

date

Header

Molecule Type

GenBank Division

Modification Date

Version Number

Accession Number

Locus Name

Sequence Length

GeneBank Record

2/28/2011fatchiya

29

Gene Sequence

2/28/2011fatchiya

30

Searching Biological Databases

February 28, 2011BBSI Summer School - Iowa State University

31

BLAST (Basic Local Alignment Search Tool)

http://www.ncbi.nlm.nih.gov

BLASTN (DNA)

BLASTP (Protein)

BLASTX (DNA against Protein)

PSI-BLAST (Position Specific Iterative BLAST)

Multiple Alignment Software

February 28, 2011BBSI Summer School - Iowa State University

32

Clustalw (http://www.ebi.ac.uk/clustalw)

MSA (http://softlib.rice.edu/softlib/msa.html)

HMMER (http://hmmer.wustl.edu/)

SAM (http://www.cse.ucsc.edu/research/ compbio/sam.html)

2/28/2011 fatchiyah, JB-UB33

Biology Information on the Internet

Introduction to Databases

Searching the Internet for Biology Information.

General Search methods

Biology Web sites

Introduction to Genbank file format.

Introduction to Entrez and Pubmed

Ref: Chapters 1,2,5,6 of “Bioinformatics”

2/28/2011fatchiyah, JB-UB 34

The Wellcome Trust

Free unrestricted access for all

The door to discovery is wide open

Genome browsers

Ensemblwww.ensembl.org

University of California Santa Cruzhttp://genome.cse.ucsc.edu

European Bioinformatics Instituteswww.ebi.ac.uk

MGD the Jackson Laboratorywww.informatics.jax.org

GenBankwww.ncbi.nlm.nih.gov

DNA Data Bank of Japanwww.ddbj.nig.ac.jp

Genome Databases

2/28/2011 fatchiyah, JB-UB35

Refseq and LocusLink

Attempt to produce 1 mRNA, 1 protein, and 1 genomic gene for each frequently occuring allele of a protein expressing gene.

www.ncbi.nlm.nih.gov/LocusLink

Special non-genbank Accession numbers

NM_nnnnnn mRNA refseq

NP_nnnnnn protein refseq

NC_nnnnnn refseq genomic contig

NT_nnnnnn temporary genomic contig

NX_nnnnnn predicted gene

Ensembl Overviewsearching and browsing

Background

HGP draft sequence of human genome

jigsaw puzzle: assembling semi-accurate sequences coming from all over the world

Ensembl

EMBL-EBI and Sanger Institute

Automatic system track sequenced pieces (and their changes)

assemble into larger stretches

analyze to find genes

Ensembl Data

Annotation features identified on each DNA sequence

Examples genes: known genes or predicted by Ensembl

SNP (single nucleotide polymorphisms)

repeats (regions of simple repetitive seq.)

homologies (regions highly similar to other sequence in the public databases)

Data Access

A web-based genome browser which can be customized as required

A web-based system for data export and data mining

'Dumps' of sequence and other data sets for you to download

Direct access to the databases

A Perl-based object layer

Gene Prediction

finding genes GeneScan: a gene finding software

comparison with all known genes matches are considered supporting evidence

Ensembl Services

BLAST

Sequence browsing

Identifier search

Known gene names

OMIM diseases

Free text search of OMIM, SWISS-PROT, InterPro annotation

Contig View (Customization)

Marker View

Ensembl Gene Report

Recommended