Computational Methods in Bioinformatics-Dr Elshafei

Embed Size (px)

Citation preview

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    1/34

    March 16, 2004 1

    Computational

    Methods inBioinformatics

    Dr. Moustafa ElshafeiSystems Engineering Department

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    2/34

    March 16, 2004 2

    Topics What is Bioinformatics ?

    Introduction to Molecular genetics Some challenging problems

    Review of the current computationaltechniques.

    Future approaches

    Conclusion

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    3/34

    March 16, 2004 3

    What is Bioinformatics?

    Bioinformatics is a managementinformation system for molecular biology

    Organization of a huge amount of

    information in Gene Banks and proteinBanks

    Data mining and analysis tools Modeling, interpreting and predicting

    Biological activities.

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    4/34

    March 16, 2004 4

    Introduction to molecular

    genetics Molecules

    Lipids

    Proteins

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    5/34

    March 16, 2004 5

    Nucleus and Nucleolus

    Plant Cell Note the large nucleus

    and nucleolus in the

    centre of the cell

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    6/34

    March 16, 2004 6

    Chromosomes

    and Genes

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    7/34

    March 16, 2004 7

    How long is DNA?

    DNA helix ( 2 nm wide) are rounded on

    histone fibre of diameter 11 nm, then

    compacted in 30 nm cromation fiber, thencoiled in 700 nm diameter then formed as

    chromosomes 1400 nm diameter.If the the DNA strand of the human gene had 1 mm

    diameter, it would stretch to 25km. It would be winded

    and twisted, and coiled until it becomes a chromosome of

    50 cm diameter and 4 meter length.

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    8/34

    March 16, 2004 8

    Chromosomes Chromosomes are the cellular components

    that contain genes, in animals and plantsthey are located in Nucleolus;

    Genes are the functional units ofinheritance.

    Genes are specific segments of DNA that

    code for specific proteins which control cellstructure and function.

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    9/34

    March 16, 2004 9

    Number of chromosomes vary from

    organism to another

    Human 46,

    Chicken 78, Mouse 40,

    wheat 42, corn 20, Fruit fly 8, scorpion 4

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    10/34

    March 16, 2004 10

    Genes & Genetics

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    11/34

    March 16, 2004 11

    Deoxyribonucleic acid (DNA)

    Pair of sequence of four nucleotides:

    cytosine (C), guanine (G), adenine (A), and thymine (T). A

    Pairs with T, and C pairs with G, the pairs held together by

    hydrogen bonds.

    TCTCGGCATTAGGGCCT

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    12/34

    March 16, 2004 12

    TCTCGGCATTAGGGCCT

    AGAGCCGTAATCCCGGA

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    13/34

    March 16, 2004 13

    Genome length in nucleotide pairs

    Virus 5k E.Coli 4700k

    Human being 3,000,000k Corn 4,500,000k

    Salamander 72,500,000k

    G d t i

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    14/34

    March 16, 2004 14

    Genes and proteins Genes are segments of DNA which code for proteins.

    A segment of the DNA that codes for a specific protein is astructural gene.

    Protein synthesis is also governed by a genetic code

    Every function in a cell is controlled by some kind ofproteins .

    Proteins are formed by strands from 20 amino acids

    Every three nucleotides are called codons.

    The 64 possible codons are mapped into, Start, Stop, andone of the 20 amino acids

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    15/34

    March 16, 2004 15

    Protein Mapping

    Protein consists of a chain of amino acidsThere are 20 amino acids

    Each amino acid is coded by three bases.

    During protein synthesis T->U; DNA->mRNA

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    16/34

    March 16, 2004 16

    Protein Expression

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    17/34

    March 16, 2004 17

    Genes length between 30k-250k, exon region 69-

    3106 bp. Introns can be as large as 32k

    Mean internal coding exon 150 bp.

    Eukaryotes have only 10% of their DNA coding

    for proteins. Humans may have a little as 1%coding for proteins. Viruses and prokaryotes use a

    great deal more of their DNA.

    Human genome project completed 2003, 3 billion

    bp, and about 30,000 genes, compared to 13,600for the fruit fly, and over 14,000 genes in

    mosquitoes, Rice 50,000.

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    18/34

    March 16, 2004 18

    If the number of genes really turns out

    to be about 30,000, then this can be atestament to the marvellous design of

    life. Only a genius could create us withso few genes performing so many

    functions

    A famous scientist in genetics.

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    19/34

    March 16, 2004 19

    An RNA gene is any gene that is not

    translated into a protein. Commonly-usedsynonyms of "RNA gene" are noncoding

    RNA or (ncRNA).

    RNA genes code certain Regulatory

    functions.

    RNA genes are not predictable by currentalgorithms. Not clear how many of these

    are hidden in the human genome.

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    20/34

    March 16, 2004 20

    Gene

    Banks

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    21/34

    March 16, 2004 21

    Challenges

    1-Gene finding: try to identify a potential gene region in DNA,

    however, only 1-3% of human genome is translated into proteins.

    2- Finding a region of interest. Raw sequencing is performed on pieces

    of random lengths between 500 to 5000 pbs. With possible largeoverlapping parts at both ends, 6 possible interpretation of each strand.

    Need for algorithms to align the fragments

    3-Multiple Alignment of a set of genes to reveal regions of similarities,

    and cross species changes.

    4- Local alignment and similarity search, Statistical grouping,

    clustering, statistical similarity measures for course classification.

    5- Protein structure prediction: given a protein sequence, how itwould fold itself into a specific 3D complex shape.

    Locating the non-coding genes (RNA)

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    22/34

    March 16, 2004 22

    Methods

    Similarity Search

    Content search

    Signal Search

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    23/34

    March 16, 2004 23

    Common Software Uses

    Similarity analysis Sequence analysis

    Sequence alignment

    Population genetics statistical analysis

    Format conversion, Database maintenance

    and searching

    b h

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    24/34

    March 16, 2004 24

    Data base Fast Search

    BLAST & FASTA

    Query data base for DNAs similar to a givensequence.

    Rely on identification of brief subsequences (k-tuples). Multiple k-tuples serve as seeds forextended alignment.

    Versions for DNA and protein sequences.

    Limited capability to handle gaps in coding

    regions.

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    25/34

    March 16, 2004 25

    Gene Prediction/Gene analysis

    The most common : GRAIL* FGENEH/FGENES

    MZEF GENSCAN*

    Procrustes

    GeneID

    GeneParser

    HMMgene

    GRAIL

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    26/34

    March 16, 2004 26

    GRAIL

    Gene Recognition and Analysis Link

    There are multiple versions:

    Grail 1, Grail 1a, Grail 2, GRAIL III, etc.

    GRAIL II uses Neural to classify Introns and

    Exons. GRAIL III Uses Dynamic Programming to find the

    optimal combinations of Introns and Exons.

    Refinements: consideration of contextual

    information, and linguistic methods.

    GenScan

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    27/34

    March 16, 2004 27

    GenScan

    Predicts complete gene structures

    Input sequence may represent more than one gene

    It follows a probabilistic model

    Uses Markov Model, Generalized HiddenMarkov Model.

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    28/34

    March 16, 2004 28

    Multiple Sequence Alignment Programs

    Discover the commonalities and evolutionary

    relations among a set of genes or proteins.

    Examples

    ClustalW

    DiAlign MAP

    Alignment editors

    Bioedit

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    29/34

    March 16, 2004 29

    ClustalW

    finds the best global alignment for a set ofinput sequences (nucleic acid or protein).

    A global alignment refers to the best match

    over the total length of the sequences.

    Produces a similarity tree with scores

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    30/34

    March 16, 2004 30

    CLUSTALW

    Step 1: Pairwise alignment, distance matrix

    Calculates distance scores between pairs

    Cost: O(q2

    l2

    ) , q number of sequences, l mean length Step 2: Guide tree

    Group nearest first

    Build tree sequentially Cost: O(q3)

    Step 3: Progressive alignment

    Align, starting at leaves of tree Cost: O(ql2)

    Other programs (MAP) use DP to find the most

    likely evolutionary sequence.

    P t i St t P di ti

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    31/34

    March 16, 2004 31

    Protein Structure Prediction

    NNs are the bases for many known software

    packages for predicting protein structures.

    The main software packages : nnPredict

    Predict Protein

    Predator

    PSIPRED

    SOPMA

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    32/34

    March 16, 2004 32

    POSSIBLE RESEARCH

    DIRECTIONS

    Neuro Fuzzy techniques Genetic Algorithm

    Theory of Error Correction codes

    Wavelets

    Spectrum analysis

    Dynamic modeling of protein expression.

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    33/34

    March 16, 2004 33

  • 7/30/2019 Computational Methods in Bioinformatics-Dr Elshafei

    34/34

    March 16, 2004 34

    THANK YOU