137
1 TRƢỜNG ĐẠI HỌC NÔNG NGHIỆP HÀ NỘI KHOA CÔNG NGHỆ SINH HỌC .................................... Bài giảng TIN SINH HỌC ỨNG DỤNG (Applied bioinformatics) NGUYỄN ĐỨC BÁCH HÀ NỘI, 8/2013

Bai Giang Tin Sinh Hoc

Embed Size (px)

DESCRIPTION

Bai Giang Tin Sinh Hoc

Citation preview

  • 1

    TRNG I HC NNG NGHIP H NI KHOA CNG NGH SINH HC

    ....................................

    Bi ging

    TIN SINH HC NG DNG

    (Applied bioinformatics)

    NGUYN C BCH

    H NI, 8/2013

  • 2

    PHN 1. GII THIU CHUNG 5

    CHNG 1. GII THIU V BIOINFORMATICS 5 1.1. Khi nim 5 1.2. Nn tng sinh hc v s pht trin ca bioinformatics 5 1.3. Vai tr ca bioinformatics trong nghin cu sinh hc 7 1.4. Nhim v v cc hng nghin cu ca Bioinformatic 12 1.5. Xu hng pht trin ca bioinformatics 16 Tm tt chng 1 18 Cu hi n tp chng 1 18

    CHNG 2 19 NN TNG SINH HC CA TIN SINH HC 19 2.1. Axit nucleic v protein 19 2.2. Cu trc ca axit nucleic 19 2.3. Genome v nghin cu genome 24 2.4. Pht hin gene v xc nh chc nng gene trong genome 26 2.5. Hot ng chc nng ca gene v iu ha hot ng ca gene 29 2.6. Proteome v lnh vc nghin cu protein (proteomics) 29 2.7. Tin ha v bn cht phn t ca qu trnh tin ha cc sinh vt 30 2.8. Phn tch mi quan h tin ha ca cc sinh vt 31 Tm tt chng 2 33 Cu hi n tp chng 2 33

    CHNG 3 35 TM KIM V QUN L TI LIU NGHIN CU 35 3.1. Phng php tm kim thng tin 35 3.2. Cch tm ti liu phc v nghin cu 35 3.3. Lm quen vi Pubmed 36 3.4. Cch qun l ti liu nghin cu 37 Tm tt chng 3 38 Cu hi n tp chng 3 38

    PHN 2 40

    C S D LIU SINH HC 40

    NG K TRNH T VO C S D LIU 40

    CHNG 4. C S D LIU SINH HC 40 4.1. C s d liu s cp 41 4.1.1. CSDL trnh t nucleotide 41 4.1.2. CSDL trnh t protein 41 4.1.3. C s d liu cu trc cc phn t 43 4.2. C s d liu th cp 45 4.3. Cc c s d liu khc 46 4.3.1. C s d liu kiu gene v kiu hnh 46 4.3.2. CSDL kiu gene (PhenomicDB) 46 4.3.3. PubChem 46 4.4. Ngn hng gene 47 Tm tt chng 4 50 Cu hi n tp chng 4 50 CHNG 5 52 XC NH TRNH T V NG K TRNH T VO NGN HNG GENE 52 5.1. Xc nh trnh t nucleotide 52 5.2. Xc nh trnh t genome 52 5.3. Lp rp trnh t 53 5.4. ng k trnh t 55 5.5. Cc cng c ng k trnh t 58 5.5.1. Cc thng tin cn thit phi chun b trc khi ng k trnh t 61 5.5.2. V d ng k trnh t bng WebIn 62 5.5.3. V d ng k trnh t bng Sequin 62

  • 3

    Tm tt chng 5 65 Cu hi n tp chng 5 65

    PHN 3 66

    CC CNG C PHN TCH 66

    KHAI THC V X L D LIU TRNH T SINH HC 66

    CHNG 6. GENOME BROWSER 66 6.1. Khi nim genome browser 66 6.2. Gii thiu mt s genome browser quan trng 66 6.2.1. Ensembl 66 6.2.2. UCSC 68 6.2.3. NCBI Genomes and MapViewer 70 6.3. c im v ng dng ca cc genome browser 71 Tm tt chng 6 72 Cu hi n tp chng 6 72

    CHNG 7 74 LM QUEN VI CC CNG C PHN TCH CSDL SINH HC 74 7.1. Lm quen vi cc cng c phn tch c bn 74 7.1.1. Tm v copy trnh t 74 7.1.2. Nhm cng c tm kim trnh t ging nhau 75 7.2. Tm cc vng chc nng, vng bo th 79 7.2.1. Cn nhiu trnh t (multi sequence alignment) 79 7.2.2. Xy dng bn gii hn (restriction map contruction) 81 7.2.3. D on cu trc bc 2 v bc 3 ca phn t protein 83 7.2.4. Phn tch trnh t axit nucleic 84 7.2.5. Thit k mi cho PCR v mu d lai axit nucleic 85 7.2.6. Xc nh khung c m 86 7.2.7. Tm cc bi bo khoa hc 87 7.2.8. Lp rp trnh t 87 7.2.9. Phn tch quan h tin ha 88 7.2.10. Phn tch protein 90 7.2.11. Nghin cu biu hin gene 90 7.3. Cc nhm cng c phn tch 91 7.3.1. Cng c phn tch ca NCBI 91 7.3.2. Nhm cng c ca EMBL 92 7.3.3. Nhm cng c ca ExPASy 95 7.3.4. Cc nhm cng c khc 97 Tm tt chng 7 97 Cu hi n tp chng 7 98

    CHNG 8 99 LM QUEN VI PHN TCH D LIU SINH HC 99 8.1. Tm d liu trong cc ngn hng CSDL 99 8.1.1. D liu trnh t 99 8.1.2. D liu cu trc 99 8.1.3. Cc d liu khc 102 8.2. Phn tch trnh t 102 8.2.1. So snh trnh t 102 8.2.2. Phn tch khung c m v vng trnh t m ha 106 8.2.3. Tm kim Promoter v cc vng iu ha hot ng gene 106 8.2.4. Tm kim vng chc nng ca protein (functional motif searching) 109 8.2.5. D on v m phng tng tc protein 110

    CHNG 9 113 CN TRNH T V NGUYN L CA CN TRNH T 113 9.1. Gii thiu v cn trnh t 113 9.2. Nguyn l ca cn trnh t 114 9.3. Cn nhiu trnh t v nguyn l cn nhiu trnh t 118

  • 4

    9.4. Cc cng c tm kim trnh t tng ng 119

    CHNG 10. PHN TCH MI QUAN H TIN HA 125 10.1. Khi nim 125 10.2. D liu dng xy dng cy tin ha 127 10.2.1. Phng php da vo khong cch 129 10.2.2. Phng php phn tch k t 131 10.3. La chn m hnh tin ha 133 10.4. nh gi cy phn tin ha 133

  • 5

    PHN 1. GII THIU CHUNG

    CHNG 1. GII THIU V BIOINFORMATICS

    1.1. Khi nim

    Tin sinh hc l ngnh khoa hc ng dng ton hc v khoa hc my tnh vo

    lnh vc sinh hc c bit l sinh hc phn t v y hc. Thut ng tin sinh hc ln u

    tin c Paulien Hogeweg gii thiu nm 1979 dng m t nghin cu v cc qu

    trnh trong h thng sinh hc. Vo cui nhng nm 1980, thut ng ny c a vo

    lnh vc di truyn hc v nghin cu genome. Tin sinh hc lin quan n vic xc nh

    trnh t, qun l, phn tch v khai thc cc CSDL sinh hc. Tin sinh hc hin lin

    quan n xy dng v pht trin cc c s d liu, cc thut ton, thng k v cc k

    thut my tnh gii quyt cc vn lin quan n l thuyt v thc nghim trong

    vic qun l v phn tch cc d liu sinh hc. Tin sinh hc cng bao gm m phng

    v d on tng tc gia cc phn t v cc qu trnh sinh hc.

    Hnh 1: Tin sinh hc v mi lin h gia cc lnh vc

    1.2. Nn tng sinh hc v s pht trin ca bioinformatics

    Vic pht hin DNA l vt cht mang thng tin di truyn v xc nh m hnh

    cu trc ca DNA m ra thi k pht trin ca sinh hc phn t. DNA m ha cho

    mRNA v cc loi RNA khc. Protein c dch m t phn t mRNA s thc hin

    nhiu chc nng sinh hc trong t bo k c iu ha hot ng ca gene cng nh cc

    qu trnh sinh hc. Mc d vic xc nh trnh t genome ca cc sinh vt hin nay

    tr nn n gin nhng lm sng t thng tin di truyn cha trong genome v s

    hot ng chc nng cng nh mi tng tc gia cc gene vn cn l mt thch thc

    ln. Chng hn ngi, mi t bo cha 23 cp NST v kch thc genome khong

    3,2.109 cp nucleotide trong cha khong 23.000 gene (

    1). n nay v c bn cc

    qu trnh phin m v dch m c bit nhng xc nh c chnh xc s

    lng gene, v tr v s tng tc ca cc gene ny vn cn l cu hi kh.

    1 International Human Genome Sequencing Consortium (2004). "Finishing the euchromatic sequence of the

    human genome.". Nature 431 (7011): 93145. Bibcode

  • 6

    Vi s pht trin nhanh chng ca cc k thut v cng ngh mi, d liu sinh

    hc m ch yu l trnh t nucleotide, amino acid, c to ra hng ngy cng nhiu.

    Vic thu thp, lu tr, cho php truy cp, tm kim, phn tch v so snh mi lin quan

    gia cc d liu trong cc c s d liu khng l l nhim v ca tin sinh hc. Thc t

    i hi cc nh tin sinh hc, khoa hc my tnh cn phi pht trin cc thut ton mi

    nng cao chnh xc v gim thi gian cho cc nh nghin cu sinh hc.

    Tin sinh hc l mt lnh vc nghin cu a ngnh, mc nht nh, n c

    t trn nn tng ca sinh hc phn t (ngun cung cp CSDL cn phn tch), khoa

    hc my tnh (cung cp cc phn cng cho vic phn tch v mng li my tnh so

    snh, i chiu cc kt qu phn tch), cc thut ton phn tch d liu. Ba yu t

    ny c vai tr sng cn i vi tin sinh hc. Bn thn sinh hc phn t cng l mt

    lnh vc tng i mi c da trn nn tng ca nhiu mn khoa hc c bn m

    quan trng nht l di truyn hc, ha sinh hc, t bo hc Chnh v vy vic ra i,

    nghin cu tin sinh hc cng nh ng dng tin sinh hc cng i hi kin thc c bn

    lin ngnh v hiu bit v khoa hc my tnh. Di y l mt vi im mc lch s

    quan trng cho s pht trin ca sinh hc phn t v tin sinh hc.

    Nm Pht minh

    1930 Tiselius a ra k thut in di phn tch protein trong dung dch

    1951 Pauling v Corey xut cu trc xon alpha v phin gp np beta

    1953 Watson v Crick xut m hnh chui xon kp DNA da trn d liu thu c t kt

    qu phn tch nhiu x tia X ca Franklin and Wilkins

    1954 Nhm nghin cu ca Perutz pht trin phng php dng nguyn t nng (heavy

    atom) gii quyt kh khn trong vic kt tinh protein.

    1955 Trnh t ca protein u tin c phn tch l insulin b bi F. Sanger.

    1970 Thut ton ca Needleman-Wunsch cho vic cn trnh t (alignment) c cng b.

    1972 Phn t DNA ti t hp c to ra bi Paul Berg v nhm nghin cu ca mnh.

    1973 C s d liu Protein c cng b bi Brookhaven

    1974 Vint Cerf v Robert Kahn pht trin phng thc giao tip my tnh TCP lm nn tng

    cho internet.

    1975 in di 2 chiu c pht trin bi P. H. O'Farrell

    Phng php Southern blot c m t v cng b bi E. M. Southern

    1977 C d liu protein, PDB, chnh thc ra i

    Maxam v Walter Gilbert (Harvard) v Frederick Sanger (U.K. Medical Research

    Council) cng b phng php xc nh trnh t DNA.

    1980 Trnh t genome hon chnh ca mt sinh vt (FX174) c cng b. Genome cha 5,386

    cp base m ha cho 9 protein.

    Phng php NMR a chiu (multi-dimensional NMR) c s dng xc nh cu

    trc protein

    1981 Thut ton Smith-Waterman cn trnh t c cng b

    1982 Genetics Computer Group (GCG) to ra nhiu cng c phn tch trong sinh hc phn

    t ti trung tm Cng ngh sinh hc Wisconsin thuc trng i hc Wisconsin.

    1985 Thut ton FASTP c cng b

    Phn ng PCR c m t bi Kary Mullis v cng s

    1986 Thut ng Genomics" xut hin ln u tin m t lnh vc khoa hc lin quan n

    vic lp bn , xc nh trnh t v phn tch cc gene. Thut ng c a ra bi

    Thomas Roderick, sau ny l tn ca mt tp ch ni ting: Genomes.

    CSDL SWISS-PROT c to ra bi phng sinh ha y hc (Department of Medical

    Biochemistry) ca trng i hc Geneva v ngn hng CSDL chu u EMBL ra i

  • 7

    (European Molecular Biology Laboratory).

    1987 NST nhn to ca nm men (YAC) c gii thiu

    Bn vt l ca E.coli c cng b

    Ngn ng lp trnh Perl c pht trin bi Larry Wall.

    1988 NCBI (National Center for Biotechnology Information) c thnh lp vin nghin cu

    ung th quc gia (National Cancer Institute).

    D n xc nh genome ngi c khi ng (Commission on Life Sciences, National

    Research Council. Mapping and Sequencing the Human Genome, National Academy

    Press: Washington, D.C.), 1988.

    Thut ton FASTA dng so snh trnh t c cng b bi Pearson v Lupman.

    Des Higgins v Paul Sharpe cng b pht trin chng trnh CLUSTAL

    1990 Chng trnh BLAST ra i (Altschul, et. al.)

    Molecular Applications Group c thnh lp California bi Michael Levitt v Chris

    Lee. Sn phm ca cng ty l Look and SegMod c dng thit k cc m hnh phn

    t v protein.

    InforMax c thnh lp Bethesda, MD. Sn phm ca cng ty hng ti l cc phn

    mm, chng trnh phn tch trnh t, qun l v phn tch CSDL, tm kim, hin th d

    liu bn ha, thit k dng (clone construction), mapping v thit k mi.

    1991 Vin nghin cu Geneva (Research institute in Geneva/ CERN) cng b to ra phng

    thc make-up cho World Wide Web.

    1997 Genome ca E.coli (4.7 Mbp) c cng b

    1998 Genom ca Caenorhabditis elegans v nm men bnh m c cng b.

    Swiss Institute of Bioinformatics c thnh lp di dng hip hi nghin cu phi li

    nhn

    2000 Genome ca Pseudomonas aeruginosa (6.3 Mbp) c cng b

    Genome ca Arabidopsis thaliana (100 Mb) c xc nh trnh t

    Genome Drosophila melanogaster (180Mb) c xc nh trnh t

    2001 Genome ngi c kch thc 3,000 Mbp c cng b

    2004 Bn nhp genome ca chut, Rattus norvegicus, c cng b

    2004 Th h xc nh trnh t mi chnh thc ra i khi u vi k thut 454 sequencing

    2008 Cc d n xc nh trnh t genome 1000 loi http://www.1000genomes.org/

    1.3. Vai tr ca bioinformatics trong nghin cu sinh hc Trong mt vi thp k gn y, lnh vc genomic v cng ngh sinh hc phn t

    pht trin nhanh chng to ra mt khi lng thng tin rt ln lm c s cho cc

    phn tch so snh v i chiu. phn tch c s d liu (CSDL) cn phi c thut

    ton kt hp vi khoa hc my tnh. Tin sinh hc vi s kt hp cht ch ca CSDL,

    thut ton v khoa hc my tnh s lm sng t bn cht ca cc qu trnh sinh hc. C

    th tm tt vai tr ca tin sinh hc nh sau:

    - Thu thp, t chc v qun l cc d liu sinh hc (database); - Pht trin cc cng c tm kim d liu (search tools, data mining) - Phn tch trnh t (sequence analysis), m t genome (genome annotation), so

    snh genome (genomic comparison);

    - M phng cu trc, m phng tng tc phn t (molecular interaction modelling), d on cu trc protein (prediction of protein structure);

    - Phn tch chc nng protein (protein function analysis), tng tc protein v cc con ng chuyn ha (protein interactions and metabolism pathways), m

    hnh ha cc h thng sinh hc (modeling biological systems), phn tch m

    hnh biu hin gene (analysis of gene expression profile),

    http://www.1000genomes.org/

  • 8

    - Phn tch trnh t genome pht hin gene, cc gene t bin, ung th, xc nh c vai tr ca cc gene v hng ti cc liu php iu tr (genome

    analysis and treatment);

    - Phn tnh mi quan hin tin ha, di truyn qun th da trn cc phn mm v cng c my tnh;

    - Phn tch hnh nh quy m ln (high-throughput image analysis), - Pht trin cc thut ton, phn mm gii quyt nhu cu ca cc nh khoa hc

    trong lnh vc sinh hc.

    Phn tch trnh t (sequence analysis)

    Phn tch trnh t l qu trnh gm nhiu thao tc lin quan n tm kim cc d

    liu trnh t, so snh cc trnh t vi nhau v kt hp vi cc cng c khc tm ra

    nhng thng tin cn thit nm trong chui trnh t cn phn tch. Nhng thng tin thu

    c bao gm s tng ng, cc vng hot ng chc nng (domain), cc vng c

    trng (motif), v tr ca cc gene trong genome (gene finding), cc yu t iu ha

    hot ng gene (promoter, intron, exon, vng cu trc iu ha phin m).

    Nm 1977, genome u tin c xc nh trnh t l ca phage -X174. n

    nay genome ca hng nghn sinh vt c xc nh trnh t v lu gi trong cc

    ngn hng gene. Nhiu cng c tin sinh hc quan trng v cc chng trnh h tr

    phn tch, so snh trnh t sinh hc c pht trin v ng dng ph bin.

    M t genome (genome annotation)

    Trong nghin cu genome, qu trnh nh du cc trnh t DNA v gn cc

    thng tin sinh hc vo nhng trnh t DNA c gi l m t (annotation). H thng

    phn mm cho php m t genome u tin c Dr. Owen White xy dng vo nm

    1995. i tng u tin l vi khun Haemophilus influenzae. ng xy dng h

    thng ny vi mc tiu ban u l tm ra cc gene, cc tRNA trong genome... sau

    gn nhng chc nng sinh hc bit vo cc yu t ny. n nay c nhiu h

    thng m t genome c pht trin. V cn bn cc h thng m t ny ging

    nhau nhng c s khc nhau v thut ton v chng trnh my tnh.

    So snh genome

    Trng tm ca so snh genome l xc nh s ging nhau hoc mi lin h gia

    cc gene (orthology analysis) hoc cc c im chung trong genome ca cc sinh vt.

    So snh genome c hin th di dng bn tng tc gia cc genome cho php

    pht hin c cc s kin hoc mc bin i genome trong qu trnh tin ha dn

    n s khc nhau hoc bin i gia cc genome, gia cc vng gene hoc gia cc

    gene.

    Cc s kin tin ha phc tp xy ra nhiu mc khc nhau dn n tin

    ha genome. mc thp nht (mc phn t), cc t bin im lm thay i

    genome nhng nucleotide n l. S bin i ny c th gy ra hu qu nghim

    trng, trung tnh hoc khng nh hng g. mc cao hn, cc t bin lp on,

    o on, mt on v thay i v tr cc trnh t DNA trong NST (gene nhy,

    transposable elements) lm thay i t chc vt l ca genome. Theo thi gian, cui

    cng ton b genome tham gia vo qu trnh lai, lng bi ha v tng tc cng sinh

    ni bo dn n s phn loi. Tnh phc tp ca tin ha genome dn n nhng s

    kh khn trong vic pht trin thut ton cng nhng m hnh ton hc m phng

  • 9

    chnh xc. Chnh v vy cc thut ton trong tin sinh hc ch mang tnh hp l nht

    (heuristic) ch khng phi l chnh xc (precise). Cc thut ton v m hnh ang

    dng ph bin hin nay bao gm: heuristics, approximation algorithms, parsimony

    models, Markov Chain Monte Carloalgorithms, Bayesian analysis, probabilistic

    models.

    Xy dng v m phng cu trc

    D on cu trc phn t protein l mt trong nhng ng dng quan trng ca

    tin sinh hc. Trnh t amino acid ca mt phn t protein c th c xc nh trc

    tip hoc suy din t trnh t nucleotide ca gene m ha tng ng. m phng cu

    trc ngi ta cn nhng thng tin c th v protein, tt nht l cu trc kt tinh ca

    phn t protein. Trong nhng trng hp kh kt tinh hoc ch c trnh t amino acid

    ngi ta c th so snh trnh t amino acid ca mt protein hoc polypeptide vi

    nhng protein khc bit trong CSDL s dng cc thut ton tm ra s tng

    ng, t a ra cu trc m phng tng i ca cc protein cha bit. Thng

    thng cc trnh t c mc ging nhau >40% c th p dng d on cu trc.

    Mc d c s tng quan cht ch gia mc ging nhau v trnh t v cu trc

    nhng trong nhiu trng hp mc d cu trc ging nhau nhng trnh t amino acid

    c th li khc nhau. V th vic xc nh hoc m phng cu trc cng khng th da

    n thun vo thut ton hay chng trnh my tnh. Trong nhiu trng hp, vic m

    phng ch s dng sng lc v tham kho.

    S tng ng gia haemoglobin ca ngi v ca cc cy h u

    (leghemoglobin) cng l mt trong nhng v d v mi tng quan gia trnh t v

    cu trc. C hai protein u c dng vn chuyn oxy. Mc d chng c trnh t

    amino acid rt khc nhau nhng cu trc ca chng li ging nhau mt cch c bit.

    iu ny cng phn nh mi quan h gia cu trc v hot ng chc nng.

    M phng tng tc phn t

    M phng tng tc phn t l xy dng cc m hnh m t s tng tc khi

    hai hay nhiu phn t tip xc vi nhau. Thng tin v s tng tc bao gm v tr,

    nhm tng tc v c ch hnh thnh nhng tng tc. Tng tc phn t lin quan

    n nhng thay i v nhit ng hc, thay i trng thi phn t (thay i in tch,

    chuyn dch cc nhm lin kt, thay i cu hnh v trng thi hnh hc khng gian).

    Cc tng tc phn t in hnh nh tng tc protein-protein/peptide, enzyme-c

    cht, ligand-cht tng tc. Thut ng thng s dng hin nay l docking v thut

    ton tng ng ca n l docking algorithms.

    Cc k thut c dng h tr bao gm: CD (circular dichroism), phn tch

    nhiu x tia X (X-ray crystallography), phn tch cng hng t ht nhn protein

    (protein nuclear magnetic resonance spectroscopy protein NMR). Mt trong nhng

    cu hi quan trng l liu ch cn phn tch cu trc phn t (3D) d on s tng

    tc phn t hay cn phi lm thc nghim c th cho tng protein-protein (protein

    protein interaction experiments) hoc proteinprotein docking.

    D on cu trc protein (prediction of protein structure)

    D on cu trc protein da vo nhng thng tin nh trnh t amino acid, kt

    qu khi ph (MS), kt tinh v phn tch nhiu x tia X, cc c im sinh hc tng

  • 10

    ng (s ging nhau trn c s cng thc hin chc nng sinh hc hoc cc enzyme

    xc tc mt kiu phn ng hoc nhm c cht).

    Cc thut ton u da trn c s tnh ton cc lin kt ha hc, kh nng hnh

    thnh cc lin kt, tng tc gia cc phn t, phn tch nhit ng hc, nng lng t

    do, nng lng lin kt xy dng ln cc m hnh cu trc khng gian. Tuy nhin,

    hin nay vic phn tch mi lin h v so snh gia cc cu trc v chc nng bit

    vn c coi l nn tng d on cu trc cc protein. Chnh v vy, nhng protein

    mi vi cu trc cha c xc nh thng c d on da vo vic so snh trnh

    t kt hp vi cc c im vt l v ha hc.

    Phn tch biu hin gene (analysis of gene expression)

    Cc CSDL v mRNA, cDNA, EST h tr pht hin s biu hin hoc mc

    biu hin ca cc gene. Cc CSDL v protein microarray v khi ph (MS) c vai tr

    rt quan trng trong vic phn tch hoc pht hin s c mt ca mt protein no

    mt mu sinh hc. Bng cch so snh v i chiu cc CSDL ny cho php rt ngn

    thi gian nghin cu. Tuy nhin, qu trnh ny i thng tr ln phc tp khi x l

    khi lng mu ln (high through put analysis) v s liu nhiu do cc sai s gp phi

    trong thc nghim.

    T phn tch trnh t genome n vic iu tr (from genome to therapy)

    Mt trong nhng nguyn nhn chnh dn n ung th l s tch ly cc t bin.

    Phn tch nhiu trnh t c th xc nh c cc t bin tim n trong cc gene c

    lin quan n ung th. Tin sinh hc xy dng cc h thng phn tch t ng qun

    l, lu gi cc thng tin t h tr cc thao tc tm kim, so snh v i chiu gia

    cc gene, genome pht hin s a hnh (chng hn cc c s d liu dbVar, dbSNP,

    CancerChromosome). Kt qu nhng phn tch h tr cho vic iu tr v chn on

    bnh d dng hn. Mt v d in hnh l s pht trin cc loi thuc khc nhau p

    ng vi mi c th.

    Cc k thut mi ang c p dng nh so snh trnh t cc nucleotide

    pht hin s khc bit mc nucleotide n tm ra cc t bin im (single-

    nucleotide polymorphism arrays) nhiu v tr, vng trnh t khc nhau trong genome.

    Thut ton ang dng hin nay l Hidden Markov model, change-point analysis

    methods.

    Nghin cu tin ha (Computational evolutionary biology)

    Nghin cu tin ha bao gm xc nh ngun gc tin ha ca cc loi cng

    nh s bin i v pht sinh loi mi theo thi gian. Cng ngh thng tin v tin sinh

    hc h tr cc nh nghin cu sinh hc nhiu kha cnh, bao gm:

    - Pht hin c s tin ha da vo so snh, pht hin s thay i trnh t DNA ch khng da nhiu vo s bin i hnh thi.

    - So snh ton b genome cho php nghin cu cc s kin phc tp xy ra trong qu trnh tin ha chng hn nh lp on, trao i vt cht di truyn hoc ly

    mt phn vt cht di truyn ca mt loi (chng hn nh chuyn gene ngang,

    bao gm bin np, chuyn np, ti np, cng sinh, ti t hp genome, chuyn

    gene)

    - Xy dng cc m hnh my tnh d on din tin v h qu ca cc qun th theo thi gian.

  • 11

    - Theo di v chia s thng tin ca mt s lng ln cc loi v c th. - Xy dng bc tranh tng th v cy pht sinh chng loi.

    Phn tch hnh nh

    Cng ngh my tnh hin nay cng vi cc th nghim phn tch t ng quy

    m ln to ra mt s lng hnh nh vi dung lng rt ln. Thm vo , nhng loi

    hnh nh cha ng nhiu thng tin nh: nh phn tch cc mu, m bnh, nh chp

    trong y hc, lm sng cn phi c phn tch cn thn nhiu mc . Vic lu tr

    cc hnh nh ny c ngha khi cn i chiu v so snh cht lc thng tin phc v

    cho chn on v iu tr. Di y l mt s v d v nhng ng dng tin sinh hc

    trong x l v phn tch hnh nh:

    - Phn tch nh lng cc c im bn trong hnh nh nh bo quan, kch thc, hnh dng, v tr phn b ca cc phn t hoc kt qu chp ct lp ca

    cc m, c quan.

    - Xc nh cc m hnh, hnh mu real-time ca dng kh vn chuyn trong phi ng vt, s vn chuyn ca cc cht qua mng t bo, m (drug delivery).

    - D on kch thc ca cc ht, vn cc xy ra trong qu trnh phu thut (real-time imaginery) v qu trnh hi phc sau b thng cc ng mch.

    - Phn tch cc hnh nh hng ngoi xc nh hot ng trao i cht - Phn tch cc hnh nh hunh quang chng hn vi cc k thut xc nh trnh

    t th h mi, cc k thut nh du hunh quang v phn tch real-time.

    Phn tch chc nng protein

    Cc CSDL MS, trnh t, cu trc, tng tc protein-protein, protein docking l

    nn tng phn tch chc nng protein. Vic so snh trnh t, cn trnh t h tr rt

    c lc pht hin cc motif, domain, (m hnh) pattern pht hin v phn tch

    chc nng cc protein. Cc h protein hoc cc protein cng thc hin chc nng cng

    c pht hin da trn nhng c s so snh ny.

    Tng tc protein v cc con ng chuyn ha

    Nghin cu tng tc gia cc protein, enzyme trong cc qu trnh sinh hc c

    ngha ng dng rt ln. Chng hn tm c cht cho enzyme, xc nh protein khng

    nguyn, khng th... Nghin cu xy dng m hnh tng tc gia cc protein gip

    xc nh vai tr ca cc yu t tham gia cng nh c ch iu ha s biu hin ca cc

    gene tham gia trong cc mng li. S ri lon hoc thay i cc mi quan h tng

    tc s dn n nhng bnh tt. Vic iu tr cc bnh da trn c s hiu bit mi lin

    h nhiu yu t s c hiu qu rt ln. y cng l hng c cc nh sinh hc, tin

    sinh hc ang tp trung nghin cu hin nay.

    M hnh ha cc h thng sinh hc (Modeling biological systems)

    Thc cht l s m phng bng my tnh cc qu trnh sinh hc din ra trong h

    thng sng (t bo, m hoc ton b c th). thc hin c iu ny cn kt hp

    gia sinh hc h thng (system biology) v ton sinh hc (mathematical biology). V

    d nh cc h thng t bo, cc bo quan, cc cht trao i v cc enzymes tham gia

    hnh thnh cc con ng trao i cht, cc con ng dn truyn tn hiu, iu ha

    hot ng gene. Tt c nhng qu trnh ny cn c phn tch v hin th trong phc

    hp ca cc thnh phn bn trong t bo hoc cc bo quan trong t bo. Ngoi ra vi

  • 12

    tin sinh hc v sinh hc my tnh c th m phng s sng nhn to lin quan n qu

    trnh tin ha ca sinh vt.

    Pht trin cc phn mm v cng c phn tch (Software and tools) Thut ton v cc thch thc trong khoa hc my tnh

    Cc phn mm hoc chng trnh my tnh c pht trin da vo nhiu thut

    ton. Mc chnh xc v tc x l ph thuc vo thut ton v phn cng my

    tnh. Pht trin thut ton mi s ti u ha, rt ngn thi gian phn tch, gim thiu s

    dng ti nguyn my tnh v nng cao tin cy ca cc phn tch, m phng.

    Cc cng c tm kim trnh t ging v tng ng:

    Trnh t tng ng (homology): gia cc trnh t DNA hoc cc tnh trng

    phn tch c cng ngun gc, quan h tin ha t mt t tin chung. Mc ging

    nhau (similarity) gia hai (cc) trnh t c th c xc nh liu s tng ng l

    thc s hay l ngu nhin.

    Cc cng c thuc nhm ny nhm xc nh s ging nhau gia mt trnh t

    mi a vo (novel query sequence) vi cu trc v chc nng cha bit vi ton b

    CSDL c bit.Nhm ny bao gm cc cng c chnh: FASTA, BLAST v cc

    bin th ca chng (xem cc chng sau).

    Phn tch chc nng protein:

    Phn tch chc nng: Xc nh chc nng v lp bn ca cc thnh phn

    chc nng bao gm phn m ha v khng m ha ca gene trong genome.

    thc hin cn s h tr ca cc chng trnh v cng c my tnh trong vic so

    snh trnh t protein truy vn vi cc CSDL protein th cp cha thng tin v

    cc motif, domain. Kt qu tm kim s cho ra danh sch cc protein ging

    nhau t php d on chc nng ca protein cha bit.

    - Phn tch cu trc Cho php so snh cc cu trc cha bit vi cc CSDL cu trc bit. Chc

    nng ca mt protein c th xc nh chnh xc hn khi so snh cu trc ca n

    hn l ch trnh t amino acid. V cu trc tng t nhau thng gn lin vi s

    tng ng v chc nng hot ng. Vic xc nh cu trc protein dng 2D/3D

    c ngha v cng quan trng nghin cu chc nng ca n. Cng vic ny

    i km vi vic tinh sch, kt tinh protein v kt hp vi cc phng php phn

    tch tinh th.

    - Phn tch trnh t Cc cng c thuc nhm ny cho php thc hin cc phn tch su hn v trnh

    t cha bit bao gm: phn tch tin ha, xc nh t bin, cc vng a nc,

    CpG islands v xu hng s dng cc thnh phn base trong cc m di truyn

    (compositional biases). Nhng kt qu phn tch ny s h tr cho cc nghin

    cu lm sng t chc nng ca trnh t cha bit.

    1.4. Nhim v v cc hng nghin cu ca Bioinformatic

    Vo giai on u ca cuc cch mng genomics, tin sinh hc tp trung vo

    vic tp hp v lu gi cc thng tin, c s d liu sinh hc hnh thnh cc ngn

    hng c s d liu (ch yu l trnh t amino acid, nucleotide). Qu trnh ny lin quan

  • 13

    n vic thit k mng li CSDL lin kt v pht trin cc giao din web nh cc

    nh nghin cu va c th truy cp vo cc c s d liu va c th ng k thm cc

    trnh t, d liu mi hoc cc d liu c chnh sa, b sung. Xut pht t nhu

    cu ca cc nh khoa hc v vic tm kim v phn tch d liu (data mining) dn

    n vic pht trin cc cng c tm kim kt hp vi vic so snh cc d liu. Vic s

    dng cc chng trnh FASTA, BLAST, cn trnh t (sequence alignment); lp rp cc

    trnh t (genome assembly);tm kim gene trong genome (gene finding), phn tch cc

    domain trong phn t protein v xc nh cu trc ca chng tr thnh nhng thao

    tc thng thng hng ngy ca cc nh nghin cu. Nhng ng dng mc cao hn

    v phc tp hn nh xc nh c v tr v vai tr ca gene trn cc nhim sc th

    (position cloning); so snh cu trc ba chiu ca cc protein,d on cu trc protein

    v cc tng tc protein-protein; nhn dng m hnh (pattern recognition); d on

    m hnh biu hin gene (gene expression profile prediction)ang tr nn ph bin

    nhng phng nghin cu mnh.

    T kt qu ca cc nghin cu v xc nh vai tr cc gene v tng tc gene,

    nh khoa hc c th so snh cc hot ng ca nhng t bo bnh thng v nhng t

    bo b bnh. lm c iu nycn thit phi c s kt hp v i chiu gia cc

    CSDL sinh hc to thnh mt bc tranh tng th v din t c cc mi lin h

    ca cc hot ng qua s nghin cu c cc con ng chuyn ha

    (metabolomics). y cng l mt trong nhng thch thc rt ln ca cc nh tin sinh

    hc.

    Hnh 2. Mi lin h gia transcriptomics, proteomics v cc con ng chuyn

    ha (metabolomics) (Goodacre (2005) J Exp Bot 56: 245)

    Hng pht trin cao hn na l xy dng c cc m hnh v s tng tc

    gia cc m hnh chuyn ha trn c s ny s lm sng t c cc m hnh biu

    hin gene, s tng tc gia cc gene v nhm cc gene. Nhng kt qu ny s gp

    phn trong vic iu khin s hot ng ca gene v pht trin cc liu php iu tr

    hiu qu.

  • 14

    Hnh 3. Mng li cc gene lin quan n cc bnh ngi

    (The human disease network. PNAS. vol. 104, no. 21, 86858690)

    Nghin cu pht trin thut ton, phn mm v cc cng c phn tch mi

    (software and tools) chng hn: h tr trong vic xc nh s c mt v v tr ca cc

    gene trong mt trnh t DNA hay trn NST, d on cu trc protein v chc nng ca

    chng hoc phn tch, sp xp cc nhm trnh t protein thnh mt h gm cc trnh t

    c lin quan.

    Cc cng c chnh ca Bioinformatics (Bioinformatics tools)

    BLAST

    BLAST l ch vit tt ca Basic Local Alignment Search Tool. y l nhm

    cng c cho php so snh cc trnh t DNA v protein vi cc trnh t khc c trong

    CSDL. Hin nay c mt s bin th ca BLAST nh: PSI-BLAST, PHI-BLAST,

    DELTA-BLAST. Ngoi ra cn c mt s cng c BLAST c bit p dng cho cc

    genome ngi, vi sinh vt, k sinh trng st rt v cc genome khc. Cc cng c h

    tr pht hin cc trnh t c ln vi trnh t ca vector (c bit khi ng k vo

    ngn hng gene), cc trnh t globulin min dch, v cc trnh t bo th...

  • 15

    FASTA

    L mt cng c tm kim CSDL c s dng so snh trnh t nucleotide

    hoc amino acid vi mt CSDL trnh t. Chng trnh ny da vo thut ton tm

    kim trnh t nhanh bi Lipman v Pearson. y cng l thut ton u tin c

    dng tm kim cc trnh t ging nhau trong CSDL.

    EMBOSS

    EMBOSS c vit tt t (European Molecular Biology Open Software Suite),

    l mt t hp cc phn mm phn tch ngun m min ph ng dng trong lnh vc

    sinh hc phn t. C khong hn 100 chng trnh ng dng so snh trnh t, tm

    trnh t trong CSDL, tm kim cc m hnh (pattern), tm kim domain, motif trong

    phn t protein bng cch so snh trnh t amino acid, so snh trnh t nucleotide

    pht hin cc pattern, phn tch tn sut s dng b m (codon bias analysis)

    Mt danh sch cc ng dng c th tm a ch:

    http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/

    Clustalw

    ClustalW l chng trnh dng so snh cc trnh t DNA v protein. Mc

    ch l tm ra cc vng trnh t ging nhau v khc nhau. Trn c s h tr cho

    nhiu ng dng khc nh: phn tch domain, motif, pattern, xy dng mi quan h tin

    ha.

    RasMol

    y l cng c nghin cu rt hiu qu hin th cu trc DNA, protein v cc

    phn t nh. Protein Explorer l mt dng bin th d s dng ca RasMol.

    Chng trnh ng dng cho chuyn ngnh bioinformatics

    - JAVA: Do bn cht Java l chng trnh c lp v vy n l mt thnh phn quan trng ca bioinformatics (BioJava)

    - Perl: S dng x l cc d liu sinh hc (BioPerl) - BioXML: L mt phn ca d n BioPerl, l ngun tp hp cc ti liu dng

    XML v DTD

    Xy dng cc CSDL ti liu, tp ch phc v nghin cu

    - Bi bo, tp ch (pubmed); - H thng phn loi, kha phn loi (taxon); - Sch (book); - Bi bo, tp ch, ti liu lin quan n cc phn ng sinh ha

    (pubchembioassay);

    - Cc ti liu lin quan n cc hp cht ha hc (Pubchem compounds); - Cc ti liu v cc cht ha hc (pubchem substances); - Cc c s d liu: genomics, proteomics, metabolomics, microarray gene

    expression v phylogenetics.

    Thng tin cha ng bn trong cc CSDL sinh hc bao gm: tn gene, trnh t

    gene, v tr ca gene trn NST hoc genome (locus tag), cu trc v chc nng

    ca cc gene, hu qu ca cc t bin gene , cc gene lin quan (h gene) v

    cu trc ca chng (nu l protein, RNA...)

    http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/

  • 16

    D liu bao gm: Cc trnh t gene, cc m t v c im ca gene (gene m

    ha cho mRNA, tRNA, rRNA), thut ng phn loi (ngun gc ca gene,

    sinh vt cha gene ), cc trch dn (bi bo lin quan n gene, protein) v

    cc bng s liu (nu c).

    Kiu nh dng CSDL

    Cc dng nh dng ca d liu sinh hc gm nhiu loi: ch, d liu trnh t,

    cu trc protein v cc lin kt (link).

    - Dng ch: PubMed v OMIM. - Dng trnh t: GenBank (DNA) v UniProt (protein). - Dng cu trc: PDB, SCOP, v CATH.

    Nhng vn lin quan n CSDL protein

    Vic pht trin CSDL cu trc protein thng rt kh khn v chm hn so vi

    trnh t DNA v cu trc 3 chiu ca protein rt kh xc nh. xc nh cu trc 3

    chiu ca mt phn t protein ngi ta phi tch ring hay tinh sch protein vi

    lng ln, tip tm cc iu kin ph hp cho protein kt tinh sau s dng

    cc k thut xc nh cu trc, chng hn nh dung tia X (X-ray crystallography),

    cng hng t ht nhn (NMR spectroscopy), CD (Circular Dichroism), knh hin vi

    in t... Cc d liu cu trc c ng k v c th truy cp thng qua cc CSDL

    thnh vin ca wwPDB (PDBe, PDBj v RCSB PDB, SCOP) v CATH.

    Cc CSDL c th loi

    Mt s CSDL c th loi c cng b, ch yu dng cho nghin cu.

    Chng hn: Colibase (CSDL cho E.coli). Cc CSDL khc nh Flybase cho Drosophila

    v WormBase cho cc bn giun trn (Caenorhabditis elegans v Caenorhabditis

    briggsae). Ngoi ra cn c cc CSDL khc cho la (Oryza sativa), Arabidopsis

    1.5. Xu hng pht trin ca bioinformatics Xu hng ca bioinformatics tp trung vo cc hng sau:

    - Pht trin cc thut ton v my tnh (Algorithms and computational challenges)

    - Phn tch chc nng protein (Protein function) - Tng tc protein v cc con ng chuyn ha(Protein interactions and

    pathways)

    - p dng trong lm sng v nghin cu tm thuc mi, d on ri ro, nguy c.

    Cc xu hng hin nay ca Bioinformatics

    - Thut ton: 27% - Machine learning: 21% - Thng k: 18% - Sinh hc: 10% - CSDL: 10% - Cc hng khc: 14%

  • 17

    Cc ch nghin cu hin nay:

    - Phng php: 26% - Phn tch trnh t (motif, domain), so snh trnh t : 25% - M phng cu trc protein: 19% - M hnh cu trc v iu ha hot ng gene: 12% - Phn tch trnh t lin quan n tin ha: 12% - M phng v xy dng mng li trao i cht (metabolome): 6%

    K nng v yu t con ngi pht trin bioinformatics:

    - Hiu bit su rng c hai lnh vc: sinh hc v tin hc - Nm c nhng vn cn quan tm c 2 lnh vc - Hi t c khoa hc my tnh v phn mm: t vn v pht trin thut

    ton

    mc nht nh c th ni tin sinh hc l lnh vc th v, hp dn, mi, thch thc,

    c th truy cp c, lnh vc c th m rng nghin cu, c s nh hng nhiu, c

    hi cho ngi lm my tnh.

    Nhng ch cn khm ph:

    - Cc k thut CSDL cho d liu Bioinformatics - Di truyn phn t (nn tng ch yu thuc v lnh vc sinh hc) - So snh trnh t, m hnh mu (patterns), profiles - Pht hin cc pattern - Gene expression arrays - Xy dng cu trc protein (nn tng ch yu thuc v lnh vc sinh hc) - Xy dng hnh hc khng gian (lp th) ca protein (k thut my tnh v cc

    cng c)

    - D on cu trc protein - Xy dng mng li ha sinh hc, metabolome (nn tng ch yu thuc v lnh

    vc sinh hc)

    - Xy dng cc con ng trao i cht, cc con ng iu ha v tn hiu iu ha gene: CSDL, k thut my tnh v cc cng c

  • 18

    Tm tt chng 1

    Tin sinh hc l mt lnh vc khoa hc mi c s kt hp cht ch ca sinh hc

    m ch yu l di truyn hc, sinh hc phn t vi cc cng c thng k, ton hc v

    khoa hc my tnh. Chng 1 gii thiu khi nim, vai tr ca tin sinh hc cng nh

    cc cng c phc v cho nhng vn nghin cu ca sinh hc phn t hin i chng

    hn nh tm kim cc trnh t sinh hc tng ng hoc ging nhau trong cc ngn

    hng c s d liu, m phng v d on s tng tc gia cc phn t, pht hin cc

    m hnh biu hin gene v cc mi lin h gia cc geneCc ni dung chnh ca tin

    sinh hc cng nh xu hng pht trin ca lnh vc ny cng c cp qua gip

    sinh vin c mt ci nhn bao qut v mt lnh vc khoa hc mang tnh ng dng, h

    tr cho cc nh nghin cu trong cc lnh vc di truyn phn t, sinh hc phn t, y

    hc

    Cu hi n tp chng 1

    1. Trnh by khi nim tin sinh hc. 2. Hy nu tm tt vai tr ca tin sinh hc trong nghin cu sinh hc. 3. Trnh t sinh hc l g? Hy nu mt vi v d v vic phn tch trnh t sinh

    hc.

    4. Th no so snh trnh t? Mc ch ca vic so snh trnh t lm g? 5. Ti sao phi nghin cu cu trc cc i phn t ? tin sinh hc h tr nh th

    no trong vic d on cu trc phn t.

    6. Nhng hiu bit v vai tr ca cc gene, mi lin h gia cc gene c vai tr nh th no trong y hc hin i?

    7. Th no l mi quan h tin ha gia cc sinh vt? Tin sinh hc s h tr g trong nghin cu tin ha.

    8. Hy nu nhim v v cc hng nghin cu ca tin sinh hc hin nay. 9. Hy nu nhng ch ang c cc nh tin sinh hc tp trung nghin cu. 10. tr thnh nhng nh nghin cu trong lnh vc tin sinh hc chng ta cn

    phi c nhng yu t g?

  • 19

    CHNG 2

    NN TNG SINH HC CA TIN SINH HC

    2.1. Axit nucleic v protein

    Axit nucleic v protein l hai i phn t sinh hc ng vai tr quan trng trong

    th gii sng. Axit deoxyribonuleotide nucleic (DNA) mang thng tin di truyn v axit

    ribonucleic (RNA) lin quan n qu trnh sinh tng hp protein v tham gia vo iu

    ha hot ng sng ca t bo. n v cu to nn axit nucleic l cc nucleotide v

    protein l cc amino acid.

    2.2. Cu trc ca axit nucleic

    DNA v RNA c cu to bi cc n phn l nucleotide v ribonucleotide.

    Trong phn t DNA, mi nucleotide c cu to bi gc axit phosphoric, mt phn

    t ng pentose v mt base. Cc nucleotide ni vi nhau bi lin kt phosphodiester

    gia nhm 5PO4 ca phn t ng pentose ca mt nucleotide v nhm 3OH ca

    phn t ng pentose mt nucleotide tip theo. V vy phn t axit nucleic bao gi

    cng tn ti u 5PO4 v 3OH. Theo quy c i vi mt axit nucleic bao gi cng

    vit theo hng 5 n 3 theo chiu t tri sang phi.

    Hnh 4. Cu trc DNA

    Axit nucleic c cu to bi 5 loi base khc nhau: cytosine (C), uracil (U),

    thymine (T), adenine (A) v guanine (G). Tuy nhin, U ch c mt trong phn t RNA

    v C ch c mt trong DNA. Phn t DNA v RNA khng ch khc nhau v thnh

    phn base m cn khc nhau v phn t ng. RNA c ng ribose trong khi

    DNA cha ng 2-deoxyribose. Phn t DNA gm 2 chui polynucleotide xon vi

  • 20

    nhau theo hng i song. Phn t DNA c th tn ti di dng si n (ssDNA) v

    dng si kp (dsDNA). Trong phn t DNA, hai si c gn vi nhau qua lin kt

    hydro gia cc base. Hai lin kt hydro gia A v T v ba lin kt hydro gia C v G.

    Hai si DNA b sung vi nhau do nu bit trnh t ca mt si s suy ra trnh t

    ca si cn li.

    Lu tr thng tin di truyn

    Trnh t cc base mang thng tin m ha cho cc protein. Phn t protein c

    cu to bi 20 amino acid v mi amino acid c m ha bi 1 b ba gm 3

    nucleotide tng ng trn phn t DNA. Mi b ba nh vy c gi l b m

    (codon). Mi sinh vt c xu hng s dng cc b m khc nhau. Chng hn

    prokaryote mt s loi dng b m khc vi cc sinh vt eukaryote. M di truyn ca

    genome ti th cng c mt s khc bit so vi m di truyn ca genome trong nhn.

    Hnh 4. M di truyn

    Mi quan h gia DNA, RNA v protein c m t trong lun thuyt trung

    tm (Crick 1970)

  • 21

    Hnh 5. Lun thuyt trung tm

    Ton b thng tin di truyn cha trong nhn hoc kiu nhn ca mt sinh vt

    c gi l genome. Ngoi tr cc retrovirus genome l RNA, thng tin di truyn

    c cha ng trong cc trnh t nucleotide ca phn t DNA. Ngoi tr qu trnh

    phin m ngc t RNA sang DNA mt s virus RNA, dng thng tin c chuyn

    mt chiu t genome n transcriptome v n proteome thng qua qu trnh phin m

    v dch m. Ton b cc bn phin m RNA (mRNA, tRNA, rRNA v cc RNA

    khng m ha khc) ca mt sinh vt c gi l transcriptome. Ton b protein c

    th c dch m t cc mRNA c gi l proteome. Nh vy trnh t amino acid

    trong phn t protein c quyt nh bi trnh t DNA v dng thng tin c

    chuyn t DNA n protein thng qua mRNA.

    Genome ca eukaryote v prokaryote c nhiu im khc bit. prokaryote

    thng tin di truyn c m ha trn mt on DNA lin tc, trong khi

    eukaryote, cc trnh t m ha (exon) c ngn cch bi cc trnh t khng m ha

    gi l intron. Ngoi ra, eukaryote, s phin m t DNA thnh mRNA trng thnh

    cng phc tp hn nhiu chng hn cc intron c loi b trong qu trnh phn ct

    mRNA. Cng chnh v qu trnh ny t mt gene ban u c th hnh thnh nn nhiu

    mRNA v to ra nhiu protein tng ng. iu ny gii thch ti sao genome sinh

    vt bc cao cha mt s lng gene nht nh, chng hn ngi c khong 25.000

    gene, tuy nhin s lng protein thc t c to ra ln hn nhiu, khong 1 triu

    protein.

  • 22

    Hnh 6. Cu trc vng gene ca prokaryote v eukaryote

    Cu trc phn t protein

    Cu trc s cp

    Cc phn t protein l cc i phn t sinh hc c cu thnh t khong 20

    loi amino acid. Trong iu kin nht nh phn t protein s cun gp li hnh thnh

    cu trc 3 chiu mang y cc c im v chc nng sinh hc. Cc gc amino acid

    trong chui polypeptide s quyt nh nhng c im ha hc nh tnh k nc, phn

    cc, acid, base ca phn t protein. Cu trc s cp ca phn t protein hay cn gi l

    cu trc bc 1 l trt t sp xp ca amino acid trong chui polypeptide. Cu trc bc

    1 s quyt nh cc cu trc khng gian ca phn t protein.

    Trong phn t protein, amino acid ni vi nhau to thnh chui polypeptide. Cc

    amino acid c ni vi nhau thng qua lin kt amide ca nhm carboxyl vi nhm

    amino ca amino acid tip theo. Chnh v vy chui polypeptide c 2 u N v C tn

    cng. Theo quy c v chiu, u N bn tay tri v u C bn phi.

  • 23

    Hnh 7. Cc amino acid trong phn t protein

    Cu trc bc 2

    Thut ng cu trc bc 2 ch nhng vng khng gian cc b trn chui

    polypeptide. Cu trc bc hai lin quan n s c mt ca cc xon alpha (-helix) v

    phin gp np beta (-strand) v cc cu trc vng xon (loop). C s ca vic hnh

    thnh cc cu trc ny l do cc c im hnh hc ca cc gc trong cc amino acid.

    Vo nhng nm 1930 v 1940, Linus Pauling v Robert Corey m t cc lin kt

    peptide l dng cu trc phng, cng (khng xoay). Nh vy, mt chui polypeptide

    c th c xem nh l mt chui cc trnh t ni vi nhau v nm trn mt mt

    phng. Xon alpha, phin beta v cc vng xon tham gia hnh thnh nn cu trc bc

    2. Cu trc xon alpha v phin beta c gi n nh nh lin kt hydro. Phin beta

    c th c 2 dng song song v i song (hnh 8).

  • 24

    Hnh 8. Cu trc bc 2 ca mt phn t protein Xon alpha v phin beta. Cu disulfide lm n nh cu trc bc 3 v cc vng lin

    quan n hot tnh xc tc (mu vng).

    Cu trc bc 3 v bc 4

    Cu trc bc 3 c hnh thnh t vic sp xp v gp np tip theo t cc thnh phn

    cu trc bc 2. Nhng polypeptide c chiu di ln hn 200 amino acid thng t gp

    np vi nhau thnh mt s n v cu trc gi l domain. Cu trc bc 4 l dng cu

    trc tip theo ca cu trc bc 3. Cc protein c cu trc bc 4 thng c hnh thnh

    t nhiu chui polypeptide (subunit).

    Trong cu trc bc 4 s tng tc gia cc amino acid bao gm lin kt hydro gia cc

    chui peptide, cu disulfide gia cc gc cystein, cc lin kt ion gia cc nhm tch

    in ca cc gc (chui bn) v tng tc k nc.

    2.3. Genome v nghin cu genome Genome

    Genome cha ng ton b thng tin di truyn ca mt sinh vt. Cc thng tin

    di truyn c m ha trong DNA hoc RNA. Ly genome ngi lm mt v d, nu

    coi genome l mt cun sch th cun sch ny c chia thnh 23 chng (tng ng

    vi 23 cp NST). Mi chng cha 48 n 250 triu ch tin tc (A,C,G,T). Ton b

    cun sch c hn 3,2 t ch v c t trong nhn ca t bo.

    D n xc nh trnh t genome u tin hon tt nm 1977 bi Fred Sanger.

    ng v cng s xc nh trnh t phage -X174, cha 5386 base. Genome ca vi

    khun u tin c xc nh trnh t l Haemophilus influenzae vo nm 1995. Sau

    genome eukaryote u tin c xc nh trnh t l ca nm men Saccharomyces

    cerevisiae. Hin nay, s pht trin nhanh chng ca cng ngh (Ilumina solexa, 454

    pyrosequencing, ion torrent, solid sequencing...) s lng genome ca cc loi c

    xc nh trnh t tng ln mt cch nhanh chng.

    Nghin cu genome (genomic research)

    Nghin cu genome khng n thun ch l vic tng kt cc genome c

    xc nh trnh t hay cc ch ra s lng gene c trong mt genome v tnh trng

    tng ng. Nghin cu genome cn bao gm c vic so snh kch thc genome, s

    lng NST (karyotype), trt t cc gene, tn sut s dng codon, thnh phn GC, v

    tin ha genome. Ngoi ra nghin cu genome cng bao gm c vic so snh nhiu

  • 25

    genome pht hin ra cc vng bo th, cc s kin bin i din ra trong genome.

    Cc kt qu nghin cu genome thng c biu din di dng ha thng qua

    cc trnh duyt genome hay genome browser.

    Genome hc (genomics) l mt mn hc gn lin vi di truyn hc. Genomics

    lin quan n vic nghin cu genome ca cc sinh vt bao gm xc nh trnh t

    DNA ca ton b genome v lp bn di truyn c mc phn gii cao (khong cch

    gia cc marker rt gn nhau). Genomics cn nghin cu cc hin tng xy ra bn

    trong genome chng hn nh: hin tng u th lai (heterosis), tc ng ln t ca cc

    gene (epistasis), nh hng ca mt gene ln nhiu gene (pleiotropy) v tng tc

    gia cc locus v cc allele bn trong genome. Khc vi nghin cu vai tr v chc

    nng ca nhng gene n l, genomics nghin cu mi quan h tng th ca cc thnh

    phn trong genome.

    Lp genome (genome duplication) ng vai tr ch yu trong vic hnh thnh

    loi mi. Lp geneome c th dao ng t phm vi nh (lp li cc on ngn/short

    tandem repeat) hoc lp li c gene hoc c cm gene, lp c NST v thm ch ton b

    genome. Nhng s kin ny l nn tng to ra c tnh di truyn mi, lm c s ca

    tin ha. Trao i gene ngang (horizontal gene transfer) c vai tr quan trng trong

    vic gii thch s ging nhau gia cc phn nh trong cc genome ca hai sinh vt vn

    khng cng ngun gc tin ha. Vic trao i gene ny cng tng i ph bin gia

    cc vi sinh vt chng hn hin tng khng khng sinh cc vi sinh vt l mt v d

    in hnh. Vt cht di truyn c chuyn t genome ti th v lc lp vo NST cc

    t bo eukaryote cng l mt v d cho hin tng ny.

    Genome ngi (human genome)

    Nm 2001, bn nhp u tin ca genome ngi c cng b. Vo nm 2007,

    d n xc nh trnh t genome ngi hon tt vi t l li rt nh (khong 1/20.000

    base). C th truy cp cc phin bn lp rp trnh t genome ngi bng cch dng

    UCSC Genome Browser, Ensembl.

    Nghin cu genome ca virus (bacterophage)

    Bacteriophages ng vai tr quan trng trong nghin cu di truyn vi khun v

    sinh hc phn t. V mt lch s, chng c s dng xc nh cu trc gene v

    nghin cu c ch cng nh m hnh iu ha hot ng gene. Do genome c kch

    thc nh v khng cha intron nn bacteriophase c la chn xc nh trnh t

    u tin. Tuy nhin, nghin cu v bacteriophage khng m ra s cch mng v

    genome (cuc cch mng v genome bt u t vic xc nh trnh t cc vi khun).

    Trnh t genome ca cc bacteriophage thng c xc nh thng bng vic c

    trnh t trc tip. Phn tch genome vi khun cho thy mt phn ng k DNA vi

    khun cha cc trnh t tin phage (prophage) v dng ging nh prophage (prophage-

    like). Nh vy, vic khai thc thng tin trong CSDL ca bacteriophage gp phn gii

    thch c vai tr ca prophage trong vic hnh thnh dng genome ca vi khun.

    Nghin cu genome vi khun lam (Cyanobacteria genomics)

    Hin ti c 24 vi khun lam c xc dnh trnh t. 15 trong s chng c

    phn lp t bin. C 6 chng thuc chi Prochlorococcus, 7 chng thuc chi nc mn

    Synechococcus, Trichodesmium erythraeum IMS101 v Crocosphaera watsonii

    WH8501. Mt s nghin cu cho thy cc trnh t ny c th c s dng rt hu

  • 26

    ch trong vic suy din cc c tnh sinh l v sinh thi ca vi khun lam bin. Tuy

    nhin, c rt nhiu d n xc nh trnh t genome ang c thc hin trong s c

    cc dng phn lp thuc chi Prochlorococcus v Synechococcus ( bin),

    Acaryochloris v Prochloron, mt dng khun lam dng si c kh nng c nh

    nitrogen Nodularia spumigena, Lyngbya aestuarii v Lyngbya majuscul cng nh tc

    ng ca bacteriophage ln vi khun lam bin. Nh vy, vic nghin cu genome

    ng vai tr quan trng trong vic gii thch ngun gc tin ha ca cc sinh vt v

    cc qu trnh sinh hc chng hn nh quang hp.

    Mi quan h gia C-value v s lng gene:

    Gi tr C (C-value) l hm lng DNA ca mt sinh vt. Gi tr ny c s bin

    ng rt ln cc loi. Khng c mi lin h r rng no gia C-value v s lng

    gene ca sinh vt. cc genome phc tp, t l cc trnh t DNA khng m ha (non-

    coding DNA) khng mang thng tin di truyn m ha RNA cng ln. ngi,

    DNA khng m ha chim ti gn 75% genome. Nghch l gi tr C (C-value paradox)

    ch mi quan h khng t l gia kch thc genome v s lng gene.

    2.4. Pht hin gene v xc nh chc nng gene trong genome

    Hnh 10. T chc genome ngi

  • 27

    Sau khi cc d n xc nh trnh t genome kt thc, kt qu thu c l cc

    chui trnh t c sp xp trong cc nhim sc th. Vn tip theo l phi gii

    m thng tin cha ng trong cc chui trnh t . Vic gii m thng tin thc cht

    l tr li nhng cu hi nh: (i) genome ca sinh vt cha bao nhiu gene, (ii) cc

    gene phn b u trn cc nhim sc th, (iii) chc nng ca cc gene l g,

    (iv) c ch iu ha ng ca cc gene nh th no v mi lin h gia cc gene

    trong vic hnh thnh kiu hnh hoc bnh tt... tr li nhng cu hi ny i hi

    rt nhiu thi gian, cng sc v trong mt s trng hp cha th tm ra p n cho

    nhng cu hi . C nhiu hng tip cn gii m genome, trong cc cng c

    tin sinh hc c vai tr rt ln. Chng hn xc nh s lng gene ngi ta phi da

    vo cc c im ca gene bao gm: trnh t m ha (coding sequence) hay cc khung

    c m (open reading frame), trnh t promoter, cc trnh t ni gia exon v intron

    cng nh cc trnh t iu khin hot ng ca gene (cc vng 5 UTR, 3UTR)... So

    snh genome, so snh trnh t DNA l nhng thao tc quan trng u tin pht hin

    cng nh d on chc nng ca gene.

    Lp bn vt l da trn c s trt t cc gene v thng tin bit ca cc

    gene cng l bc u tin trong nghin cu genome. Thng tin ny s c hin th

    di dng ha cc genome browser. Xc nh chc nng ca gene c coi l

    mt trong nhng thch thc vi cc nh nghin cu genome. Mc d thng tin v trnh

    t, cu trc v chc nng sinh hc ca cc gene, cc trnh t sinh hc c cng b

    ngy cng nhiu nhng vic d on chc nng ca cc gene thng rt phc tp. C

    nhiu hng tip cn cho bi ton ny trong c th tip cn t genome hoc t sn

    phm gene (protein) hoc kiu hnh. Gi s ngi ta mun bit tnh trng chiu cao

    cy, kh nng khng su bnh, mu sc hoa hay hm lng protein trong sa do gene

    no m ha. Nu tnh trng cn nghin cu l n gene th s tng i n gin. Tuy

    nhin nu tnh trng do nhiu gene quy nh (tnh trng s lng) th cng vic ny

    s tr ln v cng phc tp. Vn l lm th no ch r c gene hoc cc gene

    no phn b u trong genome (trn NST) trc tip m ha hoc tham gia vo qu

    trnh hnh thnh nn tnh trng . Ngoi ra, m hnh hot ng hoc c ch, iu kin

    biu hin ca cc gene nh th no?

    Trn thc t cho d s phng php no hay hng tip cn no th cui cng

    vn phi xc nhn li c ng gene tham gia vo vic hnh thnh tnh trng

    khng. Vic kim chng ny thc s l mt cu hi v cng nan gii c bit nhng

    tnh trng di truyn s lng cc i tng sinh vt bc cao bi v cc k thut

    knock out, knock down, c ch s biu hin gene bng RNAi khng phi lc no cng

    c th p dng v p dng thnh cng. Mt hng tip cn khc xc nh chc

    nng ca gene nh k thut microarray nhm pht hin s xut hin hoc thay i mc

    biu hin ca cc mRNA trong nhng iu kin nht nh cng gp phn vo vic

    nhn din v nghin cu chc nng gene. Nhng nghin cu so snh genome, so snh

    trnh t, so snh cu trc (data mining and analysis) cng l mt xu hng v l thao

    tc u tin khi cc CSDL cha thng tin v cc trnh t sinh hc ngy cng nhiu.

    Tuy nhin mc chnh xc v tin cy ca cc thng tin a ra ph thuc rt nhiu

    vo cc thut ton v mc phong ph ca thng tin trong cc c s d liu.

    S lng gene ca cc sinh vt

    ngi, lc ban u genome ngi d on cha khong 50.000 n 100.000

    gene. Gn y s lng gene c bit khong hn 20.000. Chut v rui cng c s

  • 28

    lng gene tng t. Giun trn c khong 13.000 v la c khong 46.000. ngi,

    trnh t gene m ha protein chim khong 12% genome.

    Cu trc gene

    Hnh 11. S cu trc mt gene prokaryote

    prokaryote, v mt quy c u 5 ca gene c t bn tri, u 3 bn phi.

    Cu trc mt gene in hnh c minh ha di y.

    Hnh 12. S cu trc vng trnh t promoter ca prokaryote

    Hnh 13. Cu trc gene ca eukaryote (trn) v vng promoter (di)

  • 29

    2.5. Hot ng chc nng ca gene v iu ha hot ng ca gene

    Hot ng chc nng ca gene l mt qu trnh phc tp, c s tham gia ca rt

    nhiu thnh phn ca t bo. prokaryote, hot ng chc nng v iu ha hot

    ng ca gene tng i n gin. Tuy nhin eukaryote iu ha hot ng ca gene

    v cng phc tp lin quan n nhiu qu trnh t cu trc nhim sc th lin quan n

    cc c ch epigenetics (methyl ha, acetyl ha, phosphoril ha), khi u phin m,

    phin m, ci bin sau phin m, dch m, ci bin sau dch m v vn chuyn hng

    ch. Nghin cu hot ng ca mt gene phc tp th iu ha hot ng ca mt

    con ng chuyn ha (metabolomic pathway) cn phc tp hn nhiu do c s tham

    gia ca rt nhiu gene v tng tc ca nhiu protein, enzyme khc trong t bo.

    Chnh v vy nghin cu hot ng chc nng ca gene cn c s so snh v i chiu

    vi nhiu c s d liu v nhiu genome khc nhau.

    Hnh 14. Cc qu trnh iu ha hot ng gene eukaryote

    2.6. Proteome v lnh vc nghin cu protein (proteomics)

    Proteome c coi l ton b protein c biu hin bi mt genome, t bo,

    m hoc cc sinh vt mt thi im hoc iu kin nht nh. Xt v mc ang

    dng, proteome ln hn nhiu so vi genome, c bit sinh vt nhn chun. Ni cch

    khc s lng protein ln hn nhiu so vi s lng cc gene c trong genome.

    Nguyn nhn l do cc hin tng phn ct, sa cha tin mRNA (pre-mRNA) ca

    cc gene v qu trnh ci bin sau dch m chng hn nh phosphoryl ha, glycosyl

    ha. Nu so vi d liu v genome ch yu l trnh t DNA, RNA th d liu v

    proteome phc tp hn bi v ngoi trnh t amino acid cn c cc d liu cu trc,

    chc nng v s tng tc gia cc protein.

    Lnh vc nghin cu proteome lin quan n nhiu k thut phc tp nh tch

    chit, tinh sch protein, phn tch protein bng in di 2 chiu, cc k thut phn tch

  • 30

    khi ph, so snh s ng dng gia cc mnh peptide, so snh trnh t amino acid...

    Proteomics bao gm ni dung quan trng l nghin cu cu trc v nghin cu chc

    nng. Nhng thng tin v trnh t amino acid, cu trc v chc nng gip cc nh

    nghin cu gii thch c bn cht ca cc qu trnh sinh hc, c ch ca cc qu

    trnh ri lon, bnh tt v nhn dng v d on chc nng ca nhng protein mi.

    2.7. Tin ha v bn cht phn t ca qu trnh tin ha cc sinh vt

    t bin v tch ly t bin

    Mc d c ch v nguyn nhn ca tin ha n nay vn cn nhiu tranh ci,

    tuy nhin trn quan im hin i, t bin c coi l vt liu ban u ca tin ha

    bi v y l con ng dn n vic hnh thnh allele mi hoc cc vng c chc

    nng iu ha b thay i hoc to mi. t bin c th gy ra hu qu nghim trng

    nhng cng c t bin trung tnh hoc khng nh hng n kiu hnh (t bin

    trong cc vng DNA khng m ha/ non-coding DNA).

    Hu ht cc t bin trong gene cu trc u tc ng n sn phm protein

    hoc dn n s a dng v sn phm protein do qu trnh phn ct, ghp ni exon ca

    mRNA. Nhng thay i cu trc v chc nng ca cc phn t biu hin thnh cc

    dng bin d ca c th trong qun th. Tri qua cc s kin tin ha cui cng c th

    dn n phn loi v hnh thnh loi mi. y, cu hi t ra l ti sao nhng thay

    i nh trong cc gene do t bin, c bit l t bin im, li dn n s phn bit

    loi ny vi loi khc. tr li cu hi ny cn phi xem xt c hai kha cnh

    khng gian v thi gian. Khng gian y l nhng chn lc ngu nhin t ln

    nhng c th b t bin. Thi gian l h qu ca mt qu trnh chn lc t nhin lu

    di. Khng gian v thi gian c mi quan h cht ch vi nhau nu p lc chn lc qu

    mnh th trong mt thi gian ngn c th hnh thnh loi mi hoc dn n tuyt

    chng.

    S lp gene v genome (gene/genome duplication)

    Nu mt gene c lp li hay c nhiu bn copy th t bin xy ra mt bn

    copy c th khng nh hng g n hot ng sng ca t bo. Lp gene trong mt c

    th lng bi s to ra thm mt cp gene v th mt cp vn hot ng chc nng

    bnh thng, cp cn li c th b bin i hoc tn ti cc dng t hp khc nhau.

    Vy li ch ca qu trnh lp gene ny l g? Theo thi gian, mt bn copy c th to

    ra chc nng mi, lm nn tng cho vic thch nghi trong qu trnh tin ha. Ngay c

    khi hai bn copy ca gene tn ti theo kiu paralogous, tc l c trnh t v chc

    nng tng t nhau th s tn ti ca cc bn copy l mt dng d tha (gene

    redundancy). iu ny gii thch ti sao trong mt s trng hp chut hoc nm men

    b knock out mt gene nhng khng thy nh hng hoc nh hng khng qu nng

    n ln kiu hnh. Nh vy, chc nng ca cc gene b knock out c th b trung ha

    bi mt dng paralog tng ng ca n.

    Sau khi gene c lp, tri qua cc s kin tin ha mt bn copy ca gene c

    th b bin i hoc mt i. Nhng bin i xy ra nhiu gene v nhiu v tr trong

    genome dn n nhng ro cn (post-zygotic isolating mechanism) trong qu trnh

    giao phi v sinh sn gia chng. Nhng ro cn ny c th dn dn gy ra s phn

    loi.

    Cc t bin trong vng iu ha

  • 31

    Mc d v mt s lng gene c th ni l nh nhau tt c cc t bo, tuy

    nhin khng phi tt c cc gene u c biu hin nh nhau mi t bo. S khc

    bit ny ph thuc vo loi t bo, s tng tc ca cc tn hiu ngoi bo, cc yu t

    phin m...

    C nhiu bng chng cho rng t bin trong vng iu khin ng vai tr quan

    trng trong tin ha. Chng hn: Ngi c mt gene (LCT) m ha cho lactase,

    enzyme ny ng vai tr phn gii lactose. Hu ht mi ngi trn th gii gene ny

    v u hot ha tr nh nhng s khng hot ng ngi ln. Tuy nhin, nhng

    ngi Bc u v 3 b tc chu Phi gene ny vn hot ng v trong khu phn n ca

    h vn dng sa. Nguyn nhn l do c mt t bin trong vng iu khin gene

    lactose cho php n vn c biu hin. Mt v d khc l gene Prx1. Gene ny m

    ha cho mt yu t phin m quyt nh cho s hnh thnh chn trc ng vt c

    v. Khi chut c vng enhancer ca gene Prx1 b thay th bi vng enhancer tng

    ng ca di (chn trc s l i cnh), khi cc chn trc di hn 6% so vi bnh

    thng. Nh vy, mt s thay i v hnh thi khng c iu khin bi s thay i

    protein Prx1 nhng li do s thay i v mc biu hin ca gene ny.

    2.8. Phn tch mi quan h tin ha ca cc sinh vt

    Tin ha l mt qu trnh dn s thay i v vn gen ca mt qun th theo thi

    gian. Mc d bn cht ca tin ha din ra mc qun th, tuy nhin vic xc nh

    v phn tch mi quan h tin ha c th nhiu mc khc nhau nh qun th, loi,

    nhm c th, t bo, cc bo quan v mc phn t. Trong lnh vc tin sinh hc

    ng dng vic phn tch mi quan h tin ha ch yu da vo phn tch mc

    phn t hay tin ha phn t. Chng hn gn y ngi ta da vo vic phn tch cc

    trnh t DNA m ha cho ribosome, cytochrome c, Rubisco ribolose (RuBisCo), gene

    ti th... phn loi sinh vt v xp chng vo cc n v phn loi (taxon). Tt nhin

    vic phn tch mc phn t l cha cn phi kt hp vi kt qu ca cc

    nghin cu khc.

    Analogous

    Hiu mt cch n gin analogous l nhng c im ging nhau c quan

    st thy hai hay nhiu loi m bn thn chng khng c s lin h v mt t tin.

    Cc c im sinh hc ging nhau nh vy thng l kt qu ca qu trnh tin ha

    hi t. Tin ha hi t l kiu tin ha m s thay i mt s c im trong qu

    trnh tin ha ch mang tnh thch nghi vi iu kin nht nh. V d i cnh ca

    chim v di c cu trc dng tng t nhau v ph hp cho vic bay ln nhng v

    bn cht l khc nhau.

    Homologous

    Cc tnh trng tng ng (homology) c cng mt ngun gc tin ha chung.

    Mt tnh trng tng ng c th l:

    - Homoplasious: qu trnh tin ha xy ra ring r, nhng c cng t tin chung - Plesiomorphic: c cng t tin chung, nhng trong qu trnh tin ha dn n

    s mt i mt s tnh trng cc th h con chu.

    - (syn)apomorphic: c cng t tin chung v c mt tt c con chu ca chng

  • 32

    Ortholog

    Cc trnh t tng ng c coi l orthologous khi chng c tch ring bi

    mt s kin phn loi. Tuy nhin chng vn c cng mt t tin chung gn nht. Khi

    mt loi phn li hay tch thnh 2 loi ring bit, cc bn copy phn ly t mt gene n

    c gi l orthologous. Cc gene orthologous l cc gene ca cc loi khc nhau

    nhng c s ging nhau bi v chng c ngun gc l hu du trc tip ca mt gene

    n l. Chng hn protein iu ha Flu c mt c Arabidopsis (thc vt a bo bc

    cao) v Chlamydomonas (to lc n bo). Chlamydomonas, protein ny phc tp

    hn ch n xuyn mng 2 ln thay v mt ln Arabidopsis. Khi chuyn gene ny t

    to lc sang genome thc vt bng k thut di truyn th hot ng ca gene ny cng

    tng t nh t bo ban u ca chng. Kt qu ny chng t 2 gene ny l

    orthologous v cng di truyn t 1 t tin chung.

    xc nh 2 gene ging nhau c phi l orthologous hay khng th ch cn

    phn tch ngun gc tin ha ca gene . Nu cc gene nm trong mt nhnh th

    chng s l ortholog v l con chu ca mt t tin chung. Cc gene orthologs thng

    c chc nng sinh hc ging nhau.

    Paralogous

    Cc trnh t tng ng (homologous) c gi l paralogous khi chng c

    phn tch bi mt s kin lp gene. Nu mt gene ca mt sinh vt b lp v chim 2

    v tr khc nhau trong cng mt genome, khi 2 bn copy c gi l paralogous

    (para ngha l song song) v c th cng thc hin chc nng ging nhau. Paralog

    thng c cng chc nng hoc chc nng tng t nhau, nhng khng phi lun lun

    nh vy. Nguyn nhn ca hin tng ny l do thiu p lc la chn, tc l p lc la

    chn ch t ln 1 bn copy ca gene b lp, bn copy kia c t do t bin, thay i

    v hnh thnh chc nng mi.

    Cc trnh t paralogous cung cp nhiu thng tin hu ch bn trong cc genome.

    Cc gene m ha cho myoglobin v haemoglobin c xem nh l dng paralogs c

    xa nht. n nay ngi ta bit 4 nhm haemoglobin (A, A2, B, F) l paralog ca

    nhau. Trong khi mi protein u thc hin chc nng ging nhau l vn chuyn oxy

    th mt dng bin i nh haemoglobin F dn n c i lc rt cao vi oxy so vi

    cc haemoglobin ngi trng thnh. Chc nng hot ng ca cc gene paralog

    cng khng nht thit phi gi vng. Cc gene paralogous thng thuc v cng mt

    loi, nhng khng phi lc no cng nh vy. Chng hn gene haemoglobin ca ngi

    v myoglobin ca kh u ch l paralog. y cng chnh l mt vn hay gp phi

    trong tin sinh hc. Khi cc genome ca cc loi khc nhau c xc nh trnh t v so

    snh vi nhau ngi ta rt d dng c th kt lun chng l tng ng (homologous)

    tuy nhin chng vn c th l paralog v chc nng ca chng bin i.

    Ohnology

    Cc gene c gi ohnologous khi chng c ngun gc t mt qu trnh lp li

    ton b genome. Thut ng ny c Ken Wolfe s dng vinh danh Susumu Ohno.

    Ohnolog l mt trong nhng hin tng l th trong phn tch tin ha bi v chng

    c bin i trong cng mt di thi gian bt u t ngun gc t tin chung ca

    chng (do lp li ton b genome).

    Xenology

  • 33

    Cc dng homolog hnh thnh do s trao i gene ngang (horizontal gene

    transfer) gia 2 sinh vt c gi l xenologs. Phn ln cc xenolog ging nhau v

    chc nng.

    Gametology

    Gametology m t mi quan h gia cc gene tng ng (homologous gene)

    cc NST khng tng ng (chng hn NST X v NST Y ngi). Gametolog l kt

    qu ca s quyt nh gii tnh v mt di truyn v l cc ro cn cho s ti t hp

    gia cc NST gii tnh.

    Tm tt chng 2

    1. Tin sinh hc ra i da trn nn tng quan trng ca sinh hc, c bit l sinh hc phn t. Sinh hc phn t nghin cu cu trc, chc nng ca cc phn t

    v cc hot ng sng ca t bo, m, c quan v c th mc phn t.

    Trong tin sinh hc, nghin cu phn t tp trung vo vic xc nh trnh t cc

    axit nucleic (DNA, RNA) v trnh t amino acid (protein), ng thi nghin

    cu cu trc, chc nng v s tng tc gia cc phn t ny.

    2. Thng tin di truyn c lu tr trong phn t DNA, RNA c biu hin thng qua cc qu trnh phin m, dch m v ci bin (sau phin m v dch

    m). y cng l ni dung ca lun thuyt trung tm trong sinh hc phn t.

    3. Vi s pht trin nhanh chng ca cc k thut, vic xc nh trnh t gene v genome tr thnh mt cng vic thng ngy cc phng th nghim. Sau

    khi xc nh trnh t genome, vic m t v gn cc thng tin sinh hc vo cc

    trnh t DNA l mt nhim v ca c cc nh nghin cu sinh hc v tin sinh

    hc. Cc kt qu nghin cu sinh hc v thnh phn, cu trc gene ca sinh vt

    prokaryote v eukaryote lm c s cho vic xy dng cc thut ton v m hnh

    m phng my tnh.

    4. Nhng nghin cu v mi lin h gia trnh t v cu trc phn t axit nucleic, protein v mi lin h gia cu trc v chc nng sinh hc s lm nn tng

    m phng v d on v so snh cc cu trc, d on chc nng da vo vic

    so snh trnh t.

    5. t bin v nhng thay i trnh t, cu trc gene, genome trong qu trnh tin ha to c s nghin cu cc mi quan h loi, s pht sinh loi v

    nghin cu chc nng ca gene, genome gia cc loi sinh vt. Trn c s phn

    tch v so snh trnh t sinh hc c th xc nh c cc mi quan h di

    truyn, ngun gc tin ha v xu hng tin ha cc mc tng gene, h

    gene, h protein v mc loi.

    Cu hi n tp chng 2

    1. Trnh by thnh phn cu to v cu trc ca axit nucleic 2. Th no l m di truyn, c im ca m di truyn 3. Trnh by ni dung ca lun thuyt trung tm 4. Trnh by mi lin h gia cu trc v chc nng ca cc protein 5. Genome l g? ngha ca vic nghin cu genome? 6. Hy m t cu trc gene ca sinh vt prokaryote v eukaryote

  • 34

    7. iu ha hot ng gene l g? 8. Ti sao phi nghin cu mi quan h tin ha ca cc sinh vt

  • 35

    CHNG 3

    TM KIM V QUN L TI LIU NGHIN CU

    3.1. Phng php tm kim thng tin

    S pht trin nhanh chng ca mng Internet v s lng trang Web to ra

    mt lng thng tin khng l v tng ln tng ngy. tm c thng tin cn thit

    trong kho d liu khng l ny cn phi s dng cc cng c tm kim kt hp vi

    phng php ph hp. Chng 3 s gii thiu mt s cng c v phng php tm

    thng tin chung trn Internet phc v hc tp v nghin cu.

    Khi cn tm kim cc trang web cha nhng t c th hoc cm t cc cng c

    tm kim chng hn nh Google s cho ra kt qu nhanh v rt hiu qu. Tuy nhin,

    kt qu tm kim i khi a ra rt nhiu thng tin khng lin quan trc tip n ch

    hoc phm vi tm kim dn n mt nhiu thi gian chn lc. Khi tm kim c nh

    hng trong mt lnh vc c th hoc mt ch c th c th s dng cc nhm th

    mc (subject directories) chng hn Word Wide Web Vitual Library (http://vlib.org/)

    thu hp phm vi lnh vc ca ngi tm kim. Tuy nhin mt thc t l lng thng

    tin m cc cng c tm kim cung cp ch khong 1/3 s lng thng tin thc t c.

    Nguyn nhn l do cc cng c ny khng th truy cp c ngun thng tin . Vic

    khng truy cp c ch yu lin quan n an ninh mng v cc hng ro chn. Cc

    cng c tm kim khng c php vt qua cc ro chn ny.

    C hai kiu tm kim thng tin, tm kim s dng cc cng c tm kim chung

    (chng hn nh Google) v tm kim cc d liu c th theo mc ch nghin cu

    hoc lnh vc nghin cu. Cho d s dng cng c tm kim no th vic tm kim

    thng tin cng cn c cc qu trnh bao gm: (i) xc nh cng c tm tin hoc cc

    trang web h tr tm tin, (ii) xc nh ni dung thng tin cn tm, (iii) xy dng t

    kha i din cho ni dung tm kim (nn s dng t kha di dng cm t thay v

    nhng t n, i vi ting Anh khng nn dng mo t, nn dng danh t), (iv) s

    dng cc ton t logic kt hp chng hn nh cc hm boolean nh: and, or, not,

    hoc +, -, du ngoc kp , du *, lc v thu hp kt qu nghin cu.

    3.2. Cch tm ti liu phc v nghin cu

    Hin nay Google c xem nh mt cng c tm kim nhanh v hu hiu nht

    c a s mi ngi s dng. Xt v phng din tm kim thng tin chung hoc k

    c tm kim theo th mc ch (directory) th Google vn l cng c chim u th.

    Trong mt s trng hp Google c th thm nhp vo mt s trang web c bo mt

    hin th thng tin tm kim, tuy nhin vic truy xut vo cc ngun thng tin ny s

    b chn li v l do an ninh mng. Mc d vy, c th ni tm thng tin mt cch bao

    qut Google c xem nh l cng c tm kim u tin c la chn.

    Vic tm kim c bt u bng cch xc nh thng tin cn tm kim, tip sau

    l xy dng t kha. i vi cc nh nghin cu sinh hc, c bit trong lnh vc

    sinh hc phn t, thng tin ch yu c ly t cc ti liu nc ngoi v vy vic

    thnh tho ting Anh l iu gn nh bt buc. Vic xy dng t kha da vo cch

    kt hp cc t, ch yu l danh t hnh thnh cc cm t kha. Thng thng cc

    kt qu tr v ca Google thng rt ln v vy ngi s dng phi lc kt bng cch

    s dng cc phng php nh tng di t kha, nhm t kha thnh cc cm t v

    kt hp vi cc ton t logic (hm boolean) hoc s dng cc chc nng tm kim

    nng cao. Tuy nhin, vic s dng Google ch gii quyt c bi ton tm thng tin

    http://vlib.org/

  • 36

    chung v khi qut v tm c thng tin c th cho mc ch nghin cu i hi

    qu trnh tm kim li trong kt qu va tm c dn n mt rt nhiu thi gian v

    cng sc.

    Trong lnh vc sinh hc, mt phn ln ti liu phc v nghin cu v hc tp l

    cc bi bo khoa hc c ng trn cc tp ch chuyn ngnh. Vic s dng thng tin

    t cc bi bo m bo c tnh chnh xc v c th ca thng tin. Pubmed l mt

    trong nhng c s d liu MEDLINE ca NCBI cho php ngi s dng c th tm

    kim rt nhiu kt qu nghin cu lin quan n lnh vc sinh, y hc di dng cc bi

    bo khoa hc ton vn (full text) hoc tm tt (abstract). Gn y, nhiu tp ch khc

    nhau ng k vo trong danh mc ca Pubmed v vy phm vi tm kim cc kt qu

    cng b di dng bi bo khoa hc ca Pubmed khng ch dng li phm vi y sinh

    hc m cn lin quan n nhiu lnh vc khc nh ha hc, vt l, cng ngh vt liu,

    cng ngh thng tin... Cc bi bo dng ton vn c th download min ph c th tm

    trong CSDL PMC ca NCBI.

    Cc d liu tm kim trong Pubmed c th hin di dng cc bi bo v

    thng tin lin quan. Hnh xxx gii thiu mt kt qu tm kim in hnh ca Pubmed.

    V mt nh dng, thng tin tm kim bng Pubmed s c cung cp bao gm tiu

    bi bo, tc gi hoc nhm tc gi thc hin, tn tp ch c ng, s xut bn v s

    trng ca bi bo. Ngoi ra, Pubmed cung cp ng kt ni (link) ti ngun ca bi

    bo cho php ngi c c th truy cp min ph hoc c s cho php ca trang

    cung cp cha bi bo .

    Hnh 15. Tm kim ti liu nghin cu t CSDL Pubmed

    3.3. Lm quen vi Pubmed

    PubMed l mt ngun m c pht trin v duy tr bi NCBI, thuc NIH.

    PubMed cha hn 20 triu trch dn cho cc vn lin quan n sinh y hc t

    MEDLINE, cc tp ch khoa hc s sng v cc sch online. PubMed l mt CSDL

    ln tp hp cc bi bo, tm tt, cc trch dn v cc ng link lin kt vi cc CSDL

    khc. Ban u CSDL MEDLINE cha cc tp ch, tm tt lin quan n khoa hc s

    sng v cc ch y sinh hc. United States National Library of Medicine (NLM)

  • 37

    NIH duy tr CSDL ny nh mt phn ca h thng qun l v lu tr thng tin.

    PubMed c a ra bt u t thng ging nm 1996.

    Tnh t nm 1966 n nay PubMed cha hn 22,7 triu bi bo v thm ch c nhng

    bi t nm 1809. Hng nm c khong 0,5 triu bi bo mi c b sung. Trong s

    cc d liu trong Pubmed c khong 13,1 triu c vit di dng tm tt v 14,2

    triu di dng ng lin kt vi cc bi bo ton vn (full text) v trong s ny c

    3,8 triu bi bo cho php ngi dng ti v min ph.

    PubMed cng trang b cc ton t logic trong qu trnh thc hin tm kim, tuy

    nhin qu trnh ny l t ng. T kha a vo s c dch ra thnh cc dng bin

    th ca tng t v cc t thng c s dng lin quan vi cc t kha kt hp

    vi cc ton t logic.

    Hnh 16. Kt qu tm kim CSDL Pubmed

    3.4. Cch qun l ti liu nghin cu

    Vic tm c ti liu ph hp vi mc ch nghin cu l mt qu trnh i hi

    mt nhiu thi gian v cng sc. Tuy nhin, ngay c khi tm c nhng bi bo

    lin quan n ch nghin cu th vic qun l thng tin ny mt cch hiu qu cho

    vic c, tra cu v trch dn cng i hi nh nghin cu sp xp v t chc ngun

    thng tin ny mt cc hiu qu.

    C nhiu cch qun l cc thng tin v d liu bi bo, trong Endnote l mt

    cng c kh hiu qu cho php nh nghin cu truy cp v trch dn ngun ti liu

    theo nhiu mc ch khc nhau. Mt trong nhng u im l Endnote nhn nh dng

    kt qu tm kim ca mt s cng c, in hnh nht l nh dng MEDLINE ca

    NCBI. Ngoi ra Pubmed cho php tm kim kh nng tm kim thng tin v trch dn

    trong cc bi bo khoa hc, lun vn v lun n mt cch t ng da trn c s d

    liu c to ra. Di y l mt hnh nh minh ha ca chng trnh Endnote. Cch

    s dng Endnote c gii thiu c th trong cc bi thc hnh trn lp i km vi bi

    ging ny.

  • 38

    Hnh 17: Qun l CSDL bi bo khoa hc bng chng trnh Endnote

    Tm tt chng 3

    1. Internet cha ng mt kh thng tin khng l, khai thc c ngun thng tin ny cn phi s dng cc cng c tm kim.

    2. Vic tm kim thng tin bao gm vic xc nh ngun thng tin, xy dng t kha v biu thc tm tin v cui cng l la chn cng c tm kim.

    3. Vic nh gi tin cy ca thng tin phi da vo mt s tiu ch nh mc ch ca ngi ng ti thng tin, thi gian ng ti, cc ng dn

    4. C s d liu Pubmed l mt trong nhng CSLD quan trng ca NCBI. y cc nh nghin cu c th tm kim v ti v rt nhiu cng trnh, bi bo

    nghin cu c ng trn nhiu tp ch c uy tn.

    5. Vic qun l ti liu bng cc cng c tin hc gip cho nh nghin cu t chc, sp xp c cc ti liu tham kho mt cch khoa hc. Vic trch dn cc ti

    liu cho cc bi bo, lun vn, lun n bng Endnote gip nh nghin cu tit

    kim c thi gian v cng sc.

    Cu hi n tp chng 3

    1. Hy nu cc bc chnh trong qu trnh tm kim thng tin s dng cng c tm kim? Da trn nhng c s no nh gi tin cy ca thng tin tm kim

    c. Hy nu mt v d c th cc bc tm kim mt ni dung nghin cu

    (chng hn nghin cu chuyn gene khng thuc tr c vo thuc l) bng cng

    c Google?

    2. Tm mt s hnh nh vi khun E.coli, vi khun gy bnh bc l Xanthomonas oryzae pv oryzae, nguyn l k thut PCR.

  • 39

    3. S dng cc cng c tm kim, hy tm cc ti liu v k thut PCR v ng dng ca k thut ny. Yu cu: Xc nh t kha, s kt qu tm c. Trong

    s cc kt qu tm c hy chn ra mt ti liu ng tin cy nht?

    4. S dng kin thc hc hy tm kim a ch v truy cp vo cc trang ch ca Ngn hng gen th gii NCBI, EMBL, EBI, DDJB, PubMed v trang ch

    ca Vin nghin cu la quc t (IRRI).

    5. Truy cp vo trang PubMed, tm kim cc ti liu lin quan n virus HIV hoc bnh vim gan. Tm kim khong trn 10 bi bo (full text) trong CSLD

    Pubmed sau dng chng trnh Endnote lu gi v qun l cc bi bo

    ny dng mt th vin.

    6. Trn c s th vin va xy dng hy tm kim cc bi bo theo cc trng (tn tc gi, tn bi bo, nm cng b, t kha). T kt qu xy dng th vin, hy

    p dng chng trnh Endnote trch dn t ng cc bi bo, cng trnh

    nghin cu cho lun vn tt nghip.

  • 40

    PHN 2

    C S D LIU SINH HC

    NG K TRNH T VO C S D LIU

    CHNG 4. C S D LIU SINH HC

    C s d liu

    Nn tng quan trng nht trong tin sinh hc ng dng l CSDL. Phn ln d

    liu trong cc CSDL sinh hc l nhng trnh t sinh hc i km vi nhng thng tin

    m t chi tit. Chng hn d liu t cc d n xc nh trnh t genome c to ra

    hng ngy trn quy m ton th gii. s dng c cc c s d liu ny cn phi

    c mt h thng t chc v sp xp chng mt cch hp l c th lu tr, phn

    nhm, cho php truy cp, tm kim v so snh. Ngoi ra, do c th ca CSDL sinh

    hc, ngoi d liu trnh t thng thng cn c cc CSDL cu trc, chc nng.

    Do tnh phc tp v mi lin h gia cc CSDL nn rt kh c th sp xp v

    phn loi CSDL mt cch tch bit. Theo ngun gc ca d liu c th phn chia

    thnh CSDL s cp v CSDL th cp. CSDL s cp cha cc trnh t nucleotide hoc

    amino acid trnh cu trc c xc nh t thc nghim cng vi nhng thng tin m

    t lin quan n chc nng, cc bi bo cng b lin quan, lin kt cho vi cc c s

    d liu khc. CSDL th cp l CSDL cha cc d liu c cht lc, sp xp theo

    nhng tiu ch nht nh t d liu ca CSDL s cp. Nu da vo c im d liu c

    th phn chia thnh CSDL trnh t, CSDL cu trc v cc CSDL khc (hnh 18).

    CSDL c vai tr v cng quan trng lm c s cho cc mc ch tm kim, phn tch

    v so snh i chiu d liu. Kt hp vi cc cng c phn tch v cc lin kt cho

    gia cc c s d liu, cc nh nghin cu c th xc nh, d on v phn tch

    tm ra thng tin cha trong cc trnh t cng nh xc nh tnh cht v chc nng ca

    cc trnh t sinh hc mi.

    Hnh 18. Phn loi CSDL sinh hc

  • 41

    4.1. C s d liu s cp

    4.1.1. CSDL trnh t nucleotide GenBank

    CSDL GenBank c xem l CSDL c bit v s dng nhiu nht thuc NCBI

    (Center for Biotechnology Information ca M. Genbank l CSDL cho php truy cp

    min ph cha hn 189.000.000 trnh t vi tng s hn 299.000.000.000 base ca

    hn 380.000 sinh vt (tnh n thng 12 nm 2010). GenBank cng kt hp vi 2 ngn

    hng ln ca chu u (European Molecular Biology Laboratory (EMBL) t ti

    European Bioinformatics Institute (EBI) v DNA Data Bank of Japan (DDBJ) ca

    Nht hnh thnh trung tm hp tc trnh t nucleotide quc t (INSDC).

    Cc trnh t c gi vo NCBI phi c chiu di t 50 base tr ln c m t

    chi tit bao gm s truy cp (accession number/AN). S truy cp ny s c gi

    khng i ngay c khi trnh t c update. Trong mt s trng hp cc phin bn

    (nh s) t sau s truy cp v c ngn cch bi du chm. Trnh t c a vo

    Genbank thng qua vic ng k trnh t c thc hin thng qua giao din web

    (Bankit) hoc qua email (Sequin). Vic ng k trnh t s c m t chi tit

    chng sau.

    Mi trnh t lu tr trong Genbank c gi l mt mc (entry) c bt u

    vi t kha LOCUS theo sau l tn locus (locus name). Tng t vi AN, tn locus l

    duy nht tuy nhin, khc vi s truy cp, tn locus c th thay i sau khi c cn

    nhc hoc sa i. Tn locus bao gm 8 k t bao gm ch u tin ch tn chi v

    loi, sau l 6 con s ca s truy cp.

    EMBL v DDBJ

    Hai i tc chu u v Nht Bn ca GenBank l EMBL/EBI v DDBJ, y

    cng l hai kho CSDL trnh t s cp. Ba CSDL GenBank/EMBL/DDBJ lin kt vi

    nhau hnh thnh INSDC. CSDL ca mi i tc u c trao i vi nhau hng

    ngy, v vy c th thc hin cc thao tc tm kim trnh t bt k ngn hng no.

    Mc d nh dng cho mi entry ca NCBI v DDBJ so vi EMBL c s khc bit

    nhng thng tin cha ng cho mi entry l nh nhau.

    4.1.2. CSDL trnh t protein

    SWISSPROT

    Mt trong nhng CSDL ln nht cha cc trnh t protein c m t chi tit

    nht l CSDL SWISSPROT c t ti Vin nghin cu tin sinh hc Thy S

    (Institute of Bioinformatics/SIB). CSDL ny c h thng server gi l Expasy (Expert

    Protein Analysis System). CSDL SWISSPROT c cha cc trnh t c chn lc

    th cng, mi bn ghi (record) trong CSDL u c thm nh bi cc chuyn gia v

    nu cn thit c th c i chiu vi cc cng trnh cng b. Chnh v iu ny m

    CSDL ny c cht lng rt cao v c coi l tiu chun vng cho phn tch, tm hiu

    thng tin v protein. Hn na SWISSPROT l mt phn trong CSDL UniProt hay cn

    gi l UniProt.

    Do s lng cc trnh t v thng tin mi c to ra lin tc nn cc chuyn

    gia ca SIB khng th c thi gian bt kp v th mt CSDL mi c hnh

    thnh bn cnh SWISSPROT l TrEMBL database. TrEMBL l ch vit tt ca

    Translated EMBL v th n cha tt c cc trnh t protein c dch m t trnh t

  • 42

    DNA. Tt c cc thng tin m t u c thc hin t ng nh my tnh ch khng

    phi cc chuyn gia v th tin cy ca TrEMBL km hn. C hai CSDL ny u c

    th truy cp c thng qua giao din chnh SWISSPROT. Cc trnh t truy vn n

    gin c th c nhp vo trong khung. Cc cng c tm kim v cng c phn tch

    cc CSDL ny u c h tr SIB.

    CSDL Protein NCBI

    Mt CSDL trnh t rt quan trng khc cng c duy tr NCBI l CSDL

    protein. CSDL ny khng ch n thun l cc d liu trnh t m l mt tp hp cc

    entry t nhiu CSDL trnh t protein khc. Chng hn cc CSDL UniProt, PIR, v

    PDB.

    UniProt

    Thng tin v cc protein trong UniProt vn tip tc tng ln nhanh chng. Bn

    cnh thng tin v cc trnh t, cc m hnh biu hin, cc kt qu d on cu trc bc

    2 v chc nng sinh hc cng c lu gi v m t. Tt c cc d liu ny c lu

    gi trong cc CSDL, mt trong s chng l nhng CSDL c th (CSDL chuyn su

    v mt lnh vc). tp hp c tt c cc thng tin lin quan n mt protein quan

    tm c th mt rt nhiu thi gian. Chnh v vy EBI, SIB v Georgetown University

    xy dng mt trung tm cho lu gi thng tin v cc protein gi l Universal

    Protein Resource hay vit tt l UniProt. UniProt c thnh lp vo nm 2007 trn c

    s kt hp ca cc CSDL protein nh: Swissprot, TrEMBL v PIR. UniProt bao gm

    3 phn: (i) UniProt Knowledgebase (UniProtKB), (ii) c s d liu cc cm protein

    c sp xp hay UniProt Reference Clusters Database (UniRef) v (iii) UniProt

    Archive (UniPArc) l mt tp hp ca cc trnh t protein i km vi lch s ca n.

    Trong s 3 CSDL ny ca UniProt, UniProtKB l CSDL tt nht c kt hp

    ca Swissprot v TrEMBL. tm kim protein trong CSDL UniProtKB c th s

    dng cc t kha di hoc t hp cc t kha. UniRef l mt CSDL trnh t duy nht

    tc l mi trnh t ch c mt duy nht 1 ln. CSDL UniRef rt ph hp cho mc ch

    tm kim trnh t tng ng. CSDL ny tn ti di 3 dng UniRef100, UniRef90 v

    UniRef50. Mi CSDL ny cho php tm kim cc trnh t ging 100%, ln hn 90%

    v ln hn 50%.

    PIR

    Protein information resource (PIR) cung cp cho cc nh khoa hc CSDL tin

    cy v cc trnh t protein cng nh thng tin v chc nng ca chng mt cch chnh

    xc v tin cy. PIR h tr c lc cho cc nghin cu v genome, proteom v sinh hc

    h thng (system biology).

    c thnh lp t nm 1984 bi hip hi nghin cu y sinh hc quc t

    (NBRF) nhm h tr cc nh nghin cu xc nh v m t nh danh cc thng tin

    trnh t protein. Bao gm so snh trnh t protein, xc nh cc trnh t c mi quan h

    v tin ha da trn c s cn trnh t.

  • 43

    Hnh 19. C s d liu PIR

    Tri qua hn 4 thp ch, bt u vi Atlas of Protein Sequence and Structure,

    PIR cung cp cc CSDL protein v cng c phn tch cho php cc nh khoa hc

    s dng v truy cp min ph bao gm c CSDL Protein Sequence Database (PSD).

    4.1.3. C s d liu cu trc cc phn t PDB

    Protein data bank (PDB) l CSDL cha cc d liu cu trc ba chiu ca cc

    i phn t sinh hc, chng hn nh protein v axit nucleic. D liu thng l kt qu

    nghin cu thc nghim s dng cc k thut kt tinh v phn tch tinh th bng tia X

    hoc phn tch ph NMR. D liu c thu thp t kt qu nghin cu ca tt c cc

    nh khoa hc, nhm nghin cu trn ton th gii. PDB c coi l ngun cung cp

    CSDL cu trc sinh hc ln nht c lin kt vi cc CSDL ln khc nh GenBank,

    EMBL, SwissProt

    Bt u t nm 1976 vi ch c 3 cu trc phn t protein c xc nh, tnh

    n gia thng 5/2013, CSDL PDB cha tng s 90611 d liu cu trc cc phn t.

  • 44

    Phng php thc

    nghim

    Proteins Nucleic acid Phc hp

    protein/DNA

    Cc

    phn t

    khc

    Tng s

    Tn x tia X 74593 1457 3864 2 79916

    NMR 8700 1029 192 7 9928

    Knh hin vi in t 374 45 126 0 545

    Lai 46 3 2 1 52

    Khc 147 4 6 13 170

    Tng 83860 2538 4190 23 90611

    Hnh 20. C s d liu cu trc protein PDB

    hin th cc file ca PDB c th s dng cc chng trnh my tnh ngun

    m. Mt s chng trnh c tch hp sn trn trang Web nh Pymol, UCSF

    Chimera, Rasmol, Swiss-PDB Viewer. Cc phn mm ny thng i hi h tr

    Javascript phin bn mi nht.

    Ngoi vic lu gi cc d liu cu trc ca cc phn t, PDB cung cp cc cng

    c cho php nh nghin cu so snh trnh t cc protein, m phng cu trc v so snh

    cu trc ca cc protein.

    SCOP

    SCOP (Structure classification of Protein) phn loi cc protein bit cu trc

    theo mt h thng th bc Cc protein thc hin chc nng sinh hc tng t nhau v

    c mi quan h tin ha gn gi th chng s c cu trc tng t nhau, t nht l

    nhng vng trung tm hot ng. Do c th d on c chc nng ca mt

    protein cha bit bng cch so snh cu trc ca n vi cu trc cc protein bit.

    CSDL SCOP cho php d on chc nng protein v c phn thnh ba dng l cc

  • 45

    h protein, siu h protein v cc cu trc gp np. Cc h protein bao gm cc

    protein c mi quan h tin ha r rng v gn gi vi nhau c gii hn bi mt

    mc ging nhau v trnh t t nht >30% trn ton b chiu di trnh t ca cc

    protein. Nu khng p ng c nhng tiu ch ny cc protein s c xp vo

    trong mt h nu nh chng vn c s tng ng v cu trc v chc nng. Tuy

    nhin, cc protein c trnh t ging nhau mc thp nhng chng c mi quan h

    vi nhau da vo cc c im cu trc v chc nng th s c xp thnh cc siu

    h. Cc protein c cng kiu hoc dng cu trc bc hai trong cng mt kiu gp np

    v cun li s c xp vo cng mt nhm.

    CATH (Class Architecture Topology and Homologous Superfamily)

    C s d liu CATH phn nhm cu trc cc protein theo kiu th bc thnh 4

    cp. Class (C), Architecture (A), Topology (T), and Homologous Superfamily (H). S

    xp v k phn loi cc protein thnh nhm cc lp (Class) ch yu c tin hnh t

    ng, mt phn cc cu trc bc 2 c xem xt v tnh ton m khng cn quan tm

    n s sp xp v kt ni ca cc cu trc bc 2. C 4 lp protein c phn bit: (i)

    protein c cu to ch yu bi cc cu trc xon (ch yu l xon alpha), (ii) phin

    beta, (iii) c xon v phin (alpha-beta) v (iv) cc protein c rt t cu trc bc 2.

    Nhm Archiecture (A) m t s sp xp ca cc thnh phn cu trc bc 2 mt

    cch ln lt v chnh xc theo cch th cng. Trong nhm Topology m t dng

    protein v s tng tc kt ni ca cc thnh phn cu trc bc 2. S phn nhm

    Topology da vo thut ton s dng da trn c s thc nghim xut pht t cc

    thng s phn nhm cc domain. Nhm siu h protein tng ng (H) bao gm

    cc domain tng ng, chng hn cc domain c cng ngun gc chung. Mc

    ging nhau ca cc trnh t c xc nh bng cch so snh trnh t sau bi so

    snh cu trc ty thuc vo vic phn loi theo nhm Topology. Ngoi 4 nhm trn,

    mt nhm th 5 gi l h trnh t (Superfamilies). Trong nhm ny cc domain c

    phn nhm da vo mc ging nhau cao ca trnh t (t nht 35% ging nhau trn

    hn 60% chiu di ca domain ln) v vy cc protein ny thng c chc nng tng

    t nhau.

    4.2. C s d liu th cp

    PROSITE

    Lm mt CSDL th cp cha cc protein c phn nhm da vo vic s

    dng motif bo th (nhng vng trnh t ngn c kch thc t 10 n 20 amino acid

    c tnh cht bo th cao trong cc phn t protein c mi lin h gn gi). y l c s

    rt quan trng nghin cu chc nng protein.

    Vic tm kim cc protein c cc dng motif ging nhau cho php pht hin

    c chc nng ca chng. iu ny rt hu ch trong vic nghin cu mt protein

    cha bit. Vic pht hin cc motif c trong protein cha bit ny c th gi v chc

    nng v mt s c im sinh hc ca n. Vic pht hin cc motif da vo nguyn l

    cn trnh t (xem chng 8).

    PRINTS

    Cc trnh t trong CSDL PRINTS c phn bit da vo nguyn l

    fingerpriting. Cc Fingerprints bao gm mt vi motif trnh t. CSDL PRINTS li

  • 46

    dng c im cc protein cha cc vng chc nng ging nhau s c mt vi vng

    motif trnh t ging nhau. Bng cch so snh mt s vng trnh t Fingerprint s xc

    nh c mi lin h ca mt protein vi mt h protein bit thm ch ngay c khi

    mt s motif b mt hoc khng c.

    CSDL PRINTS c lin kt cho vi cc mc (entries) ca cc CSDL lin

    quan nh cho php ngi s dng c th truy cp ti nhiu ngun thng tin lin

    quan n h protein. Cng tng t nh Prosite, CSDL Prints cha thng tin v mi

    h protein v, nu c th, chc nng sinh hc ca mi motif trong cc fingerprint.

    Pfam

    CSDL Pfam phn loi cc protein da vo dng. Mi dng c xc nh bng

    kh nng xut hin ca mt amino acid nht nh, mt v tr chn thm hoc mt i

    mt amino acid mi v tr trong mt trnh t protein. Cc protein trong Pfam c

    phn nhm da vo vic cn trnh t. Kt qu ca vic cn trnh t s cho php phn

    bit kt hp gia chc nng, cu trc v mi quan h tin ha.

    4.3. Cc c s d liu khc

    4.3.1. C s d liu kiu gene v kiu hnh Mi quan h gia kiu gene v kiu hnh c nghin cu thng qua s thay

    i kiu hnh ca cc gene b t bin. C mt s CSDL kiu gene/kiu hnh c

    to ra lu gi cc mi quan h gia cc gene v cc c im sinh hc ca sinh vt.

    Trong s c th k n CSDL OMIM (Online Mendelian Inheritance in Man) ca

    NCBI. Mt dng CSDL na l dbGaP (Genotype and Phenotype database) ca NCBI.

    D liu trong CSDL ny c s dng phn tch mc ngha thng k ca cc mi

    quan h gia kiu gene v kiu hnh. Ngoi ra CSDL OMIA (Online Mendelian

    Inheritance in Animals) NCBI cng cha cc mi quan h gia kiu gene v kiu

    hnh nhiu loi ng vt, ngoi tr chut v ngi. Vi chut, CSDL tng ng l

    MGD (Mouse genome database). Mi quan h gia genotype ca hai m hnh sinh vt

    quan trng l rui dm (D. melanogaster) v giun trn (C. elegan), c lu gi

    FlyBase v Wormbase. C hai CSDL cha thng tin cho mi quan h gia genotype

    v phenotype.

    4.3.2. CSDL kiu gene (PhenomicDB) CSDL kiu gene l mt CSDL lu gi thng tin v kiu gene v kiu hnh ca

    nhiu loi t ngi cho n nhng sinh vt c nghin cu nhiu nh chut, c, rui

    dm, giun trn, nm men v Arabidopsis thaliana. CSDL ny kt hp d liu t nhiu

    CSDL khc.

    Mt im c bit ca CSDL PhenomicDB l c s so snh cho gia cc sinh

    vt vi nhau da trn mi quan h gia kiu gene v kiu hnh. Vic so snh c

    thc hin bng cch kt hp cc d liu phn tch cc gene tng ng theo kiu

    orthology (phn li t mt t tin chung).

    4.3.3. PubChem L mt CSDL NCBI lu gi cc phn t nh v thng tin lin quan n cc

    hot tnh sinh hc ca chng. PubChem bao gm 3 thnh phn: PubChem compound,

    Pubchem substance v Pubchem Bio Assay. Trong PubChem compound cha hn

    11 triu phn t (2007) cng vi cu trc 2 chiu tng ng.

  • 47

    PubChem substance cho php tm kim cc cht c to ra bi nhiu nh sn

    xut, cc hp cht cha bit thnh phn v cc hp cht t nhin cha bit cu trc 2

    chiu. PubChem BioAssay cung cp d liu v cc phn ng sinh hc. CSDL ny cho

    php tm kim bng t kha truy vn (query). CSDL PubChem rt hu ch do c s

    lin kt gia cc d liu bn trong CSDL v cc CSDL bn ngoi nh PubMed. Chng

    hn khi bit mt cht c ch hot ng ca mt enzyme th c th tm c nhiu cht

    c kh nng c ch tng t. Hn na cc phn t ha hc nh c th c xc nh

    c cu trc khc nhau li c th c cng hot tnh sinh hc trong cc phn ng sinh

    hc. y l c s p dng trong vic pht hin v pht trin cc cu trc thuc iu

    tr mi.

    Cc CSDL c th

    Ngoi cc CSDL k trn, hin nay c ti hng nghn CSDL lu gi cc thng

    tin v trnh t sinh hc, cu trc phn t, bn gene, mi lin h gia kiu gene v

    kiu hnh. Vi s pht trin nhanh chng ca k thut xc nh trnh t genome th h

    mi hng chc nghn genome ca cc sinh vt c xc nh trnh t. Cc CSDL

    genome i km vi nhng thng tin m t c ngha rt ln trong vic khai thc thng

    tin v genome, so snh genome cng nh nghin cu chc nng ca cc gene, cc

    protein thng qua vic so snh khng ch mc phn t m c ton b genome.

    i vi mt s i tng sinh vt c nghin cu k lng, thng tin chi tit

    v tng gene hoc c ch iu ha hot ng ca cc gene u c m t. Mt v d

    in hnh l CSDL Arabidopsis thaliana, CSDL v la v mt s i tng cy trng

    quan trng.

    S pht trin nhanh chng v s lng genome v kt qu ca vic so snh

    genome hnh thnh nn cc CSDL v s a hnh cc nucleotide n (SNP). Cc c

    s d liu SNP c ngha quan trng trong vic phn tch s a hnh ca cc sinh vt

    v mi lin h gia SNP vi cc tnh trng v k c bnh tt. Nghin cu v SNP cng

    gp phn nghin cu s phn ng khc nhau mc c th vi cc nh hng ca

    mi trng hoc thuc iu tr. i vi vt nui, khai thc cc CSDL SNP cng cung

    cp cc ch th phn t ng dng trong chn to ging.

    Nghin cu v gene v hot ng chc nng ca gene