Datamining in Bioinformatics-1

Embed Size (px)

Citation preview

  • 8/13/2019 Datamining in Bioinformatics-1

    1/15

    ATA MINING IN BIO-INFORMATICS

    VIDYAA VIKAS COLLEGE OF ENGINEERING & TECHNOLOGY,

    Tiruchengode,

    Naa!!a"#

    Prepared by,

    Srimathi.K

    Pavithra.B

    B.Tech(IT),

    Pre-final yr.

    Contact:

    rimathiceo!"mail.com,(#$$%%%')

    mailto:[email protected]:[email protected]
  • 8/13/2019 Datamining in Bioinformatics-1

    2/15

    *BST+*CT

    Biolo"y and compter cience hare a natral affinity. Phyicit rin

    Schr/din"er enviioned life a an aperiodic crytal, obervin" that the or"ani0in"

    trctre of life i neither completely re"lar, li1e a pre crytal, nor completely chaotic

    and ithot trctre, li1e dt in the ind. Thi i hy biolo"ical information ha never

    atifactorily yielded to claical mathematical analyi. 2achine comptation combine

    ele"ant al"orithm ith brte-force calclation 3 hich eem a reaonable approach to

    thi aperiodic trctre. The oltion to vario problem lie in the domain of or"anic

    matter. Th, e4aminin" ho or"anim olve problem can lead to ne comptation-and

    al"orithm-development approache that devor the problem that are difficlt to tac1le in

    the laboratory, bt o eay to approach in" a compter.

    $ioin%oraic'i the field of cience in hich biolo"y, compter cience, and

    information technolo"y mer"e to form a in"le dicipline. The ltimate "oal of the field i

    to enable the dicovery of ne biolo"ical ini"ht a ell a to create a "lobal perpective

    from hich nifyin" principle in biolo"y can be dicerned. In thi firt part of thi

    paper, e "ive a brief introdction on bioinformatic and data minin" and their

    relationhip. In the later part, e deal ith data minin" approache in bioinformatic and

    it application particlarly in biomedical and 56* data analyi.

    7

  • 8/13/2019 Datamining in Bioinformatics-1

    3/15

    THE ITINERARY

    (# $ioin%oraic'

    ) *here $io"og+ ee' Couer 'cience

    -# Daa ining ) an inroducion

    .# *ha i' a /io"ogica" daa/a'e0

    1# *h+ need daa ining in /ioin%oraic'0

    2# Cha""enge' in /io3indu'r+

    4# Aroache' o% daa ining in $ioin%oraic'

    3 In%"uence /a'ed ining

    3 A%%ini+3/a'ed ining

    3 Tie de"a+ daa ining

    3 Trend3/a'ed ining

    3 Coarai5e daa ining

    3 6redici5e daa ining

    7# Daa ining %or $ioedica" and DNA daa ana"+'i'

    8# Conc"u'ion

    %

  • 8/13/2019 Datamining in Bioinformatics-1

    4/15

    De%ining $ioin%oraic'

    Bioinformatic i the compter-aited data mana"ement dicipline that help

    "ather, analy0e, and repreent biolo"ical information in order to ndertand life8

    procee.

    Bioinformatic i conceptali0in" /io"og+ in term of molecle(in the ene of

    phyical-chemitry) and then applyin" 9in%oraic': techni9e (derived from

    dicipline ch a applied math, CS, and tatitic) to ndertand and or"ani0e the

    information aociatedith thee molecle, on a lar"e-cale.

    $ioin%oraic'; wherebiology meets computer science

    Biolo"y i the yon"et of the natral cience. hen it collected information

    reache a critical denity, a natral cience pro"ree from information "atherin" to

    information procein". Combinin" cold ilicon and hot protoplam may contitte a

    marria"e of oppoite, bt thi nion cold prodce "enetic reearch prodi"ie.

    $

  • 8/13/2019 Datamining in Bioinformatics-1

    5/15

    Thee day biolo"it e compter rotinely to ait ith many activitie,

    incldin"

    Biomoleclar e9ence ali"nment,

    *embly of 56* piece,

    2ltivariate analyi of lar"e-cale "ene e4preion, and

    2etabolic pathay analyi.

    Crrently, the mot ccefl e of compter in biolo"y are comparative

    e9ence analyi and in silico cloning - the proce of in" a compter earch of

    e4itin" databae to clone a "ene.

    Daa ining ) a '+non+ %or KDD

    5ata minin" can be defined a the proce of e4tractin" hidden predictive

    information from lar"e databae.

    5ata minin", by it implet definition, atomate the detection of relevant

    pattern in a databae. ;or e4ample, a pattern mi"ht indicate that married male ith

    children are tice a li1ely to drive a particlar port car than married male ith no

    children.

    &

  • 8/13/2019 Datamining in Bioinformatics-1

    6/15

    5ata minin" e ell-etablihed tatitical and machine learnin" techni9e to

    bild model that predict ctomer behavior.

    Today, technolo"y atomate the minin" proce, inte"rate it ith commercial

    data arehoe, and preent it in a relevant ay for bine er.

    *ha i' a $io"ogica" Daa/a'e0

    * /io"ogica" daa/a'e i a lar"e, or"ani0ed body of peritent data, ally

    aociated ith compteri0ed oftare dei"ned to pdate, 9ery, and retrieve

    component of the data tored ithin the ytem. * imple databae mi"ht be a in"le file

    containin" many record, each of hich inclde the ame et of information. ;or

    e4ample, a record aociated ith a ncleotide e9ence databae typically contain

    information ch a contact name< the inpt e9ence ith a decription of the type of

    molecle< the cientific name of the orce or"anim from hich it a iolated< and,

    often, literatre citation aociated ith the e9ence.

    ;or reearcher to benefit from the data tored in a databae, to additional

    re9irement mt be met:

    ay acce to the information< and

    * method for e4tractin" only that information needed to aner a pecific biolo"ical

    9etion.

    Need %or daa ining in /ioin%oraic'

    =

  • 8/13/2019 Datamining in Bioinformatics-1

    7/15

    The "roth of crve of biolo"ical information databae follo an e4ponential crve that

    cloely mimic 2oore> la - doblin" every ' month or o.

    By helpin" reearcher proce thi vat collection of data, Computer sciencecan ait in

    diperin" thi information torm.

    2ore than 7,, biolo"ical abtract are lyin" for information e4traction, and the

    amont i till pdatin".

    The biopharmacetical indtry i "eneratin" more chemical and biolo"ical creenin" data

    than it 1no hat to do ith or ho bet to handle. * a relt, decidin" hich tar"et

    and lead compond to develop frther i often a lon" and ardo ta1.

    2edical data ha increaed dramatically

    2anal analyi i not ade9ate

    The traditional data analyi method are not ade9ate to deal ith enormo data flo.

    5ata minin" i neceary.

    Comprehenive pre-procein" facilitie are inclded

    The "enerated rle ere imple to ndertand

    In the medical domain primary ob?ective a e4planation rather than prediction

    2edical databae typically have a hi"h proportion of miin" vale. The data minin"

    oftare can efficiently handle the miin" vale.

    Cha""enge' in $io ) indu'r+

    @

  • 8/13/2019 Datamining in Bioinformatics-1

    8/15

    4plainin" the cale of data that need to be handled in Biotechnolo"y,

    Orac"eAeneral 2ana"er S#Gro5er ay, There are %7, "enome ith .& million

    protein in them. ach "enome re9ire appro4imately % terabyte of trace file. So

    %7, time %TB i maive. 2edical ima"in" "enerate $ million AB of data

    annally.

    ach ma pectrometer "enerate 7 AB of data daily. 2ltiply thi by

    > of ma pectrometer in e in the orld today and yo "et the pictre. Thi

    heer volme of data call for intelli"ent databae.

    Biolo"it ometime can>t a"ree on the very definition and concept the

    databae are ppoe to mana"e. In "enomic, the data entered i not accrate and

    precie. ven if it i tandardi0ed, earchin" a coloal databae i no mean ta1. *nother

    problem i that databae created by different or"ani0ation, tore information

    idioyncratically, creatin" different file format that cannot tal1 to each other.

    To be"in ith itelf, biolo"ical data i comple4 and interlin1ed. * pot on

    a 56* array, for intance, i connected not only to immediate information abot it

    intenity, bt to layer of information abot "enomic location, 56* e9ence, trctre,

    fnction, and mch more.

    Creatin" information ytem that allo biolo"it to eamlely follo

    thee lin1 ithot "ettin" lot in a ea of information i a challen"e for Compter

    cientit. 5ata minin" ith ele"ant al"orithm eem to be a better oltion.

    '

  • 8/13/2019 Datamining in Bioinformatics-1

    9/15

    Aroache' o% Daa ining in $ioin%oraic'

    In%"uence3/a'ed ining;

    Comple4 and "ranlar (a oppoed to linear) data in lar"e databae are

    canned for inflence beteen pecific data et, and thi i done alon" many

    dimenion and in mlti-table format.

    Thee ytem find application herever there are i"nificant cae-and-

    effect relationhip beteen data et D a occr, for e4ample, in lar"e and mltivariant

    "ene e4preion tdie, hich are behind area ch a pharmaco"enomic.

    A%%ini+3/a'ed ining:

    Ear"e and comple4 data et are analy0ed acro mltiple dimenion, and

    the data-minin" ytem identifie data point or et that tend to be "roped to"ether.

    Thee ytem differentiate themelve by providin" hierarchie of aociation and

    hoin" any nderlyin" lo"ical condition or rle that accont for the pecific "ropin"

    of data. Thi approach i particlarly efl in biolo"ical motif analyi, hereby it i

    important to ditin"ih FaccidentalF or incidental motif from one ith biolo"ical

    i"nificance.

    Tie de"a+ daa ining:

    The data et i not available immediately and in complete form, bt i

    collected over time. The ytem dei"ned to handle ch data loo1 for pattern that are

    #

  • 8/13/2019 Datamining in Bioinformatics-1

    10/15

    confirmed or re?ected a the data et increae and become more robt. Thi approach i

    "eared toard lon"-term clinical trial analyi and mlticomponent mode of action

    tdie.

    Trend3/a'ed ining:

    The oftare analy0e lar"e and comple4 data et in term of any

    chan"e that occr in pecific data et over time. The data et can be er-defined, or

    the ytem can ncover them itelf. entially, the ytem report on anythin" that i

    chan"in" over time.

    Coarai5e daa ining:

    It foce on overlayin" lar"e and comple4 data et that are imilar to

    each other and comparin" them. Thi i particlarly efl in all form of clinical trial

    meta analye, here data collected at different ite over different time period, and

    perhap nder imilar bt not alay identical condition, need to be compared. Gere, the

    emphai i on findin" diimilaritie, not imilaritie.

    6redici5e daa ining:

    5ata minin" alone i lac1in" omehat if it i nable to alo offer a

    frameor1 for ma1in" imlation, prediction, and forecat, baed on the data et it

    ha analy0ed. It combine pattern matchin", inflence relationhip, time et correlation,

    and diimilarity analyi to offer imlation of ftre data et.

  • 8/13/2019 Datamining in Bioinformatics-1

    11/15

    Daa ining %or /ioedica" and DNA daa ana"+'i';

    The pat decade ha een an e4ploive "roth in biomedical reearch,

    ran"in" from the development of ne pharmacetical and advance in cancer therapie

    to the identification and tdy of the hman "enome by dicoverin" lar"e-cale

    e9encin" pattern

    and "ene fnction. Since a "reat deal of biomedical reearch ha foced on 56* data

    analyi, e tdy thi application here. +ecent reearch in 56* analyi ha lead to the

    dicovery of "enetic cae for many dieae and diabilitie, a ell a the dicovery of

    ne medicine and approache for dieae dia"noi, prevention, and treatment.

    *n important foc in "enome reearch i the tdy of 56* e9ence ince

    ch e9ence form the fondation of the "enetic code of all livin" or"anim.

    *ll 56* e9ence are compried of for bildin" bloc1 (called ncleotide):

    adenine (*), cytosine(C), guanine (A), and thymine (T). Thee for ncleotide are

    combined to form lon" e9ence or chain that reemble a tited ladder.

  • 8/13/2019 Datamining in Bioinformatics-1

    12/15

    Gman bein" have arond , , "ene. * "ene i ally compried of

    hndred of individal ncleotide arran"ed in a particlar order. There are almot an

    nlimited nmber of ay that the ncleotide can be ordered and e9enced to form

    ditinct "ene. It i challen"in" to identify particlar "ene e9ence pattern that play

    role in vario dieae. Since many interetin" e9ential pattern analyi and imilarity

    earch techni9e have been developed in data minin", data minin" ha become a

    poerfl tool and contribte btantially to 56* analyi in the folloin" ay,

    Seanic inegraion o% heerogeneou', di'ri/ued genoe daa/a'e';

    5e to the hi"hly ditribted, ncontrolled "eneration and e of a ide

    variety of 56* data, the emantic inte"ration of ch hetero"eneo and idely

    ditribted "enome databae become an important ta1 for ytematic coordinated

    analyi of 56* databae. Thi ha promoted the development of inte"rated data

    arehoe and ditribted federated

    databae to tore and mana"e the primary and derived "enetic data.

    5ata cleanin" and data inte"ration method developed in data minin" ill

    help the inte"ration of "enetic data and the contrction of data arehoe for "enetic

    data analyi.

    Sii"ari+ 'earch and coari'on aong DNA 'e

  • 8/13/2019 Datamining in Bioinformatics-1

    13/15

    and healthy tie can be compared to identify critical difference beteen the to

    clae of "ene. Tho can be done by firt retrievin" the "ene e9ence from the to

    tie clae, and then findin" and comparin" the fre9ently occrrin" pattern of each

    cla. ally, e9ence occrrin" more fre9ently in the dieaed ample than in the

    healthy ample mi"ht indicate the "enetic factor of the dieae< on the other hand, thoe

    occrrin" only more fre9ently in the healthy ample mi"ht indicate mechanim that

    protect the body from the dieae. *ltho"h "enetic analyi re9ire imilarity earch,

    the techni9e needed here i 9ite different. ;or e4ample, ome of the data

    tranformation method, hich are poplarly ed in the analyi of time-erie data, are

    ineffective for "enetic data ince ch data are nonnmeric data and the precie

    interconnection beteen different 1ind of ncleotide play an important role in their

    fnction. Hn the other hand, the analyi of fre9ent e9ential pattern i important in

    the analyi of imilarity and diimilarity in "enetic e9ence.

    A''ociaion ana"+'i'; identification of co-occurring gene sequences;

    Crrently, many tdie have foced on the comparion of one "ene to

    other. Goever, mot dieae are not tri""ered by a in"le "ene bt by a combination of

    "ene actin" to"ether. *ociation analyi method can be ed to help determine the

    1ind of "ene that are li1ely to co-occr in tar"et ample. Sch analyi old facilitate

    the dicovery of "rop of "ene and the tdy of interaction and relationhip beteen

    them.

    %

  • 8/13/2019 Datamining in Bioinformatics-1

    14/15

    6ah ana"+'i'; linking genes to different stages of disease development;

    hile a "rop of "ene may contribte to a dieae proce, different "ene

    may become active at different ta"e of the dieae. If the e9ence of "enetic activitie

    acro the different ta"e of dieae development can be identified, it may be poible to

    develop pharmacetical intervention that tar"et the different ta"e eparately, therefore

    achievin" more effective treatment of the dieae. Sch path analyi i e4pected to play

    an important role in "enetic tdie.

    Vi'ua"i=aion oo"' and geneic daa ana"+'i';

    Comple4 trctre and e9encin" pattern of "ene are mot effectively

    preented in "raph, tree, cboid, and chain by vario 1ind of viali0ation tool.

    Sch vially appealin" trctre and pattern facilitate pattern ndertandin",

    1noled"e dicovery, and interactive data e4ploration. Jiali0ation therefore play an

    important role in biomedical data minin".

    An Indu'ria" "oo!

    *fter the dotcom> donfall many leadin" companie li1e TCS, *iro,

    and I$>are no loo1in" at compteri0in" the medical field. IT profeional feel

    databae mana"ement and data minin" oltion and ervice play an important role in

    thi. Dr#>ano?!uar, director, I$> re'earch "a/', ay, that competence in area li1e

    data and tora"e mana"ement, data minin" old aid in prit of bioinformatic.

    Conc"u'ion;

    $

  • 8/13/2019 Datamining in Bioinformatics-1

    15/15

    Bioinformatic ytem benefit from the e of data minin" trate"ie to

    locate interetin" and pertinent relationhip ithin maive information. ;or e4ample,

    data minin" method can acertain and mmari0e the et of "ene repondin" to a certain

    level of tre in an or"anim. +eearcher can e "raphical model and relational

    al"orithm to mine ch "ene et and model a "ene e4preion netor1. Thi paper on

    it part reveal the peritent role of data minin" in e4perimental biolo"y. Th, /io"og+

    combined ith couer 'ciencei an emer"in" field that ha come to tay and erve the

    hmanity for it better cae.

    Re%erence;

    . I Compter 3 Lly 77

    7. 5ata minin": Concept and Techni9e 3 by L. Gan and 2. Kamber, 7

    %. .c.te4a.ed

    $. $. 5*T*MST 3 Jol NN 6o.< dated 2ay%, 77

    &

    http://www.cs.utexas.edu/http://www.cs.utexas.edu/