Upload
vuongduong
View
214
Download
0
Embed Size (px)
Citation preview
Introduc)ontoBioinforma)cs
ShifraBen-Dor
Bioinforma)csUnitLifeSciencesCoreFacili)es
LectureOutline:
• TechnicalCourseItems
• Sequences
• Databases
– Thisweekandnextweek
Thetechnicalstuff
Thecourseismadeupofonelectureandanop)onalexercisesessioneachweek.
Theexercisesessionsarenotmandatory,theyaretheretohelp.Demonstra)onsoftheprogramswillbedoneinboththelecturesandtheexercisesessions.Theexercisesessionsareanopportunityforyoutodotheassignmentwithsomebodytheretoaskforhelpifyougetstuck.
Thetechnicalstuff
Ifyouareplanningoncomingtotheexercisesessions,pleasesendmeanemail:
TheTechnicalStuff
Thecoursewebsiteiswhereyoucanfindthesyllabus,lecturenotes,assignments,linkstothevariousprogramstaughtandtorelevantliterature.Itisalsowhereweputannouncementsandupdates.hOp://dors.weizmann.ac.il/course/introbioinfo/
• ThiscourseisbuiltforBiologists
• Backgroundwillbegivenonvarioustopicsasneeded,butbasicknowledgeofB.Sc.levelbiologyistakenforgranted
• Ifyouneedhelpwiththebiology,contactme
Thetechnicalstuff
Requirementsforagrade
• Youarerequiredtodoalloftheassignmentsandafinalproject
• Thecoursegradeiscomputedasfollows:60%finalproject,40%assignments
Assignments
• Youhavetwoweekstohandineachassignment
• AssignmentsaretobehandedinattheWolfsonlecturehall,bytheendofthelecture(11:00)
• Ifforanyreasonyouneed/wantanextension,talktomeBEFOREtheassignmentisdue
• Anassignmenthandedinlateornotatallwillgeta0
Assignments
• Youmayconsultwithafriendwhiledoingtheassignment,howeverallworkmustbehandedinindividually.Ifwefindcopyingthegradewillbedividedamongthenumberofstudentshandinginthesameanswersheet
• Assignmentsshouldbeprintedandhandedin.Electronicsubmission(e-mail)willNOTbeaccepted.
FinalProject
• ThefinalprojectwillbegiveninthebeginningofJuly.
• ItwillbedueonAugust9.
• Lateprojectswillnotbeaccepted
• ThereisNOpossibilitytocorrectprojects
• Ifevidenceisfoundofsharedwork,therewillbenocoursegrade
Announcements,Updates…
• Anynewswillbeannouncedinthelecturesandupdatedonthewebsite
• Whatissaidinthelecturehallisthefinalword,unlessspecifiedotherwise
Ifyouhaveques)ons,comments,sugges)onsorcomplaints-pleasecontactus-theearlierthebeOer!
CourseStaff
MainLecturer:ShifraBen-DorMetargelot:
IritOrr BareketDassa
Whatisbioinforma)cs?
Whatwillwecoverinthiscourse?
Whatwon’twecoverinthiscourse?
• Detailedstructuralanalysisofproteins• AlgorithmDevelopment
• Highthroughputmethods
• In-depthphylogene)csorevolu)onarybiology• In-depthsystemsbiology
• siRNA,miRNA
• PromoterAnalysis
Skep)cismandcomputers
ThebiologicalthinkinghastobedonebyYOU
LectureOutline:
• TechnicalCourseItems
• Sequences
• Databases
– Thisweekandnextweek
What“unitsofinforma)on”dowedealwithinbioinforma)cs?
• DNA
• RNA
• Protein
• Sequence
• Structure
• Evolu)on
• Pathways
• Interac)ons
• Muta)ons
Examplesofbiologicaldatausedinbioinforma)cs
v DNA (Genome)
v RNA (Transciptome)
v Protein (Proteome)
DNA RawDNASequence
• CodingorNotcoding?
• Parseintogenes?
• Otherimportant
genomicelements?
• 4bases:AGCT
atggcaaOaaaaOggtatcaatggOOggtcgtatcggccgtatcgtaOccgtgcagcacaacaccgtgatgacaOgaagOgtaggtaOaacgacOaatcgacgOgaatacatggcOatatgOgaaatatgaOcaactcacggtcgOtcgacggcactgOgaagtgaaagatggtaacOagtggOaatggtaaaactatccgtgtaactgcagaacgtgatcca
DNA/RNAsequences
• Genesareencodedingenomicsequences.
• Genesaretranscribedintopre-mRNAs(includingcoding,intronic,5’and3’untranslatedregions).
• mRNAsarespliced(intronsremoved)andtranslatedintoproteins.
• mRNAsarecopiedtocDNAs(inthelab)
TSS TTS
ATG Stop PolyAsitePromoter 1 2 3 4
ATG Stop PolyAsite
1 2 3 4
GenomicDNA
Pre-mRNA
mRNA
ModifiedfromZhangMQNatRevGenet.2002Sep;3(9):698-709.
ATG Stop
1 2 3 4Cap PolyA
5’UTR 3’UTRCDS
SourcesofmRNAs
• Experimental– Clonenewgene– “Clone”genefromdatabase– RNA-Seq
• Database– “Typical”cDNA– FulllengthcDNA– EST(ExpressedSequenceTag)– Shortreadsequences
mRNA
FulllengthcDNA
TypicalcDNA
5’mG AAAA
TTTT
TTTT
tag
AAAAtag
tag
SourcesofmRNAs
• Experimental– Clonenewgene– “Clone”genefromdatabase– RNA-Seq
• Database– “Typical”cDNA– FulllengthcDNA– EST(ExpressedSequenceTag)– Shortreadsequences
RNA
RNA,cDNA,andESTs
mRNA
cDNA
exon1 exon2 exon3
EST
EST
cDNAclone
GenBankESTs(ExpressedSequenceTags):~8,700,000humanESTs~4,850,000mouseESTs
AdaptedwithpermissionfromAdamSar)el
UsesofESTs
- predic)onofcodingregions- detec)onofalterna)vesplicing- clusteringtoform“genes”Problemswithclustering:- incompletecoveragebreaksgenesup- genefamilies
ProblemswithESTs
- lowcopynumbergenes
- rare)ssues- mistakes
- enrichmentof3’endsofgenes
- incompletecoverageofgenes
NextGenera)onSequencing
• Generallyshortreads(thoughnowlongertechnologiesarebecomingavailable)
• Sequencelengthsrangefrom20-25bpto75-100to150bpreads
• Canbe3’endonly• Canbepairedorsingleread
MatePair
Con)gsor“Transcripts”
FragmentRead
Pairedendreads
ESTvsRead
• ESTshavelongercon)nuoussequence,sobeOertoseegenestructure(alterna)ve
splicing)
• Shortreadsgenerallyhavehigheraccuracy
• Bothcannotgiveapictureofawholegene
Protein
• 20leOeralphabet ACDEFGHIKLMNPQRSTVWY ButnotBJOUXZ
• Stringsof~300aainanaverageprotein
(e.g.bacteria)
• Proteinaredividedintodomains
LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQNLVIMGKKTWFSILNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQNAVIMGKKTWFSIISLIAALAVDRVIGMENAMPWNLPADLAWFKRNTLDKPVIMGRHTWESITAFLWAQDRNGLIGKDGHLPWHLPDDLHYFRAQTVGKIMVVGRRTYESF
Protein
v ProteomeofanOrganismv 2Dgelsv MassSpecv 2DStructurev 3DStructurev 4DStructure(interac)ons)
LectureOutline:
• TechnicalCourseItems
• Sequences
• Databases
Databases:Outline
• Introduc)on– DataandDatabasetypes– Databasecomponents
• DataFormats• Sampledatabases• Howtotextsearchdatabases
What“unitsofinforma)on”dowedealwithinbioinforma)cs?
• DNA• RNA• Protein
• Sequence• Structure• Evolu)on
• Pathways• Interac)ons• Muta)ons
AAGTGCCACTGCATAAATGACCATGAGTGGGCACCGGTAAGGGAGGGTGATGCTATCTGGTCTGAAGNucleotidesequence
Genes
mRNA
Proteinprimarysequence
Protein 3Dstructure
ProteinFunction
Acts as a tumor suppressor inmany tumor types. induces growtharrest or apoptosis depending on thephysiological circumstances or celltype, but both activities areinvolved in tumor suppression.
Involved in the transport ofchloride ions. Defects in CFTRare the cause of cystic fibrosis.It is the most common genetic diseasein the caucasian population, with aprevalence of about 1 in 2000 livebirths. cf, an autosomal recessivedisorder, is a common generalizeddisorder of exocrine gland function
SNPs
Whatdowewantfromdatabases?
Allofthesehavedatabasesandtoolsthatwerecreatedtoworkwiththem
Informa)onretrievalfromsequencedatabases
Biologicaldatabasescontainenormousamountsofdata.
• Databasesneedtobewellannotated.• Databasesneedtobeeasilysearched.• Datafoundindatabasesshouldbeeasilyretrieved.
• Dataindatabasesshouldbeinstandardformats.
IntegratedInforma)onRetrieval
• Manydatabasescontainlogicalrela)onsbetweenspecificentries.
• Oneinterface-connec)ngmanybiologicaldatabases.
• Forexample:adatabasethatconnectsbetweenproteinsequence,proteindomain,proteinstructureandreferencedatabases.(Interpro)
• Anotherexample:Connec)onbetweenreferences,proteinsequence,DNAsequence,andstructuredatabases.(Entrez)
000003 breast cancer 1, early onset000002 breast cancer 1, early onset
000001 tumor protein p53
Chromosomal location: 17p13.1
DNA sequence:
mRNA sequence:
Protein function:
brain -liver -lung -
Protein sequence:
Interacts with genes:
Protein structure:
000365, 025783, 004674
PDB 1OLG, 1OLH, 1SAE
Fields
External links
Internal links
A Database
AccessionNumber
Entries
Slide provided by Dr. Vered Caspi
CoreDataandAnnota)on
Databasesgenerallyhave(atleast)twotypesofdata:
Coredata:Thedatathedatabasewasgeneratedtoorganize
Annota)on:Extrainforma)onthatroundsoutourpictureofthecoredataForexampleinagenomedatabase,thesequenceisthecoredata,andtheloca)onofgenesistheannota)on
DatabaseIssues
• Printedjournalsvs.databases
• Directsubmissiontodatabases(e.g.GenBank,PDB)
• Archivalvs.curateddatabases
• Databasesthatpublishexperimentalresultsoflargegenomiccenters.
• Publicvs.privatedatabases.
ForExample:ClassificaSonofGenomicDatabases
Databasescope
InformaSonsource
InformaSontype
ManygenomesOneGenomeOneSubjectOneGene
Directsubmissionfromscien)ficcommunityScien)ficliteratureGenomecenter’sexperimentalresultsOtherdatabases
MappingSequence&annota)onProteinstructure&func)onVaria)onsCompara)vegenomicsgenenetworks
Slide provided by Dr. Vered Caspi
UserInterface
• Databasesearch– freetext– field-specific– sequence-based
• Databaseoutput– text– graphics– dynamic
DataFormats
Therearemanydataformatsusedforsequences(bothnucleicandaminoacid)
• FastaFormat• GenBankFormat• FastqFormat
• (EMBLFormat)
FastaFormat
• Simplestformat
• Leastinforma)on
• Startswitha>andsequencenameononeline
• Thesequenceinplaintextfollows
>OB2T2GTGACAACATGTACAGCTGTGAGCGGTGTAAGAAGCTGCGGAACGGAGTGAAGTACTGCAAAGTCCTGCGGTTGCCCGAGATCCTGTGCATTCACCTAAAGCGCTTTCGGCACGAGGTGATGTACTCATTCAAGATCAACAGCCACGTCTCCTTGCCCTCGAGGGGCTCGACCTGCGCCCCTTCCTTGCCAAGGAGTGCACATCCCAGATCACCACCTACGACCTCCTCTCGGTCATCTGCCACCACGGCACGGCAGGCA
>TNRC_HUMAN P36941 (tumor necrosis factor c receptor)MLLPWATSAPGLAWGPLVLGLFGLLAASQPQAVPPYASENQTCRDQEKEYYEPQHRICCSRCPPGTYVSAKCSRIRDTVCATCAENSYNEHWNYLTICQLCRPCDPVMGLEEIAPCTSKRKTQCRCQPGMFCAAWALECTHCELLSDCPPGTEAELKDEVGKGNNHCVPCKAGHFQNTSSPSARCQPHTRCENQGLVEAAPGTAQSDTTCKNPLEPLPPEMSGTMLMLAVLLPLAFFLLLATVFSCIWKSHPSLCRKLGSLLKRRPQGEGPNPVAGSWEPPKAHPYFPDLVQPLLPISGDVSPVSTGLPAAPVLEAGVPQQQSPLDLTREPQLEPGEQSQVAHGTNGIHVTGGSMTITGNIYIYNGPVLGGPPGPGDLPATPEPPYPIPEEGDPGPPGLSTPHQEDGKAWHLAETEHCGATPSNRGPRNQFITHD>TNRC_MOUSE P50284 lymphotoxin-beta receptor precursorMRLPRASSPCGLAWGPLLLGLSGLLVASQPQLVPPYRIENQTCWDQDKEYYEPMHDVCCSRCPPGEFVFAVCSRSQDTVCKTCPHNSYNEHWNHLSTCQLCRPCDIVLGFEEVAPCTSDRKAECRCQPGMSCVYLDNECVHCEEERLVLCQPGTEAEVTDEIMDTDVNCVPCKPGHFQNTSSPRARCQPHTRCEIQGLVEAAPGTSYSDTICKNPPEPGAMLLLAILLSLVLFLLFTTVLACAWMRHPSLCRKLGTLLKRHPEGEESPPCPAPRADPHFPDLAEPLLPMSGDLSPSPAGPPTAPSLEEVVLQQQSPLVQARELEAEPGEHGQVAHGANGIHVTGGSVTVTGNIYIYNGPVLGGTRGPGDPPAPPEPPYPTPEEGAPGPSELSTPYQEDGKAWHLAETETLGCQDL>TNR1_RAT P22934 tumor necrosis factor receptor 1 precursor (p60)MGLPIVPGLLLSLVLLALLMGIHPSGVTGLVPSLGDREKRDNLCPQGKYAHPKNNSICCTKCHKGTYLVSDCPSPGQETVCEVCDKGTFTASQNHVRQCLSCKTCRKEMFQVEISPCKADMDTVCGCKKNQFQRYLSETHFQCVDCSPCFNGTVTIPCKEKQNTVCNCHAGFFLSGNECTPCSHCKKNQECMKLCLPPVANVTNPQDSGTAVLLPLVIFLGLCLLFFICISLLCRYPQWRPRVYSIICRDSAPVKEVEGEGIVTKPLTPASIPAFSPNPGFNPTLGFSTTPRFSHPVSSTPISPVFGPSNWHNFVPPVREVVPTQGADPLLYGSLNPVPIPAPVRKWEDVVAAQPQRLDTADPAMLYAVVDGVPPTRWKEFMRLLGLSEHEIERLELQNGRCLREAHYSMLEAWRRRTPRHEATLDVVGRVLCDMNLRGCLENIRETLESPAHSSTTHLPR
KnownIssueswithFastaFormat
• Differentprogramstreattheheaderlinedifferently:
– Someread10characters,some30
– Somereadun)lthefirstspace
• Makesureyouhaveuniquenames!!!
• Headerlinesshouldbeunder80characters• Lengthofsequencelinecandiffer
@SRR2976060.1 1 length=202NAAGCTCTCACCCATGGAGACCAAGGCGATTAGGGTTTTTCTCTTCGCTCTCCTCCT+SRR2976060.1 1 length=202#1=DDFFFHHHHHJJJEIJJJJJIJJJJFHGJIIJ9DHIIIJJJJGIIJJJGIIIJJ
FastqFormat
Fourlines:1–startswith@andisauniqueiden)fier2–theactualsequence3–startswitha+andcanhaveaniden)fieragain4–thequalityofthebases
GenbankFormat
• Dividedintothreeparts:– Informa)onlines– Featuretable– Sequence
EMBL sequence formatRN [2] RA Wirsel S.G.R., Leibinger W., Mendgen K.W.; RT "Genetic diversity of fungi associated with common reed (Phragmites RT australis)"; RL Unpublished. XX FH Key Location/Qualifiers FH FT source 1..581 FT /db_xref="taxon:112223" FT /organism="ascomycota sp. 4/97-9" FT /isolate="4/97-9"