Upload
chantrea-nhek
View
39
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Predicting Protein Function. protein. RNA. DNA. Biochemical function (molecular function). What does it do? Kinase??? Ligase???. Page 245. Function based on ligand binding specificity. What (who) does it bind ??. Page 245. Function based on biological process. - PowerPoint PPT Presentation
Citation preview
Function based oncellular location
DNA RNA
Page 245
Where is the RNA/Protein Expressed ??Brain? Testis? Where it is under expressed??
GO (gene ontology)http://www.geneontology.org/
• The GO project is aimed to develop three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated
• molecular functions (F)• biological processes (P) • cellular components (C)
Ontology is a description of the concepts and relationships that can exist for an agent or a community of agents
Inferring protein function Bioinformatics approach
• Based on homology
• Based on the existence of
known protein domains (the protein signature)
Proteins with a common evolutionary origin
Paralogs - Proteins encoded within a given species that arose from one or more gene duplication events.
Orthologs - Proteins from different species that evolved by speciation.
Hemoglobin human vs Hemoglobin mouse
Hemoglobin human vs Myoglobin human
Homologous proteins
COGsClusters of Orthologous Groups of proteins
> Each COG consists of individual orthologous proteins or orthologous sets of paralogs.
> Orthologs typically have the same function, allowing transfer of functional information from one member to an entire COG.
DATABASE
Refence: Classification of conserved genes according to their homologous relationships. (Koonin et al., NAR)
The Protein Signature
Motif (or fingerprint):• a short, conserved region of a protein• typically 10 to 20 contiguous amino acid residues
Domain: • A region of a protein that can adopt a 3D structure
1 50ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLDhsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD
GTWYEI K AV M
GXW[YF][EA][IVLM]
Protein MotifsProtein motifs can be represented as a consensus or a profile
Searching for Protein Motifs
- ProSite a database of protein patterns that can be searched by either regular expression patterns or sequence profiles.
- PHI BLAST Searching a specific protein sequence pattern with local alignments surrounding the match.
-MEME searching for a common motifs in unaligned sequences
Protein Domains
• Domains can be considered as building blocks of proteins.
• Some domains can be found in many proteins with different functions, while others are only found in proteins with a certain function.
Varieties of protein domains
Page 228
Extending along the length of a protein
Occupying a subset of a protein sequence
Occurring one or more times
Example of a protein with 2 domains: Methyl CpG binding protein 2 (MeCP2)
MBD TRD
The protein includes a Methylated DNA Binding Domain(MBD) and a Transcriptional Repression Domain (TRD).MeCP2 is a transcriptional repressor.
Pfam
> Database that contains a large collection of multiple sequence alignments of protein domains
Based on Profile hidden Markov Models (HMMs).
Profile HMM (Hidden Markov Model)
D16 D17 D18 D19
M16 M17 M18 M19
I16 I19I18I17
100%
100% 100%
100%
D 0.8S 0.2
P 0.4R 0.6
T 1.0 R 0.4S 0.6
X XX X
50%
50%D R T RD R T SS - - SS P T RD R T RD P T SD - - SD - - SD - - SD - - R
16 17 18 19
HMM is a probabilistic model of the MSA consisting of a number of interconnected states
Match
delete
insert
Pfam
> Database that contains a large collection of multiple sequence alignments of protein domains
Based on Profile hidden Markov Models (HMMs).
> The Pfam database is based on two distinct classes of alignments
–Seed alignments which are deemed to be accurate and used to produce Pfam A-Alignments derived by automatic clustering of SwissProt, which are less reliable and give rise to Pfam B
DNA binding domains have relatively high frequency of basic (positive) amino acids
M K D P A A L K R A R N T E A AR R S S R A R K L Q R M
GCN4
zif268 M E R P Y A C P V E S C D R R FS R S D E L T R H I R I H T
myoDS K V N E A F E T L K R C T S S N
P N Q R L P K V E I L R N A I R
Physical properties of proteins
Many websites are available for the analysis ofindividual proteins for example:EXPASY (ExPASy)UCSC Proteome BrowserProtoNet HUJI
The accuracy of the analysis programs are variable. Predictions based on primary amino acid sequence (such as molecular weight prediction) are likely to be more trustworthy. For many other properties (such asPhosphorylation sites), experimental evidence may be required rather than prediction algorithms.
Page 236
Knowledge Based Approach
• IDEA Find the common properties of a protein
family (or any group of proteins of interest) which are unique to the group and different
from all the other proteins. Generate a model for the group and predict
new members of the family which have similar properties.
Knowledge Based Approach
• Generate a dataset of proteins with a common function (DNA binding protein)
• Generate a control dataset • Calculate the different properties which are characteristic
of the protein family you are interested for all the proteins in the data (DNA binding proteins and the non-DNA binding proteins
• Represent each protein in a set by a vector of calculated features and build a statistical model to split the groups
Basic Steps1. Building a Model
• Calculate the properties for a new protein
And represent them in a vector
• Predict whether the tested protein belongs to the family
Basic Steps2. Predicting the function of a new protein
TEST CASEY14 – A protein sequence translated from an ORF (Open Reading Frame)Obtained from the Drosophila complete Genome
>Y14PQRSVGWILFVTSIHEEAQEDEIQEKFCDYGEIKNIHLNLDRRTGFSKGYALVEYETHKQALAAKEALNGAEIMGQTIQVDWCFVKG G
>Y14PQRSVGWILFVTSIHEEAQEDEIQEKFCDYGEIKNIHLNLDRRTGFSKGYALVEYETHKQALAAKEALNGAEIMGQTIQVDWCFVKG G
Y14 DOES NOT BIND RNA
Database and Tools for protein families and domains
• InterPro - Integrated Resources of Proteins Domains and Functional Sites
• Prosite – A dadabase of protein families and domain • BLOCKS - BLOCKS db • Pfam - Protein families db (HMM derived)• PRINTS - Protein Motif fingerprint db • ProDom - Protein domain db (Automatically generated) • PROTOMAP - An automatic hierarchical classification of Swiss-Prot
proteins • SBASE - SBASE domain db • SMART - Simple Modular Architecture Research Tool • TIGRFAMs - TIGR protein families db
Key dates
14.12 lists of suggested projects published **If you or your partner are working in a biology lab, try to find a relevant project which can help in your research
11-20/1 Presenting a proposed project in small groups Title Main question Major Tools you are planning to use to answer the questions
1.3 Project submission
Instructions for the final projectIntroduction to Bioinformatics 2009-10
2. Planning your research After you have described the main question or questions of your project, you should carefully plan your next stepsA. Make sure you understand the problem and read the necessary background to proceed B. formulate your working plan, step by stepC. After you have a plan, start from extracting the necessary data and decide on the relevant tools to use at the first step. When running a tool make sure to summarize the results and extract the relevant information you need to answer your question, it is recommended to save the raw data for your records , don't present raw data in your final written project. Your initial results should guide you towards your next steps.D. When you feel you explored all tools you can apply to answer your question you should summarize and get to conclusions. Remember NO is also an answer as long as you are sure it is NO. Also remember this is a course project not only a HW exercise. .
3. Writing the final project (in pairs)Background : 2-3 pagesBackground should include description of your question including the relevant literature. Relevant literature should also include bioinformatics studies that have approached a similar question. Please use common formats for citations.Goal and Research Plan: 1/2 pageDescribe the main objective and the research planResults : 3-5 pages Describe your results , you can extract the relevant parts from the output of the tool used. Please don't present all the output, if you feel the full output is necessary please add it as an appendix. If possible summaries your results in figures/ tables. Conclusions : Up to 1 page References : List the references used for your project