59
Analyzing Local Properties Many local properties are important for the function of your protein Hydrophobic regions are potential transmembrane domains Coiled-coiled regions are potential protein- interaction domains Hydrophilic stretches are potential loops You can discover these regions Using sliding-widow techniques (easy) Using prediction methods such as hidden Markov Models (more sophisticated)

Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

Embed Size (px)

Citation preview

Page 1: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

Analyzing Local Properties• Many local properties are important for the function of

your protein– Hydrophobic regions are potential transmembrane domains

– Coiled-coiled regions are potential protein-interaction domains

– Hydrophilic stretches are potential loops

• You can discover these regions– Using sliding-widow techniques (easy)

– Using prediction methods such as hidden Markov Models (more sophisticated)

Page 2: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

Sliding-window Techniques• Ideal for identifying strong

signals• Very simple methods

– Few artifacts– Not very sensitive

• Use ProtScale on www.expasy.org

• Make the window the same size as the feature you’re looking for

Page 3: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

www.expasy.org/cgi-bin/protscale.pl

Page 4: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

www.expasy.org/cgi-bin/protscale.pl

Hphob. / Eisenberg

Page 5: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

Using TMHMM

• TMHMM is the best method for predicting transmembrane domains

• TMHMM uses an HMM

• Its principle is very different from that of ProtScale

• TMHMM output is a prediction

Page 6: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

Searching for PROSITE Patterns

• Search your protein against PROSITE on ExPAsy– www.expasy.org/tools/scanprosite

• PROSITE motifs are written as patterns– Short patterns are not very informative by themselves

– They only indicate a possibility

– Combine them with other information to draw a conclusion

• Remember: Not everything is in PROSITE !

Page 7: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

www.expasy.org/tools/scanprosite

P12259

Page 8: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

www.expasy.org/tools/scanprosite

Page 9: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

Protein Domains

• Proteins are usually made of domains

• A domain is an autonomous folding unit

• Domains are more than 50 amino acids long

• It’s common to find these together:– A regulatory domain

– A binding domain

– A catalytic domain

Page 10: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

www.ebi.ac.uk/InterProScan

Page 11: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi

Page 12: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi

Page 13: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

Secondary Structures

• Helix– Amino acid that twists like a spring

• Beta strand or extended– Amino acid forms a line without

twisting

• Random coils– Amino acid with a structure neither

helical nor extended

– Amino-acid loops are usually coils

Page 14: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

bioinf.cs.ucl.ac.uk/psipred//?program=psipred

Page 15: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

bioinf.cs.ucl.ac.uk/psipred//?program=psipred

Page 16: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

bioinf.cs.ucl.ac.uk/psipred//?program=psipred

Page 17: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

Servers

• www.predictprotein.org

• cubic.bioc.columbia.edu/predictprotein

• www.sdsc.edu/predicprotein

• www.cbi.pku.edu.cn/predictprotein

Page 18: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

www.rcsb.org

Page 19: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

www.rcsb.org

Page 20: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

ncbi.nlm.nih.gov/BLAST

Page 21: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

zhanglab.ccmb.med.umich.edu/I-TASSER/

Page 22: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

zhanglab.ccmb.med.umich.edu/I-TASSER/

Page 23: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

http://zhanglab.ccmb.med.umich.edu/I-TASSER/output/S102840/

Page 24: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

R-programming

Page 25: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

Introduction•R is:

– a suite of operators for calculations on arrays, in particular matrices,

– a large, coherent, integrated collection of intermediate tools for interactive data analysis,

– graphical facilities for data analysis and display either directly at the computer or on hardcopy

– a well developed programming language which includes conditionals, loops, user defined recursive functions and input and output facilities.

•The core of R is an interpreted computer language.– It allows branching and looping as well as modular programming using

functions. – Most of the user-visible functions in R are written in R, calling upon a smaller

set of internal primitives. – It is possible for the user to interface to procedures written in C, C++ or

FORTRAN languages for efficiency, and also to write additional primitives.

Page 26: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

R and statisticso Packaging: a crucial infrastructure to efficiently produce, load

and keep consistent software libraries from (many) different sources / authors

o Statistics: most packages deal with statistics and data analysis

o State of the art: many statistical researchers provide their methods as R packages

Page 27: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

Data Analysis and Presentation

• The R distribution contains functionality for large number of statistical procedures. – linear and generalized linear models– nonlinear regression models– time series analysis– classical parametric and nonparametric tests– clustering – smoothing

• R also has a large set of functions which provide a flexible graphical environment for creating various kinds of data presentations.

Page 28: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

R as a calculator

> log2(32)

[1] 5

> sqrt(2)

[1] 1.414214

> seq(0, 5, length=6)

[1] 0 1 2 3 4 5

> plot(sin(seq(0, 2*pi, length=100)))

0 20 40 60 80 100

-1.0

-0.5

0.0

0.5

1.0

Index

sin

(se

q(0

, 2 *

pi,

len

gth

= 1

00

))

Page 29: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

Object orientation

primitive (or: atomic) data types in R are:

• numeric (integer, double, complex)• character• logical• function

out of these, vectors, arrays, lists can be built.

Page 30: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

Object orientation

• Object: a collection of atomic variables and/or other objects that belong together

• Example: a microarray experiment• probe intensities• patient data (tissue location, diagnosis, follow-up)• gene data (sequence, IDs, annotation)

Parlance:• class: the “abstract” definition of it• object: a concrete instance• method: other word for ‘function’• slot: a component of an object

Page 31: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

Object orientation

Advantages:

Encapsulation (can use the objects and methods someone else has written without having to care about the internals)

Generic functions (e.g. plot, print)

Inheritance (hierarchical organization of complexity)

Caveat:Overcomplicated, baroque program architecture…

Page 32: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

variables

> a = 24

> b<-25> sqrt(a+b)[1] 7

> a = "The dog ate my homework"> sub("dog","cat",a)[1] "The cat ate my homework"

> a = (1+1==3)> a[1] FALSE

numeric

character string

logical

Page 33: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

variables> paste("X", "Y")

> paste("X", "Y", sep = " + ")

> paste("Fig", 1:4)

> paste(c("X", "Y"), 1:4, sep = "", collapse = " + ")

Page 34: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

x<-2.17y<-as.character(x)z<-as.numeric(y)

Help(as)

Page 35: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains
Page 36: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

vectors, matrices and arrays• vector: an ordered collection of data of the same type> a = c(1,2,3)> a*2[1] 2 4 6

• Example: the mean spot intensities of all 15488 spots on a chip: a vector of 15488 numbers

• In R, a single number is the special case of a vector with 1 element.

• Other vector types: character strings, logical

Page 37: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

vectors, matrices and arrays

• matrix: a rectangular table of data of the same typeexample: the expression values for 10000 genes for 30

tissue biopsies: a matrix with 10000 rows and 30 columns.

• array: 3-,4-,..dimensional matrixexample: the red and green foreground and background

values for 20000 spots on 120 chips: a 4 x 20000 x 120 (3D) array.

Page 38: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

Lists• vector: an ordered collection of data of the same type. > a = c(7,5,1)> a[2][1] 5

• list: an ordered collection of data of arbitrary types. > doe = list(name="john",age=28,married=F)> doe$name[1] "john "> doe$age[1] 28• Typically, vector elements are accessed by their index (an integer),

list elements by their name (a character string). But both types support both access methods.

Page 39: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

Data frames

data frame: is like a spreadsheet.

It is a rectangular table with rows and columns; data within each column has the same type (e.g. number, text, logical), but different columns may have different types.

Example:> a localisation tumorsize progressXX348 proximal 6.3 FALSEXX234 distal 8.0 TRUEXX987 proximal 10.0 FALSE

Page 40: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

id<-c("xx348", "xx234", "xx987")locallization<-c("proximal", "distal", "proximal")progress<-c(F, T, F)tumorsize<-c(6.3, 8.0, 10.0)results<-data.frame(id, locallization , tumorsize, progress)

> results id locallization tumorsize progress1 xx348 proximal 6.3 FALSE2 xx234 distal 8.0 TRUE3 xx987 proximal 10.0 FALSE

Page 41: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

results<-edit(results)

Page 42: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

> summary(results) id locallization tumorsize progress xx234:1 distal :1 Min. : 6.30 Mode :logical xx348:1 proximal:2 1st Qu.: 7.15 FALSE:1 xx987:1 Median : 8.00 TRUE :2 Mean : 8.10 NA's :0 3rd Qu.: 9.00 Max. :10.00

>x<-summary(results)>x id locallization tumorsize progress xx234:1 distal :1 Min. : 6.30 Mode :logical xx348:1 proximal:2 1st Qu.: 7.15 FALSE:1 xx987:1 Median : 8.00 TRUE :2 Mean : 8.10 NA's :0 3rd Qu.: 9.00 Max. :10.00

Page 43: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

SubsettingIndividual elements of a vector, matrix, array or data frame are accessed with “[ ]” by specifying their index, or their name> results localisation tumorsize progressXX348 proximal 6.3 0XX234 distal 8.0 1XX987 proximal 10.0 0> results[3, 2][1] 10> a["XX987", "tumorsize"][1] 10> results["XX987",] localisation tumorsize progressXX987 proximal 10 0

Page 44: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

SubsettingSubsetting> results localisation tumorsize progressXX348 proximal 6.3 0XX234 distal 8.0 1XX987 proximal 10.0 0> results[c(1,3),] localisation tumorsize progressXX348 proximal 6.3 0XX987 proximal 10.0 0> results[c(T,F,T),] localisation tumorsize progressXX348 proximal 6.3 0XX987 proximal 10.0 0> results$localisation[1] "proximal" "distal" "proximal"> results $localisation=="proximal"[1] TRUE FALSE TRUE> results[ results$localisation=="proximal", ] localisation tumorsize progressXX348 proximal 6.3 0XX987 proximal 10.0 0

subset rows by a vector of indices

subset rows by a logical vector

subset a column

comparison resulting in logical vector

subset the selected rows

Page 45: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

results[2,]

results[2,2]

results[1:3,]

results[c(1,3),]

results[c(T,F,T),]

x<-summary(results)xX[2,2]

Page 46: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

x = c(1, 1, 2, 3, 5, 8)

x[c(TRUE, TRUE, FALSE, FALSE, TRUE, TRUE)]

x[c(TRUE, FALSE)]

x == 1

x[x == 1]

x[x%%2 == 0]

y = c(1, 2, 3)

y[]=3

y

Page 47: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

Matrix

• a matrix is a vector with an additional attribute (dim) that defines the number of columns and rows

• only one mode (numeric, character, complex, or logical) allowed

• can be created using matrix()x<-matrix(data=0,nr=2,nc=2) orx<-matrix(0,2,2)

Page 48: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

Data Frame

• several modes allowed within a single data frame

• can be created using data.frame()L<-LETTERS[1:4] #A B C Dx<-1:4 #1 2 3 4data.frame(x,L) #create data frame

• attach() and detach()– the database is attached to the R search path so that the database is searched by R

when it is evaluating a variable.– objects in the database can be accessed by simply giving their names

Page 49: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

a=matrix(1:9, ncol = 3, nrow = 3)a

b=matrix(c(TRUE, FALSE, TRUE), ncol = 3, nrow = 3)b

x=1:10y=11:20z=matrix(c(x,y))z

z=matrix(c(x,y),nrow=2)z

z=matrix(c(x,y),nrow=4)z

Page 50: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

R code

max(z) min(z) length(z) mean(z) sd(z) sum(z)

Page 51: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

index=c(15,27,34,10,9)welcome=c(13,26,30,10,7)paper=c(2,1,3,0,1)days=c("mon", "tues", "wed", "thurs", "fri")filenames=c("index.html", "welcom.png", "paper.pdf")downloads=matrix(c(index,welcome,paper), nrow=5, dimnames=list(days,filenames))downloads

filesizes = c(1624, 23172, 1234065)downloads%*%filesizes

image(as.matrix(downloads))

Page 52: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

Factors

A character string can contain arbitrary text. Sometimes it is useful to use a limited vocabulary, with a small number of allowed words. A factor is a variable that can only take such a limited number of values, which are called levels.

Page 53: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

expression<-factor(c("over","under","over","unchanged","under","under"))

levels(expression)

Page 54: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

protein<-list("glucose oxidas", "1CF3", 63355)protein

protein<-list(name="glucose oxidas", accession="1CF3", weight=63355)

x<-c(16614, 50660, 6066, 6118)protein$GOIDs<-x

protein

Page 55: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

class(protein)

length(protein)

attributes(protein)

Page 56: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

Working directorysetwd("D:/data")

x<-read.table("profiles.csv", sep=",", header=TRUE)

Page 57: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

x<-read.table("http://www.bixsolutions.net/profiles.csv", sep=",", header=TRUE)

matplot(x, type="l")

matplot(x, type="l", xlab="fraction", ylab="quantity", col=1:6, lty=1:5, lwd=2)

lty: line stylelwd: line width

Page 58: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

xmax<-apply(x, 2, max)xmax

ymax<-apply(x, 1, max)ymax

Apply the max function on columns (2) or rows (1) of matrix x

Page 59: Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains

cummean = function(x){n = length(x)y = numeric(n)z = c(1:n)y = cumsum(x)y = y/zreturn(y)

}

n = 10000z = rnorm(n)x = seq(1,n,1)y = cummean(z)X11()plot(x,y,type= 'l',main= 'Convergence Plot')

Apply the max function on columns (2) or rows (1) of matrix x