20
prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

Embed Size (px)

Citation preview

Page 1: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

prorepeat.bioinformatics.nl

ProRepeat a comprehensive directory of exact tandem repeats in proteins

Page 2: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

www.bioinformatics.nl

9 diseases causes by polyQ repeats- HD- DRPLA- SCA 1,2,3,6,7,17- Kennedy’s disease (SBMA)

PolyQ and neurodegenerative diseases

Page 3: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

www.bioinformatics.nl

Transcription Factor

-COOHNH3-

TRANSCRIPTIONAL REGULATIONDNA BINDING

HORMONE BINDING

T1 T2 T3

Region 1 Region 2 Region 3

Androgen receptor (AR)

polyQ tract length has important consequences■ shorter tracts : prostate cancer susceptibility■ longer tracts : feminization syndromes■ over 40 residues : SBMA (spinal and bulbar muscular atrophy) or Kennedy’s disease

polyQ tract length has important consequences■ shorter tracts : prostate cancer susceptibility■ longer tracts : feminization syndromes■ over 40 residues : SBMA (spinal and bulbar muscular atrophy) or Kennedy’s disease

9-35 residues, average of 20-25 depending on ethnic origin

9-35 residues, average of 20-25 depending on ethnic origin

Page 4: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

www.bioinformatics.nl

PolyQ in AR

Collection of polyQ repeats 792 human individuals

available from earlier study (Edwards, 1992)

26 armadillo individuals sequenced by CP

77 mammals and marsupials from protein database

Céline Poux, RUCéline Poux, RU

Page 5: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

www.bioinformatics.nl

What about repeats in other proteins?

ProRepeat database Data sources: UniProt and RefSeq Limited to exact tandem repeats

Standard, linear-time suffix tree algorithm Stored in Oracle 10g Interface in PHP5

unit length repetitions

1 ≥ 5

2 ≥ 4

3 ≥ 3

4 .. N ≥ 2Maarten van den Bosch, WURMaarten van den Bosch, WUR

Page 6: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

www.bioinformatics.nl

Simple query syntax:

e.g. “Q” or “DE”

Simple query syntax:

e.g. “Q” or “DE”

DE is equivalent to ED; DEF is equivalent to EFD and FDE

DE is equivalent to ED; DEF is equivalent to EFD and FDE

Page 7: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

www.bioinformatics.nl

Or use ProSite syntax:

e.g. “[DE]-{P}-X(0,1).”

Or use ProSite syntax:

e.g. “[DE]-{P}-X(0,1).”

Page 8: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

www.bioinformatics.nl

Taxonomic distributions of hits

Page 9: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

www.bioinformatics.nl

Page 10: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

www.bioinformatics.nl

Sorting/grouping options

Identifier Repeat unit Repetitions Unit length Length Start location End location Protein Taxonomy Ontology

Page 11: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

www.bioinformatics.nl

Link to DNA data

DNA coding sequences of available repeats also stored in the database Extracted from EMBL

and/or RefSeq

Hong Luo, WURHong Luo, WUR

Page 12: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

www.bioinformatics.nl

Link to DNA data / errors

Approximately 3% of corresponding nucleotide sequences cannot be retrieved

Errors caused by No links to nucleotide database (35%)

• NO_ANNOTATED_CDS• No EMBL links

Annotation errors in the nucleotide database (65%)

Page 13: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

www.bioinformatics.nl

Number of different units per unit size per proteome

0

100

200

300

400

500

600

700

800

900

Unit length

Nu

mb

er o

f d

iffe

ren

t u

nit

s

Hsapiens

Athaliana

Celegans

Cserevesiae

Ptroglodytes

Ggallus

Rnorvegicus

Mmusculus

Ecoli

Guido Kappé, RUGuido Kappé, RU

Page 14: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

www.bioinformatics.nl

Single amino acid (SAA) repeat length distribution in Homo sapiens

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 >20

Total SAA repeat length (aa)

Per

cen

tag

e (%

)

A B C D E F G H I K L M N P Q R S T U V W X Y Z

SS

QQ

PPGG

EEAA

TT

Page 15: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

www.bioinformatics.nl

Amino acid distribution Homo sapiens

0

5

10

15

20

25

30

A B C D E F G H I K L M N P Q R S T U V W X Y Z

Amino acid

Per

cen

tag

e (%

)

All prot. - Rep. Rep. - SAA SAA

Page 16: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

www.bioinformatics.nl

Amino acid distribution Arabidopsis thaliana

0

5

10

15

20

25

30

A B C D E F G H I K L M N P Q R S T U V W X Y Z

Amino acid

Per

cen

tag

e (%

)

All prot. - Rep. Rep. - SAA SAA

Page 17: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

www.bioinformatics.nl

Current work

Annotation of repeats versus function Adding imperfect tandem repeats - a.k.a.

approximate tandem repeats (ATR) – to the database

Offering remote access via web services (WSDL and BioMoby)

Expansion of the analysis capabilities of the interface

Page 18: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

www.bioinformatics.nl

PolyQ in AR (reprise)

Impure tracts longer and more variable than pure CAG tracts (mainly CAA, CCG, and CGG)

Presence of other codons better explained by codon duplication than multiple point mutations interrupting codons are part of elongation process,

rather than hampering their dynamics as proposed previously

Negative correlation between lengths of the different CAG tracts maximal expansion length that protein can handle

without being deleteriousCéline Poux, RUCéline Poux, RU

Page 19: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

www.bioinformatics.nl

Acknowledgements

Wageningen University and Research Centre Maarten van den Bosch Hong Luo Mark Kramer Harm Nijveen

Radboud University, Nijmegen Guido Kappé Céline Poux Wilfried W. de Jong

This work was supported in part by project grants from NWO/BMI (GK, CP) and the NBIC/BioAssist program (HN)

Page 20: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

prorepeat.bioinformatics.nl

Thank you for your attention!See also our posters on phylogenetic domain visualisation (TreeDomViewer) and microarray (re)annotation at the ISMB

Post-doc positions available: contact [email protected] or [email protected]