Upload
juan-houston
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
DBBMCESMG
G. Paolella
CEINGE
CSI
INTERNET
CEINGE
University Campus
CAPRIImage restoration
and analysis
ComparativeGenomics
H. sapiens
M. musculus
CST
Annotazione
DG CST
Allineamento e Identificazione
LOCUSLINK
EST
ENSEMBL
PROGR.
QuickTime™ and aCinepak decompressor
are needed to see this picture.
Francesco SalvatoreFrancesco Salvatore 0503
Research and Services in Bioinformatics
- Comparative genomics- DG-CST- KinWeb
- Non Coding RNAs- Bacterial- Eukaryotic
- Cell motility
Research subjects
H. sapiens
M. musculus
CST
Annotazione
DG CST
Allineamento e Identificazione
LOCUSLINK
EST
ENSEMBL
PROGR.
Conserved Sequence Tags (CST)
DG-CST
DG-CST DB
Genome browser
KinWeb
(a)
(b)
(c)
(d)
(e) KinWeb DB
Three genes
a)
b)
Ig-I Ig-II Ig-III TM Tyr Kinase
// // //
CSTsSer-Thr Kinase
CST
// //
Ser-Thr Kinasec)
// //
a cb
CST
I II III
Multistep process of comparative sequence analysisIdentify orthologsBased on combination of ENSEMBL and NCBI informations and/or
sequence alignment
Insert CSTs into DBAutomatic insertion of identified CSTs and preliminary annotation
CompareFind similar stretches by using BLASTZ
Postprocess and select CSTsThresholds: identity >=70% and length >=100 bp
PreprocessMask repetitive sequences by passing through RepeatMasker
Automatic CST annotationBased on available resources and according to different criteria
Select target genesAbout 1000 genes involved in genetic disease
Identify CST subpopulationsAnalysis of annotation results
Test hypothesis on functional rolesAccording to literature and experimental data
FINDING CSTs
Selection of homologous chromosome regions from human and mouse genomes.
Comparison of selected regions using BLASTZ, a program based on a local similarity algorhitm.
Further analysis on the dataset looking for subpopulations sharing specific characteristics, using different programs, such as:- Blast of CSTs vs EST, human and other species genomes- Program for calculation of CPS score (Coding Potential Score)- RNA structure prediction programs
Selection of the definitive set of CSTs based on specified thresholds (identity >= 70%; length >= 100 bp) using StrongHits .
Insertion of selected CSTs into DB and extensively annotation for:- type (i.e. intergenic, exonic etc.) according to Ensembl- Coding capability according to Ensembl- Distances from other genes and coding regions- Calculation of Log Score according to UCSC comparison of human and mouse genomes
Masking sequences of repetitive elements to reduce the noise fatally introduced by repeated sequences through RepeatMasker.
Pipeline
Annotation is carried out through a pipeline which goes through the various phases wit hout requiring human assistance. Tasks requiring intensive CPU usage, such as BLAST homology search, are spread on several collaborating servers using a system specifically developed for load distribution and monitoring.
CST ANNOTATIONCSTs- chromosome position- type (i.e. intergenic, intronic, exonic, etc.)- coding %- closest gene and relative distances- .......
ENSEMBL gene and gene structure data- Max L-Score- Avg L-Score- .......
UCSC Log Score dataMatches with:- EST- Other genomes- Proteins (BlastX)
BLAST- repeats type- repeats %Repeat MaskerCoding Potential ScoreCPS - Redundancy- Overlapping- ........
PHP ScriptsDBRemote Servers Remote Servers
Pipeline units
Non coding RNAs
ncRNADNA
transcriptionreverse
transcription
Proteinstranslation
mRNA
tRNArRNA
AntisensemiRNA
transcription/maturation
snoRNA
maturation
Self-splicing intronsnRNA
Imprinting H19, AIRX inactivation XISTChromatin structure dynamics small RNAsDNA demethylation KHPS1a
0 50000 100000 150000 200000 250000
Bacillus anthracis Ames (1)
Bacillus halodurans C-125 (2)
Bacillus subtilis 168 (3)
Clostridium perfringens (4)
Clostridium tetani E88 (5)
Enterococcus faecalis V583 (6)
Lactobacillus johnsonii NCC 533 (7)
Listeria innocua (8)
Listeria monocytogenes EGD-e (9)
Staphylococcus aureus Mu50 (10)
Streptococcus pneumoniae TIGR4 (11)
Streptococcus pyogenes MGAS315 (12)
Mycoplasma genitalium (13)
Mycoplasma pneumoniae M129 (14)
Ureaplasma urealyticum (15)
Corynebacterium diphtheriae strain NCTC13129 (16)
Mycobacterium leprae (17)
Mycobacterium tuberculosis H37Rv (18)
Treponema pallidum (19)
Chlamydia pneumoniae AR39 (20)
Chlamydia trachomatis serovar D (21)
Campylobacter jejuni NCTC 11168 (22)
Helicobacter pylori 26695 (23)
Brucella melitensis (24)
Rickettsia conorii (25)
Rickettsia prowazekii Madrid E (26)
Bordetella bronchiseptica RB50 (27)
Bordetella parapertussis 12822 (28)
Bordetella pertussis (29)
Neisseria meningitidis MC58 (30)
Buchnera sp. APS (31)
Escherichia coli K12-MG1655 (32)
Escherichia coli O157:H7 (EDL933) (33)
Haemophilus influenzae KW20 (34)
Pasteurella multocida (35)
Pseudomonas aeruginosa PA01 (36)
Pseudomonas putida KT2440 (37)
Salmonella enterica serovar Typhi CT-18 (38)
Salmonella typhimurium LT2 SGSC1412 (39)
Vibrio cholerae El Tor N16961 chr1 (40)
Yersinia pestis CO92 (41)
Aquifex aeolicus VF5 (42)
Species
SLS Num
genic
antigenic
spanning
intergenic
Bacterial SLSs
Genome ClusterDS Rawclusters Elements Avg_identity Type %P<=0.001 Known1 1 47 58 intergenic 43.8 Bru-1
Brucella M 1 1 4 19 98 intergenic 62.5 Bru-11 5 17 59 antigenic 11.8 Bru-1
SLS = 44,719 1 9 8 68 intergenic 12.5 Bru-11 6 21 87 intergenic 69.2 Bru-21 7 14 98 intergenic 78.9 Bru-12 2 31 100 mixed 50.7 Bru-22 3 24 98 mixed 71.4 Bru-23 8 7 95 mixed 66.7 new Family1 446 23 91 intergenic 43.2 efa-1
Ent. Faecalis 1 447 36 85 intergenic 88.9 efa-11 448 39 88 intergenic 95.9 efa-1
SLS = 40,991 1 451 7 81 intergenic 22.2 efa-11 449 21 93 intergenic 43.2 efa-12 452 9 100 antigenic 0 new Family3 450 7 95 antigenic 24 new Family1 453 12 80 intergenic 48 bcr-1
Bac. Anthracis 1 454 10 81 intergenic 66.7 bcr-11 455 7 85 mixed 37.5 bcr-1
SLS = 65,220 2 456 9 59 intergenic 0 new Family1 419 51 76 intergenic 50 TPP riboswitch
Vibrio Cholerae 1 1 420 33 66 intergenic 12.8 TPP riboswitch2 421 7 85 intergenic 0 new Family
SLS = 45,824 2 424 7 99 intergenic 0 new Family3 422 8 100 intergenic 50 new Family4 423 8 100 antigenic 0 new Family5 425 8 100 intergenic 18.2 new Family
SLS Families
Genome ClusterDS Rawclusters Elements Avg_identity Type %P<=0.001 Known1 1 47 58 intergenic 43.8 Bru-1
Brucella M 1 1 4 19 98 intergenic 62.5 Bru-11 5 17 59 antigenic 11.8 Bru-1
SLS = 44,719 1 9 8 68 intergenic 12.5 Bru-11 6 21 87 intergenic 69.2 Bru-21 7 14 98 intergenic 78.9 Bru-12 2 31 100 mixed 50.7 Bru-22 3 24 98 mixed 71.4 Bru-23 8 7 95 mixed 66.7 new Family1 446 23 91 intergenic 43.2 efa-1
Ent. Faecalis 1 447 36 85 intergenic 88.9 efa-11 448 39 88 intergenic 95.9 efa-1
SLS = 40,991 1 451 7 81 intergenic 22.2 efa-11 449 21 93 intergenic 43.2 efa-12 452 9 100 antigenic 0 new Family3 450 7 95 antigenic 24 new Family1 453 12 80 intergenic 48 bcr-1
Bac. Anthracis 1 454 10 81 intergenic 66.7 bcr-11 455 7 85 mixed 37.5 bcr-1
SLS = 65,220 2 456 9 59 intergenic 0 new Family1 419 51 76 intergenic 50 TPP riboswitch
Vibrio Cholerae 1 1 420 33 66 intergenic 12.8 TPP riboswitch2 421 7 85 intergenic 0 new Family
SLS = 45,824 2 424 7 99 intergenic 0 new Family3 422 8 100 intergenic 50 new Family4 423 8 100 antigenic 0 new Family5 425 8 100 intergenic 18.2 new Family
Position in the genome
Position
Genome ClusterDS Rawclusters Elements Avg_identity Type %P<=0.001 Known1 1 47 58 intergenic 43.8 Bru-1
Brucella M 1 1 4 19 98 intergenic 62.5 Bru-11 5 17 59 antigenic 11.8 Bru-1
SLS = 44,719 1 9 8 68 intergenic 12.5 Bru-11 6 21 87 intergenic 69.2 Bru-21 7 14 98 intergenic 78.9 Bru-12 2 31 100 mixed 50.7 Bru-22 3 24 98 mixed 71.4 Bru-23 8 7 95 mixed 66.7 new Family1 446 23 91 intergenic 43.2 efa-1
Ent. Faecalis 1 447 36 85 intergenic 88.9 efa-11 448 39 88 intergenic 95.9 efa-1
SLS = 40,991 1 451 7 81 intergenic 22.2 efa-11 449 21 93 intergenic 43.2 efa-12 452 9 100 antigenic 0 new Family3 450 7 95 antigenic 24 new Family1 453 12 80 intergenic 48 bcr-1
Bac. Anthracis 1 454 10 81 intergenic 66.7 bcr-11 455 7 85 mixed 37.5 bcr-1
SLS = 65,220 2 456 9 59 intergenic 0 new Family1 419 51 76 intergenic 50 TPP riboswitch
Vibrio Cholerae 1 1 420 33 66 intergenic 12.8 TPP riboswitch2 421 7 85 intergenic 0 new Family
SLS = 45,824 2 424 7 99 intergenic 0 new Family3 422 8 100 intergenic 50 new Family4 423 8 100 antigenic 0 new Family5 425 8 100 intergenic 18.2 new Family
Alignment
Genome ClusterDS Rawclusters Elements Avg_identity Type %P<=0.001 Known1 1 47 58 intergenic 43.8 Bru-1
Brucella M 1 1 4 19 98 intergenic 62.5 Bru-11 5 17 59 antigenic 11.8 Bru-1
SLS = 44,719 1 9 8 68 intergenic 12.5 Bru-11 6 21 87 intergenic 69.2 Bru-21 7 14 98 intergenic 78.9 Bru-12 2 31 100 mixed 50.7 Bru-22 3 24 98 mixed 71.4 Bru-23 8 7 95 mixed 66.7 new Family1 446 23 91 intergenic 43.2 efa-1
Ent. Faecalis 1 447 36 85 intergenic 88.9 efa-11 448 39 88 intergenic 95.9 efa-1
SLS = 40,991 1 451 7 81 intergenic 22.2 efa-11 449 21 93 intergenic 43.2 efa-12 452 9 100 antigenic 0 new Family3 450 7 95 antigenic 24 new Family1 453 12 80 intergenic 48 bcr-1
Bac. Anthracis 1 454 10 81 intergenic 66.7 bcr-11 455 7 85 mixed 37.5 bcr-1
SLS = 65,220 2 456 9 59 intergenic 0 new Family1 419 51 76 intergenic 50 TPP riboswitch
Vibrio Cholerae 1 1 420 33 66 intergenic 12.8 TPP riboswitch2 421 7 85 intergenic 0 new Family
SLS = 45,824 2 424 7 99 intergenic 0 new Family3 422 8 100 intergenic 50 new Family4 423 8 100 antigenic 0 new Family5 425 8 100 intergenic 18.2 new Family
RNAzP = 0.99
PFOLD
Secondary structures
Processing timeSLSs Proj CSTs Proj
1 1
BLAST vs self 1 1BLAST vs hum EST - 15BLAST vs musEST - 12BLAST vs Hum Genome - 13BLAST vs Mus/Rat Genome - 10BLAST vs Small Genomes - 6RepeatMasker 3 5Mfold 2 -RandFold 30 30RNA-z 0.5 0.5
SLSs Proj CSTs Proj2469003 103340
BLAST vs self 28.6 1.2BLAST vs hum EST - 17.9BLAST vs musEST - 14.4BLAST vs Hum Genome - 15.5BLAST vs Mus/Rat Genome - 12BLAST vs Small Genomes - 7.2RepeatMasker 85.7 6Mfold 57.2 -RandFold 857.3 35.9RNA-z 14.3 0.6
SLSs Proj CSTs Proj2469003 103340
Time (months) ALL 33.6 3.6
Operation
Operation
Operation
Time (days)
Time (sec)
4x14x2=112 procs 2.8 GHz
4x14x2=112 GB RAM
2 GB/s per scheda - 4 GB/s aggregata
Cluster
Bioinfo portal
Servizi bioinformatici per la ricerca gia’ attivi
Francesco SalvatoreFrancesco Salvatore 0503
• Circa 100 banche dati di interesse biologico accessibili mediante SRS (sequenze nucleotidiche, genomi, mutazioni, malattie ereditarie, enzimi, etc.)
• Sistema integrato per analisi di dati biologici con oltre 150 programmi per analisi di sequenze, modelli evolutivi, studio di mutazioni, proteine etc.
• Banche dati realizzate nell’ambito di progetti di ricerca (DG-CST, KinWEB, etc.)
• Sistemi per la gestione di dati sperimentali (campioni biologici, sequenze, immagini da microscopia etc.)
Research and services
Research and Services in BioinformaticsCAPRI
Image restorationand analysis
ComparativeGenomics
H. sapiens
M. musculus
CST
Annotazione
DG CST
Allineamento e Identificazione
LOCUSLINK
EST
ENSEMBL
PROGR.
QuickTime™ and aCinepak decompressor
are needed to see this picture.
• CEINGE• DBBM• IIGB• BIOGEM• Facolta’ di Medicina• Facolta’ di Biotecnologie• Altre Facolta’• Pubblico (accesso limitato)
Francesco SalvatoreFrancesco Salvatore 0503
Servizi: chi ha accesso ?
WEB SERVER
CAPRI SRSPISE
Other Emboss Fasta Blast
UserData DB
Primary remotedatabases
ENSEMBL
Services organization
Graphic interface to programs
CAPRI
CAPRI
Various operations in a row:Complement ->Translation -> Isoelectric point of the resulting protein.
DNA
Complement
Translation
Isoelectric point
CAPRI workflow
CGI
Plugin ObjectPise
Plugin ObjectCLI Simple
Programs
Plugin ObjectCURL
Base Obj.
Plugin ObjectSOAP
Plugin ObjectJEMBOSS
ProgramObject
Tasks Obj.
Menu Table
Disk Buffering
BLAST
FASTA
EMBOSS
HMMer
Genscan
ClustalW
Programmi
Dischi del ServerDischi del Server
Phylip
CLIENT SERVER
CAPRI
ProgramObject
ProgramObject
LegendaRelazione tra oggetti:
UsoEredità
Esecuzione programmiTrasferimento datiRelazione temporale
CAPRI architecture
Cluster Cluster Nodes
AccessServer
AccessServer
AccessServer
For each user request, a process islaunched on a different node
Distributed execution
Cluster
BrokerBroker
Web applicat
ion server
Web applicat
ion server
DB serverDB serverClusterManage
r
ClusterManage
r
3 – Request the status of the cluster
5 - launch the
command on the node
1 – Run a command
2 – Request a node IP
4 – Search for the best resource and return the corresponding node IP
Relational DB
6 – Return the result
Cluster activity
http
Broker
virtualnode
virtualnode
DB
DB
Grid
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
PROGETTO DI RICERCA
------------------------------------------
*Cell line*Colture conditions*Fixation and inclusion methods, stainings, ecc
*Objective*Focus Position*Stage position x/y
*Project title *Experiment name, *Author, group, group leader, ecc.
WEB INTERFACE
*Exposure time*Resolution, ecc.
DB
Image archival and management
Image-DB interface
timelapse at 6 positionstimelapseactinwound healingtimelapse 2adhesionactin staining
IPROC
HPCon
ClusternodesG
ateway
iPage
image
area
data + images
page
iPaneiPaneiPane
proc-steps
IPROC architecture
Cluster Cluster Nodes
AccessServer
AccessServer
AccessServer
A tool can require the execution of multiple, simultaneous processes
Distributed execution of parallel requests
-PHP internal routines (basic drawing, processing)
-ImageMagick (more advanced processing)
-Image converters
-Special tools (PDL, deconvolution)
-Tools developed in-house (cell tracking)
- ......
What software may be linked
-Convenient graphic interface
-Access to a vast library of image processing steps
-No specific interface requirements
-Remote processing on parallel hardware
-Support for a large number of concurrent users
-System independent (works on Mac, PC, Linux etc.)
-No need to install. A browser is enough.
Advantages