Upload
kelley-johnson
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Yike Guo/Jiancheng Lin
InforSense Ltd.
April 19, 2023
Bioinformatics workflow integration
Life Science Challenges
Information resides on different: Granularity levels (individual records vs. massive
repositories) Abstraction levels (models ranging from entire
systems to compound patterns) Domain levels (clinical, sequence, instrument…)
Researchers Grouped in Virtual Organizations (VOs) Working on the Grid Need to communicate across physical and
scientific/cultural barriers Tools
Legacy, well-established in the process Novel, essential to innovation In need of a consistent infrastructure to connect the
two groups
Discovery Informatics in Post-Genome Era
ATGCAAGTCCCTAAGATTGCATAAGCTCGCTCAGTT
polymorphismpatient recordsepidemiology
linkage mapscytogenetic maps
physical maps
sequences alignments
expression patternsphysiology
receptorssignals
pathways
secondary structuretertiary structure
Integrative Analytics Workflow Environment
Data
Applications
Components
Inbuilt AnalyticsInbuilt Analytics
Oracle Data PreprocessOracle Data Preprocess
Files
DB
Workflow Warehouse Informatician
Deployed Web App for End Users
PortalPortal
Oracle DM
Oracle DM
MatlabMatlab
RR
KXENKXENWEKAWEKA
S-PlusS-Plus
SASSAS
Integrative Analytics Workflow Environment
3rd Party & Custom Apps
MDLSpotfire
Daylight
Healthcare
Web Services
Web Services
BioTeam iNquiry
BioTeam iNquiry
Data Analysis
Group
InforSense Workflow Life Cycle
Constructing a ubiquitous workflow : by scientists Integrate your information
resources/software applications cross-domain
Support innovation and capture the best practice of your scientific research
Warehousing workflows: for scientists Manage discovery processes in
your organisation Construct an enterprise process
knowledge bank Deployment workflow: to scientists
Turn your workflows into reusable applications
Turn every scientist into a solution builder
Workflow Creation, Integration, and Deployment
Data Sources Data Sources
Select:Select:11
Data Mining / StatisticsData Mining / Statistics
Connect data and components in GUIConnect data and components in GUI
Connect:Connect:22
Workflow describes complex data processing and analysisWorkflow describes complex data processing and analysis
“In database” processing & analytics“In database” processing & analytics
Execute:Execute:33
Define parameters of workflow to exposeDefine parameters of workflow to expose
Deploy:Deploy:44
Publish as: portlet, web application, SOAP service, command line appPublish as: portlet, web application, SOAP service, command line app
Data Processing / TransformationData Processing / Transformation
3rd Party applications (e.g.Haploview)3rd Party applications (e.g.Haploview)
Interactive data visualization / reportingInteractive data visualization / reporting
“Cluster / Grid” execution“Cluster / Grid” execution
Biology to Chemistry
Novel sequences are compared to known protein structures The resulting set of ligands on these matching structures is used
to search small molecule databases for similar compounds Compounds are then analyzed using KDE tools such as PCA and
clustering to provide a diverse, representative subset for further assays
Navigating KEGG pathways
Gene names from EMBL are used to query KEGG via their Webservice API for appropriate pathways
Further Webservice API calls allow navigation of the data to find:
Pathway compounds Other genes in the pathways Visualization of query genes on their pathways
cDNA sequence annotation and alignment
A novel cDNA is annotated using EMBOSS tools, and a BLAST similarity search perfomed against human proteins
Annotations used to aid identification of predicted proteins derived from the cDNA
Ortholog analysis using BLAST
Sequence libraries from 2 organisms are cross-compared using BLAST to determine the best bi-directional matches of sufficient quality
Clustering of Affymetrix data with R
Native Affymetrix CEL files are loaded using R/Bioconductor
Differentially expressed genes calculated using KDE statistical nodes
The resulting list of genes is then clustered using HCLUST in R
Microarray analysis using text mining
Microarray data normalized in KDE Upregulated genes annotated from Pubmed to obtain a set of
related scientific papers Text mining used to mine the paper collection and extract
information most relevant to the researcher
•Genetic data•Mouse ID•Cage ID•Environmental conditions•Management records
Normal Diet
Fat Fed
PhysiologicalData prior changeIn Diet
•Weight•Blood analysis•Urine analysis
Physiological Data after change In Diet.One time point in end-point experimentSeveral time points in longitudinal study
•Weight•Blood analysis
•Physiological parameters•Metabonomics
•Urine analysis•Physiological parameter•Metabonomics
•Tissue sampling•Liver,Fat, Muscle, Kidney
•Metabonomics•Proteomics (general, glyco-, phospho-proteomics)•Transcriptomics
•Culling conditions
EndpointCulling or death
6 to 10animals
•Sampling conditions•Sample Storage conditions•Ref of Biological assays used across the study
Data FormatsAffymetrixXLS filesChromatogramsFilemaker ProMetabonomicsNMR spectra
•Raw Data•Normalised Data•Processed Data
Similar data will be recorded regarding experiments performed with cells lines cDNA arraysATF, GAL files
Time
BAIR project
Biological Atlas of Insulin ResistanceBiological Atlas of Insulin Resistance
Collaborative Visualisation
Literature mining and compound analysis
Grid Computing
BAIR Portal
Integrative supportIntegrative support
Information: Data models to support individual domains (sequences,
NMR profiles…) and methods to map them into generic analysis (tables, text)
Annotation databases integrated through Web Service APIs
Researchers Sharing of work and knowledge through reusable workflow
components Aim for minimum technical overhead when linking new
resources Tools
Focus on integration methods rather than one-off tool linkage
Researchers able to link to standard tools without the need for an IT specialist
Databases accessed through aggregators (SRS, BioMart…)