Yike Guo/Jiancheng Lin InforSense Ltd. 15 September 2015 Bioinformatics workflow integration

Yike Guo/Jiancheng Lin

InforSense Ltd.

April 19, 2023

Bioinformatics workflow integration

Life Science Challenges

Information resides on different: Granularity levels (individual records vs. massive

repositories) Abstraction levels (models ranging from entire

systems to compound patterns) Domain levels (clinical, sequence, instrument…)

Researchers Grouped in Virtual Organizations (VOs) Working on the Grid Need to communicate across physical and

scientific/cultural barriers Tools

Legacy, well-established in the process Novel, essential to innovation In need of a consistent infrastructure to connect the

two groups

Discovery Informatics in Post-Genome Era

ATGCAAGTCCCTAAGATTGCATAAGCTCGCTCAGTT

polymorphismpatient recordsepidemiology

linkage mapscytogenetic maps

physical maps

sequences alignments

expression patternsphysiology

receptorssignals

pathways

secondary structuretertiary structure

Integrative Analytics Workflow Environment

Data

Applications

Components

Inbuilt AnalyticsInbuilt Analytics

Oracle Data PreprocessOracle Data Preprocess

Files

DB

Workflow Warehouse Informatician

Deployed Web App for End Users

PortalPortal

Oracle DM

Oracle DM

MatlabMatlab

RR

KXENKXENWEKAWEKA

S-PlusS-Plus

SASSAS

Integrative Analytics Workflow Environment

3rd Party & Custom Apps

MDLSpotfire

Daylight

Healthcare

Web Services

Web Services

BioTeam iNquiry

BioTeam iNquiry

Data Analysis

Group

InforSense Workflow Life Cycle

Constructing a ubiquitous workflow : by scientists Integrate your information

resources/software applications cross-domain

Support innovation and capture the best practice of your scientific research

Warehousing workflows: for scientists Manage discovery processes in

your organisation Construct an enterprise process

knowledge bank Deployment workflow: to scientists

Turn your workflows into reusable applications

Turn every scientist into a solution builder

Workflow Creation, Integration, and Deployment

Data Sources Data Sources

Select:Select:11

Data Mining / StatisticsData Mining / Statistics

Connect data and components in GUIConnect data and components in GUI

Connect:Connect:22

Workflow describes complex data processing and analysisWorkflow describes complex data processing and analysis

“In database” processing & analytics“In database” processing & analytics

Execute:Execute:33

Define parameters of workflow to exposeDefine parameters of workflow to expose

Deploy:Deploy:44

Publish as: portlet, web application, SOAP service, command line appPublish as: portlet, web application, SOAP service, command line app

Data Processing / TransformationData Processing / Transformation

3rd Party applications (e.g.Haploview)3rd Party applications (e.g.Haploview)

Interactive data visualization / reportingInteractive data visualization / reporting

“Cluster / Grid” execution“Cluster / Grid” execution

Biology to Chemistry

Novel sequences are compared to known protein structures The resulting set of ligands on these matching structures is used

to search small molecule databases for similar compounds Compounds are then analyzed using KDE tools such as PCA and

clustering to provide a diverse, representative subset for further assays

Navigating KEGG pathways

Gene names from EMBL are used to query KEGG via their Webservice API for appropriate pathways

Further Webservice API calls allow navigation of the data to find:

Pathway compounds Other genes in the pathways Visualization of query genes on their pathways

cDNA sequence annotation and alignment

A novel cDNA is annotated using EMBOSS tools, and a BLAST similarity search perfomed against human proteins

Annotations used to aid identification of predicted proteins derived from the cDNA

Ortholog analysis using BLAST

Sequence libraries from 2 organisms are cross-compared using BLAST to determine the best bi-directional matches of sufficient quality

Clustering of Affymetrix data with R

Native Affymetrix CEL files are loaded using R/Bioconductor

Differentially expressed genes calculated using KDE statistical nodes

The resulting list of genes is then clustered using HCLUST in R

Microarray analysis using text mining

Microarray data normalized in KDE Upregulated genes annotated from Pubmed to obtain a set of

related scientific papers Text mining used to mine the paper collection and extract

information most relevant to the researcher

•Genetic data•Mouse ID•Cage ID•Environmental conditions•Management records

Normal Diet

Fat Fed

PhysiologicalData prior changeIn Diet

•Weight•Blood analysis•Urine analysis

Physiological Data after change In Diet.One time point in end-point experimentSeveral time points in longitudinal study

•Weight•Blood analysis

•Physiological parameters•Metabonomics

•Urine analysis•Physiological parameter•Metabonomics

•Tissue sampling•Liver,Fat, Muscle, Kidney

•Metabonomics•Proteomics (general, glyco-, phospho-proteomics)•Transcriptomics

•Culling conditions

EndpointCulling or death

6 to 10animals

•Sampling conditions•Sample Storage conditions•Ref of Biological assays used across the study

Data FormatsAffymetrixXLS filesChromatogramsFilemaker ProMetabonomicsNMR spectra

•Raw Data•Normalised Data•Processed Data

Similar data will be recorded regarding experiments performed with cells lines cDNA arraysATF, GAL files

Time

BAIR project

Biological Atlas of Insulin ResistanceBiological Atlas of Insulin Resistance

Collaborative Visualisation

Literature mining and compound analysis

Grid Computing

BAIR Portal

Integrative supportIntegrative support

Information: Data models to support individual domains (sequences,

NMR profiles…) and methods to map them into generic analysis (tables, text)

Annotation databases integrated through Web Service APIs

Researchers Sharing of work and knowledge through reusable workflow

components Aim for minimum technical overhead when linking new

resources Tools

Focus on integration methods rather than one-off tool linkage

Researchers able to link to standard tools without the need for an IT specialist

Databases accessed through aggregators (SRS, BioMart…)

Documents

Yike Guo/Jiancheng Lin InforSense Ltd. 15 September 2015 Bioinformatics workflow integration