Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State...

Supporting High-Performance Data Processing on Flat-Files

Xuan Zhang

Gagan Agrawal

Ohio State University

Motivation

• Challenges of bioinformatics integration– Data volume: overwhelming

• DNA sequence: 100 gigabases (August, 2005)

– Data growth:

exponential

Figure provided by PDB

Existing Solutions

– (Relational) Databases• Support for indexing and high-level queries • Not suitable for biological data

– Flat Files with Scripts • Compact, Perl Scripts available • Lack indexing and high-level query processing

– Web-services • Significant overhead

• Enhance information integration systems on– Functionality

• On-the-fly data incorporation• Flat file data process

– Usability• Declarative interface• Low programming requirement

– Performance • Incorporate indexing support

Our Approach

Approach Summary

• Metadata– Declarative description of data– Data mining algorithms for semi-automatic

writing– Reusable by different requests on same data

• Code generation– Request analysis and execution separated– General modules with plug-in data module

System OverviewUnderstand Data Process Data

Data File User Request

Metadata Description

Layout Descriptor---------------------------------------------------

Schema DescriptorLayout Descriptor

---------------------------------------------------

Schema DescriptorLayout Descriptor

---------------------------------------------------

Schema Descriptor

CodeGeneration

RequestProcessor

Layout Miner

SchemaMiner

Information Integration System

Advantages

• Simple interface– At metadata level, declarative

• General data model– Semi-structured data– Flat file data

• Low human involvement– Semi-automatic data incorporation– Low maintenance cost

• OK Performance– Linear scale guaranteed – Can improve by using indexing

System Components

• Understand data– Layout mining– Schema mining

• Process data– Wrapper generation– Query Process– Query Process with indices

Data Process Overview

• Automatic code generation approach• Input

– Metadata about datasets involved– Optional:

• Implicit data transformation task• Request by users• Indexing functions

• Output– Executable programs

• General modules• Task-specific data module

Metadata Description

• Two aspects of data in flat files– Logical view of the data– Physical data organization

• Two components of every data descriptor– Schema description– Layout description

• Design goals– Powerful– Easy for writing and interpretation

Schema Descriptors

• Follow XML DTD standard for semi-structured data

• Simple attribute list for relational data

<?xml version='1.0' encoding='UTF-8'?><!ELEMENT FASTA (ID, DESCRIPTION, SEQ)><!ELEMENT ID (#PCDATA)><!ELEMENT DESCRIPTION (#PCDATA)><!ELEMENT SEQ (#PCDATA)>

[FASTA] //Schema NameID = string //Data type definitionsDESCRIPTION = stringSEQ = string

Layout Descriptors

• Overall structure (FASTA example)

DATASET “FASTAData” { //Dataset nameDATATYPE {FASTA} //Schema name

DATASPACE LINESIZE=80 {

// ---- File layout details goes here ----

}DATA {osu/fasta} //File location

Wrapper GenerationSystem Overview

DataReader DataWriterSynchronizer

SourceDataset

TargetDataset

WRAPINFO

Wrapper generationsystem

wrapper

Mapping File

Mapping Parser

Schema Mapping

Mapping Generator

Schema Descriptors

Layout Parser

Layout Descriptor

Data EntryRepresentation

Application Analyzer

Query With IndicesMotivation

• Goal– Improve the performance of query-proc program

• Index

– Maintain the advantages• Flat file based• Low requirement on programming

Challenges & Approaches

• Various indexing algorithms for various biological data– User defined indexing functions– Standard function interfaces

• Flat file data– Values parsed implicitly and ready to be indexed– Byte offset as pointer

• Metadata about indices– Layout descriptor

System Revisitedquery

Query parser

Metadatacollection

Datasetdescriptors

Descriptorparser

Application analyzer

QUERYINFOR

DataReader DataWriter

Synchronizer

Source data files

Targetdata file

Source/target names

Schema & Layout information mappings

Query analysis

Query execution

Index file Index functions

Language Enhancement

• Describe indices– Indexing is a property of dataset– Extend layout descriptors

– Maintain query format

DATASET “name”{…INDEX {attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc[, attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc]}}

AUTOWRAP GNAMESFROM CHIPDATA, YEASTGENOMEBY CHIPDATA.GENE = YEASTGENOME.IDWHERE …

New meaning of “=“:If index available, use index

retrieving functionElse, compare values directly

System Enhancement

• Metadata Descriptor Parser+ parse index information

• Application Analyzer+ index information: index look-up table

+ test condition: compare_field_indexing

Microarray Gene Information Look-up

• Goal: gather information about genes (120)

• Query: microarray output join genome database

• Index: gene names in genome

0.01 0.72

queryanalysis

indexgeneration

query withindices

query w/oindices

BLAST-ENHANCE Query

• Goal: Add extra information to BLAST output

• Query: BLAST output join Swiss-Prot database

• Index: protein ID in Swiss-Prot

indexgeneration

query w/indices

query w/oindices

3 5 12

OMIM-PLUS Query

• Goal: add Swiss-Prot link to OMIM

• Query: OMIM join Swiss-Prot

• Index: protein ID in Swiss-Prot

100000

1000000

10000000

indexgeneration

query w/indices

query w/oindices

Homology Search Query

• Goal: find similar sequences

• Query: query sequence list * sequence database

• Indexing algorithm– Sequence-based– Transformation of sub-string composition– Indexing n-D numerical values

Homology Search (1)

• Index (Singh’s algorithm)– Data: yeast

genome– wavelet

coefficients – minimum

bounding rectangles

1 2 3 4 5

Database size (9.8MB)

Index generation

Homology Search (2)

• Index (Ferhatosmanoglu’s algorithm)– Data: GenBank– Wavelet coefficients– Scalar quantization– R-tree 0

1 2 3 4 5

Database size (250MB)

Conclusions

• A frame work and a set of tools for on-the-fly flat file data integration– New data source understood semi-automatically

by data mining tools– New data processed automatically by generated

programs – Support for indexing incorporated flexibly

Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State...

Documents

1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1

Elastic Cloud Caches for Accelerating Service-Oriented Computations Gagan Agrawal Ohio State University Columbus, OH David Chiu Washington State University

Supporting High-Level Abstractions through XML Technologies Xiaogang Li Gagan Agrawal The Ohio State University

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering

Modeling and Adaptive Scheduling of Large-Scale Wide-Area Data Transfers Raj Kettimuthu Advisors: Gagan Agrawal, P. Sadayappan

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal

Compiler Supported High-level Abstractions for Sparse Disk-resident Datasets Renato Ferreira Gagan Agrawal Joel Saltz Ohio State University

Mining Data Streams Challenges, Techniques, and Future Work Ruoming Jin Joint work with Prof. Gagan Agrawal

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal

Transitioning to Semesters CSE MS Program Prof. Gagan Agrawal Grad Studies Chair

Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL

Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems Vignesh Ravi and Gagan Agrawal

High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering

Data-Intensive Computing: From Multi-Cores and GPGPUs to Cloud Computing and Deep Web Gagan Agrawal u

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal

ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters Vignesh Ravi, Michela Becchi, Gagan Agrawal, Srimat Chakradhar

Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and