Upload
shavonne-hodges
View
218
Download
0
Embed Size (px)
Citation preview
2nd Texas A&M Big Data Workshop
Development of “Big Data” Scientific Workflow Management Tools for the Materials Genome Initiative:“Materials Galaxy”
Rodolfo AramayoDepartment of BiologyCollege of Sciences
Dr. Rodolfo AramayoRicardo Perez
Department of Biology
Dr. Raymundo ArroyaveDr. Ibrahim Karaman
Daniel SaucedaDr. Anjana Talapatra
Nayan chaudharyRamaranjan Ruj
Vinay AkulaDepartment of Materials Science and Engineering
Dr. Ricardo Gutierrez-OsunaDepartment of Computer Science and Engineering
The Problem…• When data is generated faster than it can be processed we have an
"information crisis”• This crisis is not only associated with the lack of hardware/software
infrastructure to transform data into knowledge but also with two major informatics-related needs:– I. Accessibility: It is not uncommon to find scientists unable to process
information due to their lack of programming and/or informatics expertise– II. Reproducibility: Lack of robust frameworks to ensure reproducibility has
been identified as a major issue in the scientific enterprise Reproducibility in INFORMATICS is a major challenge as the generation of knowledge out of data involves highly complex analysis workflows
The Problem = Opportunity
• This “Problem” will hit Materials Sciences hard, since this field is undergoing a major transformation into being “Big-Data” centric– This is particularly true since the launch of the Materials Genome
Initiative (MGI) in 2011– The Materials data infrastructure is undergoing active development– Materials Sciences is expected to ramp-up from 0% to 100% Big-Data in
few years• This is a tremendous opportunity for us to become leaders in this
emerging field
Our Objective…• To establish TAMU as a leading center for Materials and Materials Informatics• How are we going to do that?
– By developing a series of computational tools designed to collect, store and analyze "Big Data" from diverse sources
– By adapting and porting Informatic Tools from other, more developed areas into Materials Sciences
We propose to start adapting “Galaxy”, a complex web-based system originally developed for Genomics applications for
Materials Informatics
What Is Galaxy? (Definition)
• Galaxy is an open source, web-based platform for accessible, reproducible, and transparent computational biomedical research– Accessible: Users without programming experience can easily specify
parameters and run tools and workflows– Reproducible: Galaxy captures information so that any user can
repeat and understand a complete computational analysis– Transparent: Users share and publish analyses via the web and
create Pages, interactive, web-based documents that describe a complete analysis
Source: Galaxy Wiki: https://wiki.galaxyproject.org/FrontPage
Galaxy’s Internals
Galaxy’s Internals
Galaxy’s Interface
Hillman-jackson et al. - 2012 - Using Galaxy to Perform Large-Scale Interactive Data Analyses
Galaxy’s Internals
Hillman-jackson et al. - 2012 - Using Galaxy to Perform Large-Scale Interactive Data Analyses
Galaxy’s Internals
Hillman-jackson et al. - 2012 - Using Galaxy to Perform Large-Scale Interactive Data Analyses
Hillman-jackson et al. - 2012 - Using Galaxy to Perform Large-Scale Interactive Data Analyses
Galaxy Interface
Hillman-jackson et al. - 2012 - Using Galaxy to Perform Large-Scale Interactive Data AnalysesHillman-jackson et al. - 2012 - Using Galaxy to Perform Large-Scale Interactive Data Analyses
Galaxy’s Internals
Blankenberg et al. - 2011 - Making whole genome multiple alignments usable for biologists
Galaxy Runs on “Ada” (“Reveille”)