28
2nd Texas A&M Big Data Workshop Development of “Big Data” Scientific Workflow Management Tools for the Materials Genome Initiative: “Materials Galaxy” Rodolfo Aramayo Department of Biology College of Sciences

2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy

Embed Size (px)

Citation preview

Page 1: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy

2nd Texas A&M Big Data Workshop

Development of “Big Data” Scientific Workflow Management Tools for the Materials Genome Initiative:“Materials Galaxy”

Rodolfo AramayoDepartment of BiologyCollege of Sciences

Page 2: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy

Dr. Rodolfo AramayoRicardo Perez

Department of Biology

Dr. Raymundo ArroyaveDr. Ibrahim Karaman

Daniel SaucedaDr. Anjana Talapatra

Nayan chaudharyRamaranjan Ruj

Vinay AkulaDepartment of Materials Science and Engineering

Dr. Ricardo Gutierrez-OsunaDepartment of Computer Science and Engineering

Page 3: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy
Page 4: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy

The Problem…• When data is generated faster than it can be processed we have an

"information crisis”• This crisis is not only associated with the lack of hardware/software

infrastructure to transform data into knowledge but also with two major informatics-related needs:– I. Accessibility: It is not uncommon to find scientists unable to process

information due to their lack of programming and/or informatics expertise– II. Reproducibility: Lack of robust frameworks to ensure reproducibility has

been identified as a major issue in the scientific enterprise Reproducibility in INFORMATICS is a major challenge as the generation of knowledge out of data involves highly complex analysis workflows

Page 5: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy

The Problem = Opportunity

• This “Problem” will hit Materials Sciences hard, since this field is undergoing a major transformation into being “Big-Data” centric– This is particularly true since the launch of the Materials Genome

Initiative (MGI) in 2011– The Materials data infrastructure is undergoing active development– Materials Sciences is expected to ramp-up from 0% to 100% Big-Data in

few years• This is a tremendous opportunity for us to become leaders in this

emerging field

Page 6: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy

Our Objective…• To establish TAMU as a leading center for Materials and Materials Informatics• How are we going to do that?

– By developing a series of computational tools designed to collect, store and analyze "Big Data" from diverse sources

– By adapting and porting Informatic Tools from other, more developed areas into Materials Sciences

We propose to start adapting “Galaxy”, a complex web-based system originally developed for Genomics applications for

Materials Informatics

Page 7: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy

What Is Galaxy? (Definition)

• Galaxy is an open source, web-based platform for accessible, reproducible, and transparent computational biomedical research– Accessible: Users without programming experience can easily specify

parameters and run tools and workflows– Reproducible: Galaxy captures information so that any user can

repeat and understand a complete computational analysis– Transparent: Users share and publish analyses via the web and

create Pages, interactive, web-based documents that describe a complete analysis

Source: Galaxy Wiki: https://wiki.galaxyproject.org/FrontPage

Page 8: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy

Galaxy’s Internals

Page 9: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy

Galaxy’s Internals

Page 10: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy

Galaxy’s Interface

Page 11: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy

Hillman-jackson et al. - 2012 - Using Galaxy to Perform Large-Scale Interactive Data Analyses

Page 12: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy

Galaxy’s Internals

Page 13: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy

Hillman-jackson et al. - 2012 - Using Galaxy to Perform Large-Scale Interactive Data Analyses

Page 14: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy

Galaxy’s Internals

Page 15: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy
Page 16: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy

Hillman-jackson et al. - 2012 - Using Galaxy to Perform Large-Scale Interactive Data Analyses

Page 17: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy

Hillman-jackson et al. - 2012 - Using Galaxy to Perform Large-Scale Interactive Data Analyses

Page 18: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy

Galaxy Interface

Hillman-jackson et al. - 2012 - Using Galaxy to Perform Large-Scale Interactive Data AnalysesHillman-jackson et al. - 2012 - Using Galaxy to Perform Large-Scale Interactive Data Analyses

Page 19: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy

Galaxy’s Internals

Page 20: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy

Blankenberg et al. - 2011 - Making whole genome multiple alignments usable for biologists

Page 21: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy
Page 22: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy
Page 23: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy
Page 24: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy
Page 25: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy
Page 26: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy

Galaxy Runs on “Ada” (“Reveille”)

Page 27: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy
Page 28: 2nd Texas AM Big Data Workshop Development of Big Data Scientific Workflow Management Tools for the Materials Genome Initiative: Materials Galaxy