Upload
colin
View
37
Download
0
Tags:
Embed Size (px)
DESCRIPTION
NSF Workshop on Cyberinfrastructure for Environmental Observatories Introduction to CI Topics. Chaitan Baru, SDSC/NLADR Bertram Ludaescher, UC Davis/SDSC Michael Welge, NCSA/NLADR. Outline. A nexus of CI projects CI project “principles” CI technical focus areas/topics - PowerPoint PPT Presentation
Citation preview
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
NSF Workshop on Cyberinfrastructure for Environmental Observatories
Introduction to CI Topics
Chaitan Baru, SDSC/NLADRBertram Ludaescher, UC Davis/SDSC
Michael Welge, NCSA/NLADR
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
Outline
• A nexus of CI projects• CI project “principles”• CI technical focus areas/topics• CI organizational issues
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
CI Projects• Biomedical
• BIRN-CC (Ellisman PI, Papadopoulos, Gupta, Baru, …)• National Biomedical Computational Resource, NBCR (Arzberger PI, Ellisman, Papadopoulos,
Gupta, Baru, …) ...• Geosciences
• GEON (Baru PI, Ludaescher, Papadopoulos, Helly, …)• SCEC (Jordan PI, Moore, …)• LEAD (Drogemeier PI, Wilhelmson, Welge, …)• Chronos (Cervato PI, Baru…)• CUAHSI-HIS (Maidment PI, Helly, Zaslavsky, …)• LOOKING (Smarr/Orcutt PI, Welge, Fountain, …) ...
• Bio/Eco/Environmental• SEEK (Michener PI, Ludaescher, Jones, Rajasekar, …)• LTER (Michener PI, SDSC partner (Arzberger, Baru, Fountain, Rajasekar)…)• NEON (Hayden/Michener Lead PI’s, Krishtalka, Baru, Welge…)• ROADNet (Orcutt PI, Vernon, Rajasekar, Ludaescher, Fountain, …)• NSF/BDI Lake Metabolism (Arzberger/Kratz PI’s, Fountain, …) ...
• Engineering• Monitoring Health of Civil Infrastructure (El Gamal PI, Fountain, …)• CLEANER (Minsker, Welge, Zaslavsky, Fountain, Pancake, …)
• CISE• OpIPuter (Smarr PI, Ellisman, Orcutt, Papadopoulos, Welge, …)• NMI, GRIDs Center• Data Intensive Grid Benchmarking (Baru PI, Snavely, Casanova)
• MPS• NVO, GriPhyN, …
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
CI Project Principles• Use IT state-of-the-art, and develop advanced IT where needed, to support
the “day-to-day” conduct of science (e-science)• (not just “hero” computations)• Based on a Web/Grid services-based distributed environment
The “two-tier” approach• Use best practices, including commercial tools,• while developing advanced technology in open source, and doing CS research
• An equal partnership • IT works in close conjunction with science, to create CI, i.e., the best practices, data
sharing frameworks, useful and usable capabilities and tools
• Create the “science IT infrastructure”• Online databases with advanced search engines• Robust tools and applications, etc.
• Leverage from other intersecting projects• Much commonality in the technologies, regardless of science disciplines• Constantly work towards eliminating (or, at least, minimizing) the “NIH” syndrome• And, importantly, try not to reinvent what industry already knows how to do…
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
Important Focus Areas / Topics
• Security• Authentication, access control, controls for data publication…
• Grid middleware • WSRF implementations, architecting “core” services (e.g. for metadata
management, versioning, …)• Data integration and ontologies
• Data interoperability, schema and semantic integration• Workflow systems
• “system-level” and science workflows (ingestion and analysis)• Sensor network and sensor data management
• Extensible, scalable, autonomic software; intelligent sensor management• Data mining
• Online analysis, large-scale data, novel algorithms, advanced triggering and notification
• Visualization• Large-scale, multi-model (data viz, GIS, info viz)
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
Example: GEON “Software Stack”
Core Grid ServicesData, Metadata, Indexing, Logging, Other Systems Services
“Physical” GridRedHat Linux, ROCKS, OGSI, Internet, I2, OptIPuter (planned)
Registration Services
Data Integration Services
GIS Mapping Services
Computational And Modeling
Services
Registration GEONsearch GEONworkbench
service interfaces
Portal(myGEON)
Other service “consumers”
Antelope WSRF ExtensionsCourtesy: Tony Fountain, SDSC and LOOKING project
Object Ring
BufferField
Interface Module
ORB Operations:
Orb ImportOrb ExportProcessingArchiving
field digitizer
field digitizer
field digitizer
Databases
Antelope Executive
Module
WS-Resource
WS-Resource
WS-Resource
WS-Resource
Soap HeaderSoap Body
Proxy Cert
Request
Params
SoapRequest
SOAP/HTTPPortal Data Analyzer
ORBcommander
ORBManager
LookupService
WSRFAuthentication & Authorization
Antelope Web
Services
ServiceInvoker
Proxy RepositoryCerts,username, password, others
Services Repositoryname, definiton, others
ORBMonitor
ServicesSubscriber
Databaseoperator
Event Coordinator
OtherServices
WS-Resource
WS-Resource
WS-Resource
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
CI Organizational Issues
• How to foster development of common infrastructure (based upon science needs/input), across multiple science domains• Not just at hardware level (e.g. supercomputers, high-
speed networks) or OS and system services level• But, at the database, data integration, data mining levels
• How to deal with the continuum of activities from basic CS research to production IT systems
• NLADR – created with above issues in mind• Prototype for a CI organization
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
NLADR—National Lab for Advanced Data Research
• Joint activity between SDSC and NCSA, started October 1, 2004
• Formed based on NSF’s requirement that SDSC and NCSA collaborate on CI activities
• Collaborative R&D activity focused on advanced data technologies• Guided by real applications from science communities• …to assemble expertise and a “knowledge base” of data technologies• And, also develop a broad data architecture framework• …within which to develop, integrate, test, and benchmark data-related
technologies• …in the context of national-scale physical infrastructure
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
NLADR Services Architecture
NLADR Data Management ServicesManagement and archiving of large simulation outputs, streaming data, databases, data collections
Internet2, LambdaGridsSDSC/NCSA testbed, OptIPuter
nladrSearch DataWorkbench
NLADR Query, Analysis, and Visualization Services
DataRegistration
And Indexing
Database Federation
& Integration
WorkflowAuthoring Execution
Data andInformationVisualization
Data Analysis and
MiningCollaboration Benchmarking
Applications NSF – LEAD, GEON, LTERGrid, CLEANER, LOOKING
NIH/NCRR – BIRNNASA – Space & Earth Sciences
Strategic Industrial Partners -- …
Grid and Web Middleware – (Globus/WSRF/WebServices/J2EE)
Node Operating Systems (Linux, …)
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
Some Core IT Areas
• Data integration and ontologies• Data interoperability, schema and semantic integration
• Scientific Workflows
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
IntegrationSchema
Schema Integration (“registering” local schemas to a global schema)
Arizona
Colorado
Utah
Nevada
Wyoming
New Mexico
Montana E.
Idaho
Montana West
Formation …
Age …
Formation …
Age …
Formation …
Age …
Formation …
Age …
Formation …
Age …
Formation …
Age …
Formation …
Age …
… Formation
… Age
… Composition
… Fabric
… Texture
… Formation
… Age
… Composition
… Fabric
… Texture
ABBREV
PERIOD
PERIOD
NAME
PERIOD
TYPE
TIME_UNIT
FMATN
PERIOD
NAME
PERIOD
NAME
FORMATION
PERIOD
FORMATION
FORMATION
LITHOLOGY
LITHOLOGY
AGE
AGE
andesitic sandstone
Livingston formation
Tertiary-Cretaceous
Sources
Sources
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
Multihierarchical Rock Classification “Ontology” (Taxonomies) for “Thematic Queries” (GSC)
Composition
Genesis
Fabric
Texture
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
Ontology-Enabled Application Example:Geologic Map Integration
Show formations where AGE = ‘Paleozic’
(without age ontology)
Show formations where AGE = ‘Paleozic’
(without age ontology)
Show formations where AGE = ‘Paleozic’
(with age ontology)
Show formations where AGE = ‘Paleozic’
(with age ontology)
+/- a few hundred million years
domainknowledge
domainknowledge
Knowledge r
epresentatio
n
Geologic Age
ONTOLOGY
NevadaNevada
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
Different views on State Geological Maps
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
Sedimentary Rocks: BGS Ontology
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
Sedimentary Rocks: GSC Ontology
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004formalized as domain map/ontology
Purkinje cells and Pyramidal cells have dendritesthat have higher-order branches that contain spines.Dendritic spines are ion (calcium) regulating components.Spines have ion binding proteins. Neurotransmissioninvolves ionic activity (release). Ion-binding proteinscontrol ion activity (propagation) in a cell. Ion-regulatingcomponents of cells affect ionic activity (release).
domain expert knowledge
Made usable for the system using Description Logic
Example: Domain Knowledge to “glue” SYNAPSE & NCMIR
Data
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
“Semantic Source Browsing”: Domain Maps/Ontologies (left) & conceptually linked data (right)
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
A Semantic Mediation Result View
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
Source Contextualization through Ontology Refinement
In addition to registering (“hanging off”) data relative toexisting concepts, a source may also refine the mediator’s domain map...
sources can register new concepts at the mediator ...
increase your data usability
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
What is a Scientific Workflow (SWF)?• Aims:
• automate a scientist’s repetitive data management and analysis tasks • typical phases:
• data access, scheduling, generation, transformation, aggregation, analysis, mining, visualization
design, test, share, deploy, execute, reuse, … SWFs
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
Promoter Identification Workflow
Source: Matt Coleman (LLNL)Source: Matt Coleman (LLNL)
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
KEPLER/CSP: Contributors, Sponsors, Projects(or loosely coupled Communicating Sequential Persons ;-)
Ilkay Altintas SDM, ResurgenceKim Baldridge Resurgence, NMI Chad Berkley SEEK Shawn Bowers SEEKTerence Critchlow SDM Tobin Fricke ROADNetJeffrey Grethe BIRNChristopher H. Brooks Ptolemy II Zhengang Cheng SDM Dan Higgins SEEKEfrat Jaeger GEON Matt Jones SEEK Werner Krebs, EOLEdward A. Lee Ptolemy II Kai Lin GEONBertram Ludaescher SDM, SEEK, GEON, BIRN, ROADNetMark Miller EOLSteve Mock NMISteve Neuendorffer Ptolemy II Jing Tao SEEK Mladen Vouk SDM Xiaowen Xin SDM Yang Zhao Ptolemy IIBing Zhu SEEK •••
Ptolemy IIPtolemy II
www.kepler-project.orgwww.kepler-project.org
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
Scientific Workflows as a Melting Pot:Example: The Kepler SWF System
• A grass-roots project• collaboration at the level of developers
• Intra-project links• e.g. in SEEK: AMS SMS EcoGrid
• Inter-project links• SEEK ITR, GEON ITR, ROADNet ITRs, DOE SciDAC SDM, Ptolemy
II, NIH BIRN (coming we hope …), UK eScience myGrid, …• Inter-technology links
• Globus, SRB, JDBC, web services, soaplab services, command line tools, R, GRASS, XSLT, …
• Interdisciplinary links• CS, IT, domain sciences, …
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
Promoter Identification Workflowin KEPLER
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
Promoter Identification Workflowin KEPLER
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
Web Services Actors (WS Harvester)
12
3
4
“Minute-made” (MM) WS-based application integration• Similarly: MM workflow design & sharing w/o implemented
components
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
Job Management (here: NIMROD)
• Job management infrastructure in place• Results database: under development• Goal: 1000’s of GAMESS jobs (quantum mechanics)
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
Some Recent Actor Additions
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
in KEPLER (w/ editable script)
Source: Dan Higgins, Kepler/SEEKSource: Dan Higgins, Kepler/SEEK
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
Blurring Design (ToDo) and Execution
SAN DIEGO SUPERCOMPUTER CENTER NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
Towards Real-time Analysis Pipelines:Towards Real-time Analysis Pipelines:Combining Simulations, Models, and ObservationsCombining Simulations, Models, and Observations
A Briefing On Data Mining to the NSF Planning Meeting Discussion Group on Cyberinfrastructure For Environmental Observatories
December 6 & 7, Arlington, VA
Michael WelgeUniversity Of Illinois/[email protected]
Modern Discovery and Problem Solving
• Team-oriented and collaborative• Information-based, decision focused
• Requires large-scale data fusion and analysis• All data is not under user’s control
• Geographically distributed experts• Geographically distributed data and
applications• Multiple stakeholders – multiple objectives
Enabling Scientist
Scientists, Engineers, Decision Makers, Policy Makers, Media and Citizens
Engaging in discovery, analysis, discussion, deliberation, decisions, policy formulation and communication
Collaboration Framework facilitates Idea and Knowledge Sharing, eLearning and Multi-Objective Decision Support Processes
Analysis Framework facilitates Data and Model Discovery, Exploration, and Analysis; via the Collaboration Framework
Data Management Framework builds logical maps of distributed, heterogeneous information resources (data, models, tools, etc.)
and facilitates their use via the Analysis and Collaboration Frameworks
Physical Infrastructure
Data Streams – large number of applications
• Sensor networks • Massive Simulation data sets (stored but random
access is too expensive)• Monitoring & surveillance: video streams• Network monitoring and traffic engineering• Text based systems• RFID tags
• Web logs and Web page click streams• Credit card transaction flows
• Telecommunication calling records• Engineering & industrial processes: power supply &
manufacturing
Support For Large Data Driven Problems
• Streaming Data• Continuous, unbounded, rapid,
time-varying • Huge volumes of continuous
data, possibly infinite• Unpredictable arrival• Fast changing and requires
real-time response• Random access is expensive so
an application can only have one look at the data
• May require methods to detect rare events
• Large Static Data• Databases involving many
terabytes can exceed reasonable processing capacity
• Thousands of files problems of management and version control
• Thousands of fields create problems with model building
• May require auxiliary models to support data quality issues
• May require methods to detect rare events
• Distributed data store necessary for some application domains
Managing and Mining Data Streams
Event Federation I
• Connect with data sources.
• Parse source data to form (composite) events according to type definitions.
• Collect and stage events for retrieval.
Event Interface
…1 2 N
Data Sources
Type Info
Event CollectorEvent Collector
Parse and Compose EQL
Persistence Buffering
…Stream Clients
Event Federation 2
• Monitors are event expression recognition agents.• Recognize Event• Evaluate Conditions• Act
• EQL (Event Query Language) implements a compositional semantics for event expressions.• Composite events are
“first order” events.• Monitors can monitor
monitors.
• Clock events are part of the language implementation.• Easy to write queries
with temporal constraints.
EventWorksEvent Router
Streams
Monitor 1 Monitor N
…EQL EQLNew
Events
Monitors are generated by users or programmatically.
D2K : A Framework For Building Data-Driven Apps – Persistent Stream Data Analytics Foundation
Designed for Building and Maintaining Complex Persistent and Stream Designed for Building and Maintaining Complex Persistent and Stream Data-Driven ApplicationsData-Driven Applications
http://alg.ncsa.uiuc.edu
D2K/T2K/I2K: Data, Text, and Image Analysis
http://alg.ncsa.uiuc.edu
Uses novel methods to do real-time stream data analysis.
LOOKING: Stream Data Analytics/Information Visualization scientific “dashboard”
Discovers association and correlation rules in data stream environment.
Online Frequent Pattern Mining
Online Stream Query Engine Online Stream Classification
Adaptable to the changes and evolution of data streams.
Detects outliers and finds evolution of clusters in data streams.
Online Clustering of Data Streams
CI Issues Architecture – NEON/CLEANER
Real-time Visualization of RFID people location sensors: Supercomputing IntelliBadge™
Atmospheric Science: Analytic Feature Extraction Scientific Visualization Techniques
LOOKING: Scientist Analytical/Spatial-temporal Visualization Techniques
LOOKING/Optiputer/Planetary Collaboratory
1024 Processor Altix
3 TB Shared Memory
>300 TeraBytes Disk
8 X 8 Processor 4 Pipe, 16 gig Memory each, Prisms coupled with Infiniban for On-demand, Interactive
U of W
NLADR Tier 1 Architecture
…. Data-Drive Science
• Collaboration• Information Gathering (experiments, simulation,
observation – calendar of upcoming activities)• Data Management
• Generation and Publishing of Data (experiments, simulation, or observation)• Persistent Data Stores (Distributed Data Management)• Stream Data Management (Event Management)
• Detection• Mining of new types of data, such as large static data
stores (>>1TB), streams, networks,..• Behavior Characterization (atypical, surprising, normal)
• Discovery• Hypothesis Generation
• Collaboration • Focusing results for testing and validation