National Center for Supercomputing ApplicationsUniversity of Illinois at Urbana-Champaign
Geospatial Informatics:
From Raw Data To Information and To KnowledgePresenters:Peter Bajcsy, Michal Ondrejcek, Rob Kooper and Luigi MariniNCSA/UIUC
Outline• Background to Geospatial Informatics • Introduction to Geospatial Informatics • Prediction Modeling from Spatially Sparse Data
• A Framework for Computer-Assisted Learning of Spatially Dense Recharge/Discharge Rate Models
• Problems Addressed by Spatial Pattern To Learn (SP2Learn)• SP2Learn Architecture and Functionality Overview• Running SP2Learn
• Other Similar Solutions• An Exploratory Framework for Extracting Information and Knowledge
from Remote Sensing Imagery• Integration of Ground and Remote Sensing Variables for Better
Understanding of Algal Biomass Bloom in Illinois
• Summary and Open Problems
3
Background• Funded collaborations in the interdisciplinary areas related
to hydrology, environmental engineering, water quality, ground water and computer science:• ISWS: Momcilo Markus, Yu-Feng Lin• CEE UIUC: Praveen Kumar, Barbara Minsker • UNESCO-IHE Netherlands, Instituto Tecnológico de Costa Rica,
• On-going efforts:• The current collaborations and many other preliminary collaborations
have led to proposals and are currently pending.• NCSA Petascale proposal to NSF
• Terra/Peta/Exa/Zetta/Yotta-scale Computing Will Require Large Investments
• Many Times Large Investments Have to Be Driven by Large Impact Application Drivers
• Institute for Advanced Computation and Technology (IACAT): Advanced Information Theme
• NSF funded efforts to build Earth Observatories
5
Earth Observatories• Multiple efforts to build Earth
Observatories to better understand • our environment in general• the principles governing changes in nature• the decision making required in order to
address the urban and rural environment sustainability challenges.
• Example National Efforts in US: • WATERS, CUAHSI, CLEANER, NEES,
GEON • NEON, ORION, LOOKING, LTER, CZEN,
…
6
WATERS
CUAHSI
Earth Observatories: Challenges • Task: Design, Implement, Test, Deploy, Expand
and Maintain Earth Observatories• Challenges:
• It cannot be done without a multi-disciplinary effort• It is based on rapidly changing information technology
stacks• It has to scale across multiple geographic sites and
continents• It is being built based on community consensus which
has to be reached at some level to make any progress • How can we help building Earth Observatories?
7
Common Requirements• Data:
• Access large amounts of hydrologic, geographic, meteorological, water quality, soil type, land-use and many other types of data
• Ingest and integrate heterogeneous large size data and streaming data• Computational Resources:
• Perform complex CPU and memory intensive data-driven analyses• Utilize a spectrum of distributed computational resources
• Data-driven Analyses (Software):• Design data-driven (data mining, machine learning, pattern recognition,
statistical) analyses• Visualize and interpret data-driven models• Integrate data-driven models with physics/chemistry/bio based models
• Data and Software Integration:• Exercise seamlessly functionality present in heterogeneous software
packages using available computational resources• Provide an environment where heterogeneous visualization and mining
tools could be integrated into workflows, and the analysis workflows could be re-used and modified.
10
X-informatics Problems• X-informatics Problems: Given some hardware,
software and data, extract information and knowledge from data
• X-informatics Objectives: • Analytical capabilities
• Data-driven analyses, machine learning, pattern recognition, modeling
• Automation• Integration of data and software,
computational scalability, provenance gathering
• Learning• Scientific discovery by visual exploration,
decision support, sharing and collaboration
Terminology
• Raw data: “Anything that can be ingested by a computer”
• Information: “Meaning of raw data that is narrow in scope and it has a simple organization”
• Knowledge: “Interpretation of information that is broad in scope and it is orderly synthesized”
12
Terminology• Informatics (or information science):
• “The process of going from raw data/measurements to information and knowledge”
• X-Informatics:• “Informatics with the focus on inter-
disciplinary research and development of information technology that bridges boundaries across hardware, software and domain users”
13
Why X-Informatics?
• Plethora of end-to-end engineering efforts in many scientific disciplines, where X-informatics is the key component
X-informatics Approaches
• Kumar P., J. Alameda, P. Bajcsy, M. Folk and M. Markus, “Hydroinformatics : Data Integrative Approaches in Computation, Analysis, and Modeling,” CRC Press LLC, 2006.
• Bajcsy P., “A Perspective on Cyberinfrastructure for Water Research Driven by Informatics Methodologies,” Geography Compass, Volume 2, Issue 6 (p 2040-2061), 2008 Blackwell Publishing Ltd
• Bajcsy P., R. Kooper, L. Marini and J. Myers, “Community-Scale Cyberinfrastructure for Exploratory Science,” In: Cyberinfrastructure Technologies and Applications book, Editor: Junwei Cao, Nova Science Publishers, Inc., Chapter 12, 1st quarter of 2009,
17
General Problem
• Compute a set of geo-spatially dense accurate predictions of variables • given a set of direct geo-spatially sparse point
measurements and • auxiliary variables with implicit relationships with
respect to the predicted variable• Motivation:
• minimize cost of taking direct point measurements• maximize accuracy of predictions and • automate discovering relationships among direct
field measurements and indirect variables
Formulation• Input: sets of geo-spatially sparse variables {Vi{pij}} & dense
auxiliary variables & a priori tacit knowledge of experts• Output: geo-spatially dense (raster) {Ok}• Unknown: selection of methods & workflow of
operations/methods & parameters of methods & relationships of auxiliary variables w.r.t Ok & quantitative metric of output goodness
p1j
Interpolations
Mathematical models
p2j
V1 & V2 O1Auxiliary Variables & Tacit Knowledge
Applied Problem
Discharged Recharged
Recharge and DischargeRate Prediction
Bedrock elevation
Water table elevation
Interdisciplinary Objectives
• Ground Water (Hydrologic Science) View:• Evaluation of Alternative Conceptual (implicit
relationships) and Mathematical Models (explicit relationships)
• Accurate Prediction of Groundwater Recharge and Discharge Rates from Limited Number of Field Measurements
• Computer Science View:• Computer-Assisted Learning to Assess
Alternative Conceptual and Mathematical Models• Optimization of Prediction Models From a Set of
Geo-Spatially Sparse Point MeasurementsDIALOG
21
State-of-the-Art Results• Limited Spatial Resolution and Accuracy
Min. Grid:805mX805m
Recharge zone Noisy pattern or weak R/D
Discharge zone
Discharge
Recharge
Uniform Grid:80mX80m
• MODFLOW is a three-dimensional finite-difference ground-water model • http://water.usgs.gov/nrp/gwsoftware/modflow2005/modflow2005.html -
freeware (2005)• PEST - is software for model calibration, parameter estimation and
predictive uncertainty analysis• http://www.sspa.com/pest/ - freeware (2007); University of Queensland,
Australia • Precipitation-Runoff Modeling System (PRMS) – is deterministic,
distributed-parameter modeling system developed to evaluate the impacts of various combinations of precipitation, climate, and land use on streamflow, sediment yields, and general basin hydrology • http://water.usgs.gov/software/prms.html - freeware (1996); USGS
• Deep Percolation Model (DPM) - facilitates estimation of ground-water recharge under a large range in climatic, landscape, and land-use and land-cover conditions • http://pubs.usgs.gov/sir/2006/5318/; USGS
Existing Software for Groundwater and Surface Water Modeling
Related Work
• Singh A. et al. “Expert-Driven ‘Perceptive’ Models for Reducing User Fatigue in an Interactive Hydrologic Model Calibration Framework”
Conductivity (K) and Hydraulic heads (H) for the hypothetical aquifer
Motivation• Ground Water (Hydrologic) Science:
• Currently, there is no single method that could estimate R/D rates and patterns for all practical applications.
• Therefore, cross analyzing results from various estimation methods and related field information is likely to be superior than using only a single estimation method.
• Computer Science :• It is currently impossible
• (a) to replace an expert with a lot of tacit domain knowledge by computer algorithms or
• (b) to learn by an expert new I/O relationships from a plethora of possible variables and an extremely large space of processing methods and their parameters
• Thus, assisting experts to discover, evaluate and validate new relationships in an iterative way will likely enable
• (a) better understanding of the underlying phenomena, and • (b) more automated and cost-efficient predictions
Our Approach• Data-Driven Analyses to Test Alternative Models, and
to Search the Space of Processing Operations and Their Parameters• Interpolation methods• Mathematical models• Image processing algorithms• Machine learning algorithms• Scalability of algorithms with large size data
• Computer-Assisted Comparisons and Evaluations of Multiple Models and Sub-Optimal Solutions• Model/Solution Representation • Closed Loop (Iterative) Workflows• Human Computer Interfaces
• Overall Approach: An Exploration Framework for a Class of Alternative Models/Hypotheses and Optimal Solutions
SP2Learn Problem Formulation
• Given a set of geo-spatially sparse field measurements and auxiliary variables, derive accurate, spatially dense, R/D rate map by
• (a) using physics-based model• (b) incorporating boundary conditions and • (c) exploring auxiliary variables representing prior
knowledge about R/D patterns but missing in the physics-based model
Challenges• (1) How to Recognize ‘Meaningful’ Pattern of Predicted
Map?• (2) How to Quantify the Goodness of the Pattern?
Approach:• (1a) Recognize patterns by utilizing multiple image
enhancement and segmentation techniques applied to R/D rate predictions
• (1b) Introduce relationship between R/D pattern and auxiliary (a priori reference) information
• (2a) Define goodness w.r.t. reference information using expert’s selection of ‘meaningful’ relationships
• (2b) Define goodness w.r.t. reference information using complexity of machine learning
Ground water flux=hydraulic conductivity * cell area * gradient of water table elevation (head) over cell distance
Using Physics-Based Model
Incoming water Outgoing waterBed rock
elevation +
Water table elevation +
Hydraulic conductivity +
Discharged Recharged
Field MeasurementsR/D Rate Prediction
+ + ++++ +
+
+
+
++
+
++
* * dhQ K AdL
=
Incorporating Spatial Boundary Conditions
• BC: R/D rate prediction could have smooth transitions and recharge & discharge regions (contiguous pixels) should be clearly delineated
• Approach: Apply Image Restoration and De-noising Techniques• Moving average based low pass filter• TVL (Total Variation regularized L1-norm function) based filter• Morphological operation based filter• Using multiple techniques multiple times
Discharged Recharged
Exploring Auxiliary Variables Driving R/D Patterns
moving average normalization+TVL moving averagenormalization+TVL
Proximity to River: P(R or D area/River is close)~high
Soil Type: P(R or D area/Soil=Clay)~low
Slope: P(R or D area/ slope=high)~low
Prior Tacit Knowledge about R/D and Auxiliary Variables
From Auxiliary Variables To Knowledge and Accurate R/D
Load Variables Integrate Maps
Define ROIMap Ranking
Load R/D Map
Learn from Data-Driven Rules
33
SP2Learn Output• A set of rules that define relationships between
predicted (R/D rate) variable and auxiliary variables
• Modified (more accurate) predictions according to the user selected rules defining relationships of predicted and auxiliary variables
• Sensitivity analysis results with respect to • Methods (interpolations, image enhancement, …)• Models• Parameters
Example Results
• <RULE ID=138 NUM_OF_CASES=3975 SUPPORT=32.65%>• <IF>Elevation is not in {330-344} AND• Soil type is in {Rm=Roscommon muck} AND• Proximity to water body is not {near_water} AND• Slope is in {0-0.9} </IF>• <THEN>R/D rate is -0.004,-0.002</THEN>
= +
ROI
SP2Learn Functionality Overview
Load RasterStep
IntegrationStep
RulesStep
User Driven Attribute Selection Step
Apply RuleStep
Create Mask Step
Attribute RankingStep
Registration
• Integration of all maps (raster images) to a common projection and spatial resolution
Before “Convert” After “Convert”
Attribute Selection
• Output: Predicted Variable
• Input: Auxiliary Variables
• Check-boxes• Show Table• Prune Tree
Decision Tree Based Modeling
• Tree structure can be represented as a set of rules
Discharge
yes
noyes
no
Recharge DischargeCase A..
Case E.. Case J..
Distance from river ≤ 100 ft?
Soil Type is {sand}?
Rules from Decision Tree• Num: Node number in a
decision tree. • Support(%): Among all
cases satisfying conditions, the ratio of cases having the same class (conclusion).
• # of cases: The number of cases satisfying conditions
• Class: Conclusion of a rule
• Conditions: Conditions of a rule
• MDL Score: MDL score of a decision tree. The less the score is, the better the tree is
47
Apply Rules
• Visualization of• Modified output variable• Changed pixels• Magnitude of changes (differences)
National Center for Supercomputing Applications
GeoLearn
Praveen KumarAssoc. Prof.Civil & Env. Eng.
Peter BajcsyRes. Scientist, NCSA
David TchengNCSA
Wei-Wen FengGrad Student Comp. Sci.
Pratyush SinhaGrad Student Civil and Env. Eng.
Amanda WhiteGrad Student Civil & Env. Eng.
Vikas MehraGrad Student Civil & Env. Eng. David Clutter
NCSA
Ricky RobertsonPost Doc Civil Eng.
Rob KooperNCSA
Ben RuddellGrad Student Civil & Env. Eng.
Qi Li, Grad Student, Dept. of Geology
Chulyun Kim, Grad Student, Comp. Sci.
Yakov Keselman, NCSA
Tim Nee, Undergraduate Student, NCSA
Types of Research Using Geolearn
• (1) Sensitivity Studies• Data integration and masking variables• Data modeling variables
• (2) Modeling Studies: • Find Common Characteristics
• E.g., using clustering• Find Deviations
• Data-driven modeling• Prediction analysis
• (3) Variable Relevance Ranking52
Deviations ~ Regression Tree Based Modeling of NDVI
corn &soybeans
irrigated river valleymany crops
May 2004, Residual=|RT predicted – measured|
Example of Variable Relevance Ranking
• Prediction of Chl-A in May 2002 using Support Vector Machine (SVM) data-driven modeling technique with linear kernel
• Example conclusion: Chl-A is likely to depend on temperature and unlikely to depend on elevation
Adding VariablesTemperature Slope & Elevation
Ground Measurements
Remote Sensing Measurements
Spatially sparse and temporally incomplete ground measurements & spatially dense and temporally sparse remote sensing measurements of temperature
Summary
• Sp2Learn, Geolearn and other solutions are focusing primarily on supporting analytical flow
• They are java-based desktop solutions• They address software and data integration,
as well as machine learning and exploratory analyses
Imaginations unbound
Open Problems• Automation
• Flexible creation of scripts/workflows • Provenance gathering
• Data sharing and collaboration• Data sharing is supported by exporting intermediate and
resulting files• Parameter sharing?• Visualization sharing?
• Computational and visualization scalability• Out-of-core representation• Utilization of distributed or high-performance computational
resources?
Imaginations unbound
Example: Common Characteristics ~ Eco-region level II
Unsupervised K-means Clustering of Large Size Data ~ (6 Million Pixels) x (29 Variables) x (8 bytes) = 1.4GB
Example: Processing and Visualization of Large Size Data
• Out-of-core representation
• Total file size:372MB
• Total memoryfootprint:less than 30MB
Scalability of algorithms ? Grid Computing ? Algorithms behind web services ?
60
Software and Test Data Download
• Sp2Learn and Geolearn• Download web page of Image Spatial
Data Analysis group at NCSA: http://isda.ncsa.uiuc.edu/download/
Thank you!
• Questions: • Peter Bajcsy [email protected]
• Need More Details• Sp2Learn Project URL:
http://isda.ncsa.uiuc.edu/groundwater/• GeoLearn Project URL:
http://isda.ncsa.uiuc.edu/geolearn/geolearn.html
• Publications: http://isda.ncsa.uiuc.edu
61
Acknowledgement
• This research was partially supported by National Aeronautics and Space Administration (NASA), the Faculty Fellow Program at National Center for Supercomputing Applications (NCSA), the Illinois State Water Survey (ISWS) and by NCSA Industrial partners
• The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the sponsors.
• Contributions by the members of the Image Spatial Data Analysis (ISDA) Group at NCSA and our collaborators from ISWS, CEE UIUC and International Institutions
Imaginations unbound
Model Interpretation by Input Variable Relevance Assignment
Pixel TN EVI
1 3.38 0.22
2 5.1 0.3
Extracted Table
Multiple Learning Methods
SVMKNNRT
Relevance Analyzer
VisualizationModel Relevance
Sample Relevance
Prediction Result
Prediction Error
Point and Raster Integration
Pixel TN
1 3.38
2 5.1