Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research Center
Ed BullinerU.S. Geological Survey, Columbia Environmental Research Center
Goals of Presentation
• How are the data available to us different than the past?
• What different approaches are needed to analyze these data?
• What questions are we asking and answering that we could not before?
• ‘Big river science’ – four examples• How does this relate to NRDAR/ecological
restoration?
“Big Data”
• What is “big data”?• Emerging field• Several definitions – volume, variety,
variability• Do we work with ‘big data’ or ‘lots of
data’• Is that distinction important?• Regardless of semantics, increasing scale and
complexity of problems and necessary data• What do increasing amounts of data mean
for science and scientists?• How do we get the most value from the data
available to us?• Why is this important?
Data Intensive Science
• Paradigm shift in how we do science
• Can ask (and answer) new kinds of questions
• New tools and techniques
Traditional versus Data-Intensive Analyses
• Where do we see ‘data-intensive’ science?
• Within river science?• Within USGS/government?
• Why now? (what’s different?)• Data availability• Data resolution• Computational power
• What are the different tools and approaches currently used?
Tools for Data-Intensive Analyses
• Data storage• Increased hard drive space• Databases
• Data manipulation• Scripting languages• Web scraping/data
‘munging’/ data mining• Modeling
• Scripting languages• Modeling packages• Data visualization
Python
OS Operations
Web Queries
Database Integration
IDL
ArcGIS & ArcPY
Data Visualization
Statistics
• General purpose scripting language
• Lots of modules• Free*
• Tools for:• Data management• Data
filtering/cleaning• Scientific
computing• Geospatial analyses• Plotting• Collaborating
Pretty cool, but what can we use it for?
Question: Where do riverine sandbars exist and how do
they change over time?
• Create database of rivers and flows• Mask active channel within overlap
of rivers and landsat images• Integrate Landast metadata with
corresponding discharge data through relational database
• Query imagery by discharge/date• Automated download and analysis
of imagery – timeseries of sandbars
• Identified areas of persistent sand
• Investigated flows where sand was exposed
• Examined spatial variation• Used metrics of exposure
to help model success of Least Tern nests
Main Points
• Scripts and databases allow for automated downloading and linking of multiple data types
• Too much data for manual analysis
• Python can be used to batch-process images across programs without manual intervention
• Scripted tools can be used to directly query, plot, and perform statistics on image data
Question: What information can we synthesize from a 400+ day archive of field
measurements?
2.5
0
EXPLANATIONVelocity, in cubic meters per second
Velocity ensemble
Velocity bin
River bottom
Water column
fast slow4-beamdepths
• Velocities and depths measured along regular transects
• Lateral, longitudinal, and vertical variability
ADCP and single-beam survey dates, locations and
discharges 2000-2015
EXPLANATION
Flow percentileLow <25%Medium 25-75%High >75%
• Compiled over 32,000 individual cross-sections from 2000-2015
• Joined dataset to river mile and gage to allow discharge-specific queries
• Can group data by location along river and varying discharge levels to compare
• Ongoing restoration question: how does habitat (velocity) compare in river chutes versus main channel
• Chutes = restoration• 37 field days where
measurements in chutes were taken incidentally or deliberately
• Can use geospatial tools and scripts to come up with relevant comparisons
Measurement archive in lieu of hydrodynamic model –sturgeon spawning locations?
Main Points
• Scripts and databases allow for efficient querying and cleaning of archived datasets
• Python can be used to quickly and interactively summarize datasets by specific groupings
• Existing data can be repurposed and integrated with new data for value-added analyses using scripting
Question: How can we better visualize field
measurements of channel velocity and bathymetry?
• Measurements of velocity collected along ‘regular’ transects
• Python used to interpolate data into structured grid (3d matrix)
Paraview
• Can visualize flowlines around structures (biology)
• Identified bias in field measurements?
• Noticed systematic bias
• Collaborating with ILWSC
33 million+ data points!
Main Points
• Python scripts allow for interpolation and visualization of field data
• Using open-source (free) tools along with Python allows for replication of abilities from more expensive software
• New insights can be gained from visualizing data in different ways
Question: How can we better
characterize inundation patterns along the Missouri
River?
• Hydrodynamic (HEC-RAS) model provided by USACE describing water surface elevations at cross sections over time
• Used scripting to extend cross sections across floodplain for Missouri River
• Merged LIDAR and channel data provides high-resolution characterization of floodplain elevation
• Spatial interpolations of water elevation
• Calculations of inundation depths
Inundation return interval statistics
Base unit for calculations: 1 date, water depth raster grid (30m) for 1 area
Time series of rasters, 1 per day for 29,892 modeled days
…n dates…
…n dates…
Stack over time
xy
z
Structured 3-dimensional matrix of data
x and y are geospatial coordinates (raster dims)z is time coordinate (29,892 days)
Water depth for each x,y,z
Tim
e
Data structured as hierarchical data format (hdf) on disk to allow computationally efficient slicing in time domain
Setting inundation threshold allows for identification of inundated periods per pixel
Can aggregate data by year
Evaluate inundation status by criteria (such as longest consecutive inundated period during growing season)
Summarize metrics across all modeled years
…nyears……n
years…
Main Points
• Python scripts allow for dealing with data too big for one computer
• Processing across virtual machines• Processing large files
• Time-series analyses on large datasets are useful for answering management questions
• Computational models are a useful supplement to field data
Data Intensive Restoration?
• There have been many attempts at ecological restoration
• Meta-analysis of restoration success is nothing new
• What data are available to us in USGS/DOI that might lend itself to these approaches?
• What data are needed by people implementing NRDAR restoration?
• How can NRDAR projects contribute useful information?
NRDAR Case Map and Document Library
Conclusions
• As scientists, we work in an expanding world of ‘big data’
• We can’t analyze data by ourselves – need tools• Sharing data is important• Ongoing projects are just beginning to utilize
the scope of available datasets and capabilities of tools like Python
• What existing data is not fully utilized?• Think big• Add value
Questions?