Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research Center

Ed BullinerU.S. Geological Survey, Columbia Environmental Research Center

Goals of Presentation

• How are the data available to us different than the past?

• What different approaches are needed to analyze these data?

• What questions are we asking and answering that we could not before?

• ‘Big river science’ – four examples• How does this relate to NRDAR/ecological

restoration?

“Big Data”

• What is “big data”?• Emerging field• Several definitions – volume, variety,

variability• Do we work with ‘big data’ or ‘lots of

data’• Is that distinction important?• Regardless of semantics, increasing scale and

complexity of problems and necessary data• What do increasing amounts of data mean

for science and scientists?• How do we get the most value from the data

available to us?• Why is this important?

Data Intensive Science

• Paradigm shift in how we do science

• Can ask (and answer) new kinds of questions

• New tools and techniques

Traditional versus Data-Intensive Analyses

• Where do we see ‘data-intensive’ science?

• Within river science?• Within USGS/government?

• Why now? (what’s different?)• Data availability• Data resolution• Computational power

• What are the different tools and approaches currently used?

Tools for Data-Intensive Analyses

• Data storage• Increased hard drive space• Databases

• Data manipulation• Scripting languages• Web scraping/data

‘munging’/ data mining• Modeling

• Scripting languages• Modeling packages• Data visualization

Python

OS Operations

Web Queries

Database Integration

IDL

ArcGIS & ArcPY

Data Visualization

Statistics

• General purpose scripting language

• Lots of modules• Free*

• Tools for:• Data management• Data

filtering/cleaning• Scientific

computing• Geospatial analyses• Plotting• Collaborating

Pretty cool, but what can we use it for?

Question: Where do riverine sandbars exist and how do

they change over time?

• Create database of rivers and flows• Mask active channel within overlap

of rivers and landsat images• Integrate Landast metadata with

corresponding discharge data through relational database

• Query imagery by discharge/date• Automated download and analysis

of imagery – timeseries of sandbars

• Identified areas of persistent sand

• Investigated flows where sand was exposed

• Examined spatial variation• Used metrics of exposure

to help model success of Least Tern nests

Main Points

• Scripts and databases allow for automated downloading and linking of multiple data types

• Too much data for manual analysis

• Python can be used to batch-process images across programs without manual intervention

• Scripted tools can be used to directly query, plot, and perform statistics on image data

Question: What information can we synthesize from a 400+ day archive of field

measurements?

2.5

0

EXPLANATIONVelocity, in cubic meters per second

Velocity ensemble

Velocity bin

River bottom

Water column

fast slow4-beamdepths

• Velocities and depths measured along regular transects

• Lateral, longitudinal, and vertical variability

ADCP and single-beam survey dates, locations and

discharges 2000-2015

EXPLANATION

Flow percentileLow <25%Medium 25-75%High >75%

• Compiled over 32,000 individual cross-sections from 2000-2015

• Joined dataset to river mile and gage to allow discharge-specific queries

• Can group data by location along river and varying discharge levels to compare

• Ongoing restoration question: how does habitat (velocity) compare in river chutes versus main channel

• Chutes = restoration• 37 field days where

measurements in chutes were taken incidentally or deliberately

• Can use geospatial tools and scripts to come up with relevant comparisons

Measurement archive in lieu of hydrodynamic model –sturgeon spawning locations?

Main Points

• Scripts and databases allow for efficient querying and cleaning of archived datasets

• Python can be used to quickly and interactively summarize datasets by specific groupings

• Existing data can be repurposed and integrated with new data for value-added analyses using scripting

Question: How can we better visualize field

measurements of channel velocity and bathymetry?

• Measurements of velocity collected along ‘regular’ transects

• Python used to interpolate data into structured grid (3d matrix)

Paraview

• Can visualize flowlines around structures (biology)

• Identified bias in field measurements?

• Noticed systematic bias

• Collaborating with ILWSC

33 million+ data points!

Main Points

• Python scripts allow for interpolation and visualization of field data

• Using open-source (free) tools along with Python allows for replication of abilities from more expensive software

• New insights can be gained from visualizing data in different ways

Question: How can we better

characterize inundation patterns along the Missouri

River?

• Hydrodynamic (HEC-RAS) model provided by USACE describing water surface elevations at cross sections over time

• Used scripting to extend cross sections across floodplain for Missouri River

• Merged LIDAR and channel data provides high-resolution characterization of floodplain elevation

• Spatial interpolations of water elevation

• Calculations of inundation depths

Inundation return interval statistics

Base unit for calculations: 1 date, water depth raster grid (30m) for 1 area

Time series of rasters, 1 per day for 29,892 modeled days

…n dates…

…n dates…

Stack over time

xy

z

Structured 3-dimensional matrix of data

x and y are geospatial coordinates (raster dims)z is time coordinate (29,892 days)

Water depth for each x,y,z

Tim

e

Data structured as hierarchical data format (hdf) on disk to allow computationally efficient slicing in time domain

Setting inundation threshold allows for identification of inundated periods per pixel

Can aggregate data by year

Evaluate inundation status by criteria (such as longest consecutive inundated period during growing season)

Summarize metrics across all modeled years

…nyears……n

years…

Main Points

• Python scripts allow for dealing with data too big for one computer

• Processing across virtual machines• Processing large files

• Time-series analyses on large datasets are useful for answering management questions

• Computational models are a useful supplement to field data

Data Intensive Restoration?

• There have been many attempts at ecological restoration

• Meta-analysis of restoration success is nothing new

• What data are available to us in USGS/DOI that might lend itself to these approaches?

• What data are needed by people implementing NRDAR restoration?

• How can NRDAR projects contribute useful information?

NRDAR Case Map and Document Library

Conclusions

• As scientists, we work in an expanding world of ‘big data’

• We can’t analyze data by ourselves – need tools• Sharing data is important• Ongoing projects are just beginning to utilize

the scope of available datasets and capabilities of tools like Python

• What existing data is not fully utilized?• Think big• Add value

Questions?

Documents

Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research