Upload
lowell
View
40
Download
0
Embed Size (px)
DESCRIPTION
MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer. Lars Arge Datalogisk Institut Aarhus Universitet Regionalt endagskursus datalogi 20 Marts 2006. Outline. Massive (terrain) data Scalability problems (I/O bottleneck) - PowerPoint PPT Presentation
Citation preview
MASSIVE Terrain Datasæt −om vigtigheden af effektive algoritmer
Lars Arge
Datalogisk Institut
Aarhus Universitet
Regionalt endagskursus datalogi20 Marts 2006
Lars Arge
Massive terrain datasæt
2
Outline
1. Massive (terrain) data
2. Scalability problems (I/O bottleneck)
3. Processing massive terrain data: Flow modeling on grid terrains
4. Summary
Lars Arge
Massive terrain datasæt
3
Massive Data
Lars Arge
Massive terrain datasæt
4
Massive Data• Massive datasets are being collected everywhere• Storage management software is billion-$ industry
Examples (2002):
• Phone: AT&T 20TB phone call database, wireless tracking
• Consumer: WalMart 70TB database, buying patterns (supermarket checkout)
• WEB: Web crawl of 200M pages and 2000M links, Akamai stores 7 billion clicks per day
• Geography: NASA satellites generate 1.2TB per day
Lars Arge
Massive terrain datasæt
5
Example: Satellite Images
– Terrabyte image database
Lars Arge
Massive terrain datasæt
6
Example: Grid Terrain Data• Grid terrain data increasingly available
– NASA SRTM mission acquired 30m data
for around 80% of earth land mass
– US data readily available through
USGS National Map Seamless Data Distribution System
• Appalachian Mountains (800km x 800km)
– 100m resolution ~ 64M cells
~128MB raw data (~500MB when processing)
– ~ 1.2GB at 30m resolution
– ~ 12GB at 10m resolution (much of US available from USGS)
– ~ 1.2TB at 1m resolution (selected, mostly military, availability)
Lars Arge
Massive terrain datasæt
7
Example: LIDAR Terrain Data
• Massive (irregular) point sets (1-10m resolution)
– Becoming relatively cheap and easy to collect
• NC floodplain mapping program: www.ncfloodmaps.com
– Collected LIDAR for all NC after Hurricane Floyd in 1999
– Still processing it
Lars Arge
Massive terrain datasæt
8
Hurricane Floyd
• Sep. 15, 1999
7 am 3pm
Lars Arge
Massive terrain datasæt
9
Example: LIDAR Terrain Data
• US LIDAR data becoming available:
– www.ncfloodmaps.com
– USGS Center for LIDAR Information
Coordination and Knowledge (CLICK)
– NOAA LIDAR Data Retrieval Tool (LDART)
Lars Arge
Massive terrain datasæt
10
Scalability Problems
Lars Arge
Massive terrain datasæt
11
Scalability Problems: I/O-Bottleneck
– Disk systems try to amortize large access time transferring large contiguous blocks of data
• Need to store and access data to take advantage of blocks (locality)
• I/O is often bottleneck when handling massive datasets
• Disk access is 106 times slower than main memory access
track
magnetic surface
read/write armread/write head“The difference in speed
between modern CPU and disk technologies is
analogous to the difference in speed in sharpening a
pencil using a sharpener on one’s desk or by taking an
airplane to the other side of the world and using a
sharpener on someone else’s desk.” (D. Comer)
Lars Arge
Massive terrain datasæt
12
Scalability Problems: Block Access Matters• Example: Reading an array from disk
– Array size N = 10 elements
– Disk block size B = 2 elements
– Main memory size M = 4 elements (2 blocks)
• Difference between N and N/B large since block size is large
– Example: N = 256 x 106, B = 8000 , 1ms disk access time
N I/Os take 256 x 103 sec = 4266 min = 71 hr
N/B I/Os take 256/8 sec = 32 sec
1 2 10 9 5 6 3 4 8 71 5 2 6 3 8 9 4 7 10
Algorithm 2: Loads N/B=5 blocksAlgorithm 1: Loads N=10 blocks
Lars Arge
Massive terrain datasæt
13
R
A
M
Scalability Problems: Block Access Matters• Most programs developed without memory considerations
– Infinite memory
– Uniform access cost
• Run on large datasets because OS moves blocks as needed
• Moderns OS utilizes sophisticated paging and prefetching strategies
– But if program makes scattered accesses even good OS cannot take advantage of block access
Scalability problems!
data size
runn
ing
tim
e
Lars Arge
Massive terrain datasæt
14
L
1
L
2
R
A
M
Scalability: Hierarchical Memory• Block access not only important on disk level
• Machines have complicated memory hierarchy
– Levels get larger and slower
– Block transfers on all levels
• We focus on disk level:
data size
runn
ing
tim
eR
A
M
Lars Arge
Massive terrain datasæt
15
Processing Massive Terrain Data: Flow
Lars Arge
Massive terrain datasæt
16
Flow on Terrains• Modeling of water flow on terrains has many important applications
– Predict location of streams
– Predict areas susceptible to floods
– Compute watersheds
– Predict erosion
– Predict vegetation distribution
– ……
• Conceptually flow is modeled using two basic attributes
– Flow direction: The direction water flows at a point
– Flow accumulation: Amount of water flowing through a point
• Flow accumulation used to compute other hydrological attributes, e.g. drainage network, topographic convergence index…
Lars Arge
Massive terrain datasæt
17
Flow Directions on Grid Terrains• Common terrain representation: Grid
• Flow directions: Water in each cell flows to downslope neighbor(s)
– Commonly used:
* Single flow direction (SFD or D8):
Flow to downslope neighbor
* Multiple flow direction (MFD):
Flow to all downslope neighbors
SFD
MFD
3 2 47 5 87 1 9
3 2 47 5 87 1 9
3 2 47 5 87 1 9
3 2 47 5 87 1 9
Lars Arge
Massive terrain datasæt
18
Flow Accumulation on Grid Terrains
• Flow accumulation
– Initially one unit of water in each cell
– Water distributed from each cell according to flow direction(s)
– Flow accumulation of cell is total flow through it
Lars Arge
Massive terrain datasæt
19
Flow Accumulation Example (Panama dataset)
Lars Arge
Massive terrain datasæt
20
Flow Modeling on Massive Grid Terrains• Duke University Environmental researchers had problems with
computing flow accumulation for Appalachian Mountains
– Recall ~128MB raw data and ~500MB when processing
Running time: 14 days
• It could be much worse; Recall
– ~ 1.2GB at 30m resolution
– ~ 12GB at 10m resolution
– ~ 1.2TB at 1m resolution
Lars Arge
Massive terrain datasæt
21
Flow Modeling on Massive Grid Terrains• We surveyed other flow accumulation software
• GRASS (leading open-source GIS)
– Killed after 17 days on a 50MB dataset (6700 x 4300 grid)
• TARDEM (specialized hydrology software)
– Could handle 50MB dataset
– Killed after 20 days on a 240MB dataset (12000 x 10000 grid)
* CPU utilization 5%, 3GB swap file
• ArcGIS (leading commercial GIS)
– Could handle the 240MB dataset
– Sometimes very slow:
* 3 days to process 490MB dataset
* 1 day to process 560MB dataset
– Does not work for datasets larger than 2GB
Lars Arge
Massive terrain datasæt
22
Flow Accumulation Scalability Problem
• Natural algorithm may require ~N I/Os– “Push” flow down the terrain by visiting cells in height order
Problem since cells of same height scattered over terrain
• Natural to try “tiling” (ArcGIS?)– But computation in different tiles not independent
Lars Arge
Massive terrain datasæt
23
TerraFlow• We developed theoretically I/O-optimal algorithms using ~N/B I/Os
• Avoiding scattered access by:
– Grid storing input: Data duplication
– Grid storing flow: “Lazy write”
• Implementation was very efficient
– Appalachian Mountains flow accumulation in 3 hours!
• Developed into comprehensive software package for flow computation on massive grids (www.cs.duke.edu/geo*/terraflow)
– Efficient: 2-1000 times faster than other software on massive grids
– Scalable: 1 billion elements! (>2GB data)
– Flexible: Different flow modeling (direction) methods
Lars Arge
Massive terrain datasæt
24
TerraFlow• Significant speedup over ArcInfo for large datasets
– East-Coast (100m)
TerraFlow: 8.7 Hours
ArcInfo: 78 Hours
– Washington state (10m)
TerraFlow: 63 Hours
ArcInfo: %
• Incorporated in Grass 5.0.2 and later
• Recently also extensions for ArcGIS 8 and 9
Hawai
i
56M
Cumber
lands
80M Lower
NE
256M
East-C
oast
491M M
idwes
t
561M
Was
hingto
n
2G
0
10
20
30
40
50
60
70
80
90
Run
ning
Tim
e (H
ours
)
TerraFlow 512TerraFlow 128ArcInfo 512ArcInfo 128
500 MHz Alpha, FreeBSD 4.0
Lars Arge
Massive terrain datasæt
25
Denmark?
Lars Arge
Massive terrain datasæt
26
Denmark Terrain Data• Mainly two data suppliers in Denmark
– Kort & Matrikelstyrelsen
– COWI A/S
• Grid/vector models based on paper maps/ortofoto
• LIDAR data for major cities
• Unfortunately not available online (and not free)
– But obviously increasing interest in terrain data/applications
Lars Arge
Massive terrain datasæt
27
New Project • New (NABIIT) project: Development of algorithms and software for
processing massive terrain data
– COWI A/S
* Problems processing LIDAR data during production and analysis (e.g. railroad noise)
– Spatial analysis unit, Danish Institute of Agricultural Sciences
* Use data, e.g. to comply with EU directives
– Computer science, Aarhus University
* Efficient algorithms
• Focus on
– Terrain modeling, terrain flow analysis, influence of simplification
Lars Arge
Massive terrain datasæt
28
Example Sub-Projects• Terrain modeling, e.g:
– Terrain models from “raw” LIDAR
Process >10G raw data in a few hours using only 128M memory
• Terrain analysis, e.g:
– Erosion modeling (USLE factor computation)
– Watershed hierarchy computation
NC Neuse basin at 10m resolution (~400M cells) in 3 hours
Lars Arge
Massive terrain datasæt
29
Summary
Lars Arge
Massive terrain datasæt
30
Summary• Massive datasets appear everywhere
• Leads to scalability problems
– Due to hierarchical memory and slow I/O
• I/O-efficient algorithms greatly improves scalability
• Terrain data:
– Massive grid data exists
– New technologies are creating massive
and very detailed datasets
– Processing capabilities lag behind
Lars Arge
Massive terrain datasæt
31
Summary - Resources• Google earth: http://earth.google.com/
• USGS national map: http://seamless.usgs.gov
• USGS center for LIDAR information: http:/lidar.cr.usgs.gov
• NC floodmaps: http://www.ncfloodmaps.com
• NOAA LIDAR data retrieval tool: http://www.csc.noaa.gov/crs/tcm/about_ldart.html
• TerraFlow: http://www.cs.duke.edu/geo*/terraflow
• Duke STREAM project: http://terrain.cs.duke.edu
• Kort & Matrikelstyrelsen: http://www.kms.dk
• COWI A/S: http://www.cowi.dk
• Geoforum: http://www.geoforum.dk/