MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer

MASSIVE Terrain Datasæt −om vigtigheden af effektive algoritmer

Lars Arge

Datalogisk Institut

Aarhus Universitet

Regionalt endagskursus datalogi20 Marts 2006

Lars Arge

Massive terrain datasæt

2

Outline

1. Massive (terrain) data

2. Scalability problems (I/O bottleneck)

3. Processing massive terrain data: Flow modeling on grid terrains

4. Summary

Lars Arge


3

Massive Data

Lars Arge


4

Massive Data• Massive datasets are being collected everywhere• Storage management software is billion-$ industry

Examples (2002):

• Phone: AT&T 20TB phone call database, wireless tracking

• Consumer: WalMart 70TB database, buying patterns (supermarket checkout)

• WEB: Web crawl of 200M pages and 2000M links, Akamai stores 7 billion clicks per day

• Geography: NASA satellites generate 1.2TB per day

Lars Arge


5

Example: Satellite Images

– Terrabyte image database

Lars Arge


6

Example: Grid Terrain Data• Grid terrain data increasingly available

– NASA SRTM mission acquired 30m data

for around 80% of earth land mass

– US data readily available through

USGS National Map Seamless Data Distribution System

• Appalachian Mountains (800km x 800km)

– 100m resolution ~ 64M cells

~128MB raw data (~500MB when processing)

– ~ 1.2GB at 30m resolution

– ~ 12GB at 10m resolution (much of US available from USGS)

– ~ 1.2TB at 1m resolution (selected, mostly military, availability)

Lars Arge


7

Example: LIDAR Terrain Data

• Massive (irregular) point sets (1-10m resolution)

– Becoming relatively cheap and easy to collect

• NC floodplain mapping program: www.ncfloodmaps.com

– Collected LIDAR for all NC after Hurricane Floyd in 1999

– Still processing it

Lars Arge


8

Hurricane Floyd

• Sep. 15, 1999

7 am 3pm

Lars Arge


9

Example: LIDAR Terrain Data

• US LIDAR data becoming available:

– www.ncfloodmaps.com

– USGS Center for LIDAR Information

Coordination and Knowledge (CLICK)

– NOAA LIDAR Data Retrieval Tool (LDART)

Lars Arge


10

Scalability Problems

Lars Arge


11

Scalability Problems: I/O-Bottleneck

– Disk systems try to amortize large access time transferring large contiguous blocks of data

• Need to store and access data to take advantage of blocks (locality)

• I/O is often bottleneck when handling massive datasets

• Disk access is 106 times slower than main memory access

track

magnetic surface

read/write armread/write head“The difference in speed

between modern CPU and disk technologies is

analogous to the difference in speed in sharpening a

pencil using a sharpener on one’s desk or by taking an

airplane to the other side of the world and using a

sharpener on someone else’s desk.” (D. Comer)

Lars Arge


12

Scalability Problems: Block Access Matters• Example: Reading an array from disk

– Array size N = 10 elements

– Disk block size B = 2 elements

– Main memory size M = 4 elements (2 blocks)

• Difference between N and N/B large since block size is large

– Example: N = 256 x 106, B = 8000 , 1ms disk access time

N I/Os take 256 x 103 sec = 4266 min = 71 hr

N/B I/Os take 256/8 sec = 32 sec

1 2 10 9 5 6 3 4 8 71 5 2 6 3 8 9 4 7 10

Algorithm 2: Loads N/B=5 blocksAlgorithm 1: Loads N=10 blocks

Lars Arge


13

R

A

M

Scalability Problems: Block Access Matters• Most programs developed without memory considerations

– Infinite memory

– Uniform access cost

• Run on large datasets because OS moves blocks as needed

• Moderns OS utilizes sophisticated paging and prefetching strategies

– But if program makes scattered accesses even good OS cannot take advantage of block access

Scalability problems!

data size

runn

ing

tim

e

Lars Arge


14

L

1

L

2

R

A

M

Scalability: Hierarchical Memory• Block access not only important on disk level

• Machines have complicated memory hierarchy

– Levels get larger and slower

– Block transfers on all levels

• We focus on disk level:

data size

runn

ing

tim

eR

A

M

Lars Arge


15

Processing Massive Terrain Data: Flow

Lars Arge


16

Flow on Terrains• Modeling of water flow on terrains has many important applications

– Predict location of streams

– Predict areas susceptible to floods

– Compute watersheds

– Predict erosion

– Predict vegetation distribution

– ……

• Conceptually flow is modeled using two basic attributes

– Flow direction: The direction water flows at a point

– Flow accumulation: Amount of water flowing through a point

• Flow accumulation used to compute other hydrological attributes, e.g. drainage network, topographic convergence index…

Lars Arge


17

Flow Directions on Grid Terrains• Common terrain representation: Grid

• Flow directions: Water in each cell flows to downslope neighbor(s)

– Commonly used:

* Single flow direction (SFD or D8):

Flow to downslope neighbor

* Multiple flow direction (MFD):

Flow to all downslope neighbors

SFD

MFD

3 2 47 5 87 1 9

3 2 47 5 87 1 9

3 2 47 5 87 1 9

3 2 47 5 87 1 9

Lars Arge


18

Flow Accumulation on Grid Terrains

• Flow accumulation

– Initially one unit of water in each cell

– Water distributed from each cell according to flow direction(s)

– Flow accumulation of cell is total flow through it

Lars Arge


19

Flow Accumulation Example (Panama dataset)

Lars Arge


20

Flow Modeling on Massive Grid Terrains• Duke University Environmental researchers had problems with

computing flow accumulation for Appalachian Mountains

– Recall ~128MB raw data and ~500MB when processing

Running time: 14 days

• It could be much worse; Recall

– ~ 1.2GB at 30m resolution

– ~ 12GB at 10m resolution

– ~ 1.2TB at 1m resolution

Lars Arge


21

Flow Modeling on Massive Grid Terrains• We surveyed other flow accumulation software

• GRASS (leading open-source GIS)

– Killed after 17 days on a 50MB dataset (6700 x 4300 grid)

• TARDEM (specialized hydrology software)

– Could handle 50MB dataset

– Killed after 20 days on a 240MB dataset (12000 x 10000 grid)

* CPU utilization 5%, 3GB swap file

• ArcGIS (leading commercial GIS)

– Could handle the 240MB dataset

– Sometimes very slow:

* 3 days to process 490MB dataset

* 1 day to process 560MB dataset

– Does not work for datasets larger than 2GB

Lars Arge


22

Flow Accumulation Scalability Problem

• Natural algorithm may require ~N I/Os– “Push” flow down the terrain by visiting cells in height order

Problem since cells of same height scattered over terrain

• Natural to try “tiling” (ArcGIS?)– But computation in different tiles not independent

Lars Arge


23

TerraFlow• We developed theoretically I/O-optimal algorithms using ~N/B I/Os

• Avoiding scattered access by:

– Grid storing input: Data duplication

– Grid storing flow: “Lazy write”

• Implementation was very efficient

– Appalachian Mountains flow accumulation in 3 hours!

• Developed into comprehensive software package for flow computation on massive grids (www.cs.duke.edu/geo*/terraflow)

– Efficient: 2-1000 times faster than other software on massive grids

– Scalable: 1 billion elements! (>2GB data)

– Flexible: Different flow modeling (direction) methods

Lars Arge


24

TerraFlow• Significant speedup over ArcInfo for large datasets

– East-Coast (100m)

TerraFlow: 8.7 Hours

ArcInfo: 78 Hours

– Washington state (10m)

TerraFlow: 63 Hours

ArcInfo: %

• Incorporated in Grass 5.0.2 and later

• Recently also extensions for ArcGIS 8 and 9

Hawai

i

56M

Cumber

lands

80M Lower

NE

256M

East-C

oast

491M M

idwes

t

561M

Was

hingto

n

2G

0

10

20

30

40

50

60

70

80

90

Run

ning

Tim

e (H

ours

)

TerraFlow 512TerraFlow 128ArcInfo 512ArcInfo 128

500 MHz Alpha, FreeBSD 4.0

Lars Arge


25

Denmark?

Lars Arge


26

Denmark Terrain Data• Mainly two data suppliers in Denmark

– Kort & Matrikelstyrelsen

– COWI A/S

• Grid/vector models based on paper maps/ortofoto

• LIDAR data for major cities

• Unfortunately not available online (and not free)

– But obviously increasing interest in terrain data/applications

Lars Arge


27

New Project • New (NABIIT) project: Development of algorithms and software for

processing massive terrain data

– COWI A/S

* Problems processing LIDAR data during production and analysis (e.g. railroad noise)

– Spatial analysis unit, Danish Institute of Agricultural Sciences

* Use data, e.g. to comply with EU directives

– Computer science, Aarhus University

* Efficient algorithms

• Focus on

– Terrain modeling, terrain flow analysis, influence of simplification

Lars Arge


28

Example Sub-Projects• Terrain modeling, e.g:

– Terrain models from “raw” LIDAR

Process >10G raw data in a few hours using only 128M memory

• Terrain analysis, e.g:

– Erosion modeling (USLE factor computation)

– Watershed hierarchy computation

NC Neuse basin at 10m resolution (~400M cells) in 3 hours

Lars Arge


29

Summary

Lars Arge


30

Summary• Massive datasets appear everywhere

• Leads to scalability problems

– Due to hierarchical memory and slow I/O

• I/O-efficient algorithms greatly improves scalability

• Terrain data:

– Massive grid data exists

– New technologies are creating massive

and very detailed datasets

– Processing capabilities lag behind

Lars Arge


31

Summary - Resources• Google earth: http://earth.google.com/

• USGS national map: http://seamless.usgs.gov

• USGS center for LIDAR information: http:/lidar.cr.usgs.gov

• NC floodmaps: http://www.ncfloodmaps.com

• NOAA LIDAR data retrieval tool: http://www.csc.noaa.gov/crs/tcm/about_ldart.html

• TerraFlow: http://www.cs.duke.edu/geo*/terraflow

• Duke STREAM project: http://terrain.cs.duke.edu

• Kort & Matrikelstyrelsen: http://www.kms.dk

• COWI A/S: http://www.cowi.dk

• Geoforum: http://www.geoforum.dk/

Lars Arge


32

THANKS/TAK

Lars Arge

[email protected]

Documents

MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer