32
MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer Lars Arge Datalogisk Institut Aarhus Universitet Regionalt endagskursus datalogi 20 Marts 2006

MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer

  • Upload
    lowell

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer. Lars Arge Datalogisk Institut Aarhus Universitet Regionalt endagskursus datalogi 20 Marts 2006. Outline. Massive (terrain) data Scalability problems (I/O bottleneck) - PowerPoint PPT Presentation

Citation preview

Page 1: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

MASSIVE Terrain Datasæt −om vigtigheden af effektive algoritmer

Lars Arge

Datalogisk Institut

Aarhus Universitet

Regionalt endagskursus datalogi20 Marts 2006

Page 2: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

2

Outline

1. Massive (terrain) data

2. Scalability problems (I/O bottleneck)

3. Processing massive terrain data: Flow modeling on grid terrains

4. Summary

Page 3: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

3

Massive Data

Page 4: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

4

Massive Data• Massive datasets are being collected everywhere• Storage management software is billion-$ industry

Examples (2002):

• Phone: AT&T 20TB phone call database, wireless tracking

• Consumer: WalMart 70TB database, buying patterns (supermarket checkout)

• WEB: Web crawl of 200M pages and 2000M links, Akamai stores 7 billion clicks per day

• Geography: NASA satellites generate 1.2TB per day

Page 5: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

5

Example: Satellite Images

– Terrabyte image database

Page 6: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

6

Example: Grid Terrain Data• Grid terrain data increasingly available

– NASA SRTM mission acquired 30m data

for around 80% of earth land mass

– US data readily available through

USGS National Map Seamless Data Distribution System

• Appalachian Mountains (800km x 800km)

– 100m resolution ~ 64M cells

~128MB raw data (~500MB when processing)

– ~ 1.2GB at 30m resolution

– ~ 12GB at 10m resolution (much of US available from USGS)

– ~ 1.2TB at 1m resolution (selected, mostly military, availability)

Page 7: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

7

Example: LIDAR Terrain Data

• Massive (irregular) point sets (1-10m resolution)

– Becoming relatively cheap and easy to collect

• NC floodplain mapping program: www.ncfloodmaps.com

– Collected LIDAR for all NC after Hurricane Floyd in 1999

– Still processing it

Page 8: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

8

Hurricane Floyd

• Sep. 15, 1999

7 am 3pm

Page 9: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

9

Example: LIDAR Terrain Data

• US LIDAR data becoming available:

– www.ncfloodmaps.com

– USGS Center for LIDAR Information

Coordination and Knowledge (CLICK)

– NOAA LIDAR Data Retrieval Tool (LDART)

Page 10: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

10

Scalability Problems

Page 11: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

11

Scalability Problems: I/O-Bottleneck

– Disk systems try to amortize large access time transferring large contiguous blocks of data

• Need to store and access data to take advantage of blocks (locality)

• I/O is often bottleneck when handling massive datasets

• Disk access is 106 times slower than main memory access

track

magnetic surface

read/write armread/write head“The difference in speed

between modern CPU and disk technologies is

analogous to the difference in speed in sharpening a

pencil using a sharpener on one’s desk or by taking an

airplane to the other side of the world and using a

sharpener on someone else’s desk.” (D. Comer)

Page 12: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

12

Scalability Problems: Block Access Matters• Example: Reading an array from disk

– Array size N = 10 elements

– Disk block size B = 2 elements

– Main memory size M = 4 elements (2 blocks)

• Difference between N and N/B large since block size is large

– Example: N = 256 x 106, B = 8000 , 1ms disk access time

N I/Os take 256 x 103 sec = 4266 min = 71 hr

N/B I/Os take 256/8 sec = 32 sec

1 2 10 9 5 6 3 4 8 71 5 2 6 3 8 9 4 7 10

Algorithm 2: Loads N/B=5 blocksAlgorithm 1: Loads N=10 blocks

Page 13: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

13

R

A

M

Scalability Problems: Block Access Matters• Most programs developed without memory considerations

– Infinite memory

– Uniform access cost

• Run on large datasets because OS moves blocks as needed

• Moderns OS utilizes sophisticated paging and prefetching strategies

– But if program makes scattered accesses even good OS cannot take advantage of block access

Scalability problems!

data size

runn

ing

tim

e

Page 14: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

14

L

1

L

2

R

A

M

Scalability: Hierarchical Memory• Block access not only important on disk level

• Machines have complicated memory hierarchy

– Levels get larger and slower

– Block transfers on all levels

• We focus on disk level:

data size

runn

ing

tim

eR

A

M

Page 15: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

15

Processing Massive Terrain Data: Flow

Page 16: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

16

Flow on Terrains• Modeling of water flow on terrains has many important applications

– Predict location of streams

– Predict areas susceptible to floods

– Compute watersheds

– Predict erosion

– Predict vegetation distribution

– ……

• Conceptually flow is modeled using two basic attributes

– Flow direction: The direction water flows at a point

– Flow accumulation: Amount of water flowing through a point

• Flow accumulation used to compute other hydrological attributes, e.g. drainage network, topographic convergence index…

Page 17: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

17

Flow Directions on Grid Terrains• Common terrain representation: Grid

• Flow directions: Water in each cell flows to downslope neighbor(s)

– Commonly used:

* Single flow direction (SFD or D8):

Flow to downslope neighbor

* Multiple flow direction (MFD):

Flow to all downslope neighbors

SFD

MFD

3 2 47 5 87 1 9

3 2 47 5 87 1 9

3 2 47 5 87 1 9

3 2 47 5 87 1 9

Page 18: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

18

Flow Accumulation on Grid Terrains

• Flow accumulation

– Initially one unit of water in each cell

– Water distributed from each cell according to flow direction(s)

– Flow accumulation of cell is total flow through it

Page 19: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

19

Flow Accumulation Example (Panama dataset)

Page 20: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

20

Flow Modeling on Massive Grid Terrains• Duke University Environmental researchers had problems with

computing flow accumulation for Appalachian Mountains

– Recall ~128MB raw data and ~500MB when processing

Running time: 14 days

• It could be much worse; Recall

– ~ 1.2GB at 30m resolution

– ~ 12GB at 10m resolution

– ~ 1.2TB at 1m resolution

Page 21: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

21

Flow Modeling on Massive Grid Terrains• We surveyed other flow accumulation software

• GRASS (leading open-source GIS)

– Killed after 17 days on a 50MB dataset (6700 x 4300 grid)

• TARDEM (specialized hydrology software)

– Could handle 50MB dataset

– Killed after 20 days on a 240MB dataset (12000 x 10000 grid)

* CPU utilization 5%, 3GB swap file

• ArcGIS (leading commercial GIS)

– Could handle the 240MB dataset

– Sometimes very slow:

* 3 days to process 490MB dataset

* 1 day to process 560MB dataset

– Does not work for datasets larger than 2GB

Page 22: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

22

Flow Accumulation Scalability Problem

• Natural algorithm may require ~N I/Os– “Push” flow down the terrain by visiting cells in height order

Problem since cells of same height scattered over terrain

• Natural to try “tiling” (ArcGIS?)– But computation in different tiles not independent

Page 23: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

23

TerraFlow• We developed theoretically I/O-optimal algorithms using ~N/B I/Os

• Avoiding scattered access by:

– Grid storing input: Data duplication

– Grid storing flow: “Lazy write”

• Implementation was very efficient

– Appalachian Mountains flow accumulation in 3 hours!

• Developed into comprehensive software package for flow computation on massive grids (www.cs.duke.edu/geo*/terraflow)

– Efficient: 2-1000 times faster than other software on massive grids

– Scalable: 1 billion elements! (>2GB data)

– Flexible: Different flow modeling (direction) methods

Page 24: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

24

TerraFlow• Significant speedup over ArcInfo for large datasets

– East-Coast (100m)

TerraFlow: 8.7 Hours

ArcInfo: 78 Hours

– Washington state (10m)

TerraFlow: 63 Hours

ArcInfo: %

• Incorporated in Grass 5.0.2 and later

• Recently also extensions for ArcGIS 8 and 9

Hawai

i

56M

Cumber

lands

80M Lower

NE

256M

East-C

oast

491M M

idwes

t

561M

Was

hingto

n

2G

0

10

20

30

40

50

60

70

80

90

Run

ning

Tim

e (H

ours

)

TerraFlow 512TerraFlow 128ArcInfo 512ArcInfo 128

500 MHz Alpha, FreeBSD 4.0

Page 25: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

25

Denmark?

Page 26: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

26

Denmark Terrain Data• Mainly two data suppliers in Denmark

– Kort & Matrikelstyrelsen

– COWI A/S

• Grid/vector models based on paper maps/ortofoto

• LIDAR data for major cities

• Unfortunately not available online (and not free)

– But obviously increasing interest in terrain data/applications

Page 27: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

27

New Project • New (NABIIT) project: Development of algorithms and software for

processing massive terrain data

– COWI A/S

* Problems processing LIDAR data during production and analysis (e.g. railroad noise)

– Spatial analysis unit, Danish Institute of Agricultural Sciences

* Use data, e.g. to comply with EU directives

– Computer science, Aarhus University

* Efficient algorithms

• Focus on

– Terrain modeling, terrain flow analysis, influence of simplification

Page 28: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

28

Example Sub-Projects• Terrain modeling, e.g:

– Terrain models from “raw” LIDAR

Process >10G raw data in a few hours using only 128M memory

• Terrain analysis, e.g:

– Erosion modeling (USLE factor computation)

– Watershed hierarchy computation

NC Neuse basin at 10m resolution (~400M cells) in 3 hours

Page 29: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

29

Summary

Page 30: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

30

Summary• Massive datasets appear everywhere

• Leads to scalability problems

– Due to hierarchical memory and slow I/O

• I/O-efficient algorithms greatly improves scalability

• Terrain data:

– Massive grid data exists

– New technologies are creating massive

and very detailed datasets

– Processing capabilities lag behind

Page 31: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

31

Summary - Resources• Google earth: http://earth.google.com/

• USGS national map: http://seamless.usgs.gov

• USGS center for LIDAR information: http:/lidar.cr.usgs.gov

• NC floodmaps: http://www.ncfloodmaps.com

• NOAA LIDAR data retrieval tool: http://www.csc.noaa.gov/crs/tcm/about_ldart.html

• TerraFlow: http://www.cs.duke.edu/geo*/terraflow

• Duke STREAM project: http://terrain.cs.duke.edu

• Kort & Matrikelstyrelsen: http://www.kms.dk

• COWI A/S: http://www.cowi.dk

• Geoforum: http://www.geoforum.dk/

Page 32: MASSIVE Terrain Datasæt  − om vigtigheden af effektive algoritmer

Lars Arge

Massive terrain datasæt

32

THANKS/TAK

Lars Arge

[email protected]