10
Oct 29, 200 1 CSE 373, Autumn 20 01 1 External Storage • For large data sets, the computer will have to access the disk. • Disk access can take 200,000 times longer than a machine instruction. • The RAM model does not account for disk I/O. memory disk 128 MB fast, expensive 60 GB slow, cheap

Oct 29, 2001CSE 373, Autumn 20011 External Storage For large data sets, the computer will have to access the disk. Disk access can take 200,000 times longer

Embed Size (px)

Citation preview

Page 1: Oct 29, 2001CSE 373, Autumn 20011 External Storage For large data sets, the computer will have to access the disk. Disk access can take 200,000 times longer

Oct 29, 2001 CSE 373, Autumn 2001 1

External Storage• For large data sets, the computer

will have to access the disk.

• Disk access can take 200,000 times longer than a machine instruction.

• The RAM model does not account for disk I/O.

memory

disk

128 MBfast, expensive

60 GBslow, cheap

Page 2: Oct 29, 2001CSE 373, Autumn 20011 External Storage For large data sets, the computer will have to access the disk. Disk access can take 200,000 times longer

Oct 29, 2001 CSE 373, Autumn 2001 2

Disks, continued

• The difference between memory speed and disk speed is increasing.

• Example: State of Florida driving records (256 bytes). 10,000,000 items. 6 disk accesses per second on a time-sharing system.

• unbalanced binary search tree: possibly 10,000,000 accesses.

• BST: on avg. 32 accesses (5 sec.)

• AVL: worst: 1.44 log n

typical case: log n, 25 accesses (4 sec.)

Page 3: Oct 29, 2001CSE 373, Autumn 20011 External Storage For large data sets, the computer will have to access the disk. Disk access can take 200,000 times longer

Oct 29, 2001 CSE 373, Autumn 2001 3

Disk accesses

• Goal: reduce the number of disk accesses.

• We are willing to do more complicated computations in memory in order to save disk time.

• Idea: increase the branching of the tree so that the height is decreased.

• Defn: An M-ary search tree allows up to M children per node.

Page 4: Oct 29, 2001CSE 373, Autumn 20011 External Storage For large data sets, the computer will have to access the disk. Disk access can take 200,000 times longer

Oct 29, 2001 CSE 373, Autumn 2001 4

B-Trees1. All the data items are stored at

the leaves.

2. The non-leaf nodes store up to M-1 keys. The ith key represents the smallest key in subtree i+1.

3. The root is either a leaf of has between 2 and M children.

4. All non-leaf nodes (except the root) have between M/2 and M children.

5. All leaves are at the same depth and have between L/2 and L data items.

Page 5: Oct 29, 2001CSE 373, Autumn 20011 External Storage For large data sets, the computer will have to access the disk. Disk access can take 200,000 times longer

Oct 29, 2001 CSE 373, Autumn 2001 5

B-Trees: Choices

• Choose M and L based on the size of the keys K and on the size of the record R.

• Suppose a disk block is of size B (bytes). Choose M so that a non-leaf node fits into one block:

B (M-1) · K + M · 4

• Choose L so that a leaf node fits into one block:

B L · R

• accesses: log2 N vs. logM/2 N

Page 6: Oct 29, 2001CSE 373, Autumn 20011 External Storage For large data sets, the computer will have to access the disk. Disk access can take 200,000 times longer

Oct 29, 2001 CSE 373, Autumn 2001 6

Hash Tables

• Constant time accesses!

• A hash table is an array of some fixed size, usually a prime number.

• General idea:

key space (e.g., strings)

0

TableSize –1

hash func.h(K)

hash table

Page 7: Oct 29, 2001CSE 373, Autumn 20011 External Storage For large data sets, the computer will have to access the disk. Disk access can take 200,000 times longer

Oct 29, 2001 CSE 373, Autumn 2001 7

Desirable Properties

We want a hash function to:

1. be simple/fast to compute,

2. map different keys to different cells, (impossible – why?)

3. have keys distributed evenly among cells.

Idea: If #1 and #3 are true and the hash table is not very full, then it should be fast to do a find.

Page 8: Oct 29, 2001CSE 373, Autumn 20011 External Storage For large data sets, the computer will have to access the disk. Disk access can take 200,000 times longer

Oct 29, 2001 CSE 373, Autumn 2001 8

Example

• key space = integers

• h(K) = K mod 10

0

1 41

2

3

4 34

5

6

7 7

8 18

9

We lose all ordering information:findMin, findMax, inorder traversal, printing items in sorted order.

Page 9: Oct 29, 2001CSE 373, Autumn 20011 External Storage For large data sets, the computer will have to access the disk. Disk access can take 200,000 times longer

Oct 29, 2001 CSE 373, Autumn 2001 9

Example 2

• key space = strings

• s = s0 s1 s2 … s k-1

h(s) = s0 mod TableSize

BAD HASH FUNCTION

h(s) = mod TableSize

BETTER HASH FUNCTION

1

0

37k

i

iis

Page 10: Oct 29, 2001CSE 373, Autumn 20011 External Storage For large data sets, the computer will have to access the disk. Disk access can take 200,000 times longer

Oct 29, 2001 CSE 373, Autumn 2001 10

Collision Resolution• Separate chaining: All keys that

map to the same hash value are kept in a list.

0

1

2

3

4

5

6

7

8

9

10

107

22 12 42