32
SmartCuckoo: A Fast and Cost-Efficient Hashing Index Scheme for Cloud Storage Systems Yuanyuan Sun, Yu Hua, Song Jiang*, Qiuyu Li, Shunde Cao, Pengfei Zuo Huazhong University of Science and Technology *University of Texas, Arlington 1 Presented in the USENIX ATC 2017

SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

SmartCuckoo: A Fast and Cost-Efficient Hashing Index Scheme for Cloud Storage Systems

Yuanyuan Sun, Yu Hua, Song Jiang*, Qiuyu Li, Shunde Cao, Pengfei Zuo

Huazhong University of Science and Technology*University of Texas, Arlington

1

Presented in the USENIX ATC 2017

Page 2: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

Indexing services in cloud storage

n Large amounts of dataØ From small hand-held devices to large-scale data centersØ 44ZB in total, 5.2TB for each user in 2020 (IDC' 2014)

n Fast query services are important to both users and systemsØ Returning accurate results in a real-time mannerØ Improving system performance and storage efficiency

2

Page 3: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

The importance of hash tables n Hash tables are widely used in data stores and caches

Ø Key-value stores, e.g., Memcached, RedisØ Relational databases, e.g., MonetDB, HyPerØ In-cache index (ICS 2014, MICRO 2015)

n Strengths:Ø Constant-scale addressing complexity ~O(1)Ø Fast query response

n Weakness:Ø Risk of high-latency for handling hashing collisions

n Cuckoo hashing3

Page 4: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

Cuckoo hashing

n Kick-out operations: like cuckoo birdsn Open addressing n Supporting fast lookups: O(1) time complexityn However, insertion latency can be very high and

unpredictable, especiallyØ when an endless loop occurs!

4

Page 5: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

How is an endless loop formed?

H1( )

a

0

1

2

3

4

5

6

7

5

Page 6: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

How is an endless loop formed?

a 0

1

2

3

4

5

6

7

H1( )c

6

Page 7: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

How is an endless loop formed?

a 0

1

2

3

4

5

6

7

cH1( )b

H2( )

7

Page 8: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

How is an endless loop formed?

a 0

1

2

3

4

5

6

7

c

T2T1

b

8

Page 9: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

How is an endless loop formed?

a 0

1

2

3

4

5

6

7

c

T2T1

b

d

e

H1( )x

H2( )

9

Page 10: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

How is an endless loop formed?

a 0

1

2

3

4

5

6

7

c

T2T1

b

d

e

x

My alternative location

10

Kickout for empty buckets

Page 11: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

How is an endless loop formed?

a 0

1

2

3

4

5

6

7

c

T2T1

b

d

e

x

My alternative location

11

Kickout for empty buckets

Page 12: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

How is an endless loop formed?

a 0

1

2

3

4

5

6

7c

T2T1

b

d

e

x

My alternative location

12

Kickout for empty buckets

Page 13: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

How is an endless loop formed?

a 0

1

2

3

4

5

6

7

c

T2T1

b

d

e

x

My alternative location

13

a b

d x

n An endless loop is formed.

n Endless kickouts for any insertion within the loop.

Page 14: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

Observations

n Endless loops widely exist in the Cuckoo hashing structures.Ø More than 25% (cuckoo hashing with a stash)

n Loop ratio: the percentage of insertion failures due to loops

05

101520253035404550

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90

Loop

Rat

ios (

%)

Load Factor

RandomIntegerMacOSDocWords

14

Page 15: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

Existing works

n ChunkStash @USENIX ATC’10Ø Collisions: resursive strategy to relocate one of keys in candidatesØ Loops: an auxiliary linked list (or, hash table)

n MemC3 @NSDI’13Ø Collisions: random and repeat relocation (500 times)Ø Loops: an expansion processØ Stand-alone implementation: libcuckoo @ EuroSys’14

n Horton tables @USENIX ATC’16Ø Recursively evicting keys within a certain search tree height

15

Page 16: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

Motivations

n Due to endless loops:Ø Substantial resources consumption

u A large number of step-by-step kick-out operationsØ Unbounded performance

u Fruitless effort

n Design Goal:Ø Predetermining and avoiding occurrence of endless loops

16

Page 17: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

Our approach: SmartCuckoo

n Tracking item placements in the hash table

Ø Representing the hashing relationship as a directed pseudoforest

Ø Classifying item insertions into three cases

Ø Predetermining and avoiding loops during insertion without any kick-out attempts.

17

Page 18: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

n Pseudoforest:Ø A graph: each vertex has an outdegree of at most oneØ Each connected component (subgraph) has at most one cycle (loop)Ø In a subgraph:

Loop #Vertices = #Edges

How to identify loop(s)?

e

cd

a

k

jn

m

l

bi

hf g

e

cd

a

k

jn

mb

hf g Vacancy

No loop #Vertices = #Edges + 1

18Maximal Non-maximal

Page 19: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

Classification and predeterminationn Three cases depending on the number of vertices added to the graph

n v+0, v+1, and v+2n v+0: 5 possible scenarios based on the status of corresponding subgraph(s)

19

Three cases v+0 v+1 v+2Two insert

positions of a keySame subgraph Different subgraphs A new

oneTwo new

ones

Subgraph status Non-maximal

Maximal Both non-maximal

A maximaland a non-maximal

Both maximal- -

Scenarios (a) (e) (b) (c) (d) - -

Page 20: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

v+0: (a) One non-maximal subgraph

Pseudoforest

cb

d

T1 T2

0

12345

a

67

x1

H1( )

H2( )a

b

c

dH1(x1)

H2(x1)

cb

d

T1 T20

12345

a

67

x1a

b

c

d

x1

n One empty bucketn Success!

20

Page 21: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

v+0: (b) Two non-maximal subgraphs

cb

d

T1 T2

0

12345

a

fg6

7

x2H1( )

H2( )

ab

c

d

H1(x2)

H2(x2)

f

g

cb

d

T1 T20

12345

a

fg6

7

x2

ab

c

d

f

gx2

n Two empty bucketsn Success!

Pseudoforest21

Page 22: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

ab

c

d

e H1(x3)

H2(x3)

f

g

v+0: (c) One maximal and one non-maximal

cb

d

T1 T2

0

12345

a

fg6

7

ex3

H2( )

H1( )c

g

b

d

T1 T20

12345

a

fx36

7

e

ab

c

d

e

f

g x3

n One loop and one empty bucketn Conventional cuckoo hashing: taking a random walk

Ø T1: executing extra useless kick-out operationsØ T2: making a successØ SmartCuckoo: directly selecting to enter from T2

n Success!

Pseudoforest22

Page 23: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

v+0: (d) Two maximal subgraphs

cb

d

T1 T2

0

12345

a

f ig6

7

ex4

H1( )

H2( )

h

ab

c

d

e

H1(x4)

H2(x4)

fi

h g Failure!

n Two loops!n Execution:

Ø Conventional cuckoo hashing: sufficient attempts, then reporting a failureØ SmartCuckoo: reporting a failure without any kick-out operations.

Pseudoforest23

Page 24: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

v+0: (e) One maximal subgraph

cb

d

T1 T2

0

12345

a

67

ex5

H1( )

H2( )

ab

c

d

e

H1(x5)H2(x5)

Failure!

n One loop!

Pseudoforest24

Page 25: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

Case: v+1

Pseudoforest

cb

d

T1 T2

0

12345

a

67

x6

H1( )

H2( )

ab

c

d

H1(x6)

H2(x6)

cb

d

T1 T20

12345

a

67

x6

ab

c

d

x6

n A new vertex after the item's insertionn Success!

25

Page 26: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

Case: v+2

Pseudoforest

cb

d

T1 T2

0

12345

a

67

x7

H1( )

H2( )

ab

c

d

H1(x7) H2(x7)

cb

d

T1 T20

12345

a

67

x7

ab

c

d

x7

n Two new vertices after the insertionn Success!

26

Page 27: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

Evaluation methodologyn Comparisons:

Ø Baseline (Cuckoo hashing with a stash @ SIAM Journal on Computing'09)Ø libcuckoo @ EuroSys'14Ø BCHT (bucketized cuckoo hash table)

n Traces:Ø RandomInteger: random integer generator @ TOMACS'98Ø MacOS: http://tracer.filesystems.orgØ DocWords: http://archive.ics.uci.edu/ml/datasets/Bag+of+WordsØ YCSB: https://github.com/brianfrankcooper/YCSB @ SOCC'11

n Metrics: in millions of operations per secondØ Insertion throughputØ Lookup throughput: positive/negativeØ Throughput of workload with mixed queries (YCSB)

27

Page 28: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

Insertion throughput

0

0.5

1

1.5

2

2.5

3

3.5

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9

Mill

ions

of I

nser

tions

Per

Se

cond

Load Factor

BaselinelibcuckooBCHTSmartCuckoo

n SmartCuckoo significantly increases insertion throughputs. n 0.5× to 5×speedups compared to Baseline.

28

0.5×

Page 29: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

Lookup throughput

0

0.5

1

1.5

2

2.5

100% 0%

Mill

ions

of L

ooku

ps P

er

Seco

nd

Percentage of Existent Keys in the Lookup Requests

Baseline libcuckoo BCHT SmartCuckoo

n 0%: all candidate positions for a key have to be accessed.n Almost the same lookup throughput with Baseline.n Significantly higher than libcuckoo and BCHT.

29

Page 30: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

Throughput of workload with mixed queries

0

0.4

0.8

1.2

1.6

2

2.4

YCSB-1 YCSB-2 YCSB-3 YCSB-4 YCSB-5

Mill

ions

of O

pera

tions

Per

Se

cond

Workloads

Baseline

libcuckoo

BCHT

SmartCuckoo

n With the decrease of the percentage of insertions, all schemes increase the throughputs.

n In each workload, SmartCuckoo produces higher throughput than other three schemes.

30

Workload Insert Lookup Update

YCSB-1 100 0 0

YCSB-2 75 25 0

YCSB-3 50 50 0

YCSB-4 25 75 0

YCSB-5 0 95 5

Page 31: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

Conclusion and future workn Cuckoo hashing is cost-efficient to offer O(1) query

performance.n We address the problem of potential endless loops in item

insertion. n SmartCuckoo helps improve predictable performance in

storage systems.

n To-do-list:n SmartCuckoo in hash tables with more than two hash functions;n The use of multiple slots in each bucket.

31

Page 32: SmartCuckoo: A Fast and Cost-Efficient Hashing Index ...nEndless loops widely exist in the Cuckoo hashing structures. Ø More than 25% (cuckoo hashing with a stash) nLoop ratio: the

Thanks and questions?

Open-source code: https://github.com/syy804123097/SmartCuckoo

32