67
Accelerating Aggregation using Intra-cycle Parallelism Ziqiang Feng , Eric Lo Department of Computing The Hong Kong Polytechnic University {cszqfeng, ericlo}@comp.polyu.edu.hk

Accelerating aggregation using intra-cycle parallelism

Embed Size (px)

Citation preview

Page 1: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using

Intra-cycle ParallelismZiqiang Feng, Eric Lo

Department of ComputingThe Hong Kong Polytechnic University{cszqfeng, ericlo}@comp.polyu.edu.hk

Page 2: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 2

Background• Analytic database• Memory-resident• Column store• Compression

Column values compressed into short (int) codes

Page 3: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 3

Background• Analytic database• Memory-resident• Column store• Compression

Column values compressed into short (int) codes

salary1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

Encodedsalary

1

2

3

4

5

6

7

8

Page 4: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 4

Background• Analytic database• Memory-resident• Column store• Compression

Column values compressed into short (int) codes

salary1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

Encodedsalary

1

2

3

4

5

6

7

8

Binary rep.

0001

0010

0011

0100

0101

0110

0111

1000

Page 5: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 5

Background

64-bit Wide processor words• Process 64 bits information per cycle

• Analytic database• Memory-resident• Column store• Compression

Column values compressed into short (int) codes

salary1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

Encodedsalary

1

2

3

4

5

6

7

8

Binary rep.

0001

0010

0011

0100

0101

0110

0111

1000

Page 6: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 6

Background

64-bit Wide processor words• Process 64 bits information per cycle

• Analytic database• Memory-resident• Column store• Compression

Column values compressed into short (int) codes

salary1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

1 0 0 0

load

Encodedsalary

1

2

3

4

5

6

7

8

Binary rep.

0001

0010

0011

0100

0101

0110

0111

1000

Page 7: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 7

Background

64-bit Wide processor words• Process 64 bits information per cycle

Wasted ↑

How to utilize?

• Analytic database• Memory-resident• Column store• Compression

Column values compressed into short (int) codes

salary1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

1 0 0 0

load

Encodedsalary

1

2

3

4

5

6

7

8

Binary rep.

0001

0010

0011

0100

0101

0110

0111

1000

Page 8: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 8

• HBP & VBP (Li and Patel, SIGMOD’13)• Two bit-packed storage layouts• Horizontal Bit Packing (HBP)• Vertical Bit Packing (VBP)• Fast filter scans (e.g., )

A Better Approach …

Page 9: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 9

• HBP & VBP (Li and Patel, SIGMOD’13)• Two bit-packed storage layouts• Horizontal Bit Packing (HBP)• Vertical Bit Packing (VBP)• Fast filter scans (e.g., )

• An example of HBP:

A Better Approach … Encodesalary

12

3

456

7

8

64-bit CPU register

Page 10: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 10

• HBP & VBP (Li and Patel, SIGMOD’13)• Two bit-packed storage layouts• Horizontal Bit Packing (HBP)• Vertical Bit Packing (VBP)• Fast filter scans (e.g., )

• An example of HBP:

A Better Approach … Encodesalary

12

3

456

7

8

Load 8 values

64-bit CPU register

Page 11: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 11

• HBP & VBP (Li and Patel, SIGMOD’13)• Two bit-packed storage layouts• Horizontal Bit Packing (HBP)• Vertical Bit Packing (VBP)• Fast filter scans (e.g., )

• An example of HBP:

A Better Approach …

1 2 3 4 5 6 7 8

Encodesalary

12

3

456

7

8

Load 8 values

0000000100000010000000110000010000000101000001100000011100001000

64-bit CPU register

8 bits

Page 12: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 12

• HBP & VBP (Li and Patel, SIGMOD’13)• Two bit-packed storage layouts• Horizontal Bit Packing (HBP)• Vertical Bit Packing (VBP)• Fast filter scans (e.g., )

• An example of HBP:

A Better Approach …

1 2 3 4 5 6 7 8

Encodesalary

12

3

456

7

8

Load 8 values

0000000100000010000000110000010000000101000001100000011100001000

64-bit CPU register

8 bits

1 CPU instruction process 8 values simultaneously 8x intra-cycle parallelism

Page 13: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 13

• HBP & VBP (Li and Patel, SIGMOD’13)• Two bit-packed storage layouts• Horizontal Bit Packing (HBP)• Vertical Bit Packing (VBP)• Fast filter scans (e.g., )

• An example of HBP:

A Better Approach …

1 2 3 4 5 6 7 8

Encodesalary

12

3

456

7

8

Load 8 values

0000000100000010000000110000010000000101000001100000011100001000

Not Wasted 64-bit CPU register

8 bits

1 CPU instruction process 8 values simultaneously 8x intra-cycle parallelism

Page 14: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 14

• Remains open• Baseline (example: sum)

Aggregation (SUM, MIN, MAX, MEDIAN, AVG, COUNT) on HBP/VBP?

00000001000000100000001100000100000001010000011000000111000010001 2 3 4 5 6 7 8

Page 15: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 15

• Remains open• Baseline (example: sum)

Aggregation (SUM, MIN, MAX, MEDIAN, AVG, COUNT) on HBP/VBP?

00000001000000100000001100000100000001010000011000000111000010001 2 3 4 5 6 7 8

000000011lookup

Page 16: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 16

• Remains open• Baseline (example: sum)

Aggregation (SUM, MIN, MAX, MEDIAN, AVG, COUNT) on HBP/VBP?

00000001000000100000001100000100000001010000011000000111000010001 2 3 4 5 6 7 8

000000011

000000102

lookup

lookup

Page 17: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 17

• Remains open• Baseline (example: sum)

Aggregation (SUM, MIN, MAX, MEDIAN, AVG, COUNT) on HBP/VBP?

00000001000000100000001100000100000001010000011000000111000010001 2 3 4 5 6 7 8

000000011

000000102

000000113

lookup

lookup

lookup

Page 18: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 18

• Remains open• Baseline (example: sum)

Aggregation (SUM, MIN, MAX, MEDIAN, AVG, COUNT) on HBP/VBP?

00000001000000100000001100000100000001010000011000000111000010001 2 3 4 5 6 7 8

000000011

000000102

000000113

lookup

lookup

lookup

lookup

Page 19: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 19

• Remains open• Baseline (example: sum)

Aggregation (SUM, MIN, MAX, MEDIAN, AVG, COUNT) on HBP/VBP?

00000001000000100000001100000100000001010000011000000111000010001 2 3 4 5 6 7 8

000000011

000000102

000000113

+¿

+¿+¿

… …

lookup

0010010036

lookup

lookup

lookup

¿

Page 20: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 20

• Remains open• Baseline (example: sum)

Aggregation (SUM, MIN, MAX, MEDIAN, AVG, COUNT) on HBP/VBP?

00000001000000100000001100000100000001010000011000000111000010001 2 3 4 5 6 7 8

000000011

000000102

000000113

+¿

+¿+¿

… …

lookup

0010010036

lookup

lookup

lookup

¿

Wasted (again).

Page 21: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 21

• Remains open• Baseline (example: sum)

Aggregation (SUM, MIN, MAX, MEDIAN, AVG, COUNT) on HBP/VBP?

00000001000000100000001100000100000001010000011000000111000010001 2 3 4 5 6 7 8

000000011

000000102

000000113

+¿

+¿+¿

… …

lookup

0010010036

lookup

lookup

lookup

A lookup is expensive • Involves a sequence of • Scan: cycle/value• Lookup: cycles/value!

¿

Wasted (again).

Page 22: Accelerating aggregation using intra-cycle parallelism

22

Bit-parallel Approach (This paper):

00000001000000100000001100000100000001010000011000000111000010001 2 3 4 5 6 7 8

Page 23: Accelerating aggregation using intra-cycle parallelism

23

Bit-parallel Approach (This paper):

00000001000000100000001100000100000001010000011000000111000010001 2 3 4 5 6 7 8

0010010036

∧ ,∨ ,≪ ,≫ ,+ ,×

Page 24: Accelerating aggregation using intra-cycle parallelism

24

Bit-parallel Approach (This paper):

00000001000000100000001100000100000001010000011000000111000010001 2 3 4 5 6 7 8

0010010036

∧ ,∨ ,≪ ,≫ ,+ ,×

• Each instruction works on 8 values simultaneously. • Uses much fewer instructions. • Expensive lookups are avoided.

Page 25: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 25

Challenge0000000100000010000000110000010000000101000001100000011100001000

1 2 3 4 5 6 7 8

64-bit CPU register

Page 26: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 26

Challenge0000000100000010000000110000010000000101000001100000011100001000

1 2 3 4 5 6 7 8

What CPU sees: a 64-bit integer

64-bit CPU register

0000000100000010000000110000010000000101000001100000011100001000

Page 27: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 27

• We want: 36 (sum)

Challenge0000000100000010000000110000010000000101000001100000011100001000

1 2 3 4 5 6 7 8

What CPU sees: a 64-bit integer

64-bit CPU register

0000000100000010000000110000010000000101000001100000011100001000

Page 28: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 28

• We want: 36 (sum)• How to interpret/manipulate this (meaningless)

huge number?

Challenge0000000100000010000000110000010000000101000001100000011100001000

1 2 3 4 5 6 7 8

What CPU sees: a 64-bit integer

64-bit CPU register

0000000100000010000000110000010000000101000001100000011100001000

Page 29: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 29

Contribution

CountSumAvgMinMax

Median

HBPVBP× are covered in our paper.Bit-parallel Aggregations for

Page 30: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 30

CountSumAvgMinMax

Median

HBPVBP× are covered in our paper.Bit-parallel Aggregations for

Page 31: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 31

Example: Sum in HBP0000000100000010000000110000010000000101000001100000011100001000

1 2 3 4 5 6 7 8

Baseline

64-bit CPU register

Page 32: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 32

Example: Sum in HBP0000000100000010000000110000010000000101000001100000011100001000

1 2 3 4 5 6 7 8

000000010000001000000011000001000000010100000110000001111 2 3 4 5 6 7

(1). Right shift 8

Baseline

64-bit CPU register

Page 33: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 33

Example: Sum in HBP0000000100000010000000110000010000000101000001100000011100001000

1 2 3 4 5 6 7 8

000000010000001000000011000001000000010100000110000001111 2 3 4 5 6 7

00000001000000110000010100000111000010010000101100001101000011111 1+2 2+3 3+4 4+5 5+6 6+7 7+8

(1). Right shift 8

(2). Add

Baseline

64-bit CPU register

Page 34: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 34

Example: Sum in HBP0000000100000010000000110000010000000101000001100000011100001000

1 2 3 4 5 6 7 8

000000010000001000000011000001000000010100000110000001111 2 3 4 5 6 7

00000001000000110000010100000111000010010000101100001101000011111 1+2 2+3 3+4 4+5 5+6 6+7 7+8

00000000000000110000000000000111000000000000101100000000000011110 1+2 0 3+4 0 5+6 0 7+8

(1). Right shift 8

(2). Add

(3). Mask

Baseline

64-bit CPU register

Page 35: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 35

Cont’d0000000000000011000000000000011100000000000010110000000000001111

0 1+2 0 3+4 0 5+6 0 7+8

Page 36: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 36

Cont’d0000000000000011000000000000011100000000000010110000000000001111

0 1+2 0 3+4 0 5+6 0 7+8

0000000000000001000000000000000100000000000000010000000000000001

00000000001001000000000000100001000000000001101000000000000011111+2+3+4+5+6+7+8 3+4+5+6+7+8 5+6+7+8 7+8

(4). Multiply

0 1 0 1 0 1 0 1

Page 37: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 37

Cont’d0000000000000011000000000000011100000000000010110000000000001111

0 1+2 0 3+4 0 5+6 0 7+8

0000000000000001000000000000000100000000000000010000000000000001

00000000001001000000000000100001000000000001101000000000000011111+2+3+4+5+6+7+8 3+4+5+6+7+8 5+6+7+8 7+8

(4). Multiply

0 1 0 1 0 1 0 1

Page 38: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 38

Cont’d0000000000000011000000000000011100000000000010110000000000001111

0 1+2 0 3+4 0 5+6 0 7+8

0000000000000001000000000000000100000000000000010000000000000001

00000000001001000000000000100001000000000001101000000000000011111+2+3+4+5+6+7+8 3+4+5+6+7+8 5+6+7+8 7+8

(4). Multiply

0 1 0 1 0 1 0 1

Page 39: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 39

Cont’d0000000000000011000000000000011100000000000010110000000000001111

0 1+2 0 3+4 0 5+6 0 7+8

0000000000000001000000000000000100000000000000010000000000000001

00000000001001000000000000100001000000000001101000000000000011111+2+3+4+5+6+7+8 3+4+5+6+7+8 5+6+7+8 7+8

00000000000000000000000000000000000000000000000000000000001001001+2+3+4+5+6+7+8 = 36

(4). Multiply

(5). Right shift 48

0 1 0 1 0 1 0 1

Page 40: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 40

• Baseline:

Summary: Sum on HBP

Page 41: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 41

• Baseline:

• Our method: use only 5 instructions

Summary: Sum on HBP

Page 42: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 42

• Baseline:

• Our method: use only 5 instructions

• One 64-bit word contains 8 eight-bit values• One instruction processes 8 values in parallel

Summary: Sum on HBP

Page 43: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 43

• Baseline:

• Our method: use only 5 instructions

• One 64-bit word contains 8 eight-bit values• One instruction processes 8 values in parallel• Advantage:

• parallelism is achieved • # of instructions is low

Summary: Sum on HBP

Page 44: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 44

CountSumAvgMinMax

Median

HBPVBP× are covered in our paper.Bit-parallel Aggregations for

Page 45: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 45

• Baseline

Example: min on HBP

00000001000000100000001100000100000001010000011000000111000010001 2 3 4 5 6 7 8

Page 46: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 46

• Baseline

Example: min on HBP

00000001000000100000001100000100000001010000011000000111000010001 2 3 4 5 6 7 8

000000011

000000102

000000113

lookup

lookup

lookup

lookup … …

Page 47: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 47

• Baseline

Example: min on HBP

00000001000000100000001100000100000001010000011000000111000010001 2 3 4 5 6 7 8

000000011

000000102

000000113

lookup

lookup

lookup

lookup … … min

Page 48: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 48

• Bit-parallel: consider 16 values …

Cont’d

00000001000000100000001100000100000001010000011000000111000010001 2 3 4 5 6 7 8

0000001100000011000000110000001100000011000000110000001100000011

𝑣1 𝑣8

𝑣9 𝑣163 3 3 3 3 3 3 3

64-bit CPU register

Page 49: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 49

• Bit-parallel: consider 16 values …

Cont’d

00000001000000100000001100000100000001010000011000000111000010001 2 3 4 5 6 7 8

0000001100000011000000110000001100000011000000110000001100000011

00000001000000100000001100000011000000110000001100000011000000111 2 3 3 3 3 3 3

𝑣1 𝑣8

𝑣9 𝑣163 3 3 3 3 3 3 3

Bit-parallel slot-wise min(1). ; (2). ; (3). ; … …

64-bit CPU register

Page 50: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 50

• Bit-parallel: consider 16 values …

Cont’d

00000001000000100000001100000100000001010000011000000111000010001 2 3 4 5 6 7 8

0000001100000011000000110000001100000011000000110000001100000011

00000001000000100000001100000011000000110000001100000011000000111 2 3 3 3 3 3 3

𝑣1 𝑣8

𝑣9 𝑣163 3 3 3 3 3 3 3

Bit-parallel slot-wise min(1). ; (2). ; (3). ; … …

64-bit CPU register

Page 51: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 51

• Bit-parallel: consider 16 values …

Cont’d

00000001000000100000001100000100000001010000011000000111000010001 2 3 4 5 6 7 8

0000001100000011000000110000001100000011000000110000001100000011

00000001000000100000001100000011000000110000001100000011000000111 2 3 3 3 3 3 3

𝑣1 𝑣8

𝑣9 𝑣163 3 3 3 3 3 3 3

Bit-parallel slot-wise min(1). ; (2). ; (3). ; … …

lookup lookup lookup… ….

64-bit CPU register

Page 52: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 52

• Work on the storage layout (HBP/VBP) directly• Intra-cycle parallelism is utilized in calculation• Avoid expensive lookup operation• # of instructions is low

In a nutshell …

Page 53: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 53

Aggregation time Reduced by 28.1%

TPC-H Result: HBP

Aggregation(this paper)

Aggregation(baseline)vs

Page 54: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 54

Aggregation time Reduced by 28.1%

TPC-H Result: HBP

Aggregation(this paper)

Aggregation(baseline)vs

Whole-query(this paper)

Whole-query(baseline)vs Whole-query time Reduced by 20.4%

Page 55: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 55

TPC-H Result: VBP

Aggregation time Reduced by 55.0%Aggregation(this paper)

Aggregation(baseline)vs

Page 56: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 56

TPC-H Result: VBP

Aggregation time Reduced by 55.0%Aggregation(this paper)

Aggregation(baseline)vs

Whole-query(this paper)

Whole-query(baseline)vs Whole-query time Reduced by 44.4%

Page 57: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 57

• Intra-cycle parallelism is important. • We devised a suite of algorithms to compute aggregation

very efficiently.• This paper has been built around HBP and VBP, but …

Conclusion and Future Work

Page 58: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 58

• Intra-cycle parallelism is important. • We devised a suite of algorithms to compute aggregation

very efficiently.• This paper has been built around HBP and VBP, but …

Conclusion and Future Work

ByteSlice --- better than HBP and VBP.See our SIGMOD’15 paper:“ByteSlice: Pushing the Envelop of Main Memory Data Processing with a New Storage Layout”

Page 59: Accelerating aggregation using intra-cycle parallelism

59

Thank you.

Page 60: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 60

select SUM(b) from R where a > 88

Combining Scan and Aggregation

0 0 0 1 0 1 0 1

Filter result bit vector

Column b

(1). Scan

(2). Identify the 4th tuple

(3). Lookup

Running SUM

(4). Add

The Non-bit-parallel Approach

Column a

(5). Find next

Page 61: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 61

select SUM(b) from R where a > 88

Combining Scan and Aggregation

0 0 0 1 0 1 0 1

Filter result bit vector (1). Scan

The Bit-parallel Approach

Column a

00000001000000100000001100000100000001010000011000000111000010001 2 3 4 5 6 7 8

00000000000000000000000011111111000000001111111100000000111111110 0 0 0xFF 0 0xFF 0 0xFF

00000000000000000000000000000100000000000000011000000000000010000 0 0 4 0 6 0 8

(2). Transform to a mask

(3). Intersect

… …

Page 62: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 62

• Materialized join• Column store + compression Low space overhead• Benefit: complex (join) query scan-then-aggregate• WideTable (Li and Patel, VLDB’14)

• Sorted Projection• Exploit the replica requirement

• Multiple (copies) projections in different sort orders• Vertica

Handling Join and Group-by

Page 63: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 63

• Materialize pseudo-column • A few pseudo-columns can satisfy the whole workload

Multiple Attributes

Page 64: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 64

• Our solutions have significant improvement when selectivity > 1%

Effect of Query Selectivity

Page 65: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 65

Our solutions outperform the baseline under all value widths (# of bits)

Effect of Value Width

Page 66: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 66

• All attributes can be encoded in bits• More than a half can be encoded in bits.

TPC-H Attributes

Page 67: Accelerating aggregation using intra-cycle parallelism

Accelerating Aggregation using Intra-cycle Parallelism. Ziqiang Feng and Eric Lo. The Hong Kong Polytechnic University. ICDE’15. Seoul. 67

• There’s a tradeoff, depending on data distribution/workload.• HBP: slower scan + faster lookup• VBP: faster scan + slower lookup• Solution?• ByteSlice: fast scan + fast lookup (see our

SIGMOD’15 paper)

HBP or VBP?