84
Vectorized Execution @Andy_Pavlo // 15-721 // Spring 2019 ADVANCED DATABASE SYSTEMS Lecture #20

ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

Vectorized Execution

@Andy_Pavlo // 15-721 // Spring 2019

ADVANCEDDATABASE SYSTEMS

Le

ctu

re #

20

Page 2: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

Background

Hardware

Vectorized Algorithms (Columbia)

2

Page 3: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

VECTORIZATION

The process of converting an algorithm's scalar implementation that processes a single pair of operands at a time, to a vector implementation that processes one operation on multiple pairs of operands at once.

3

Page 4: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

WHY THIS MAT TERS

Say we can parallelize our algorithm over 32 cores.

Each core has a 4-wide SIMD registers.

Potential Speed-up: 32x × 4x = 128x

4

Page 5: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

MULTI-CORE CPUS

Use a small number of high-powered cores. → Intel Xeon Skylake / Kaby Lake→ High power consumption and area per core.

Massively superscalar and aggressive out-of-order execution→ Instructions are issued from a sequential stream.→ Check for dependencies between instructions.→ Process multiple instructions per clock cycle.

5

Page 6: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

MANY INTEGRATED CORES (MIC)

Use a larger number of low-powered cores.→ Intel Xeon Phi→ Low power consumption and area per core.→ Expanded SIMD instructions with larger register sizes.

Knights Ferry (Columbia Paper)→ Non-superscalar and in-order execution→ Cores = Intel P54C (aka Pentium from the 1990s).

Knights Landing (Since 2016)→ Superscalar and out-of-order execution.→ Cores = Silvermont (aka Atom)

6

Page 7: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

MANY INTEGRATED CORES (MIC)

Use a larger number of low-powered cores.→ Intel Xeon Phi→ Low power consumption and area per core.→ Expanded SIMD instructions with larger register sizes.

Knights Ferry (Columbia Paper)→ Non-superscalar and in-order execution→ Cores = Intel P54C (aka Pentium from the 1990s).

Knights Landing (Since 2016)→ Superscalar and out-of-order execution.→ Cores = Silvermont (aka Atom)

6

Page 8: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

S INGLE INSTRUCTION, MULTIPLE DATA

A class of CPU instructions that allow the processor to perform the same operation on multiple data points simultaneously.

All major ISAs have microarchitecture support SIMD operations.→ x86: MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2,

AVX512→ PowerPC: Altivec→ ARM: NEON

8

Page 9: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

Z

SIMD EXAMPLE

9

X + Y = Z 87654321

X

SISD

+for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];

}

911111111

Y

x1

x2

⋮xn

y1

y2

⋮yn

x1+y1

x2+y2

⋮xn+yn

+ =

Page 10: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

Z

SIMD EXAMPLE

9

X + Y = Z 87654321

X

SISD

+for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];

}

9 8 7 6 5 4 3 211111111

Y

x1

x2

⋮xn

y1

y2

⋮yn

x1+y1

x2+y2

⋮xn+yn

+ =

Page 11: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

Z

SIMD EXAMPLE

9

X + Y = Z 87654321

X

for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];

}

11111111

Y

SIMD

+

8 7 6 5

1 1 1 1

128-bit SIMD Register

128-bit SIMD Register

x1

x2

⋮xn

y1

y2

⋮yn

x1+y1

x2+y2

⋮xn+yn

+ =

Page 12: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

Z

SIMD EXAMPLE

9

X + Y = Z 87654321

X

for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];

}

9 8 7 611111111

Y

SIMD

+

8 7 6 5

1 1 1 1

128-bit SIMD Register

128-bit SIMD Register128-bit SIMD Register

x1

x2

⋮xn

y1

y2

⋮yn

x1+y1

x2+y2

⋮xn+yn

+ =

Page 13: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

Z

SIMD EXAMPLE

9

X + Y = Z 87654321

X

for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];

}

9 8 7 6 5 4 3 211111111

Y

SIMD

+4 3 2 1

1 1 1 1

x1

x2

⋮xn

y1

y2

⋮yn

x1+y1

x2+y2

⋮xn+yn

+ =

Page 14: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

STREAMING SIMD EXTENSIONS (SSE)

SSE is a collection SIMD instructions that target special 128-bit SIMD registers.

These registers can be packed with four 32-bit scalars after which an operation can be performed on each of the four elements simultaneously.

First introduced by Intel in 1999.

10

Page 15: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

SIMD INSTRUCTIONS (1 )

Data Movement→ Moving data in and out of vector registers

Arithmetic Operations→ Apply operation on multiple data items (e.g., 2 doubles, 4

floats, 16 bytes)→ Example: ADD, SUB, MUL, DIV, SQRT, MAX, MIN

Logical Instructions→ Logical operations on multiple data items→ Example: AND, OR, XOR, ANDN, ANDPS, ANDNPS

11

Page 16: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

SIMD INSTRUCTIONS (2)

Comparison Instructions→ Comparing multiple data items (==,<,<=,>,>=,!=)

Shuffle instructions→ Move data in between SIMD registers

Miscellaneous→ Conversion: Transform data between x86 and SIMD

registers.→ Cache Control: Move data directly from SIMD registers

to memory (bypassing CPU cache).

12

Page 17: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

INTEL SIMD EXTENSIONS

13

Width Integers Single-P Double-P

1997 MMX 64 bits ✔

1999 SSE 128 bits ✔ ✔(×4)

2001 SSE2 128 bits ✔ ✔ ✔(×2)

2004 SSE3 128 bits ✔ ✔ ✔

2006 SSSE 3 128 bits ✔ ✔ ✔

2006 SSE 4.1 128 bits ✔ ✔ ✔

2008 SSE 4.2 128 bits ✔ ✔ ✔

2011 AVX 256 bits ✔ ✔(×8) ✔(×4)

2013 AVX2 256 bits ✔ ✔ ✔

2017 AVX-512 512 bits ✔ ✔(×16) ✔(×8)

Source: James Reinders

Page 18: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

VECTORIZATION

Choice #1: Automatic Vectorization

Choice #2: Compiler Hints

Choice #3: Explicit Vectorization

14

Source: James Reinders

Page 19: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

VECTORIZATION

Choice #1: Automatic Vectorization

Choice #2: Compiler Hints

Choice #3: Explicit Vectorization

14

Source: James Reinders

Ease of Use

ProgrammerControl

Page 20: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

AUTOMATIC VECTORIZATION

The compiler can identify when instructions inside of a loop can be rewritten as a vectorizedoperation.

Works for simple loops only and is rare in database operators. Requires hardware support for SIMD instructions.

15

Page 21: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

AUTOMATIC VECTORIZATION

This loop is not legal to automatically vectorize.

16

void add(int *X,int *Y,int *Z) {

for (int i=0; i<MAX; i++) {Z[i] = X[i] + Y[i];

}}

Page 22: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

AUTOMATIC VECTORIZATION

This loop is not legal to automatically vectorize.

16

These might point to the same address!

void add(int *X,int *Y,int *Z) {

for (int i=0; i<MAX; i++) {Z[i] = X[i] + Y[i];

}}

*Z=*X+1

Page 23: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

AUTOMATIC VECTORIZATION

This loop is not legal to automatically vectorize.

The code is written such that the addition is described as being done sequentially.

16

These might point to the same address!

void add(int *X,int *Y,int *Z) {

for (int i=0; i<MAX; i++) {Z[i] = X[i] + Y[i];

}}

*Z=*X+1

Page 24: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

COMPILER HINTS

Provide the compiler with additional information about the code to let it know that is safe to vectorize.

Two approaches:→ Give explicit information about memory locations.→ Tell the compiler to ignore vector dependencies.

17

Page 25: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

COMPILER HINTS

The restrict keyword in C++ tells the compiler that the arrays are distinct locations in memory.

18

void add(int *restrict X,int *restrict Y,int *restrict Z) {

for (int i=0; i<MAX; i++) {Z[i] = X[i] + Y[i];

}}

Page 26: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

COMPILER HINTS

This pragma tells the compiler to ignore loop dependencies for the vectors.

It’s up to you make sure that this is correct.

19

void add(int *X,int *Y,int *Z) {

#pragma ivdepfor (int i=0; i<MAX; i++) {Z[i] = X[i] + Y[i];

}}

Page 27: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

EXPLICIT VECTORIZATION

Use CPU intrinsics to manually marshal data between SIMD registers and execute vectorized instructions.

Potentially not portable.

20

Page 28: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

EXPLICIT VECTORIZATION

Store the vectors in 128-bit SIMD registers.

Then invoke the intrinsic to add together the vectors and write them to the output location.

21

void add(int *X,int *Y,int *Z) {

__mm128i *vecX = (__m128i*)X;__mm128i *vecY = (__m128i*)Y;__mm128i *vecZ = (__m128i*)Z;for (int i=0; i<MAX/4; i++) {_mm_store_si128(vecZ++,⮱_mm_add_epi32(*vecX++,

⮱*vecY++));}

}

Page 29: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

VECTORIZATION DIRECTION

Approach #1: Horizontal→ Perform operation on all elements together

within a single vector.

Approach #2: Vertical→ Perform operation in an elementwise

manner on elements of each vector.

22

Source:

0 1 2 3

SIMD Add 6

0 1 2 3

SIMD Add

1 1 1 1

1 2 3 4

Page 30: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

EXPLICIT VECTORIZATION

Linear Access Operators→ Predicate evaluation→ Compression

Ad-hoc Vectorization→ Sorting→ Merging

Composable Operations→ Multi-way trees→ Bucketized hash tables

23

Source: Orestis Polychroniou

Page 31: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

VECTORIZED DBMS ALGORITHMS

Principles for efficient vectorization by using fundamental vector operations to construct more advanced functionality.→ Favor vertical vectorization by processing different input

data per lane.→ Maximize lane utilization by executing different things

per lane subset.

24

RETHINKING SIMD VECTORIZATION FOR IN-MEMORY DATABASESSIGMOD 2015

Page 32: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

FUNDAMENTAL OPERATIONS

Selective Load

Selective Store

Selective Gather

Selective Scatter

25

Page 33: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

FUNDAMENTAL VECTOR OPERATIONS

26

Selective Load

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

Page 34: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

FUNDAMENTAL VECTOR OPERATIONS

26

Selective Load

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

Page 35: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

FUNDAMENTAL VECTOR OPERATIONS

26

Selective Load

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U

Page 36: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

FUNDAMENTAL VECTOR OPERATIONS

26

Selective Load

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U

Page 37: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

FUNDAMENTAL VECTOR OPERATIONS

26

Selective Load

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U V

Page 38: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

FUNDAMENTAL VECTOR OPERATIONS

26

Selective Load Selective Store

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U V

A B C DVector

U V W X Y Z • • •Memory

0 1 0 1Mask

Page 39: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

FUNDAMENTAL VECTOR OPERATIONS

26

Selective Load Selective Store

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U V

A B C DVector

U V W X Y Z • • •Memory

0 1 0 1Mask

Page 40: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

FUNDAMENTAL VECTOR OPERATIONS

26

Selective Load Selective Store

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U V

A B C DVector

U V W X Y Z • • •Memory

0 1 0 1Mask

B

Page 41: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

FUNDAMENTAL VECTOR OPERATIONS

26

Selective Load Selective Store

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U V

A B C DVector

U V W X Y Z • • •Memory

0 1 0 1Mask

B

Page 42: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

FUNDAMENTAL VECTOR OPERATIONS

26

Selective Load Selective Store

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U V

A B C DVector

U V W X Y Z • • •Memory

0 1 0 1Mask

B D

Page 43: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

Selective Gather

FUNDAMENTAL VECTOR OPERATIONS

27

A B DValue Vector

Memory

2 1 5 3Index Vector

U V W X Y Z • • •

CA

0 21 3 54

Page 44: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

Selective Gather

FUNDAMENTAL VECTOR OPERATIONS

27

A B DValue Vector

Memory

2 1 5 3Index Vector

U V W X Y Z • • •

CAW

0 21 3 54

Page 45: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

Selective Gather

FUNDAMENTAL VECTOR OPERATIONS

27

A B DValue Vector

Memory

2 1 5 3Index Vector

U V W X Y Z • • •

CAW V XZ

0 21 3 54

Page 46: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

Selective Gather

FUNDAMENTAL VECTOR OPERATIONS

27

Selective Scatter

A B DValue Vector

Memory

2 1 5 3Index Vector

U V W X Y Z • • • A B C DValue Vector

U V W X Y Z • • •Memory

2 1 5 3Index Vector

CAW V XZ

0 21 3 54

0 21 3 54

Page 47: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

Selective Gather

FUNDAMENTAL VECTOR OPERATIONS

27

Selective Scatter

A B DValue Vector

Memory

2 1 5 3Index Vector

U V W X Y Z • • • A B C DValue Vector

U V W X Y Z • • •Memory

2 1 5 3Index Vector

CAW V XZ B A CD

0 21 3 54

0 21 3 54

Page 48: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

ISSUES

Gathers and scatters are not really executed in parallel because the L1 cache only allows one or two distinct accesses per cycle.

Gathers are only supported in newer CPUs.

Selective loads and stores are also implemented in Xeon CPUs using vector permutations.

28

Page 49: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

VECTORIZED OPERATORS

Selection Scans

Hash Tables

Partitioning

Paper provides additional info:→ Joins, Sorting, Bloom filters.

29

RETHINKING SIMD VECTORIZATION FOR IN-MEMORY DATABASESSIGMOD 2015

Page 50: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

SELECTION SCANS

30

Scalar (Branching)

i = 0for t in table:key = t.keyif (key≥low) && (key≤high):

copy(t, output[i])i = i + 1

Scalar (Branchless)

i = 0for t in table:copy(t, output[i])key = t.keym = (key≥low ? 1 : 0) &&⮱(key≤high ? 1 : 0)

i = i + m

Source: Bogdan Raducanu

Page 51: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

SELECTION SCANS

30

Scalar (Branching)

i = 0for t in table:key = t.keyif (key≥low) && (key≤high):

copy(t, output[i])i = i + 1

Scalar (Branchless)

i = 0for t in table:copy(t, output[i])key = t.keym = (key≥low ? 1 : 0) &&⮱(key≤high ? 1 : 0)

i = i + m

Source: Bogdan Raducanu

Page 52: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

SELECTION SCANS

31

Source: Bogdan Raducanu

Page 53: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

SELECTION SCANS

32

Vectorized

i = 0for vt in table:simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&↪(vk≤high ? 1 : 0)

simdStore(vt, vm, output[i])i = i + |vm≠false|

Page 54: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

SELECTION SCANS

32

Vectorized

i = 0for vt in table:simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&↪(vk≤high ? 1 : 0)

simdStore(vt, vm, output[i])i = i + |vm≠false|

ID1

KEYJ

2 O3 Y4 S5 U6 X

SELECT * FROM tableWHERE key >= "O" AND key <= "U"

Page 55: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

SELECTION SCANS

32

Vectorized

i = 0for vt in table:simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&↪(vk≤high ? 1 : 0)

simdStore(vt, vm, output[i])i = i + |vm≠false|

J O Y S U XKey VectorID1

KEYJ

2 O3 Y4 S5 U6 X

SELECT * FROM tableWHERE key >= "O" AND key <= "U"

Page 56: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

SELECTION SCANS

32

Vectorized

i = 0for vt in table:simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&↪(vk≤high ? 1 : 0)

simdStore(vt, vm, output[i])i = i + |vm≠false|

J O Y S U XKey VectorID1

KEYJ

2 O3 Y4 S5 U6 X

Mask 0 1 0 1 1 0

SIMD Compare

SELECT * FROM tableWHERE key >= "O" AND key <= "U"

Page 57: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

SELECTION SCANS

32

Vectorized

i = 0for vt in table:simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&↪(vk≤high ? 1 : 0)

simdStore(vt, vm, output[i])i = i + |vm≠false|

J O Y S U XKey VectorID1

KEYJ

2 O3 Y4 S5 U6 X

Mask 0 1 0 1 1 0

SIMD Compare

0 1 2 3 4 5All Offsets

SELECT * FROM tableWHERE key >= "O" AND key <= "U"

Page 58: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

SELECTION SCANS

32

Vectorized

i = 0for vt in table:simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&↪(vk≤high ? 1 : 0)

simdStore(vt, vm, output[i])i = i + |vm≠false|

J O Y S U XKey VectorID1

KEYJ

2 O3 Y4 S5 U6 X

Mask 0 1 0 1 1 0

SIMD Compare

0 1 2 3 4 5All Offsets

SIMD Store

1 3 4Matched OffsetsSELECT * FROM tableWHERE key >= "O" AND key <= "U"

Page 59: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

SELECTION SCANS

32

Vectorized

i = 0for vt in table:simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&↪(vk≤high ? 1 : 0)

simdStore(vt, vm, output[i])i = i + |vm≠false|

J O Y S U XKey VectorID1

KEYJ

2 O3 Y4 S5 U6 X

Mask 0 1 0 1 1 0

SIMD Compare

0 1 2 3 4 5All Offsets

SIMD Store

1 3 4Matched OffsetsSELECT * FROM tableWHERE key >= "O" AND key <= "U"

Page 60: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

SELECTION SCANS

33

0

16

32

48

0 1 2 5 10 20 50 100

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Selectivity (%)

Scalar (Branching)

Scalar (Branchless)

Vectorized (Early Mat)

Vectorized (Late Mat)

0.0

2.0

4.0

6.0

0 1 2 5 10 20 50 100

Thr

ough

put

(bil

lion

tupl

es /

sec

)Selectivity (%)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

5.7 5.6 5.35.7 4.9 4.3 2.8 1.3

1.7 1.7 1.71.81.6 1.4 1.5

1.2

Page 61: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

SELECTION SCANS

33

0

16

32

48

0 1 2 5 10 20 50 100

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Selectivity (%)

Scalar (Branching)

Scalar (Branchless)

Vectorized (Early Mat)

Vectorized (Late Mat)

0.0

2.0

4.0

6.0

0 1 2 5 10 20 50 100

Thr

ough

put

(bil

lion

tupl

es /

sec

)Selectivity (%)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

MemoryBandwidth

MemoryBandwidth

Page 62: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

PAYLOADKEY

Linear Probing Hash Table

HASH TABLES PROBING

34

Scalar

k1

Input Key

h1

Hash Index

#hash(key)

Page 63: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

PAYLOADKEY

Linear Probing Hash Table

HASH TABLES PROBING

34

Scalar

k1

Input Key

h1

Hash Index

#hash(key)

k1 k9=

Page 64: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

PAYLOADKEY

Linear Probing Hash Table

HASH TABLES PROBING

34

Scalar

k1

Input Key

h1

Hash Index

#hash(key)

k1

k9=

k3=

k8=

k1=

Page 65: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

HASH TABLES PROBING

34

Scalar

k1

Input Key

h1

Hash Index

#hash(key)

Vectorized (Horizontal)

KEYS PAYLOAD

Linear Probing Bucketized Hash Table

k1

Input Key

h1

Hash Index

#hash(key)

Page 66: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

HASH TABLES PROBING

34

Scalar

k1

Input Key

h1

Hash Index

#hash(key)

Vectorized (Horizontal)

KEYS PAYLOAD

Linear Probing Bucketized Hash Table

k1

Input Key

h1

Hash Index

#hash(key)

k9= k3 k8 k1k1

Page 67: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

HASH TABLES PROBING

34

Scalar

k1

Input Key

h1

Hash Index

#hash(key)

Vectorized (Horizontal)

KEYS PAYLOAD

Linear Probing Bucketized Hash Table

k1

Input Key

h1

Hash Index

#hash(key)

k9= k3 k8 k1k1

0 0 0 1Matched Mask

SIMD Compare

Page 68: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88

Linear Probing Hash Table

HASH TABLES PROBING

35

Vectorized (Vertical)Input Key

Vector

k1

k2

k3

k4

Page 69: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88

Linear Probing Hash Table

HASH TABLES PROBING

35

Vectorized (Vertical)Input Key

Vector hash(key)

#

#

#

#

Hash IndexVector

h1

h2

h3

h4

k1

k2

k3

k4

Page 70: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88

Linear Probing Hash Table

HASH TABLES PROBING

35

Vectorized (Vertical)Input Key

Vector hash(key)

#

#

#

#

Hash IndexVector

h1

h2

h3

h4

k1

k2

k3

k4

k1

k99

k88

k4

====

SIMD Gather

k1

k2

k3

k4

Page 71: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88

Linear Probing Hash Table

HASH TABLES PROBING

35

Vectorized (Vertical)Input Key

Vector hash(key)

#

#

#

#

Hash IndexVector

h1

h2

h3

h4

k1

k2

k3

k4

k1

k99

k88

k4

====

SIMD Compare

1

0

0

1

k1

k2

k3

k4

Page 72: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88

Linear Probing Hash Table

HASH TABLES PROBING

35

Vectorized (Vertical)Input Key

Vector hash(key)

#

#

#

#

Hash IndexVector

h1

h2

h3

h4

k1

k2

k3

k4

k1

k99

k88

k4

====

SIMD Compare

1

0

0

1

k1

k2

k3

k4

Page 73: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88

Linear Probing Hash Table

HASH TABLES PROBING

35

Vectorized (Vertical)Input Key

Vector hash(key)

#

#

#

#

Hash IndexVector

h1

h2

h3

h4

k1

k2

k3

k4

k1

k99

k88

k4

====

SIMD Compare

1

0

0

1

k1

k2

k3

k4

k5

k6

h5

h2+1

h3+1

h6

Page 74: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88

Linear Probing Hash Table

HASH TABLES PROBING

35

Vectorized (Vertical)Input Key

Vector hash(key)

#

#

#

#

Hash IndexVector

h1

h2

h3

h4

k1

k2

k3

k4

k5

k6

h5

h2+1

h3+1

h6

Page 75: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

HASH TABLES PROBING

36

0

3

6

9

12

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Hash Table Size

Scalar Vectorized (Horizontal) Vectorized (Vertical)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

0

0.5

1

1.5

2

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Hash Table Size

2.3 2.2 2.12.41.1 0.9 0.7 0.6

1.1 1.10.9

1.2

0.8 0.8

0.30.2

Page 76: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

HASH TABLES PROBING

36

0

3

6

9

12

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Hash Table Size

Scalar Vectorized (Horizontal) Vectorized (Vertical)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

0

0.5

1

1.5

2

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Hash Table Size

Out of Cache

Out of Cache

Page 77: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

PARTITIONING HISTOGRAM

Use scatter and gathers to increment counts.

Replicate the histogram to handle collisions.

37

Page 78: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

PARTITIONING HISTOGRAM

Use scatter and gathers to increment counts.

Replicate the histogram to handle collisions.

37

k1

k2

k3

k4

Input KeyVector

h1

h2

h3

h4

Hash Index Vector

SIMD AddSIMD Radix

+1

+1

+1

Histogram

Page 79: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

PARTITIONING HISTOGRAM

Use scatter and gathers to increment counts.

Replicate the histogram to handle collisions.

37

k1

k2

k3

k4

Input KeyVector

h1

h2

h3

h4

Hash Index Vector

SIMD AddSIMD Radix

+1

+1

+1

Histogram

Missing Update

Page 80: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

PARTITIONING HISTOGRAM

Use scatter and gathers to increment counts.

Replicate the histogram to handle collisions.

37

k1

k2

k3

k4

Input KeyVector

h1

h2

h3

h4

Hash Index Vector Replicated Histogram

+1

+1

+1

+1

SIMD Add

# of Vector Lanes

SIMD Radix

+1

+2

+1

Histogram

SIMD Scatter

Page 81: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

JOINS

No Partitioning→ Build one shared hash table using atomics→ Partially vectorized

Min Partitioning→ Partition building table→ Build one hash table per thread→ Fully vectorized

Max Partitioning→ Partition both tables repeatedly→ Build and probe cache-resident hash tables→ Fully vectorized

38

Page 82: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

JOINS

39

0

0.5

1

1.5

2

Scalar Vector Scalar Vector Scalar VectorNo Partitioning Min Partitioning Max Partitioning

Join

Tim

e (s

ec)

Partition Build Probe Build+Probe

200M ⨝ 200M tuples (32-bit keys & payloads)Xeon Phi 7120P – 61 Cores + 4×HT

Page 83: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

PARTING THOUGHTS

Vectorization is essential for OLAP queries.

These algorithms don’t work when the data exceeds your CPU cache.

We can combine all the intra-query parallelism optimizations we’ve talked about in a DBMS.→ Multiple threads processing the same query.→ Each thread can execute a compiled plan.→ The compiled plan can invoke vectorized operations.

40

Page 84: ADVANCED 0 DATABASE SYSTEMS - CMU 15-721 · →Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) →Superscalar and out-of-order execution. →Cores =

CMU 15-721 (Spring 2019)

NEXT CL ASS

Compilation vs. Vectorization

41