Parallel Sorting with Skiplists and Atomic Memory...

Preview:

Citation preview

Category: Algorithms & NumericAl techNiquesposter

AN15 contact name

Lars nyland: lnyland@gmail.com

Parallel Sorting with Skiplists and Atomic Memory Operations Lars S. Nyland, NVIDIA

Skiplists • Hierarchical linked lists for ordered data • Probabilistically-sized nodes • O(log n) find time, walking high to low • O(log n) insertion time • Reliably balanced • About 1.5x the cost in pointers (1 data, ~2 next) • Concurrent operations proven correct

0

1/8

1/4

3/8

1/2

5/8

1 2 3 4 5 6 7 8

Prob

abili

ty o

f allo

catio

n

Number of "next" pointers in a node

Atomic Memory Ops Compare-and-swap (CAS) is an atomic memory operation used to manipulate pointers concurrently. A CAS operation takes 3 inputs: an address A, a comparison value C, and a replacement value V. It compares the value in memory at location A (mem[A]) to C, and if they are equal, it stores V in mem[A]. It returns what was originally in mem[A]. Access to mem[A] is blocked during the CAS operation. For linked structures like skiplists, CAS is used to “swing pointers” to insert nodes, ensuring that the list is never corrupt. The figure below shows two threads trying to insert a node at the same location, requiring updates to the same pointer. Each thread uses CAS to update the “next” pointer (orange arrows). Only one will succeed while the other fails and repeats.

Parallel Skiplist Insertion Sort N items are inserted in a skiplist using P threads (N/P each), by these steps : 1. Allocate a new node with k “next” pointers for the next value. 2. Find the insertion point by chasing the pointers from high to low, staying at one

level until the value is exceeded, then stepping down a level, and repeating. 3. From low to the high, set the next pointers in the new node, and then swing

the previous pointer to the new node using CAS. By going from low to high, the skiplist is always valid, allowing other threads to chase and update. Figure 1 shows two colliding level-2 pointers after their level-1 pointers have been successfully set.

In total, there are O(N) successful CAS operations, and O(n log n) pointers chased. The question is how many CAS operations fail.

Concurrency, Collisions & Communication Thousands of concurrent threads attempt to insert values into the skiplist, retrying if they fail. At the start when the list is short, there are many failures, but the skiplist doubles in size until its length exceeds the number of concurrent threads. Parallel skiplist insertion sort is an example of a lock-free parallel algorithm, since at least one thread makes progress at all times. The number of failed CAS operations is shown in figure 2, indicating that far more than one thread is succeeding on every insertion attempt.

Thread k

Thread j

value next value

next value

next value

next

Conclusions 1. Parallel skiplist insertion sort is work-

efficient. 2. Performance is dominated by O(n log n)

loads, not O(n) atomic-CAS operations. 3. Performance is limited by memory address

divergence. 4. Skiplist traversal is accelerated by L2 hits. 5. CAS insertion failures drop dramatically as

number of items (N) exceeds number of parallel threads (P).

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000

Tim

e to

sor

t (se

cond

s)

N, the number of elements to sort

Sorting Time

GTX580 Time

GTX680 Time

K20c time

Figure 3. Performance of Skiplist-insertion Sort The time needed to insert N elements is shown. The scaling is as expected at O(n log n). We ran the same problems on 3 different GPUs (Fermi GF100, Kepler GK104, Kepler GK110), finding the performance differences surprisingly small.

References Many thanks to William Pugh for the invention of skiplists. Skiplists and all the topics discussed in this poster are described in detail on Wikipedia. The topic areas are skiplists, atomic memory operations, comparison sorting, lock-free and wait-free parallel algorithms, along with a complete description of NVIDIA GPUs.

Figure 1. Nodes X and Y are being inserted. Their 0-level pointers have been set successfully, and now both X and Y are trying to swing A1, shown with the dotted lines, using CAS. One succeeds, say the CAS of Y1, so the level-1 pointers are now A1-Y1-C1. The thread inserting X1 will see the failure, returning to find its level-1 insertion point between A1 and Y1, where it will again try to set A1 to point to X1 with CAS, and it will point to Y1.

A0

A1

X0

X1

B0 Y0

Y1 C0

C1

0.1%

1.0%

10.0%

100.0%

32768 131072 524288 2097152 8388608 33554432

Faili

ng in

sert

ions

, by

perc

ent

Number of elements sorted (N)

Failed Insertion Rate

gtx580 overhead

gtx680 overhead

K20c overhead

Figure 2. Overhead from failed CAS operations When N is small (near P), CAS operations fail nearly as often as not, leading to retries. As N grows, the percentage of operations that fail drops to near 0%. The chart shows the failure rate when 25,000 concurrent threads are running.

507

: 21

93 2

191

|

|

| | |

|

| 508

: 21

91 2

187

|

|

| | |

|

| 509

: 21

87 2

185

2182

+188

8+

| | |

|

| 510

: 21

85 2

182

|

|

| | |

|

| 511

: 21

82 1

888

1888

+

|

| | |

|

| 512

: 18

88 1

886

1879

+187

9+18

79+1815+ |

|

| 513

: 18

86 1

884

|

|

| | |

|

| 514

: 18

84 1

879

|

|

| | |

|

| 515

: 18

79 1

876

1876

+185

8+18

26+ | |

|

| 516

: 18

76 1

874

1869

+

|

| | |

|

| 517

: 18

74 1

872

|

|

| | |

|

| 518

: 18

72 1

869

|

|

| | |

|

| 519

: 18

69 1

867

1862

+

|

| | |

|

| 520

: 18

67 1

865

|

|

| | |

|

| 521

: 18

65 1

862

|

|

| | |

|

| 522

: 18

62 1

858

1858

+

|

| | |

|

| 523

: 18

58 1

856

1850

+185

0+

| | |

|

| 524

: 18

56 1

854

|

|

| | |

|

| 525

: 18

54 1

850

|

|

| | |

|

| 526

: 18

50 1

847

1847

+184

1+

| | |

|

| 527

: 18

47 1

845

1841

+

|

| | |

|

| 528

: 18

45 1

841

|

|

| | |

|

| 529

: 18

41 1

838

1838

+182

6+

| | |

|

| 530

: 18

38 1

836

1831

+

|

| | |

|

| 531

: 18

36 1

834

|

|

| | |

|

| 532

: 18

34 1

831

|

|

| | |

|

| 533

: 18

31 1

826

1826

+

|

| | |

|

| 534

: 18

26 1

824

1821

+181

5+18

15+ | |

|

| 535

: 18

24 1

821

|

|

| | |

|

| 536

: 18

21 1

815

1815

+

|

| | |

|

| 537

: 18

15 1

813

1806

+180

6+18

06+1934+ |

|

| 538

: 18

13 1

811

|

|

| | |

|

| 539

: 18

11 1

806

|

|

| | |

|

| 540

: 18

06 1

803

1803

+197

6+19

45+ | |

|

| 541

: 18

03 1

801

1987

+

|

| | |

|

| 542

: 18

01 1

799

|

|

| | |

|

| 543

: 17

99 1

987

|

|

| | |

|

| 544

: 19

87 1

985

1980

+

|

| | |

|

| 545

: 19

85 1

983

|

|

| | |

|

| 546

: 19

83 1

980

|

|

| | |

|

| 547

: 19

80 1

976

1976

+

|

| | |

|

| 548

: 19

76 1

974

1968

+196

8+

| | |

|

| 549

: 19

74 1

972

|

|

| | |

|

| 550

: 19

72 1

968

|

|

| | |

|

| 551

: 19

68 1

966

1960

+196

0+

| | |

|

| 552

: 19

66 1

964

|

|

| | |

|

| 553

: 19

64 1

960

|

|

| | |

|

| 554

: 19

60 1

957

1957

+194

5+

| | |

|

| 555

: 19

57 1

955

1950

+

|

| | |

|

| 556

: 19

55 1

953

|

|

| | |

|

| 557

: 19

53 1

950

|

|

| | |

|

| 558

: 19

50 1

945

1945

+

|

| | |

|

| 559

: 19

45 1

943

1940

+193

4+19

34+ | |

|

| 560

: 19

43 1

940

|

|

| | |

|

| 561

: 19

40 1

934

1934

+

|

| | |

|

| 562

: 19

34 1

932

1925

+192

5+19

25+1912+ |

|

| 563

: 19

32 1

930

|

|

| | |

|

| 564

: 19

30 1

925

|

|

| | |

|

| 565

: 19

25 1

922

1922

+191

2+19

12+ | |

|

| 566

: 19

22 1

920

1912

+

|

| | |

|

| 567

: 19

20 1

912

|

|

| | |

|

| 568

: 19

12 1

909

1909

+189

8+20

49+2049+2028+2

144+

| 569

: 19

09 1

907

1902

+

|

| | |

|

| 570

: 19

07 1

905

|

|

| | |

|

| 571

: 19

05 1

902

|

|

| | |

|

| 572

: 19

02 1

898

1898

+

|

| | |

|

| 573

: 18

98 1

896

2080

+208

0+

| | |

|

| 574

: 18

96 1

894

|

|

| | |

|

| 575

: 18

94 2

080

|

|

| | |

|

| 576

: 20

80 2

078

2072

+207

2+

| | |

|

| 577

: 20

78 2

076

|

|

| | |

|

| 578

: 20

76 2

072

|

|

| | |

|

| 579

: 20

72 2

069

2069

+204

9+

| | |

|

| 580

: 20

69 2

067

2062

+

|

| | |

|

| 581

: 20

67 2

065

|

|

| | |

|

| 582

: 20

65 2

062

|

|

| | |

|

| 583

: 20

62 2

060

2055

+

|

| | |

|

| 584

: 20

60 2

058

|

|

| | |

|

| 585

: 20

58 2

055

|

|

| | |

|

| 586

: 20

55 2

049

2049

+

|

| | |

|

| 587

: 20

49 2

047

2040

+204

0+20

40+2028+ |

|

| 588

: 20

47 2

045

|

|

| | |

|

| 589

: 20

45 2

040

|

|

| | |

|

| 590

: 20

40 2

037

2037

+202

8+20

28+ | |

|

| 591

: 20

37 2

035

2028

+

|

| | |

|

| 592

: 20

35 2

028

|

|

| | |

|

| 593

: 20

28 2

025

2025

+201

4+21

66+2166+2144+

|

| 594

: 20

25 2

023

2018

+

|

| | |

|

| 595

: 20

23 2

021

|

|

| | |

|

| 596

: 20

21 2

018

|

|

| | |

|

| 597

: 20

18 2

014

2014

+

|

| | |

|

| 598

: 20

14 2

012

2009

+200

5+

| | |

|

| 599

: 20

12 2

009

|

|

| | |

|

| 600

: 20

09 2

005

2005

+

|

| | |

|

| 601

: 20

05 2

003

1997

+199

7+

| | |

|

| 602

: 20

03 2

001

|

|

| | |

|

| 603

: 20

01 1

997

|

|

| | |

|

| 604

: 19

97 1

994

1994

+216

6+

| | |

|

| 605

: 19

94 1

992

2179

+

|

| | |

|

| 606

: 19

92 1

990

|

|

| | |

|

| 607

: 19

90 2

179

|

|

| | |

|

| 608

: 21

79 2

177

2172

+

|

| | |

|

| 609

: 21

77 2

175

|

|

| | |

|

| 610

: 21

75 2

172

|

|

| | |

|

| 611

: 21

72 2

166

2166

+

|

| | |

|

| 612

: 21

66 2

164

2157

+215

7+21

57+2144+ |

|

| 613

: 21

64 2

162

|

|

| | |

|

| 614

: 21

62 2

157

|

|

| | |

|

| 615

: 21

57 2

155

2144

+214

4+21

44+ | |

|

| 616

: 21

55 2

153

|

|

| | |

|

| 617

: 21

53 2

144

|

|

| | |

|

| 618

: 21

44 2

141

2141

+213

0+20

88+2088+2633+2

557+2981

+ 619

: 21

41 2

139

2134

+

|

| | |

|

| 620

: 21

39 2

137

|

|

| | |

|

| 621

: 21

37 2

134

|

|

| | |

|

| 622

: 21

34 2

130

2130

+

|

| | |

|

| 623

: 21

30 2

128

2125

+212

1+

| | |

|

| 624

: 21

28 2

125

|

|

| | |

|

| 625

: 21

25 2

121

2121

+

|

| | |

|

| 626

: 21

21 2

119

2113

+211

3+

| | |

|

| 627

: 21

19 2

117

|

|

| | |

|

| 628

: 21

17 2

113

|

|

| | |

|

| 629

: 21

13 2

110

2110

+210

4+

| | |

|

| 630

: 21

10 2

108

2104

+

|

| | |

|

| 631

: 21

08 2

104

|

|

| | |

|

| 632

: 21

04 2

101

2101

+208

8+

| | |

|

| 633

: 21

01 2

099

2094

+

|

| | |

|

| 634

: 20

99 2

097

|

|

| | |

|

| 635

: 20

97 2

094

|

|

| | |

|

| 636

: 20

94 2

088

2088

+

|

| | |

|

| 637

: 20

88 2

086

2644

+264

4+26

44+2633+ |

|

| 638

: 20

86 2

084

|

|

| | |

|

| 639

: 20

84 2

644

|

|

| | |

|

| 640

: 26

44 2

642

2633

+263

3+26

33+ | |

|

| 641

: 26

42 2

640

|

|

| | |

|

| 642

: 26

40 2

633

|

|

| | |

|

| 643

: 26

33 2

630

2630

+261

2+25

79+2579+2557+

|

| 644

: 26

30 2

628

2623

+

|

| | |

|

| 645

: 26

28 2

626

|

|

| | |

|

| 646

: 26

26 2

623

|

|

| | |

|

| 647

: 26

23 2

621

2616

+

|

| | |

|

| 648

: 26

21 2

619

|

|

| | |

|

| 649

: 26

19 2

616

|

|

| | |

|

| 650

: 26

16 2

612

2612

+

|

| | |

|

| 651

: 26

12 2

610

2604

+260

4+

| | |

|

| 652

: 26

10 2

608

|

|

| | |

|

| 653

: 26

08 2

604

|

|

| | |

|

| 654

: 26

04 2

601

2601

+259

5+

| | |

|

| 655

: 26

01 2

599

2595

+

|

| | |

|

| 656

: 25

99 2

595

|

|

| | |

|

| 657

: 25

95 2

592

2592

+257

9+

| | |

|

| 658

: 25

92 2

590

2585

+

|

| | |

|

| 659

: 25

90 2

588

|

|

| | |

|

| 660

: 25

88 2

585

|

|

| | |

|

| 661

: 25

85 2

579

2579

+

|

| | |

|

| 662

: 25

79 2

577

2574

+256

9+25

69+2557+ |

|

| 663

: 25

77 2

574

|

|

| | |

|

| 664

: 25

74 2

569

2569

+

|

| | |

|

| 665

: 25

69 2

567

2557

+255

7+25

57+ | |

|

| 666

: 25

67 2

565

|

|

| | |

|

| 667

: 25

65 2

557

|

|

| | |

|

| 668

: 25

57 2

554

2554

+253

6+25

04+2504+2483+2

981+

| 669

: 25

54 2

552

2547

+

|

| | |

|

| 670

: 25

52 2

550

|

|

| | |

|

| 671

: 25

50 2

547

|

|

| | |

|

| 672

: 25

47 2

545

2540

+

|

| | |

|

| 673

: 25

45 2

543

|

|

| | |

|

| 674

: 25

43 2

540

|

|

| | |

|

| 675

: 25

40 2

536

2536

+

|

| | |

|

| 676

: 25

36 2

534

2528

+252

8+

| | |

|

| 677

: 25

34 2

532

|

|

| | |

|

| 678

: 25

32 2

528

|

|

| | |

|

| 679

: 25

28 2

526

2520

+252

0+

| | |

|

| 680

: 25

26 2

524

|

|

| | |

|

| 681

: 25

24 2

520

|

|

| | |

|

| 682

: 25

20 2

517

2517

+250

4+

| | |

|

| 683

: 25

17 2

515

2510

+

|

| | |

|

| 684

: 25

15 2

513

|

|

| | |

|

| 685

: 25

13 2

510

|

|

| | |

|

| 686

: 25

10 2

504

2504

+

|

| | |

|

| 687

: 25

04 2

502

2499

+249

4+24

94+2483+ |

|

| 688

: 25

02 2

499

|

|

| | |

|

| 689

: 24

99 2

494

2494

+

|

| | |

|

| 690

: 24

94 2

492

2483

+248

3+24

83+ | |

|

| 691

: 24

92 2

490

|

|

| | |

|

| 692

: 24

90 2

483

|

|

| | |

|

| 693

: 24

83 2

480

2480

+247

3+24

73+2981+2981+

|

| 694

: 24

80 2

478

2473

+

|

| | |

|

| 695

: 24

78 2

473

|

|

| | |

|

| 696

: 24

73 2

470

2470

+245

9+29

94+ | |

|

| 697

: 24

70 2

468

2463

+

|

| | |

|

| 698

: 24

68 2

466

|

|

| | |

|

| 699

: 24

66 2

463

|

|

| | |

|

| 700

: 24

63 2

459

2459

+

|

| | |

|

| 701

: 24

59 2

457

3024

+302

4+

| | |

|

| 702

: 24

57 2

453

|

|

| | |

|

| 703

: 24

53 3

024

|

|

| | |

|

| 704

: 30

24 3

022

3016

+301

6+

| | |

|

| 705

: 30

22 3

020

|

|

| | |

|

| 706

: 30

20 3

016

|

|

| | |

|

| 707

: 30

16 3

013

3013

+299

4+

| | |

|

| 708

: 30

13 3

011

3006

+

|

| | |

|

| 709

: 30

11 3

009

|

|

| | |

|

| 710

: 30

09 3

006

|

|

| | |

|

| 711

: 30

06 3

004

2999

+

|

| | |

|

| 712

: 30

04 3

002

|

|

| | |

|

| 713

: 30

02 2

999

|

|

| | |

|

| 714

: 29

99 2

994

2994

+

|

| | |

|

| 715

: 29

94 2

992

2981

+298

1+29

81+ | |

|

|

Recommended