24
Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters Tia Newhall, Daniel Amato, Alexandr Pshenichkin Computer Science Department Swarthmore College Swarthmore, PA USA [email protected]

Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

  • Upload
    moses

  • View
    36

  • Download
    0

Embed Size (px)

DESCRIPTION

Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters. Tia Newhall, Daniel Amato, Alexandr Pshenichkin Computer Science Department Swarthmore College Swarthmore, PA USA [email protected]. swap out page. Node B. Node A. RAM. disk. Network. Network RAM. - PowerPoint PPT Presentation

Citation preview

Page 1: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

Nswap: A Reliable, Adaptable Network RAM System for General

Purpose ClustersTia Newhall, Daniel Amato, Alexandr Pshenichkin

Computer Science Department

Swarthmore College

Swarthmore, PA [email protected]

Page 2: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

2Cluster08, Tia Newhal, 2008

Network RAM Cluster nodes share each other’s idle RAM as a

remote swap partition

• When one node’s RAM is overcommitted, swap its pages out over the network to store in idle RAM of other nodes

+ Avoid swapping to slower local disk+ Almost always some significant amt idle RAM even when some

nodes overloaded

NetworkNetwork

swap out pageswap out pageNode A

diskdisk

Node B RAM

Page 3: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

3Cluster08, Tia Newhal, 2008

Nwap Design Goals Scalable

• No central authority

Adaptable• Node’s RAM usage varies • Don’t want remotely swapped page data to cause more

swapping on a node amount of RAM made available for storing remotely

swapped page data needs to grow/shrink with local usage

Fault Tolerant• A single node failure can lose pages from processes running

on remote nodes• One node’s failure can affect unrelated processes on other

nodes

Page 4: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

4Cluster08, Tia Newhal, 2008

Nswap Network swapping lkm for Linux clusters

• Runs entirely in kernel space on unmodified Linux 2.6

Completely Decentralized• Each node runs a multi-threaded client & server• Client is active when node swapping

• Uses local information to find a “good” Server when it swaps-out

• Server is active when node has idle RAM available

Clientswap out page Server

Server

Node A Node BKernel spaceUser space

Nswap Cache Client

Nswap Communication Layer Nswap Communication Layer

Network

Page 5: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

5Cluster08, Tia Newhal, 2008

How Pages Move around system

Node A Node B

SWAP IN

Swap in: from server B to client A (B still is backing store)

Node A Node B Node C

MIGRATE

Migrate: server B shrinks its Nswap Cache sends pages to server C

Node A Node B

SWAP OUT

Swap out: from client A to server B

Page 6: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

6Cluster08, Tia Newhal, 2008

Adding Reliability Requires extra time and space

Minimize extra costs, particularly to nodes that are swapping Avoid reliability solutions that use disk use cluster-wide idle RAM for reliability data

Has to work with Nswap’s:1. Dynamic resizing of Nswap Cache 2. Varying Nswap Cache capacity at each node3. Support for migrating remotely swapped page data between

servers

=> Reliability solutions that require fixed placement of page and reliability data won’t work

Page 7: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

7Cluster08, Tia Newhal, 2008

Centralized Dynamic Parity

RAID 4 like A single, dedicated, parity server node

• In large clusters, nodes divided into Parity Partitions, each partition has its own dedicated Parity Server

• Parity Server stores parity pages, keeps track of parity groups, implements page recovery

+ Client & server don’t need to know about parity grps

Parity Server Node 1 Node 2 Node m-1

. . .

Parity Partition 1

Parity Server Node m+1 Node 2m-1

. . .

Parity Partition 2

Page 8: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

8Cluster08, Tia Newhal, 2008

Centralized Dynamic Parity (cont.) Like RAID 4

• Parity group pages striped across cluster idle RAM

• Parity pages all on single parity server

with some differences• Parity group size and assignment is not fixed

• Pages can leave and enter a given parity group (garbage collection, migration, merging parity grps)

Group 1

Group 2

Parity ServerNode 2 Node 3

P

Node 1

. . .

P

Node 4

Page 9: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

9Cluster08, Tia Newhal, 2008

Page Swap-out, case 1: new page swap Parity Pool at Client:

• client stores a set of in-progress parity pages

• As page swapped out it is added to a page in the pool• minor computation overhead on client (XOR of 4K pages)

• As parity pages fill, they are sent to the Parity Server• One extra page send to parity server every ~N swap-outs

Client Servers

Parity Server

SWAP OUTParityPool

PARITY PAGE

SWAP OUT

SWAP OUT

SWAP OUT

XORXOR

Page 10: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

10Cluster08, Tia Newhal, 2008

Page Swap-out, case 2: overwrite

Server has old copy of swapped out page:• Client sends new page to server

• No extra overhead on client side vs. non-reliable Nswap

• Server computes the XOR of the old and new version of the page and sends it to the Parity Server before overwriting the old version with the new

Node ANode B

UPDATE_XOR

Parity Server

SWAP OUT

XORParityPage

XORold

new

Page 11: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

11Cluster08, Tia Newhal, 2008

Page Recovery• Detecting node sends a RECOVERY message to Parity Server

• Page Recovery runs concurrently with cluster applications

• Parity Server rebuilds all pages that were stored at the crashed node

As it recovers each page, it migrates it to a non-failed Nswap Serverpage may stay in same parity group or be added to a new one

The server receiving the recovered page tells client of its new location

Servers in this Parity GroupParity Server

. . .

ParityPage

XOR

RecoveredPage

MIGRATE RECOVERED

Lost Page’s owner

New ServerUPDATE

Page 12: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

12Cluster08, Tia Newhal, 2008

Decentralized Dynamic Parity Like RAID 5:

• No dedicated parity server• Data pages and Parity pages striped across Nswap Servers+ not limited by Parity Server’s RAM capacity nor Parity Partitioning - every node is now Client, Server, Parity Server

Store with each data page its parity server & P-group ID• For each page, need to know its parity server and to which group

it belongs • A page’s parity group ID and parity server can change due to

migration or merging of two small parity groups• First set by client on swap-out when parity logging• Server can change when page is migrated or parity groups are merged

Client still uses parity pool• Finds a node to take the parity page as it starts a new parity group• One extra message per parity group to find a server for parity page

Every Nswap server has to recover lost pages that belong to parity groups whose parity page it stores.

+/- Decentralized recovery

Page 13: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

13Cluster08, Tia Newhal, 2008

Kernel Benchmark ResultsWorkload Swapping

to DiskNswap

(No Reliability)

Nswap

(Centralized Parity)(1) Sequential R&W

220.31 116.28

(speedup 1.9)

117.10

(1.9)

(2) Random R&W

2462.90 105.24

(23.4)

109.15

(22.6)

(3) Random R&W & File I/O

3561.66 105.50

(33.8)

110.19(32.3)

8 node Linux 2.6 cluster (Pentium 4, 512 MB RAM, TCP/IP over 1 Gbit Ethernet, 80 GB IDE (100MB/s))

Workloads:

(1) Sequential R & W to large chunk of memory (best case for disk swapping)

(2) Random R & W to memory (more disk arm seeks w/in swap partition)

(3) 1 large file I/O, 1 W2 (disk arm seeks between swap & file partitions)

Page 14: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

14Cluster08, Tia Newhal, 2008

Parallel Benchmark ResultsWorkload Swapping

to DiskNswap

(No Reliability)

Nswap

(Centralized Parity) Linpack 1745.05 418.26

(speedup 4.2)

415.02

(4.2)

LU 33464.99 3940.12

(8.5)

109.15

(8.2)

Radix 464.40 96.01

(4.8)

97.65(4.8)

FFT 156.58 94.81

(1.7)

95.95

(1.6)8 node Linux 2.6 cluster (Pentium 4, 512 MB RAM, TCP/IP over 1 Gbit Ethernet, 80 GB IDE (100MB/s))

Application Processes running on half of the nodes (clients of Nswap), the other half are not running benchmark processes and are acting as Nswap servers.

Page 15: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

15Cluster08, Tia Newhal, 2008

Recovery Results Timed execution of applications with and

without concurrent page recovery (simulated node failure and the recovery of pages it lost)

• Concurrent recovery does not slow down application

Measured the time it takes for the Parity Server to recover each page of lost data

• ~7,000 pages recovered per second

• When parity group size is ~5: 0.15 ms per page

• When parity group size is ~6: 0.18 ms per page

Page 16: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

16Cluster08, Tia Newhal, 2008

Conclusions Nswap’s adaptable design makes adding reliability

support difficult Our Dynamic Parity Solutions solve these

difficulties, and should provide the best solutions in terms of time and space efficiency

Results testing our Centralized Solution, support implementing the Decentralized Solution

+ more adaptable+ no dedicated Parity Server or its fixed-size RAM limitations- more complicated protocols- more overlapping, potentially interfering operations- each node now a Client, Server, and Parity Server

Page 17: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

17Cluster08, Tia Newhal, 2008

AcknowlegmentsSwarthmore Students:

Dan Amato’07 Alexandr Pshenishkin’07

Jenny Barry’07 Heather Jones’06

America Holloway’05 Ben Mitchell’05

Julian Rosse’04 Matti Klock ’03

Sean Finney ’03 Michael Spiegel ’03

Kuzman Ganchev ’03

More information:http://www.cs.swarthmore.edu/~newhall/nswap.html

Page 18: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

18Cluster08, Tia Newhal, 2008

Nswap’s Design Goals Transparent

• User should not have to do anything to enable swapping over NW Adaptable

• A Network RAM system that constantly runs on cluster must adjust to changes in local node’s memory usage

• Local processes should get local RAM before remote processes do Efficient

• Should be fast swapping in and out • Should use a minimal amount of local memory state

Scalable• System should scale to large sized clusters (or networked systems)

Reliable• A crash of one node should not effect unrelated processes running

on other nodes

Page 19: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

19Cluster08, Tia Newhal, 2008

Complications Simultaneous Conflicting Operations

• Asynchrony and threads allows for fast, multiple ops at once, but some overlapping ops can conflictex. Migration and new swap-out for same page

Garbage Pages in the System• When process terminates we need to remove its

remotely swapped pages from servers• Swap interface doesn’t contain call to device to free

slots since this isn’t a problem for disk swap

Node failure• Can lose remotely swapped page data

Page 20: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

20Cluster08, Tia Newhal, 2008

How Pages Move Around the System

SWAP-OUT:

SWAP-IN:Node A

SWAP_INB

swap in page i

YES,

Nswap Server SWAP_OUT?

Node B

OKNswapCache

Node A

swap out page

B

Nswap Clientshadowslot map

i

Nswap Server

Node B

NswapCache

Page 21: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

21Cluster08, Tia Newhal, 2008

Nswap Client Implemented as device driver and added as a

swap device on each node• Kernel swaps pages to it just like any other swap device

Shadow slot map stores state about remote location of each swapped out page

- Extra space overhead that must be minimizedswap out page:

BNswap Client shadowslot map

kernel’s slot map 1i

i

(2) kernel calls our driver’s write function

(4) send the page to server B

(1) kernel finds free swap slot i

(3) add server info. to shadow slot map

Page 22: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

22Cluster08, Tia Newhal, 2008

Nswap Server Manages local idle RAM currently

allocated for storing remote pages

Handles swapping requests • Swap-out: allocate page of RAM to store remote page

• Swap-in: fast lookup of page it stores

Grows and Shrinks the amount of local RAM available based on the node’s local memory usage

• Acquire pages from paging system when there is idle RAM

• Release pages to paging system when they are needed locally

• Remotely swapped page data may be migrated to other servers

Nswap Cache

Nswap Server

swap out page

Page 23: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

23Cluster08, Tia Newhal, 2008

Finding a Server to take a Page Client uses local info. to pick “best” server

• Local IP Table stores available RAM for each node • Servers periodically broadcast their size values

• Clients update entries as they swap to servers

• IP Table also caches open sockets to nodes

+ No centralized remote memory server

swap out page i

B

Nswap Client

shadowslot map

IP TableHOST AMT Open

Socks

B 20

C 10

F 35

i

look up a good candidate serverand get an open socket to it

Page 24: Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters

24Cluster08, Tia Newhal, 2008

Soln 1: MirroringOn Swap-outs: send page to primary & back-up servers

On Migrate: if new Server already has a copy of the pageit will not accept the MIGRATE request and old server picks another candidate

+ Easy to Implement - 2 pages being sent on every swap-out - Requires 2x as much RAM space for pages - Increases the size of the shadow slot map

Node A Node B

SWAP OUT

Node C

SWAP OUT