84
Service Computing http://xuwang.tech

xuwangxuwang.tech/course/sc2020fall/08-sc2020fall-20200731.pdf · 2020. 9. 22. · –2PC 3PC •#/ – , . 0 4 ... Paxos Client WS Replica1 WS Replica2 WS Replica3 (the leader) time

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

  • Service Computing

    http://xuwang.tech

  • •– Facebook– Google–

    • : XaaS, Internet of Services– Amazon

    • LBS•

  • “ ”

  • • (scalability)–

    • Facebook–

    • 2013 Google 100PB–

    • 2012 11 11 100019.2 /

    • 2014 3

  • Scalability

  • • (availability): 7X24– (SLA):

    • Amazon EC2 99.95%, Google Cloud Storage 99.9%, Microsoft Windows Azure 99.9%

    1 Amazon 2011 4 21 22 EC2 , 2012 6 10

    2 2012 7 26 Azure 2.5

    3 Google GAE google 2012 10 26 4

    4 Hotmail Outlook.com SkyDrive 2013 3 12 17

    5 Twitter 2013 6 3 tweets 45

    6 Google 2013 8 17 Google 40%

    7 Adobe 2014 5 25 Adobe Creative Cloud 24

    8 2014 6 23 24 Lync instant messaging service Outlook Exchange online

  • ( / )ScalabilityLoad Balance

    Failover&Recovery

    Replication

    Availability

    R1 R2 RnRk… …

    Load Balancer

    RWrite(A,2)Read(A)=?

    A=1 A=1 A=1A=1A=2

    Read(A) 2

  • (1)• CAP(2000 SIGMOD keynote)– (Consistency) (Availability)

    (Partition)

    • PACELC(2012 IEEE Computer)– (Partition),

    (Availability) (Consistency) (Else) (Latency) (Consistency)

  • Scalability: CAP

    PODC 2000: C A P

    2012 IEEE Computer The Growing Impact of the CAP Theorem

    ACID, BASE, PACELC

  • CAP

    Data Center A Data Center B

    S1 S2

    Load Balance

    ???

    P: Partition Tolerance

    C: Consistency

    A: Ava

    ilability

    A: Availability

    • Consistency: all nodes have same data at any time• Availability: the system allows operations all the time• Partition-tolerance: the system continues to work in spite of network

    partitions

  • (2)• Jim Gray–

    • Amazon––

    • Yahoo–

  • •– (Linearizability) ACID

    • à– (eventual consistency)– ACIDàBASE(Basic Availability, Soft state,

    Eventual consistency)•– (Write conflicts)

    – (Read Staleness),

  • •–

    •–

    •–

  • •–

    •– 2PC 3PC

    •–

    • Paxos• Quorum– N W( ) R( )

  • •– SMATER POSTGRES-R HBase

    •– Eventual Consistency– Timeline consistency– Session consistency– Causal consistency– Snapshot Isolation consistency

  • •––––

    18

  • •––––

    19

  • •– K K-Versions

    •–

    •– TACT

    •– PBS

  • • Quorum

    – Amazon Dynamo– Cassandra– W,R

    • W,R

    • Quorum

    – (Ring)

  • • (Fast Paxos)• (Generalized Paxos)• IP•• (Mencius Multi-Ring

    Paxos)• SSD (Spinnaker CORFU)

  • •––

    • : – : +2s à / -1.8% and / -

    4.3%– : +500msà -25% – : +100msà -1%

    23

  • Client

    Replica1

    Replica2Replican

    (Coordinator)…

    ([n/2]+1) ACKs

    ACK

    ACK

    Client

    Replica1

    Replica2Replican

    (Coordinator)…

    n ACKs

    ACK

    ACK

    Client

    Replica1

    Replica2Replican

    (Coordinator)…

    Non-Uniform

    Total Order

    Broadcast

    Distributed

    Consensus

    (Paxos)

    Uniform

    Total Order

    Broadcast

    COMMITC

    OM

    MIT

    COMMITC

    OM

    MIT

    COMMITC

    OM

    MIT

    WRITE

    WRITE

    WRITE

    log

    log

    commit

    ACK

    1, [n/2]+1,n

    24

  • : RSM-dClient

    Replica2Replica3

    (Coordinator)

    WRITE

    Replican

    d ACKs

    ACK

    ACK

    REPL

    Y

    COMMIT

    (1) f : f ≤ [n/2]

    Replica1

    (2) (coordinator) , (log history)(commit history) .

    log history commit history

    PROPOSEFIN

    ISHED

    Agreement Phase Commit Phase

    (1 ≤ d ≤ n)

    d=1 (Non-Uniform Total Order Broadcastd=[n/2]+1 Paxosd=n (Uniform Total Order Broadcast)Otherwise

    25

  • • ( )– ( )– ( )–

    •––

  • RSM-d Pwc=0Paxos

    d ≥[n/2]+1

    1. 100% ;

    2. ”PROPOSE” d

    ( );

    3. .

    Write Log LostPwc= Pwll

    Non-uniform WritePwc= Pwll Pwnu

    3

    Write LostPwc= Pwl

    2

    (Write Duplicaton)

    Pwc= Pwl+ Pwd

    1

    27

  • : Client

    Replica2Replica3

    (Coordinator)

    WRITE

    Replican

    d ACKs

    ACK

    ACK

    (1 ≤ d ≤ [n/2])

    Replica1

    log history

    PROPOSE d(d ≤f)

    • 1,2,3• d(d ≤f) ,

    • Pwll : d

    cwllwc PPP )(==

    Pc

    28

  • (Non-uniform Write)Client

    Replica2Replica3

    WRITE

    Replican

    d ACKs

    ACK

    ACK

    (1 ≤ d ≤ [n/2])

    Replica1

    log history

    PROPOSEd(d ≤f)

    k

    COMMIT

    commit history

    • 1,2, (commit history)• Pwnu :

    kdnc

    dn

    k

    kcwnu PPP

    ---

    =

    -= å )1()(]2/[

    0

    wnuwllwc PPP =

    (Coordinator)

    k

    29

  • • 1. “PROPOSE”• D: . • Pelw(D) : D•

    • D=d ( 1,2 ), • , D=m(d ≤ m ≤ [n/2]) , • D ( Pwl):

    kxnc

    xn

    k

    kc

    xc PPPxQ

    ---

    =

    -= å )1()()()(]2/[

    0

    )(dQPwc =)()( mQmDPwc ==

    å=

    ==]2/[

    )()(n

    dmelwwlwc mQmPPP

    30

  • (Write Duplication)•

    ––

    Client WRITE

    d ACKs

    (1 ≤ d ≤ [n/2])

    SuspectedCoordinator

    NewCoordinator

    WR

    ITE

    d ACKs

    WRITE d ACKs

    (1 ≤ d ≤ [n/2])

    WR

    ITE

    d ACKs

    ÷÷ø

    öççè

    æ--

    ÷÷ø

    öççè

    æ--

    ÷÷ø

    öççè

    æ---

    ÷÷ø

    öççè

    æ--

    =

    11

    11

    11

    12

    dn

    dn

    ddn

    dn

    PnoPno :

    31

  • Coordinator

    Replica

    To

    Te

    To

    Te Te

    To To To

    Te Te

    Trust Suspicion

    Tf(t): TThb2 Thb1 : T .Pfs :

    )Pr()Pr( 1212 eohbhbohbehbfs TTTTTTTTP ->-=+>+=

    heartbeat

    heartbeat32

  • • n

    • :

    ))1(1()1()(1 11]2/[

    0

    fnfs

    fnc

    fc

    n

    fgfs PPPf

    nP ----

    =

    ---÷÷ø

    öççè

    æ -= å

    nogfscwd PPPP )1( -=

    33

  • • (Write lost)– (

    )• (Write duplication)–

    • , :

    wdwlwc PPP +=

    34

  • Client

    Replica2Replica3

    (Coordinator)

    WRITE

    Replican

    d ACKsAC

    K

    ACK

    REPLY

    COMMIT

    (1 ≤ d ≤ n)

    Replica1 PROPOSE

    LAi = tp(i)+tlog(i)+ta(i)LA1, LA2,…, LAn :

    LA(1) ≤ LA(2) ≤… ≤ LA(n-1) ( )

    Latency Lw (d)= LA(d-1)+min(LCi)

    Lw (d)

    tp ta

    LAi LCi

    FINISH

    ED

    tlog tcom

    tc tf

    35

  • Lw (d)• LAi=tp(i)+ta(i) G(t)• LA(k) h(k)(t) H(k)(t)(1 ≤k ≤n-1)

    • , :

    • :

    dxxtfxftg )()()( -= ò+¥

    ¥-

    knkn

    iik tLAPtLAPk

    nttLAPttLAP ---

    -

    =

    +>D÷÷ø

    öççè

    æ--

    =-=-+ - ddLAdn

    LALAEdLEdLE ddww

    ΔLA(d) = 0))(1())((0

    1 >-ò+¥ -- dttGtG dnd

    36

  • • B(d) BcBl :

    • :

    37

  • •–

    j0=j

    brc=j

    C=j

    w: b: ( )rb: rl:

  • •– Nwl Nwd– Pwc (Nwl + Nwd)/10,000,000

    •– n∈[2,9] d∈[1,[n/2]]– RMSE= 0.0009%, std.dev.=0.0052%

    •– n∈[2,9] d∈[1,n]– RMSE=0.013ms, std.dev.=0.019ms

    39

  • d

    -6

    -5

    -4

    -3

    -2

    -1

    0

    d=1d=2d=3d=4

    n=2 n=3 n=4 n=5 n=6 n=7 n=8 n=9

    wcPlg

    0

    50

    100

    150

    200

    250

    n=2 n=3 n=4 n=5 n=6 n=7 n=8 n=9

    Lw(2)

    Lw(3)Lw(4)Lw(5)

    Latency-L w(1)(ms)

    Impact of d on Consistency Impact of d on Latency

    40

  • vs.

    1-10-1

    1-10-2

    1-10-3

    1-10-4

    1-10-5

    1-10-6

    Consistency

    Latency-Lw(1)(ms)050100150200

    n=3n=5n=7

    (3,2)(5,3)

    (7,4)

    (7,3)

    (5,2)

    (7,2)

    (3,1)(5,1)

    (7,1)

    (n,d)

    41

  • • S , V– ( ) rb– ( ) !(

    )–

    • Microsoft Bing rl=0.0009%• Amazon.com rl=0.01%• Google Search rl=0.05%

    – v(d)

    42

  • d

  • •––––

    44

  • Quorum

    ÷÷ø

    öççè

    æ÷÷ø

    öççè

    æ -=

    RN

    RWN

    prs

  • W/R ,

    Client

    Replica1

    Replica2(Coordinator)

    COMMIT

    WRITE

    ReplicaN

    Send to N replicas

    Wait forW responses

    N=3, W=2,R=1

    Replica2 Replica3

    Partition 1

    Partition 2Replica1

    Replica2…

    COMMIT

    ReplicaN

    Send to N replicas

    Waiting, timeoutand then unavailable

    Client

    (Coordinator)

    Replica1

    Replica1

    Replica2(Coordinator)

    COMMIT

    ReplicaN

    Send to N replicas

    Wait forW responses

    Client

  • • Quorum– Quorum– / (2/3-tier basic tree), (fat

    tree), folded clos

    top of rack (ToR) switches

    Aggregation switches (AS)

    Core switches (CS)

  • • Quorum QS(DCN,PM,W/R)– DCN∈{2 (bt2) 3 (bt3) K (ft)

    folded clos (fc)}– PM: 0/1 ( )– W/R : Quorum Quorum

    (1) : Pc(2) : Pa(3) ToR : Pt(4) : Ps

    PM tor

  • (1)

  • (2)

    à

    àà

  • Server

    ToRSEdge

    CSCore

    CoordinatorReplica Replica

    ToRi

    Ai : ToRi:

    W

    å Õ=+++

    -- ==xmmm i

    iitorbtN

    mAPxQ...

    221

    )()(

    å=

    ---=N

    Wxtorbts xQPWPMbtQSAvail )()1()),,2(( 2

    WRITE

    … … …

  • Server

    ToRS

    CS

    AS

    CoordinatorReplica Replica

    )(2 xQ torbt --

    WRITE

    … … … …

    Xbt3-as:

    å=

    - =-=N

    Wxasbts xXPWPMbtQSAvail

    '3 )'Pr()1()),,3((

  • AS

    ToRS

    CS

    k/2 (CS) à GCSpod (AS)à SAS

    Pod1 Podk

    Server

    WRITE

    … … … …

    …… …

    Coordinator ReplicaReplica

    Group1 Groupk/2

    K

  • AS

    ToRS

    CS

    Pod1 Podk

    Server

    WRITE

    … … … …

    … …

    Coordinator ReplicaReplica

    GCS1 GCSx

    SAS1 SASk

    K

    x GCS 3

    å=

    - ==N

    Wxsasft xXWPMftQSAvail

    ')'Pr()),,((

  • Folded Clos

    (CS) à GCS(AS)à SAS

  • ToRS

    CS

    Server

    AS

    WRITE

    Coordinator ReplicaReplica

    SAS1 SASDI/2

    GCS

    Folded Clos

    3

    å=

    - =-=N

    Wxsasfcgc xXPWPMfcQSAvail

    ')'Pr()'1()),,((

  • •–

  • • ( 8K) N=3• =〈1,1,1〉 =〈2,1,0〉 =〈3,0,0〉torPM1 torPM 2 torPM 2

    HadoopCassandra

  • vs. • α

    • - (W,R)– [!n]– [!n]– [!n] αW+(1- α)R

  • N∈[3,9] W∈[1,N],

    4 9

  • W

    • N=3, W 9 (ft, fc)

  • N

    • , N 9

  • PM• Folded Clos , N=3 6

    •••

    ñá= 0,0,1,0,0,1,0,0,11torPM ñá= 0,0,0,0,0,1,0,0,22

    torPM

    ñá= 0,0,0,0,0,1,0,1,13torPM ñá= 0,0,0,0,0,0,0,1,24

    torPMñá= 0,0,0,0,0,0,1,1,15

    torPM ñá= 0,0,0,0,0,0,0,0,36torPM

    Hadoop

  • -• (α=0.05)

    •– 99.9%à(W,R)=(2,1)– 99%à(W,R)=(3,1)

  • •––––

    65

  • Web• Web HTTP– Ebay Web– AWS

    • ,–

    •• (2PC)•• Paxos

  • Rep4WS

  • Paxos

    Client

    WS Replica1

    WS Replica2

    WS Replica3

    (the leader)

    time

    REQUEST

    NUMBER

    PROPOSE

    PROPO

    SE

    PROPOSE

    LOGGING

    LOGGING

    LOGGING

    ACK

    AC

    K

    LEARN EXECUTE

    EXECUTE

    EXECUTELEA

    RN

    LEARN

    RESPOND

    RESP

    OND

    RESP

    OND

    REP

    LY

    NUMBER: 1,2,3…

    ACK: n/2+1 ACK

    EXECUTE:

    (AGREEMENT PHASE) (EXECUTION PHASE)

    (follower)

    (follower)

    (RDG)

    AGREEMENT PHASE: α (Pipeline Concurrency)

  • • API–

    (commutable)– ( )

    (RDG)• WebShopService

    – getMyCart– addToCart– searchProduct– orderProduct– modifyMyProfile

    getMyCart

    searchProduct

    Commutable

    getMyCart

    addToCart

    Can not commute

  • WSOP1 × × × √ ×OP2 √ × √ √OP3 √ × √OP4 × ×OP5 √

    OP2 OP1 OP2 OP4

    OP4 OP5

    OP2

    OP3

    n+1 n+2 n+3 n+4 n+5 n+6 n+7 n+8

    OP2n+1

  • RDG

    LEARN:(n+1)(n+5)(n+3)(n+8)(n+4)(n+6)(n+2)(n+7)

    n+1 n+2 n+3 n+4 n+5 n+6 n+7 n+8

    n+1

    n+2

    n+3n+7

    n+8

    n+4

    n+5

    n+6

    EXECUTE ExecutedWaitWaitWaitEXECUTE Executed

    EXECUTE Executed

    WaitEXECUTE Executed

    EXECUTE Executed

    EXECUTE Executed

    EXECUTE Executed

    EXECUTE Executed

    : 3+4+4+1=12 units

    RDG : 5+4+4+2+1=16 units

    RDG of Pipeline[n+1,n+8]

  • RDG• RDG

    •– RDG

    n+8

    RDGn+8

    n+1 n+2

    n+7

    n+3

    n+4 n+5n+6

    n

    (1) 0 RDG0

    (2) (i+1)RDG

    , (i/2)

    ( )RDG

  • • (leader)– Paxos–

    • (follower)– (checkpoint)– leader leader

    – leader• à

  • Replica Number

    Res

    pons

    e Ti

    me(

    ms)

    0

    20

    40

    60

    80

    100

    120

    3 4 5 6 7

    2PC

    GC-TCP

    Rep4WS/Basic Paxos

    Load(reqs/sec)R

    espo

    nse

    Tim

    e(m

    s) Replica Number=3

    0

    50

    100

    150

    200

    250

    300

    350

    10 20 30 40 50 60 70 80 90 100

    2PC

    GC-TCP

    Basic Paxos

    Rep4WS

  • RDG

    …n n+1 n+2 n+12 n+13 n+14

    n n+1 n+5 n+6

    n+7

    …n+8 n+9 n+13 n+14

    …n n+1 n+2 n+12 n+13 n+14

    RDG1

    RDG2

    RDG3

    read operation write operation

    RDGs

    0

    50

    100

    150

    200

    250

    10 20 30 40 50 60 70 80 90 100

    RDG

    RDG

    RDG

    Res

    pons

    e Ti

    me(

    ms)

    12

    3

    (No RDG)

    Load(reqs/sec)

    7 4.2 0

    RDG3 RDG1 15%-66%26%

  • •––––

    76

  • •– O P O P

    P⇝O•1. A2. B A– à

    •–

    77

  • •––

    •––

    • Lamport–

    78

  • • Monotonic reads–

    • Monotonic writes–

    • Read your writes–

    • Writes Follow Reads– a b

    b a

    •79

  • •–

    •–

    •– w(x=1) ⇝ w(x=2) x 1

    80

  • •– < , >

    •––

    •–

    ts– ts–

    81

  • 20% 40% 60% 80%Cassandra-Eventual 0.16% 0.08% 0.03% 0.02%

    Cassandra-RYW >0 >0 >0 >0CoCaCo 0 0 0 0

    82

    Cassandra: CoCaCo

  • 83

  • Q&A