GloballySynchronizedtimeviaDatacenterNetworks
KiSuhLeeCornellUniversity
JointworkwithHanWang,VishalShrivastav andHakimWeatherspoon
1
SynchronizedClocks
⢠Fundamentalfornetworkanddistributedsystemsâ OWD,Monitoring,Coordination,Snapshots,Updates,âŚ
⢠Goal:Minimizedandboundedprecisionwithscalabilityâ Minimizedandboundedprecision:hundredsofnanosecondsâ Scalability:Entiredatacenter
2
ClockSynchronizationProtocol
⢠Offset: Timedifferencebetweentwoclocks⢠Precision:Theworstcaseofoffset
3
Client Timeserver
đĄ"đĄ#
đĄ$
đĄ%
ClockSynchronizationProtocol
⢠RTT= đĄ% â đĄ" â đĄ$ â đĄ#
⢠Offset=(()*(+)#
â (-*(.#
⢠Offset= đĄ# â đĄ" â đ đđ/2
4
Client Timeserver
đĄ"đĄ#
đĄ$
đĄ%
5
CurrenttimeprotocolsdoNOTprovidebounded precision,duetouncertainty inmeasuredRTT!
Challenge:RTTisnotaccurate
6
⢠Errorsfromâ Oscillatorskewâ InaccurateTimestampingâ NetworkStackâ NetworkJitter
Client Timeserver
đĄ"đĄ#
đĄ$
đĄ%
Challenge:RTTisnotaccurate⢠Errorsfromâ Oscillatorskewâ InaccurateTimestampingâ NetworkStackâ NetworkJitter
⢠PTP
â Hardwaretimestampingâ PTP-enabledswitchesâ Filtering/Smoothing
7
Client Timeserver
đĄ"đĄ#
đĄ$
đĄ%
Challenge:Scalability
⢠Re-synchronizationperiodvs.Networkoverhead⢠Limitednumberofclients
8
Client Timeserver
đĄ"đĄ#
đĄ$
đĄ%
SynchronizationProtocols
9
Client Timeserver
đĄ"đĄ#
đĄ$
đĄ%
Precision Scalability Overhead ExtraHardwareNTP us Good Moderate NonePTP sub-us Good Moderate PTP-enableddevicesGPS ns Bad None Timingsignal receivers,cables
Solution:UsethePHYtosynchronizeclocks⢠ProtocolinthePHYâ Eachphysicallylinkisalreadysynchronized!â Noprotocolstackoverheadâ Nonetworkoverheadâ Scalable:peer-to-peeranddecentralized
10
Application
Transport
Network
DataLink
Physical
đĄ3
đĄ4
DTP:DatacenterTimeProtocol⢠HighlyScalablewithbounded precision!â ~25ns(4clockticks)betweenpeersâ ~150nsforadatacenterwithsixhopsâ NoNetworkTrafficâ Internal ClockSynchronization
⢠End-to-End:~200nsprecision!
11
Application
Transport
Network
DataLink
Physical
Outline
⢠Introduction⢠Design⢠Evaluation⢠Discussion⢠Conclusion
12
DTP:DatacenterTimeProtocol⢠10GBackgroundâ Continuous/I/swhenthereisnopacketâ Atleast12/I/sbetweentwoEthernetframes
13
Application
Transport
Network
DataLink
Physical
Packeti Packeti+1 Packeti+2
DTP:DatacenterTimeProtocol⢠10GBackgroundâ Continuous/I/swhenthereisnopacketâ Atleast12/I/sbetweentwoEthernetframesâ 1Controlblock(/E/,66bit)=8/I/sâ Atleast1/E/betweenanytwoframesâ ThePHYisrunby156.25MHz
⢠Periodis6.4ns
14
Application
Transport
Network
DataLink
Physical
Packeti Packeti+1 Packeti+2
/E/ /E/ /E/ /E/ /E/ /E/ /E/ /E/
/E/ /E/
DTP:DatacenterTimeProtocol
15
Application
Transport
Network
DataLink
Physical
Packeti Packeti+1 Packeti+2
/E/ /E/ /E/ /E/ /E/ /E/ /E/ /E/
/E/ /E/
⢠DTPoverwrites/E/tosendprotocolmessagesâ Frequentmessagingâ NooverheadtoEthernet(L2)
DTP DTP
DTP DTP
/E/
2bitSyncheader
8bitBlockType
3bitDTPMSGType
53bitDTPPayload
10GbENetworkStack
8/26/16 SoNICNSDI2013
16
Physical64/66bPCS
PMA
PMD
Encode
Scrambler
Gearbox
Decode
Descrambler
Blocksync
DataLink
Network
Transport
Application Data
/S/ /D/ /D/ /D/ /D/ /T/ /E/
DataL3Hdr
DataL3HdrL2Hdr
DataL3HdrL2Hdr GapEthHdr CRCPreamble
011010010110100101101001011010010110100101101001011010010110100101101
Encode
Scrambler
Gearbox
PMA
64bit 2bitsyncheader
16bit
10.3125Gigabits
/S/ /D/ /D/ /D/ /D/ /T/ /E/
Idlecharacters(/I/)
DTP
17
Physical64/66bPCS
Decode
Descrambler
Blocksync
Encode
Scrambler
Gearbox
PMD
PMA
DTPRxDTPTxDTP Control
localcounter
⢠localcounter:106-bitclockâ Frequently,synchronizelow53bitsâ Occasionally,synchronizehigh53bits
⢠delay:one-waydelaytopeer
SynchronizationFIFO
delay
LocalClock RemoteClock
Application
Transport
Network
DataLink
Physical
DTP
18
⢠Runsintwophasesbetweentwopeersâ Init Phase:MeasuringOWDâ BeaconPhase:Re-Synchronization
Physicallocaldelay
Physicallocaldelay
Application
Transport
Network
DataLink
Physical
DTP: Init Phase
19
⢠dđđđđŚ = đĄ% âđĄ" â đź /2â đź=3:Ensuredelayisalwayslessthanactualdelay
⢠Introduce2clocktickerrorsâ Duetooscillatorskew,timingandSyncFIFO
đĄ"đĄ#đĄ$
đĄ%
Physicallocaldelay
Physicallocaldelay
Application
Transport
Network
DataLink
Physical
DTP: BeaconPhase
20
⢠local =max(local,remote+delay)⢠Frequentmessagesâ Every1.2us(200clockticks)withMTUpacketsâ Every7.2us(1200clockticks)withJumbopackets
⢠Introduces2clocktickerrorsâ Total4clocktickerrors
Physicallocaldelay
Physicallocaldelay
đĄ"đĄ#
Application
Transport
Network
DataLink
Physical
DTPSwitch
21
⢠global=max(local counters)⢠Propagatesglobal viaBeaconmessages
Physicallocaldelay
Physicallocaldelay
Physicallocaldelay
Physicallocaldelay
Physicallocaldelay
max
global
Application
Transport
Network
DataLink
Physical
DTPDaemon
⢠End-to-Endprecision⢠AccesstheDTPcounterviaPCIe⢠EstimateDTPtimeusinginvariantTSCcounter
22
DTPProperty
23
⢠BoundedPrecisioninhardwareâ Boundedby4T(=25.6ns,T=oscillatortickis6.4ns)â Networkprecisionboundedby4TD
⢠Disnetworkdiameterinhops
⢠RequiresNICandswitchmodificationsâ PTPalsorequiresPTP-enableddevices
DTPvsPTPPTP DTP
Oscillator Skew
Timestamping HW - timestamping PHYtimestamping
NetworkStack Notinvolved Notinvolved
NetworkJitter TransparentClockBoundary Clock
No jitter
Precision UnboundedTenstoHundredsns(When Idle)
Bounded
24
⢠Handlingfailure⢠Differentstandards:1GbE,25GbE,40GbE,100GbE,etc⢠Externalsynchronization(i.e.synchronizingtotruetime)⢠Incrementaldeployment
25
DTP:Topicsdiscussedinpaper
Handlingfailure
⢠BitErrorsâ IgnoresBiterrorsinMSBsâ AppendschecksumforlowLSBs
⢠FaultyDevicesâ Whentoomanyjumpsoutsidethebound
26
DifferentStandardsData Rate Encoding Data Width Frequency Period Î1GbE 8b/10b 8bit 125MHz 8ns 25
10 GbE 64b/66b 32bit 156.25MHz 6.4ns 20
40GbE 64b/66b 64bit 625MHz 1.6ns 5
100GbE 64b/66b 64bit 1562.5MHz 0.64ns 2
27
ExternalSynchronization
⢠Amasterserverâ Connectedtoareferencetimeâ BroadcaststhemappingbetweenDTP andwalltime
⢠Clientserversâ InterpolatestimeusingDTP counters
28
IncrementalDeployment
⢠Updatesperrackâ DTP-enabledswitchâ DTP-enabledNICsâ Oneserveractingasamaster forwalltime
⢠SynchronizingRacksâ DTP-enabledswitchâ DTPbeacon-joinmessageforsynchronizingDTPcountersâ Selectanewmaster
29
Outline
⢠Introduction⢠Design⢠Evaluation⢠Discussion⢠Conclusion
30
Evaluation
⢠DTPPrototypeâ Terasic DE5boardwithAlteraStratix Vâ UsingBluespec andConnectal framework
31
Evaluation:DTPTopology
32
S4 S5 S6 S7 S8 S9 S10 S11
S1 S2 S3
S0
DTPNIC
Measuredoffsetsbetweenpeers
Evaluation:Logger
⢠Offsetbetweenpeers:đĄ$ â đĄ# â OWD⢠OffsetbetweenSWandHW:đĄ# â đĄ"
33
Physicallocaldelay
Physicallocaldelay
DTPDaemon DTPDaemon
đĄ"
đĄ# đĄ$
đĄ", đĄ#,đĄ$
Evaluation:DTPTopology
34
S4 S5 S6 S7 S8 S9 S10 S11
S1 S2 S3
S0
DTPNIC
Offset=đđĄđCD- đđĄđED
Evaluation:PTPTopology
35
S4 S5 S6 S7 S8 S9 S10 S11
S1 S2 S3
S0
Timeserver
PTPSwitch
PTPNIC
Evaluation:PTPTopology
36
S4 S5 S6 S7 S8 S9 S10 S11
S1 S2 S3
S0
Timeserver
PTPSwitch
PTPNIC
Evaluation:PTPTopology
37
S4 S5 S6 S7 S8 S9 S10 S11
S1 S2 S3
S0
Timeserver
PTPSwitch
PTPNIC
PTP:IdleNetwork(Notraffic)
⢠Tenstohundredsofnanosecondprecision
38
-600
-400
-200
0
200
400
600
Offs
et(n
anosecon
d)
Time(min)
PTP:MediumLoaded(4Gbps)
⢠Tensofmicrosecondsprecision
39
-50
-25
0
25
50
Offs
et(m
icrosecond
)
Time(min)
PTP:HeavilyLoaded(9Gbps)
⢠Tenstohundredsofmicrosecondprecision
40
-150
-100
-50
0
50
100
150
Offs
et(m
icrosecond
)
Time(min)
DTP:HeavilyLoaded
⢠Alwayswithin25.6ns(4clockticks)betweenpeers
41
-32
-25.6
-19.2
-12.8
-6.4
0
6.4
12.8
19.2
25.6
32
0 3 6-5
-4
-3
-2
-1
0
1
2
3
4
5
Offs
et(N
anosecon
d)
Time(min)
Offs
et(C
lockTick)
S1-S4 S1-S5 S1-S0 S2-S7 S2-S8
S2-S0 S3-S10 S3-S11 S3-S0
DTPDaemon
42
DTPDaemon(aftersmoothing)
⢠Usuallycanaccessthecounterwith25.6nsprecision
43
-20
-16
-12
-8
-4
0
4
8
12
16
20
-128
-102.4
-76.8
-51.2
-25.6
0
25.6
51.2
76.8
102.4
128
Offs
et(C
lockTick)
Offs
et(n
anosecon
d)
Time(min)
Outline
⢠Introduction⢠Design⢠Evaluation⢠Discussion⢠Conclusion
44
NextSteps
⢠IntegrationwithOSNT(OpenSourceNetworkTester)â NetFPGA SUMEBoardwithXilinxVirtex-7
45
SomeRelatedWork⢠SynchronousEthernet(SyncE)â Synchronizethefrequencyofclocksâ DTP,PTPsynchronizesthetime ofclocks
⢠WhiteRabbit:PTP+SyncEâ Sub-nanosecondprecisionâ 1GbEonlyyet
⢠CommercialPTP+SyncEâ Tenstohundredsofnanoseconds
46
Conclusion
⢠DTPprovidesboundedprecision andscalabilityâ Boundedprecision:4clockticks(25.6ns)betweenpeersâ Scalability:153.6nsforadatacenterwithsixhopsâ Free:NoNetworkTrafficâ Applications:Usuallywithin25.6ns(withoutbounds)â End-to-End:153.6+25.6*2=200ns!
47
Questions?
⢠http://github.com/hanw/sonic-lite⢠http://sonic.cs.cornell.edu⢠Email:[email protected]
⢠CometoPostersessiontomorrow!
48