Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
ThreadClustering:Sharing-AwareSchedulingonSMP-CMP-SMTMultiprocessors
DepartmentofElectricalandComputerEngineeringUniversityofToronto,Canada
Presenter:Hwan-jin YongEuroSys’07
DavidTam,RezaAzimi,MichaelStumm
Outline
• Introduction:OpenPower 720Architecture• Motivatoin• PerformanceManagementUnit
• DesignofthreadClustrering Scheme• Evaluation• Contribute• Summary
OpenPower 720Server• Design:Performance,Scalability,Reliabilityetc• Power7processors(SMP-CMP-SMTMultiprocessor)
• DesignedMulti-corearchitecture(calledCMP)forleadingthethroughput• Sharedmemorymultiprocessors(SMP)• SimultaneousMultithreading(SMT)
IBMOpenPower 720
Simutaneous multithreading(SMT)• MultipleindependentthreadstoexecuteSIMULTANEOUSLYontheSAMEcore
• IncreaseCoreEfficiency• Example
• Singlethread:Theprocesspipelinegetstalledwhenwaitingfordatatoarrivefrommemory
• Ifonethreadiswaitingforafloatingpointoperationtocomplete,anotherthreadcanusetheintegerunits
• Power5• BySMT,2virtualprocessorperrealprocessor
Power5Layout
ExistingOSdoesn’thandlethecomplexityofmulticoreprocessors
Motivation• Thepoorperformanceis…what….• Solution?Power-5(8-logicalprocessors)
• IncreaseCacheSize...Money!• Addmoreprocessor…Power!• HiremoreChipArchitectureengineers..??
OverviewofThreadClusteringScheme• DesignofThreadClusteringScheme
1.MonitoringStallBreakdown
2.DetectingSharingPatterns
3.ThreadClustering
4.ThreadMigration
Step1:Monitoring StallBreakdown• PMU
• DetectvariouseventthatcancountinProcessor
• IntroductionPMUonCortex-R• SelectOnlythreeeventregister• Overflowhandling• Difficulttoextracthigh-levelinsight
<Cortex-R(eventup-to40ea),ReferenceManual>
FuncA
FuncB
FuncC
PMU On PMU Off
Data Cache Miss: 0Branch Mis-Prediction: 0Instruction count : 0ClockCount : 0
Data Cache Miss: ABranch Mis-Prediction: BInstruction count : CClockCount : D
Step2:DetectSharingPatterns• Construnction shMaps
• buildshMap (summarydatastructure),countremotecacheaccess(8-bitcounter)
• SetoneregionindexonshMap Vectorifcachemississatisfiedbyremotecacheaccess
• Regionsize:128bytes(equaltocachesize)• Buthowtoencoretotalvirtualaddressspacewithonly128regionentry
• Usesimplehashingfunction(region=address%128?)shMap Vector (Thread A)
0 1 127
addess space (Power5 : 64 bit)2^640
Step2:DetectSharingPatterns• ForLowOverhead&ReduceNoise(falsereport)onStep2
• TemporalSampling:noteverytimetorecordandprocesswhenremotecacheaccess,OnlyonesetinNoccurrencesofremotecacheaccess
• SpatialSampling: Toreducehashcollisions(falsereport)andcanmaintainsmallmemorysizeofshmap vector
shMap Vector (Thread A) 6
0 1 127
ThreadA
HashFunc
2
shMap Vector (Thread A) 2
0 1 127
ThreadA
HashFunc
2
shMapFilter
FirstCome,GotTiket!
Step3:ThreadClustering
• DefinethesimilarityoftwoshMap vectors
10
shMap Vector (Thread A)
256 200 130
shMap Vector (Thread B)
256 150
The similarity value will be high when two threads are sharing data (Theard A and B)
Evaluation(1/2)
11
• Continuousverticaldark-linemeansclusteredthreads
Evaluation(2/2)
12
• Performanceimprovementofup-to7%• Reducestallduetoremotecacheaccess
RelatedWork:isitbeingusednowdays?
13
• 추가준비중
Summary
• ProposeNewThreadscheduling• Usingrun-timeinformationfromhardwareperformancecount
• Detectionsharingpatterndifferentthreads• FindBestlocationthreadpositionnotmakingremotecacheaccessanymore
• OSJobschedulertore-assignthreadsthatsharedatatothesamechipdomain(memorydomain)withlowoverhead
14
Thankyou!
15
?Questions?