Click here to load reader

Precise and Accurate Processor Simulation Harold Cain, Kevin Lepak, Brandon Schwartz, and Mikko H. Lipasti University of Wisconsin—Madison pharm

  • View
    218

  • Download
    0

Embed Size (px)

Text of Precise and Accurate Processor Simulation Harold Cain, Kevin Lepak, Brandon Schwartz, and Mikko H....

  • Precise and Accurate Processor SimulationHarold Cain, Kevin Lepak, Brandon Schwartz, andMikko H. LipastiUniversity of WisconsinMadisonhttp://www.ece.wisc.edu/~pharm

  • Architecture Research?Genius is one percent inspiration and ninety-nine percent perspiration.--Thomas EdisonThis is not a talk about inspirationNo new ideas or gimmicksThis is a talk about perspirationMostly graduate student perspirationInfrastructure, tools, methodology[CAECW Talk/Paper, February 2002]

  • Performance ModelingAnalytical modelsQueuing modelsSimulationTrace-drivenExecution-drivenFull systemWhy?Most widely used in academic researchPerceived accuracy and precision

  • Performance ModelingPerformanceModelPerformance resultsExecution characteristicsBottlenecksEtc.Accuracy?Precision?garbage in, garbage out

    InputsProgram Characteristics-instruction mix, miss ratesExecution Trace-instruction, address, bothProgram binary + input setsOperating system, program(s), etc.

  • Talk OutlineIntroduction & MotivationPerformance ModelingPrecision, Accuracy, FlexibilityPharmSim OverviewCauses of InaccuracyO/S EffectsCoherent I/O (DMA)Wrong-path EffectsSummary & Conclusions

  • Precision, Accuracy, FlexibilityPrecisionHow closely simulator matches designLatency, bandwidth, resource occupancy, etc.AccuracyHow closely simulation matches realityRequires precisionAlso requires replication of real-world conditions, inputsFlexibility?Enables exploration of broad design space

  • Uses for SimulationVerificationAcademic ResearchAccuracy???

  • Causes of InaccuracyMany possible causesSoftware differencesHardware differencesSystem effectsTime dilation: interaction with physical worldHere, we consider:Operating system codeDMA trafficWrong-path effects

  • Validating AccuracyHow do we validate?Against real hardware with perf. countersDifferent input since O/S now presentAlso, post-mortem: too lateAgainst HDLSame input as model, same error?Without full system simulation, cannot:Replicate runtime environmentCannot really validate accuracyCompensating errors mask inaccuracyHence: build simulator that does not cheat

  • PharmSim OverviewDevice simulation, etc. from SimOS-PPCPharmSim replaces functional simulatorsFull OOO core model, values in rename registersBased on SimpleMP [Rajwar]Adds priv. mode, MMU, TLB, exceptions, interrupts, barriers, flushes, etc.

  • PharmSim PipelineSubstantially similar to IBM Power4Some instructions cracked (1:2 expansion)Others (e.g. lmw) microcode streamMem StageInterface to 2-level cache modelSun Gigaplane XB snoopy MP coherenceCaches contain values, must remain coherentNo cheating!No flat memory model for reference/redirect

  • Talk OutlineIntroduction & MotivationPerformance ModelingPrecision, Accuracy, FlexibilityPharmSim OverviewCauses of InaccuracyO/S EffectsCoherent I/O (DMA)Wrong-path EffectsSummary & Conclusions

  • Operating System EffectsFairly well-understood for commercial:Must account for O/S referencesFor SPEC? Widely accepted:Safe to ignore O/S pathsMost popular tool (Simplescalar)Intercepts system callsEmulates on host, updates flat memoryReturns magically with caches intactIs this really OK?

  • Operating System Effects

    References ModeledExampleUser-mode onlyAtomUser + Shared librarySimplescalar with static linkUser + Sh Lib + O/SH/W bus traceUser + Sh Lib + O/S + cache control opsPharmSim

  • Operating System EffectsDramatic error (5.8x in mcf, 2-3x commonplace)Note compensating errors (e.g. crafty, gzip, perl)IPC error > 100% (more detail at ISCA)5.8x

  • Talk OutlineIntroduction & MotivationPerformance ModelingPrecision, Accuracy, FlexibilityPharmSim OverviewCauses of InaccuracyO/S EffectsCoherent I/O (DMA)Wrong-path EffectsSummary & Conclusions

  • Coherent I/O with DMA

  • DMA TrafficHow do we support DMA?No flat memory image in simulatorLines may be in cachesInvalidate (if DMA write)Flush (if DMA read)Must use existing coherence protocolEverything has to work correctlyNo subtle coherence bugsHow much does this matter?Affects cache miss ratesIntroduces bus contention

  • DMA TrafficPharmSim incorporates accurate DMA engine:Issues bus invalidates, snoopsConcurrent data transfer: No magic flat memoryBottom line:Unimportant for SPECINTUnimportant for SPECWEB, SPECJBBOthers in progressContrived multiprogrammed workload4.8% of all coherence traffic due to I/O, 1% IPC effectResults understated due to overbuilt MP busMP workloads likely much more sensitiveAdditional evaluation in progress

  • Talk OutlineIntroduction & MotivationPerformance ModelingPrecision, Accuracy, FlexibilityPharmSim OverviewCauses of InaccuracyO/S EffectsCoherent I/O (DMA)Wrong-path EffectsSummary & Conclusions

  • Wrong-Path ExecutionBranch predictor predicts control flowBranch execute redirects mispredictionsExtra instructions on wrong path

  • Wrong-path ExecutionMultiple effects on unarchitected statePollute/prefetch I-cache, D-cache, TLBPollute/train branch predictor (BHR, PHT, RAS)PharmSim (current status):BHR is updated and repairedPHT is not updated speculativelyRAS is updated, no repairNo speculative TLB fillHow can we filter wrong-path instructions?No cheating: dont know branch outcomes

  • Eliminating Wrong-PathRunaheadPharmSimBranch Outcome TraceRight-pathPharmSimOn correctly predicted branchcontinue fetching (BAU)On mispredicted branch-stall instruction fetch-restart once branch resolves

  • Wrong-path InstructionsAggressive core model; 25%-40% wrong-path

    Chart6

    0.3969451568

    0.3675330846

    0.3927978557

    0.3952491101

    0.26931024

    0.4024253904

    0.3048771146

    0.2443192469

    0.2994830297

    0.2983719401

    benchmark

    %executed inst on wrong path

    ipc.notrace

    BenchmarkEXE IPCTRACE IPCEXE Icache misses per insnTRACE Icache misses per insnEXE Icache misses per 1000 insnTRACE Icache misses per 1000 insnEXE Dcache misses per insnTRACE Dcache misses per insnEXE Dcache misses per 1000 insnTRACE Dcache misses per 1000 insnexe ucache misses per insntrace ucache misses per insnexe ucache misses per 10,000 insntrace ucache misses per 10,000 insnexe RAS pred ratetrace RAS pred rateexe PISA instr executedexe PISA instr committed% instr committed%instr which are wrong path

    crafty2.39172.44680.005024230.004385525.024234.385520.00073760.000682970.73760.682970.000115590.000109241.15591.092470.43%99.25%50242814830299172860.31%39.69%

    gap1.34441.35230.002717940.002630712.717942.630710.001807150.001792361.807151.792360.001364530.0013553213.645313.553270.69%84.28%50975980932240621463.25%36.75%

    gcc2.03212.02540.00657820.005563566.57825.563560.001545230.00149721.545231.49720.000168380.000159071.68381.590778.03%97.88%55835649033903525860.72%39.28%

    gzip2.17192.22810.00060070.000574190.60070.574190.00126160.001140891.26161.140890.000268580.00026122.68582.61284.51%99.40%51018761330853641360.48%39.52%

    mcf1.0511.04690.004814340.004886914.814344.886910.002468340.002379952.468342.379950.000852820.000836688.52828.366879.00%96.73%47440775034664488573.07%26.93%

    parser2.3352.36070.000656280.000621520.656280.621520.001907450.001703051.907451.703050.000154140.000147351.54141.473593.29%99.44%59288979235429588659.76%40.24%

    perlbmk1.31551.32120.008069080.007294888.069087.294880.004476410.004370864.476414.370860.001104760.0010654911.047610.654972.14%90.39%52088095336207627169.51%30.49%

    vortex2.36992.35970.007655910.007534297.655917.534290.000783060.000765740.783060.765740.000315870.000308223.15873.082290.73%96.45%39350077529736096275.57%24.43%

    specjbb1.10351.09320.024769130.0243234824.7691324.323480.00625810.005855746.25815.855740.001934720.0018208819.347218.208877.87%89.01%47452205633241075370.05%29.95%

    specweb1.1191.12520.007258090.006552047.258096.552040.00350880.003397163.50883.397160.002227130.0021215222.271321.215269.81%87.02%41220835328921694770.16%29.84%

    ipc.notrace

    exe-driven

    trace-driven

    benchmark

    IPC

    exe-driven

    trace-driven

    benchmark

    icache misses er 1000 insn

    exe-driven

    trace-driven

    benchmark

    dcache misses per 1000 insn

    exe-driven

    trace-driven

    benchmark

    misses per 10,000 insn

    exe-driven

    trace-driven

    benchmark

    RAS Prediction Accuracy

    benchmark

    %executed instr committed

    benchmark

    %executed inst on wrong path

  • Wrong-path Memory StallsMinor effect: better or worse

    Chart2

    0.17480.1778

    0.27630.2789

    0.14420.1445

    0.19510.1981

    0.43980.4522

    0.13420.1371

    0.24540.246

    0.17050.1731

    0.31240.3088

    0.16440.1612

    exe-driven

    trace-driven

    benchmark

    %time stalled for memory

    ipc.notrace

    BenchmarkEXE IPCTRACE IPC% DifferenceEXE Icache misses per insnTRACE Icache misses per insnEXE Icache misses per 1000 insnTRACE Icache misses per 1000 insnEXE Dcache misses per insnTRACE Dcache misses per insnEXE Dcache misses per 1000 insnTRACE Dcache misses per 1000 insnexe ucache misses per insntrace ucache misses per insnexe ucache misses per 10,000 insntrace ucache misses per 10,000 insnexe RAS pred ratetrace RAS pred rateRAS Differenceexe PISA instr executedexe PISA instr committed% instr committed%instr which are wrong pathexe %cycles stalled on memtrace %cycles stalled on memDifference

    crafty2.39172.44682.30%2.30%0.005024230.004385525.024234.385520.00073760.000682970.73760.682970.000115590.000109241.15591.092470.43%99.25%28.82%50242814830299172860.31%39.69%17.48%17.78%-0.30%0.003

    gap1.34441.35230.59%0.59%0.002717940.002630712.717942.630710.001807150.001792361.807151.792360.001364530.00135

Search related