Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
How to get realistic C-states latency and residency ?
Vincent Guittot
Agenda● Overview● Exit latency● Enter latency● Residency● Conclusion
Overview
Overview● PMWG uses hikey960 for testing our dev on b/L system
○ Cluster off and residency values in DT binding were looking really high:
● Decided to find a way to check the correctness of the figures● How to easily get realistic figures for the C-states table of my platform ?
○ Without expensive materials○ Without deep knowledges in power management and idle states○ Define values for a platform or check current values
Entry latency (us) Exit Latency (us) Residency time (us)
CPU off (big and LITTLE) 40 70 3000
LITTLE cluster off 500 5000 20000
Big cluster off 1000 5000 20000
C-state latency● Prepare :
○ Cache maintenance○ Abortable
● Entry :○ HW & SW sequence to enter idle step○ Not abortable
● Exit :○ HW & SW sequences needed to bring back CPU to running state
* Read Documentation/devicetree/bindings/arm/idle-states.txt for details
Exec
Pre
pare
EntryIdle
Exit Exec
How to measure latency ?● Trigger contentions
○ Compete for accessing critical resources○ Look for worst values
● Trigger slowest path○ Cache flush for entering latency
Test environment● CPU isolation
○ Isolate CPUs from external noise and background activity○ Works great for big cluster○ Not enough for little cluster
■ Boot CPU in little cluster
■ Interruptions pinned to CPU0
■ “Lot” of spurious activity pinned on little cluster
● Use rt-app○ Sync wake up of CPUs○ Range of wake up periods○ Log events and phases duration
● Hikey960○ Modified for accessing VDD_4V2 voltage domain
● Arm Energy Probe USB dongle
Exit latency
1st test: exit latency● Enable only 1 state to force cpuidle
○ Not fully robust
● Wake up CPUs simultaneously
● rt-app logs wake up latency○ Get min, max, average and std-dev
Timer IRQ Read clock
CPU0
CPU1
CPU2
CPU3
1st test: exit latency
Max
Min
95%
@903Mhz @2362Mhz
1st test: exit latency
@903Mhz @2362Mhz
1st test: exit latency● One CPU wakes up faster than others
○ Most probably the one that gets one “lock” first
● Frequency of other cluster impacts exit latency○ Flatten the difference between min and max OPP○ +400us for max OPP when other cluster runs at lowest OPP
● Local frequency has a limited impact at the end○ Around 200us on the 2900us budget
● Sync wake up with other cluster has a limited impacts latency○ Few dozen of us
● Firmware mode has an impact○ Release vs debug mode
All latencies● Big cluster off slower than LITTLE cluster off
○ Most probably more thing are shut down compared to little■ Like powering down power domain
● Measured latency includes full wake up path1. timer interrupt fires (at almost the programmed timestamp as the granularity of the timer is 52ns)2. PM coprocessor HW wakes up sequence (when involved)3. ATF firmware resume sequence (when involved)4. cpuidle driver5. cpuidle framework6. Idle thread including starting/stopping tick nohz idle7. Switching to rt-app thread8. Read time clock
big cluster little cluster
CLUSTER CPU WFI CLUSTER CPU WFI
exit 2900 550 70 1600 650 100
Entry latency
2nd test: entry latency● Enable only 1 state to force cpuidle
○ Not fully robust
● rt-app logs phases duration○ Get min, max, average and std-dev
● Increase the sleep duration step by step○ Phase duration increase @ entry latency
Timer IRQ
CPU0
Timer IRQ
CPU0
Timer IRQ
CPU0
Timer IRQ
CPU0
Timer IRQ
CPU0
2nd test: entry latency (single cpu)
sleep duration becomes longer
than entry latency
Spurious wake up that can be discarded
2nd test: entry latency (multi cpu)
sleep duration becomes longer than wake up latency
1st abort point
2nd test: entry latency● Wake up duration includes
○ rt-app task events○ Entry latency○ Extra sleep time○ Exit latency
● Steps in charts○ Show the different abortable points
big cluster little cluster
CLUSTER CPU WFI CLUSTER CPU WFI
entry 900 400 ~0 500 400 ~0
All latencies
big cluster little cluster
CLUSTER CPU WFI CLUSTER CPU WFI
entry 800 400 ~0 500 400 0
exit 2900 550 70 1600 650 100
wake up 3700 950 70 2100 1050 100
Residency time
● Residency time○ Minimum idle time above which
it’s worth selecting the C-state
● Estimated idle duration○ Select longest residency time
● Wakeup latency○ Skip some C-states
C-state residency
Exec
Pre
pare
Idle
Exec
ExecIdle
Exec
ExecEntry
Idle
Exit Exec
Pre
pare
How to estimate residency time ?● Measure precisely each step independently
○ Energy consumed during each step of each state○ Isolate CPUs power domain from others
● Imply○ Having access to all power domains○ Having very precise power meters (some steps are short, transient and difficult to measure)
● Don’t really care of absolute value○ Just want to compare idle states to each others
● Don’t really care about power impact of each step○ Only interested by end results
How to estimate residency time ?● Wake up periodically the CPU and measures power consumption
○ Task don’t do anything else than wake up and sleep■ Power impact is mainly entry/exit sequence
○ With decreasing periods, entry and exit steps take more and more importance
○ Run the same number of wakeup/sleep sequence■ Thousands of times■ Relax power meters precision constraint
○ Don’t need to have access to dedicated power domain■ Only interested in difference■ Side and noise power consumption will be removed as long as stable across tests
How to estimate residency time ?● Use rt-app to generate periodic wake up
○ Task don’t do anything else than wake up and sleep○ Run thread with a decreasing period
■ 10ms down to 1ms with a step of 0.5ms has been used for hikey960
● Minimize impact of background activity of other cluster(s)○ Enable only WFI○ use lowest OPP
● Run long enough (20 seconds) and several times (x8)○ Filter background activity of the system○ Keep iteration with min value○ Test is really long : more than 3 days of continuous tests for hikey960
3rd test: residency time
Wake up latencyfor cluster off
Break even point between cluster
and cpu off
Break even point between cpu off
and WFI
3rd test: residency time
Wake up latencyfor cluster off
Break even point between cpu off
and WFI
Residency
Big cluster Little cluster
CLUSTER CPU WFI CLUSTER CPU WFI
Lowest OPP
5000 1500 N/A 8000 4500 N/A
Highest OPP
0 1500 N/A 0 1500 N/A
3rd test: residency time● Residency time differs widely with OPP● Understandable when we looks the “static” power consumption
○ big core @ lowest OPP: cluster off is 8% < WFI (absolute value)○ big core @ highest OPP: cluster off is 25% < WFI (absolute value)○ Need to weight residency time value of each OPP with % saved
● New residency value means increase the usage on cluster off state○ Can see some responsiveness increases○ 20ms residency time for cluster off versus 16ms for display sync event○ Use CPU latency constraint instead: per CPU or system wide
Conclusion
Conclusion● More rt-app test cases can be used:
○ With memory event as an example■ Not real difference has been shown
● OPP has a significant impact on residency time
● Scripts will be publicly available soon○ Run tests and gather results
● Next step○ Automate charts creation○ Automate entry, exit, and residency values extraction
Thank You
#HKG18HKG18 keynotes and videos on: connect.linaro.orgFor further information: www.linaro.org