View
253
Download
5
Category
Preview:
Citation preview
Embedded TechCon Practical Techniques for Embedded
System Optimization Processes
Rob Oshanarobert.Oshana@freescale.com
Agenda• Follow a process• Define the goals, quantitatively• The platform architecture makes a big difference• Don’t be naïve about the algorithms• Do some estimation and modelling• Help out the compiler if possible• Power is becoming more important• What about multiple cores?• Track what you are doing
2
There is a right way and a wrong way
• Donald Knuth; "Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil”
• Discipline and an iterative approach are the keys to effective serial performance tuning• measurements and careful analysis to guide decision making• change one thing at a time• meticulously re-measure to confirm that changes have been beneficial
There are always tradeoffs• Symptoms
– Excessive optimization– Premature optimization
• These consume project resources, delay release, compromise software design w/o directly improving performance
• Fixation on efficiency• Model first before optimizing
Follow a process
Spend time up front understand your non-functional requirements
Functional; “The embedded software shall…(monitor, control, etc)”
Non-Functional; “The embedded software shall be..(fast, reliable, scalable, etc)
IPFwd
Fast
Kpps
Should 600
Must 550
Functional = what the system should do
Non-functional = how well the system should do it
“it has to be really fast”
“it has to be able to kick <competitor A>’s butt”
-- examples of real performance “requirements”
There is a Difference Between Latency and Throughput
“It is not possible to determine both the position and momentum of an object beyond a certain amount of precision.”
-Heisenberg’s Principle
•Similarly, it is not possible to design a system that provides both low latency and high performance•However, real-world systems (such as Media, eNodeB, etc.) need both•Need to tune the system for the right balance of latency and performance
Latency; 10usec avg, 50 uses max wake up latency for RT Tasks
Throughput; 50Mbps UL, 100 Mbps DL for 512B packets
Map the application to the coreCPU (latency oriented cores)
GPU (throughput oriented cores)
VLIW DSP CPU Core
Data Path 1
D1M1S1L1
A Register File
Data Path 2
L2S2M2D2
B Register File
Instruction Decode
Instruction Dispatch
Program Fetch
Interrupts
Control Registers
Control Logic
Emulation
Test
VLIW DSP CPU Core
Data Path 1
D1M1S1L1
A Register File
Data Path 2
L2S2M2D2
B Register File
Instruction Decode
Instruction Dispatch
Program Fetch
Interrupts
Control Registers
Control Logic
Emulation
Test
Or offload to the cloud
Estimating Embedded Performance can be done prior to writing the code
1. Maximum CPU Performance“What is the maximum number of times the CPU can execute your algorithm? (max # channels)
2. Maximum I/O Performance
“Can the I/O keep up with this maximum #channels?”
1. CPU Load (% of maximum)
2. At this CPU Load, what other functions can I perform?
1. CPU Load (% of maximum)
2. At this CPU Load, what other functions can I perform?
3. Available Hi-Speed Memory
“Is there enough hi-speed internal memory?”
Example – Performance Calculation
Algorithm: 200-tap (nh) low-pass FIR filter
Frame size: 256 (nx) 16-bit elements
Sampling frequency: 48KHz
How many channels can the core handle given this algorithm?
Max # channels: does not include overhead for interrupts, control code, RTOS, etc.
Are the I/O and memory capable of handling this many channels?
Required memory assumes: 60 different filters, 199 element delay buffer, double buffering rcv/xmt
X
Estimation results drive options
20%
CPU Load Graph Application: simple, low-end (CPU Load 5-20%)
What do you do with the other 80-95%?1
• Additional functions/tasks
• Increase sampling rate (increase accuracy)
• Add more channels
• Decrease voltage/clock speed (lower power)
Application: complex, high-end (CPU Load 100%+)
How do you split up the tasks wisely?2
• GPP/uC (user interface), DSP (all signal processing)
• DSP (user i/f, most signal proc), FPGA (hi-speed tasks)
• GPP (user i/f), DSP (most signal proc), FPGA (hi-speed)
Core Core Acc+or
DSP Acc++GPP
Help out the compiler► A compiler maps high-level code to
a target platform• Preserves the defined behavior of
the high level language• The target may provide functionality
that is not directly mapped into the high level language
• Application may use algorithmic concepts that are not handled by the high level language
► Understanding how the compiler generates code is important to writing code that will achieved desired results
Big compiler impact 1; ILP
restrict enables SIMD optimizations
void VecAdd(int *a, int *b, int *c) { for (int i = 0; i < 4; i++) a[i] = b[i] + c[i]; }
void VecAdd(int * restrict a, int *b, int *c) { for (int i = 0; i < 4; i++) a[i] = b[i] + c[i]; }
Independent loads andstores. Operations canbe performed in parallel!
Stores may alias loads.Must perform operationssequentially.
Big compiler impact 2; data locality
Spatial locality of B enhanced Unroll outer loop and fuse new copies of the inner loop Increases size of loop body and hence available ILP
General guideline; Align computation and locality
for (i=0;i<N;i++) for (j=0;j<N;j++)
A[i][j] = B[j][i];
for (i=0; i<N; i+=2) for (j=0; j<N; j++) { A[i][j] = B[j][i]; A[i+1][j] =B[j][i+1]; }
Use cache efficiently
C PU600 M Hz
Exte rn a l Me mo ry~100 M Hz M e mo ry
O n -C h ip L 1 C a ch e600 M Hz
O n -C h ip L 2 C a ch e300 M Hz
mem
ory size
spe
ed
/ cost
Use the right algorithm think Big O
200 cycles
100 cycles40 cycles
DFT versus FFT performance
1
10
100
1000
10000
100000
1000000
10000000
1 2 3 4
Number Points
DFT
FFT (radix 2)
% increase
O(n^2) vs O(nlogn)
Understand Performance Patterns (and anti-patterns)
(green – data, red – control, blue – termination)
Power Optimization; Active vs. Static PowerPower consumption in CMOS circuits:
P total = Pactive + Pstatic
C = charge (q) / voltage (V),
q = CV
W = V * q, or in other terms, W = V * CV or W = CV²
Power is defined as work over time, or in this discussion it is how many times a second we oscillate the circuit.
P = (work) W / (time) T and since T = 1/F then P = WF or substituting, P = CV²F
Top Ten Power Optimization Techniques1. Architect SW to have natural “idle” points (inc. low power boot)2. Use interrupt-driven programming (no polling, use OS to block)3. Code and data placement close to processor to minimize off-chip accesses (and overlays from non-volatile to fast
memory)4. Smart placement to allow frequently accessed code/data close to CPU (and use hierarchical memory models)5. Size optimizations to reduce footprint, memory and corresponding leakage6. Optimize for speed for more CPU idle modes or reduced CPU frequency (benchmark and experiment!)7. Don’t over calculate, use minimum data widths, reduce bus activity, smaller multipliers8. Use DMA for efficient transfer (not CPU)9. Use co-processors to efficiently handle/accelerate frequent/specialized processing10. Use more buffering and batch processing to allow more computation at once and more time in low power modes11. Use OS to scale V/F and analyze/benchmark (make right 1st !!)
When you have more than 1 core to optimize (multicore)
Goal; Exploit multicore resourcesStep 1; Optimize serial implementation
easierless time consumingless likely to introduce bugsreduce the gap less parallelization is neededallows parallelization to focus on parallel behavior, rather than a mix of serial and parallel issues
Serial optimization is not the end goalApply changes that will facilitate parallelization and the performance improvements that parallelization can bringSerial optimizations that interfere with, or limit parallelization should be avoided
avoid introducing unnecessary data dependenciesexploiting details of the single core hardware architecture (such as cache capacity)
There’s Amdahl and then there’s Gustafson (know the difference)
• Conventional Wisdom− Speedup decreases with increasing portion of
serial code (S)- diminishing returns
− Imposes fundamental limit (1/S) on speedup
− Assumes parallel vs. serial code ratio is fixed for any given application – unrealistic?
• Theoretical Max?− Applies to applications without a fixed code ratio
– e.g. networking/routing− Speedup becomes proportional to the number of
cores in the system− Packet processing provides opportunity for
parallelism
Time/Problem SizeTime/Problem Size
Many types of Parallelism (more than one may apply)
Multithreaded Programming has some hazards
Deadlock
Livelock
False sharing
Data hazards
Lock contention
Optimize for best case scenario, not worst case
No Lock Contention – No System Call
Very useful in low number of threads scenario
Optimize for best case scenario, not worst case
Since most operations willnot require arbitration betweenprocesses, this will not happen in most cases
Top Ten Performance Optimization Techniques for Multicore1.Achieve proper load balancing2.Improved data locality and reduce false sharing3.Affinity Scheduling if necessary4.Lock granularity5.Lock frequency & ordering6.Remove sync barrier7.Async vs sync communication8.Scheduling9.Worker thread pool10.Manage thread count11.Use parallel libraries (pthreads, openMP, etc)
Threads, cache, naïve and smart
Threads, cache, naïve and smart
Recommendation; Start developing crawl charts
1.DL throughput : 60 Mbps (with MCS=27 DL MIMO)2.UL throughput : 20 Mbps (with MCS=20 for UL )
521
412
312288
0
100
200
300
400
500
600
"Out of Box C" C w/ intrinsics Hand assembly Full entitlement
Cycle Count
optimization improvement
05000
100001500020000
out o
f box
intrin
sics
prag
mas
partia
l sum
mation
mult
i sam
pling
optimization type
tota
l cy
cles
cycles
Recommendation; Form a Performance Engineering Team
Feature Content
Configuration Settings
Repository / Branch / Patches
Feature Content
Configuration Settings
As content is upstreamed
SoCKernel
Upstream Kernel
Feature Merge
Feature Integration
Performance Engineering
SoC Features and NPIs
Recommended