Bill Finch, Sr. Vice President CAST, Inc.
[email protected] • www.cast-inc.com
The BA20 Processor:Responding to IoT and Wearable
Device Energy Challenges
Linley Tech Processor ConferenceOctober 22–23, 2014
Slide 2Announcing the BA20
Announcing the BA20 MPU
Energy-Optimized 32-bit Embedded Processor
PipelineZero™ ArchitectureCode-Dense ISA, Power Management, Full EcosystemIP core in RTL or FPGA netlist
Provided by CAST, Inc.20 years in IPBA2x Family, 8051s, GPUs, Compression, more
Developed by Beyond Semiconductor32-bit RISC/DSP experts
Slide 3Announcing the BA20
Keys to a Low-Power µP
Consume as little energyas possible while idle
Leakage proportional to area
Use as little memoryas possible
Active and idle memory system power can be > CPU power
Complete tasks with thelowest possible energy cost
Small silicon footprint
High codedensity
High CoreMarksper MHZ &per µW
Slide 4Announcing the BA20
Performance Realization
Process evolution has made processing power more than adequate for many applications
3 pipeline stages run > 400MHz X CoreMarks (40nm)5 pipeline stages > 800MHz X CoreMarks (40nm)
but
Some applications just need high performance efficiency (CM/MHz), not high frequency
3-5 stage pipeline processors are overkill!
Slide 5Announcing the BA20
How Can We Do Better?
Keep the relevant best practicesVariable length ISA for better code densityAdvanced power-management support including frequency scalingInterrupt handling efficiency, debugging facilities, optional arithmetic acceleration, memory protection schemes …
Re-invent the basic architectureWhat about those “old fashioned” non-pipelined CPUs?
Slide 6Announcing the BA20
Typical 3-Stage RISC Pipeline
Hazards Limit PerformanceData & Structural Hazards — execution delayBranch Hazards — branch target delay
Pre-Fetching Wastes Energy when a Branch is Taken
Pipeline Registers & Hazards Resolution Increase Area
(Pre)Fetch Decode WriteExecute
Slide 7Announcing the BA20
BA20 PipelineZero Approach
Lack of Hazards = Higher PerformanceNo Data Hazards (execution and write on same cycle)No Structural Hazards (single-issue/no pipeline)No Branch Hazards (branch resolved & next fetch address in Execute)
Shorter Branch Shadow = Less Energy Waste
Fewer Pipeline Registers & Simpler Control= Smaller Area
Fetch Decode WriteExecute
Slide 8Announcing the BA20
BA20: M4 Efficiency, M0+ Size
Silicon Area @ 40ηm BetterWorse
Bett
er
Wors
eC
ore
Mark
s/M
Hz
Cortex®-M4
3–4 stages.04 mm2
3.40 CM/MHzCortex®-M0+
2 stages.009 mm2
2.42 CM/MHz
BA200 (1) stages
.01 mm2
3.41 CM/MHz
Slide 9Announcing the BA20
How does Higher Performance Lead to Lower Energy?
Do more in less time
Sleep for a longer time
Slide 10Announcing the BA20
How does Higher Performance Lead to Lower Energy?
Allows lower clock ratesReduces clock tree and CPU power when activeEnables use of HVT cells and a smaller implementation both of which lower leakage power
Slide 11Announcing the BA20
Best Practices: Code Density
BA2 ISA developed to take advantage ofstate-of-the-art compilers
Variable length instruction encoding16-bit, 24-bit, 32-bit and 48-bitCompiler chooses smallest suitable encoding Yields denser code than fixed-length ISAs
32 General Purpose Registers mean fewer load/store operations
Load/store ~25% of code for typical programs on RISC CPUs
Slide 12Announcing the BA20
BA2 Code Size Advantage
20000
25000
30000
35000
40000
45000
MIPS PPC ARM Thumb2 BA22
CSiBe Benchmark
Providesbest-in-class code-densityand enables
high performance
implementations
Slide 13Announcing the BA20
Code DensityBenefits
Smaller FLASH, ROM, RAM for code storage
Fewer accesses to instruction memory
Slide 14Announcing the BA20
BA2 Code Size Benefits
Reduces memory size requirements and resulting product cost
Beken Chooses BA22 Processor to Satisfy Tight Constraints in New Mobile Bluetooth Audio Chip
“Beken’s evaluation determined that their program code for the BA22 core fits in a 128KB memory, versus 170KB for the next-closest competitor….”
CAST Press Release, June 2, 2014
Slide 15Announcing the BA20
Best Practices: Power Management
Power and Clock GatingAutomatically gates clock of unused modulesBroadcasts modules status to enable power and/or clock gating
Dynamic Frequency ScalingIndependent SoC Bus and CPU clocks
Over-clock CPU, under-clock or shut-off peripheral bus, when computational load is highUnder-clock CPU & bus, when computational load is low
Wake-up on Interrupt, Tick-Timer or PM Event
Slide 16Announcing the BA20
Best Practices: Development & Configurability
Easy software developmentComplete GNU Toolchain, customized Eclipse IDECycle-Accurate ISS, Ported C Libraries & OSsJTAG & Serial Debugging, Development Kits
Flexible Architecture and OptionsOptions for ALU and Interrupt ControllerJTAG or Serial Debug InterfaceOptional Memory Protection UnitPre-Integrated Peripherals
Slide 17Announcing the BA20
The New BA20 Processor IP
Ultra-Low PowerPipelineZero Architecturefor higher performance efficiency and lower areaAdvanced Power ManagementBA2 ISA for extreme code-density
Easy Software Development
Flexible Architecture and Peripheral Options to match different requirements
Better Business Terms (no-royalties)