Modern Computing Systems
Prof. Dr. Akash KumarChair for Processor Design
(Ack: my past and current students/PostDocs)(Some slides adapted from Koren, Krishna, Anand)
© Akash Kumar
Singapore2
© Akash Kumar
Eindhoven University of Technology3
© Akash Kumar
Gliding4
© Akash Kumar
Finished PhD!5
© Akash Kumar
Life in Singapore6
© Akash Kumar
Faculty of Engineering, NUS7
© Akash Kumar
Outreach8
© Akash Kumar
…9
© Akash Kumar
Topics to be Covered
1. Real-time Embedded Systems2. Fault-Tolerance3. Approximate computing 14. Approximate computing 25. Machine learning embedded architectures 16. Machine learning embedded architectures 27. Going beyond CMOS
10
© Akash Kumar
Outline
¨ Characteristics of real-time embedded systems¨ Hardware architecture for real-time embedded
systems
¨ Design flow and considerations¨ Modern challenges and solutions
11
© Akash Kumar
What is a real-time system?
A system which responds to events within a finite and specifiable delay in order to change its environment
y(t+Δ)
RT system x(t)Environment
Δ is finite and specifiablePredictability!!
12
© Akash Kumar
Why is real-time important?
*Courtesy – Rajesh Panicker
Original
Corrupt – 5%
Corrupt – 20%
Corrupt – 40%
13
© Akash Kumar
What is an embedded system?
¨ Small device, like a cell phone¨ Small processor installed in some other device, like
a car¨ Software that controls a consumer device¨ Real-time response
My favorite:
Any system where the user doesn’t want to know that it includes a processor
14
© Akash Kumar
Examples of Real time/ Embedded Systems
15
¨ Mars pathfinder
¨ Car engine
¨ Mobile phone
¨ Set-top box
¨ Car navigation
¨ Industrial control
¨ Telecom switch
¨ Global Positioning System
¨ Air Traffic Management
¨ Satellite flight manager
¨ Satellite Ground Control
¨ TV receiver
¨ Flight control system
¨ Electric shaver
¨ Toaster
¨ Washing Machine
© Akash Kumar
Hardware Architectures for Embedded Systems
16
© Akash Kumar
History of Hardware / VLSI
¨ Vacuum tube
(Lee De Forest, 1906)
¨ ENIAC
(1946, UPenn)
¨ Transistor
(1947, Bardeen, Brattain,
Shockley)
¨ Integrated circuit
(1958, Jack Kilby)
17
© Akash Kumar
History of Hardware / VLSI
¨ Intel 4004
(1971, 1400 transistors)
n Intel Core i7 - Ivy Bridge (2012, >1.4 Billion transistors)
n Very Large Scale Integration (VLSI) – originally defined for chips having transistors in the order of 100,000. Other terms such as ULSI came along, but the usage VLSI remains dominant
18
© Akash Kumarhttp://cpudb.stanford.edu/
Moore’s Law
n In 1965, Intel’s Gordon Moore predicted that the number of transistors that can be integrated on single chip would double about every two years
19
© Akash Kumar
40 Years of microprocessor trend data20
© Akash Kumar
Design Productivity Gap
¨ Increasing number of transistors makes it harder to design the system¤ Late launch of products directly hurts profits
21
© Akash Kumar
System Design Considerations
¨ System : sensor -> processor -> actuator¨ Considerations
¤ Technology¤ Performance¤ Power consumption¤ Volume of production¤ Upgradability / ease of maintenance¤ Reliability¤ Testability¤ Availability of CAD and software tools, IP's, hardware and
software libraries¤ Cost, chip area¤ Legal and certification requirements, client specifications¤ …..
22
© Akash Kumar
Digital Hardware Market Segments
n Processor, GPU
n DRAM, Flash memories
n (Co-)Processor alternativesn ASIC (application specific integrated circuit)n ASSP (application specific standard product) n FPGA (field programmable gate array)
n Convergence as System on Chip (SoC), which may also contain analog, mixed-signal, and radio-frequency functions
23
© Akash Kumar
Embedded systems architecture
¨ Trend towards Multi-Processor Systems-on-chip (MPSoC)
¨ Homogeneous vs heterogeneous systems¨ Different memory models¨ Different network architectures
¤ Network-on-chip¤ Buses
24
© Akash Kumar
Processor 4 Processor 5
Interconnection network
Processor 1 Processor 2 Processor 3
Memory Memory Memory
Memory Memory
Homogeneous vs heterogeneous25
© Akash Kumar
Homogeneous vs heterogeneous
¨ Heterogeneity is increasing¤ Different levels of parallelism in application¤ uProc – better for control-flow¤ DSP – better for signal processing¤ Dedicated hardware blocks needed for certain parts¤ Improves efficiency and saves power
¨ Homogeneous systems ¤ Better for fault-tolerance¤ Only one compiled version of any application needed¤ Easier to design and replicate¤ Easy to support task migration
26
© Akash Kumar
Memory usage27
© Akash Kumar
Processor 4 Processor 5
Interconnection network
Processor 1 Processor 2 Processor 3
Memory Memory Memory
Memory Memory
Embedded systems – local memory28
Local memory is better for more predictability Network/ bus delay may be unpredictable
© Akash Kumar
Embedded systems – global memory
Processor 4 Processor 5 Memory
Interconnection network
Processor 1 Processor 2 Processor 3
Arbiter
29
Global memory may be better for shared data
© Akash Kumar
Processor 4 Processor 5
Interconnection network
Processor 1 Processor 2 Processor 3
Memory Memory Memory
Memory Memory
Embedded systems – combination
Memory
Arbiter
Can also be off-chip!
30
Communication pattern also determines which architecture is betterMessage passing OR Shared memory
© Akash Kumar
Embedded systems – network
Processor 4 Input/ Output Memory
Interconnection network
Processor 1 Processor 2 Processor 3
Arbiter
31
© Akash Kumar
Interconnection network-on-chip
Processor 4 Input/ Output Memory
Processor 1 Processor 2 Processor 3
Arbiter
NININI
NI NI
NI
RouterRouter Router
32
© Akash Kumar
Interconnection network – bus
Processor 4 Input/ Output Memory
High speed bus
Processor 1 Processor 2 Processor 3
Arbiter
Arbiter
33
© Akash Kumar
Point-to-point networks
Processor 4 Input/ Output Memory
Processor 1 Processor 2 Processor 3
Arbiter
34
© Akash Kumar
System Design – Hw/Sw Codesign
¨ Take decisions on whether to implement in hardware or software¤ Consider the advantages vs costs
¨ If hardware, whether to use commercial off the shelf (COTS) components or custom components
System Level Specification
SoftwareModel
Hardware/Software
Partitioning
HardwareModel
Co-simulation
Compilation Synthesis
Integrationand Testing
Resources
Performance-1
Pareto Curve
35
A
BC
D
© Akash Kumar
Modern Challenges
36
© Akash Kumar
Modern Multimedia Embedded Systems
Large number of use-cases
Guarantee performance
Minimize power consumption
Run-time addition of appl
37
© Akash Kumar
Issues and Modern Trends
¨ The communication bottleneck¤ 3D Chips¤ Optical interconnects
¨ Leakage current limiting size reduction¤ Multi-gate or gate-all-around transistors (Intel 22nm
uses 3D/tri-gate transistors)¤ Channel strain engineering, silicon-on-insulator-based
technologies, and high-k/metal gate materials
¨ One may not fit all¤ Hardware/Software Co-design¤ Fault-tolerant / reconfigurable computing
¨ Power issues¤ Multi-core and heterogeneous architectures
38
© Akash Kumar
¨ Dennard scaling principles [1]
Technology Scaling
[1] R. Dennard et al. “Design of Ion-Implanted MOSFET’s with Very Small Physical Dimensions,” IEEE Journal of Solid-State Circuits, 1974.
Device Parameters Scaling Factor
Device dimension 1/k
Doping concentration 1/k
Voltage 1/k
Current 1/k
Capacitance 1/k
Delay time per circuit 1/k
Power dissipation 1/k2
Area 1/k2
Power density 1
39
© Akash Kumar
¨ Digression from Dennard’s scaling beyond 65nm¤ Non-ideal voltage scaling: limit on threshold voltage scaling¤ Non-ideal gate oxide scaling
¤ Sub-threshold leakage power
¨ Power dissipation increases with technology scaling¤ Heat localization (hot spots)
¤ Higher temperature => device wear-out
Technology Scaling40
© Akash Kumar
Technology Scaling and Power Density
Hot Plate
Nuclear Reactor
41
© Akash Kumar
Technology Scaling and Power Density
Manufacturing defects(e.g. Imperfect Lithographic
patterning)
Transistor Scaling
Increased Variability(e.g. Random Dopant
Fluctuation)
Increasing Power Density
Increase T
42
© Akash Kumar
Approximate Computing
43
© Akash Kumar
20 W20 W
~200000 W
The Computational Efficiency Gap
IBM Watson playing Jeopardy, 2011
44
© Akash Kumar
Humans Approximate
923 = - - .- -?21
is 923 >1.75?21
is 923 > 45?21
Task:Division
Application context dictates required accuracy of results
Accuracy
21) 923 (43848363
Effort expended increases with required accuracy
~1Petaflop/W
45
© Akash Kumar
But Computers DO NOT
float x = 923;float y = 21;cout << (x/y > 45.0) ?“YES”:”NO”;
NO92321 >45
923> 1.7521
float x = 923;float y = 21;cout << (x/y > 1.75) ?“YES”:”NO”;
YES
But, I workedharder than needed
} Overkill (for many applications)
} Leads to inefficiency} Can computers be more efficient by producing “just good enough” results?
46
© Akash Kumar
Intrinsic Application Resilience: Sources
IntrinsicApplicationResilience
‘Noisy’ RealWorld Inputs
Redundant Input Data
Perceptual Limitations
Statistical Probabilistic
Computations
Self-Healing
Compute distances& assign points to clusters
Update clustermeans
Repeat until convergence
} Intrinsic application resilience:Ability to produce acceptable outputs despite underlying computations being performed in an approximate manner
47
© Akash Kumar
Machine Learning
48
© Akash Kumar
What's Deep Learning/ Deep Neural Network?
¨ Self learning algorithms¨ Using huge data sets to learn¨ Deep: many "learning layers"¨ Brain inspired, based on neurons and synapses
(connections)¨ High classification accuracy¨ Many applications; let's look at ImageNet
classification and Tesla Autopilot
49
© Akash Kumar
ImageNet Winners (top-5 classification error)
¨ ImageNet dataset: 10M images, 10000 classes
50
0%
5%
10%
15%
20%
25%
30%
2010 2011 2012 2013 2014 2015 2016 2017
top-
5 er
ror
Traditional methods Deep Learning Human
© Akash Kumar
AI: Tesla Autopilot
¨ Tesla Model S demonstration of autonomous driving¨ Computing system monitors radar and several
cameras¤ Detect objects like cars, and pedestrians¤ Monitor traffic signs¤ Lane tracking and possible lane changing¤ Auto parking
51
Tesla web page: www.tesla.com/videos/ November 2016
© Akash Kumar
Deep Learning and High-performance HW Architectures
52
© Akash Kumar
Beyond CMOS
53
© Akash Kumar
Classical CMOS Transistors54
PMOS on topNMOS at bottom
© Akash Kumar
Reconfigurable Transistors:Silicon Nanowires Based Reconfigurable FETs
55
© Akash Kumar
SiNW Dual-gate RFETs: Combines p-type and n-type functionality
Runtimereconfigurability
56
© Akash Kumar
Conclusions
¨ Characteristics of real-time embedded systems
¨ Hardware architectures and considerations
¨ Modern trends and challenges
¨ Fault-tolerance a major problem in modern systems
57
© Akash Kumar
Ongoing Research Activities
Reliability/Energy Optimization• Reconfigurable approximate computing at run-time• Optimize energy and reliability• Minimize thermal cycling and peak temperature• Task remapping and scheduling for dealing with faults
Processing Architecture Design• Determine and design appropriate system architecture• Design predictable components – network and communication assist• Partially reconfigurable tile-based heterogeneous multiprocessor systems• Task-migration module in hardware for predictable delay
Low-Power and Fault-Tolerant FPGA Designs• Improving fault-tolerance of FPGA through LUT content manipulation• Novel error-correction mechanisms for FPGAs• Leakage-aware resource management techniques• Electronic Design Automation – Place and Route for FPGAs
58
© Akash Kumar
Chair for Processor Design59