29
How FPGAs Work When They Don’t - and how Feynman can help us understand

How fpgas work when they don't

Embed Size (px)

Citation preview

Page 1: How fpgas work when they don't

How FPGAs Work When They Don’t- and how Feynman can help us understand

Page 2: How fpgas work when they don't

SummaryClock domain crossing, timing violations, single event effects and accelerated aging in hostile environments, power supply fluctuations, etc. As if the learning curve for HDL programming isn't steep already, as soon as we have mastered the archaic trade it is to write synthesizable code for FPGAs, we find the physical reality intruding, breaking our assumptions, and removing any remaining illusions we might have about the soothing comforts of deterministic programming. The physical reality is a nuisance; one we should deal with, but often do not. And understandably so. The non-ideal behavior of CMOS is difficult to simulate, difficult to grasp, and a hassle to mitigate.

Fortunately, as we shall see in this presentation, the learning effort can be greatly reduced, as long as we apply the right perspective. One such is Richard Feynman's File Clerk model (FCM), which is both intuitive and instructive when the goal is to understand "how FPGAs work when they don't". With an outset in the FCM we go through the following topics:

● Basic computer organization in FPGAs● Error mechanisms relevant in FPGA design● Applying the FCM to explain

○ Clock domain crossing logic○ SEE due to radiation○ Timing violations○ Voltage and frequency scaling

Page 3: How fpgas work when they don't

Resumé

Alex Birklykke, [email protected]

● 2010: Msc.EE in Applied Signal Processing and Implementation● 2015: PhD - Modeling and Predicting the behavior of computers operating without

guardbands (case study of FPGAs)● 2013-2016: FPGA development at Rohde & Schwarz (WLAN layer-1)● 2016-2017: FPGA development at GomSpace A/S● 2017- : Newspace entrepreneur with Space Inventor

Page 4: How fpgas work when they don't

Research● Empirical study of FPGA behavior when subject to

voltage and frequency scaling● Based on 65 nm Spartan 3E● Objective was to determine the cause of errors, as well as

model and predict errors. ● Research confirmed that

○ FPGAs are very noise immune devices○ Timing violations are the cause of errors in

voltage/frequency scaled device○ Precise error behavior is hard to predict

Page 5: How fpgas work when they don't

Presentation objective

Provide an intuition about how FPGAs work when they don’t

Page 6: How fpgas work when they don't

What could go wrong? Timing Closure

● Timing constraints not meet● Multi-seed P&R or refactoring

don’t always solve problem. Especially for systems with high FPGA utilization

● Sometimes it is necessary to ship systems with timing violations

● How to assess the criticality of timing violations?

Page 7: How fpgas work when they don't

What could go wrong? Clock domain crossings

● Clock domain crossings are commonly encountered in FPGA applications

● Metastable behavior must be mitigated● Error mechanism must be thoroughly understood

in order to mitigate problem

Page 8: How fpgas work when they don't

What could go wrong? Temperature effects and ageing

● Ring oscillator frequency in Virtex-5 FPGA vs:○ Left) Location and temperature.○ Right) Localized wearout

● Might lead to unforeseen timing violations

S. Zhang, Delay Characterization in FPGA-based Reconfigurable Systems. Master Thesis. 2013

Page 9: How fpgas work when they don't

What could go wrong? Radiation induced Ageing● Microsemi SmartFusion2 SoC FPGA (65nm)● Irradiated with Cobolt-60 gamma source● Accelerated ageing observed● For comparison, 20 krad ~ 5yrs in low Earth orbit● 10% timing overhead must be introduced, to

ensure timing closure after 5 yrs● Bad news: Other studies have found that the Flash

configuration memory cannot be reprogrammed after a few krad’s

N. Rezzak, J. J. Wang, C. K. Huang, V. Nguyen and G. Bakker, "Total Ionizing Dose Characterization of 65 nm Flash-Based FPGA," 2014 IEEE Radiation Effects Data Workshop (REDW), Paris, 2014, pp. 1-5.

Page 10: How fpgas work when they don't

What could go wrong? Chasing better performance

Voltage and/or frequency scaling results in timing errors

A. Birklykke, P. Koch, R. Prasad, L. Alminde and Y. Le Moullec, "Empirical verification of fault models for FPGAs operating in the subcritical voltage region," 2013 23rd International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), Karlsruhe, 2013, pp. 16-23.

Page 11: How fpgas work when they don't

It’s all about timing

How FPGAs work when they don’t?

Page 12: How fpgas work when they don't

Feynman's Lectures on Computation● Write-up of Feynman's lectures on computation

given at CalTech from 1983-1987● Includes an introductory chapter on computation,

as well as five chapters addressing the limitation of computers.

● Introduces the so-called “File Clerk Model” to explain the system-level behavior of sequential computers.

● Known as the as one of the great communicators of science

Page 13: How fpgas work when they don't

The File Clerk Model● Computers are data transfer machines first, and

only secondly an arithmetic device● The file clerk is primarily a data transfer function.

Data processing is only secondary● Feynman: Let’s use the file clerk as a metaphor

for understanding basic computer structure

Page 14: How fpgas work when they don't

The File Clerk Model

File clerk “total sales for California” procedure

Take out next “sales” cardIf “Location” says California, then

Take out “total” cardAdd sales number to number on cardPut “total” card back

Put “sales” card backRepeat

Sales cards

Salesman: “Smith”Location: “Tahoe”Salary: 100Sales: 1000

xxx.xx

Total card

File cabinet

Page 15: How fpgas work when they don't

The File Clerk Model

File clerk “total sales for California” procedure

Take out next “sales” cardIf “Location” says California, then

Add sales number to SPut “sales” card backRepeat until endTake out “total” cardReplace total with SPut “total” card back

Sales cards

Salesman: “Smith”Location: “Tahoe”Salary: 100Sales: 1000

xxx.xx

Total card

File cabinet

S : 0

Local scratch padLocal scratch pad limits data transfer, thus increasing file clerk performance

Page 16: How fpgas work when they don't

The File Clerk Model - Stored Program Clerking

1. R2 <- 12. R3 <- ADD (R1) (R2)3. R1 <- (R2)4. R2 <- (R3)5. R4 <- SUB 1000 (R3)6. PC <- 8 IF (CARRY)7. PC <- 28. HALT

Fetch instruction from address PCPC <- (PC) + 1Do instruction

R1 : 0R2 : 0R3 : 0R4 : 0

User registers Program/Data Memory

Fibonacci.exe

PC : 0 CARRY : 0

Control register

Generic file clerk with instruction set

Page 17: How fpgas work when they don't

The File Clerk Model with deadlines● Same model, but where results must be available

at a certain deadline. ● Imagine an angry office manager dictating the

pace● Claim: The time-dependence allow us to

intuitively explain how computers work when they don’t

● Trick: Use sympathetic insight/empathy for our file clerk

Page 18: How fpgas work when they don't

Intuitive Explanation of Errors using the File Clerk Model

Cause FCM eqv. FCM effect Reallife effect

Under-voltage Starving clerk Less effective clerk, more time to do same task. Unmet deadlines

Timing degradation

Overclocking Tight deadlines Less room for missteps Slack reduction

Electrical noise Office noise Processing errors more likely, variable execution time

Lower signal integrity, probabilistic propagation delay

Device Ageing Old file clerk Loss of vit and dexterity. More time to do same job

Timing degradation

High temperature Uncomfortable clerk Harder to focus. More time to do same job

Timing degradation

Page 19: How fpgas work when they don't

Adapting the File Clerk Model for FPGAs● Timed FCM● Think of a really simple-minded file clerk● Vocabulary restricted to “yes”, “no”, and “maybe”

○ Maybe ~ Metastability● Instructions limited to boolean expressions: file

clerk becomes LUTs● Important differences:

○ Program is unrolled into one long pipeline ○ Registers and file clerks are distributed

Yes, no, maybe?

Page 20: How fpgas work when they don't

Adapting the File Clerk Model for FPGAs

● “File clerk production line”● Information transfer is still dominating activity● System-level intuition about FCM still hold

Reg

Reg

File clerksScratch pad

Input data Output data

Page 21: How fpgas work when they don't

Mechanics of Timing Errors

Q: Assuming that we have timing violations, what happens?

Q: What conditions must be met before a timing violation result in a logic error?

Q: When do we have to worry?

Page 22: How fpgas work when they don't

Sensitization Criteria

Timing violations are a necessary condition for timing errors, but not sufficient. The circuit must also be exercised

FCM analogy: An idle “file clerk production line” does not make errors

Reg

Reg

...,X2, X1 …,Y2, Y1

Page 23: How fpgas work when they don't

Patience solves all problems

Reg

Reg

...X ,X, X, X, X …,Y, Y, #, @, ±

By repeating the input, the output will eventually settle to the correct error-free value

Page 24: How fpgas work when they don't

Two Primary Error Modes

Reg

Reg

Transition from X1 to X2

● Dynamic hazard when F(X1) != F(X2) → possible “stuck-at” error ● Static hazard when F(X1) == F(X2) → possible “bit-flip” error

F

F(X2), F(X1)

Page 25: How fpgas work when they don't

Generation of “Maybe’s”

● Register inputs must be stable during the setup and hold period (aperture).

● Unstable signals during latching → probability of meta-stabilities

● Given sufficient patiences, “maybe’s” will settle to a fixed yes or no. However, there is no guarantee that the value is correct (coin flip)

● With some probability, logic hazard can result in “maybe’s”

Page 26: How fpgas work when they don't

Clock Domain Crossing

● Ubiquitous in FPGA designs● Metastable behavior in receiving clock domain● Critical for control signals● Data signals are usually less critical (but it

depends)● Constant signals usually not critical (e.g.

configuration signals for subsystem)

Page 27: How fpgas work when they don't

Clock Domain Crossing

Classical mitigation using synchronizer

● Decreases the probability of “maybe’s”○ More levels, less probability

● No guarantee for correct signal transfer!!!● To ensure signal integrity, the patience principle

must be applied○ Sig1 must be repeated

Page 28: How fpgas work when they don't

When to worry about timing violations? Evaluate and accept

● Some data signals● Debug● Configuration● Low frequency signals re. fclk

Evaluate and avoid

● Mitigate○ Switch to level signaling○ Add synchronizers

● Refactor

Page 29: How fpgas work when they don't

That’s all folks