Functional modeling style for efficient SW code generation
of video codec applications
Sang-Il Han1)2) Soo-Ik Chae1) Ahmed. A. Jerraya2)
SD Group1) SLS Group2)
Seoul National Univ., Korea TIMA laboratory, France
02/01/06 2
Summary Challenges in video codec system design
Complex High-performance within short design time Classical RTL design methodology is too slow
Requirements in video codec system design System level design methodology with functional model Explicit parallelism and conditional in functional model
Existing modeling styles Data-driven model lacks explicit conditional (ex: SDF lacks conditionals) Event-driven model lacks explicit parallelism (ex: SM lacks distributed imp.)
Proposed modeling style Abstract clock synchronous model (ACSM): An extension of CSM for RTL Both parallelism and conditional with an abstract global clock
02/01/06 3
The scope of this work System-level design methodology
Scope Modeling style to generate efficient SW code.
Target Application
TargetArchitecture
Implementationof System
Modeling
ACSM Model
Mapping
Mixed Algo./Arch. Model
AutomaticGeneration
Evaluation
Scope
02/01/06 4
ContentsIntroductionProposed modeling styleComparison with existing modeling stylesExperimentConclusion and Perspective
02/01/06 5
Target applicationMacroblock-based video codecs
MPEG-2, H.263, MPEG-4, H.264, and WMV9
A frame = WxH macroblocks
MacroblockVLD
Inverse ScanQuantization
InverseTransform + Deblocking
Filter
Video BitStream
Frame/SliceVLD
DecodedFrame Store
Spatial Compensation
Process
MultiplePrevious
Frame Store
Motion Compensation
Process
Current Frame Store
Switch
Spatial Prediction Modes
Motion Vectors
Intra/Inter MB
W
H
Block diagram of an H.264 decoder
Exploit temporal redundancy
Exploit spatial redundancy
02/01/06 6
Challenges and Requirements (1) 1) Computation Complexity
Ex: ≈ 10 Gops/sec for 4VGA decoding
Explicit parallelism Parallel and pipelined computation execution
2) Communication Complexity Ex: ≈ 500 MB/sec external memory bandwidth for 4VGA decoding
Explicit and predictable communication Parallel computation and communication execution Efficient communication scheme e.g. burst transfer
02/01/06 7
Challenges and Requirements (2) 3) Specification Complexity
Ex: Curr. MB has dependency with prev., upper., upper prev. MB
Higher level synchronization Function- and thread-level synchronization
4) Control Complexity Ex: 4x4 intra, 16x16 intra, 4x4 inter, 8x8 inter, … MB prediction
modes
Explicit data-dependent comp./comm. Sophisticated control operations e.g. memory management
02/01/06 8
Contribution Propose modeling style for video codecs
Abstract clock synchronous model (ACSM) An extension of clocked synchronous model for RTL Coarser clock: Abstract clock
1)Partially ordered dependency at function level Computation Complexity,
2)Explicit comm. channel with fixed buffer size Communication Complexity
3)Coarser clock granularity than physical clock Specification Complexity
4)Explicit conditionals through global abstract clock Control Complexity
02/01/06 9
ContentsIntroductionProposed modeling styleComparison with existing modeling stylesExperimentConclusion and Perspective
02/01/06 10
Proposed modeling styleAbstract clock synchronous model
Structure Network of state-less functions Memory elements Abstract clock
Hypothesis : Lock step execution model In every abstract clock cycle, new values propagate in the
network
F0
F1 F3
Del
ay
F2
Abstract clock
02/01/06 11
RTL model vs ACSM
Network of combinational gates Memory elements (delay) No loops in combinational logics Synch: Lock step model Synch. interval: clock
Network of state-less functions Memory elements (delay) No loops in network of functions Synch: Lock step model Synch. interval: abstract clock
*
/ +D
elay
-
clk
F0
F1 F3
Del
ay
F2
Abstract clock
RTL model ACSM
Difference: Granularity of clock and components
02/01/06 12
Basic componentsF0
i0
in
,,,
o0
om
,,, Z-K F0
F1
Z-K
(a) module (b) delay (c) arc
(d) If-action subsystem (IAS)
F1
F2
F3
If/else
…
F4 F5
…
Mer
ge
IAS1
IASi
0..3
D F0 M
(e) For-iterator subsystem (FIS)
FIS1
D
M
Demux
Mux
Data
Control
Rule on execution: If one token on all input ports, fired. Except Merge block
Rule on absent token: absent token with global abstract clock. absent token = (tag, empty value)
token = (tag, value)
02/01/06 13
Abstract clock for video codecsHierarchical iterations in video decoder and encoder
Frame level index: too coarse to express parallelism. Sub-macroblock level index: irregular delay depth Macroblock level index: proper granularity and regular
delay depth
Abstract clock for video codecs: Macroblock index
0 1
2 3
4 5
6 7
8 9
10 11
12 13
14 15
UperMB
Prev
MB
12 …
… …
20 21
… …
78 …
89 …
86 87
97 98
11
…
77
88
1 … 9 100
0 1
2 3
4 5
6 7
8 9
10 11
12 13
14 15
UperMB
Prev
MB
0 1
2 3
4 5
6 7
8 9
10 11
12 13
14 15
UperMB
Prev
MB
12 …
… …
20 21
… …
78 …
89 …
86 87
97 98
11
…
77
88
1 … 9 100
12 …
… …
20 21
… …
78 …
89 …
86 87
97 98
11
…
77
88
1 … 9 100
(a) Decoding order of macroblock (b) Decoding order of 4x4 block
Delay depth : 6
Delay depth : 166
QCIF image
02/01/06 14
Implementations of RTL model and ACSM
+
/ *
Del
ay
-
clk
F0
F1 F3
Del
ay
F2
Abstract clockre
gist
er
clk
add
div mul
sub
Abstract clock
SR
AM
IP1 IP3
F0 F2
RTL model ACSM model
16bit Carry lookahead Adder RISC processor
02/01/06 15
A simplified ACSM of H.264 decoder
+
Video BitStream
Video Out
Z-1
Z-W
Z-1
Z-W
CurrentFrame
PreviousFrames
MBVLD
Fr.VLD
8x8IQ
8x8IT
8x8DFM
0..3
Intra/Inter MB
16MVs
0..3
4 MC Types
8x8SC
8x8MC
4x4MC
4x4IF
8x8IF
D
D
D
MD
FIS1
IAS3
If/else
IAS2
IAS1
FIS1
In the full model
83 Modules 24 delays 286 arcs 43 IASs 5 FISs
02/01/06 16
ContentsIntroductionProposed modeling styleComparison with existing modeling stylesExperimentConclusion and Perspective
02/01/06 17
Synchronous Dataflow (SDF)
F4
Z-k
F1
If/else
F2
F3
Z-j
F6
Z-k
F1
If/else
F2
F3
Z-j
Fmux
(a) ACSM (b) SDF
SDF: Concurrent actors communicate with one-way FIFO channels Rule on execution: If sufficient non-zero tokens on all input ports, fired Rule on absent token: No concept of absent token. Difficulty: No explicit conditionals (!req. 4)
Redundant computation : One of F2 and F3 Redundant commutation : One of (Z-j⇒F2) and (Z-k⇒F3)
02/01/06 18
Boolean Dataflow (BDF)
F4
Z-k
F1
If/else
F2
F3
Z-j
F4
Z-k
F1
If/else
F2
F3
Z-j
SWIT
CH
SELE
CT
T
F F
T
(a) ACSM (b) BDF
BDF: Concurrent actors communicate with one-way FIFO channels Rule on execution: If sufficient non-zero tokens on all input ports, fired
Except SWITCH and SELECT blocks Rule on absent token: Absent token without global clock. Difficulty: Buffer overflow problem. (!req. 4)
Accumulate tokens between (Z-j⇒F2) and (Z-k⇒F3) To solve this problem, a lot of additional switches: 138 switches/83 modules in
case of H.264 decoder
02/01/06 19
Other modeling styles Khan Process Network (KPN)
Difficulty1: Task-level coarse granularity (!req 1) Difficulty2: Comp./comm./control mixed (!req 2, !req4)
Synchronous Dataflow with embedded control (SDF+eCtrl) Difficulty1: Coarser granularity. (!req. 1) Difficulty2: Still Redundant communication (!req. 4)
Synchronous model (SM) Rule on execution: If one present token on one or more input
ports, fired Rule on absent token: Absent tokens with virtual clock. Difficulty: distributed implementation (!req 1)
02/01/06 20
Summary of Modeling styles
Clock-drivenEvent-drivenData-drivenModel Type
√√√Explicit data-dependent op.
Abstract clkPhysic clkVirtual clksNo global clockClock
Unrestricted absent tokenDistributed
system
√√
SM
Restricted absent token with global clock
Execution ≈ Data-drivenAbsent token ≈Event-driven
No absent token (or without clock)
Explicit conditionalsAnalysis
√√
SDF+eCtrl
√√√ √Higher level synchronization
√√√√Explicit communication
√√√√Explicit fine-grain parallelism
ACSMRTLBDFSDFKPN
Requirements
Models
02/01/06 21
ContentsIntroductionProposed modeling styleComparison with existing modeling stylesExperimentConclusion and Perspective
02/01/06 22
Experiment environmentTarget application
H.264 baseline decoder Foreman QCIF format with QP=28 and IntraPeriod=5 All search modes are supported.
Target architecture Tensilica Xtensa single processor DMA + memory I/F Mapping
Frames are stored in global memory. Image fetch process is executed on DMA. All other processes are executed on Xtensa.
GlobalMem0 Proc0
Interconnection
Xtensa
localmem
DMA/Mem I/F
In Out
02/01/06 23
Experiment100
71.1 68.1
40.0
0.010.020.030.040.050.060.070.080.090.0
100.0
SDF ACSM Opt. ACSM Handedcode
Execution cycle
EtcRecITVLDChrom DFLuma DFChrom MCLuma MCTop control
100
78.1
54.0 54.0
0.010.020.030.040.050.060.070.080.090.0
100.0
SDF ACSM Opt. ACSM Handedcode
External memory access
Write ChromWrite LumaRead ChromRead Luma
100
71.1 68.1
40.0
100
78.1
54.0 54.0
H.264 decoder
C codes
Modeling
SW codeby RTW
Profile withXtensa ISS
Cycle andMemory access
ACSM (4x4)
Opt. ACSM (4x4, 8x8)
Handed code
XtensaSingle proc.
SDF(4x4 w/o IAS)
Simulink models
02/01/06 24
Experiment result: ComparisonCompared to SDF
Reduce 32% execution time Due to Redundant computation of SDF
Reduce 46% external memory access Due to redundant communication of SDF
Compared to SDF+eCtrl Similar execution time, but SDF+eCtrl limits design space. Reduce 46% external memory access
Due to redundant communication of SDF+eCtrl
Compared to BDF Buffer overflow, or a lot of switches required (138 switches/83 modules)
02/01/06 25
ContentsIntroductionContributionBasics of target application and architectureProposed modeling styleComparison with existing modeling stylesExperimentConclusion and Perspective
02/01/06 26
Conclusion and perspectiveConclusion
ACSM suitable for video codec application An high level RTL model Satisfies the requirements of functional model
Efficient SW code generation from Simulink with RTW Reduce 32% execution time than SDF Reduce 46% external memory access than SDF
Perspective SW generation tool for distributed implementation.