Upload
ashley-goodman
View
268
Download
0
Tags:
Embed Size (px)
Citation preview
Implementing RISC Multi Core Processor Using HLS Language – BLUESPECFinal Presentation
Liam Wigdor Advisor Mony OrbachShirel Josef
Semesterial Winter 2013
Department of Electrical EngineeringElectronicsComputersCommunicationsTechnion Israel Institute of Technology
AGENDA• Introduction
• BlueSpec Development Environment
• Project’s Goals
• Project’s Requirements
• Design Overview
• Design Stage 1 – Instruction Memory
• Design Stage 2 – Data Memory
• Design Stage 3 – MultiCore
• The Scalable Processor
• Benchmarks & Results
• Summary & Conclusion
Introduction• The future of single core is gloomy
• Multi cores can be used for parallel computing
• Multi cores may be used as specific accelerators as well as general purpose core.
Ecclesiastes 4:9-12 - Two are better than one
BlueSpec Development Environment• High level language for hardware description
• Rules – describing dynamic behavior– Atomic, fires once per cycle– Can run concurrently if not conflicted– Scheduled by BlueSpec automatically
• Module - Same as object in an object-oriented language.
• Interface – A module can be manipulated via the methods of its interface.Interface can be used by parent module only!
Project’s Goals• Main Goal:
– Implementing RISC multi core processor using BlueSpec.– Evaluate and analyze multi core design performance
compared to single core.
• Derived Goals:– Learning the BlueSpec principles, syntax and working
environment.– Understanding and using single core RISC processor to
implement multi core processor.– Validate design in BlueSpec level by using simple benchmark
programs and evaluating performance to single core.
Project’s Requirements
• Scalable Architecture: The architecture does not depend on the number of cores.
• Shared data memory
Core 1 Core 2
Shared Data Memory
Single Core Dual Core Quadratic Core
Baseline Processor – Single CoreThe SMIPS BlueSpec code taken from 046004 - Architecting and Implementing Microprocessors in BlueSpec
• 2 Stage Pipeline• Data and Instruction memory as sub modules• Includes naïve branch predictor
Design Overview
• In order to achieve project’s goals our design consisted of 3 stages:– Stage 1 – Instruction memory
– Stage 2 – Data memory
– Stage 3 – multicore
Stage 1 – Instruction Memory Motiviation• Each core execute different instructions.
• Can’t be achieved with I.Mem as CPU’s sub module.
• Solution: Draw out the I.Mem to the same hierarchy as the CPU module.
Core 1
D Mem
I Mem
• Modules use get/put Interface. (CPU as client, memory as Server)
• connect_resps and connect_reqs rules use CPU and I.Mem interfaces in order to connect the requests and the responses.
Test Bench I mem
put_request.put
get_response.get
CPU
get_request.get
put_response.put
f_out
f_in f_out
f_inRule:
connect_reqs
Rule:connect_resps
Stage 1 – Instruction Memory Implementation Methods
• Problem: Fetching Instruction latency is 5 cycles
• Cycle 1:– CPU rule enqueue the PC address into memory request
to f_out.
Test Bench I mem
put_request.put
get_response.get
CPU
get_request.get
put_response.put
f_out
f_in f_out
f_inRule:
connect_reqs
Rule:connect_resps
Stage 1 – Instruction Memory Latency problem – cycle 1
• Problem: Fetching Instruction latency is 5 cycles
• Cycle 2:– connect_reqs dequeue the request from CPU f_out and
enqueue it into I.mem f_in fifo.
Test Bench I mem
put_request.put
get_response.get
CPU
get_request.get
put_response.put
f_out
f_in f_out
f_inRule:
connect_reqs
Rule:connect_resps
Stage 1 – Instruction Memory Latency problem – cycle 2
• Problem: Fetching Instruction latency is 5 cycles
• Cycle 3:– I.Mem dequeue the request from f_in, process it and
enqueue the response to f_out
Test Bench I mem
put_request.put
get_response.get
CPU
get_request.get
put_response.put
f_out
f_in f_out
f_inRule:
connect_reqs
Rule:connect_resps
Stage 1 – Instruction Memory Latency problem – cycle 3
• Problem: Fetching Instruction latency is 5 cycles
• Cycle 4:– connect_resps dequeue the response from I.mem f_out
and enqueue it into CPU f_in fifo.
Test Bench I mem
put_request.put
get_response.get
CPU
get_request.get
put_response.put
f_out
f_in f_out
f_inRule:
connect_reqs
Rule:connect_resps
Stage 1 – Instruction Memory Latency problem – cycle 4
• Problem: Fetching Instruction latency is 5 cycles
• Cycle 5:– CPU rule dequeue the response from f_in and process it.
Test Bench I mem
put_request.put
get_response.get
CPU
get_request.get
put_response.put
f_out
f_in f_out
f_inRule:
connect_reqs
Rule:connect_resps
Stage 1 – Instruction Memory Latency problem – cycle 5
• Solution: Using bypass fifo for f_in and f_out instead of regular fifo, allowing enqueue and dequeue in the same cycle.New latency: 1 Cycle
• doFetch execute after response arrives.
Stage 1 – Instruction Memory Solution – Overview
Test Bench I mem
put_request.put
get_response.get
CPU
get_request.get
put_response.put
f_out
f_in f_out
f_inRule:
connect_reqs
Rule:connect_resps
• Each core access the same data memory to achieve parallelism
• Can’t be achieved with D.Mem as CPU’s sub module.
• Solution: Draw out the D.Mem to the same hierarchy as the CPU module.
Core 1
D Mem
I Mem
Stage 2 – Data Memory Motivation
• Modules use get/put Interface. (CPU as client, memory as Server)
• dconnect_resps and dconnect_reqs rules use CPU and D.Mem interfaces in order to connect the requests and the responses.
Stage 2 – Data Memory Implementation method
• Rule can only fire once per cycle.• doExecute initiate memory operation and
process the response and cannot fire twice in the same cycle.
Stage 2 – Data MemoryLatency Problem
• Solution:– Add memory stage in the pipeline data path, requesting
data in the execution Stage and receiving it in the memory stage.
• This solution was not implemented as we focused on creating multi-core processor.
Stage 2 – Data Memorysuggested solution
• Connecting multiple cores to their instruction memory and the shared data memory.
• Higher hierarchy module must be created in order to establish these connections.
Core 1 Core 2
Shared Data Memory
I Mem 1 I Mem 2
Core 1
D Mem
I Mem
Stage 3 – Multi-Core processor
• Connections are established using dconnect_reqs and dconnect_resps between each core to the same data memory
Stage 3 – Multi-Core processor Implementation method
• Issue 1: D.Mem has only one port, How can memory access be scheduled?
• Solution: BlueSpec automaticaly schedule design rules execution, giving priority to lower numbered core.
Stage 3 – Multi-Core processorIssue 1 - Scheduling
• Issue 2: – Connection rules constantly try to fire.
– Need to ensure that the CPU which accessed the memory will obtain the response and not other core.
Stage 3 – Multi-Core processorIssue 2 – Response Path
• Issue 3: – When simulating the processor 2 cores were unable to
operate together resulting poor performance.
Stage 3 – Multi-Core processorIssue 3 – Performance
– Using BlueSpec tools we observed that dconnect_resps_core2 blocked by dconnect_resps_core1
– Therefore, core2 execute stage was blocked when core1 operated.
Stage 3 – Multi-Core processorIssue 3 – debugging
– get_response interface in D.Mem was:
– Due to f_out.deq, only one core could obtain response and blocked all other cores because D.Mem f_out fifo was empty.
– get_response interface changed to:
Stage 3 – Multi-Core processorIssue 2,3 – Solution
– step 1: sendMessage enqueue the response which was prepared in the previous cycle
put_request.put
get_response.getf_out
f_inRule:
dconnect_reqs
Rule:dconnect_resps
D mem
Rule:sendMessage
Rule:dMemoryResponse
Stage 3 – Multi-Core processorChange in D.mem – step 1
– step 2: the connection use fifo.first (do not dequeue f_out) and new request arrives
put_request.put
get_response.getf_out
f_inRule:
dconnect_reqs
Rule:dconnect_resps
D mem
Rule:sendMessage
Rule:dMemoryResponse
Stage 3 – Multi-Core processorChange in D.mem – step 2
– step 3: dMemoryResponse prepare the new response and dequeue the response that was sent in the beginning of the cycle
put_request.put
get_response.getf_out
f_inRule:
dconnect_reqs
Rule:dconnect_resps
D mem
Rule:sendMessage
Rule:dMemoryResponse
Stage 3 – Multi-Core processorChange in D.mem – step 3
• Two cores executing instructions simultaneously, sharing the same data memory
Stage 3 – Multi-Core processorParallel execution
The Scalable Processor• 3 easy steps are required to add cores:
– Step 1: • Creating new instruction memory
– Step 2: • Connecting cores to data and instruction memories.
– Step 3:• Adding monitoring mechanism for each core
• Architectural independency in number of cores
Benchmark 1 – Description • Benchmark 1 – pure computational program
– No memory instructions– Pure parallelism due to no blocking
Benchmark 1 – Results • Benchmark 1 – pure computational program
– Results:
– With no memory instructions, all cores working independently and simultaneously.
– the results match the concept of multi-core as 8 cores can do the same “job” as 1 core in 1/8 of the time.
Benchmark 2 – Description • Benchmark 2 – Short Image Processing
– Input: 32X32 binary image– Output: inverted image– Using memory instructions
Benchmark 2 – Example • Benchmark 2 – Short Image Processing
– Image processing result:
Benchmark 2 – Results• Benchmark 2 – Short Image Processing
– Results:
– 2 cores managed to multiply performance by 2– 4 cores and 8 cores improvement declined as can be
predicted by the rule of diminishing marginal productivity.– The gap between the memory instruction was enough for 2
cores to operate with phase difference allowing each core to access the memory without blocking the other.
Benchmark 3/4 – Description • Benchmark 3/4 – Pure memory accessing
program– Mostly SW instructions or LW instruction– SW is “fire and forget” instruction, however load instruction
wait for response
Benchmarks 3/4 – Results • Benchmark 3/4 – Pure memory accessing
program– Results:
– Single core allocate cycles to computation, therefore memory is idle.
– In multiple cores, some cores execute computation instruction and others memory instructions allowing the maximize memory utilization.
Benchmark 5 – Description • Benchmark 5 – Long Image Processing
– Input: 32X32 binary image– Output: inverted image– Using memory instructions– However, processing part takes longer than benchmark2.
- Motivation – Larger gap betweenmemory instructions.
Benchmark 5 – Results • Benchmark 5 – Long Image Processing
– Results:
– As predicted in benchmark 2, larger gape between memory instructions resulted in greater performance for quadratic core.
– The larger the gap, more cores are capable to operate in different phase allowing them not to be blocked by other cores memory access.
Summary & Conclusion• Design included 3 stages:
– Stage 1 – Instruction memory– Stage 2 – Data memory– Stage 3 – Multi core
• Scalable and shared data memory requirements achieved.
• MultiCore increase data memory utilization (shown in benchmark 3/4)
Summary & Conclusion
• the number of cores should be chosen with regards to executed program
• Using mutlicore processor can enhance performance but after certain number of cores adding more cores will not result in better performance.
Summary & Conclusion - BlueSpec
• Pros:– High abstraction level of design – easier to focus on goal.– Automatic scheduling of modules interactions.– High level language – more human readable.
• Cons:– Hard to optimize – understanding the automatic scheduling
mechanism takes time.– Decipher scheduling errors and warnings .– Lack of “knowledge-base”.
Summary & Conclusion - FAQ• Problem: Each core execute same instructions• Solution: Draw out the I.Mem to the same hierarchy as the
CPU module.
• Problem: Client/Server interface latency is 5 cycles.• Solution: Use bypass fifo instead of regular fifo.
• load instructions latency cannot be 1 cycle even when using bypass fifo.
• Solution: Add memory stage in the pipeline data path, requesting data in the execution Stage and receiving it in the memory stage. (Not implemented)
Summary & Conclusion - FAQ• Problem: D.Mem has only one port, How can memory
access be scheduled?• Solution: BlueSpec automatically schedule design rules
execution, giving priority to lower numbered core.
• Problem: Need to ensure that the CPU which accessed the memory will obtain the response and not other core.
• Solution: Change interface so that every core can receive and validate response possession.
Future Projects Possibilities: • what’s next: MultiCore 2.0
– Design’s verification on hardware.
– Adding memory stage to reduce load latency.• Send request in execute stage and receive response in memory stage.
– Implement cache to reduce memory access.
– Implement multiple port data memory.
– Design mechanism for memory coherence .
As BlueSpec alluring advertisement says: