Implementing RISC Multi Core Processor Using HLS Language – BLUESPEC Final Presentation Liam Wigdor Advisor Mony Orbach Shirel Josef Semesterial Winter

Implementing RISC Multi Core Processor Using HLS Language – BLUESPECFinal Presentation

Liam Wigdor Advisor Mony OrbachShirel Josef

Semesterial Winter 2013

Department of Electrical EngineeringElectronicsComputersCommunicationsTechnion Israel Institute of Technology

AGENDA• Introduction

• BlueSpec Development Environment

• Project’s Goals

• Project’s Requirements

• Design Overview

• Design Stage 1 – Instruction Memory

• Design Stage 2 – Data Memory

• Design Stage 3 – MultiCore

• The Scalable Processor

• Benchmarks & Results

• Summary & Conclusion

Introduction• The future of single core is gloomy

• Multi cores can be used for parallel computing

• Multi cores may be used as specific accelerators as well as general purpose core.

Ecclesiastes 4:9-12 - Two are better than one

BlueSpec Development Environment• High level language for hardware description

• Rules – describing dynamic behavior– Atomic, fires once per cycle– Can run concurrently if not conflicted– Scheduled by BlueSpec automatically

• Module - Same as object in an object-oriented language.

• Interface – A module can be manipulated via the methods of its interface.Interface can be used by parent module only!

Project’s Goals• Main Goal:

– Implementing RISC multi core processor using BlueSpec.– Evaluate and analyze multi core design performance

compared to single core.

• Derived Goals:– Learning the BlueSpec principles, syntax and working

environment.– Understanding and using single core RISC processor to

implement multi core processor.– Validate design in BlueSpec level by using simple benchmark

programs and evaluating performance to single core.

Project’s Requirements

• Scalable Architecture: The architecture does not depend on the number of cores.

• Shared data memory

Core 1 Core 2

Shared Data Memory

Single Core Dual Core Quadratic Core

Baseline Processor – Single CoreThe SMIPS BlueSpec code taken from 046004 - Architecting and Implementing Microprocessors in BlueSpec

• 2 Stage Pipeline• Data and Instruction memory as sub modules• Includes naïve branch predictor

Design Overview

• In order to achieve project’s goals our design consisted of 3 stages:– Stage 1 – Instruction memory

– Stage 2 – Data memory

– Stage 3 – multicore

Stage 1 – Instruction Memory Motiviation• Each core execute different instructions.

• Can’t be achieved with I.Mem as CPU’s sub module.

• Solution: Draw out the I.Mem to the same hierarchy as the CPU module.

Core 1

D Mem

I Mem

• Modules use get/put Interface. (CPU as client, memory as Server)

• connect_resps and connect_reqs rules use CPU and I.Mem interfaces in order to connect the requests and the responses.

Test Bench I mem

put_request.put

get_response.get

CPU

get_request.get

put_response.put

f_out

f_in f_out

f_inRule:

connect_reqs

Rule:connect_resps

Stage 1 – Instruction Memory Implementation Methods

• Problem: Fetching Instruction latency is 5 cycles

• Cycle 1:– CPU rule enqueue the PC address into memory request

to f_out.

Test Bench I mem

put_request.put

get_response.get

CPU

get_request.get

put_response.put

f_out

f_in f_out

f_inRule:

connect_reqs

Rule:connect_resps

Stage 1 – Instruction Memory Latency problem – cycle 1


• Cycle 2:– connect_reqs dequeue the request from CPU f_out and

enqueue it into I.mem f_in fifo.

Test Bench I mem

put_request.put

get_response.get

CPU

get_request.get

put_response.put

f_out

f_in f_out

f_inRule:

connect_reqs

Rule:connect_resps



• Cycle 3:– I.Mem dequeue the request from f_in, process it and

enqueue the response to f_out

Test Bench I mem

put_request.put

get_response.get

CPU

get_request.get

put_response.put

f_out

f_in f_out

f_inRule:

connect_reqs

Rule:connect_resps



• Cycle 4:– connect_resps dequeue the response from I.mem f_out

and enqueue it into CPU f_in fifo.

Test Bench I mem

put_request.put

get_response.get

CPU

get_request.get

put_response.put

f_out

f_in f_out

f_inRule:

connect_reqs

Rule:connect_resps



• Cycle 5:– CPU rule dequeue the response from f_in and process it.

Test Bench I mem

put_request.put

get_response.get

CPU

get_request.get

put_response.put

f_out

f_in f_out

f_inRule:

connect_reqs

Rule:connect_resps


• Solution: Using bypass fifo for f_in and f_out instead of regular fifo, allowing enqueue and dequeue in the same cycle.New latency: 1 Cycle

• doFetch execute after response arrives.

Stage 1 – Instruction Memory Solution – Overview

Test Bench I mem

put_request.put

get_response.get

CPU

get_request.get

put_response.put

f_out

f_in f_out

f_inRule:

connect_reqs

Rule:connect_resps

• Each core access the same data memory to achieve parallelism

• Can’t be achieved with D.Mem as CPU’s sub module.

• Solution: Draw out the D.Mem to the same hierarchy as the CPU module.

Core 1

D Mem

I Mem

Stage 2 – Data Memory Motivation

• Modules use get/put Interface. (CPU as client, memory as Server)

• dconnect_resps and dconnect_reqs rules use CPU and D.Mem interfaces in order to connect the requests and the responses.

Stage 2 – Data Memory Implementation method

• Rule can only fire once per cycle.• doExecute initiate memory operation and

process the response and cannot fire twice in the same cycle.

Stage 2 – Data MemoryLatency Problem

• Solution:– Add memory stage in the pipeline data path, requesting

data in the execution Stage and receiving it in the memory stage.

• This solution was not implemented as we focused on creating multi-core processor.

Stage 2 – Data Memorysuggested solution

• Connecting multiple cores to their instruction memory and the shared data memory.

• Higher hierarchy module must be created in order to establish these connections.

Core 1 Core 2

Shared Data Memory

I Mem 1 I Mem 2

Core 1

D Mem

I Mem

Stage 3 – Multi-Core processor

• Connections are established using dconnect_reqs and dconnect_resps between each core to the same data memory

Stage 3 – Multi-Core processor Implementation method

• Issue 1: D.Mem has only one port, How can memory access be scheduled?

• Solution: BlueSpec automaticaly schedule design rules execution, giving priority to lower numbered core.

Stage 3 – Multi-Core processorIssue 1 - Scheduling

• Issue 2: – Connection rules constantly try to fire.

– Need to ensure that the CPU which accessed the memory will obtain the response and not other core.

Stage 3 – Multi-Core processorIssue 2 – Response Path

• Issue 3: – When simulating the processor 2 cores were unable to

operate together resulting poor performance.

Stage 3 – Multi-Core processorIssue 3 – Performance

– Using BlueSpec tools we observed that dconnect_resps_core2 blocked by dconnect_resps_core1

– Therefore, core2 execute stage was blocked when core1 operated.

Stage 3 – Multi-Core processorIssue 3 – debugging

– get_response interface in D.Mem was:

– Due to f_out.deq, only one core could obtain response and blocked all other cores because D.Mem f_out fifo was empty.

– get_response interface changed to:

Stage 3 – Multi-Core processorIssue 2,3 – Solution

– step 1: sendMessage enqueue the response which was prepared in the previous cycle

put_request.put

get_response.getf_out

f_inRule:

dconnect_reqs

Rule:dconnect_resps

D mem

Rule:sendMessage

Rule:dMemoryResponse

Stage 3 – Multi-Core processorChange in D.mem – step 1

– step 2: the connection use fifo.first (do not dequeue f_out) and new request arrives

put_request.put


f_inRule:

dconnect_reqs

Rule:dconnect_resps

D mem

Rule:sendMessage



– step 3: dMemoryResponse prepare the new response and dequeue the response that was sent in the beginning of the cycle

put_request.put


f_inRule:

dconnect_reqs

Rule:dconnect_resps

D mem

Rule:sendMessage



• Two cores executing instructions simultaneously, sharing the same data memory

Stage 3 – Multi-Core processorParallel execution

The Scalable Processor• 3 easy steps are required to add cores:

– Step 1: • Creating new instruction memory

– Step 2: • Connecting cores to data and instruction memories.

– Step 3:• Adding monitoring mechanism for each core

• Architectural independency in number of cores

Benchmark 1 – Description • Benchmark 1 – pure computational program

– No memory instructions– Pure parallelism due to no blocking

Benchmark 1 – Results • Benchmark 1 – pure computational program

– Results:

– With no memory instructions, all cores working independently and simultaneously.

– the results match the concept of multi-core as 8 cores can do the same “job” as 1 core in 1/8 of the time.

Benchmark 2 – Description • Benchmark 2 – Short Image Processing

– Input: 32X32 binary image– Output: inverted image– Using memory instructions

Benchmark 2 – Example • Benchmark 2 – Short Image Processing

– Image processing result:

Benchmark 2 – Results• Benchmark 2 – Short Image Processing

– Results:

– 2 cores managed to multiply performance by 2– 4 cores and 8 cores improvement declined as can be

predicted by the rule of diminishing marginal productivity.– The gap between the memory instruction was enough for 2

cores to operate with phase difference allowing each core to access the memory without blocking the other.

Benchmark 3/4 – Description • Benchmark 3/4 – Pure memory accessing

program– Mostly SW instructions or LW instruction– SW is “fire and forget” instruction, however load instruction

wait for response

Benchmarks 3/4 – Results • Benchmark 3/4 – Pure memory accessing

program– Results:

– Single core allocate cycles to computation, therefore memory is idle.

– In multiple cores, some cores execute computation instruction and others memory instructions allowing the maximize memory utilization.

Benchmark 5 – Description • Benchmark 5 – Long Image Processing

– Input: 32X32 binary image– Output: inverted image– Using memory instructions– However, processing part takes longer than benchmark2.

- Motivation – Larger gap betweenmemory instructions.

Benchmark 5 – Results • Benchmark 5 – Long Image Processing

– Results:

– As predicted in benchmark 2, larger gape between memory instructions resulted in greater performance for quadratic core.

– The larger the gap, more cores are capable to operate in different phase allowing them not to be blocked by other cores memory access.

Summary & Conclusion• Design included 3 stages:

– Stage 1 – Instruction memory– Stage 2 – Data memory– Stage 3 – Multi core

• Scalable and shared data memory requirements achieved.

• MultiCore increase data memory utilization (shown in benchmark 3/4)

Summary & Conclusion

• the number of cores should be chosen with regards to executed program

• Using mutlicore processor can enhance performance but after certain number of cores adding more cores will not result in better performance.

Summary & Conclusion - BlueSpec

• Pros:– High abstraction level of design – easier to focus on goal.– Automatic scheduling of modules interactions.– High level language – more human readable.

• Cons:– Hard to optimize – understanding the automatic scheduling

mechanism takes time.– Decipher scheduling errors and warnings .– Lack of “knowledge-base”.

Summary & Conclusion - FAQ• Problem: Each core execute same instructions• Solution: Draw out the I.Mem to the same hierarchy as the

CPU module.

• Problem: Client/Server interface latency is 5 cycles.• Solution: Use bypass fifo instead of regular fifo.

• load instructions latency cannot be 1 cycle even when using bypass fifo.

• Solution: Add memory stage in the pipeline data path, requesting data in the execution Stage and receiving it in the memory stage. (Not implemented)

Summary & Conclusion - FAQ• Problem: D.Mem has only one port, How can memory

access be scheduled?• Solution: BlueSpec automatically schedule design rules

execution, giving priority to lower numbered core.

• Problem: Need to ensure that the CPU which accessed the memory will obtain the response and not other core.

• Solution: Change interface so that every core can receive and validate response possession.

Future Projects Possibilities: • what’s next: MultiCore 2.0

– Design’s verification on hardware.

– Adding memory stage to reduce load latency.• Send request in execute stage and receive response in memory stage.

– Implement cache to reduce memory access.

– Implement multiple port data memory.

– Design mechanism for memory coherence .

As BlueSpec alluring advertisement says:

Documents

Implementing RISC Multi Core Processor Using HLS Language – BLUESPEC Final Presentation Liam Wigdor Advisor Mony Orbach Shirel Josef Semesterial Winter