Exercise 6: PULP Programming - ETH Zgmichi/asocd_2015/exercises/ex_06.pdf · Integrated Systems Laboratory Exercise 6: PULP Programming. Introduction to the PULP Computing Platform

Integrated Systems Laboratory

Exercise 6: PULP Programming

Introduction to the PULP Computing Platform

Antonio PulliniMichael Gautschi

19.05.2015


How efficient do we need to be?

[RuchIBM11]

1012ops/J↓

1pJ/op↓

1GOPS/mW

2


Minimum energy operation

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

Total EnergyLeakage EnergyDynamic Energy

0.55 0.55 0.55 0.55 0.6 0.7 0.8 0.9 1 1.1 1.2

Ener

gy/C

ycle

(nJ)

32nm CMOS, 25oC

4.7X

Logic Vcc / Memory Vcc (V)

Source: Vivek De, INTEL – Date 2013

Near-Threshold Computing (NTC): 1. Don’t waste energy pushing devices in strong inversion

2. Recover performance with parallel execution

3


PULP2 ARCHITECTURE

4


PULP ClusterMCHAN

Private non blocking per-core command queues Ultra low latency micro-coded programming interface Support for multiple out-of-order outstanding transactions

TCDM interconnect Single-cycle access to the

shared tightly coupled data memory

Interleaved addressing to reduce memory contention

Up to 16 processing elements

Private I$ No D$

Shared L1 multi-banked data memory (TCDM)

Design choices– I$ high code locality & simple architecture– No D$ low locality & high complexity– Multibank smaller energy per access, “almost” multiported


PULP3 ARCHITECTURE

6


OR10N Architecture Overview• 4 stage integer pipeline supporting the basic OpenRISC instruction set

– Single cycle memory access– Single cycle multiplication/macs– Non-aligned memory access in two cycles– Hardware loop, pre-/post-increment, and vector support

7


SHARED ICACHE

NxM Low latency – read only Interconnect: (OPTIONAL → MULTICAST)

I$ I$ I$

Mx1 AXI 4 node

128b L0 Buffers

1KB 4Way Set associative cachebanks

From core Fetch unitCore1

CoreN

To AXI DC FIFO

. . .

. . .

8


19.05.2015 9

What will we do today?

• Try simple benchmark to see how system works, how to run code, how to get outputs and how to interact with the simulator

• Implement a simple hw semaphore and use it to protect access to shared memory

• Study a matrix multiply example and understand it’s weakness

• Improve the matrix multiply benchmark using all cores, DMA transfers and TCDM memory


10

Clust0

0xFFFF_FFFF

0x2000_0000

Clusters

0x1000_0000

0x0000_0000

0x1040_0000

0x1C00_0000L2

Test&Set Space

Load/Store Space0x1000_0000

0x1010_0000

0x1020_0000

0x1021_0000

0x1022_0000

0x1040_0000

0x1023_0000

0x1024_0000

0x103F_0000

0x1030_0000Demux Mapped

Reserved

0x1025_0000

0x1026_0000

0x1027_0000

0x1028_0000

0x102A_0000

0x102E_0000

…0x102B_0000

MMU

PIPESTAGES

TIMER

BBMUX/CKGATE

EOC UNIT

DMA Command Queue 0x1031_100A

0x1030_0000

0x1030_0004

0x1031_0000

Core ID

…

Cluster ID

DMA Synch Reg

…

0x100F_1000

0x1031_1004

0x1030_0008

SoCPeripherals

0x1A10_0000

0x1A00_0000

0x1000_0000

0x1040_0000

0x1A00_0000

0x1A01_0000

0x1A03_0000

ROM

SPI S0x1A02_0000

0x1A10_0000

CVP/FLL

CLK DIVIDER

…

0x1A04_0000STDOUT

0x1A05_0000

PULP2 MEMORY MAP


19.05.2015 11

Getting Started

• Copy data from master account:$ mkdir pulp_ex$ cp /home/soc_master/pulp_ex/pulp_ex.tar.gz pulp_ex/.

• Extract the tar.gz and build the platform:$ cd pulp_ex$ tar –xzvf pulp.tar.gz .$ source setup/setup.csh (this will make the OpenRISC compiler available)$ cd fe/ips/$ tcsh (switch to tcsh console)$ source scripts/vcompile_libs.tcl (Builds the IPs library)$ cd ../sim/$ source scripts/vcompile_work.tcl (Builds the toplevel+testbench)


19.05.2015 12

Running Helloworld!

• Compile system and string libraries once in the beginning!$ cd sw/apps/helloworld$ make sys_lib$ make string_lib

• Compile the helloworld application with the OpenRISC compiler:$ make helloworld.slm

• Start modelsim and run a simulation!$ cd ../../../fe/sim$ vsim-10.2a$Modelsim$ source tcl_files/run.tcl

• Check the output$ less stdout/stdout_rvt_pe{0-3} (one qprintf() output file of each core)

• Each core should output “Hello world!!!”


19.05.2015 13

System Verilog Basics

• C-like hardware-description language

• In HDL coding the data type logic should be used!SV: VHDL:

1bit signal: logic std_logicN-bit vector: logic [N-1:0] std_logic_vector(N-1 downto 0)

• System Verilog uses modules (entities in VHDL)

module myUnit#(

parameter WIDTH = 0)

(input logic clk_i,input logic rst_ni,input logic [WIDTH-1:0] data_di,output logic [WIDTH-1:0] data_do

);

<< Module body >>

endmodule

Module Declaration

…myUnit#(

.WIDTH(32))myUnit_i(

.clk_i(clk),

.rst_ni(rst),

.data_di(data_di),

.data_do(data_do));…

Module Instantiation


19.05.2015 14

Sequential and combinational logic in System Verilog

…logic enable;enum logic {idle, write, read} CS, NS;…assign enable = 1’b1;…always_ff @(posedge clk_i, negedge rst_ni)begin

if (~rst_ni) beginData_DP <= 32’b0;CS <= idle;

endelse if (enable) begin

Data_DP <= Data_DN;CS <= NS;

endend…

Flip-Flop (with enable and active low reset)…always_combbegin

NS = CS;Data_DO = Data_DP;

case (CS)idle:begin

NS = write;Data_DO = 32’b0;

end…default:

NS = idle;endcase…

end…

Combinational process with a state machine

• Flip-flops– Use always_ff construct– Use non-blocking assignments ( <= )

• Combinational logic– Use always_comb construct– Or simple assignments assign = …;– Use blocking assignments ( = )


19.05.2015 15

EX1: Implement HW semaphore

• Setup:– cp /home/soc_master/pulp_ex/pulp_ex_hwsem.tar.gz ./ – tar -zxvf pulp_ex_hwsem.tar.gz– cp pulp_ex_hwsem/*.sv fe/rtl/ulpsoc/ – cp pulp_ex_hwsem/*.do fe/sim/scripts– Recompile toplevel

• Task 1:– Analyze code of axi_hwsem.sv and add the hw semaphores. Now there

are 2 simple 1 bit registers. Need to modify it so that when a read occurs and the value of the reg is 0 the value is automatically updated to 1. The value should return to 0 only when writing a 0.

• Task 2:– Add a simple function to check the semaphore: It should poll the register

untill a 0 is read– Add a function to set back the semaphore value to 0– Add a program to test the semaphore


19.05.2015 16

EX2: Matrix Multiplication

• Task 1:– Go to sw/apps/matrixMul/– Study the matrixMul.c code– Compile it with “make matrixMul.slm”– Estimate the number of cycles to complete the matrix multiplication by doing

some back-of-the-envelope calculations.– Run the simulation & check how many cycles it actually took!

• Task 2:– Analyze where the losses come from and use the presented functionality of the

cluster to speedup the matrix multiplication.– You might want to use the following features in your C-code:

• Multiple processing cores• Fast TCDM memory• Synchronization barriers• DMA

– You can use the qprintf() function for debugging– In matrixMul.h you can choose different matrix sizes. Start your optimizations

with a small 8x8 matrix


19.05.2015 17

PULP – System Functions

• Timer

• DMA

• qprintf()

• dma_barrier()

• synch_barrirer()


19.05.2015 18

Q&A

Documents

Exercise 6: PULP Programming - ETH Zgmichi/asocd_2015/exercises/ex_06.pdf · Integrated Systems Laboratory Exercise 6: PULP Programming. Introduction to the PULP Computing Platform