Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Integrated Systems Laboratory
Exercise 6: PULP Programming
Introduction to the PULP Computing Platform
Antonio PulliniMichael Gautschi
19.05.2015
Integrated Systems Laboratory
How efficient do we need to be?
[RuchIBM11]
1012ops/J↓
1pJ/op↓
1GOPS/mW
2
Integrated Systems Laboratory
Minimum energy operation
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
Total EnergyLeakage EnergyDynamic Energy
0.55 0.55 0.55 0.55 0.6 0.7 0.8 0.9 1 1.1 1.2
Ener
gy/C
ycle
(nJ)
32nm CMOS, 25oC
4.7X
Logic Vcc / Memory Vcc (V)
Source: Vivek De, INTEL – Date 2013
Near-Threshold Computing (NTC): 1. Don’t waste energy pushing devices in strong inversion
2. Recover performance with parallel execution
3
Integrated Systems Laboratory
PULP ClusterMCHAN
Private non blocking per-core command queues Ultra low latency micro-coded programming interface Support for multiple out-of-order outstanding transactions
TCDM interconnect Single-cycle access to the
shared tightly coupled data memory
Interleaved addressing to reduce memory contention
Up to 16 processing elements
Private I$ No D$
Shared L1 multi-banked data memory (TCDM)
Design choices– I$ high code locality & simple architecture– No D$ low locality & high complexity– Multibank smaller energy per access, “almost” multiported
Integrated Systems Laboratory
OR10N Architecture Overview• 4 stage integer pipeline supporting the basic OpenRISC instruction set
– Single cycle memory access– Single cycle multiplication/macs– Non-aligned memory access in two cycles– Hardware loop, pre-/post-increment, and vector support
7
Integrated Systems Laboratory
SHARED ICACHE
NxM Low latency – read only Interconnect: (OPTIONAL → MULTICAST)
I$ I$ I$
Mx1 AXI 4 node
128b L0 Buffers
1KB 4Way Set associative cachebanks
From core Fetch unitCore1
CoreN
To AXI DC FIFO
. . .
. . .
8
Integrated Systems Laboratory
19.05.2015 9
What will we do today?
• Try simple benchmark to see how system works, how to run code, how to get outputs and how to interact with the simulator
• Implement a simple hw semaphore and use it to protect access to shared memory
• Study a matrix multiply example and understand it’s weakness
• Improve the matrix multiply benchmark using all cores, DMA transfers and TCDM memory
Integrated Systems Laboratory
10
Clust0
0xFFFF_FFFF
0x2000_0000
Clusters
0x1000_0000
0x0000_0000
0x1040_0000
0x1C00_0000L2
Test&Set Space
Load/Store Space0x1000_0000
0x1010_0000
0x1020_0000
0x1021_0000
0x1022_0000
0x1040_0000
0x1023_0000
0x1024_0000
0x103F_0000
0x1030_0000Demux Mapped
Reserved
0x1025_0000
0x1026_0000
0x1027_0000
0x1028_0000
0x102A_0000
0x102E_0000
…0x102B_0000
MMU
PIPESTAGES
TIMER
BBMUX/CKGATE
EOC UNIT
DMA Command Queue 0x1031_100A
0x1030_0000
0x1030_0004
0x1031_0000
Core ID
…
Cluster ID
DMA Synch Reg
…
0x100F_1000
0x1031_1004
0x1030_0008
SoCPeripherals
0x1A10_0000
0x1A00_0000
0x1000_0000
0x1040_0000
0x1A00_0000
0x1A01_0000
0x1A03_0000
ROM
SPI S0x1A02_0000
0x1A10_0000
CVP/FLL
CLK DIVIDER
…
0x1A04_0000STDOUT
0x1A05_0000
PULP2 MEMORY MAP
Integrated Systems Laboratory
19.05.2015 11
Getting Started
• Copy data from master account:$ mkdir pulp_ex$ cp /home/soc_master/pulp_ex/pulp_ex.tar.gz pulp_ex/.
• Extract the tar.gz and build the platform:$ cd pulp_ex$ tar –xzvf pulp.tar.gz .$ source setup/setup.csh (this will make the OpenRISC compiler available)$ cd fe/ips/$ tcsh (switch to tcsh console)$ source scripts/vcompile_libs.tcl (Builds the IPs library)$ cd ../sim/$ source scripts/vcompile_work.tcl (Builds the toplevel+testbench)
Integrated Systems Laboratory
19.05.2015 12
Running Helloworld!
• Compile system and string libraries once in the beginning!$ cd sw/apps/helloworld$ make sys_lib$ make string_lib
• Compile the helloworld application with the OpenRISC compiler:$ make helloworld.slm
• Start modelsim and run a simulation!$ cd ../../../fe/sim$ vsim-10.2a$Modelsim$ source tcl_files/run.tcl
• Check the output$ less stdout/stdout_rvt_pe{0-3} (one qprintf() output file of each core)
• Each core should output “Hello world!!!”
Integrated Systems Laboratory
19.05.2015 13
System Verilog Basics
• C-like hardware-description language
• In HDL coding the data type logic should be used!SV: VHDL:
1bit signal: logic std_logicN-bit vector: logic [N-1:0] std_logic_vector(N-1 downto 0)
• System Verilog uses modules (entities in VHDL)
module myUnit#(
parameter WIDTH = 0)
(input logic clk_i,input logic rst_ni,input logic [WIDTH-1:0] data_di,output logic [WIDTH-1:0] data_do
);
<< Module body >>
endmodule
Module Declaration
…myUnit#(
.WIDTH(32))myUnit_i(
.clk_i(clk),
.rst_ni(rst),
.data_di(data_di),
.data_do(data_do));…
Module Instantiation
Integrated Systems Laboratory
19.05.2015 14
Sequential and combinational logic in System Verilog
…logic enable;enum logic {idle, write, read} CS, NS;…assign enable = 1’b1;…always_ff @(posedge clk_i, negedge rst_ni)begin
if (~rst_ni) beginData_DP <= 32’b0;CS <= idle;
endelse if (enable) begin
Data_DP <= Data_DN;CS <= NS;
endend…
Flip-Flop (with enable and active low reset)…always_combbegin
NS = CS;Data_DO = Data_DP;
case (CS)idle:begin
NS = write;Data_DO = 32’b0;
end…default:
NS = idle;endcase…
end…
Combinational process with a state machine
• Flip-flops– Use always_ff construct– Use non-blocking assignments ( <= )
• Combinational logic– Use always_comb construct– Or simple assignments assign = …;– Use blocking assignments ( = )
Integrated Systems Laboratory
19.05.2015 15
EX1: Implement HW semaphore
• Setup:– cp /home/soc_master/pulp_ex/pulp_ex_hwsem.tar.gz ./ – tar -zxvf pulp_ex_hwsem.tar.gz– cp pulp_ex_hwsem/*.sv fe/rtl/ulpsoc/ – cp pulp_ex_hwsem/*.do fe/sim/scripts– Recompile toplevel
• Task 1:– Analyze code of axi_hwsem.sv and add the hw semaphores. Now there
are 2 simple 1 bit registers. Need to modify it so that when a read occurs and the value of the reg is 0 the value is automatically updated to 1. The value should return to 0 only when writing a 0.
• Task 2:– Add a simple function to check the semaphore: It should poll the register
untill a 0 is read– Add a function to set back the semaphore value to 0– Add a program to test the semaphore
Integrated Systems Laboratory
19.05.2015 16
EX2: Matrix Multiplication
• Task 1:– Go to sw/apps/matrixMul/– Study the matrixMul.c code– Compile it with “make matrixMul.slm”– Estimate the number of cycles to complete the matrix multiplication by doing
some back-of-the-envelope calculations.– Run the simulation & check how many cycles it actually took!
• Task 2:– Analyze where the losses come from and use the presented functionality of the
cluster to speedup the matrix multiplication.– You might want to use the following features in your C-code:
• Multiple processing cores• Fast TCDM memory• Synchronization barriers• DMA
– You can use the qprintf() function for debugging– In matrixMul.h you can choose different matrix sizes. Start your optimizations
with a small 8x8 matrix
Integrated Systems Laboratory
19.05.2015 17
PULP – System Functions
• Timer
• DMA
• qprintf()
• dma_barrier()
• synch_barrirer()