Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Princeton University and ETH Zürich
http://openpiton.orghttp://pulp-platform.org
OpenPiton with RISC-V CoresA Hands-On Tutorial with the
Open Source Manycore Processor
Princeton Parallel Research Group• Computer Architecture after Moore’s Law
– @MICRO 2019: ComputeDRAM: In-Memory Compute Using Off-the-Shelf DRAMs (Monday)
• Redesigning the Data Center of the Future– @MICRO 2019: Architectural Implications of Function-as-a-
Service Computing (Wednesday)• Biodegradable Computing (Materials)
• 10 PhD Students• 1 Postdoc• 3 Undergraduates
2Grand Canyon Trip 2019
3
This work was partially supported by the NSF under Grants No. CNS-1823222,CCF-1823032, CCF-1217553, CCF-1453112, and CCF-1438980, Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) under agreements No. FA8650-18-2-7846, FA8650-18-2-7852, and FA8650-18-2-7862, AFOSR under Grant No. FA9550-14-1-0148, and DARPA under Grants No. N66001-14-1-4040 and HR0011-13-2-0005. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA), the NSF, AFOSR, or the U.S. Government.
Support
4
The world’s first open source, general purpose, multithreaded manycore processor
• Open source manycore• Written in Verilog RTL• Scales to ½ billion cores• Configurable core, uncore• Includes synthesis and back-end flow• Simulate in VCS, ModelSim, NCSim, Verilator, Icarus• ASIC & FPGA verified• ASIC power and energy fully characterized
[HPCA 2018]• Runs full stack multi-user Debian Linux• Used for Architecture, Programming Language,
Compilers, Operating Systems, Security, EDA research
Tile
Chip
chipset
• Collaboration between Princeton University and PULP team from ETH Zürich
• Goal is to develop a permissively licensed, Linux capable many-core research platform based on RISC-V
• Ariane– RV64GC Core– Linux capable
•– Research manycore system– OpenSPARC T1 based– Coherent NoC, distributed cache
OpenPiton+Ariane
5
• Project started in 2013 by Luca Benini• A collaboration between University of Bologna and ETH Zürich
– Large team. In total about 60 people, not all are working on PULP
• Key goal is
• We were able to start with a clean slate, no need to remain compatible to legacy systems.
Parallel Ultra Low Power (PULP)
How to get the most BANGfor the ENERGY consumed in a computing system
ARIANE: Linux capable 64-bit core• Application class processor• Linux Capable
– Tightly integrated D$ and I$– M, S and U privilege modes– TLB, SV39– Hardware PTW
• Optimized for performance– Frequency: 1.5 GHz (22 FDX)– Area: ~ 175 kGE– Critical path: ~ 25 logic levels
• 6-stage pipeline– In-order issue– Out-of-order write-back– In-order commit
• Scoreboarding• Designed for extendibility• Branch-prediction
– Return Address Stack (RAS)– Branch Target Buffer (BTB)– Branch History Table (BHT)
7
7
ARIANE: Linux capable 64-bit core
8
OpenPiton System Overview
9
Tile
OpenPiton System Overview
10
OpenPiton System Overview
11
Chip
OpenPiton System Overview
12
P-Mesh Off-Chip Routers (3)
Chip Bridge
P-Mesh Chipset Crossbars (3)
ChipsetChip
OpenPiton System Overview
13
P-Mesh Off-Chip Routers (3)
Chip Bridge
P-Mesh Chipset Crossbars (3)
DRAM
Chip Chipset
OpenPiton System Overview
14
P-Mesh Off-Chip Routers (3)
Chip Bridge
P-Mesh Chipset Crossbars (3)
DRAM WishboneSDHC
Chip Chipset
OpenPiton System Overview
15
P-Mesh Off-Chip Routers (3)
Chip Bridge
P-Mesh Chipset Crossbars (3)
DRAM WishboneSDHC
AXII/O
Chip Chipset
OpenPiton System Overview
16
P-Mesh Off-Chip Routers (3)
Chip Bridge
P-Mesh Chipset Crossbars (3)
DRAM WishboneSDHC
AXII/O
Chip Chipset
Tile Overview
17
To Other Tiles
L2 Cache Slice+
Directory Cache
P-MeshRouters
(3)
L1.5 Cache
CCX Arbiter
FPU
Modified OpenSPARC T1
Core
MITTS(Traffic Shaper)
Silicon Proven Designs: Ariane• Ariane has been taped-out
Globalfoundries 22nm FDXin 2017 and 2018
• The system features 16 kByte ofinstruction and 32 kByte of datacache.
• Poseidon:– Area: 0.23 mm2 – 175 kGE– 0.2 - 1.7 GHz (0.5 V – 1.15 V)
• Kosmodrom:– RV64GCXsmallFloat– Transprecision / Vector FPU– Ariane HP
• 8T library, 0.8V, 1.3 GHz• 55 mW @ 1 GHz
– Ariane LP• 7.5T ULP library, 0.5V, 250 MHz• 5 mW @ 200 MHz 18
Issue
QUENTIN KERBIN
HYPERDRIVE
Poseidon layoutAriane
Kosmodrom layout
Ariane LPAriane HP
L2
NTX
18
Silicon Proven Designs: Piton Chip• 25-core
– 2 Threads per core– 64-bit Architecture– Modified OpenSPARC T1 Core
• 3 NoCs (P-Mesh)– 64-bit, 2D Mesh– Extend off-chip enabling multichip systems
• Directory-Based Cache System– 64KB L2 Cache per core (Shared)– 8KB L1.5 Data Cache– 8KB L1 Data Cache– 16KB L1 Instruction Cache
• IBM 32nm SOI Process– 6mm x 6mm– 460 Million Transistors
• Target: 1GHz Clock @ 900mV• 208 Pin CQFP Package
19
Tile 0 Tile 1 Tile 2 Tile 3 Tile 4
Tile 20
Tile 21
Tile 22
Tile 23
Tile 24
Tile 5 Tile 6 Tile 7 Tile 8 Tile 9
Tile 10
Tile 11
Tile 12
Tile 13
Tile 14
Tile 15
Tile 16
Tile 17
Tile 18
Tile 19
PLLCB Chip Bridge (CB)
Piton Test Setup
20
DRAM + I/O
Chipset FPGAKintex 7
Bridge FPGASpartan 6
Piton + Heat Sink
Bulk Decoupling
Power Supply
Misc. Configuration
[McKeown et al, HotChips 2016] [McKeown et al, IEEE MICRO 2017] [McKeown et al, HPCA 2018]
Putting it all together
21
To Other Tiles
L2 Cache Slice+
Directory Cache
P-MeshRouters
(3)
L1.5 Cache
CCX Arbiter
FPU
Modified OpenSPARC T1
Core
MITTS(Traffic Shaper)
§ Native L1.5 interface is the ideal point to attach a new core
§ Well defined interface similar to CCX from OpenSPARC
§ Write-through cache protocol
§ Coherency mechanism: only need to support invalidation messages
Putting it all together
22
To Other Tiles
L2 Cache Slice+
Directory Cache
P-MeshRouters
(3)
L1.5 Cache
CCX Arbiter
FPU
Modified OpenSPARC T1
Core
MITTS(Traffic Shaper)
§ Native L1.5 interface is the ideal point to attach a new core
§ Well defined interface similar to CCX from OpenSPARC
§ Write-through cache protocol
§ Coherency mechanism: only need to support invalidation messages
FPGA Prototyping Platforms
Available:• Digilent Genesys2– $999 ($600 academic)– 1-2 cores at 66MHz• Xilinx VC707– $3500– 1-4 cores at 60MHz• Digilent Nexys Video– $500 ($250 academic)– 1 core at 30MHz
• BittWare XUPP3R– $7000-8000– >100MHz (12 cores)• Amazon AWS F1– Rent by the hour– 12 cores
OpenPiton Philosophy• Focus/Value is in the Uncore
– Not religious about ISA– Provide whole working system
• We are practical– Use Verilog (Ariane is SV)– Industry standard tools– Use the best tool for job (including commercial CAD tools)
• Primarily for research, but welcome industry also• Licensing
– All our code, Hypervisor, are BSD-like– Linux, T1 core (GPL or LGPL)– Ariane (Solderpad)
• Scalability (Million Core)
24
OpenPiton Community
• Visit http://openpiton.org• [email protected]
25
• Building a community– Welcome community
contributions– Thousands of Downloads
• Google Group
Doing Research with OpenPiton + Ariane
• Software– Install on Debian, test scalability
• Operating System– Recompile kernel, rebuild SW, run
• Hardware/Software Co-design– Add new instructions, change compiler/HV/OS/SW
• Architecture– Change parameters, rebuild HW, run
26
HW
ISA
HV/OS
Apps
Compiler/Runtime
Enabled Research
• Coherence Domain Restriction– Fu et al. MICRO 2015
• Execution Drafting– McKeown et al. MICRO 2014
• Memory Inter-arrival Time Traffic Shaper– Zhou et al. ISCA 2016
• Oblivious RAM– Fletcher et al. ASPLOS 2015
• DVFS modelling• Numerous outside papers• Numerous class research projects
27
Program A Instruction Program B Instruction
Fetch Stage Thread Select Stage
Decode Stage Execute Stage Memory Stage Writeback Stage
Successfully Drafted Instructions Lead Instructions
…
… … … … … … … ……
…
…
…
…
…
…
…
…App 1
App 3
App 2
Frequency
RequestInter-arrival time
2t
t 3t
Uniform Traffic
More Bursty Traffic
2tA Distribution of Traffic
Time