OpenPiton:
AnOpen-SourceFrameworkfor
EDAToolDevelopment
DavidWentzlaff
FOSDEM2020
PrincetonUniversity
1http://openpiton.org
https://github.com/PrincetonUniversity/openpiton
2
OpenPiton:Theworld’sfirstopensource,general
purpose,multithreadedmanycoreprocessor
• Opensourcemanycore
• WritteninVerilogRTL
• Scalesto½billioncores• Configurablecore,uncore• Includessynthesisandback-endflow• SimulateinVCS,ModelSim,NCSim,Verilator,
Icarus,Riviera
• ASIC&FPGAverified• ASICpowerandenergyfullycharacterized
[HPCA2018]
• Runsfullstackmulti-userDebianLinux
Tile
Chip
chipset
http://openpiton.org
https://github.com/PrincetonUniversity/openpiton
OpenPiton+Ariane:ThefirstOpen-SourceSMPLinux-
bootingRISC-Vsystemscalingfromonetomanycores
• CollaborationwithETHZurich
– IntegrateOpenPitonandArianeRISC-V
Core
– BootedSMPLinux
lessthan6months
afterstarting
integratingRTL
3
+ Ariane
http://openpiton.org
https://github.com/PrincetonUniversity/openpiton
4
SiliconProvenDesigns• Piton(25-coreinstanceofOpenPiton)
– 25-coremodified64bitOpenSPARCT1Core
– 3P-MeshNoCs– P-MeshDirectory-BasedCacheSystem– Taped-outinIBM32nmSOI
• 6mmx6mm;460MillionTransistors-Amonglargestchipsbuiltinacademia
– Target:1GHzClock@900mV
– Receivedsiliconandrunsfull-stackDebianinlab
• ArianeRISC-V(ETH-Z)– Taped-outinGlobalFoundries22nmFDX
Poseidon:
– Area:0.23mm2-175kGE
– 0.2-1.7GHz(0.5V-1.15V)
Kosmodrom:
– RV64GCXsmallFloat,Transprecision/VectorFPU
– ArianeHP• 8Tlibrary,0.8V,1.3GHz
• 55mW@1GHz
– ArianeLP• 7.5TULPlibrary,0.5V,250MHz
• 5mW@200MHz
Issue
QUENTIN KERBIN
HYPERDRIVE
Ariane
L2
NTX
ArianeDiePhotosfromETH-Z
FPGAPrototypingPlatforms
Available:
• DigilentGenesys2– $999($600academic)
– 1-2coresat66MHz
• XilinxVC707– $3500– 1-4coresat60MHz
• DigilentNexysVideo– $500($250academic)
– 1coreat30MHz
• BittWareXUPP3R
– $7000-8000– >100MHz(12cores)
• AmazonAWSF1
– ~$1.60/hr– Rentbythehour– 12cores
5
OpenPitonSystemOverview
6
P-MeshOff-Chip
Routers(3)
Chip
Bridge
P-MeshChipset
Crossbars(3)
DRAMWishbone
SDHC
AXI
I/O
Chip Chipset
http://openpiton.org
https://github.com/PrincetonUniversity/openpiton
7
Tile Chipset
OpenPitonSystemOverview
http://openpiton.org
https://github.com/PrincetonUniversity/openpiton
8
Tile Chipset
OpenPiton+ArianeSystemOverview
http://openpiton.org
https://github.com/PrincetonUniversity/openpiton
CurrentStatusofFreeandOpenSource
ChipEDAtoolsandBenchmarks
• GrowinguseofFree&OpenSourceEDAtools– MainlyusedforverificationofdesignsandwithFPGAs
• Free&OpenSourceEDAtoolshavereliedonindustrialhardwaredesignreleasesforverification
• Dependenceonindustryincurslimitations
– Designsandscaleofdesignsareoutdatedbythetime
theyarereleased
• LEON3(2004)isstilloneofthelargestdesigns– Lower-levelinformationsuchasoneneededforPlacementandRoutingtools,etc.areoften
obfuscated(ISPD,ICCADbenchmarks)
9
OpenPitonContainsDesignVariety
10
Last-Level
Cache
OpenSPARC
T1Core
Last-Level
Cache
Private
Cache
PicoRV32
Core
Private
Cache
NoC
Router
MemorySystem
NoC
Router
Last-Level
Cache
Ariane
RISC-V
Core
Private
Cache
NoC
Router
Last-Level
Cache
ao486
Core
Private
Cache
NoC
Router
DDR
Memory
Ethernet
PS/2
UART
MIAOWGPGPU
WishboneSDHC
Chipset
NoC
Crossbar
• BigCores• SmallCores
• Caches• Interconnect• GPGPU(MIAOW)
• I/O• Addinginadditional
accelerators
http://openpiton.org
https://github.com/PrincetonUniversity/openpiton
ConfigurabilityOptions
11
Component ConfigurabilityOptions
Cores(perchip) Upto65,536
Cores(persystem) Upto500million
CoreType OpenSPARCT1 Ariane64bitRISC-V
ThreadsperCore 1/2/4 1
Floating-PointUnit FP64,FP32 FP64,FP32,FP16,FP8,
BFLOAT16
TLBs 8/16/32/64entries Numberofentries(16entries)
L1I-Cache NumberofSets,Ways(16kB,4-way)
L1D-Cache NumberofSets,Ways(8kB,4-way)
L1.5Cache NumberofSets,Ways(8kB,4-way)
L2Cache NumberofSets,Ways(64kB,4-way)
Intra-chip
Topologies
2DMesh,Crossbar
Inter-chip
Topologies
2DMesh,3DMesh,Crossbar,ButterflyNetwork
Bootloading SD/SDHCCard,UART,RISC-VJTAGDebug
ScaleofDesigns
12
Component ConfigurabilityOptions
Cores(perchip) Upto65,536
Cores(persystem) Upto500million
CoreType OpenSPARCT1 Ariane64bitRISC-V
ThreadsperCore 1/2/4 1
Floating-PointUnit FP64,FP32 FP64,FP32,FP16,FP8,
BFLOAT16
TLBs 8/16/32/64entries Numberofentries(16entries)
L1I-Cache NumberofSets,Ways(16kB,4-way)
L1D-Cache NumberofSets,Ways(8kB,4-way)
L1.5Cache NumberofSets,Ways(8kB,4-way)
L2Cache NumberofSets,Ways(64kB,4-way)
Intra-chip
Topologies
2DMesh,Crossbar
Inter-chip
Topologies
2DMesh,3DMesh,Crossbar,ButterflyNetwork
Bootloading SD/SDHCCard,UART,RISC-VJTAGDebug
OpenPiton:NotJustVerilog
• Verificationtestbenches
• 8000+Testcases• PowerandThermalAnalysis• PCBdesign
13
https://parallel.princeton.edu/openpiton/piton_power_char.html
https://parallel.princeton.edu/openpiton/piton_pcb.html
PitonTestSetup
14
DRAM+I/O
ChipsetFPGAKintex7
BridgeFPGASpartan6
Piton+HeatSink
Bulk
Decoupling
PowerSupply
Misc.Configuration
[McKeownetal,HotChips2016][McKeownetal,IEEEMICRO2017][McKeownetal,HPCA2018]
PitonCharacterization
Laye
r 7La
yer 1
3La
yer 8
5V VDD BF Rsense 2.5V
1.2V BF Rsense 1.2V AF RsenseVIO Chip AF RsenseVIO Config
3.3V
12V VIO Gateway FPGA + VIO Chip BF Rsense
VCS AF Rsense
2.5V
2.7V
12V 3.5V VCS BF Rsense VDD AF Rsense
[McKeownetal,HPCA2018]
SendingData8-hops
onNoCequivalent
ofanALUop.
Debunks
conventional
wisdomthatNoCs
dominateenergyof
manycore
1LD==3ALU
instructions
Re-computemaybe
moreenergy
efficient
15
OpenSourceDesignandData
16
Tile1
Tile2
Tile3
Tile4
Tile5
Tile6
Tile7
Tile8
Tile9
Tile10
Tile11
Tile12
Tile13
Tile14
Tile15
Tile16
Tile17
Tile18
Tile19
Tile20
Tile21
Tile22
Tile23
Tile0
Tile24
ChipBridge
Design CharacterizationData
Firstdetailedpowerandenergycharacterizationofanopensourcemanycore[McKeownetal,HPCA2018]
https://parallel.princeton.edu/openpiton/piton_power_char.html
OpenPitonandFree&OpenSource
EDAtools
• OpenPitonframeworkutilizesmultipleopensourcetools
– IcarusVerilog– Verilator– Yosys– FuseSoC– SV2V
• Notonlyusebutcontributetoo– Bugreportsandfixes– Requestsfornewfeatures– Contributingnewfeatures
17
NeedforaFree&OpenSourceEDAFlow
• EDAtoolsareessential
componentsbutaflowthat
connectsisequally
important
• TheEDAflowfromachip
builder’sperspectiveis
differentthanfromanEDA
toolsdeveloper
– Viewacrosstools– Interactionbetweentools– Hierarchicalsynthesis– Twopasses– GoodsupportforGate-level
verification
18
WhatismissingforafullFree&OpenSourceFlow
• TheUCSDledOpenROADprojectprovidesmajority
ofchipimplementation
tools
• UWhaspackagedflow
usingOpenROADwith
Free45library
MissingComponents• Free/OpenSourceDRCtool
• StrengtheningYosys• BetterSystemVerilog
support
• Free/OpenSourceParasitic
Extractiontool
• PowergridandSignalIntegrityIssues
19https://github.com/bsg-idea/uw_openroad_free45
https://github.com/The-OpenROAD-Project/OpenROAD-flow
BuildingOpenSourceChipswithOpen
SourceEDAtools
• PrincetonwithUWplanningto
tapeoutbilliontransistorchip– GlobalFoundries14/12nm
– UseFree&OpenSourceCADtoolsfromOpenROAD
andDARPAIDEAproject
20
TurboXAUISerDes TurboXAUISerDes
Debug BaseJumpI/O
Deep
Learning
Accelerators
BaseJump
Manycore
MediumRISC-V
Cores
TurboXAUISerDes
TurboXAUISerDes
FMC
DC/DC
SixPackSiPpackage
BaseJump
PCB
TurboXAUISerDes
TurboXAUISerDes
RISC-Vcores
DeepReinforcementSupervisors
AAA
A A A
AADRLASIC
DECADES:anOpenSource
HeterogeneousPlatform
• SoftwareDefinedHardware(SDH)
– Designruntime-reconfigurablehardware
toacceleratedata-intensivesoftware
applications
• Machinelearninganddatascience
• Graphanalyticsandsparselinearalgebra
• DECADES:heterogeneoustile-basedchip
– Combinationofcore,accelerator,and
intelligentstoragetiles
– Princeton/Columbiacollaborationledby
PIsMargaretMartonosi,DavidWentzlaff,
LucaCarloni
• OurtoolsandHardwareisopen-source!– https://decades.cs.princeton.edu/
21
Program
Executable
Address-space
mapping
Initial Set of
In-Memory Operations
Initial Task
Mapping
DECADES Tile
Configuration
Interconnect Bridges
Configuration
Data Migration
(Across DDR nodes)
Update In-Memory
Operations
Task Migration
(across Tiles)
DECADE Tile
Reconfiguration
Bridge Ports
Re-Routing
Source Code
Compile-Time Optimizations
(Graphicionado, DeSC)
Run-Time
Self-Tuning
Data Segments
Definition
DECADES
Accelerator
Tile
DECADES
Core
Tile
DECADES
Intelligent
Storage
DECADES
Intelligent
Storage
DECADES
Intelligent
Storage
DECADES
Intelligent
Storage
DECADES
chip
DECADES
Accelerator
Tile
DECADES
Accelerator
Tile
DECADES
Accelerator
Tile
DECADES
Accelerator
Tile
DECADES
Accelerator
Tile
DECADES
Core
Tile
DECADES
Core
Tile
DECADES
Core
Tile
DECADES
Core
Tile
DECADES
Core
Tile
L1 Cache
Configurable
Interconnect
Shims
Configurable Core Pipeline
Data Supply / Compute Threads L1 Cache
Per-Tile Configurable
On-chip Memory
DECADES Monitor and Run-Time Reconfiguration Shim
DECADES Core Tile
Configurable
Interconnect
Shims
Specialized, Configurable
Data Supply / Compute
Accelerator
Per-Tile Configurable
On-chip Memory
DECADES Monitor and Run-Time Reconfiguration Shim
DECADES Accelerator Tile
Configurable
Interconnect
Shims
Near-Memory
Computation
Across-Chip Configurable
On-Chip Memory
DECADES Monitor and Run-Time Reconfiguration Shim
DECADES Intelligent Storage
Configurable Pattern-Based
Prefetcher
Run-Time
Self-Tuning
Run-Time
Self-Tuning
Run-Time
Self-Tuning
DECADES TA2 DECADES TA1
FPGAs: O↵-Chip Memory System
with In-Memory Computation
FPGAs: O↵-Chip Memory System
with In-Memory Computation
OpenPitonCommunity
• Welcomecommunitycontributions
• Homepage–http://openpiton.org
• GoogleGroup–https://groups.google.com/group/openpiton
• Directemail–[email protected]
• GitHub–https://github.com/PrincetonUniversity/openpiton
22
TeamandAcknowledgements
PrincetonParallel
Team
– SpecialthankstoGeorgiosTziantzioulis
forhelpwithslides
23
TeamBuildingGrandCanyonSummer2019
• ArianeTeamaspartofthePULPplatformledbyProf.LucaBenini
• Prof.MichaelB.TaylorandProf.RichardShiGroupsatUW
• OpenROADteamledbyProf.AndrewKahng
24
Funding/Support
ThismaterialisbasedonresearchsponsoredbytheNSFunderGrantsNo.CNS-1823222,CCF-1217553,CCF1453112,CCF-1823032,
andCCF-1438980,AFOSRunderGrantNo.FA9550-14-1-0148,AirForceResearchLaboratory(AFRL)andDefenseAdvancedResearch
ProjectsAgency(DARPA)underagreementNo.FA8650-18-2-7846,FA8650-18-2-7852,andFA8650-18-2-7862andDARPAunder
GrantsNo.N66001-14-1-4040andHR0011-13-2-0005.TheU.S.Governmentisauthorizedtoreproduceanddistributereprintsfor
Governmentalpurposesnotwithstandinganycopyrightnotationthereon.Theviewsandconclusionscontainedhereinarethoseofthe
authorsandshouldnotbeinterpretedasnecessarilyrepresentingtheofficialpoliciesorendorsements,eitherexpressedorimplied,of
AirForceResearchLaboratory(AFRL)andDefenseAdvancedResearchProjectsAgency(DARPA),theNSF,AFOSR,ortheU.S.
Government.
OpenPiton:AnOpen-SourceFrameworkforEDA
ToolDevelopment
25
DavidWentzlaff([email protected])
http://openpiton.org
https://github.com/PrincetonUniversity/openpiton
Program
Executable
Address-space
mapping
Initial Set of
In-Memory Operations
Initial Task
Mapping
DECADES Tile
Configuration
Interconnect Bridges
Configuration
Data Migration
(Across DDR nodes)
Update In-Memory
Operations
Task Migration
(across Tiles)
DECADE Tile
Reconfiguration
Bridge Ports
Re-Routing
Source Code
Compile-Time Optimizations
(Graphicionado, DeSC)
Run-Time
Self-Tuning
Data Segments
Definition
DECADES
Accelerator
Tile
DECADES
Core
Tile
DECADES
Intelligent
Storage
DECADES
Intelligent
Storage
DECADES
Intelligent
Storage
DECADES
Intelligent
Storage
DECADES
chip
DECADES
Accelerator
Tile
DECADES
Accelerator
Tile
DECADES
Accelerator
Tile
DECADES
Accelerator
Tile
DECADES
Accelerator
Tile
DECADES
Core
Tile
DECADES
Core
Tile
DECADES
Core
Tile
DECADES
Core
Tile
DECADES
Core
Tile
L1 Cache
Configurable
Interconnect
Shims
Configurable Core Pipeline
Data Supply / Compute Threads L1 Cache
Per-Tile Configurable
On-chip Memory
DECADES Monitor and Run-Time Reconfiguration Shim
DECADES Core Tile
Configurable
Interconnect
Shims
Specialized, Configurable
Data Supply / Compute
Accelerator
Per-Tile Configurable
On-chip Memory
DECADES Monitor and Run-Time Reconfiguration Shim
DECADES Accelerator Tile
Configurable
Interconnect
Shims
Near-Memory
Computation
Across-Chip Configurable
On-Chip Memory
DECADES Monitor and Run-Time Reconfiguration Shim
DECADES Intelligent Storage
Configurable Pattern-Based
Prefetcher
Run-Time
Self-Tuning
Run-Time
Self-Tuning
Run-Time
Self-Tuning
DECADES TA2 DECADES TA1
FPGAs: O↵-Chip Memory System
with In-Memory Computation
FPGAs: O↵-Chip Memory System
with In-Memory Computation