1
OSCAR Compiler SC18 Vector Accelerator Clang/LLVM Performance on FPGA OSCAR Vector Multicore System Platinum Vector Accelerator on FPGA Hardware CPU (NIOS) 1CPU + 1VA / 2CPU + 2VA Platinum Multicore Architecture Kasahara & Kimura Lab, Waseda University, TOKYO Vector Accelerator / DTU Features Attachable for any CPUs (Intel, ARM, IBM) Data driven initiation (sync flags) Function Units [tentative] Vector Function Unit 8 double precision ops/clock 64 characters ops/clock Variable vector register length Chaining LD/ST & Vector pipes Vector Length = 256 element Scalar Function Unit Data Transfer Unit Overwrap execution w/ data transfer (Currently using DMA) http://www.kasahara.cs.waseda.ac.jp Centralized Shared Memory Compiler Co-designed Interconnection Network Compiler co-designed Connection Network On-chip Shared Memory Multicore Chip Vector Data Transfer Unit CPU Local Memory Distributed Shared Memory Power Control Unit Core ×4 chips SC18 members K. Miyamoto T. Kawata K. Takahashi T. Kashimata B.A. Adhi Y. Minato H. Mikami Prof. T. Kitamura Prof. K. Kimura Prof. H. Kasahara FPGA Implementation Board: C5P development kit (intelFPGA Cyclone V / 301K LE) Specification: 16 single precision ops/cycle Local Data Memory Bandwidth 32 byte/clock All data located on Local Data Memory Local Data Memory size: 32KB Compile Flow Sequential C Programs OSCAR Compiler (C to Parallel C) Host CPU Compiler Accelerator Compiler Executable File Multi Grain Parallelization Local Memory Management Data Transfer Optimization Power Management Auto Vectorization Nios II compiler Clang / LLVM based Acc. C code to binary Download to FPGA board using Nios II toolchain OSCAR compiler can generate C code for Accelerator Automatically Host side code (OSCAR API) Accelerator side code (Using Intrinsic function) OSCAR API Translator Convert architecture independent directives to architecture specific code Applications Swim (SPEC 95) Size 257x257 Iteration = 10 times Evaluation except 1 st iteration (initialization) Compiler: nios2-elf-gcc FPU: Floating Point Hardware 2 Cache: 32KB No cache Local Memory: 32KB DMA transfer between DDR and Local Memory controlled by OSCAR Compiler Result 3.5 times speed up w/o initialization Execution time of CPU part increases w/o cache 0 0.5 1 1.5 2 2.5 3 3.5 4 CPU (cache 32K) 1CPU+1VA (No cache) 2CPU+2VA (No cache) Speedup 3.5x speedup by 2CPU + 2VA for short vector swim (257x257) on FPGA 3.5x

OSCAR Vector Multicore System OSCAR Compilercontrolled by OSCAR Compiler Result • 3.5 times speed up w/o initialization • Execution time of CPU part increases w/o cache 0 0 .5

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

OSCAR Compiler

SC18

Vector Accelerator

Clang/LLVM

Performance on FPGA

OSCAR Vector Multicore System Platinum Vector Accelerator on FPGA

Hardware• CPU (NIOS)

• 1CPU + 1VA / 2CPU + 2VA

Platinum Multicore Architecture

Kasahara & Kimura Lab, Waseda University, TOKYO

Vector Accelerator / DTUFeatures• Attachable for any CPUs (Intel, ARM, IBM)

• Data driven initiation (sync flags)

Function Units [tentative]• Vector Function Unit

• 8 double precision ops/clock• 64 characters ops/clock• Variable vector register length• Chaining LD/ST & Vector pipes• Vector Length = 256 element

• Scalar Function Unit

Data Transfer Unit• Overwrap

execution w/data transfer(Currently using DMA)

http://www.kasahara.cs.waseda.ac.jp

Centralized Shared Memory

Compiler Co-designed Interconnection Network

Compiler co-designed Connection Network

On-chip Shared Memory

Multicore Chip

VectorData

TransferUnit

CPU

Local MemoryDistributed Shared Memory

Power Control Unit

Core

×4chips

SC18 members

K. MiyamotoT. KawataK. TakahashiT. KashimataB.A. AdhiY. MinatoH. MikamiProf. T. KitamuraProf. K. KimuraProf. H. Kasahara

FPGA ImplementationBoard: C5P development kit (intelFPGA Cyclone V / 301K LE)Specification:• 16 single precision ops/cycle• Local Data Memory Bandwidth 32 byte/clock• All data located on Local Data Memory• Local Data Memory size: 32KB

Compile Flow

Sequential CPrograms

OSCAR Compiler(C to Parallel C)

Host CPUCompiler

AcceleratorCompiler

Executable File

• Multi Grain Parallelization• Local Memory Management• Data Transfer Optimization• Power Management• Auto Vectorization

• Nios II compiler

• Clang / LLVM based• Acc. C code to binary

• Download to FPGA board using Nios II toolchain

OSCAR compiler can generate C code for Accelerator Automatically

Host side code(OSCAR API)

Accelerator side code(Using Intrinsic function)

OSCAR API Translator

• Convert architecture independent directives to architecture specific code

Applications• Swim (SPEC 95)

• Size 257x257• Iteration = 10 times• Evaluation except 1st iteration (initialization)

• Compiler: nios2-elf-gcc• FPU: Floating Point

Hardware 2• Cache: 32KB

• No cache• Local Memory: 32KB• DMA transfer between

DDR and Local Memorycontrolled by OSCAR Compiler

Result• 3.5 times speed up w/o

initialization• Execution time of CPU

part increases w/ocache

0

0.5

1

1.5

2

2.5

3

3.5

4

CPU(cache 32K)

1CPU+1VA(No cache)

2CPU+2VA(No cache)

Sp

eed

up

3.5x speedup by 2CPU + 2VA for short vector swim (257x257) on FPGA

3.5x