The brain on low power scalable architectures: efficient ...Human Brain Project WaveScales: 2016 –2023 Measures of brain Slow Waves during deep-sleep and anesthesia and transition

ParCo2017 – International Conference on Parallel ComputingBologna, Italy 12-15 September 2017

The brain on low power scalablearchitectures: efficient simulation of

cortical slow waves and asynchronous states

Andrea BiagioniINFN – Sezione di Roma

for the APE Lab ExaNeSt and WaveScalES team

Human Brain Project

� WaveScales: 2016 – 2023� Measures of brain Slow Waves during deep-sleep and

anesthesia and transition to awareness� Large-scale spiking simulations (hundreds of billions

synapses) distributed over (tens of) thousands of processes.

Distributed and Plastic Spiking Neural Networks (DPSNN)� Neural networks heavily interconnected at multiple distances,

local activity rapidly produces effects at all distances ÆPrototype of non-trivial parallelization problem

� Each neural spike originates a cascade of synaptic events atmultiple times: t + Δts Æ Complex data structures andsynchronization. Mixed time-driven (delivery of spiking message)and event-driven (neural dynamic and synaptic activity)

� Multiple time-scales (neural, synaptic, long and short termplasticity models) Æ Non-trivial synchronization at all scales

� Gigantic synaptic data-base. A key issue for large scalesimulations Æ Clever parallel resource management required.

13/09/2017 2Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states

P. S. P

aolucci et al, Distributed sim

ulationof

polychronousand plastic spiking neural netw

orks: strong and w

eak scaling of a representative mini-

application benchmark executed on a sm

all-scale com

modity cluster,arXiv:1310.8478, O

ct. 2013.

Neuron Model

13/09/2017 3

� The unit of the system:� Semplifications are needed� balancement between computing (flops)

and biological plausibility� Point-like Leaky Integrate and Fire with

Spike Frequency Adaptation

Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states

Gigante et al. 2007, Diverse population-burstingmodes of adapting spiking neurons. Phys Rev Lett. 98:148101. DOI: 10.1103/PhysRevLett.98.148101.

E. M. Izhikevich. 2004. Which model to use for cortical spiking neurons?. Trans. Neur. Netw. 15, 5 (September 2004), 1063-1070. DOI=http://dx.doi.org/10.1109/TNN.2004.832719

Neural Columns

13/09/2017 4

� Grey matter� Different families of neurons are in

the column (excitatory, inhibitory)� Configurable number of family� Configurable number of neurons� Parametric

V. Braitenberg. 2007. Grey substance and white substance. Scholarpedia, 2(11):2918.

White MatterLong Range Inter-areal Communication

Grey MatterNeurons + Intra-areal connectionsShort range communication

� 2 Excitatory, 1 Inhibitory� Family Ratio 1:3:1� Neurons ~1250


� Cortical Area: A segment of the cerebralcortex that carries out a given function

� Cortical Column: a group of neurons inthe cortex that can be successivelypenetrated by a probe insertedperpendicularly to the cortical surface.

Testbed

� QUonG (32 nodes; 256 core):� Intel Ivy Bridge CPU E5-2630 v2 @

2.60GHz (dual processor; exa-core)� 128 GB per node (10 GB per core)� NIC: IB gen2� Limited to 96 cores� INFN Roma

13/09/2017 5

� Galileo (516 node; 8256 core):� Intel Haswell 2.40 GHz per node

(dual processor; octo-core)� 128 GB per node (8 GB per core)� NIC: IB gen3 (4x QDR switch)� 281 on TOP500 (July 2017)� Limited to 1024 cores� CINECA (Bologna)


R. Ammendola et al., QUonG: A GPU-based HPC system dedicated to LQCD computing, in Application Accelerators in High- Performance Computing (SAAHPC), 2011 Symposium on, pp. 113–122, July 2011

DPSNN: Strong and Weak Scaling measures

Strong scaling. From 1 to 1024 cores @ 2.4 GHz simulate various total network sizes.Exec time normalized to synapse count.

13/09/2017 6

Weak scaling for various local network sizes.


Distribution of Cortical Modules among Software Processes

A sample grid of 64=8x8 neural columns.

Excitatory neurons projects 76% of their synapses toward neurons located in the same column, 3% to first neighbouring columns, 2% to second neighbours and 1% to third neighbour.

a) Grid of 64 processes: 1 column per process

b) Grid of 4 processes: 16 columns per process

c) Grid of 256 processes: ¼ of column per process

One computational core host one software processes.

13/09/2017 7

Strong scaling measures:A,b,c) Examples of distribution of a grid composed of 64 neural columns over a varying number of software processes (computational cores)

Node connectivity matrix is not equal to the Columns connectivity matrix.


Overview of DPSNN tasks

13/09/2017 8

12x12 24x24 48x48NEURONS 0.18M 0.71M 2.86MSYNAPSES 0.20G 0.80G 3.20GCOLUMNS 144 576 2304PROCESSES 144 192 192COLUMN/PROCESS 1 3 12SIMULATED SECONDS 30 12 18WALL CLOCK SEC. 1484 2148 15182COMMUNICATION 35.2% 10.7% 0.9%SYNCHRONIZATION 22.9% 36.3% 36.2%COMPUTATION 21.3% 34.2% 45.1%LIST MANAGEMENT 17.1% 16.7% 16.9%


4x4 GRID: Real-time domain

13/09/2017 9

1,002,004,008,00

16,0032,0064,00

1 2 4 8 16 32 64

Spee

d Up

Processes/Cores

Strong Scaling; Grid 4x4

SPEED UP IDEAL

� Simulated time: 10s (QUonG)� Wall clock time

� 32 processes: 12.03 sec� 64 processes: 16.46 sec

� Similar to expected behavior (λ=350um)� Communication doesn’t scale!!!� Traditional distributed computing system:

� Throughput: OK� Latency: NO

0,130,250,501,002,004,008,00

16,0032,0064,00

128,00256,00

1 2 4 8 16 32 64

Seco

nds

Processes/Cores

Strong Scaling; Grid 4x4

COMMUNICATION COMPUTATION BARRIER TOTAL

D. S. Modha et al. The Cat is Out of the Bag: Cortical Simulations with 109 Neurons, 1013 Synapses, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, Portland, Oregon, pages 1-12, 2009, ACM


Exanest Objectives� H2020 - FETHPC-2014 (December 2015 – November 2018)� System architecture for datacentric Exascale-class HPC

� Low-latency unified Interconnect (compute & storage traffic)• RDMA + PGAS to reduce communication overhead

� Fast, distributed in-node non-volatile-memory� Extreme compute-power density

� Advanced totally-liquid cooling technology� Scalable packaging for ARM-based (v8, 64-bit) microserver

• Low Energy Compute (256 ARM cores + 1TB DDR4 Memory in a 1U blade)• Heterogeneous: FPGA accelerator (~4 TFlops per node)

� Real scientific and data-center applications� Applications used to identify system requirements� Tuned versions will evaluate our solutions

INFN activities are strongly synergic with project objectives:� APE supercomputer: VLSI, system design, high density packing� APEnet: FPGA-based NIC for clusters (low-latency, high-throughput)

13/09/2017 10

INFN


R. Ammendola et al., APEnet+: a 3D Torus network optimized for GPU-based HPC systems, Journal of Physics: Conference Series,vol. 396, no. 4, p. 042059, 2012

M. Katevenis et al. The next Generation of Exascale-class Systems: the ExaNeStProject in 2017 Euromicro Conference on Digital System Design (DSD), Aug 2017

DPSNN on low-powercomputing architectures

� Evaluate the performaces of low-power processors inscalable simulations of spiking neural network models.

� Compare performances against traditional server-platform processors.

� Try to identify the critical architectural features enablingbetter time-to-solution and energy-to-solution figures onthis application.

� Intel Xeon vs. ARM Cortex cores (two generations toevaluate trend):1. Westmere Xeon E5620 @2.4 GHz vs. ARMv7-A Cortex A-

15 @2.3 GHz2. Haswell E5-2620 v3 @2.4 GHz vs. ARMv8-A Cortex A-57

@1.9 GHz

13/09/2017 Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states 11

� Dimensions: 1U standard rackmountable

� Motherboard: X8DTG-DF� CPU: Dual Intel Westmere quad-

core Xeon E5620� DRAM: 48 GB DDR3 1333 MHz� NIC: Mellanox ConnectX VPI IB

QDR� OS: CentOS release 6.7, kernel

2.6.32-573.7.1.el6.x86_64

� Tegra K1 SOC� CPU: NVIDIA "4-Plus-1" 2.32GHz ARM quad-core

Cortex-A15 CPU with Cortex-A15 battery-savingshadow-core

� GPU: NVIDIA Kepler "GK20a" GPU with 192SM3.2 CUDA cores (up to 326 GFLOPS in SP)

� DRAM: 2GB DDR3L 933MHz EMC x16 using 64-bit data width

� Storage: 16GB fast eMMC 4.51� Ethernet: RTL8111GS Realtek 10/100/1000Base-

T Gigabit LAN

1° Gen


Comparison of 1° Gen server and low-power architectures

� Same # of cores, ~Same clock frequency.� Intel Xeon E5620 supports Hypertheading (ARM Cortex A-15 does not).� SIMD Floating Point Theoretical Peak Performance ( 2x in DP)

� ARM Cortex-A15 (NEON):• 2 DP FLOPs/cycle: scalar FMA or scalar multiply-add• 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add

� Intel Westmere (SSE4.2):• 4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication• 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication

� Memory Bandwidth: 14.9 GB/s (ARM Cortex-A15) vs 25.6 GB/s (IntelXeon E5620)� DPSNN makes an intensive use of memory (e.g. for delivering spikes

to post-synaptic neuron queues).


Benchmark Configuration(1° Gen)

� DPSNN:� Simulated time: 3 s� 10K LIFCA neurons� 18M synapses

� Low-power platform:� 2 quad-core ARM A15 Jetson TK1 + Gigabit switch� 8 MPI processes

� Server platform:� 1 Supermicro SuperServer 6016GT-TF (2 Intel E5620 quad-core

processors)� 8 MPI processes (hyperthreading turned off)


1° Gen Results

� TIME: Server platform 3.3x better than low-power platform

� POWER: Server platform 14.4x worse than low-power platform

� ENERGY: Server platform is 4.4x worse than low-power platform

� We did not subtract any base-line power consumption.


� Tegra X1 SOC (20 nm)� CPU: ARMv8 ARM Cortex-A57 quad-core (2MB

L2 cache) + ARM Cortex-A53 quad-core (64-bit)in Big.LITTLE configuration, 102 MHz / 1.9 GHz

� GPU: NVIDIA Maxwell ”GM20B” with 256 CUDAcore: 512 GFLOPS (FP32), 1TFLOPS (FP16).

� DRAM: 4GB LPDDR4 (25.6GBs BW)� Storage: 16GB eMMC� Ethernet: 10/100/1000Base-T� OS: Ubuntu 14.04.1 LTS (GNU/Linux 3.10.67-

g458d45c aarch64)� SW stack: gcc 4.8.4 (Ubuntu/Linaro 4.8.4-

2ubuntu1~14.04.3), Open MPI 1.6.5

� Dimensions: 4U standard� Motherboard: X10DRG-Q� CPU:Dual hexa core Intel E5-2620

v3 @2.4 GHz (15MB L2 cache), 1.2up to 3.2 GHz frequency scaling, 22nm , mem BW up to 59 GB/s

� DRAM: 64GB DDR4 2133 MHz� NIC: Mellanox ConnectX VPI IB

QDR� OS: CentOS 7.2, kernel 4.5.3-

1.el7.elrepo.x86_64� SW stack: gcc 4.8.5, Open MPI

1.10.0

2° Gen


Benchmark Configuration(2° Gen)

� DPSNN:� Simulation time: 3 s� 10K LIFCA neurons� 18M synapses

� Low-power platform:� 1 Jetson TX1 (quad core ARM

Cortex A57)� 4 MPI processes, interactive

freq scaling governor� Server platform:

� 1 Supermicro SuperServer7048GR-TR (2 hexa core IntelE5-2620 v3 @ 2.40GHz)

� 4 MPI processes, powersavefreq. scaling governor


2° Gen Results

� TIME: Server platform is 5x fasterthan low-power platform

� POWER: Server platform is 14.5x worse than low-power platform

� ENERGY: Server platform is 2.9x worse than low-power platform

� We did not subtract any base-line power consumption.


Haswell vs. Cortex A57Comments on Results

� Effective Cortex A57 usable max freq. is 1734 MHz.� Taking into account the full baseline power consumption is

unfair for the Haswell platform (used 4 cores out of 12). If werenormalize the baseline to 1/3 for the Haswell, resultswould be:� Power consumption ratio: 10.9 (instead of 14.5)� Energy to solution ratio: 2.2 (instead of 2.9)


THANK YOU!!!

APE lab: R. Ammendola1, A. Biagioni2, F. Capuani2, P. Cretaro2, G. De Bonis2, O. Frezza2, F. Lo Cicero2, A. Lonardo2,

M. Martinelli2, P. S. Paolucci2, E. Pastorelli2, L. Pontisso2, F. Simula2, P. Vicini2

1 INFN, Roma Tor Vergata2 INFN, Roma

This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No. 720270 (HBP SGA1) and No. 671553 (EXANEST)


Documents

The brain on low power scalable architectures: efficient ...Human Brain Project WaveScales: 2016 –2023 Measures of brain Slow Waves during deep-sleep and anesthesia and transition