V. Milutinovic, G. Rakocevic, S. Stojanovic, and Z. Sustran University of Belgrade Oskar Mencer...

V. Milutinovic, G. Rakocevic, S. Stojanovic, and Z. SustranUniversity of Belgrade

Oskar MencerImperial College, London

Oliver Pell Maxeler Technologies, London and Palo Alto

Michael FlynnStanford University, Palo Alto

For Big Data algorithms and for the same hardware price as before, achieving:

a) speed-up, 20-200 b) monthly electricity bills, reduced

20 timesc) size, 20 times smaller

The major issues of engineering are: design cost and design complexity.

Remember, economy has its own rules: production count and market demand!

1. BigData2. WORM3. Tolerance to latency4. Over 95% of run time in loops5. Reusability of data (e.g.,

x+x2+x3+x4+…)6. Skills

Use a tractor, not a Ferrari, to drive over a plowed field

Absolutely all results achieved with:

a) All hardware produced in Europe, specifically UK

b) All software generated by programmers

of EU and WB

ControlFlow (MultiFlow and ManyFlow): Top500 ranks using Linpack

(Japanese K, IBM Sequoya, Cray Titan, …)

DataFlow: Coarse Grain (HEP) vs. Fine Grain

(Maxeler)

Compiling below the machine code level brings speedups;also a smaller power, size, and cost.

The price to pay:The machine is more difficult to program.

Consequently:Ideal for WORM applications :)

Examples using Maxeler:GeoPhysics (20-40), Banking (200-1000, with JP Morgan

20%),M&C (New York City), Datamining (Google), …

9/72Why Java? Minimal Kolmogorov Complexity, etc…

Assumptions: 1. Software includes enough parallelism to keep all cores busy 2. The only limiting factor is the number of cores.

tGPU = N * NOPS * CGPU*TclkGPU / NcoresGPU

tCPU = N * NOPS * CCPU*TclkCPU /NcoresCPU

tDF = NOPS * CDF * TclkDF + (N – 1) * TclkDF / NDF

DualCore?

Which way are the horses going?

Is it possibleto use 2000 chicken instead of two horses?

What is better, real and anecdotic?

2 x 1000 chickens (CUDA and rCUDA) 15/72

How about 2 000 000 ants?

Marmalade

Big Data Input Results

Factor: 20 to 200

MultiCore/ManyCore

Dataflow

Machine Level Code

Gate Transfer Level

Factor: 20

MultiCore/ManyCore

Dataflow

Factor: 20

Data Processing

Process ControlData Processing

Process Control

MultiCore/ManyCore

DataFlow

MultiCore:Explain what to do, to the driverCaches, instruction buffers, and predictors needed

ManyCore:Explain what to do, to many sub-driversReduced caches and instruction buffers needed

DataFlow:Make a field of processing gates: 1C+2nJava+3JavaNo caches, etc. (300 students/year: BGD, BCN, LjU,

ICL,…)

MultiCore:Business as usual

ManyCore:More difficult

DataFlow:Much more difficultDebugging both, application and configuration

MultiCore/ManyCore:Several minutes

DataFlow:Several hours for the real hardwareFortunately, only several minutes for the simulator,

and several seconds for reload (90% due to DRAM inertia)

The simulator supports both the large JPMorgan machineas well as the smallest “University Support” machine

Good news:Tabula@2GHz

MultiCore:Horse stable

ManyCore:Chicken house

DataFlow:Ant hole

MultiCore:Haystack

ManyCore:Cornbits

DataFlow:Crumbs

Small Data: Toy Benchmarks (e.g., Linpack)

Medium Data (benchmarks favorising NVidia,compared to Intel,…)

Big Data

Maxeler Hardware

CPUs plus DFEsIntel Xeon CPU cores and up to

4 DFEs with 192GB of RAM

DFEs shared over Infiniband Up to 8 DFEs with 384GB of RAM and dynamic allocation

of DFEs to CPU servers

Low latency connectivityIntel Xeon CPUs and 1-2 DFEs with up to six 10Gbit Ethernet

connections

MaxWorkstationDesktop development system

MaxCloudOn-demand scalable accelerated compute resource, hosted in London

3030/72/72

1. Coarse grained, stateful: Business– CPU requires DFE for minutes or hours

2. Fine grained, transactional with shared database: DM– CPU utilizes DFE for ms to s– Many short computations, accessing common database data

3. Fine grained, stateless transactional: Science (Phy, ...)– CPU requires DFE for ms to s– Many short computations

3131/72/72

Major Classes of Algorithms, from the Computational Perspective

• Long runtime, but:• Memory requirements

change dramatically based on modelled frequency

• Number of DFEs allocated to a CPU process can be easily varied to increase available memory

• Streaming compression• Boundary data exchanged

over chassis MaxRing

3232/72/72

Coarse Grained: Modeling

Number of MAX2 cards

15Hz peak frequency

30Hz peak frequency

45Hz peak frequency

70Hz peak frequency

0 10 20 30 40 50 60 70 80Peak Frequency (Hz)

Timesteps (thousand)

Domain points (billion)

Total computed points (trillion)

• DFE DRAM contains the database to be searched• CPUs issue transactions find(x, db)• Complex search function

– Text search against documents– Shortest distance to coordinate (multi-dimensional)– Smith Waterman sequence alignment for genomes

• Any CPU runs on any DFE that has been loaded with the database– MaxelerOS may add or remove DFEs

from the processing group to balance system demands– New DFEs must be loaded with the search DB before use

3333/72/72

Fine Grained, Shared Data: Monitoring

• Analyse > 1,000,000 scenarios• Many CPU processes run on many DFEs• ≈50x MPC-X vs. multi-core x86 node• Each transaction executes on any DFE

in the assigned group atomically

3434/72/72

Fine Grained, Stateless: The BSOP Control

CPU DFE Loop over instrumentsLoop over instruments

Random number generator and

sampling of underliers

Price instruments using Black

Scholes

Tail analysis on CPU

Scholes

DFE Loop over instrumentsLoop over instrumentsCPUMarket and instruments data

Scholes

ScholesInstrument values

3535/72/72

Selected Examples:Business,Mathematics,GeoPhysics, etc.

An MIS Example: Credit Derivatives

Climber

Tether

Orbital station

Seismic Imaging

• Running on MaxNode servers- 8 parallel compute pipelines per chip- 150MHz => low power consumption!- 30x faster than microprocessors

An Implementation of the Acoustic Wave Equation on FPGAs T. Nemeth†, J. Stefani†, W. Liu†, R. Dimond‡, O. Pell‡, R.Ergas§

†Chevron, ‡Maxeler, §Formerly Chevron, SEG 20084040/72/72

Performance of one MAX2 card vs. 1 CPU core

Land case (8 params), speedup of 230x

Marine case (6 params), speedup of 190x

The CRS Results

CPU Coherency MAX2 Coherency

4141/72/72

44444444/72/72

• DM for Monitoring and Control in Seismic processing • Velocity independent / data driven method

to obtain a stack of traces, based on 8 parameters• Search for every sample of each output trace

Trace Stacking: Speed-up 217P. Marchetti et al, 2010

parameters( emergence angle & azimuth

Normal Wave front parametersKN,11; KN,12 ; KN22

NIP Wave front parameters( KNip,11; KNip,12 ; KNip22 )

hHKHhmHKHmmw TzyNIPzy

TTzyNzy

vtthyp

4545/72/72

This is about algorithmic changes, to maximize

the algorithm to architecture match:

data choreography, process modifications,pipeline utilization,

anddecision precision.

The winning paradigm of Big Data ExaScale?

4747/72/72

Conclusion: Nota Bene

Revisiting the Top 500 SuperComputers benchmarksOur paper in Communications of the ACM

Revisiting all major Big Data DM algorithmsMassive static parallelism at low clock frequencies

Concurrency and communicationConcurrency between millions of tiny cores difficult,

“jitter” between cores will harm performance at synchronization points

Reliability and fault tolerance10-100x fewer nodes, failures much less often

Memory bandwidth and FLOP/byte ratioOptimize data choreography, data movement,

and the algorithmic computationNew architecture of n-Programming paradigms

FP7: RoMoL@BCN

4949/72/72

The SAB goal: Out of box thinking!

FP7: BalCon@SRB

5050/72/72

The SAB goal: Seed for new proposals!

The vision of Alkis Konstantellos

51/7251/72

DAFNE: Leader MISANU

52/7252/72

DAFNE = South (MaxCode) + North (BigData)

MISANU, IMP, KG, NS, BSC, UPV, U of Siena, U of Roma,IJS, FRI,IRB, QPLAN, Bogazici, U of Istanbul,U of Bucharest, U of Arad,U of Tuzla, Technion, Maxeler Israel, IPSI

UKSwedenNorway

DenmarkGermany

FranceAustria

SwissPoland

Hungary

53/7253/72

The DAFNE Map

The TriPeak @ DATAMAN

Siena+ BSC+ Imperial College + Maxeler+ Belgrade

The TriPeak: EssenceMontBlanc = A ManyCore (NVidia) + a MultiCore (ARM)Maxeler = A FineGrain DataFlow (FPGA)

How about a happy marriage?MontBlanc (ompSS) and Maxeler (an accelerator)

In each happy marriage,it is known who does what :)

The Big Data DM algorithms:What part goes to MontBlanc and what to Maxeler?

TriPeak: Core of the Symbiotic Success

An intelligent DM algorithmic scheduler,partially implemented for compile time,and partially for run time.

At compile time:Checking what part of code fits where(MontBlanc or Maxeler): LoC 1M vs 2K vs 20K

At run time:Rechecking the compile time decision,based on the current data values.

57/7257/7257

58/7258/72

Maxeler: Research (Google: good method)

Structure of a Typical Research Paper: Scenario #1[Comparison of Platforms for One Algorithm]Curve A: MultiCore of approximately the same PurchasePriceCurve B: ManyCore of approximately the same PurchasePriceCurve C: Maxeler after a direct algorithm migrationCurve D: Maxeler after algorithmic improvementsCurve E: Maxeler after data choreographyCurve F: Maxeler after precision modifications

Structure of a Typical Research Paper: Scenario #2[Ranking of Algorithms for One Application]CurveSet A: Comparison of Algorithms on a MultiCoreCurveSet B: Comparison of Algorithms on a ManyCoreCurveSet C: Comparison on Maxeler, after a direct algorithm migrationCurveSet D: Comparison on Maxeler, after algorithmic improvementsCurveSet E: Comparison on Maxeler, after data choreographyCurveSet F: Comparison on Maxeler, after precision modifications

59/7259/72

Maxeler Research in Serbia: Special Issue of IPSI Transactions Journal

KG: Blood Flow, Tijana Djukic and Prof. Filipovic

NS: Combinatorial Math, Prof. Senk and Ivan Stanojevic

MISANU: The SAT Math, Zivojin Sustran and Prof. Ognjanovic

ETF: Meteorology, Radomir Radojicic and Marko Stankovic

ETF: Physics (Gross Pitaevskii 3D real), Sasa Stojanovic

ETF: Physics (Gross Pitaevskii 3D imaginary), Lena Parezanovic

60/7260/72

Maxeler Research WorldWide:Special Issue of Advances in Computers @ SCI

Stanford, Texas,Imperial, Maxeler,ETF, MF, MISANU, IMP, KG, NS, BSC, UPV, U of Siena, U of Roma,IJS, FRI, …

62/7262/72

Maxeler: Teaching (Google: prof vm)TEACHING, VLSI, PowerPoints, Maxeler:

Maxeler Veljko Explanations, August 2012Maxeler Veljko Anegdotic, Maxeler Oskar Talk, August 2012Maxeler Forbes ArticleFlyer by JP MorganFlyer by Maxeler HPCTutorial Slides by Sasha and Veljko: Practice (Current Update)Paper, unconditionally accepted for Advances in Computers by ElsevierPaper, unconditionally accepted for Communications of the ACMTutorial Slides by Oskar: Theory (7 parts)Slides by Jacob, New YorkSlides by Jacob, AlabamaSlides by Sasha: Practice (Current Update)Maxeler in MeteorologyMaxeler in MathematicsExamples generated in Belgrade and Worldwide

THE COURSE ALSO INCLUDES DARPA METHODOLOGY FOR MICROPROCESSOR DESIGN, with an example

63/7263/72

Maxeler PreConference Tutorials (2013)

Google:

IEEE HiPeak, Berlin, Germany, January 2013

ACM iSAC, Coimbra, Portugal, March 2013

IEEE MECO, Budva, Montenegro, June 2013

ACM ISCA, Tel Aviv, Israel, June 2013

64/7264/72

Maxeler InHouse Tutorials (2013)

66/7266/72

Maxeler University Program Members

67/7267/72

How to Become a Family Member?

Options to consider:

a. MAX-UP free of charge b. Purchasing a university-level machine (min about $10K) c. Purchasing a JPM-level machine

(slowly approaching $100M), or at least a Schlumberger-level machine

(slowly moving above $10M)

68/7268/72

Good to Know!Maxeler employs close to 100 people, GBR and USA:

a. Maxeler cash burn per year = about $10M b. If a university-level machine is sold at the 100% profit margin, the company life of Maxeler is extended for about 2 hours. c. If a university-level machine is sold at the 1% profit margin, the company life of Maxeler is extended for 1 minute.

Our past or ongoing FP7 projects requiring Maxeler speeds:

a. ProSenseb. ARTreatc. HiPEAC

69/7269/72

The Educational Mission

The reality:

a. University-level machines are sold at the ZERO profit margin! b. Only the Xilinx costs, handling, and shipping. c. Email support for student doing thesis is practically unlimited!

Important note:

a. Total number of accredited universities in the whole world? b. As per WeboMetrics, about 20000. c. Consequently, all universities of the world together bring only: 20000 minutes of extra life, or about two weeks of extra life.

Conclusion: This is a chance for those who jump in first :)

70/7270/72

Our Work Impacting Maxeler

Milutinovic, V., Knezevic, P., Radunovic, B., Casselman, S., Schewel, J., Obelix Searches Internet Using Customer Data, IEEE COMPUTER, July 2000 (impact factor 2.205/2010).

Milutinovic, V., Cvetkovic, D., Mirkovic, J., Genetic Search Based on Multiple Mutation Approaches, IEEE COMPUTER, November 2000 (impact factor 2.205/2010).

Milutinovic, V., Ngom, A., Stojmenovic, I., STRIP --- A Strip Based Neural Network Growth Algorithm for Learning Multiple-Valued Functions, IEEE TRANSACTIONS ON NEURAL NETWORKS, March 2001, Vol.12, No.2, pp. 212-227.

Jovanov, E., Milutinovic, V., Hurson, A., Acceleration of Nonnumeric Operations Using Hardware Support for the Ordered Table Hashing Algorithms, IEEE TRANSACTIONS ON COMPUTERS, September 2002, Vol.51, No.9, pp. 1026-1040 (impact factor 1.822/2010).

71/7271/72

Maxeler Impacting Our Work

Tafa, Z., Rakocevic, G., Mihailovic, Dj., Milutinovic, V., Effects of Interdisciplinary Education On Technology-driven Application Design IEEE Transactions on Education, August 2011, pp.462-470. (impact factor 1.328/2010).

Tomazic, S., Pavlovic, V., Milovanovic, J., Sodnik, J., Kos, A., Stancin, S., Milutinovic, V., Fast File Existence Checking in Archiving Systems ACM Transactions on Storage (TOS) TOS Homepage archive, Volume 7 Issue 1, June 2011, ACM New York, NY, USA.

Jovanovic, Z., Milutinovic, V., FPGA Accelerator for Floating-Point Matrix Multiplication, IEE Computers & Digital Techniques, 2012, 6, (4), pp. 249-256.

Flynn, M., Mencer, O., Milutinovic, V., Rakocevic, G., Stenstrom, P., Trobec, R., and Valero, M., Moving from Petaflops (on Simple Benchmarks) to Petadata per Unit of Time and Power (On Sophisticated Benchmarks) Communications of the ACM, May 2013 (impact factor 1.919/2010).

V. Milutinovic, G. Rakocevic, S. Stojanovic, and Z. Sustran University of Belgrade Oskar Mencer...

Documents

1/9 CLASSIFICATION AND THE SURVIVAL ANALYSES IN THE ARTREAT PROJECT Goran Rakocevic Zoran Babovic Marko Novakovic Nenad Korolija Veljko Milutinovic Tuesday,

Acceleration of Cooley-Tukey algorithm using Maxeler machine

M. Rakic, S. Nektarijevic, I. Milinkovic, G. Rakocevic, M. Bumbasirevic, V. Milutinovic, V. Lekovic (SRB) with L. Luic (CRO) + M. Rudolf (SLO) 1

City Research Online Data...Milos Milojevic, Student Member, IEEE, Veselin Rakocevic, Member, IEEE 2 achieve reliable estimation from a large number of observations and overcome the

A Cross Layer Solution to Address TCP Intra-ﬂow Performance … · TCP Intra-ﬂow Performance Degradation in Multihop Ad hoc Networks Ehsan Hamadani, Veselin Rakocevic Abstract—Incorporating

FINANCIAL MANAGEMENT 597 - Symorgsymorg.fon.bg.ac.rs/proceedings/2016/papers... · Milosevic Nela, Makajic Nikolic Dragana, Barjaktarovic Rakocevic Sladjana DRIVERS OF HOTEL PROFIT

· j 1 ans.. kih iska Odgo Sla lice vic anka Rakocevic . Author: user Created Date: 20170322211244Z

Federated Authentication mechanism for mobile services Dasun Weerasinghe, Saritha Arunkumar, M Rajarajan, Veselin Rakocevic Mobile Networks Research Group

Incidence and mortality of diabetes in Serbia · Dr Ivana Rakočević / Ivana Rakocevic, MD Institut za javno zdravlje Srbije „Dr Milan Jovanović Batut”/ “Dr Milan Jovanovic

Maxeler Essence Revisited

Rakocevic S7P1 Implementation and Operation_2013

MERE1, a Low-Copy-Number Copia-Type Retroelement...MERE1, a Low-Copy-Number Copia-Type Retroelement in Medicago truncatulaActive during Tissue Culture1[C][W] Alexandra Rakocevic, Samuel

JEZICI I NACIONALIZMImedia.jezicinacionalizmi.com/2017/01/Beograd-Transkript...Ucestvuju: Ksenija Rakocevic, Ivan Colovic, Boban Arsenijevic, Nikola Vucic, Vuk Perišic i clan radne

Sustran Walking and Cycling Network Developlemt Guide · 5 Walking and Cycling Network Development Guide Technical Information Note 49 September 2017 We recommend open and transparent

King s Research Portal - COnnecting REpositories · 2017. 2. 18. · Kamran Zaidi, Milos Milojevic, Student Member, IEEE, Veselin Rakocevic, Member, IEEE, Arumugam Nallanathan, Senior

Sasa Stojanovic stojsasa@etf.rs Zivojin Sustran zika@etf.rs

Časopis za ekonomiju - emc-review.comemc-review.com/sites/default/files/2019-1.pdf · Prof. Dr Slađana Barjaktarović - Rakocevic, Vice Dean, Faculty of Organizational Sciences

Legislative Assembly of Ontario Seating Plan...Tom Rakocevic Humber River—Black Creek New Democratic Party . Suze Morrison Toronto Centre . New Democratic Party . Chris Glover Spadina—Fort

Izvestaj Komisije-Lj. Boskovic Rakocevic, 18.april 2016. · type zeolite on the content of macroelements in blackberry leaves. ... Microelements Mobility. Acta agriculturae Serbica,

Live Video Stream Processing on a Maxeler Reconfigurable