Collective Knowledge: python and scikit-learn based open research SDK for collaborative data...

Collective Knowledge: python and scikit-learn based open research SDK

for collaborative data management and exchange

PyData, London

20 June 2015

Grigori Fursin, cTuning Foundation, France

Anton Lokhmotov, dividiti, UK

Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 2

• Background

• Back to basics: major problems in computer engineering

• Machine learning for performance/energy optimization

• Collective Knowledge Infrastructure & Repository

• Organizing local code and data using Python wrappers + JSON

• Sharing all artifacts as reusable components

• Designing collaborative experiments from shared components

• Reproducing experiments

• Connecting predictive analytics

•Conclusions and future work

Outline

Interdisciplinary background (physics, electronics, ML)

1999-2004: PhD in computer science, University of Edinburgh, UK

Prepared foundation for machine-learning based performance autotuning

2007-2010: Tenured research scientist at INRIA, France Adjunct professor at Paris South University, France Developed self-tuning compiler GCC combined with machine learning

2010-2011: Head of application optimization group at Intel Exascale Lab, France Application characterization and optimization for exascale systems via

2012-2014: Senior tenured research scientist, INRIA, France Collective Mind Project – open platform for sharing optimization knowledge

2014-now: Chief Scientist, non-profit cTuning foundation, France CTO, dividiti, UK Collective Knowledge Project – python-based framework and repository for

collaborative and reproducible experimentation in computer engineering combined with predictive analytics

Close collaboration with IBM, Intel, ARM, ARC, STMicroelectronics

Presented work and opinions are my own!

Back to 1993

Semiconductor neural element - base of neural accelerators

and brain-inspired computers Modeling and understanding

brain functions

Faced major problem during modeling • Too slow • Too unreliable • Too costly • Too much data

θ - threshold

Result

Researchers and developers do not necessarily care about details of underlying

technology but simply want to:

• get result as fast as possible

• minimize all costs power consumption, data/memory footprint, inaccuracies, price, size, faults …

• guarantee some constraints power budget, real-time processing, bandwidth, QoS …

Back to basics

G.Fursin, A. Lokhmotov, et.al. “Collective Mind, Part II: Towards Performance- and Cost-Aware Software Engineering as a Natural Science”, CPC’15, London, UK, available at ArXiv

Result

Application

Compilers

Binary and libraries

Architecture

Run-time environment

State of the system Data set

Algorithm

Choose “best” solution from all available choices

Service/application providers (HPC, supercomputers, mobile systems)

Hardware and software designers

Back to basics: available solutions

20 years ago was relatively simple!

Result

Back to basics: technological chaos

GCC 4.1.x

GCC 4.2.x

GCC 4.3.x

GCC 4.4.x

GCC 4.5.x

GCC 4.6.x

GCC 4.7.x

ICC 10.1

ICC 11.0

ICC 11.1

ICC 12.0

ICC 12.1 LLVM 2.6

LLVM 2.7

LLVM 2.8

LLVM 2.9

LLVM 3.0

Phoenix

MVS XLC

Open64

Jikes Testarossa

OpenMP MPI

OpenCL

CUDA gprof prof

oprofile

Scalasca

Amplifier scheduling

algorithm-level TBB

ATLAS program-level

function-level

Codelet

loop-level

hardware counters

polyhedral transformations

LTO threads

process

pass reordering

run-time adaptation

per phase reconfiguration

cache size

frequency

bandwidth

HDD size

memory size

processors

threads

power consumption execution time

reliability

Is your system optimal? No one knows ...

Fundamental problems:

1) Ever rising complexity of computer systems: too many design and optimization choices

at ALL levels

2) It’s not only performance that matters: multiple user objectives vs choices

benefit vs optimization time

3) Complex relationship and interactions between software and hardware components

4) Too many ever changing tools with non-unified interfaces changing from version to version:

technological chaos 5) No common methodology for

performance/energy evaluation and benchmarking

Result

Back to basics: ever raising complexity

GCC compiler (similar trends in LLVM)

• OpenCL/CUDA/OpenMP/MPI parameters

• CPU/GPU frequency

• number of threads

• algorithm accuracy/precision

Large, multi-dimensional design and optimization spaces

Program: image corner detection Processor: ARM v7 (Cortex A15), 2.0GHz Compiler: GCC for ARM v4.9.2 OS: Ubuntu 14.04.02 LTS System: ODROID-XU3 Data set: MiDataSet #1, image, 600x450x8b PGM, 263KB

500 combinations of random flags -Ox -f(no-)FLAG

GCC v4.9.2 -O3 == LLVM v3.4 –O3

Cluster around –Os with “bad” flags Cluster around –O0 with “bad” flags

Cluster around –O1,-O2 with “bad” flags

Back to basics: SW/HW autotuning

~20% improvement

How realistic programs behave?

Continuously tuning 285 shared code and dataset combinations from 8 benchmarks including NAS, MiBench, SPEC2000, SPEC2006, Powerstone, UTDSP and SNU-RT

using GRID 5000; Intel E5520, 2.6MHz; GCC 4.6.3; at least 5000 random combinations of flags

Compilers are tested with a limited set of (possibly non-representative) benchmarks

Continuously tuning (crowd-tuning) shared benchmarks and datasets using GRID5000, mobile phones, tablets, laptops, and

other spare resources:

Collective Mind Node (Android Apps on Google Play): https://play.google.com/store/apps/ details?id=com.collective_mind.node

Back to basics: cost of computation

P1) Intel Core i5-2540M, 2.60GHz, 2 cores D1) grayscale image 1, size=1536x1536

P2) Qualcomm MSM7625A FFA, ARM Cortex A5, 1 GHz, 1 core D2) grayscale image 2, size=1536x1536

P3) Allwinner A20 (sun7i), ARM Cortex A7, 1.6GHz, Mali400 GPU, 2 core

P4) NVidia Quadro NVS 135M, 400MHz, 16 cores O1) Windows 7 Pro SP1, cost~170 euros

T1) 7.2E10 O2) O1 with MinGW32

W1) 32 bit processor mode T2) 9.6E9 O3) OpenSuse 12.1, Kernel 3.1.10

W2) 64 bit processor mode T3) 2.4E9 O4) Android 4.1.2, Kernel 3.4.0

T4) 1.0E9 O5) Android 4.2.2, Kernel 3.3.0

X1) GCC 4.1.1, opt.flags~190, release date=2006

X2) GCC 4.4.1, opt.flags~270, release date=2009 S1) Dell Laptop Latitude E6320, Mem=8Gb, 52W, 1200 euro

X3) GCC 4.4.4, opt.flags~270, release date=2010 S2) Samsung Mobile GT-S6312, Mem=0.8Gb, 5W, 200 euros

X4) GCC 4.6.3, opt.flags~320, release date=2012 S3) Polaroid Tablet MID0927, Mem=1Gb, 13W, 100 euros

X5) GCC 4.7.2, opt.flags~340, release date=2012 S4) Semiconductor neural network,1.5years development

X6) GCC 4.8.3, opt.flags~350, release date=2014

X7) GCC 4.9.1, opt.flags~357, release date=2014 Y1) Performance (usually -O3)

X8) LLVM 3.1, release date=2012 Y2) Size (usually -Os)

X9) LLVM 3.4.2, release date=2014 Y3) -O3 -fmodulo-sched -funroll-all-loops

X10) Open64 5.0, release date=2011 Y4) -O3 -funroll-all-loops

X11) PathScale 2.3.1, release date=2006 Y5) -O3 -fprefecth-loop-arrays

X12) NVidia CUDA Toolkit 5.0, release date=2012 Y6) -O3 -fno-if-conversion

X13) Intel Composer XE 2011, cost = ~800euro Y7) Auto-tuning with more than 6 flags (-fif-conversion)

X14) Microsoft Visual Studio 2013 Y8) Auto-tuning with more than 6 flags (-fno-if-conversion)

Analysis of computation cost of my neural network kernel in the past 10 years

1) P1 O3 W2 X1 Y1 T2 D1 A) P3 O5 W1 X1 Y1 T4 D1 2) P1 O3 W2 X7 Y1 T2 D1 B) P3 O5 W1 X4 Y1 T4 D1 3) P1 O3 W2 X1 Y7 T2 D1 C) P3 O5 W1 X4 Y7 T4 D1 4) P1 O3 W2 X7 Y5 T2 D1 D) P3 O5 W1 X6 Y1 T4 D1 5) P1 O3 W2 X11 Y1 T2 D1 E) P3 O5 W1 X6 Y7 T4 D1 6) P1 O3 W2 X9 Y1 T2 D1 F) P3 O5 W1 X9 Y1 T4 D1 7) P1 O3 W2 X3 Y7 T2 D1 8) P1 O3 W2 X4 Y8 T2 D2 I) P2 O4 W1 X1 Y1 T4 D1 9) P1 O1 W1 X14 Y1 T3 D1 II) P2 O4 W1 X6 Y1 T4 D1 10) P1 O1 W1 X13 Y1 T2 D1 11) P1 O3 W2 X7 Y8 T2 D2 $) P4 O3 W1 X12 Y1 T1 D1

10 Available resource: P1, one core

Available resource: P1, two cores

can plot similar graphs with consumed energy, price, frequency, faults or anything else depending on user needs

Back to basics: cost of computation

1) P1 O3 W2 X1 Y1 T2 D1 A) P3 O5 W1 X1 Y1 T4 D1 2) P1 O3 W2 X7 Y1 T2 D1 B) P3 O5 W1 X4 Y1 T4 D1 3) P1 O3 W2 X1 Y7 T2 D1 C) P3 O5 W1 X4 Y7 T4 D1 4) P1 O3 W2 X7 Y5 T2 D1 D) P3 O5 W1 X6 Y1 T4 D1 5) P1 O3 W2 X11 Y1 T2 D1 E) P3 O5 W1 X6 Y7 T4 D1 6) P1 O3 W2 X9 Y1 T2 D1 F) P3 O5 W1 X9 Y1 T4 D1 7) P1 O3 W2 X3 Y7 T2 D1 8) P1 O3 W2 X4 Y8 T2 D2 I) P2 O4 W1 X1 Y1 T4 D1 9) P1 O1 W1 X14 Y1 T3 D1 II) P2 O4 W1 X6 Y1 T4 D1 10) P1 O1 W1 X13 Y1 T2 D1 11) P1 O3 W2 X7 Y8 T2 D2 $) P4 O3 W1 X12 Y1 T1 D1

10 Available resource: P1, one core

Available resource: P1, two cores

can plot similar graphs with consumed energy, price, frequency, faults depending on user needs

Most of the time underperforming systems!

Waste of expensive resources and time!

Getting worse, not better!

What should we do?

Result

Consider tasks and

computational resources

as a complex physical

system

Continuously

observe behavior

(characteristics);

check for normality

Requirements ( r )

Properties ( p )

System/task state ( s )

Gradually expose

all available

algorithm, design

and optimization

choices

Behavior / characteristics ( b )

Expose

additional

information

Continuously

learning

(modeling)

observed

behavior

Predict

optimal

choices /

behavior

if enough

knowledge

If unexpected

behavior,

continuously

improve

models

(active

learning),

increase

granularity,

find more

properties

Why not to use machine learning to predict optimizations?

Combine interdisiplinary knowledge in physics, electronics, mathematics, neural networks and machine learning User

Combine autotuning with machine learning and crowdsourcing

Plugin-based MILEPOST GCC

Plugins

Monitor and explore optimization space

Extract semantic program features

cTuning.org: plugin-based auto-tuning framework and public repository

Program or kernel1

Program or kernel N

Unseen program

MILEPOST GCC

Plugins Collect dynamic features

Cluster Build predictive model

Collect hardware counters

Predict optimization to minimize

execution time, power consumption,

code size, etc

• G. Fursin et.al. MILEPOST GCC: Machine learning based self-tuning compiler. 2008, 2011 •G Fursin and O. Temam. Collective optimization: A practical collaborative approach. 2010 •G. Fursin. Collective Tuning Initiative: automating and accelerating development and optimization of computing systems, 2009 • F. Agakov et.al.. Using Machine Learning to Focus Iterative Optimization, 2006

• G. Fursin et.al. MILEPOST GCC: Machine learning based self-tuning compiler. 2008, 2011 •G. Fursin and O. Temam. Collective optimization: A practical collaborative approach. 2010 •G. Fursin. Collective Tuning Initiative: automating and accelerating development and optimization of computing systems, 2009 • F. Agakov et.al.. Using Machine Learning to Focus Iterative Optimization, 2006

Plugin-based MILEPOST GCC

Plugins

Monitor and explore optimization space

cTuning.org: plugin-based auto-tuning framework and public repository

Program or kernel1

Program or kernel N

Unseen program

MILEPOST GCC

Plugins Collect dynamic features

Cluster Build predictive model

Collect hardware counters

Predict optimization to minimize

execution time, power consumption,

code size, etc

In 2009, we opened public repository of knowledge (cTuning.org) and managed to automatically tune

customer benchmarks and compiler heuristics for a range of real platforms

from IBM and ARC (Synopsis)

Now becomes a hot topic - everything is solved?

Combine autotuning with machine learning and crowdsourcing

Technological chaos

GCC 4.1.x

GCC 4.2.x

GCC 4.3.x

GCC 4.4.x

GCC 4.5.x

GCC 4.6.x

GCC 4.7.x

ICC 10.1

ICC 11.0

ICC 11.1

ICC 12.0

ICC 12.1 LLVM 2.6

LLVM 2.7

LLVM 2.8

LLVM 2.9

LLVM 3.0

Phoenix

MVS 2013

Open64

Jikes Testarossa

OpenMP MPI

OpenCL

CUDA 4.x gprof prof

oprofile

Scalasca

Amplifier

predictive scheduling

algorithm-level TBB

program-level

function-level

Codelet

loop-level

hardware counters

LTO threads

process

pass reordering

cache size

frequency

bandwidth

HDD size TLB ISA

memory size

ARM v6

threads

execution time

reliability

GCC 4.8.x LLVM 3.4

genetic algorithms

We also experienced a few more problems

ARM v8

Intel SandyBridge

• Everything changes all the time

• Difficult to reproduce results collected from multiple users (including variability of performance data and constant changes in the system)

• Difficult to expose choices, observe behavior and extract features (tools are not prepared for auto-tuning and machine learning)

• Difficult to share experimental setups (many SW/HW dependencies) including code, data and their features

• Difficult to save heterogeneous and continuously changing data in MySQL

It’s not about machine learning – it’s about effective data

and knowledge management

CUDA 5.x

SimpleScalar

algorithm precision

Collective Knowledge Project: how to keep track of all past R&D?

GCC 4.1.x

GCC 4.2.x

GCC 4.3.x

GCC 4.4.x

GCC 4.5.x

GCC 4.6.x

GCC 4.7.x

ICC 10.1

ICC 11.0

ICC 11.1

ICC 12.0

ICC 12.1 LLVM 2.6

LLVM 2.7

LLVM 2.8

LLVM 2.9

LLVM 3.0

Phoenix

MVS 2013

Open64 Testarossa

OpenMP MPI

OpenCL

CUDA 4.x gprof prof

oprofile

Scalasca

Amplifier

predictive scheduling

algorithm-level TBB

program-level

function-level

Codelet

loop-level

hardware counters

process

pass reordering

cache size

frequency

bandwidth

HDD size TLB

ARM v6

threads

execution time

reliability

GCC 4.8.x LLVM 3.4

genetic algorithms

ARM v8

Intel SandyBridge

CUDA 5.x

SimpleScalar

algorithm precision

image-jpeg-0001

bzip2-0006

txt-0012

video-raw-1280x1024

GCC 5.0.1 bin

GCC 5.0.1 source

LLVM 3.6

gmp 5.0.5

mpfr 3.1.0

lapack 2.3.0

java apache commons codec 1.7

image corner detection

matmul CUDA

compression

neural network OpenCL

Group: programs

Have some common functions: compile,

run, etc …

Group: data sets Have some common

functions: find, extract features

Group: packages Have some common

functions: install, check dependencies

Gradually cleaning up the mess Have some

common meta: which datasets can use, how to

compile, CMD, …

Have some (common)

meta: filename, size, width, height,

colors, …

Have some (common)

meta: dependencies,

installation scripts, …

Saviour – python, extensible json, and local repository

image-jpeg-0001

bzip2-0006

txt-0012

video-raw-1280x1024

GCC 5.0.1 bin

GCC 5.0.1 source

LLVM 3.6

gmp 5.0.5

mpfr 3.1.0

lapack 2.3.0

image corner detection

matmul CUDA

compression

neural network OpenCL

Gradually cleaning up the mess

meta.json

Python wrapper: program

Functions: compile,

Python wrapper: dataset

Functions: extract_features

Python wrapper: package

Functions: install

UID or alias (UOA) UID or alias (UOA)

compiler GCC 4.4.4

GCC 4.9.2

LLVM 3.1

LLVM 3.4

package GCC 5.0.1 bin

GCC 5.0.1 source

LLVM 3.6

gmp 5.0.5

mpfr 3.1.0

lapack 2.3.0

dataset image-jpeg-0001

bzip2-0006

video-raw-1280x1024

module compiler

package

dataset

CK module JSON meta-description Files, directories

Compiler flags

Installation info

Features

Actions

.cmr / module UOA / data UOA (UID or alias) / .cm / data.json

compiler GCC 4.4.4

GCC 4.9.2

LLVM 3.1

LLVM 3.4

package GCC 5.0.1 bin

GCC 5.0.1 source

LLVM 3.6

gmp 5.0.5

mpfr 3.1.0

lapack 2.3.0

dataset image-jpeg-0001

bzip2-0006

video-raw-1280x1024

module compiler

package

dataset

CK module JSON meta-description Files, directories

Compiler flags

Installation info

Features

Actions

.cmr / module UOA / data UOA (UID or alias) / .cm / data.json

Data can always be found via CID (similar to DOI):

(repo UOA):module UOA : data UOA

Connecting it all together – CK python-based interface

http://github.com/ctuning/ck Tiny CK infrastructure (~200Kb), permissive BSD-license (preparing package for DEBIAN and PIP) Perform some action on an entry from CMD:

ck [action] [module_uoa] [CID] @input.json

For example, ck list program ck find dataset:image-jpeg-0001 ck compile program:slambench-1.1-opencl

From python/ipython notebook:

import ck.kernel as ck r=ck.access({‘action’:’compile’, ‘cid’:’slambench-1.1-opencl’}) if r[‘return’]>0: print r[‘error’] exit(1)

As a web-service with simple JSON-based API: ck start web firefox http://localhost:3344/?action=load&cid=dataset:image-jpeg-0001

(returns meta in JSON or HTML)

Can perform P2P information exchange

Input to all functions: schma-free dict/JSON – extended when needed and abstracted by module

Fixed keys: action, module_uoa, CID

Output from all functions: dict/JSON

Fixed keys: return, error

Behavior

Choices

Features

Hardwired experimental setups, very difficult to change, scale or share

Collective Knowledge concept

Meta description that should be exposed in the information flow for auto-tuning and machine learning

Tool B VM

Tool B V2

Tool A VN

Tool A V2

Tool A V1 Tool B V1 Ad-hoc analysis and

learning scripts

Ad-hoc tuning scripts

Collection of CSV, XLS, TXT

and other files

Experiments

CK module (wrapper) with unified and formalized input and output

Tool B Vi Generated files

Original unmodified

ad-hoc input

Behavior

Choices

Features

Tool B VM

Tool B V2

Tool A VN

Tool A V2

learning scripts

and other files

Experiments

Unified JSON input (meta-data)

Tool B Vi

Behavior

Choices

Features

Action

Action function

Generated files

Parse and unify

output

Unified JSON

output (meta-data)

Unified JSON input (if exists)

Original unmodified

ad-hoc input

b = B( c , f , s ) … … … …

Formalized function (model) of a component behavior

Flattened JSON vectors (either string categories or integer/float values)

Tool B VM

Tool B V2

Tool A VN

Tool A V2

learning scripts

and other files

Experiments

Tool B Vi

Behavior

Choices

Features

Action

Action function

Generated files

Set environment

for a given tool version

Parse and unify

output

Unified JSON

output (meta-data)

Original unmodified

ad-hoc input

b = B( c , f , s ) … … … …

Multiple tool versions can co-exist, while their interface is abstracted

by CK module

Tool B VM

Tool B V2

Tool A VN

Tool A V2

learning scripts

and other files

Experiments

Tool B Vi

Behavior

Choices

Features

Action

Action function

Generated files

Set environment

Parse and unify

output

Unified JSON

output (meta-data)

Original unmodified

ad-hoc input

b = B( c , f , s ) … … … …

ck run pipeline:program --speed --energy --dataset_uoa=image_1024_768 --record --record_uoa=test123 ck add experiment:test123 ck replay experiment:test123

Runs on any HW, SW and OS (Android, Linux, Windows, MacOS …)

Tool B VM

Tool B V2

Tool A VN

Tool A V2

learning scripts

and other files

Experiments

Assembling, preserving, sharing and extending experimental pipeline as “LEGO”

Tool B VM

Tool B V2

Tool A VN

Tool A V2

learning scripts

and other files

Experiments

Tool B Vi

Behavior

Choices

Features

Action

Action function

Generated files

Set environment

Parse and unify

output

Unified JSON

output (meta-data)

Original unmodified

ad-hoc input

b = B( c , f , s ) … … … …

Chaining CK components (wrappers) to an experimental pipeline for a given research and experimentation scenario

Public modular auto-tuning and machine learning repository and buildbot

Unified web services Interdisciplinary crowd

Choose exploration

strategy

Generate choices (code sample, data set, compiler,

flags, architecture …)

Compile source code

Run code

Test behavior normality

Pareto filter

Modeling and

prediction

Complexity reduction

Shared scenarios from past research

Gradually adding specification (agile development)

cTuning experiment module data.json

"characteristics":{

"execution times": ["10.3","10.1","13.3"],

"code size": "131938", ...},

"choices":{

"os":"linux", "os version":"2.6.32-5-amd64",

"compiler":"gcc", "compiler version":"4.6.3",

"compiler_flags":"-O3 -fno-if-conversion",

"platform":{"processor":"intel xeon e5520",

"l2":"8192“, ...}, ...},

"features":{

"semantic features": {"number_of_bb": "24", ...},

"hardware counters": {"cpi": "1.4" ...}, ... }

"state":{

"frequency":"2.27", ...}

cM flattened JSON key

##characteristics#execution_times@1

Gradually adding specification (agile development)

cTuning experiment module data.json

"characteristics":{

"execution times": ["10.3","10.1","13.3"],

"code size": "131938", ...},

"choices":{

"os":"linux", "os version":"2.6.32-5-amd64",

"compiler":"gcc", "compiler version":"4.6.3",

"compiler_flags":"-O3 -fno-if-conversion",

"platform":{"processor":"intel xeon e5520",

"l2":"8192“, ...}, ...},

"features":{

"semantic features": {"number_of_bb": "24", ...},

"hardware counters": {"cpi": "1.4" ...}, ... }

"state":{

"frequency":"2.27", ...}

cM flattened JSON key

##characteristics#execution_times@1

"flattened_json_key”:{

"type": "text”|"integer" | “float" | "dict" | "list”

| "uid",

"characteristic": "yes" | "no",

"feature": "yes" | "no",

"state": "yes" | "no",

"has_choice": "yes“ | "no",

"choices": [ list of strings if categorical

choice],

"explore_start": "start number if numerical

range",

"explore_stop": "stop number if numerical

range",

"explore_step": "step if numerical range",

"can_be_omitted" : "yes" | "no"

User A

CK repo: ck-analytics

CK repo: ck-env

CK repo: ctuning-datasets-min

CK repo: ck-autotuning

CK repo: ctuning-programs

CK repo: private-experiments

User B

CK repo: private-experiments

CK repo: ctuning-programs

CK repo: ctuning-datasets-min

CK repo: ck-analytics

CK repo: ck-env

CK repo: ck-autotuning

Private GIT

Semi-private GIT

Can be used in companies via private repos, while supporting common

experimental methodology (reporting performance issues,

sharing code samples and data sets)

Enabling open collaboration and code/data sharing as reusable components with CK wrappers

Get new repo simply as ck pull repo:ck-analytics

CK web interface (with JSON API)

See cknowledge.org/repo

All interconnected and reusable artifacts (code&data),

experiments, interactive graphs, predictive models, papers, reports,

Tiny CK core ~100Kb Python + JSON API

github.com/ctuning/ck

Public GIT (github; bitbucket) See our CK repositories at github.com/ctuning

Optional JSON-based ElasticSearch

indexing

Organizing local and shared (public or private) repos

Apply top-down experimental methodology similar to physics

Gradually expose some characteristics

Gradually expose some choices

Algorithm selection

(time) productivity, variable-accuracy, complexity …

Language, MPI, OpenMP, TBB, MapReduce …

Compile Program time … compiler flags; pragmas …

Code analysis & Transformations

time; memory usage; code size …

transformation ordering; polyhedral transformations; transformation parameters; instruction ordering …

Process

Thread

Function

Codelet

Instruction

Run code Run-time environment

time; power consumption … pinning/scheduling …

System cost; size … CPU/GPU; frequency; memory hierarchy …

Data set size; values; description … precision …

Run-time analysis

time; precision … hardware counters; power meters …

Run-time state processor state; cache state …

helper threads; hardware counters …

Analyze profile time; size … instrumentation; profiling …

Coarse-grain vs. fine-grain effects: depends on user requirements and expected ROI

Assemble pipelines from shared components

•Init pipeline •Detected system information •Initialize parameters •Prepare dataset •Clean program •Prepare compiler flags •Use compiler profiling •Use cTuning CC/MILEPOST GCC for fine-grain program analysis and tuning •Use universal Alchemist plugin (with any OpenME-compatible compiler or tool) •Use Alchemist plugin (currently for GCC) •Compile program •Get objdump and md5sum (if supported) •Use OpenME for fine-grain program analysis and online tuning (build & run) •Use 'Intel VTune Amplifier' to collect hardware counters •Use 'perf' to collect hardware counters •Set frequency (in Unix, if supported) •Get system state before execution •Run program •Check output for correctness (use dataset UID to save different outputs) •Finish OpenME •Misc info •Observed characteristics •Observed statistical characteristics •Finalize pipeline

We can easily assemble, extend and customize research, design and experimentation pipelines

for company needs!

We gradually unify and clean up ad-hoc setups!

http://cknowledge.org/repo

• Hundreds of benchmarks/kernels/codelets (CPU, OpenMP, OpenCL, CUDA) • Thousands of data sets • Description of major compilers: GCC 4.x, GCC 5.x, LLVM 3.x, ICC 12.x

Adaptive workload scheduling combined with active learning

Original features (properties) : V1=GWS0 V2=GWS1 V3=GWS2 V4=cpu_freq V5=gpu_freq V6=block size V7=image cols V8=image rows

Designed features: V9=image size V10=size_div_by_cpu_freq V11=size_div_by_gpu_freq V12=cpu_freq_div_by_gpu V13=size_div_by_cpu_div_by_gpu_freq V14=image_size_div_by_cpu_freq

Application: OpenCL based real time video stream processing for mobile devices

Experiments:

276 builds/runs with random features

Characteristics: CPU execution time GPU ONLY execution time GPU + MEM COPY execution time

Devices:

Chromebook 1: 4x Mali-T60x / 2x A15 Chromebook 2: 4x Mali-T62x / 4x A15

Objective (divide execution time): CPU/GPU COPY > 1.07 (true/false)? (useful for adaptive scheduling)

Our user had an real-time and machine-learning based image processing applications run on mobile device with GPUs – should it be always offloaded to GPU?

ck build model.sklearn ck validate module.sklearn (operates with ‘features’ and ‘characteristics’ keys in JSON)

EU FP7 TETRACOM project: cTuning and ARM

Samsung Chromebook1

Automatically built decision tree

with scikit-learn when more data is available.

Not a black box - gives hints to engineers

where to focus their attention.

Can drive further exploration on areas

with “unusual” behavior.

96% prediction rate

Samsung Chromebook2

Using old model 74% prediction rate

Samsung Chromebook2

More data, new model 96% prediction rate ADAPTIVE SCHEDULING gives ~32% performance

improvement in comparison with always using GPU

Results shared with the community for reproducibility:

cknowledge.org/repo/web.php?wcid=bc0409fb61f0aa82:fd54cd4b3b73b72b cknowledge.org/repo/web.php?wcid=bc0409fb61f0aa82:3bfd697a48fbba16

Reproducibility comes as a side effect!

• Can preserve the whole experimental setup with all data and software dependencies

• Can perform statistical analysis for characteristics

• Community can add missing features or improve machine learning models

Variation of experimental results:

10.5 ± 6.5 secs.

Reproducibility of experimental results as a side effect

Execution time (sec.)

Unexpected behavior - expose to the community including experts to explain, find missing feature and add to the system

Execution time (sec.)

Class A Class B

800MHz CPU Frequency 2400MHz

Unexpected behavior - expose to the community including experts to explain, find missing feature and add to the system

Grigori Fursin “Systematizing tuning of computer systems using crowdsourcing and statistics” HPSC 2013, NTU, Taiwan March, 2013 41 / 73

Making computer engineering a data science

GCC 4.1.x

GCC 4.2.x

GCC 4.3.x

GCC 4.4.x

GCC 4.5.x

GCC 4.6.x GCC 4.7.x

ICC 10.1

ICC 11.0

ICC 11.1

ICC 12.0

ICC 12.1

LLVM 2.6 LLVM 2.7

LLVM 2.8

LLVM 2.9

LLVM 3.1

Phoenix

MVS XLC

Open64

Testarossa

OpenMP

OpenCL

oprofile

Scalasca

Amplifier

scheduling

algorithm-level

ATLAS program-level

function-level

Codelet

loop-level hardware counters

threads

process pass reordering

run-time adaptation

cache size

frequency bandwidth

HDD size

memory size

cores processors

threads

power consumption

execution time reliability

Current state of computer engineering

likwid

Classification, predictive modeling

Optimal solutions

Systematization and unification of collective knowledge

(big data)

“crowd”

Collaborative Infrastructure and repository for continuous online learning

Result

Quick, non-reproducible hack? Ad-hoc heuristic?

Quick publication? Waste of expensive resources

and energy?

cTuning.org collaborative approach

Continuous systematization and unification of design and optimization of computer systems

Extrapolate collective knowledge to build faster and more power efficient self-tuning computer systems to boost innovation in science and technology!

• Preparing Collective Knowledge for release (~August 2015) Beta already available (BSD license): http://github.com/ctuning/ck

• Testing pilot live repository for code, data and interactive report sharing: http://cknowledge.org/repo

• Crowdsourcing experiments using spare mobile phones or cloud services: https://play.google.com/store/apps/details?id=com.collective_mind.node

• Preparing documentation and interactive demos (will take some time)

Current status

•Developing common methodology with ACM on code/data sharing along with publications, and validation of experimental results (Artifact Evaluation at major compiler/architecture conferences including CGO / PPoPP)

• Raising more funds to continue this R&D

Our approach opened up many interesting R&D opportunities

Thank you for attention!

@c_tuning @grigori_fursin

http://github.com/ctuning/ck http://cknowledge.org/repo

Get in touch: Grigori.Fursin@cTuning.org / anton@dividiti.com

Recent publications about CK concept and community activities:

• "Collective Mind: Towards practical and collaborative autotuning“, Journal of Scientific Programming 22 (4), 2014, http://hal.inria.fr/hal-01054763

• “Collective Mind, Part II: Towards Performance- and Cost-Aware Software Engineering as a Natural Science”, CPC 2015, London, UK, http://arxiv.org/abs/1506.06256

• “Community-driven reviewing and validation of publications”, TRUST’14@PLDI’14, Edinburgh, UK, http://arxiv.org/abs/1406.4020

Collective Knowledge: python and scikit-learn based open research SDK for collaborative data...

Science

Exploring Machine Learning in Python with Scikit-Learn

Facebook Python SDK - Introduction

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pandas by Kostas Perifanos

Cisco UCS Python SDK User Guide

Applied Machine Learning in Python with scikit-learn - Meetupfiles.meetup.com/4325882/Scikit-learn.pdf · Applied Machine Learning in Python with scikit-learn, Release 0.1 scikits.learn

T7.A1 : Python pour l’apprentissage - Laurent Risserlaurent.risser.free.fr/TEACHING/T7_A1_PythonApprentissage.pdf · T7.A1 : Python pour l’apprentissage Utilisation de Scikit-Learn

Machine Learning - University of Rhode Island...Machine Learning in Python -Scikit-Learn We will be using the Scikit-Learn module to build decision trees. Scikit-learn or sklearnfor

Scikit-Learn: Machine Learning in Pythondisi.unitn.it/.../2016-2017/MachineLearning/slides/sklearn/sklearn.pdf · Scikit-Learn: Machine Learning in Python ... Scikit-Learn Machine

Introduction to Machine Learning in Python using Scikit-Learn

Introduction to Facebook JavaScript & Python SDK

Data Science and Machine Learning Using Python and Scikit-learn

Machine Learning in Python with scikit-learn · Outline • Machine Learning refresher • scikit-learn • How the project is structured • Some improvements released in 0.15 •

Scikit-Learn: Machine Learning in Python

Skyscanner Python SDK Documentation...Skyscanner Python SDK Documentation, Release 1.1.4 •See the License for the speciﬁc language governing permissions and * limitations under

Scikit learn: apprentissage statistique en Python

Machine Learning in Python with scikit-learn · 2014. 11. 19. · Outline • Machine Learning refresher • scikit-learn • How the project is structured • Some improvements released

Azure SDK for Python Documentation - Read the Docs...Microsoft Azure Python Developer Center 13 Azure SDK for Python Documentation, Release 0.20.0 14 Chapter 7. Learn More CHAPTER

ArduCAM USB Camera Python SDK

NoSQL Database Python SDK Documentation

Machine learning for neuroimaging with scikit-learn · Machine learning for neuroimaging with scikit-learn. ... a Python machine learning library, ... Abraham et al. Machine learning