View
453
Download
3
Category
Preview:
Citation preview
Collective Knowledge: python and scikit-learn based open research SDK
for collaborative data management and exchange
PyData, London
20 June 2015
Grigori Fursin, cTuning Foundation, France
Anton Lokhmotov, dividiti, UK
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 2
• Background
• Back to basics: major problems in computer engineering
• Machine learning for performance/energy optimization
• Collective Knowledge Infrastructure & Repository
• Organizing local code and data using Python wrappers + JSON
• Sharing all artifacts as reusable components
• Designing collaborative experiments from shared components
• Reproducing experiments
• Connecting predictive analytics
•Conclusions and future work
Outline
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 3
Interdisciplinary background (physics, electronics, ML)
1999-2004: PhD in computer science, University of Edinburgh, UK
Prepared foundation for machine-learning based performance autotuning
2007-2010: Tenured research scientist at INRIA, France Adjunct professor at Paris South University, France Developed self-tuning compiler GCC combined with machine learning
2010-2011: Head of application optimization group at Intel Exascale Lab, France Application characterization and optimization for exascale systems via
2012-2014: Senior tenured research scientist, INRIA, France Collective Mind Project – open platform for sharing optimization knowledge
2014-now: Chief Scientist, non-profit cTuning foundation, France CTO, dividiti, UK Collective Knowledge Project – python-based framework and repository for
collaborative and reproducible experimentation in computer engineering combined with predictive analytics
Close collaboration with IBM, Intel, ARM, ARC, STMicroelectronics
Presented work and opinions are my own!
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 4
Back to 1993
Semiconductor neural element - base of neural accelerators
and brain-inspired computers Modeling and understanding
brain functions
Faced major problem during modeling • Too slow • Too unreliable • Too costly • Too much data
1
-1
θ - threshold
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 5
Result
Researchers and developers do not necessarily care about details of underlying
technology but simply want to:
• get result as fast as possible
• minimize all costs power consumption, data/memory footprint, inaccuracies, price, size, faults …
• guarantee some constraints power budget, real-time processing, bandwidth, QoS …
Idea
Back to basics
G.Fursin, A. Lokhmotov, et.al. “Collective Mind, Part II: Towards Performance- and Cost-Aware Software Engineering as a Natural Science”, CPC’15, London, UK, available at ArXiv
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 6
Result
Application
Compilers
Binary and libraries
Architecture
Run-time environment
State of the system Data set
Algorithm
Choose “best” solution from all available choices
Service/application providers (HPC, supercomputers, mobile systems)
Hardware and software designers
Idea
Back to basics: available solutions
20 years ago was relatively simple!
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 7
Result
Idea
Back to basics: technological chaos
GCC 4.1.x
GCC 4.2.x
GCC 4.3.x
GCC 4.4.x
GCC 4.5.x
GCC 4.6.x
GCC 4.7.x
ICC 10.1
ICC 11.0
ICC 11.1
ICC 12.0
ICC 12.1 LLVM 2.6
LLVM 2.7
LLVM 2.8
LLVM 2.9
LLVM 3.0
Phoenix
MVS XLC
Open64
Jikes Testarossa
OpenMP MPI
HMPP
OpenCL
CUDA gprof prof
perf
oprofile
PAPI
TAU
Scalasca
VTune
Amplifier scheduling
algorithm-level TBB
MKL
ATLAS program-level
function-level
Codelet
loop-level
hardware counters
IPA
polyhedral transformations
LTO threads
process
pass reordering
run-time adaptation
per phase reconfiguration
cache size
frequency
bandwidth
HDD size
TLB
ISA
memory size
processors
threads
power consumption execution time
reliability
Is your system optimal? No one knows ...
Fundamental problems:
1) Ever rising complexity of computer systems: too many design and optimization choices
at ALL levels
2) It’s not only performance that matters: multiple user objectives vs choices
benefit vs optimization time
3) Complex relationship and interactions between software and hardware components
4) Too many ever changing tools with non-unified interfaces changing from version to version:
technological chaos 5) No common methodology for
performance/energy evaluation and benchmarking
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 8
Result
Idea
Back to basics: ever raising complexity
GCC compiler (similar trends in LLVM)
• OpenCL/CUDA/OpenMP/MPI parameters
• CPU/GPU frequency
• number of threads
• algorithm accuracy/precision
…
Large, multi-dimensional design and optimization spaces
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 9
Program: image corner detection Processor: ARM v7 (Cortex A15), 2.0GHz Compiler: GCC for ARM v4.9.2 OS: Ubuntu 14.04.02 LTS System: ODROID-XU3 Data set: MiDataSet #1, image, 600x450x8b PGM, 263KB
500 combinations of random flags -Ox -f(no-)FLAG
GCC v4.9.2 -O3 == LLVM v3.4 –O3
Cluster around –Os with “bad” flags Cluster around –O0 with “bad” flags
Cluster around –O1,-O2 with “bad” flags
Back to basics: SW/HW autotuning
Pa
reto
fro
nti
er
~20% improvement
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 10
How realistic programs behave?
Continuously tuning 285 shared code and dataset combinations from 8 benchmarks including NAS, MiBench, SPEC2000, SPEC2006, Powerstone, UTDSP and SNU-RT
using GRID 5000; Intel E5520, 2.6MHz; GCC 4.6.3; at least 5000 random combinations of flags
Compilers are tested with a limited set of (possibly non-representative) benchmarks
Continuously tuning (crowd-tuning) shared benchmarks and datasets using GRID5000, mobile phones, tablets, laptops, and
other spare resources:
Collective Mind Node (Android Apps on Google Play): https://play.google.com/store/apps/ details?id=com.collective_mind.node
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 11
Back to basics: cost of computation
P1) Intel Core i5-2540M, 2.60GHz, 2 cores D1) grayscale image 1, size=1536x1536
P2) Qualcomm MSM7625A FFA, ARM Cortex A5, 1 GHz, 1 core D2) grayscale image 2, size=1536x1536
P3) Allwinner A20 (sun7i), ARM Cortex A7, 1.6GHz, Mali400 GPU, 2 core
P4) NVidia Quadro NVS 135M, 400MHz, 16 cores O1) Windows 7 Pro SP1, cost~170 euros
T1) 7.2E10 O2) O1 with MinGW32
W1) 32 bit processor mode T2) 9.6E9 O3) OpenSuse 12.1, Kernel 3.1.10
W2) 64 bit processor mode T3) 2.4E9 O4) Android 4.1.2, Kernel 3.4.0
T4) 1.0E9 O5) Android 4.2.2, Kernel 3.3.0
X1) GCC 4.1.1, opt.flags~190, release date=2006
X2) GCC 4.4.1, opt.flags~270, release date=2009 S1) Dell Laptop Latitude E6320, Mem=8Gb, 52W, 1200 euro
X3) GCC 4.4.4, opt.flags~270, release date=2010 S2) Samsung Mobile GT-S6312, Mem=0.8Gb, 5W, 200 euros
X4) GCC 4.6.3, opt.flags~320, release date=2012 S3) Polaroid Tablet MID0927, Mem=1Gb, 13W, 100 euros
X5) GCC 4.7.2, opt.flags~340, release date=2012 S4) Semiconductor neural network,1.5years development
X6) GCC 4.8.3, opt.flags~350, release date=2014
X7) GCC 4.9.1, opt.flags~357, release date=2014 Y1) Performance (usually -O3)
X8) LLVM 3.1, release date=2012 Y2) Size (usually -Os)
X9) LLVM 3.4.2, release date=2014 Y3) -O3 -fmodulo-sched -funroll-all-loops
X10) Open64 5.0, release date=2011 Y4) -O3 -funroll-all-loops
X11) PathScale 2.3.1, release date=2006 Y5) -O3 -fprefecth-loop-arrays
X12) NVidia CUDA Toolkit 5.0, release date=2012 Y6) -O3 -fno-if-conversion
X13) Intel Composer XE 2011, cost = ~800euro Y7) Auto-tuning with more than 6 flags (-fif-conversion)
X14) Microsoft Visual Studio 2013 Y8) Auto-tuning with more than 6 flags (-fno-if-conversion)
Analysis of computation cost of my neural network kernel in the past 10 years
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 12
1 2 3
4 5
9
6 7 8
A B
C
D E
I
F
II
1) P1 O3 W2 X1 Y1 T2 D1 A) P3 O5 W1 X1 Y1 T4 D1 2) P1 O3 W2 X7 Y1 T2 D1 B) P3 O5 W1 X4 Y1 T4 D1 3) P1 O3 W2 X1 Y7 T2 D1 C) P3 O5 W1 X4 Y7 T4 D1 4) P1 O3 W2 X7 Y5 T2 D1 D) P3 O5 W1 X6 Y1 T4 D1 5) P1 O3 W2 X11 Y1 T2 D1 E) P3 O5 W1 X6 Y7 T4 D1 6) P1 O3 W2 X9 Y1 T2 D1 F) P3 O5 W1 X9 Y1 T4 D1 7) P1 O3 W2 X3 Y7 T2 D1 8) P1 O3 W2 X4 Y8 T2 D2 I) P2 O4 W1 X1 Y1 T4 D1 9) P1 O1 W1 X14 Y1 T3 D1 II) P2 O4 W1 X6 Y1 T4 D1 10) P1 O1 W1 X13 Y1 T2 D1 11) P1 O3 W2 X7 Y8 T2 D2 $) P4 O3 W1 X12 Y1 T1 D1
10 Available resource: P1, one core
Available resource: P1, two cores
$ 11
can plot similar graphs with consumed energy, price, frequency, faults or anything else depending on user needs
Back to basics: cost of computation
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 13
1 2 3
4 5
9
6 7 8
A B
C
D E
I
F
II
1) P1 O3 W2 X1 Y1 T2 D1 A) P3 O5 W1 X1 Y1 T4 D1 2) P1 O3 W2 X7 Y1 T2 D1 B) P3 O5 W1 X4 Y1 T4 D1 3) P1 O3 W2 X1 Y7 T2 D1 C) P3 O5 W1 X4 Y7 T4 D1 4) P1 O3 W2 X7 Y5 T2 D1 D) P3 O5 W1 X6 Y1 T4 D1 5) P1 O3 W2 X11 Y1 T2 D1 E) P3 O5 W1 X6 Y7 T4 D1 6) P1 O3 W2 X9 Y1 T2 D1 F) P3 O5 W1 X9 Y1 T4 D1 7) P1 O3 W2 X3 Y7 T2 D1 8) P1 O3 W2 X4 Y8 T2 D2 I) P2 O4 W1 X1 Y1 T4 D1 9) P1 O1 W1 X14 Y1 T3 D1 II) P2 O4 W1 X6 Y1 T4 D1 10) P1 O1 W1 X13 Y1 T2 D1 11) P1 O3 W2 X7 Y8 T2 D2 $) P4 O3 W1 X12 Y1 T1 D1
10 Available resource: P1, one core
Available resource: P1, two cores
$ 11
can plot similar graphs with consumed energy, price, frequency, faults depending on user needs
Most of the time underperforming systems!
Waste of expensive resources and time!
Getting worse, not better!
What should we do?
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 14
Result
Consider tasks and
computational resources
as a complex physical
system
Continuously
observe behavior
(characteristics);
check for normality
Requirements ( r )
Properties ( p )
System/task state ( s )
Gradually expose
all available
algorithm, design
and optimization
choices
Behavior / characteristics ( b )
Expose
additional
information
Continuously
learning
(modeling)
observed
behavior
Predict
optimal
choices /
behavior
if enough
knowledge
If unexpected
behavior,
continuously
improve
models
(active
learning),
increase
granularity,
find more
properties
Why not to use machine learning to predict optimizations?
Combine interdisiplinary knowledge in physics, electronics, mathematics, neural networks and machine learning User
task
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 15
Combine autotuning with machine learning and crowdsourcing
Plugin-based MILEPOST GCC
Plugins
Monitor and explore optimization space
Extract semantic program features
cTuning.org: plugin-based auto-tuning framework and public repository
Program or kernel1
Program or kernel N
…
Tra
inin
g
Unseen program
Pre
dic
tio
n
MILEPOST GCC
Plugins Collect dynamic features
Cluster Build predictive model
Extract semantic program features
Collect hardware counters
Predict optimization to minimize
execution time, power consumption,
code size, etc
• G. Fursin et.al. MILEPOST GCC: Machine learning based self-tuning compiler. 2008, 2011 •G Fursin and O. Temam. Collective optimization: A practical collaborative approach. 2010 •G. Fursin. Collective Tuning Initiative: automating and accelerating development and optimization of computing systems, 2009 • F. Agakov et.al.. Using Machine Learning to Focus Iterative Optimization, 2006
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 16
• G. Fursin et.al. MILEPOST GCC: Machine learning based self-tuning compiler. 2008, 2011 •G. Fursin and O. Temam. Collective optimization: A practical collaborative approach. 2010 •G. Fursin. Collective Tuning Initiative: automating and accelerating development and optimization of computing systems, 2009 • F. Agakov et.al.. Using Machine Learning to Focus Iterative Optimization, 2006
Plugin-based MILEPOST GCC
Plugins
Monitor and explore optimization space
Extract semantic program features
cTuning.org: plugin-based auto-tuning framework and public repository
Program or kernel1
Program or kernel N
…
Tra
inin
g
Unseen program
Pre
dic
tio
n
MILEPOST GCC
Plugins Collect dynamic features
Cluster Build predictive model
Extract semantic program features
Collect hardware counters
Predict optimization to minimize
execution time, power consumption,
code size, etc
In 2009, we opened public repository of knowledge (cTuning.org) and managed to automatically tune
customer benchmarks and compiler heuristics for a range of real platforms
from IBM and ARC (Synopsis)
Now becomes a hot topic - everything is solved?
Combine autotuning with machine learning and crowdsourcing
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 17
Technological chaos
GCC 4.1.x
GCC 4.2.x
GCC 4.3.x
GCC 4.4.x
GCC 4.5.x
GCC 4.6.x
GCC 4.7.x
ICC 10.1
ICC 11.0
ICC 11.1
ICC 12.0
ICC 12.1 LLVM 2.6
LLVM 2.7
LLVM 2.8
LLVM 2.9
LLVM 3.0
Phoenix
MVS 2013
XLC
Open64
Jikes Testarossa
OpenMP MPI
HMPP
OpenCL
CUDA 4.x gprof prof
perf
oprofile
PAPI
TAU
Scalasca
VTune
Amplifier
predictive scheduling
algorithm-level TBB
MKL
ATLAS
program-level
function-level
Codelet
loop-level
hardware counters
IPA
polyhedral transformations
LTO threads
process
pass reordering
KNN
per phase reconfiguration
cache size
frequency
bandwidth
HDD size TLB ISA
memory size
ARM v6
threads
execution time
reliability
GCC 4.8.x LLVM 3.4
SVM
genetic algorithms
We also experienced a few more problems
ARM v8
Intel SandyBridge
SSE4
AVX
• Everything changes all the time
• Difficult to reproduce results collected from multiple users (including variability of performance data and constant changes in the system)
• Difficult to expose choices, observe behavior and extract features (tools are not prepared for auto-tuning and machine learning)
• Difficult to share experimental setups (many SW/HW dependencies) including code, data and their features
• Difficult to save heterogeneous and continuously changing data in MySQL
It’s not about machine learning – it’s about effective data
and knowledge management
CUDA 5.x
SimpleScalar
algorithm precision
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 18
Collective Knowledge Project: how to keep track of all past R&D?
GCC 4.1.x
GCC 4.2.x
GCC 4.3.x
GCC 4.4.x
GCC 4.5.x
GCC 4.6.x
GCC 4.7.x
ICC 10.1
ICC 11.0
ICC 11.1
ICC 12.0
ICC 12.1 LLVM 2.6
LLVM 2.7
LLVM 2.8
LLVM 2.9
LLVM 3.0
Phoenix
MVS 2013
XLC
Open64 Testarossa
OpenMP MPI
HMPP
OpenCL
CUDA 4.x gprof prof
perf
oprofile
PAPI
TAU
Scalasca
VTune
Amplifier
predictive scheduling
algorithm-level TBB
MKL
ATLAS
program-level
function-level
Codelet
loop-level
hardware counters
IPA
polyhedral transformations
LTO
process
pass reordering
per phase reconfiguration
cache size
frequency
bandwidth
HDD size TLB
ARM v6
threads
execution time
reliability
GCC 4.8.x LLVM 3.4
SVM
genetic algorithms
ARM v8
Intel SandyBridge
SSE4
AVX
CUDA 5.x
SimpleScalar
algorithm precision
image-jpeg-0001
bzip2-0006
txt-0012
video-raw-1280x1024
GCC 5.0.1 bin
GCC 5.0.1 source
LLVM 3.6
gmp 5.0.5
mpfr 3.1.0
lapack 2.3.0
java apache commons codec 1.7
image corner detection
matmul CUDA
compression
neural network OpenCL
Group: programs
Have some common functions: compile,
run, etc …
Group: data sets Have some common
functions: find, extract features
Group: packages Have some common
functions: install, check dependencies
Gradually cleaning up the mess Have some
common meta: which datasets can use, how to
compile, CMD, …
Have some (common)
meta: filename, size, width, height,
colors, …
Have some (common)
meta: dependencies,
installation scripts, …
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 19
Saviour – python, extensible json, and local repository
image-jpeg-0001
bzip2-0006
txt-0012
video-raw-1280x1024
GCC 5.0.1 bin
GCC 5.0.1 source
LLVM 3.6
gmp 5.0.5
mpfr 3.1.0
lapack 2.3.0
java apache commons codec 1.7
image corner detection
matmul CUDA
compression
neural network OpenCL
Gradually cleaning up the mess
meta.json
meta.json
meta.json
Python wrapper: program
Functions: compile,
run
Python wrapper: dataset
Functions: extract_features
Python wrapper: package
Functions: install
UID or alias (UOA) UID or alias (UOA)
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 20
Saviour – python, extensible json, and local repository
compiler GCC 4.4.4
GCC 4.9.2
LLVM 3.1
LLVM 3.4
package GCC 5.0.1 bin
GCC 5.0.1 source
LLVM 3.6
gmp 5.0.5
mpfr 3.1.0
lapack 2.3.0
java apache commons codec 1.7
dataset image-jpeg-0001
bzip2-0006
video-raw-1280x1024
…
…
…
…
…
…
…
…
…
…
module compiler
package
dataset
…
…
…
CK module JSON meta-description Files, directories
Compiler flags
Installation info
Features
Actions
Dep
end
enci
es b
etw
een
da
ta a
nd
mo
du
les
.cmr / module UOA / data UOA (UID or alias) / .cm / data.json
cM r
epo
sito
ry d
irec
tory
str
uct
ure
:
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 21
Saviour – python, extensible json, and local repository
compiler GCC 4.4.4
GCC 4.9.2
LLVM 3.1
LLVM 3.4
package GCC 5.0.1 bin
GCC 5.0.1 source
LLVM 3.6
gmp 5.0.5
mpfr 3.1.0
lapack 2.3.0
java apache commons codec 1.7
dataset image-jpeg-0001
bzip2-0006
video-raw-1280x1024
…
…
…
…
…
…
…
…
…
…
module compiler
package
dataset
…
…
…
CK module JSON meta-description Files, directories
Compiler flags
Installation info
Features
Actions
Dep
end
enci
es b
etw
een
da
ta a
nd
mo
du
les
.cmr / module UOA / data UOA (UID or alias) / .cm / data.json
cM r
epo
sito
ry d
irec
tory
str
uct
ure
:
Data can always be found via CID (similar to DOI):
(repo UOA):module UOA : data UOA
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 22
Connecting it all together – CK python-based interface
http://github.com/ctuning/ck Tiny CK infrastructure (~200Kb), permissive BSD-license (preparing package for DEBIAN and PIP) Perform some action on an entry from CMD:
ck [action] [module_uoa] [CID] @input.json
For example, ck list program ck find dataset:image-jpeg-0001 ck compile program:slambench-1.1-opencl
From python/ipython notebook:
import ck.kernel as ck r=ck.access({‘action’:’compile’, ‘cid’:’slambench-1.1-opencl’}) if r[‘return’]>0: print r[‘error’] exit(1)
As a web-service with simple JSON-based API: ck start web firefox http://localhost:3344/?action=load&cid=dataset:image-jpeg-0001
(returns meta in JSON or HTML)
Can perform P2P information exchange
Input to all functions: schma-free dict/JSON – extended when needed and abstracted by module
Fixed keys: action, module_uoa, CID
Output from all functions: dict/JSON
Fixed keys: return, error
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 23
Behavior
Choices
Features
State
Hardwired experimental setups, very difficult to change, scale or share
Collective Knowledge concept
Meta description that should be exposed in the information flow for auto-tuning and machine learning
Tool B VM
Tool B V2
Tool A VN
Tool A V2
Tool A V1 Tool B V1 Ad-hoc analysis and
learning scripts
Ad-hoc tuning scripts
Collection of CSV, XLS, TXT
and other files
Experiments
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 24
CK module (wrapper) with unified and formalized input and output
Pro
cess
CM
D
Tool B Vi Generated files
Original unmodified
ad-hoc input
Behavior
Choices
Features
State
Collective Knowledge concept
Tool B VM
Tool B V2
Tool A VN
Tool A V2
Tool A V1 Tool B V1 Ad-hoc analysis and
learning scripts
Ad-hoc tuning scripts
Collection of CSV, XLS, TXT
and other files
Experiments
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 25
CK module (wrapper) with unified and formalized input and output
Unified JSON input (meta-data)
Pro
cess
CM
D
Tool B Vi
Behavior
Choices
Features
State
Action
Action function
Generated files
Parse and unify
output
Unified JSON
output (meta-data)
Unified JSON input (if exists)
Original unmodified
ad-hoc input
b = B( c , f , s ) … … … …
Formalized function (model) of a component behavior
Flattened JSON vectors (either string categories or integer/float values)
Collective Knowledge concept
Tool B VM
Tool B V2
Tool A VN
Tool A V2
Tool A V1 Tool B V1 Ad-hoc analysis and
learning scripts
Ad-hoc tuning scripts
Collection of CSV, XLS, TXT
and other files
Experiments
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 26
CK module (wrapper) with unified and formalized input and output
Unified JSON input (meta-data)
Pro
cess
CM
D
Tool B Vi
Behavior
Choices
Features
State
Action
Action function
Generated files
Set environment
for a given tool version
Parse and unify
output
Unified JSON
output (meta-data)
Unified JSON input (if exists)
Original unmodified
ad-hoc input
b = B( c , f , s ) … … … …
Formalized function (model) of a component behavior
Flattened JSON vectors (either string categories or integer/float values)
Multiple tool versions can co-exist, while their interface is abstracted
by CK module
Collective Knowledge concept
Tool B VM
Tool B V2
Tool A VN
Tool A V2
Tool A V1 Tool B V1 Ad-hoc analysis and
learning scripts
Ad-hoc tuning scripts
Collection of CSV, XLS, TXT
and other files
Experiments
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 27
CK module (wrapper) with unified and formalized input and output
Unified JSON input (meta-data)
Pro
cess
CM
D
Tool B Vi
Behavior
Choices
Features
State
Action
Action function
Generated files
Set environment
for a given tool version
Parse and unify
output
Unified JSON
output (meta-data)
Unified JSON input (if exists)
Original unmodified
ad-hoc input
b = B( c , f , s ) … … … …
Formalized function (model) of a component behavior
Flattened JSON vectors (either string categories or integer/float values)
ck run pipeline:program --speed --energy --dataset_uoa=image_1024_768 --record --record_uoa=test123 ck add experiment:test123 ck replay experiment:test123
Runs on any HW, SW and OS (Android, Linux, Windows, MacOS …)
Tool B VM
Tool B V2
Tool A VN
Tool A V2
Tool A V1 Tool B V1 Ad-hoc analysis and
learning scripts
Ad-hoc tuning scripts
Collection of CSV, XLS, TXT
and other files
Experiments
Collective Knowledge concept
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 28
Assembling, preserving, sharing and extending experimental pipeline as “LEGO”
CK module (wrapper) with unified and formalized input and output
Unified JSON input (meta-data)
Tool B VM
Tool B V2
Tool A VN
Tool A V2
Tool A V1 Tool B V1 Ad-hoc analysis and
learning scripts
Ad-hoc tuning scripts
Collection of CSV, XLS, TXT
and other files
Experiments
Pro
cess
CM
D
Tool B Vi
Behavior
Choices
Features
State
Action
Action function
Generated files
Set environment
for a given tool version
Parse and unify
output
Unified JSON
output (meta-data)
Unified JSON input (if exists)
Original unmodified
ad-hoc input
b = B( c , f , s ) … … … …
Formalized function (model) of a component behavior
Flattened JSON vectors (either string categories or integer/float values)
Chaining CK components (wrappers) to an experimental pipeline for a given research and experimentation scenario
Public modular auto-tuning and machine learning repository and buildbot
Unified web services Interdisciplinary crowd
Choose exploration
strategy
Generate choices (code sample, data set, compiler,
flags, architecture …)
Compile source code
Run code
Test behavior normality
Pareto filter
Modeling and
prediction
Complexity reduction
Shared scenarios from past research
…
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 29
Gradually adding specification (agile development)
cTuning experiment module data.json
{
"characteristics":{
"execution times": ["10.3","10.1","13.3"],
"code size": "131938", ...},
"choices":{
"os":"linux", "os version":"2.6.32-5-amd64",
"compiler":"gcc", "compiler version":"4.6.3",
"compiler_flags":"-O3 -fno-if-conversion",
"platform":{"processor":"intel xeon e5520",
"l2":"8192“, ...}, ...},
"features":{
"semantic features": {"number_of_bb": "24", ...},
"hardware counters": {"cpi": "1.4" ...}, ... }
"state":{
"frequency":"2.27", ...}
}
cM flattened JSON key
##characteristics#execution_times@1
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 30
Gradually adding specification (agile development)
cTuning experiment module data.json
{
"characteristics":{
"execution times": ["10.3","10.1","13.3"],
"code size": "131938", ...},
"choices":{
"os":"linux", "os version":"2.6.32-5-amd64",
"compiler":"gcc", "compiler version":"4.6.3",
"compiler_flags":"-O3 -fno-if-conversion",
"platform":{"processor":"intel xeon e5520",
"l2":"8192“, ...}, ...},
"features":{
"semantic features": {"number_of_bb": "24", ...},
"hardware counters": {"cpi": "1.4" ...}, ... }
"state":{
"frequency":"2.27", ...}
}
cM flattened JSON key
##characteristics#execution_times@1
"flattened_json_key”:{
"type": "text”|"integer" | “float" | "dict" | "list”
| "uid",
"characteristic": "yes" | "no",
"feature": "yes" | "no",
"state": "yes" | "no",
"has_choice": "yes“ | "no",
"choices": [ list of strings if categorical
choice],
"explore_start": "start number if numerical
range",
"explore_stop": "stop number if numerical
range",
"explore_step": "step if numerical range",
"can_be_omitted" : "yes" | "no"
...
}
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 31
User A
CK repo: ck-analytics
CK repo: ck-env
CK repo: ctuning-datasets-min
CK repo: ck-autotuning
CK repo: ctuning-programs
CK repo: private-experiments
User B
CK repo: private-experiments
CK repo: ctuning-programs
CK repo: ctuning-datasets-min
CK repo: ck-analytics
CK repo: ck-env
CK repo: ck-autotuning
Private GIT
Semi-private GIT
Can be used in companies via private repos, while supporting common
experimental methodology (reporting performance issues,
sharing code samples and data sets)
Enabling open collaboration and code/data sharing as reusable components with CK wrappers
Get new repo simply as ck pull repo:ck-analytics
CK web interface (with JSON API)
See cknowledge.org/repo
All interconnected and reusable artifacts (code&data),
experiments, interactive graphs, predictive models, papers, reports,
…
Tiny CK core ~100Kb Python + JSON API
github.com/ctuning/ck
Public GIT (github; bitbucket) See our CK repositories at github.com/ctuning
Optional JSON-based ElasticSearch
indexing
Organizing local and shared (public or private) repos
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 32
Apply top-down experimental methodology similar to physics
Gradually expose some characteristics
Gradually expose some choices
Algorithm selection
(time) productivity, variable-accuracy, complexity …
Language, MPI, OpenMP, TBB, MapReduce …
Compile Program time … compiler flags; pragmas …
Code analysis & Transformations
time; memory usage; code size …
transformation ordering; polyhedral transformations; transformation parameters; instruction ordering …
Process
Thread
Function
Codelet
Loop
Instruction
Run code Run-time environment
time; power consumption … pinning/scheduling …
System cost; size … CPU/GPU; frequency; memory hierarchy …
Data set size; values; description … precision …
Run-time analysis
time; precision … hardware counters; power meters …
Run-time state processor state; cache state …
helper threads; hardware counters …
Analyze profile time; size … instrumentation; profiling …
Coarse-grain vs. fine-grain effects: depends on user requirements and expected ROI
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 33
Assemble pipelines from shared components
•Init pipeline •Detected system information •Initialize parameters •Prepare dataset •Clean program •Prepare compiler flags •Use compiler profiling •Use cTuning CC/MILEPOST GCC for fine-grain program analysis and tuning •Use universal Alchemist plugin (with any OpenME-compatible compiler or tool) •Use Alchemist plugin (currently for GCC) •Compile program •Get objdump and md5sum (if supported) •Use OpenME for fine-grain program analysis and online tuning (build & run) •Use 'Intel VTune Amplifier' to collect hardware counters •Use 'perf' to collect hardware counters •Set frequency (in Unix, if supported) •Get system state before execution •Run program •Check output for correctness (use dataset UID to save different outputs) •Finish OpenME •Misc info •Observed characteristics •Observed statistical characteristics •Finalize pipeline
We can easily assemble, extend and customize research, design and experimentation pipelines
for company needs!
We gradually unify and clean up ad-hoc setups!
http://cknowledge.org/repo
• Hundreds of benchmarks/kernels/codelets (CPU, OpenMP, OpenCL, CUDA) • Thousands of data sets • Description of major compilers: GCC 4.x, GCC 5.x, LLVM 3.x, ICC 12.x
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 34
Adaptive workload scheduling combined with active learning
Original features (properties) : V1=GWS0 V2=GWS1 V3=GWS2 V4=cpu_freq V5=gpu_freq V6=block size V7=image cols V8=image rows
Designed features: V9=image size V10=size_div_by_cpu_freq V11=size_div_by_gpu_freq V12=cpu_freq_div_by_gpu V13=size_div_by_cpu_div_by_gpu_freq V14=image_size_div_by_cpu_freq
Application: OpenCL based real time video stream processing for mobile devices
Experiments:
276 builds/runs with random features
Characteristics: CPU execution time GPU ONLY execution time GPU + MEM COPY execution time
Devices:
Chromebook 1: 4x Mali-T60x / 2x A15 Chromebook 2: 4x Mali-T62x / 4x A15
Objective (divide execution time): CPU/GPU COPY > 1.07 (true/false)? (useful for adaptive scheduling)
Our user had an real-time and machine-learning based image processing applications run on mobile device with GPUs – should it be always offloaded to GPU?
ck build model.sklearn ck validate module.sklearn (operates with ‘features’ and ‘characteristics’ keys in JSON)
EU FP7 TETRACOM project: cTuning and ARM
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 35
Samsung Chromebook1
Automatically built decision tree
with scikit-learn when more data is available.
Not a black box - gives hints to engineers
where to focus their attention.
Can drive further exploration on areas
with “unusual” behavior.
96% prediction rate
EU FP7 TETRACOM project: cTuning and ARM
Adaptive workload scheduling combined with active learning
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 36
Samsung Chromebook2
Using old model 74% prediction rate
Adaptive workload scheduling combined with active learning
EU FP7 TETRACOM project: cTuning and ARM
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 37
Samsung Chromebook2
More data, new model 96% prediction rate ADAPTIVE SCHEDULING gives ~32% performance
improvement in comparison with always using GPU
Adaptive workload scheduling combined with active learning
Results shared with the community for reproducibility:
cknowledge.org/repo/web.php?wcid=bc0409fb61f0aa82:fd54cd4b3b73b72b cknowledge.org/repo/web.php?wcid=bc0409fb61f0aa82:3bfd697a48fbba16
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 38
Reproducibility comes as a side effect!
• Can preserve the whole experimental setup with all data and software dependencies
• Can perform statistical analysis for characteristics
• Community can add missing features or improve machine learning models
Variation of experimental results:
10.5 ± 6.5 secs.
Reproducibility of experimental results as a side effect
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 39
Execution time (sec.)
Dis
trib
uti
on
Unexpected behavior - expose to the community including experts to explain, find missing feature and add to the system
Reproducibility of experimental results as a side effect
Reproducibility comes as a side effect!
• Can preserve the whole experimental setup with all data and software dependencies
• Can perform statistical analysis for characteristics
• Community can add missing features or improve machine learning models
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 40
Execution time (sec.)
Dis
trib
uti
on
Class A Class B
800MHz CPU Frequency 2400MHz
Unexpected behavior - expose to the community including experts to explain, find missing feature and add to the system
Reproducibility of experimental results as a side effect
Reproducibility comes as a side effect!
• Can preserve the whole experimental setup with all data and software dependencies
• Can perform statistical analysis for characteristics
• Community can add missing features or improve machine learning models
Grigori Fursin “Systematizing tuning of computer systems using crowdsourcing and statistics” HPSC 2013, NTU, Taiwan March, 2013 41 / 73
Making computer engineering a data science
GCC 4.1.x
GCC 4.2.x
GCC 4.3.x
GCC 4.4.x
GCC 4.5.x
GCC 4.6.x GCC 4.7.x
ICC 10.1
ICC 11.0
ICC 11.1
ICC 12.0
ICC 12.1
LLVM 2.6 LLVM 2.7
LLVM 2.8
LLVM 2.9
LLVM 3.1
Phoenix
MVS XLC
Open64
Jikes
Testarossa
OpenMP
MPI
HMPP
OpenCL
CUDA
gprof
prof
perf
oprofile
PAPI
TAU
Scalasca
VTune
Amplifier
scheduling
algorithm-level
TBB
MKL
ATLAS program-level
function-level
Codelet
loop-level hardware counters
IPA
polyhedral transformations
LTO
threads
process pass reordering
run-time adaptation
per phase reconfiguration
cache size
frequency bandwidth
HDD size
TLB
ISA
memory size
cores processors
threads
power consumption
execution time reliability
Current state of computer engineering
likwid
Classification, predictive modeling
Optimal solutions
Systematization and unification of collective knowledge
(big data)
“crowd”
Collaborative Infrastructure and repository for continuous online learning
Task
Result
Quick, non-reproducible hack? Ad-hoc heuristic?
Quick publication? Waste of expensive resources
and energy?
cTuning.org collaborative approach
Continuous systematization and unification of design and optimization of computer systems
Extrapolate collective knowledge to build faster and more power efficient self-tuning computer systems to boost innovation in science and technology!
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 42
• Preparing Collective Knowledge for release (~August 2015) Beta already available (BSD license): http://github.com/ctuning/ck
• Testing pilot live repository for code, data and interactive report sharing: http://cknowledge.org/repo
• Crowdsourcing experiments using spare mobile phones or cloud services: https://play.google.com/store/apps/details?id=com.collective_mind.node
• Preparing documentation and interactive demos (will take some time)
Current status
•Developing common methodology with ACM on code/data sharing along with publications, and validation of experimental results (Artifact Evaluation at major compiler/architecture conferences including CGO / PPoPP)
• Raising more funds to continue this R&D
Grigori Fursin , Anton Lokhmotov “Python and scikit-learn based open research SDK for collaborative data management and exchange” 43
Our approach opened up many interesting R&D opportunities
Thank you for attention!
Follow us:
@c_tuning @grigori_fursin
http://github.com/ctuning/ck http://cknowledge.org/repo
Get in touch: Grigori.Fursin@cTuning.org / anton@dividiti.com
Recent publications about CK concept and community activities:
• "Collective Mind: Towards practical and collaborative autotuning“, Journal of Scientific Programming 22 (4), 2014, http://hal.inria.fr/hal-01054763
• “Collective Mind, Part II: Towards Performance- and Cost-Aware Software Engineering as a Natural Science”, CPC 2015, London, UK, http://arxiv.org/abs/1506.06256
• “Community-driven reviewing and validation of publications”, TRUST’14@PLDI’14, Edinburgh, UK, http://arxiv.org/abs/1406.4020
Recommended