49
Daniel Gillblad, [email protected] Swedish Institute of Computer Science BIG DATA AND ANALYTICS

BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

  • Upload
    others

  • View
    9

  • Download
    1

Embed Size (px)

Citation preview

Page 1: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

Daniel Gillblad, [email protected] Institute of Computer Science

BIG DATA AND ANALYTICS

Page 2: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

OUR RESEARCH: SURVIVING THE DATA FLOOD

Page 3: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

SOME QUESTIONS AROUND BIG DATA ANALYTICS1. What is Big Data?

2. What is different this time?

3. Why is it important?

4. What does the future look like?

Page 4: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

BIG DATA – 3V, AND MORE?

Page 5: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

BIG DATA IS PARALLELIZATION

Read on 10 000 machines:10 000 times faster

Page 6: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

BIG DATA IS SCALABILITYON COMMODITY HARDWARE

LOTS OF COMMODITY HARDWARE……BIG DATA IS FAULT TOLERANCE

Page 7: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

BIG DATA IS PLATFORMS AND MIDDLEWARE

Page 8: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

FIRST, THERE WAS MAP REDUCE

• Distribute data over commodity hardware (HDFS etc.) in data center

• Map and distribute computation to computers storing the data

• Choose fix, batch-oriented form of calculation: Map -> Reduce

Page 9: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

MAP REDUCE, EXAMPLE• Word count: “read -> map -> reduce -> write”

...public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken());output.collect(word, one);

}}

}

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter)throws IOException {

int sum = 0;while (values.hasNext()) {sum += values.next().get();

}output.collect(key, new IntWritable(sum));

}}...

Page 10: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

NEXT-GEN MAP REDUCE

• Handle real-time, interactive and iterative tasks: Necessary for machine learning etc.

• Resilient distributed data sets – in-memory caching etc.

1. Transformations: Map, filter, join…2. Actions: Count, collect, save…• Faster, easier, more powerful

file = spark.textFile("hdfs://...")

file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

Page 11: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

SPARK, REAL-WORLD EXAMPLE

• Incomprehensible example:logs.map(log => (log.data, log)).groupByKey()

.filter(_._2.size >= minimumValue)

.map(xLogsPair => new Aggregate(Pair(xLogsPair._1, logsExpanded(xLogsPair._2))))

• Defines both what/how and hints on how to parallelize

Page 12: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

IS THAT IT?

• Still not that easy to write – special kind of developers

• Fantastic productivity in the right hands• Developer support limited• Parallelization is difficult• Pretty difficult to write advanced analytics

Page 13: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

WHO USES THIS?

Growth of Big Data Analytics

• Big Data Analytics: gaining value from data –  Web analytics, fraud detection, system

management, networking monitoring, business dashboard, …

2 Need to enable more users to perform data analytics http://www.delphianalytics.net

Page 14: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

SCALABILITY IS OFTEN NO LONGER THE MAIN ISSUE –ANALYTICS, EFFICIENCY AND PRODUCTIVITY ARE

Page 15: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

ACCESSIBLE BIG DATA TOOLS?

• Higher5level7abstractions7(RDDs75>7Dataframes etc.)

• R7and7Python7interfaces• Libraries:

• Spark7MLib,7GraphX• FlinkML,7Gelly• …

• Higher7level7languages• Imperative7Big7Data7processing

• …

Page 16: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

RAW PERFORMANCE MATTERS

https://databricks.com/blog/2015/04/28/project5tungsten5 bringing5spark5clos er5to5bare5meta l.html http://bid2.berkeley.edu/bid5data5project/

Page 17: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

DATA AT REST…

Page 18: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

… TO DATA IN MOTION

Page 19: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

FROM LAMBDA ARCHITECTURE…!

!

! !! !8!

*Figure*2:*Lambda*architecture*as*described*by*MAPR2.*

!

!Figure*3:*STREAMLINE*Flink*Architecture.*

!

! Lambda*Architecture! Flink*Streamline*Architecture!

Developer*Productivity!

• Implementations!in!multiple!systems!and!programming!languages!

• Orchestrating!multiple!system!architectures!for!efficient!usage!

• Implementation!of!data!analysis!programs!purely!in!a!single!system!

• No!system!boundaries!need!to!be!crossed,!only!one!system!needs!to!be!tuned!

Data*Transfer**Between*Layers!

• Data!transfer!happens!externally!through!the!serving!layer!with!corresponding!system!latency!overhead!

• Expert!knowledge!of!all!layers!needed!for!optimal!performance!

• Data!transfer!(between!batch!and!streaming)!happens!internally!to!the!system!

• Automatically!optimized!for!maximal!performance!

Code*Base*Maintenance!

• Upgrades!and!version!changes!independently!in!multiple!systems!

• Possible!version!conflicts!across!frameworks!on!the!user7side!

• Updates!are!solely!dependent!on!Flink!and!changes!are!made!in!a!single!API!

• Versioning!related!problems!handled!by!Flink!developers!

Debuggability! • Debugging!across!multiple!systems! • Out!of!the!box!debugging!tools!for!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!2!https://www.mapr.com/sites/default/files/otherpageimages/lambda7architecture727800.jpg!

https://www.mapr.com/sites/default/files/otherpageimages/lambda5architecture525800.jpg

…TO UNIFIED ARCHITECTURE

Page 20: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

WHAT IS DIFFERENT THIS TIME?

• Haven’t we seen this before? AI, Data Mining etc.?• Things are a bit different this time around:

1. We have the computational power and middleware (cloud, Big Data middleware)

2. We have the data (Human generated, IoT)3. We have the algorithms

Page 21: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

THE UNREASONABLE EFFECTIVENESS OF DATA• In a wide array of academic fields, the ability to effectively process data is

superseding other more classical modes of research.

“More data trumps better algorithms”*

*“The7Unreasonable7Effectiveness7of7Data”7[Halevey et7al709]

Page 22: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

EXAMPLE: POPULATION SCALE GENOMICS

[Image7source:7Patterson,7Fighting7the7Big7C7with the7Big7D,72014]

Page 23: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

EXAMPLE: POPULATION SCALE GENOMICS

[Image7source:7Patterson,7Fighting7the7Big7C7with the7Big7D,72014]

Page 24: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

Data set

Graph View

maternal

ardent

shia

deciduous

badly

guerrilla

benedictine

armoured

aerobic

franciscan

geometric

weighs

slavic

favorablearchaeological

bacterial

middleweight

southbound

wbc

fluorescent

interstellar

flyweight

turkic

stricter

extermination

featherweightbehavioural

westbound

eastbound

sta

sodium

seriously

kappa

sigma

taoist

buddhistcalibre

mechanized

welterweight

lcd

rabbinicashkenazi

psychic

impressionist

staunch

didnt

weighing

tenor

sizeable

inflicting

liver

topographic

wba

influenza

incredibly

prescription

tarot

tremendous

czar

berber

gravely

presidentialgubernatorial

pour

fiber

immense

amorphous

stem

bassist

abdel

tau

mythological

mount

ramesses

ghulam

pounder extremely

steam

risc

unfavorable

colder

thtre

projective

casimir

dietary

acidic

aluminum

philanthropic

dont

quarterback

wnba

semitic

terrible

beech

calcium

impendingconiferous

telepathic

inorganic

domesticated

authoritarian

exceptionally

volkswagen

liter

exceedingly

haji

kharkiv

hasidic

suikoden

anterior

cfm

escuela

feral

somervilleasp

corinthian

medford

freight

womens

quantitative

universitt

humid

keyboardist

multilateral

williamstown

mistakenly

antiaircraft

motocross

daytime

fcc

feminine

washing

greatly

antisubmarine

parque

zimbabwean

maulana

scriptstyle

runic

northeastern aris

teammate

neighbouring

sheikh

ammonium

bitterly

greenish

temperate

dsl

considerable

bowed

kedah

hilly

islamist

plum

microbial

atlanta

religiously

illyrian

luxurious

holyoke

viral

histoire

guitarist

extravagant

aeronautical

stuffed

infantry

magnesium

horrible

friedrichtorpedo

aryan

westport

malden

comparatively

abdominal

differentiable

fax

pashtun

nomadicalpha

khalid

thracian

wrestlemania

hash

fairbanks

evergreen

intestinal

frontman

bullion

diocesan

smythe

eocene

kamehameha

kitchener

mahayana

cosmic

cali

intentionally

calgary

vigorously

drastically

electrical

accidentally

erroneously

cincinnati

hindu

gsm

ib

frac

totally

mohammad

spruce

significantly

saturday

toledo

gg

positive

coloured

uv

medley

pius

violin

georg

safavid

articulated

devastating

huntington

fatality

epsilon

translink cylindrical

drier

monochrome

denver

bluish

turbocharged

armoredauf

developmental

vaughan

cdot

slam

ptolemaic

phi

wheel

polynesian

extratropical

ncaa

hazardous

sill

bedouin

kodiak

dans

arabian

torsion

fuzzy

seville

nottingham

plated

navajo

internationale

farmington

hennepinoak

coconut

warmer

nap

wheeled

grilled

madman

pahlavi

electromagnetic

redwood

horizontal

ultima

varying

transverse

undisclosed

csa

ideological

xia

romanesque

gnral

southeastern

zen

conflicting

amalgam

meter

gov

milling

neurological

walnut

astoria

grizzly

palais

intracellular

thurston

pi

gallic

hardwood

electronically

stout

num

sudbury

deteriorating

refresh

ftp

fertile

lankan

wichita

kilograms

jute

palepixar

fir

inner

poplar

gottfried

titanium

fda

kannadanyu

mips

continually

sitka

celestial

geo

sindh

helsinki

kidney

syed

alicante

dhaka

schematic

edwardian

algebraic

olive

pleistocene

aziz

usb

cw

synchronous

sep

fukuoka

snowy

ankara

sesame

persian

placer

confucian

acute

tai

wooster

pir

kcb

malankara

doric

adolescent

physiological

dreamworks

cistercian

couldnt

charlestown

intergovernmental

northwestern

mountainous

unilateral

hee

grassland

keynote

downward

forza

hari

gartermanganese

cauchy telugu

pusher

cyrillic

museo

respiratory

stunning

silvery

quincy

mystical

reluctantly

firestone

wirral

gnostic

lugano

herding

conifer

enid

phenomenal

poorest

oricon

syntacticrotting

mulberry

batu

drainage

tian

lethbridge

baritone

stylistic

anthropomorphic

lookup

exponential

perak

valencia

strict

dowager

dependencies

tanning

ste

defensive

falsely

sima

proxy

escorting

imam

bolivia

parser

dialog

homosexual

mlaga

sassanid

socially

melee

cigarette

salim

reformist

ultraviolet

phylogenetic

mohammed

fig

bengali

canine

anarcho

defendingenriched

vegetable

uthman

caterpillars

zoological

annual

realist

outlying

alfonso

abdullahmayan

tunisian

expressly

broadband

serotonin

supernatural

lambda

carnivorous

indigenous

caltech

warmly

bahadur

malm

lunar

unorthodox

dreamwave

marietta

thatched appleton

hasansquash

lifeguard

rightarrow

madre

psi

copper

benign

beta

han

rue

superficially

philosophical

sectarian

topological

gare

profoundlyarchie

aspx

shoe

sher

governmental

figurative

ccd

flammable

pinball

hind

intercity

default

ahmad

sanchounderwent

ctv

middletown

palazzo

hickory

parliamentary

rhythm

vegetarian

zeta

elm

cleveland

billion hilton

analogue

dense

fil

kart

sigismund

cavalry

gamma

albrecht

gravitational

yusuf

quan

orient

uncivil

sixty

flu

guyed

speciality

graz

vastly

monstrous

pulmonary

islamabad

computational

differing

caller

bodied

fbi

willow

varies

roasted

enormous

sturm

pentium

gorges

bahia

super

tinker

remotely

souvenir

geographic

ismail

markham

protestant

cherry

sandstone

inline

regenerative

sunflower

uniformly

mangrove

modernist

oslo

intangible

optimal

serra

watercolor

shaykh

byu

birch

bhp

scandinavian

ethical

wooded

vapor

surrealist

talmudic

answering

reconnaissance

walled

disastrous

satirical

huey

wadysaw

senegalese

centric

hiroshima

excessively

esoteric

seventy

indianapolis

ninth

officially

biblical

sexual

gujarati

banco

christoph

littleton

nigerian

tambon

quadruple

jat

multidisciplinary

relatively

kurdish

dull

bleach

eligibility

incoming

poultry

halifax

litchfield

es

kbe

watertown

mobile

wrongly

mallorca

hyun

paternal

sunni

severely

guerilla

anaerobic

augustinian

geometrical

weighed

germanic

favourablearcheological

fungal

northbound

incandescent

interplanetary

stringent

refugee

behavioral

santa

potassium

jain

caliber

bantamweight

motorized

crt

rabbinical

sephardic

expressionist

outspoken

doesnt

sizable

inflict

topographical

ibf

extraordinarily

illicit

greeting

tsar

theta

mayoral

fibre

crystalline

germ

drummer

abdul

mythical

mt

antiochus

rifled

superhuman

diesel

powerpc

maison

hyperbolic

nutritional

alkaline

aluminium

charitable

qb

nba

pine

imminent

organic

allegorical

totalitarian

vw

litre

lviv

messianic

posterior

stray

passenger

mens

waltham

qualitative

verlag

eucalyptus

bilateral

shaolin

bmx

primetime

masculine

asw

universidad

bangladeshi

panathinaikos

shaikh

fiercely

brownish

arid

adsl

neighboring

plucked

leftist

racially

luxury

lavish

biomedical

parachute

horrific

inflatable

stamford

thoracic

recursive

dramatically

anchorage

commemorative

suffragan

icc

miocene

guelph

theravada

medelln

alto

deliberately

mechanical

inadvertently

adverse

cdma

baccalaureate

sqrt

utterly

muhammad

fatally

substantially

weekday

columbus

qc

epa

negative

colored

freestyle

cello

johann

decker

catastrophic

rockaway

unemployment

spherical

boulder

rugged

aspirated

ausevolutionary

completely

robbie

itf

umayyad

mega

tropical

invitational

radioactive

stillwater

sous

sinai

snack

programmable

welded

cherokee

parc

langle

bloomfield

euclidmaple

cooler

unc

fried

grenadier

itc

vertical

bowel

clement

longitudinal

unspecified

bayesian

doctrinal

zhou

gothic southwestern

tibetan

edmonton

metre

edu

cnc

autoimmune

cedar

hillsboro

hershey

machine

extracellular

mandy

balkan

placebo

flattened

prod

worsening

toxic

dns

smallpox

alluvial

cuyahoga

flour

celincorrectly

outer

marathi

constantly

heavenly

zinc

balochistan

turku

chittagong

feynman

victorian

scsi

ayatollah

upn

numerical

asynchronous

jul

nara

misty

zmir

tonkin

coal

chronic

ho

hazrat

syriac

ionic

teenage

biochemical

cant

balfour

ding

montanes

ipv

upward

davao

amar

venomous

lithium

nucleotidemalayalam

bladed

phoenician

castel

spectacular

springfield

magical

willingly

goodyear

heretical

fribourg

blooded

castor

poorer

thrice

grammatical

napolon

decaying

karim

geyser

guang

soprano

morphological

periodic

penang

cdiz

zane

consort

hydrographic

bunk

brandywine

sainte

offensive

xiao

escorted

janeiro

oppressive

cardboard

heterosexual

seleucid

environmentally

cigar

infrared

mohamed

montreux

umar

gastrointestinal

meiji

valle

rooster

ska

reigning

depleted

larvae

landscaped

yearly

moroccan

explicitly

wireless

dopamine

herbivorous

enthusiastically

nader

stockholm

meteorite

unconventional

jesuit

cmg

fawcett

retractable

milwaukee

tennis

faire

pear

picket

cordillera

substantialcypress

malignant

vaguelymetaphysical

escalating

hausdorff

negatively

orc

slr

colorless

fore

aboutus

piazza

undergoing

sidi

millimeter

cbc

sabha

flamenco

vegan

trillion

hyatt

analog

lush

gael

harness

erich

electrostatic

ali

pci

sarcastic

specialty

linz

renal

analytical

divergent

arboreal

dea

vary

roast

hugemusik

earthen

hula

stringer

antique

geographical

rennes

pentecostal

peach

limestone

roller

mustard

evenly

postmodern

trondheim

tangible

thyroid

optimum

estdio

ozark

prophet

empress

demonic

horsepower

moral

forested

atmospheric

johor

recon

affine

pelvic

humorous

juliette

ghanaian

aerospace

overly

occult

eighty

valparaiso

regulatory

eighth

formallyphilipp

thoma

jutland

muban

triple

rajput

holistic

fairly

albanian

ranma

mango

psychotic

aboriginal

outgoing

dairy

vila

ja

kcmg

cellular

gcb

majorca

eun

Double-click on empty area to zoom in, shift-double-click to zoom out. Click and drag on empty area to pan. Double-click on node to highlight immediate neighours.

SICS, Swedish Institute of Computer Science, 2014

Page 25: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

BIG DATA – A BIG MISTAKE?

• “Four oversimplified articles of faith:”1. Data analysis produces uncannily accurate results2. Every single data point can be captured, making old

statistical sampling techniques obsolete3. It is passé to fret about what causes what, because

statistical correlation tells us what we need to know4. “The End of Theory”: With enough data, the numbers speak

for themselves• Practitioners do not necessarily believe this, and

neither should youFrom Big Data: are we making a big mistake?, Tim Harford, Financial Times March 28 2014

Page 26: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

BIG DATA TO BIG VALUE?

Big Data Big Value

Page 27: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

BIG DATA TO BIG VALUE?

Big Data Big Value

Real-world Data Model Conclusion

Question

Page 28: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

BDA: WORKFLOW NOT FUNDAMENTALLY DIFFERENT

Acquisition

Cleaning

RepresentationCase-based

Statistical Logical Kernel-based

Validation

Deployment

Page 29: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

EVERYTHING IS ABOUT MACHINE LEARNING ANYWAY…

A Few Useful Things to Know About Machine LearningPedro DomingosCACM 55:10http://dl.acm.org/citation.cfm?id=2347755

Page 30: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

LET’S START WITH THE OBVIOUS

It is a key component of digital services It will help us understand the world better

Lindh, J et al.: PLoS ONE 7(11), 2012.

Page 31: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

…BUT IT IS REALLY, REALLY DIFFICULT

Sample bias, data veracity Model complexity

…+ user and analyst bias…

Page 32: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

SCALABLE, PRIVACY-PRESERVING MOBILITY ANALYTICSFROM NETWORK DATA

A1 B1

C1 D1 E1

F1

G1

A2

A3 A4 A5 A6

A7 A8

B2

B3 B4 B5

B6

C2 C3

C4 C5

D2

D3

D4

E2 E3

E4F2 F3

F4 F5 F6

F7 F8

G2

G4G3 G5

G6 G7 G8

H2 H3

H1

Trajectory 1

Trajectory 2

Trajectory 3

A3, A4, A5, C1, D1, D2, B4, B1 F2, F3, F6, G1, C5, D3, E2, E1, B5, B2 F7, F5, F6, G1, G2, D4, E2, E3, B6

Urban planning Traffic management

Crisis management

Consumerapplications

Networkmanagement

Page 33: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

AN UNSOLVABLE PROBLEM?• Moving datasets can be difficult

• Large and sensitive (business value, integrity issues, etc.)• Can allow for re-identification in anonymized data

• Combining several datasets outside their collection entity is key to many (public) applications, but…

Centralized datasets: Technically feasible Politically hardFederated datasets: Technically difficult Politically solvable

Page 34: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

BIG DATA: KEY COMPONENT IN AUTOMATION

Page 35: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

MASSIVE LEARNING SYSTEMS

Page 36: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

LARGER AND LARGER REPRESENTATIONS

ConvolutionPoolingSoftmaxOther

Page 37: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

UNIFICATION OF MACHINE LEARNING?

Alternating Direction Method of MultipliersS. Boyd, N. Parikh, et al.http://stanford.edu/~boyd/papers/admm_distr_stats.html

Page 38: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

MOVING TO PROBABILISTIC APPROXIMATIONS

https://highlyscalable.wordpress.com/2012/05/01/probabilistic5structures5web5analytics5data5mining/

Page 39: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

ANALYTICS EVERYWHERE

Page 40: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

ANALYTICS AS A SERVICE

Page 41: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

WHERE ARE WE GOING?① Advanced analytics is moving towards large-scale Machine

Learning② Computation and storage platforms need to adapt and develop

LearningMachines@SICSAnalytics and system development on real data and use cases

Page 42: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

THE DATA DRIVEN SYSTEMS STACK, SICS

Hops PaaS (http://hops.io)

HDFS

Compute andresource management

Storage and streams

Data processingengines

Algorithms andframeworks

Analytics

Mesos

Flink

Graph algorithms

HBase Kafka

Yarn

Spark Storm

Scalable /On-line

Structural / DeepLearning

Diagnosis / classification

Anomaly detection Clustering

SICS ContributionsSicsthSense

ICN, SDN

SDN Monitoring

Text, Social Media

Autonomous RANNetworking

Data collection

Page 43: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

Network analytics

Vehicles &Transport

eHealth

Hops PaaS (http://hops.io)

HDFS

Compute andresource management

Storage and streams

Data processingengines

Algorithms andframeworks

Analytics

Mesos

Flink

Graph algorithms

HBase Kafka

Yarn

Spark Storm

Scalable /On-line

Structural / DeepLearning

Diagnosis / classification

Anomaly detection Clustering

SICS ContributionsSicsthSense

ICN, SDN

SDN Monitoring

Text, Social Media

Autonomous RANNetworking

Data collection

DATA DRIVEN SYSTEMS, APPLICATION EXAMPLES

Page 44: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

EXPERIENCE FROM APPLICATIONS• Big Data can be made small – consider the complete application

• Big Data can become small very quickly

• Beware of sample bias

• Distribute models, not data

• Distributed solutions, Statistical Machine Learning, and Bayesian statistics can help

Page 45: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

DISCUSSING APPLICATIONS: IT IS NOT OBVIOUS WHAT IS DIFFICULT

Page 46: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

WHAT ABOUT INTEGRITY?

• Not all Big Data data is integrity sensitive!• Media, measurements, science, …

• How data is (or potentially is) used is everything• Surveillance and data driven services are very different• It all comes down to trust• Transparency is key• Laws and regulations? How do we manage sales,

sharing?

Page 47: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

THE END OF COMPUTER SCIENCE

Computer science Data science

Page 48: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se

…BUT JUST REMEMBER…

We’re not there yet – things are moving fast!

Page 49: BIG DATA AND ANALYTICS - Sveriges Riksbankarchive.riksbank.se/Documents/Forskning/Konferenser... · BIG DATA AND ANALYTICS. OUR RESEARCH: SURVIVING THE DATA FLOOD. SOME QUESTIONS

www.sics.se