34
0 Copyright 2016 FUJITSU Trends in Big Data PRIMEFLEX for Hadoop Winterschule 22. Februar 2017 Berchtesgaden Dr. Fritz Schinkel

Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

0 Copyright 2016 FUJITSU

Trends in Big Data

PRIMEFLEX for Hadoop

Winterschule 22. Februar 2017 Berchtesgaden Dr. Fritz Schinkel

Page 2: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

1 Copyright 2016 FUJITSU

Big Data Hands-On Platform

Disk failure prediction

Machine Tool Anomaly Detection

Production Idle Time Classification

Page 3: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

2 Copyright 2016 FUJITSU

Big Data Wertschöpfungskette

Big Data Extrahieren

Sammeln

Strukturierte &

unstrukturierte Daten

Geräte,

Sensoren,

Internet der Dinge

Bereinigen

Transformieren

Analysieren

Entdecken

Entscheiden

Handeln

Forschung &

Entwicklung,

Wissenschaft

Betrieb,

Automatisierung,

Produktion

Interaktive

Berichte,

Werbung

Strukturierter Ansatz durch Beratung, Infrastruktur und Tooling.

Soziale Medien,

offene Daten,

verknüpfte Daten

Page 4: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

3 Copyright 2016 FUJITSU

Big Data Infrastruktur Referenz Architektur: Plattform passend zur Geschäftsidee

Konsolidierte Daten Destillierte Essenz Angewandtes

Wissen Vielfältige Data

Extrahieren, Sammeln Bereinigung, Transformation Entscheiden, Handeln Analyse, Visualisierung

Datenquellen Analyseplattform Zugriff

Batch-

Verarbeitung

Ereignis-

Verarbeitung

Dialog-

Verarbeitung

Datenbanken

Applikation-

server

Web-

Inhalte

Sensor-

daten

Apps

Dienste

Abfragen

Visualisierung

Reporting

Mitteilungen

Page 5: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

4 Copyright 2016 FUJITSU

Sto

rag

e

Re

so

urc

e M

gt.

Da

ta M

gt.

Da

ta A

cce

ss

Hive

SQL

Pig

Script

YARN

Cluster Resource Management

HDFS

Redundant, Reliable Persistent Storage

Kafka

Queueing

Datameer

Visual

Analytics

Impala

SQL

Hbase

NoSQL

Key

value

store

MapReduce

Execution

Engine (Linear)

TEZ

Execution

Engine

(DAG)

Spark

Res. Distr. Data Execution Engine

(In-Memory) (DAG)

Spark

SQL

Spark

Stream-

ing

Spark

GraphX

Spark

MLlib

SAP

Vora

SQL

Mehr als Map Reduce – Hadoop Software Stack (Auswahl)

SAP

HANA

engine

Page 6: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

5 Copyright 2016 FUJITSU

Sammlung Action Analyse

Strukturierte &

unstrukturierte Daten

Geräte,

Sensoren,

Internet der Dinge

Soziale Medien,

offene / verknüpfte

Daten

Bedienung: Daten statt Technik

Page 7: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

6 Copyright 2016 FUJITSU

Big Data Hands-On Platform

Disk failure prediction

Machine Tool Anomaly Detection

Production Idle Time Classification

Page 8: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

7 Copyright 2016 FUJITSU

Predictive Maintenance for Disk Arrays

Goals

Early detection of disk failures

Prevent onsite interventions at night and weekends

Asset: Storage system system logs

Error statistics per disk

Disk replacements

Approach: Pattern finding / Training

Find early warning criteria

Evaluate criteria against historical data (economical value)

Formulate Use Case

Data Preparation and Exploration

Data Selection and Transformation

Develop Model and Visualiztion

Validate

Deploy

Evaluate and Monitor

101 Log

files

from 71

systems

Page 9: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

8 Copyright 2016 FUJITSU

Overview – Flow of Analysis

Import to

analysis tool

Training

data

101 Log

files

from 71

systems

Check

potential

875 disks

faulted

Find

indicators

Error points on

58% of faulted

disks

Financial

model

Best

parameters

Search

for criteria

Evaluation

data

Result

weighting

What-if

analysis

Split

input data

Define

metrics

Visualize

Result

Improve

and repeat Use best

parameter

Page 10: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

9 Copyright 2016 FUJITSU

Data Selection and Transformation

Suspect: Error point value and frequency grow in forefront of failure

Use error point histories with failure as endpoint

13 days 3 days 4 days

Formulate Use Case

Data Preparation and Exploration

Data Selection and Transformation

Develop Model and Visualiztion

Validate

Deploy

Evaluate and Monitor

Page 11: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

10 Copyright 2016 FUJITSU

Develop Model (and Visualiztion)

Find suitable test criteria Time series of error points Heavily oscillating

No obvious trend and threshold

Moving average of error points Get smoother time series

Trend becomes visible

Moving average of error frequency

Try thresholds for Short / mid / long moving average

linear combinations of averages

Modulate moving average window

13 days

disk failure

Formulate Use Case

Data Preparation and Exploration

Data Selection and Transformation

Develop Model and Visualiztion

Validate

Deploy

Evaluate and Monitor

Page 12: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

11 Copyright 2016 FUJITSU

Data Selection and Transformation revisited: Error Careers of Failing vs. Non-Failing Disks

Higher frequency of points

in forefront of error

Strong growth in the „final“

phase, means spontanuous

healing of surviving disks!

Plausible?

Formulate Use Case

Data Preparation and Exploration

Data Selection and Transformation

Develop Model and Visualiztion

Validate

Deploy

Evaluate and Monitor

Page 13: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

12 Copyright 2016 FUJITSU

Observation: Gaps in the log files

begin / end

Day without entry

Formulate Use Case

Data Preparation and Exploration

Data Selection and Transformation

Develop Model and Visualiztion

Validate

Deploy

Evaluate and Monitor

Page 14: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

13 Copyright 2016 FUJITSU

Data Selection: Gap Free Log Files

Formulate Use Case

Data Preparation and Exploration

Data Selection and Transformation

Develop Model and Visualiztion

Validate

Deploy

Evaluate and Monitor

Page 15: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

14 Copyright 2016 FUJITSU

Result

Positive economic effect

Savings for onsite interventions vs.

Cost for untimely removed disk

Hit ratio depending on reason for degrading

Over all hit ratio is between 40 and 50%

Excellent for degraded by “Disk statistics”: 91-94%

20% of disks degraded “At once” detected

Further improvements by direct data sources

Page 16: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

15 Copyright 2016 FUJITSU

Big Data Hands-On Platform

Disk failure prediction

Machine Tool Anomaly Detection

Production Idle Time Classification

Page 17: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

16 Copyright 2016 FUJITSU

Analyze Sensor Data From CNC Lathe

Sensor logs from turning machine using multiple tools on a work piece

Many files (one per tool application) with sensor readings (100/second)

Short Target: Find unusual sensor readings pointing to production failure

Mid Target: Find metrics and thresholds to detect faulty tool application in real time sensor data

Long Target: Find rules to predict tool failure before it happens

Page 18: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

17 Copyright 2016 FUJITSU

Step 1: Import data

Import of many files to

HDFS from various shared

or remote sources

(NFS, SSH, FTP, HTTP,…)

Import wizards for

many source formats

(CSV, JSON, XML, …)

Transformation to Excel

like table format

1) Evaluation data set kindly provided by Prof. Dr.-Ing. Joachim Imiela, Geschäftsführer Optvia Unternehmensberatung (http://www.optvia.de)

1)

Page 19: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

18 Copyright 2016 FUJITSU

Step 2: Get a quick overview

Use Flip Sheet to view

standard column statistics Build Drag&Drop Infographics to discover more details

8017 is the most used tool

Page 20: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

19 Copyright 2016 FUJITSU

Step 3: Create Metric for Automatic Detection

Idea: Build average of all graphs

and calculate distance

of each graph to average

graph by using L2 norm

Page 21: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

20 Copyright 2016 FUJITSU

Step 4: Visualize Metric And Eliminate Anomalies

Tools with two different workflows. Find criteria in

data to separate them

Page 22: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

21 Copyright 2016 FUJITSU

Step 5: Determine Threshold

Threshold of 0.6 can be used in real time metric processing to quickly detect defect parts

Application failure of tool 8017

Page 23: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

22 Copyright 2016 FUJITSU

Big Data Hands-On Platform

Disk failure prediction

Machine Tool Anomaly Detection

Production Idle Time Classification

Page 24: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

23 Copyright 2016 FUJITSU

Zielstellung

Maschinendaten verstehen lernen

Verbesserung der Produktions- und Instandhaltungsplanung

Fokus: differenzierte Erfassung von Verlustzeiten zur Ableitung von gezielten Verbesserungsmaßnahmen notwendig

Produktionszeit

Nettobetriebszeit

Verlustzeit St

örun

gen

Kur

zsti

ll-

stän

de

sten

Wer

kzeu

g-

wec

hse

l

An

fah

r-

verl

ust

e

Mat

eria

l-

man

gel

War

tun

g

Wil/86773 © IFW

Page 25: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

24 Copyright 2016 FUJITSU

Vorgehensweise zur Problemlösung

Whitebox-Modell

Umfassende Beobachtung und Datenerhebung

Detailliertes Verständnis aller Parameter

Modellbildung aus Kombination von Parametern

Blackbox-Modell

Beobachtung der grundlegenden Parameter

Gruppierung der Stillstandsereignisse

Modellbildung anhand typischer Einzelereignisse

Modellierungsaufwand gering

Modell Übertragbarkeit

Unerwartete Erkenntnisse

Präzise Auswertung

Rechenaufwand gering

Page 26: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

25 Copyright 2016 FUJITSU

Prozessanalyse

Maschinendaten auswählen

• Programmstatus

• Maschinenstatus

• Achsposition

Verlustzeiten zuordnen

• Analyse von Parameter-verläufen

• Prozesswissen erforderlich

Berechnungs-logiken ableiten

• Übertragen der visuell ermittelten Verlustzeiträume in Formeln

Wil/86775 © IFW

Zeit

Pause

Rüsten

Werkzeugwechsel

?

Pro

gr.

-Sta

tus 𝑆𝑡ö𝑟𝑢𝑛𝑔𝑠𝑧𝑒𝑖𝑡 = 𝜒𝑠𝑡𝑎𝑡𝑢𝑠_6(𝑡)𝑑𝑡

𝑡𝑒𝑛𝑑

𝑡𝑠𝑡𝑎𝑟𝑡

= 𝜒𝑠𝑡𝑎𝑡𝑢𝑠_6 𝑡𝑠𝑡𝑎𝑟𝑡 + 𝑖 ∗ 0,01𝑠 ∗ 0,01𝑠𝑁−1𝑖=0

wobei 𝜒𝑠𝑡𝑎𝑡𝑢𝑠_6 die charakteristische Funktion der

Zeiten mit Status = 6 ist

𝜒𝑠𝑡𝑎𝑡𝑢𝑠_6 𝑡 =1 𝑓𝑎𝑙𝑙𝑠 𝑠𝑡𝑎𝑡𝑢𝑠 𝑡 = 6 0 𝑓𝑎𝑙𝑙𝑠 𝑠𝑡𝑎𝑡𝑢𝑠(𝑡) ≠ 6

und N die Zahl der Zeitintervalle bei Diskretisierung

in Schritte von 0,01 s

N =𝑡𝑒𝑛𝑑−𝑡𝑠𝑡𝑎𝑟𝑡

0,01𝑠

Page 27: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

26 Copyright 2016 FUJITSU

Ergebnisse

Bruttobetriebszeit = 9,01 Tage

Nettobetriebszeit = 6,06 Tage

Produktionszeit = 5,56 Tage

Pausenzeit = 2,95 Tage

Verlustzeit = 0,5 Tage

Kurzstillstände

Störungen

= 1,4 h

= 0,8 h

= 8,2 h

= 1,6 h

Rüstzeit

Werkzeugwechsel

(An-)Fahrverluste

Wartung Wil/86772 © IFW

= 0 h

= 0 h

Page 28: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

27 Copyright 2016 FUJITSU

Unsupervised Learning: k-Means Clustering (Lloyd, 1957)

Gesucht: Gruppen benachbarter Individuen (Cluster)

Kleiner Abstand der Individuen zum Clusterschwerpunkt („Kosten“)

Algorithmus Start: Positioniere k verschiedenfarbige Kreuze

Iteration: „Färben“ und „Mitteln“

• Färbe Individuum wie nächstes Kreuz

• Setze Kreuz in die Mitte der gleichfarbigen Individuen

Stopp wenn sich nichts mehr ändert

Start Färben Mitteln Färben Mitteln Färben / Stopp

=

Page 29: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

28 Copyright 2016 FUJITSU

k-means Clustering Experimente

Aufbau

Dimensionen: Zeit und mittlere Spindelpositionen

Versuche für k= 5, 6, 7, 8,15

Durchführung

PRIMEFLEX for Hadoop

Ergebnis

Ellbogen der Kostenkurve für k=6

Gut strukturierter Cluster (Silhouetten-Koeffizient 0,76)

Gruppiere Stillstände in 6 Cluster Silhouetten-Koeffizienten (Gesamtsystem stark strukturiert 0.76)

0.0 0.5 0.8

Kosten für k=5,6,7,8,15

Zwischen -1 und 1,

Hoch ist gut

Stills

tan

dph

asen

Page 30: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

29 Copyright 2016 FUJITSU

Ergebnis für k=6

Cluster #2, #5 und #6

Gute räumliche Trennung(x-Koordinate)

Cluster #1

Deutliche zeitliche Trennung

Cluster #4 und #6

Zeitliche Trennung und Fokussierung

Cluster #3

Schlecht trennbar in allen Dimensionen Cluster Index

Ze

itd

au

er

100.000

10.000

1.000

100

10

1

0,1

Po

sitio

n

Page 31: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

30 Copyright 2016 FUJITSU

Interpretation der Cluster

#1: Hohe Dauer (Tage) Pausen

#2: Dauer (Sekunden), Streuung Produktion

#6: Fokussierte Dauer ~10 Sekunden Werkzeugwechsel

#5: Fokussierte Positionen Rüsten?

#4: Fokussierte Dauer ~12 Minuten Unklar

Zeitreihe: Konturbewegung bei Spindelstillstand Messen

#3: Nicht fokussiert Sonstiges

Cluster Index Z

eitd

au

er

100.000

10.000

1.000

100

10

1

0,1

Po

sitio

n

Page 32: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

31 Copyright 2016 FUJITSU

100.000

10.000

1.000

100

10

1

0,1

Relevanz der Cluster Analyse (Blackbox)

Produktion, Pausen und Werkzeugwechselzeiten gut erkannt

Rüsten, Wartung und Sonstiges noch nicht scharf trennbar

Cluster #4 gibt Hinweis auf Messvorgänge (unerwartetes Ergebnis)

Detailanalyse in Cluster #4: 4h Messen 8h Rüsten

Plausible Aufteilung der Stillstandszeiten durch Clustering

Profil Whitebox Cluster Blackbox

Produktion 3,77 Tage #2 3,59 Tage

Pausen 2,93 Tage #1 2,99 Tage

Rüstzeit 8,7 h #5 0,18 h

Werkzeugwechsel 1,14 h #6 1,14 h

Mess- und Rüstzeit - #4 12,36h

Sonstiges 1,41 h #3 0,24h

4h 8h

!

Cluster Index

Zeitd

au

er

Page 33: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

32 Copyright 2016 FUJITSU

Big Data Hands-On Platform

Fast end-to-end import, analysis and visualization

Disk failure prediction

What-if optimized combination of metrics

Machine Tool Anomaly Detection

Generalization from visualized torque time series

Production Idle Time Classification

k-means based blackbox model

Trend: Usage of algorithms, machine learning, AI …

Page 34: Trends in Big Data - Fujitsu · Datameer Spark Visual Analytics Impala SQL Hbase NoSQL Key value store MapReduce Execution Engine (Linear) TEZ Execution Engine (DAG) Spark Res. Distr

33 Copyright 2016 FUJITSU