Microsoft on big data

Bild durch Klicken auf Symbol hinzufügen

Microsoft on Big Data Bild durch Klicken auf Symbol hinzufügen

Bild durch Klicken auf Symbol hinzufügen

Donnerstag, 28.05.2015

Vorweg:

Wir sind heute live auf Meerkat

Agenda Was ist Big Data?

Funktionsweise und Ansätze

Microsoft Architektur

Hadoop und Map Reduce

Die 3 Vs

Quelle: http://www.datasciencecentral.com/forum/topics/the-3vs-that-define-big-data

Was ist Big Data ?

Was ist Big Data?

Why Big Data? 2008: Google processes 20 PB a day

2009: Facebook has 2.5 PB user data + 15 TB/day

2009: eBay has 6.5 PB user data + 50 TB/day

2011: Yahoo! has 180-200 PB of data

2012: Facebook ingests 500 TB/day

Nächster Großer Datenlieferant

Funktionsweise und Ansätze

How to store data? Data storage is not trivial

Data volumes are massive

Reliably storing PBs of data is challenging

Disk/hardware/network failures

Probability of failure event increases with number of machines

For example: 1000 hosts, each with 10 disks a disk lasts 3 year how many failures per day?

Historical basics Hadoop is an open-source implementation based on GFS and MapReduce from

Google Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. (2003)

The Google File System Jeffrey Dean and Sanjay Ghemawat. (2004)

MapReduce: Simplified Data Processing on Large Clusters OSDI 2004

Klassische Big Data Architektur Hadop

Characteristics and Features Distributed file system

Redundant storage

Designed to reliably store data using commodity hardware

Designed to expect hardware failures

Intended for large files

Designed for batch inserts

The Hadoop Distributed File System

HDFS - files and blocks Files are stored as a collection of blocks

Blocks are 64 MB chunks of a file (configurable)

Blocks are replicated on 3 nodes (configurable)

The NameNode (NN) manages metadata about files and blocks

The SecondaryNameNode (SNN) holds a backup of the NN data

DataNodes (DN) store and serve blocks

Replication Multiple copies of a block are stored

Replication strategy: Copy #1 on another node on same rack Copy #2 on another node on different rack

Failure DataNode DNs check in with the NN to report health

Upon failure NN orders DNs to replicate under-replicated blocks

Microsoft

Distributed Storage(HDFS)

Query(Hive)

Distributed Processing

(MapReduce)

Scripting(Pig)

L Data

Metadata(HCatalog)

P/ REST)

Machine Learning(Mahout)

Graph(Pegasus)

Stats processin

g(RHadoo

nt Pip

Active Directory (Security)

Monitoring & Deployment

(System Center)

C#, F#, .NET

JavaScript

Pipelin

orkflo

Azure Storage Vault (ASV)

lybase

lligence

xcel, Po

HDINSIGHT / HADOOP Eco-System

World's Data (Azure Data Marketplace)

LegendRed = Core HadoopBlue = Data processingPurple = Microsoft integration points and value addsOrange = Data MovementGreen = Packages

Wie funktioniert Hadoop

Hadoop Distributed Architecture

FIRST, STORE THE DATA

Server

ServerServer

So How Does It Work?

Server

SECOND, TAKE THE PROCESSING TO THE DATA

So How Does It Work?

// Map Reduce function in JavaScript

var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {

if (words[i] !== "")context.write(words[i].toLowerCase(),1);}}};

var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {sum += parseInt(values.next());

}context.write(key, sum);};

ServerServer

RUNTIME

MapReduce – Workflow

Programming Models

PigData scripting language

HiveSQL-like set-oriented language

Pegasus, GiraphGraph processing

Example Video Streams

Meerkat API

Vorgehen

Ziel Verteilung von Streams über Tag und Nutzer

C# Dienst Daten sammeln

Persistierung in Azure

Aufbereitung und Analyse mit Hive

Analyse in Excel

Erwartetes Ergebnis

Weitere Beispiele

Beispiel: Social Media Analyse

Auswertung von sozialen Netzwerken

• Untersuchung des Medien-Konsumverhaltens • Quantitativ-statistische Auswertung von Kommunikationsinhalten• Erkennung von Trends, Influencern und Konkurrenzaktivitäten• Nutzung von Facebook, Twitter und anderen Sozialen Netzwerken als Datenquelle• Hohes Datenwachstum• Semi-strukturierte Datenformate• Häufige Änderungen der Datenstrukturen

Quelle: Facebook Graph API

Analyse der Ergebnisse mit Excel

Eigene Map Reduce Tasks

Beispiel: Analyse von Freitext

Textanalye von Sitzungs- protokollen

• Entdeckung von Bedeutungsstrukturen aus un- oder schwachstrukturierten Textdaten• Schnelle Erkennung von Kerninformationen der verarbeiteten Texte• Erkennung nicht bekannter Zusammenhänge• Hypothesen generieren, überprüfen und schrittweise verfeinern• Extraktion von Haltungen gegenüber einem Thema durch semantische Algorithmen• Hohes Datenwachstum

Quelle: Plenarprotokolle Bundestag

Verarbeitung der Daten mit Hadoop

Analyse der Ergebnisse mit Excel

DocumentDB

What is Azure DocumentDB?

It is a fully managed, highly scalable, queryable, schema-free document database, delivered as a service, for modern applications.

Query against Schema-Free JSONMulti-Document transactionsTunable, High PerformanceDesigned for cloud first

Azure DocumentDB Resources41

Source: http://azure.microsoft.com/en-us/documentation/articles/documentdb-introduction/

Document DB Data model

Verwaltung in Azure

Darstellung als Webseite

Traditional RDBMS vs. MapReduceTRADITIONAL RDBMS MAPREDUCE

Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)

Access Interactive and Batch Batch

Updates Read / Write many times Write once, Read many times

Structure Static Schema Dynamic Schema

Integrity High (ACID) Low

Scaling Nonlinear Linear

DBA Ratio 1:40 1:3000

Reference: Tom White’s Hadoop: The Definitive Guide

Do I really need Hadoop?

Generalized

No SQL

Hadoop

Standard SQL

or MPP Appliances

Specialized No SQL

Streaming

In-MemoryAnalytics

Velocity

Variety

HighlyStructured

PolyStructured

Batch Realtime

Ausblick: Data Management Prozesse

Ziel: Big Data Pipeline kombinieren

Steuern und Administrieren von Diensten

Produkt: Azure Data Factory

Azure Blob Storage

Call Log Files

Customer Table

On Premises

Data Mart

Call Log Files

Customer Table

Azure DB

Customer Churn Table

Visualize

Data Set(Collection of files, DB table, etc)

Activity: a processing step (Hadoop job, custom code, ML model, etc)

Pipeline: a sequence of activities (logical group)

Data Factory Concepts

…Data Sources

Ingest Transform & Analyze Publish

Customer Call

Details

Customers Likely to Churn

Transform, Combine, etc

Analyze Move

Zusammenfassung Datenanalyse verändert sich

Technologien abwägen (JSON in Integration Services)

Daten Analysten sind nicht überflüssig

Das Toolset muss sich erweitern

Coole Vorlesung zum Weiter machen http://blogs.ischool.berkeley.edu/i290-abdt-s12/

Vielen Dank!

Microsoft on big data

Data & Analytics

Big Data in the Microsoft Platform

Microsoft cloud big data strategy

Big Data Analytics - download.microsoft.comdownload.microsoft.com/.../2BigDataAnalytics.pdf · Big data + traditional BI = power & simplicity Big, fast, or complex data Microsoft

Big Data Analytics in the Cloud with Microsoft Azure

Microsoft Graph. Готовая Big Data для Ваших решений

Big Data on the Microsoft Platform

MICROSOFT PRESENTS FROM BIG DATA TO SMART …mediaplant.net/Content/reports/From-Big-Data-to-Smart-Data.pdf · MICROSOFT PRESENTS FROM BIG DATA TO SMART DATA: ... In a July 2013 report,

Microsoft - La Transformation Big Data

Big Data and NoSQL in Microsoft-Land

Generando Toma de Decisiones Inteligente con Microsoft Big Data

Big Data Do-It-Yourself · Big Data Do-It-Yourself mit Microsoft Olivia Klose Technical Evangelist, Microsoft Deutschland GmbH aka.ms/oliviaklose, @oliviaklose

Microsoft Big Data Expo

Microsoft Big Data and Advanced Analytics

Big Data and Cybersecurity · Big Data and Cybersecurity Microsoft Digital Crimes Unit Cristina Metea Microsoft Romania 10 June 2016 . Microsoft Confidential Cybersecurity is a Boardroom-level

Overview of Big data, Hadoop and Microsoft BI - version1

Optimizing Microsoft SQL Server Analysis Services for Big Data Adam Jorgensen Microsoft Corporation

Microsoft Azure Big Data Analytics

Big Data - Michiel Rozema (Microsoft)

Olivia Klose | Technical Evangelist, Microsoft ...download.microsoft.com/.../1_Intro.pdf · 1 Intro & Big Data Buzzwords - Big Data, Hadoop, MapReduce, HDInsight 2 Big Data Szenario:

Predict the future with big data (Microsoft azure)