12
Cloud based Data Lake - modern Data Platform Guido Jacobs Cloud Solution Architect – Big Data/ Data & AI [email protected] Microsoft Deutschland GmbH

Cloud based Data Lake - modern Data Platform Management Azure Data Explorer Azure Storage Event Hub IoT Hub Customer Data Lake Kafka Sync Logstash Plugin Event Grid Azure Portal Power

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Cloud based Data Lake - modern Data Platform Management Azure Data Explorer Azure Storage Event Hub IoT Hub Customer Data Lake Kafka Sync Logstash Plugin Event Grid Azure Portal Power

Cloud based Data Lake -modern Data Platform

Guido Jacobs

Cloud Solution Architect – Big Data/ Data & AI

[email protected]

Microsoft Deutschland GmbH

Page 2: Cloud based Data Lake - modern Data Platform Management Azure Data Explorer Azure Storage Event Hub IoT Hub Customer Data Lake Kafka Sync Logstash Plugin Event Grid Azure Portal Power

Cloud based Data Lake Platform

CONTROL EASE OF USE

Azure Data Lake Store

Azure Storage

Any Hadoop technology, any distribution

Workload optimized, managed clusters

Complete CI/ CD Service forML/ AI Projects

Azure MarketplaceHDP | CDH

Azure ML Services

IaaS Clusters Managed Clusters ML/AI as-a-Service

Azure HDInsight

Frictionless & Optimized Spark clusters

Azure Databricks

BIG

DA

TA

S

TO

RA

GE

BIG

DA

TA

A

NA

LYT

ICS

Re

du

ced

Ad

min

istr

atio

n

Page 3: Cloud based Data Lake - modern Data Platform Management Azure Data Explorer Azure Storage Event Hub IoT Hub Customer Data Lake Kafka Sync Logstash Plugin Event Grid Azure Portal Power

Quelle: https://www.blue-granite.com/data-lakes-in-a-modern-data-architecture-ebook

Page 4: Cloud based Data Lake - modern Data Platform Management Azure Data Explorer Azure Storage Event Hub IoT Hub Customer Data Lake Kafka Sync Logstash Plugin Event Grid Azure Portal Power

Cloud-based Data Lake „in action“

Page 5: Cloud based Data Lake - modern Data Platform Management Azure Data Explorer Azure Storage Event Hub IoT Hub Customer Data Lake Kafka Sync Logstash Plugin Event Grid Azure Portal Power

On-premise HDP Cluster

AzureBig Data Storage/

Azure Data Lake Store

Optimized for MPP based Analytical Workloads

Authorized Accessby Azure AD

Access with Oauth2:• adl:// (gen1)• abfss:// (gen2)

No upfront costNo pre-allocationPay-for-stored-Data

Shared Meta-Data

RANGER

HIVE

OOZIE

HN HN

WN WN …

RAM & CPU are configured to fulfill the workload requirements

WN

HDInsight ClusterType: HIVE LLAP

HN HN

WN WN …

RAM & CPU are configured to fulfill the workload requirements

WN

HDInsight ClusterType: Spark

DN

WN WN …

RAM & CPU are configured to fulfill the workload requirements

WN

Databricks ClusterType: Spark

HN HN

WN WN …WN

HDP (IaaS)Type: Cloudbreak

HN

HDFS (only for temp & spilling data)

Edge-Node

Synchronisation on File-level

Page 6: Cloud based Data Lake - modern Data Platform Management Azure Data Explorer Azure Storage Event Hub IoT Hub Customer Data Lake Kafka Sync Logstash Plugin Event Grid Azure Portal Power
Page 7: Cloud based Data Lake - modern Data Platform Management Azure Data Explorer Azure Storage Event Hub IoT Hub Customer Data Lake Kafka Sync Logstash Plugin Event Grid Azure Portal Power

What is Azure Databricks?A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure

Best of Databricks Best of Microsoft

Designed in collaboration with the founders of Apache Spark

One-click set up; streamlined workflows

Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.

Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)

Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)

Page 8: Cloud based Data Lake - modern Data Platform Management Azure Data Explorer Azure Storage Event Hub IoT Hub Customer Data Lake Kafka Sync Logstash Plugin Event Grid Azure Portal Power

Optimized Databricks Runtime Engine

DATABRICKS I/O SERVERLESS

Collaborative Workspace

Cloud storage

Data warehouses

Hadoop storage

IoT / streaming data

Rest APIs

Machine learning models

BI tools

Data exports

Data warehouses

Azure Databricks

Enhance Productivity

Deploy Production Jobs & Workflows

APACHE SPARK

MULTI-STAGE PIPELINES

DATA ENGINEER

JOB SCHEDULER NOTIFICATION & LOGS

DATA SCIENTIST BUSINESS ANALYST

Build on secure & trusted cloud Scale without limits

Azure Databricks

Page 9: Cloud based Data Lake - modern Data Platform Management Azure Data Explorer Azure Storage Event Hub IoT Hub Customer Data Lake Kafka Sync Logstash Plugin Event Grid Azure Portal Power
Page 10: Cloud based Data Lake - modern Data Platform Management Azure Data Explorer Azure Storage Event Hub IoT Hub Customer Data Lake Kafka Sync Logstash Plugin Event Grid Azure Portal Power

Azure Data Explorer Übersicht

1. Unterstützung vieler Dateiformate und

Strukturen

Strukturiert (Zahlenbasiert), Semi-Strukturiert

(JSON\XML) und Frei-Text

2. Batch oder Streaming ingestion

Nutzung von „managed“ Ingestion Pipelines

oder automatisierte „Pull Ingestion“

3. Trennung von Compute und Storage

• Flexible Skallierung

• Langfriste Speicherung in Azure Blob Storage

• „Low-latency“ Caching auf den Compute

Ressourcen

4. Verschiedene Standard Schnittstellen für den

Zugriff

Out-of-the box Tools und Schnittstellen

oder mit Hilfe von APIs/SDKs frei definierbar

Data Lake/ Blob

IoT

Ingested Data

EngineData

Management

Azure Data Explorer

Azure Storage

Event Hub

IoT Hub

Customer Data

Lake

Kafka Sync

Logstash Plugin

Event Grid

Azure Portal

Power BI

ADX Web UI

ODBC / JDBC Apps

Apps (Via API)

Logstash Plugin

Apps (Via API)

Create,

Manage

Stream

Batch

Grafana

Query,

Control Commands

Azure OSS Applications

Active Data

Connections

Page 11: Cloud based Data Lake - modern Data Platform Management Azure Data Explorer Azure Storage Event Hub IoT Hub Customer Data Lake Kafka Sync Logstash Plugin Event Grid Azure Portal Power

Intuitive Abfrage Leistungsfähig und simple in der Anwendung

• Umfangreiche Querysprache (Filter, Aggregate, Joins,

Calculated columns, …)

• Built-in Full-text Search, Time Series, User Analytics,

und Machine Learning Operatoren

• Out-of-the box Visualisierungs-Möglichkeiten

• Easy-to-use syntax + Microsoft IntelliSense

Übergreifend Einsetzbar

• Optimiert für Analysen über strukturierte, semi-

strukturierte und unstrukturierte Daten zeitgleich

Flexible Erweiterbar

• In-line Python

• SQL

Page 12: Cloud based Data Lake - modern Data Platform Management Azure Data Explorer Azure Storage Event Hub IoT Hub Customer Data Lake Kafka Sync Logstash Plugin Event Grid Azure Portal Power

© 2016 Microsoft Corporation. All rights reserved.