MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.

MicroStrategy Hadoop GatewayComparing this native gateway with other big data connectorsBenjamin Reyes, Product Management, Data


MicroStrategy Hadoop Gateway

2

Agenda

Comparing this native gateway with other big data connectors

• What is the MicroStrategy Hadoop Gateway?

• Benefits of using a native connector vs. other types of connectors

• The MicroStrategy Hadoop Gateway architecture

• In-memory vs. Live Connect datasets

• Filtering, aggregating and wrangling data

• How to install and configure the MicroStrategy Hadoop Gateway

• How to secure the MicroStrategy Hadoop Gateway with Kerberos authentication

• Real-world examples

• Q&A


What is the MicroStrategy Hadoop Gateway?

3

At a high level:

High-performance native access to data in Hadoop

• A high-performance, native gateway for querying and processing data stored in the Hadoop Distributed File System (HDFS)

• A Spark-based distributed data processing engine that runs directly on the Hadoop cluster.

• Enables parallel data transfer from the Hadoop nodes directly to the Intelligence Server, thus achieving much higher throughput than via SQL-on-Hadoop with ODBC

• Data processing tasks for data wrangling are distributed to the nodes of the Hadoop cluster, instead of being performed on the Intelligence Server.

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.4

ODBC/JDBC

Two approaches to analytics on Hadoop

Hadoop Gateway

MicroStrategy on Hadoop

• SQL based access for reporting and dashboarding

• Leverage Project Schema to build models on top of Hadoop or use Data Import to create in-memory or live-connect datasets.

• Build reports, documents and dashboards via live-connect or in-memory datasets

• Preferred method if requirements include:• Leverage Hadoop layer security at runtime• Project schema is required

• High-performance, parallelized native access to Hadoop

• Uses Data Import functionality to publish in-memory datasets. Since 10.9, users can create live-connect datasets to access more detail data on the source.

• Build reports, documents and dashboards via live-connect or in-memory datasets

• Preferred method if requirements include:• Data wrangling on Spark• Browse and preview Hadoop files via data

import interface.

+APACHEIMPALA


Supported Data File Formats

5

Data file formats

Import files directly from Hadoop Distributed File System

AvroRow-oriented

ParquetColumn-oriented

ORCOptimized Row Columnar

CSVText

JSON


Built from the ground up for speed and scaleHadoop Gateway Architecture

Browse files and preview data

MicroStrategyIntelligence Server

MicroStrategyHadoop Gateway

YARN Resource Manager

Name Node

Worker Node

Worker Node

Worker Node

Worker Node

Hadoop Cluster


Hadoop Gateway Architecture

7

Built from the ground up for speed and scale

Requests are distributed to the corresponding nodes




Name Node

Worker Node

Worker Node

Worker Node

Worker Node

Hadoop Cluster


Hadoop Gateway Architecture

8

Built from the ground up for speed and scale

Parallel data transfer




Name Node

Worker Node

Worker Node

Worker Node

Worker Node

Hadoop Cluster


In-memory

Response time vs. data volumeIn-memory vs. Live Connect Datasets

• All the data is transferred to the Intelligence Server in order to populate a dataset in memory

• The amount of data that can be in the dataset is limited by the amount of memory on the Intelligence Server`

• Big Data scenarios commonly require to aggregate or filter data to limit the data brought into memory

• Wrangle• Aggregate• Filter


HDFS


In-memory dataset

Interactive Dossier


Live Connect

Response time vs. data volumeIn-memory vs. Live Connect Datasets

• Supported in 10.9 and on, it allows datasets to query data live from the source

• Enables access to the full breadth of detail data on the source vs. only aggregated or filtered data

• Implies a trade-off of response time vs. breadth of detail data

All interactive queries are executed live on the source


Interactive Dossier

HDFS


Live-connect dataset

Hadoop Cluster


Wrangle, Aggregate and Filter

11

Data Wrangling

Create extracts of data for fast in-memory analysis

• Lets users transform and refine their data for analytics and visualizations without relying on IT

• Wrangling functions are performed natively at the source, distributed on each HDFS node

• There are 30+ wrangling operations available for data preparation

• All wrangling steps can be saved as a script so it can be applied when the dataset is updated with new data



12

Aggregation


• Users can aggregate data from the source files directly on the Hadoop cluster nodes at scale, without moving data the the Intelligence Server. Examples:

• Basic• Date and time• Math

• By aggregating data, users reduce the data volume to an amount appropriate for in-memory cubes.

• Separate datasets can be created for fast in-memory analytics and for detail data queries.



13

Filtering


• Users can also define filters to limit the number of rows to be brought into the system without compromising on the granularity of detail.

• Both aggregation and filtering expressions are pushed down to the cluster nodes to leverage the advantages of Spark distributed computing performance.


DemoMicroStrategy Hadoop Gateway

Browse FilesWrangle dataPublish In-memory datasetBlend with existing dataset


Demo

15



DemoAggregation and Filtering

Browse FilesAggregation FilteringPublish In-memory dataset


Demo

17



Installation and Deployment

18

Automatically deploy gateway via Web for effortless deployment


Use gateway manager in MicroStrategy Web to easily create/modify/delete, deploy/undeploy, and start/stop Hadoop Gateway remotely



19

Configuration and automatic deployment demo

Hadoop Properties:Hadoop NameNode: FQDN or IPHDFS Port: browse files, def. 8020WebHDFS: preview file, def. 50070

Gateway Properties:Host: machine to install GatewayPort: I-Server to HG, def. 30004

Spark Properties:YARN: Jar: path of spark assembly



20


• Automatic deployment remotely installs and deploys the gateway on the cluster node, requiring a user with root privileges.

• In some cases, Hadoop administrators prefer to install the components manually using their own tools to manage the application.

• Refer to the product documentation for step-by-step instructions for manual installation and deployment commands.

• Also refer to the Hadoop Gateway FAQ on the MicroStrategy Community portal for more details.

Manual installation and deployment


Authentication

Kerberos support

Authorization

Securing the Hadoop Gateway

• Support for Kerberos authentication: MIT Kerberos and Active Directory (LDAP)

• Support for Secure Socket Layer (SSL) encryption

• Integration with Ranger policies (Hortonworks)

• Integration with Sentry policies (Cloudera)

• The policies established are applied, enforcing user level authorization.

Sentry (Cloudera)Ranger (Hortonworks)

User credentials

Security policies

Data



Customer Stories Big Data Validation Program

PerformanceAgilitySecurity


Performance Agility SecurityValidated at one of the largest multi-media companies

Validated at one of the largest retailers Validated at one of the largest financial organizations

Hadoop Gateway Customer ValidationThe Hadoop Gateway has been validated with some of the largest MSTR customers

• Looking to publish cubes from a rapidly growing set of viewership data

• Big Data ODBC connections unable to publish the cubes fast enough

• Took more than 6 hours to publish a cube via Hive

• Took less than an hour with Teradata

• Hadoop Gateway published in less than an hour

• Transaction level data (12M+ rows/ day) loaded into Parquet and Avro files

• Looking to give end users direct access to HDFS

• Previously needed to wait for files to load into Hive tables

• Hadoop Gateway outperformed Hive on Spark and Impala via ODBC drivers

• Data wrangling optimized with Hadoop Gateway

• Looking to directly access secure data and publish cubes

• MIT Kerberos had been enabled cluster

• Secure Socket Layer (SSL) encryption enabled

• Cluster had been enabled for High Availability


Hadoop Gateway vs. ODBC

Hadoop Gateway performs well vs. performance ODBCHadoop Gateway Customer Validation

• The Hadoop Gateway was directly compared to a large relational database at one of the largest digital media companies in the world

• While publishing these cubes, the Hadoop Gateway outperformed this Database




















Hadoop Gateway vs. ODBCwith Data Wrangle

Hadoop Gateway performs well vs. performance ODBCHadoop Gateway Customer Validation

• The Hadoop Gateway was directly compared to Hive on Spark at one of the largest retailers in the world

• While publishing these cubes, the Hadoop Gateway outperformed Hive on Spark

• Data wrangling functions have been integrated with Hadoop Gateway to reduce data movement and leverage processing capacity of the Hadoop cluster




















Q&A


Thank you

Documents

MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded