Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.
MicroStrategy Hadoop GatewayComparing this native gateway with other big data connectorsBenjamin Reyes, Product Management, Data
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.
MicroStrategy Hadoop Gateway
2
Agenda
Comparing this native gateway with other big data connectors
• What is the MicroStrategy Hadoop Gateway?
• Benefits of using a native connector vs. other types of connectors
• The MicroStrategy Hadoop Gateway architecture
• In-memory vs. Live Connect datasets
• Filtering, aggregating and wrangling data
• How to install and configure the MicroStrategy Hadoop Gateway
• How to secure the MicroStrategy Hadoop Gateway with Kerberos authentication
• Real-world examples
• Q&A
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.
What is the MicroStrategy Hadoop Gateway?
3
At a high level:
High-performance native access to data in Hadoop
• A high-performance, native gateway for querying and processing data stored in the Hadoop Distributed File System (HDFS)
• A Spark-based distributed data processing engine that runs directly on the Hadoop cluster.
• Enables parallel data transfer from the Hadoop nodes directly to the Intelligence Server, thus achieving much higher throughput than via SQL-on-Hadoop with ODBC
• Data processing tasks for data wrangling are distributed to the nodes of the Hadoop cluster, instead of being performed on the Intelligence Server.
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.4
ODBC/JDBC
Two approaches to analytics on Hadoop
Hadoop Gateway
MicroStrategy on Hadoop
• SQL based access for reporting and dashboarding
• Leverage Project Schema to build models on top of Hadoop or use Data Import to create in-memory or live-connect datasets.
• Build reports, documents and dashboards via live-connect or in-memory datasets
• Preferred method if requirements include:• Leverage Hadoop layer security at runtime• Project schema is required
• High-performance, parallelized native access to Hadoop
• Uses Data Import functionality to publish in-memory datasets. Since 10.9, users can create live-connect datasets to access more detail data on the source.
• Build reports, documents and dashboards via live-connect or in-memory datasets
• Preferred method if requirements include:• Data wrangling on Spark• Browse and preview Hadoop files via data
import interface.
+APACHEIMPALA
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.
Supported Data File Formats
5
Data file formats
Import files directly from Hadoop Distributed File System
AvroRow-oriented
ParquetColumn-oriented
ORCOptimized Row Columnar
CSVText
JSON
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.6
Built from the ground up for speed and scaleHadoop Gateway Architecture
Browse files and preview data
MicroStrategyIntelligence Server
MicroStrategyHadoop Gateway
YARN Resource Manager
Name Node
Worker Node
Worker Node
Worker Node
Worker Node
Hadoop Cluster
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.
Hadoop Gateway Architecture
7
Built from the ground up for speed and scale
Requests are distributed to the corresponding nodes
MicroStrategyIntelligence Server
MicroStrategyHadoop Gateway
YARN Resource Manager
Name Node
Worker Node
Worker Node
Worker Node
Worker Node
Hadoop Cluster
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.
Hadoop Gateway Architecture
8
Built from the ground up for speed and scale
Parallel data transfer
MicroStrategyIntelligence Server
MicroStrategyHadoop Gateway
YARN Resource Manager
Name Node
Worker Node
Worker Node
Worker Node
Worker Node
Hadoop Cluster
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.9
In-memory
Response time vs. data volumeIn-memory vs. Live Connect Datasets
• All the data is transferred to the Intelligence Server in order to populate a dataset in memory
• The amount of data that can be in the dataset is limited by the amount of memory on the Intelligence Server`
• Big Data scenarios commonly require to aggregate or filter data to limit the data brought into memory
• Wrangle• Aggregate• Filter
MicroStrategyHadoop Gateway
HDFS
MicroStrategyIntelligence Server
In-memory dataset
Interactive Dossier
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.10
Live Connect
Response time vs. data volumeIn-memory vs. Live Connect Datasets
• Supported in 10.9 and on, it allows datasets to query data live from the source
• Enables access to the full breadth of detail data on the source vs. only aggregated or filtered data
• Implies a trade-off of response time vs. breadth of detail data
All interactive queries are executed live on the source
MicroStrategyHadoop Gateway
Interactive Dossier
HDFS
MicroStrategyIntelligence Server
Live-connect dataset
Hadoop Cluster
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.
Wrangle, Aggregate and Filter
11
Data Wrangling
Create extracts of data for fast in-memory analysis
• Lets users transform and refine their data for analytics and visualizations without relying on IT
• Wrangling functions are performed natively at the source, distributed on each HDFS node
• There are 30+ wrangling operations available for data preparation
• All wrangling steps can be saved as a script so it can be applied when the dataset is updated with new data
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.
Wrangle, Aggregate and Filter
12
Aggregation
Create extracts of data for fast in-memory analysis
• Users can aggregate data from the source files directly on the Hadoop cluster nodes at scale, without moving data the the Intelligence Server. Examples:
• Basic• Date and time• Math
• By aggregating data, users reduce the data volume to an amount appropriate for in-memory cubes.
• Separate datasets can be created for fast in-memory analytics and for detail data queries.
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.
Wrangle, Aggregate and Filter
13
Filtering
Create extracts of data for fast in-memory analysis
• Users can also define filters to limit the number of rows to be brought into the system without compromising on the granularity of detail.
• Both aggregation and filtering expressions are pushed down to the cluster nodes to leverage the advantages of Spark distributed computing performance.
Copyright © 2017 MicroStrategy Incorporated. All Rights Reserved.
DemoMicroStrategy Hadoop Gateway
Browse FilesWrangle dataPublish In-memory datasetBlend with existing dataset
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.
Demo
15
MicroStrategy Hadoop Gateway
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.
DemoAggregation and Filtering
Browse FilesAggregation FilteringPublish In-memory dataset
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.
Demo
17
MicroStrategy Hadoop Gateway
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.
Installation and Deployment
18
Automatically deploy gateway via Web for effortless deployment
MicroStrategy Hadoop Gateway
Use gateway manager in MicroStrategy Web to easily create/modify/delete, deploy/undeploy, and start/stop Hadoop Gateway remotely
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.
Installation and Deployment
19
Configuration and automatic deployment demo
Hadoop Properties:Hadoop NameNode: FQDN or IPHDFS Port: browse files, def. 8020WebHDFS: preview file, def. 50070
Gateway Properties:Host: machine to install GatewayPort: I-Server to HG, def. 30004
Spark Properties:YARN: Jar: path of spark assembly
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.
Installation and Deployment
20
MicroStrategy Hadoop Gateway
• Automatic deployment remotely installs and deploys the gateway on the cluster node, requiring a user with root privileges.
• In some cases, Hadoop administrators prefer to install the components manually using their own tools to manage the application.
• Refer to the product documentation for step-by-step instructions for manual installation and deployment commands.
• Also refer to the Hadoop Gateway FAQ on the MicroStrategy Community portal for more details.
Manual installation and deployment
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.21
Authentication
Kerberos support
Authorization
Securing the Hadoop Gateway
• Support for Kerberos authentication: MIT Kerberos and Active Directory (LDAP)
• Support for Secure Socket Layer (SSL) encryption
• Integration with Ranger policies (Hortonworks)
• Integration with Sentry policies (Cloudera)
• The policies established are applied, enforcing user level authorization.
Sentry (Cloudera)Ranger (Hortonworks)
User credentials
Security policies
Data
MicroStrategyHadoop Gateway
Copyright © 2017 MicroStrategy Incorporated. All Rights Reserved.
Customer Stories Big Data Validation Program
PerformanceAgilitySecurity
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.23
Performance Agility SecurityValidated at one of the largest multi-media companies
Validated at one of the largest retailers Validated at one of the largest financial organizations
Hadoop Gateway Customer ValidationThe Hadoop Gateway has been validated with some of the largest MSTR customers
• Looking to publish cubes from a rapidly growing set of viewership data
• Big Data ODBC connections unable to publish the cubes fast enough
• Took more than 6 hours to publish a cube via Hive
• Took less than an hour with Teradata
• Hadoop Gateway published in less than an hour
• Transaction level data (12M+ rows/ day) loaded into Parquet and Avro files
• Looking to give end users direct access to HDFS
• Previously needed to wait for files to load into Hive tables
• Hadoop Gateway outperformed Hive on Spark and Impala via ODBC drivers
• Data wrangling optimized with Hadoop Gateway
• Looking to directly access secure data and publish cubes
• MIT Kerberos had been enabled cluster
• Secure Socket Layer (SSL) encryption enabled
• Cluster had been enabled for High Availability
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.24
Hadoop Gateway vs. ODBC
Hadoop Gateway performs well vs. performance ODBCHadoop Gateway Customer Validation
• The Hadoop Gateway was directly compared to a large relational database at one of the largest digital media companies in the world
• While publishing these cubes, the Hadoop Gateway outperformed this Database
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.25
Performance Agility SecurityValidated at one of the largest multi-media companies
Validated at one of the largest retailers Validated at one of the largest financial organizations
Hadoop Gateway Customer ValidationThe Hadoop Gateway has been validated with some of the largest MSTR customers
• Looking to publish cubes from a rapidly growing set of viewership data
• Big Data ODBC connections unable to publish the cubes fast enough
• Took more than 6 hours to publish a cube via Hive
• Took less than an hour with Teradata
• Hadoop Gateway published in less than an hour
• Transaction level data (12M+ rows/ day) loaded into Parquet and Avro files
• Looking to give end users direct access to HDFS
• Previously needed to wait for files to load into Hive tables
• Hadoop Gateway outperformed Hive on Spark and Impala via ODBC drivers
• Data wrangling optimized with Hadoop Gateway
• Looking to directly access secure data and publish cubes
• MIT Kerberos had been enabled cluster
• Secure Socket Layer (SSL) encryption enabled
• Cluster had been enabled for High Availability
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.26
Hadoop Gateway vs. ODBCwith Data Wrangle
Hadoop Gateway performs well vs. performance ODBCHadoop Gateway Customer Validation
• The Hadoop Gateway was directly compared to Hive on Spark at one of the largest retailers in the world
• While publishing these cubes, the Hadoop Gateway outperformed Hive on Spark
• Data wrangling functions have been integrated with Hadoop Gateway to reduce data movement and leverage processing capacity of the Hadoop cluster
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.27
Performance Agility SecurityValidated at one of the largest multi-media companies
Validated at one of the largest retailers Validated at one of the largest financial organizations
Hadoop Gateway Customer ValidationThe Hadoop Gateway has been validated with some of the largest MSTR customers
• Looking to publish cubes from a rapidly growing set of viewership data
• Big Data ODBC connections unable to publish the cubes fast enough
• Took more than 6 hours to publish a cube via Hive
• Took less than an hour with Teradata
• Hadoop Gateway published in less than an hour
• Transaction level data (12M+ rows/ day) loaded into Parquet and Avro files
• Looking to give end users direct access to HDFS
• Previously needed to wait for files to load into Hive tables
• Hadoop Gateway outperformed Hive on Spark and Impala via ODBC drivers
• Data wrangling optimized with Hadoop Gateway
• Looking to directly access secure data and publish cubes
• MIT Kerberos had been enabled cluster
• Secure Socket Layer (SSL) encryption enabled
• Cluster had been enabled for High Availability
Copyright © 2017 MicroStrategy Incorporated. All Rights Reserved.28
Q&A
Copyright © 2017 MicroStrategy Incorporated. All Rights Reserved.29
Thank you