Big Data Meetup 21 -Kumar V · 2016-12-15 · •Mesos –Mesos is built using the principles of Linux kernel, only at a different level of abstraction. –Allows you to view an entire

The Landscape of Big Data Technologies

-Kumar V Dec15, 2016

Lightning Talk at Mississauga .NET User Group https://www.meetup.com/MississaugaNETUG

Agenda • Some components in the Hadoop Ecosystem

– Data Ingestion

– Data Processing

– Data Visualization/Reporting

– SQL On Hadoop

– Security

• NoSQL Data stores

• IoT, Big Data and Cloud Technologies

Popular Distributed Processing Systems • Mesos

– Mesos is built using the principles of Linux kernel, only at a different level of abstraction.

– Allows you to view an entire cluster as one computer.

– Provides applications with API for resource management and resource scheduling across entire datacenter and cloud environments.

• Hadoop – HDFS

– YARN

– And an overcrowded component ecosystem

HDFS • HDFS, the Hadoop Distributed File System, is a

distributed file system designed to hold very large amounts of data (Hundreds of terabytes to tens of petabytes), and provide high-throughput access to this data.

• Hadoop uses HDFS to store files efficiently in the cluster. When a file is placed in HDFS it is broken down into blocks, 128 MB block size by default.

• These blocks are then replicated across the different nodes (DataNodes) in the cluster. The default replication value is 3.

• The block size and replication factor are configurable per file.

• Hadoop deployment can comprise of a single node or thousands of nodes.

HDFS … contd(1)

• Hadoop YARN

Resource Manager • Scheduler

– Fair Scheduler (Scheduling based on tenants/apps getting equal share of resources. By default it is only memory that is considered. CPU can be added.

– Capacity Scheduler (Capacity guarantees)

• Applications Manager – Accepts job submissions, negotiates the first

container for job specific App Master.

– The job’s App Master then negotiates resources (memory, # of cores) for each container.

Some Data Ingestion Tools Flume: Distributed, reliable, and available system for efficiently

collecting, aggregating and moving large amounts the following kinds of data from multiple sources to a centralized data store: – Log Data – Events data (network traffic data, social media generated

data, email messages and pretty much any data.

• Sqoop:

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

• Kafka

• HDFS CLI, WebHDFS REST API

Distributed Processing Frameworks Spark, the cluster application framework/engine: An

in-memory application programming framework that can run on Hadoop. Spark Cluster Manager talks to YARN to obtain the resources needed for its workers and driver.

Apache Spark Ecosystem

• Spark Core

• Spark SQL + Dataframes

• Spark Streaming

• Spark Structured Streaming

• Spark MLlib

• Spark GraphX

• Scala

• Java

• Python

• R

Popular ML Libraries • Weka (Java)

• Scikit-learn (Python)

• Theano (Python. Excellent for DeepLearning)

• Tensorflow (Python library from Google)

• DeepLearning4J (Java library for Deep Learning)

• H2O.ai (Open source library. It can work with Java, Python and R)

Data Visualization

• Tableau: http://www.tableau.com/

• Microsoft Power BI: https://powerbi.microsoft.com/en-us/

• Qlik: http://www.qlik.com/

• Oracle Visual Analyzer

• SAS Visual Analytics

• TIBCO Spotfire

• Dundas BI

Multidimensional Analysis

• Atscale http://www.atscale.com/

• Apache Kylin: http://kylin.apache.org/

• Kyvos Insights: http://www.kyvosinsights.com/solution

• Druid: http://druid.io/druid.html

SQL On Hadoop

• Hive (SQL like querying on data stored on HDFS. Schema on read. External tables)

• Presto (Query data on HDFS, Cassandra, Hbase and many other data sources. Can join data from multiple sources as table joins)

• Drill (Similar to Presto. Design is based on Google’s Dremel: Aggregation on 1 trillion records in seconds)

Security

• Apache Knox (Gateway to a cluster, provides Authentication, SSO, Service level Authorization, Auditing, etc)

• Apache Ranger (Fine grained Access control to Hadoop components such as YARN, HDFS, Hive, Kafka, Knox, etc)

• Kerberos

IoT • Network of connected objects that can collect

and exchange data using sensors.

• Sensors on cars and aircraft engines

• Devices sending out speed/gps data from cars

• Smart homes (Nest, Ecobee, etc)

• Wearables: Fitbit, Apple watches

• Manufacturing industry

Application Domains

• Manufacturing: Data from production lines

• Insurance industry: Data from wearables, gps data from cars (speed-location), etc.

• Defense: Drones, Military robots, etc.

• Retail: Unmanned grocery outlets

• Logistics: tracking shipments

• Aircrafts: GE processes petabytes of data collected from its aircraft engines across the world

Processing IoT Data What’s involved in processing IoT data?

– Data Streaming

– Data Storage

– Processing frameworks that can handle Streaming / Batch data

– Cloud Technologies

– Data Security

– Data Analytics (Machine Learning)

– Data Visualization

An IoT Use Case Dutch Border Security Monitoring

• KMAR (Koninklijke Marechaussee) is Royal Dutch Police that has military status.

• Shengen Agreement in EU resulted in disappearance of border controls.

• http://downloads.typesafe.com/website/casestudies/Dutch-Border-Police-Case-Study-v1.3.pdf (Or Google “Royal Dutch Police Akka case study”)

http://downloads.typesafe.com/website/casestudies/Dutch-Border-Police-Case-Study-v1.3.pdf












Thank You

meetup.com/Mississauga-Big-Data-Analytics-Meetup/

Documents

Big Data Meetup 21 -Kumar V · 2016-12-15 · •Mesos –Mesos is built using the principles of Linux kernel, only at a different level of abstraction. –Allows you to view an entire