Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
The Landscape of Big Data Technologies
-Kumar V Dec15, 2016
Lightning Talk at Mississauga .NET User Group https://www.meetup.com/MississaugaNETUG
Agenda • Some components in the Hadoop Ecosystem
– Data Ingestion
– Data Processing
– Data Visualization/Reporting
– SQL On Hadoop
– Security
• NoSQL Data stores
• IoT, Big Data and Cloud Technologies
Popular Distributed Processing Systems • Mesos
– Mesos is built using the principles of Linux kernel, only at a different level of abstraction.
– Allows you to view an entire cluster as one computer.
– Provides applications with API for resource management and resource scheduling across entire datacenter and cloud environments.
• Hadoop – HDFS
– YARN
– And an overcrowded component ecosystem
HDFS • HDFS, the Hadoop Distributed File System, is a
distributed file system designed to hold very large amounts of data (Hundreds of terabytes to tens of petabytes), and provide high-throughput access to this data.
• Hadoop uses HDFS to store files efficiently in the cluster. When a file is placed in HDFS it is broken down into blocks, 128 MB block size by default.
• These blocks are then replicated across the different nodes (DataNodes) in the cluster. The default replication value is 3.
• The block size and replication factor are configurable per file.
• Hadoop deployment can comprise of a single node or thousands of nodes.
HDFS … contd(1)
• Hadoop YARN
Resource Manager • Scheduler
– Fair Scheduler (Scheduling based on tenants/apps getting equal share of resources. By default it is only memory that is considered. CPU can be added.
– Capacity Scheduler (Capacity guarantees)
• Applications Manager – Accepts job submissions, negotiates the first
container for job specific App Master.
– The job’s App Master then negotiates resources (memory, # of cores) for each container.
Some Data Ingestion Tools Flume: Distributed, reliable, and available system for efficiently
collecting, aggregating and moving large amounts the following kinds of data from multiple sources to a centralized data store: – Log Data – Events data (network traffic data, social media generated
data, email messages and pretty much any data.
• Sqoop:
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
• Kafka
• HDFS CLI, WebHDFS REST API
Distributed Processing Frameworks Spark, the cluster application framework/engine: An
in-memory application programming framework that can run on Hadoop. Spark Cluster Manager talks to YARN to obtain the resources needed for its workers and driver.
Apache Spark Ecosystem
• Spark Core
• Spark SQL + Dataframes
• Spark Streaming
• Spark Structured Streaming
• Spark MLlib
• Spark GraphX
• Scala
• Java
• Python
• R
Popular ML Libraries • Weka (Java)
• Scikit-learn (Python)
• Theano (Python. Excellent for DeepLearning)
• Tensorflow (Python library from Google)
• DeepLearning4J (Java library for Deep Learning)
• H2O.ai (Open source library. It can work with Java, Python and R)
Data Visualization
• Tableau: http://www.tableau.com/
• Microsoft Power BI: https://powerbi.microsoft.com/en-us/
• Qlik: http://www.qlik.com/
• Oracle Visual Analyzer
• SAS Visual Analytics
• TIBCO Spotfire
• Dundas BI
Multidimensional Analysis
• Atscale http://www.atscale.com/
• Apache Kylin: http://kylin.apache.org/
• Kyvos Insights: http://www.kyvosinsights.com/solution
• Druid: http://druid.io/druid.html
SQL On Hadoop
• Hive (SQL like querying on data stored on HDFS. Schema on read. External tables)
• Presto (Query data on HDFS, Cassandra, Hbase and many other data sources. Can join data from multiple sources as table joins)
• Drill (Similar to Presto. Design is based on Google’s Dremel: Aggregation on 1 trillion records in seconds)
Security
• Apache Knox (Gateway to a cluster, provides Authentication, SSO, Service level Authorization, Auditing, etc)
• Apache Ranger (Fine grained Access control to Hadoop components such as YARN, HDFS, Hive, Kafka, Knox, etc)
• Kerberos
IoT • Network of connected objects that can collect
and exchange data using sensors.
• Sensors on cars and aircraft engines
• Devices sending out speed/gps data from cars
• Smart homes (Nest, Ecobee, etc)
• Wearables: Fitbit, Apple watches
• Manufacturing industry
Application Domains
• Manufacturing: Data from production lines
• Insurance industry: Data from wearables, gps data from cars (speed-location), etc.
• Defense: Drones, Military robots, etc.
• Retail: Unmanned grocery outlets
• Logistics: tracking shipments
• Aircrafts: GE processes petabytes of data collected from its aircraft engines across the world
Processing IoT Data What’s involved in processing IoT data?
– Data Streaming
– Data Storage
– Processing frameworks that can handle Streaming / Batch data
– Cloud Technologies
– Data Security
– Data Analytics (Machine Learning)
– Data Visualization
An IoT Use Case Dutch Border Security Monitoring
• KMAR (Koninklijke Marechaussee) is Royal Dutch Police that has military status.
• Shengen Agreement in EU resulted in disappearance of border controls.
• http://downloads.typesafe.com/website/casestudies/Dutch-Border-Police-Case-Study-v1.3.pdf (Or Google “Royal Dutch Police Akka case study”)
Thank You
meetup.com/Mississauga-Big-Data-Analytics-Meetup/