24
ADNAN MASOOD, PHD SYSTEMS ARCHITECT / DATA SCIENTIST [email protected] ( HTTP:// BLOG.ADNANMASOOD.COM ) GITHUB (GITHUB.COM/ADNANMASOOD), TWITTER (@ADNANMASOOD). PRESENTED AT MICROSOFT DATA SCIENCE GROUP – TAMPA BAY DATA SCIENCE PROFESSIONALS HTTP://WWW.MEETUP.COM/DATA-SCIENTISTS-TAMPA-BAY/EVENTS/231293077/ Spark with Azure HDInsight

Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD

Embed Size (px)

Citation preview

Page 1: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD

A D N A N M A S O O D , P H D

S Y S T E M S A R C H I T E C T / D A T A S C I E N T I S TA D N A N . M A S O O D @ O W A S P . O R G

( H T T P : / / B L O G . A D N A N M A S O O D . C O M )

G I T H U B ( G I T H U B . C O M / A D N A N M A S O O D ) ,

T W I T T E R ( @ A D N A N M A S O O D ) .

P R E S E N T E D A T M I C R O S O F T D A T A S C I E N C E G R O U P –T A M P A B A Y D A T A S C I E N C E P R O F E S S I O N A L S

H T T P : / / W W W . M E E T U P . C O M / D A T A - S C I E N T I S T S - T A M P A - B A Y / E V E N T S / 2 3 1 2 9 3 0 7 7 /

Spark with Azure HDInsight

Page 2: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD

About the Speaker

Adnan Masood, Ph.D. is a developer, software architect, and researcher and specializes in FinTech, machine learning and Bayesian belief networks. Before joining PDS Health care, and GDC (a leading prepaid financial technology institution), he enjoyed life as a principal engineer of a start-up and worked for a leading UK based nonprofit organization as a solutions architect.

A strong believer in the development community, Adnan is an active member of the Open Web Application Security Project (OWASP), an organization dedicated to software security. In the .NET community, he is a cofounder and president of the Pasadena .NET Developers group, which he has been successfully leading for 8 years. He led a number of successful enterprise solutions and consulted for several Fortune 500 company projects.

Adnan devotes himself to his own continual, practical education. He holds certifications in big data, machine learning, and systems architecture from Massachusetts Institute of Technology; an Application Security certification from Stanford University; an SOA Smarts certification from Carnegie Mellon University; and certifications as a ScrumMaster, Microsoft Certified Trainer, Microsoft Certified Solutions Developer, and Sun Certified Java Developer.

For more details, visit Adnan's blog (http://blog.adnanmasood.com), GitHub repository (http://github.com/adnanmasood), and Twitter (@adnanmasood). Adnan can be reached at [email protected].

Page 5: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD

Spark 101

Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java. Spark can read from diverse data sources and scale to thousands of nodes.

Page 6: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD

Spark with Azure HDInsight

Page 7: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD

Deployment Models

Page 8: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD

Big Data Deployment – Public Cloud

• Hadoop-as-a-Service

- Amazon Web Services EC2 and EMR

- Microsoft Azure HDInsight

- Google Cloud Dataproc

- IBM Bluemix ... and others

• Spark-as-a-Service

- All of the above

- Databricks

Page 9: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD

Big Data Deployment – On-Premises

• Bare-Metal

• Virtual Machines

- VMware Big Data Extensions

- OpenStack Sahara

• Containers

- BlueData

- Mesos

Page 10: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD

HDInsight as Part of Azure Portal

Page 11: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD
Page 12: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD

Spark - Benefits

Page 13: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD

Spark – Use Cases

Page 14: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD

Spark is Fast!

Page 15: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD

Spark is Fast!

Page 16: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD

Demo - Creating a HDInsight Spark Cluster

Page 17: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD

HDInsight Spark Streaming

“Along with traditional Hadoop technologies, HDInsight also provides Spark as a cloud service. Spark is an integrated set of open source technologies that can run on a Hadoop cluster. The Spark family includes options for analyzing large amounts of operational data, doing machine learning, and more. It also includes Spark Streaming, a technology for working with streaming data. Spark Streaming is similar to Storm in some ways. Like Storm, it’s a general-purpose technology for processing streaming data. Unlike Storm, Spark Streaming is implemented as an extension to the basic Spark engine—it’s not an add-on technology. This tight connection can make Spark applications faster, since there’s less need to move data between components, and easier to create, since everything uses the same core Spark technology. Because of this, Spark Streaming (and Spark in general) are getting more popular by the day”

David Chappell STREAMING SCENARIOS USING THE MICROSOFT DATA PLATFORM

A GUIDE FOR IT LEADERS

Page 18: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD

HDInsight Spark Streaming

• What is it?- Distributed compute framework, an extension of the core Apache Spark API

- Allows users to integrate real-time data from disparate event streams (e.g. Kafka, HDFS, Twitter) in event-driven, asynchronous, scalable, type-safe, and fault tolerant applications

• When to use it?- When organizations need realtime decision making

- When you are working with streams of continuous data

• Why Spark Streaming?- Enables high-throughput and reliable processing of live data streams

- Batch, Iterative, and Streaming analysis on the same platform

- Easily add Machine Learning for streaming data pathways

Page 19: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD

Getting Started.

Page 20: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD

References & Further Reading

Use MapReduce in Hadoop on HDInsight https://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-mapreduce

Get started: Create Apache Spark cluster on HDInsight Linux and run interactive queries using Spark SQL https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-zeppelin-notebook-jupyter-spark-sql/

Azure Machine Learning -https://azure.microsoft.com/en-us/services/machine-learning/

Page 21: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD

References & Further Reading

Announcing Apache Spark on Azure HDInsight https://channel9.msdn.com/Shows/Azure-Friday/Announcing-Apache-Spark-on-Azure-HDInsight

Apache Zeppelin https://zeppelin.incubator.apache.org Project Jupyter http://jupyter.org/ https://azure.microsoft.com/en-us/services/hdinsight/ https://azure.microsoft.com/en-us/blog/apache-spark-for-azure-

hdinsight-now-generally-available/ Microsoft expands its commitment to Apache Spark big-data framework https://azure.microsoft.com/en-us/documentation/articles/hdinsight-

apache-spark-use-zeppelin-notebook/ https://channel9.msdn.com/Shows/Azure-Friday/Announcing-Apache-

Spark-on-Azure-HDInsight http://www.c-sharpcorner.com/UploadFile/aa700f/jumpstart-into-big-

data-with-hdinsight/

Page 22: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD

References & Further Reading

Get started: Create Apache Spark cluster on HDInsight Linux and run interactive queries using Spark SQL https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-jupyter-spark-sql/

EdX Course: Processing Big Data with Azure HDInsight Processing Big Data with Azure HDInsight Learn how to use Hadoop technologies in Microsoft Azure HDInsight to process big data in this five week, hands-on course. https://www.edx.org/course/processing-big-data-azure-hdinsight-microsoft-dat202-1x-0

Apache Spark for Azure HDInsight https://azure.microsoft.com/en-us/services/hdinsight/apache-spark/

Build Machine Learning applications to run on Apache Spark clusters on HDInsight Linux https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-ipython-notebook-machine-learning/

Page 23: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD

References & Further Reading

Page 24: Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD

Questions