Upload
adnan-masood
View
605
Download
1
Embed Size (px)
Citation preview
A D N A N M A S O O D , P H D
S Y S T E M S A R C H I T E C T / D A T A S C I E N T I S TA D N A N . M A S O O D @ O W A S P . O R G
( H T T P : / / B L O G . A D N A N M A S O O D . C O M )
G I T H U B ( G I T H U B . C O M / A D N A N M A S O O D ) ,
T W I T T E R ( @ A D N A N M A S O O D ) .
P R E S E N T E D A T M I C R O S O F T D A T A S C I E N C E G R O U P –T A M P A B A Y D A T A S C I E N C E P R O F E S S I O N A L S
H T T P : / / W W W . M E E T U P . C O M / D A T A - S C I E N T I S T S - T A M P A - B A Y / E V E N T S / 2 3 1 2 9 3 0 7 7 /
Spark with Azure HDInsight
About the Speaker
Adnan Masood, Ph.D. is a developer, software architect, and researcher and specializes in FinTech, machine learning and Bayesian belief networks. Before joining PDS Health care, and GDC (a leading prepaid financial technology institution), he enjoyed life as a principal engineer of a start-up and worked for a leading UK based nonprofit organization as a solutions architect.
A strong believer in the development community, Adnan is an active member of the Open Web Application Security Project (OWASP), an organization dedicated to software security. In the .NET community, he is a cofounder and president of the Pasadena .NET Developers group, which he has been successfully leading for 8 years. He led a number of successful enterprise solutions and consulted for several Fortune 500 company projects.
Adnan devotes himself to his own continual, practical education. He holds certifications in big data, machine learning, and systems architecture from Massachusetts Institute of Technology; an Application Security certification from Stanford University; an SOA Smarts certification from Carnegie Mellon University; and certifications as a ScrumMaster, Microsoft Certified Trainer, Microsoft Certified Solutions Developer, and Sun Certified Java Developer.
For more details, visit Adnan's blog (http://blog.adnanmasood.com), GitHub repository (http://github.com/adnanmasood), and Twitter (@adnanmasood). Adnan can be reached at [email protected].
Announcement: Apache Spark for Azure HDInsight now generally available
Channel 9 Walk through of Apache Spark on Azure HDInsight
Spark 101
Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java. Spark can read from diverse data sources and scale to thousands of nodes.
Spark with Azure HDInsight
Deployment Models
Big Data Deployment – Public Cloud
• Hadoop-as-a-Service
- Amazon Web Services EC2 and EMR
- Microsoft Azure HDInsight
- Google Cloud Dataproc
- IBM Bluemix ... and others
• Spark-as-a-Service
- All of the above
- Databricks
Big Data Deployment – On-Premises
• Bare-Metal
• Virtual Machines
- VMware Big Data Extensions
- OpenStack Sahara
• Containers
- BlueData
- Mesos
HDInsight as Part of Azure Portal
Spark - Benefits
Spark – Use Cases
Spark is Fast!
Spark is Fast!
Demo - Creating a HDInsight Spark Cluster
HDInsight Spark Streaming
“Along with traditional Hadoop technologies, HDInsight also provides Spark as a cloud service. Spark is an integrated set of open source technologies that can run on a Hadoop cluster. The Spark family includes options for analyzing large amounts of operational data, doing machine learning, and more. It also includes Spark Streaming, a technology for working with streaming data. Spark Streaming is similar to Storm in some ways. Like Storm, it’s a general-purpose technology for processing streaming data. Unlike Storm, Spark Streaming is implemented as an extension to the basic Spark engine—it’s not an add-on technology. This tight connection can make Spark applications faster, since there’s less need to move data between components, and easier to create, since everything uses the same core Spark technology. Because of this, Spark Streaming (and Spark in general) are getting more popular by the day”
David Chappell STREAMING SCENARIOS USING THE MICROSOFT DATA PLATFORM
A GUIDE FOR IT LEADERS
HDInsight Spark Streaming
• What is it?- Distributed compute framework, an extension of the core Apache Spark API
- Allows users to integrate real-time data from disparate event streams (e.g. Kafka, HDFS, Twitter) in event-driven, asynchronous, scalable, type-safe, and fault tolerant applications
• When to use it?- When organizations need realtime decision making
- When you are working with streams of continuous data
• Why Spark Streaming?- Enables high-throughput and reliable processing of live data streams
- Batch, Iterative, and Streaming analysis on the same platform
- Easily add Machine Learning for streaming data pathways
Getting Started.
References & Further Reading
Use MapReduce in Hadoop on HDInsight https://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-mapreduce
Get started: Create Apache Spark cluster on HDInsight Linux and run interactive queries using Spark SQL https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-zeppelin-notebook-jupyter-spark-sql/
Azure Machine Learning -https://azure.microsoft.com/en-us/services/machine-learning/
References & Further Reading
Announcing Apache Spark on Azure HDInsight https://channel9.msdn.com/Shows/Azure-Friday/Announcing-Apache-Spark-on-Azure-HDInsight
Apache Zeppelin https://zeppelin.incubator.apache.org Project Jupyter http://jupyter.org/ https://azure.microsoft.com/en-us/services/hdinsight/ https://azure.microsoft.com/en-us/blog/apache-spark-for-azure-
hdinsight-now-generally-available/ Microsoft expands its commitment to Apache Spark big-data framework https://azure.microsoft.com/en-us/documentation/articles/hdinsight-
apache-spark-use-zeppelin-notebook/ https://channel9.msdn.com/Shows/Azure-Friday/Announcing-Apache-
Spark-on-Azure-HDInsight http://www.c-sharpcorner.com/UploadFile/aa700f/jumpstart-into-big-
data-with-hdinsight/
References & Further Reading
Get started: Create Apache Spark cluster on HDInsight Linux and run interactive queries using Spark SQL https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-jupyter-spark-sql/
EdX Course: Processing Big Data with Azure HDInsight Processing Big Data with Azure HDInsight Learn how to use Hadoop technologies in Microsoft Azure HDInsight to process big data in this five week, hands-on course. https://www.edx.org/course/processing-big-data-azure-hdinsight-microsoft-dat202-1x-0
Apache Spark for Azure HDInsight https://azure.microsoft.com/en-us/services/hdinsight/apache-spark/
Build Machine Learning applications to run on Apache Spark clusters on HDInsight Linux https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-ipython-notebook-machine-learning/
References & Further Reading
Questions