Upload
hitesh-dua
View
129
Download
4
Embed Size (px)
DESCRIPTION
Apache Spark is an open-source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley.Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). However, Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce for certain applications.Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited to machine learning algorithms.
Citation preview
Guide :Mrs. Juhi Singh
Submitted by: Hitesh DuaCSE 4th Year05510402711
Sustained exponential growth, as one of the most active Apache projects
• Apache Spark is an open source parallel processing framework thatenables users to run large-scale data analytics applicationsacross clustered computers.
• Apache Spark can process data from a variety of data repositories. Itsupports in-memory processing to boost the performance of big dataanalytics applications, but it can also do conventional disk-basedprocessing when data sets are too large to fit into the availablesystem memory.
● Open Source
● Alternative to Map Reduce for certain applications
● A low latency cluster computing system for very large data sets
● Higher level library for stream processing, through Spark Streaming.
● May be 100 times faster than Map Reduce for
– Iterative algorithms
– Interactive data mining
• Started as a research project at the UC Berkeley AMPLab in 2009, and was open sourced in early 2010.
• After being released, Spark grew a developer community on GitHub and entered Apache in 2013 as its permanent home.
• Codebase size
Spark : 20,000 LOC
Hadoop 1.0 : 90,000 LOC
• MapReduce greatly simplified big data analysis.
• But as soon as it got popular, users wanted more: » More complex, multi‐stage applications (e.g. iterative graph
algorithms and machine learning) » More interactive ad-hoc queries.
• Both multi‐stage and interactive apps require faster data sharing
across parallel jobs.
• Resilient Distributed Datasets (RDDs) are basic building block.
Distributed collections of objects that can be cached in memory across cluster nodes.
Automatically rebuilt on failure.
• RDD operations
Transformations: Creates new dataset from existing one. e.g. Map.
Actions: Return a value to a driver program after running computation on the dataset. e.g. Reduce.
Spark : Programming Model
Spark Stack Extension Spark powers a stack of high-level toolsincluding
• Spark SQL• Spark Streaming.• MLlib for machine learning• GraphX
You can combine these frameworks seamlessly in the sameapplication.
• Spark Streaming is a Spark component that enables processing livestreams of data.
• Examples of data streams include log files generated by productionweb servers, or queues of messages containing status updatesposted by users of a web service
GraphX is a library added in Spark 0.9 that provides an API for manipulating graphs
(e.g., a social network’s friend graph) and performing graph-parallel computations.
• Allows us to create a directed graph with arbitrary properties attached to each
vertex and edge.
• GraphX also provides set of operators for manipulating graphs
• library of common graph algorithms (e.g., PageRank and triangle counting).
MLlib provides multiple types of machine learning algorithms, includingbinary classification, regression, clustering and collaborative filtering.
• Supports functionality such as model evaluation and data import.
• Designed to scale out across a cluster.
• MLlib contains high-quality algorithms that leverage iteration, and canyield better results than the one-pass approximations sometimes usedon MapReduce.
Spark SQL provides support for interacting with Spark via SQL as well as the Apache Hive variant of SQL, called the Hive Query Language (HiveQL).
• Spark SQL represents database tables as Spark RDDs and translates SQL queries into Spark operations.
• Spark SQL lets you query structured data as a distributed dataset (RDD) in Spark.
• Spark SQL includes a server mode with industry standard JDBC and ODBC connectivity.