Apache spark

Guide :Mrs. Juhi Singh

Submitted by: Hitesh DuaCSE 4th Year05510402711

Sustained exponential growth, as one of the most active Apache projects

• Apache Spark is an open source parallel processing framework thatenables users to run large-scale data analytics applicationsacross clustered computers.

• Apache Spark can process data from a variety of data repositories. Itsupports in-memory processing to boost the performance of big dataanalytics applications, but it can also do conventional disk-basedprocessing when data sets are too large to fit into the availablesystem memory.

● Open Source

● Alternative to Map Reduce for certain applications

● A low latency cluster computing system for very large data sets

● Higher level library for stream processing, through Spark Streaming.

● May be 100 times faster than Map Reduce for

– Iterative algorithms

– Interactive data mining

• Started as a research project at the UC Berkeley AMPLab in 2009, and was open sourced in early 2010.

• After being released, Spark grew a developer community on GitHub and entered Apache in 2013 as its permanent home.

• Codebase size

Spark : 20,000 LOC

Hadoop 1.0 : 90,000 LOC

• MapReduce greatly simplified big data analysis.

• But as soon as it got popular, users wanted more: » More complex, multi‐stage applications (e.g. iterative graph

algorithms and machine learning) » More interactive ad-hoc queries.

• Both multi‐stage and interactive apps require faster data sharing

across parallel jobs.

• Resilient Distributed Datasets (RDDs) are basic building block.

Distributed collections of objects that can be cached in memory across cluster nodes.

Automatically rebuilt on failure.

• RDD operations

Transformations: Creates new dataset from existing one. e.g. Map.

Actions: Return a value to a driver program after running computation on the dataset. e.g. Reduce.

Spark : Programming Model

Spark Stack Extension Spark powers a stack of high-level toolsincluding

• Spark SQL• Spark Streaming.• MLlib for machine learning• GraphX

You can combine these frameworks seamlessly in the sameapplication.

• Spark Streaming is a Spark component that enables processing livestreams of data.

• Examples of data streams include log files generated by productionweb servers, or queues of messages containing status updatesposted by users of a web service

GraphX is a library added in Spark 0.9 that provides an API for manipulating graphs

(e.g., a social network’s friend graph) and performing graph-parallel computations.

• Allows us to create a directed graph with arbitrary properties attached to each

vertex and edge.

• GraphX also provides set of operators for manipulating graphs

• library of common graph algorithms (e.g., PageRank and triangle counting).

MLlib provides multiple types of machine learning algorithms, includingbinary classification, regression, clustering and collaborative filtering.

• Supports functionality such as model evaluation and data import.

• Designed to scale out across a cluster.

• MLlib contains high-quality algorithms that leverage iteration, and canyield better results than the one-pass approximations sometimes usedon MapReduce.

Spark SQL provides support for interacting with Spark via SQL as well as the Apache Hive variant of SQL, called the Hive Query Language (HiveQL).

• Spark SQL represents database tables as Spark RDDs and translates SQL queries into Spark operations.

• Spark SQL lets you query structured data as a distributed dataset (RDD) in Spark.

• Spark SQL includes a server mode with industry standard JDBC and ODBC connectivity.

Technology

Apache spark