2. Introduction to Big Data What is Big Data? What makes data,
Big Data? 2
3. Big Data Definition No single standard definition Big Data
is data whose scale, diversity, and complexity require new
architecture, techniques, algorithms, and analytics to manage it
and extract value and hidden knowledge from it 3
4. Characteristics of Big Data: 1-Scale (Volume) Data Volume
44x increase from 2009 2020 From 0.8 zettabytes to 35zb Data volume
is increasing exponentially 4 Exponential increase in
collected/generated data
5. Characteristics of Big Data: 2-Complexity (Varity) Various
formats, types, and structures Text, numerical, images, audio,
video, sequences, time series, social media data, multi-dim arrays,
etc Static data vs. streaming data A single application can be
generating/collecting many types of data 5 To extract knowledge all
these types of data need to linked together
6. Characteristics of Big Data: 3-Speed (Velocity) Data is
begin generated fast and need to be processed fast Online Data
Analytics Late decisions missing opportunities Examples
E-Promotions: Based on your current location, your purchase
history, what you like send promotions right now for store next to
you Healthcare monitoring: sensors monitoring your activities and
body any abnormal measurements require immediate reaction 6
7. Big Data: 3Vs 7
8. Some Make it 4Vs 8
9. Whos Generating Big Data Social media and networks (all of
us are generating data) Scientific instruments (collecting all
sorts of data) Mobile devices (tracking all objects all the time)
Sensor technology and networks (measuring all kinds of data) The
progress and innovation is no longer hindered by the ability to
collect data But, by the ability to manage, analyze, summarize,
visualize, and discover knowledge from the collected data in a
timely manner and in a scalable fashion 9
10. What Technology Do We Have For Big Data ?? 10
11. 11
12. Which Movie Do You Like? Designing a movie recommendation
system
13. Can you describe the movie you would like?
14. Recommender Systems Movie Problem: Find Similar movies to
my taste. Movies have many Features Western, Clint Eastwood,
Tarantino, 90s, A viewer as preferences Features Likes Western;
hates content based filtering movies Netflix Prize From Wikipedia,
the free encyclopedia The Netflix Prize was an open competition for
the best collaborative filtering algorithm to predict user ratings
for films, based on previous ratings without any other information
about the users or films, i.e. without the users or the films being
identified except by numbers assigned for the contest. The
competition was held by Netflix, an online DVD-rental service, and
was open to anyone not connected with Netflix (current and former
employees, agents, close relatives of Netflix employees, etc.) or a
resident of Cuba, Iran, Syria, North Korea, Burma or Sudan.[1] On
21 September 2009, the grand prize of US$1,000,000 was given to the
BellKor's Pragmatic Chaos team which bested Netflix's own algorithm
for predicting ratings by 10.06%.[2]
15. A Highly Simple Solution Comedy Action Blockbu ster . Is
Tom Cruise the Lead? 6 5 0 1 7 8 1 0 Saurav 2 8 Sauravs Score =
.2*Comedy + .1*Action + 10*Blockbuster + + -.9*Tom Cruise Comedy
Action Blockbu ster . Is Tom Cruise the Lead? 2 8 0 0 Saurav 7
16. Quiz #1 Is google search a recommender systems?
17. Supervised Learning Design an Accurate Vending Machine This
is a Classification Problem This line is called the Decision
Boundary or Separating Hyper plane
18. Quiz #2 Give an example where you think supervised learning
is used Hint Spam vs. Ham in Emails
19. Some Common Supervised Algorithms Classification Decision
Trees Random Forest Support Vector Machine Neural Network Logistic
Regression Regression Linear Regression Non-linear Regression
Logistic Regression Association Rule Learning Arules Even Sequence
Analysis
20. In Action Handwriting Recognition System Classification
Input? Output? 200 200 10 200 200 8 180 200 20 6 Features
Labels
21. Note the similarity Classification Algorithms Try to
Separate items into Classes
22. Demo
23. Quiz #3 Is driverless cars a learning problem? What are the
features? What is the label?
26. Clustering Cluster: A collection/group of data
objects/points similar (or related) to one another within the same
group dissimilar (or unrelated) to the objects in other groups
Cluster analysis find similarities between data according to
characteristics underlying the data and grouping similar data
objects into clusters Unsupervised learning no predefined classes
for a training data set Two general tasks: identify the natural
clustering number and properly grouping objects into sensible
clusters
27. Plot
28. Quiz #4 How many types (species) of flowers are there?
31. Quiz #5 Which of the below are supervised and which are
unsupervised Take a collection of 1000 essays written on the US
Economy, and find a way to automatically group these essays into a
small number of groups of essays that are somehow "similar" or
"related". Examine a large collection of emails that are known to
be spam email, to discover if there are sub-types of spam mail.
Given historical data of childrens ages and heights, predict
children's height as a function of their age. Have a computer
examine an audio clip of a piece of music, and classify whether or
not there are vocals (i.e., a human voice singing) in that audio
clip, or if it is a clip of only musical instruments (and no
vocals). Given a set of news articles from many different news
websites, find out what are the main topics covered. Suppose you
are working on weather prediction, and you would like to predict
whether or not it will be raining at 5pm tomorrow. You want to use
a learning algorithm for this. Would you treat this as a
classification or a regression problem?
32. Where is Big Data???
33. Lets start from (Big) Data How do you design this system?
How do you pay for this? How do you trust someone to do it right?
How expensive will such a system be? I need Data. Good reusable
data. High quality data. Else all the smarts are waste.
34. Here comes BIG Data to help Image Audio Learning HUGE data
sets