1st Birmingham Big Data Science Group meetup

Welcome to the Birmingham Big Data Science Group

(BIDS)

Faizan Javed5/25/2011

Intermark Group

Sponsor: Intermark Group

BIDS Stats• Founded April 10, 2011• 9 members (and counting..)• Founder: Faizan Javed, Co-Founder: Qasim Ijaz• Online presence:

Meetup.com for co-ordinating meetups:http://www.meetup.com/bham-bids

Also on (for related articles and announcements):LinkedIn: http://www.linkedin.com/groups/Birmingham-Big-Data-Science-Group-3865219

Facebook:http://www.facebook.com/home.php?sk=group_202221519811444

http://www.meetup.com/bham-bids

http://www.linkedin.com/groups/Birmingham-Big-Data-Science-Group-3865219

http://www.facebook.com/home.php?sk=group_202221519811444






Agenda

• What is Big Data?

• Quick overview of related technologies:

Large-scale distributed systems and platforms NoSQL data stores

Intelligent algorithms/web-mining/information retrieval techniques

Highly-scalable systems

What is Big Data?• More people connected to the internet

• Social media explosion (Web 2.0): Facebook, Twitter, etc.

• Huge volumes of data being collected: sensors, mobile devices, machine-to-machine communications, social media and retail sites web logs for browsing patterns

• “Big” in Big Data is relative: today's "big" is certainly tomorrow's "medium" and next week's "small.“

• “Big Data" is when the size of the data itself becomes part of the problem. Going from Gigabytes to Petabytes! http://radar.oreilly.com/2010/06/what-is-data-science.html

http://radar.oreilly.com/2010/06/what-is-data-science.html

http://radar.oreilly.com/2010/06/what-is-data-science.html

Big Data, Big Numbers McKinsey report, May 2011: http://www.mckinsey.com/mgi/publications/big_data/index.asp

http://www.mckinsey.com/mgi/publications/big_data/index.asp

Why care about big data?

• Deep analysis of data can be a competitive advantage.

• More data easier to find consistent patterns• More data usually beats better algorithms

• Ex 1: Predict customer preferences and target ads on an ecommerce website.

• Ex 2: Improve search quality.

• Ex 3: Bank risk modeling (aggregate customer activity from different lines of businesses)

http://blog.mikepearce.net/2010/08/18/10-hadoop-able-problems-a-summary/http://www.ft.com/intl/cms/s/0/64095dba-7cd5-11e0-994d-00144feabdc0.html#axzz1NHn8icSC

Key point: “Many different sources” & “unstructured data”

http://blog.mikepearce.net/2010/08/18/10-hadoop-able-problems-a-summary/

http://www.ft.com/intl/cms/s/0/64095dba-7cd5-11e0-994d-00144feabdc0.html#axzz1NHn8icSC

http://blog.mikepearce.net/2010/08/18/10-hadoop-able-problems-a-summary/

Big Players on the Big Data Scene

The Government http://us1.campaign-archive1.com/?u=4cb4c08d876d7481bbc4bc70f&id=6889126aef

http://us1.campaign-archive1.com/?u=4cb4c08d876d7481bbc4bc70f&id=6889126aef

http://us1.campaign-archive1.com/?u=4cb4c08d876d7481bbc4bc70f&id=6889126aef

The need for new techniques

• Traditional “relational” techniques breakdown at scale.

Solutions:• NoSQL databases: Cassandra, Hbase, Riak, etc

• Large-scale “commodity” scale-out distributed computing techniques: MapReduce/Hadoop, Percolator, etc

• Analytics platforms: IBM BigInsight, EMC GreenPlum

The NoSQL revolution http://www.infoq.com/news/2011/04/newsql

http://www.infoq.com/news/2011/04/newsql

http://www.infoq.com/news/2011/04/newsql

Prominent NoSQL database users

• Cassandra: Facebook, Twitter, Rackspace, Reddit, Digg.com

• Riak: Mozilla, Ask.com, Comcast

• Voldemort: LinkedIn

• MongoDB: Foursquare, Etsy, bit.ly, Intuit

• Hbase: Stumbleupon, Twitter, Infolinks, Adobe, Meetup.com,

Hadoop-based SMAQ stack http://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>

{ public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException

{ int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }

http://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html

http://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html

Hadoop-based SMAQ stack

• Hadoop comes with HDFS – Hadoop Distributed File Sytem.

• Can be used alongside various NoSQL systems (Hbase most common)

Hadoop-based SMAQ stack

• Pig (yahoo)• input = LOAD 'input/sentences.txt' USING

TextLoader(); words = FOREACH input GENERATE FLATTEN(TOKENIZE($0)); grouped = GROUP words BY $0;

counts = FOREACH grouped GENERATE group, COUNT(words); ordered = ORDER counts BY $0; STORE ordered INTO 'output/wordCount' USING PigStorage();

• Hive (facebook) INSERT OVERWRITE TABLE

xyz_com_page_views SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND page_views.referrer_url like '%xyz.com';

Next-generation systems: going beyond MapReduce/Hadoop

http://www.nytimes.com/external/gigaom/2010/10/23/23gigaom-beyond-hadoop-next-generation-big-data-architectu-81730.html

• Mostly Google and Yahoo innovations.• Percolator – “real-time” MapReduce. Powers Google Instant.• Dremel – superfast “Hive” to interact with large-datasets.

Inhouse-Google.• Pregel – highly efficient graph computing for analyzing social

graphs. In-house Google. Open-source projects available.• Megastore- scalable NoSQL like system with ACID semantics

but lower consistency across partitions. In-house Google.• Next-gen Hadoop at Yahoo: enhanced scalability (going

beyond 4000 clusters), support for multiple programming paradigms, enhanced cluster utilization.

http://www.nytimes.com/external/gigaom/2010/10/23/23gigaom-beyond-hadoop-next-generation-big-data-architectu-81730.html

Intelligent Web & machine learning

• Recommendation systems, data/web mining, natural language processing

• Recommendation systems:• A type of collaborative filtering/information

retrieval technique.• Uses user profiles, ratings, browsing habits to

recommend items not yet considered.• First made famous in the commercial arena by

Amazon.com

Amazon.com & Netflix recommendation systems

Foursquare (3/2011) and Google Places (5/2011) http://engineering.foursquare.com/2011/03/22/building-a-recommendation-engine-foursquare-style/

http://places.blogspot.com/2011/05/discover-more-places-youll-like-based.html

http://engineering.foursquare.com/2011/03/22/building-a-recommendation-engine-foursquare-style/

http://places.blogspot.com/2011/05/discover-more-places-youll-like-based.html

Hot area!Netflix and Overstock.com competitions

Search Engines (Google, Bing, Wolfram, Lucene/Nutch, etc)

Search innovations @ LinkedIn http://thenoisychannel.com/2010/01/31/linkedin-search-a-look-beneath-the-hood/

http://blog.linkedin.com/2009/12/14/linkedin-faceted-search/

• Uses open-source Lucene project for social graph search and real-time indexing and searching.

• Dynamic filters automatically generated based on your query results!

http://thenoisychannel.com/2010/01/31/linkedin-search-a-look-beneath-the-hood/

http://thenoisychannel.com/2010/01/31/linkedin-search-a-look-beneath-the-hood/

http://blog.linkedin.com/2010/03/05/designing-linkedin-faceted-search/

http://blog.linkedin.com/2009/12/14/linkedin-faceted-search/

Conclusion• Big Data is a very challenging and promising

area• Can be used to get a competitive advantage• Usually bring about advances in computer

science• Vast area of topics: NoSQL systems, large-scale

distributed computing systems, highly scalable web system designs

• Machine learning techniques: search engines, recommender systems

Technology

1st Birmingham Big Data Science Group meetup