22
BIG DATA & NEXT GENERATION ANALYTICS Krishna Kulkarni Keith W. Hare ISO/IEC JTC1 SC32 Opening Plenary May 27, 2013, Gyeongju Korea ISO/IEC JTC1 SC32N2383 1

Big data & Next generation analytics

  • Upload
    zelda

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

Big data & Next generation analytics. Krishna Kulkarni Keith W. Hare ISO/IEC JTC1 SC32 Opening Plenary May 27, 2013, Gyeongju Korea. Introduction. Goal of this talk is to provide additional input to the discussion . - PowerPoint PPT Presentation

Citation preview

Page 1: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 1

BIG DATA & NEXT GENERATION ANALYTICSKrishna KulkarniKeith W. HareISO/IEC JTC1 SC32 Opening PlenaryMay 27, 2013, Gyeongju Korea

Page 2: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 2

Introduction• Goal of this talk is to provide additional input to the

discussion.• “Next Generation Analytics is essentially dealing with Big

Data – with the same concepts for predictive analysis in discovering hidden patterns, discovering unknown correlations by analyzing  huge volumes of transactional data and other untapped data (data mining, data warehouses, unstructured etc.), and, essentially using the same toolsets (NoSQL, Hadoop etc.).” Baba Priprani

Page 3: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 3

Next Generation Analytics Goals • Cost of acquiring and storing data is rapidly decreasing • Enterprises are collecting huge amounts of extremely fine-

grained data. • Enable enterprises to get newer actionable business

insights from vast amounts of raw fine-grained data dramatically faster than is possible today

Page 4: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 4

Sample use case – Retail• Utilize transactional and query logs collected by retail

companies• Finer segmentation of customers for direct marketing campaigns• Generate differentiated pricing structures• Predicting future customer demands.

Page 5: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 5

For retail use cases…• Critical to narrow time gap between:

• Data acquisition • And acting on a business decision based on the data.

• Referred to as:• Near Real-time Business Analytics • Or Operational Business Intelligence

• For example, a retailer would• Decide on promotions for the next week based on the data

collected during this week• For on-line stores, take action based on data even more quickly• Real time marketing e.g. as customers are walking down the street

Page 6: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 6

Sample use case – Medical• Cancer treatment regimen

• 100% effective in 80% of the patients• Completely ineffective in 20% of patients

• Need to identify the 20%• Sufficient to identify correlations• Causations can come later

Page 7: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 7

Requirements for Achieving Goals• Handling diverse data formats/structures • Handling high speed of data collection • Analytics capability beyond what is offered by the

traditional business intelligence • Low cost, highly scalable analytics platforms • Heterogonous infrastructure

Page 8: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 8

Diversity of data• Small fraction is structured formats, Relational, XML, etc.• Fair amount is semi-structured, as web logs, etc. • Rest of the data is unstructured text, photographs, etc. Very difficult to implement a single data model can handle the diversity

Page 9: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 9

Velocity of data • Continuously streaming data

• Need to analyze data in-flight• Combine with data at-rest

• Need a good answer quickly• A precisely correct answer

• May not exist • May not be required

Page 10: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 10

Analytics capability• Current technologies are not sufficient or are too static:

• Business Intelligence (BI) techniques • Data Warehousing (static, batch oriented style)• Built-in analytic functions in SQL• Data Mining

• “Machine learning” viewed as• key technology • will unlock novel insights in data.

• Statistical packages• Project R – public domain• SAS – proprietary • SPSS – proprietary

Effective leveraging of the machine learning tool kits requires understanding of probability and statistics.

Page 11: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 11

Significant challenges in identifying deep insights from data• How to identify relevant fragments of data easily from a

multitude of data sources?• How to use data cleaning techniques across multiple data

sources?• How to sample results of a query progressively? • How to obtain rich visualization? Best successes so far have been vertically integrated machine learning software packages for use in specific use cases, e.g., detection of credit card fraud

Page 12: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 12

Significant Challenges in Storing Data• Next Generation Analytics Operate on “Big Data”• Data Storage May Span

• Multiple Servers• Multiple Storage sub systems• Multiple data centers

• NoSQL Databases often used to store “Big Data”• Large variety of products• Diverse sets of features• No standard interface

Page 13: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 13

Low Cost, Highly Scalable Analytics Platforms

• Infrastructure based on MapReduce framework emerging as a popular retrieval and consolidation solution

• However, this infrastructure is very low-level• Responsibility for exploiting the platform is on the user• Lacks much of the maturity of the relational world.

Integration with existing relational/BI platforms is a must for long-term success

Page 14: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 14

Significant Challenges for Retrieving Data

• MapReduce• Framework for managing partitioned query & retrieval of distributed

data• Retrieves data from distributed data stores and presents it to the

analysis layer • Custom Map operation• Custom Reduce operation• No high level declarative language • Languages specific to underlying data storesNo automated way to apply MapReduce to extremely complex questions

Page 15: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 15

Summary• Community experimentation and understanding are

evolving rapidly• Need complete eco-system make this all work• Standards are essential – Niche solutions will lead to

vendor lock-in

Page 16: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 16

How the pieces fit together

Statistical Analysis EngineMachine Learning Engine

Big DataNoSQL

RelationalXML

Data Retrieval & SummaryMapReduce

Page 17: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 17

Sources• Chaudhuri, S., "What next?: A half-dozen data

management research goals for big data and the cloud", In Proceedings of the 31st Symposium on Principles of Database Systems, ACM, 2012.

• “Big Data Now: 2012 Edition”http://oreilly.com/data/radarreports/big-data-now-2012.csp

Page 18: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 18

Additional Discussion…• The following slides were incomplete and beyond the

scope of this presentation, but worth preserving for future discussions.

Page 19: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 19

Domain, Range, & Function• In traditional mathematics

• Given a domain and a function, solve for range• Given a domain and a range, identify a function, if it exsists

• Example:• Given the set of pairs {(2,-3),(4,6),(3,-1),(6,6),(2,3)}

• domain of relation is set {2,3,4,6}• Range is {-3,-1,3,6}• Answer is no, there is not a function

• one X value (2) that produces 2 different Y values

Page 20: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 20

In Analytics• Determine the range, given a set of candidate domains• Solve for function that will give range for candidate

domains.

Page 21: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 21

National Security Example• Range:

• Find candidate national security issues related to attacks on American assets

• Candidate domains: • Banking records• Money flows• E-mail• Social Media Networks• Telephone Calls• Reports from human intelligence• Satellite photos

• Find function(s) that uses those domains to produce the range• Data is always incomplete

Page 22: Big data & Next generation analytics

ISO/IEC JTC1 SC32N2383 22

Cancer Research Example• Range

• Identify patients who will not respond to specific treatment• Domains

• Genotype• Health History• Family History• Geology of residence• Work history

• Find function(s) that uses those domains to produce the range

• Data is always incomplete