Click here to load reader

Big data & Next generation analytics

  • View

  • Download

Embed Size (px)


Big data & Next generation analytics. Krishna Kulkarni Keith W. Hare ISO/IEC JTC1 SC32 Opening Plenary May 27, 2013, Gyeongju Korea. Introduction. Goal of this talk is to provide additional input to the discussion . - PowerPoint PPT Presentation

Text of Big data & Next generation analytics

Big data & Next generation analytics

Big data & Next generation analyticsKrishna KulkarniKeith W. HareISO/IEC JTC1 SC32 Opening PlenaryMay 27, 2013, Gyeongju KoreaISO/IEC JTC1 SC32N238311IntroductionGoal of this talk is to provide additional input to the discussion.Next Generation Analytics is essentially dealing with Big Data with the same concepts for predictive analysis in discovering hidden patterns, discovering unknown correlations by analyzing huge volumes of transactional data and other untapped data (data mining, data warehouses, unstructured etc.), and, essentially using the same toolsets (NoSQL, Hadoop etc.). Baba Priprani

ISO/IEC JTC1 SC32N23832Next Generation Analytics Goals Cost of acquiring and storing data is rapidly decreasing Enterprises are collecting huge amounts of extremely fine-grained data. Enable enterprises to get newer actionable business insights from vast amounts of raw fine-grained data dramatically faster than is possible today ISO/IEC JTC1 SC32N23833Sample use case RetailUtilize transactional and query logs collected by retail companiesFiner segmentation of customers for direct marketing campaignsGenerate differentiated pricing structuresPredicting future customer demands.

ISO/IEC JTC1 SC32N23834For retail use casesCritical to narrow time gap between:Data acquisition And acting on a business decision based on the data. Referred to as:Near Real-time Business Analytics Or Operational Business IntelligenceFor example, a retailer wouldDecide on promotions for the next week based on the data collected during this weekFor on-line stores, take action based on data even more quicklyReal time marketing e.g. as customers are walking down the street

ISO/IEC JTC1 SC32N23835Sample use case MedicalCancer treatment regimen100% effective in 80% of the patientsCompletely ineffective in 20% of patientsNeed to identify the 20%Sufficient to identify correlationsCausations can come later

ISO/IEC JTC1 SC32N23836Requirements for Achieving GoalsHandling diverse data formats/structures Handling high speed of data collection Analytics capability beyond what is offered by the traditional business intelligence Low cost, highly scalable analytics platforms Heterogonous infrastructureISO/IEC JTC1 SC32N23837Diversity of dataSmall fraction is structured formats, Relational, XML, etc.Fair amount is semi-structured, as web logs, etc. Rest of the data is unstructured text, photographs, etc. Very difficult to implement a single data model can handle the diversity ISO/IEC JTC1 SC32N23838Velocity of data Continuously streaming data Need to analyze data in-flightCombine with data at-restNeed a good answer quicklyA precisely correct answer May not exist May not be requiredISO/IEC JTC1 SC32N23839Analytics capabilityCurrent technologies are not sufficient or are too static:Business Intelligence (BI) techniques Data Warehousing (static, batch oriented style)Built-in analytic functions in SQLData MiningMachine learning viewed askey technology will unlock novel insights in data. Statistical packagesProject R public domainSAS proprietary SPSS proprietary Effective leveraging of the machine learning tool kits requires understanding of probability and statistics. ISO/IEC JTC1 SC32N238310Significant challenges in identifying deep insights from dataHow to identify relevant fragments of data easily from a multitude of data sources?How to use data cleaning techniques across multiple data sources?How to sample results of a query progressively? How to obtain rich visualization? Best successes so far have been vertically integrated machine learning software packages for use in specific use cases, e.g., detection of credit card fraud

ISO/IEC JTC1 SC32N238311Significant Challenges in Storing DataNext Generation Analytics Operate on Big DataData Storage May SpanMultiple ServersMultiple Storage sub systemsMultiple data centersNoSQL Databases often used to store Big DataLarge variety of productsDiverse sets of featuresNo standard interface

ISO/IEC JTC1 SC32N238312Low Cost, Highly Scalable Analytics Platforms Infrastructure based on MapReduce framework emerging as a popular retrieval and consolidation solutionHowever, this infrastructure is very low-levelResponsibility for exploiting the platform is on the userLacks much of the maturity of the relational world. Integration with existing relational/BI platforms is a must for long-term success ISO/IEC JTC1 SC32N238313Significant Challenges for Retrieving DataMapReduceFramework for managing partitioned query & retrieval of distributed dataRetrieves data from distributed data stores and presents it to the analysis layer Custom Map operationCustom Reduce operationNo high level declarative language Languages specific to underlying data storesNo automated way to apply MapReduce to extremely complex questionsISO/IEC JTC1 SC32N238314SummaryCommunity experimentation and understanding are evolving rapidlyNeed complete eco-system make this all workStandards are essential Niche solutions will lead to vendor lock-in

ISO/IEC JTC1 SC32N238315How the pieces fit togetherISO/IEC JTC1 SC32N238316Statistical Analysis EngineMachine Learning EngineBig DataNoSQLRelationalXMLData Retrieval & SummaryMapReduceSourcesChaudhuri, S., "What next?: A half-dozen data management research goals for big data and the cloud", In Proceedings of the 31st Symposium on Principles of Database Systems, ACM, 2012. Big Data Now: 2012 Edition

ISO/IEC JTC1 SC32N238317Additional DiscussionThe following slides were incomplete and beyond the scope of this presentation, but worth preserving for future discussions.ISO/IEC JTC1 SC32N238318Domain, Range, & FunctionIn traditional mathematics Given a domain and a function, solve for rangeGiven a domain and a range, identify a function, if it exsistsExample:Given the set of pairs {(2,-3),(4,6),(3,-1),(6,6),(2,3)} domain of relation is set {2,3,4,6}Range is {-3,-1,3,6}Answer is no, there is not a function one X value (2) that produces 2 different Y values

ISO/IEC JTC1 SC32N238319In AnalyticsDetermine the range, given a set of candidate domainsSolve for function that will give range for candidate domains.

ISO/IEC JTC1 SC32N238320National Security ExampleRange: Find candidate national security issues related to attacks on American assetsCandidate domains: Banking recordsMoney flowsE-mailSocial Media NetworksTelephone CallsReports from human intelligenceSatellite photosFind function(s) that uses those domains to produce the rangeData is always incomplete

ISO/IEC JTC1 SC32N238321Cancer Research ExampleRangeIdentify patients who will not respond to specific treatmentDomainsGenotypeHealth HistoryFamily HistoryGeology of residenceWork historyFind function(s) that uses those domains to produce the rangeData is always incomplete

ISO/IEC JTC1 SC32N238322

Search related