Upload
arttan2001
View
38
Download
1
Tags:
Embed Size (px)
Citation preview
Big Data
EXABYTE
PETABYTE
TERABYTE
GIGABYTE
KILOBYTE
Management IssuesCareer DirectionTechnical Resources
Speaker:
Dr. Kang Mun Arturo TanAssistant Professor
Management Information Systems Management Sciences Department
Date: Dec 23, 2012Time: 12:20 – 13:10 pmPlace: Room 75 Yanbu University College
Open to the Public
Free Admission
The Big DataReport Highlights and Units
The McKinsey Global Institute Report
Capturing Big Data Value
What is Big Data?
Various Dimensions of Big DataTools and Technologies
Additional Tools and Technologies
How can we benefit from big data?Transforming the organization
Talent Specifications
Educational Courses / Training
References
Cloudera Distribution including Hadoop
Conclusion
The UnitsMultiples of bytes
SI decimal prefixes
Name(Symbol) Value
kilobyte(kB) 103
megabyte(MB)
106
gigabyte(GB) 109
terabyte(TB) 1012
petabyte(PB) 1015
exabyte (EB)
1018
zettabyte(ZB) 1021
yottabyte(YB) 1024
To_Main
Talk Highlights - - - - - - - - - -
CERN generates 40 TB/sec of data
In 2009, nearly all sectors in the US economy have 200 TB of data on the average.
One Exabyte approximately equals 4000 X Information stored in the US Library of Congress
235 Terabytes data were collected by the US Library of Congress in April 2011
15 out of 17 firms in the US have more data stored per company than the US Library of Congress
By 2018, US will have shortage of 140,000 to 190,000 deep data analysts and 1.5 million more data savvy managers.
Training on big data is still in its infancy.
The McKinsey Global Institute (MGI) ReportMGI, May 2011, issued a report about
organizations being deluged with data. This large amount of data is generally referred
to as “Big Data.”The use/analysis of Big Data will be the basis of
innovation, competition and productivity.Corporations will be using “Data Science” to
properly manage and utilize “Big Data.”Data Scientists are the elite and specialized
class of highly-compensated data cleaning, analysis and visualization experts.
To_MainNext_Capturing_its_Value
Capturing its Value
$300 Billion/year - Potential annual value to US Health Care
€250 Billion/year – Potential annual value to the European government administration
$600 Billion – Potential annual consumer surplus from using personal location data globally
60% potential increase in retailer’s operating margins possible with big data
However, USA alone needs 140,000 to 190,000 more deep analytical talent positions, and
1.5 million more data-savvy managers needed to take full advantage of big data.
To_MainNext_What_is_Big_Data
What is Big Data?Large data sets which are impossible to
manage with conventional database tools. Size is relative. What is big today will be
small tomorrow.In 2011, our global output of data was
estimated at 1.8 zettabytes. Big Data consists of Structured, machine-friendly information Unstructured, human-friendly information
(email, social media, video, audio, click-streams and images.)
To_MainNext_Various_Dimension
Various Dimensions Volume – terabytes … petabytes of information
Variety – extends well beyond structured data: text, audio, video, click streams, log files, etc.
Velocity – frequently time-sensitive, big data must be used with its stream into the enterprise in order to maximize its value. (Example: static average:
7:00AM: 1,3 (4/2 ->Avg = 2) 10:00AM: 1,3,5 (9/3->3)
Dynamic: 1,3(_, 4,2,2) (5, 9,3,3)
To_MainNext_Tools_Technologies
Tools and TechnologiesHadoop – is a free, Java-based programming
framework that supports the processing large data sets in a distributed computing environment.
Facebook, LinkedIn, Twitter, eBay use Hadoop.
Hadoop is at the center of this decade’s Big Data revolution.
In 2011, five major companies embraced Hadoop: EMC, IBM, Informatica, Microsoft and Oracle. To_MainNext_Additional_TechnologiesJump_CDH
Additional TechnologiesCassandra – a scalable multi-master database with no single
points of failureChukwa – data collection system for managing large
distributed systemsHbase – a scalable, distributed database that supports
structured data storage for large tablesHive – a data warehouse infrastructure that provides data
summarization and ad hoc queryingMahout – a scalable machine learning and data mining
libraryPig – a high-level data-flow language and execution
framework for parallel computationZookeeper – a high-performance coordination service for
distributed applicationsTo_MainNext_Commercial_Technology
Commercial Technology (CDH)CDH (Cloudera Distribution including Hadoop)
File System Mount(Fuse-DPS)
UI Framework/SDK(Hue)
Data Mining(Apache Mahout)
Workflow(Apache Oozie)
Scheduling(Apache Oozie)
Metadata(Apache Hive)
Data Integration(Apache FLUME, Apache SQOOP)
Languages/Compilers(Apache Pig, Apache Hive)
Fast Read/ Write Access (Apache Hbase)
Hadoop
Coordination (Apache Zookeeper)
SCM Express (Installation Wizard)To_MainNext_How_To_Benefit_from_Big_DataGoTo_Hadoop
How to benefit from Big Data?Choose the right data
Data should be in line with corporate objectives.
Build models that predict and optimize outcomesHypothesis-based model building is better.
Transform your company’s capabilitiesData Science is not a replacement for human
judgment. To_MainTransforming_Your_Company
Transforming your companyLeadership – companies succeed because they have
leadership teams that set clear goals, define what success looks like and ask the right questions.
Talent Management – companies need to manage a unique breed of individuals who are scientists but who are comfortable with the language of business.
Technology – The tools to handle the volume, velocity and variety of big data are always a necessary component of big data strategy.
Decision Making – An effective organization puts information and the relevant decision rights in the same location.
Company Culture – Companies should NOT ask “What do we think?” but should ask “What do we know?”
To_MainNext_Talent_Specifications
Talent SpecificationsHybrid of data hacker, communicator and
trusted adviserUniversal skill: ability to write codeCan communicate in a language that his
stakeholders understandCan tell story with data, whether verbally,
visually or bothMany of the brightest data scientists are PhD
in esoteric fields like ecology and systems biology
To_MainNext_Talent_Specifications_2
Talent Specifications - 2Roumeliotis, PhD in Astrophysics, Head of Data Science
Team at Intuit in Silicon Valley begins his search for candidates by:
asking the candidate if they can develop prototypes in any mainstream programming language, like Java.
seeking a skill set consisting of: Mathematics, Statistics, Probability and Computer Science and a certain habits of the mind (curiosity, inventiveness, discipline, endurance?).
looking for people with a feel for business issues and empathy for customers.
immersing the candidate with on-the-job training with occasional course in a particular technology
To_MainNext_Talent_Specifications_3
Talent Specifications - 3Many of the data scientists working in
business today were formally trained in computer science, mathematics, statistics or economics.
They can emerge from any field that has a strong data and computational focus.
Hal Varian, the chief economist at Google, is known to have said, “The next sexy job in the next 10 years will be statisticians.”
To_MainNext_Courses
CoursesThere are only few formal courses being
offered right now.Data Science is at the center of:Computer Science, Operations Research,
Statistics and Business
To_Main
Statistics
Data Science
Business
Operations
Research
Computer Science
Next_Schools_Offering_Data_Science
Schools offering Data ScienceMaster of Science in Analytics (MSA) Institute for Advanced AnalyticsNorth Carolina State University= = =The class of 2012 has the following job statistics:-15 interviews per student- Average base salary offer with professional
experience $99,600$65,000 to $160,000 for candidates with experience$60,000 to $100,000 for candidates with no
experienceTo_MainNext_Schools_offering_Data_Science_2
Schools offering Data Science -2Insight Data Science Fellows Program - a postdoctoral fellowship designed by Jake
Klamka ( a High-Energy Physicist by training) takes scientists from academia and in six weeks prepares them to succeed as data scientists
Syracuse University’s School of Information Studies (iSchool)
Rensselaer Polytechnic’s Data Science Research Center
To_MainNext_Conclusion
ConclusionBig Data is now a reality with a huge profit potential.Tools and Technologies are available through Open-Source.Each one of us can benefit from working with Big Data
(dynamic) in its pure form or in its traditional form (static).Data Science is the path towards the full utilization of Big
Data.Schools are in the process of offering Data Science
programs.Students could pursue a career on Data Science programs.Doing statistical interpretation is the everyday work
routine of Data Science. (Many commercial implementations exist.)
To_MainNext_to_Reference
Commercial ImplementationsSAP Hana – MetscaleMicrosoft Parallel Data warehouse Exadata Database Machine (Oracle)Exalytics In-Memory Machine (Oracle)Greenplum Data Computing Appliance (EMC)Netezza Data Warehouse Appliance (IBM)Vertica Analytics Platform (HP)SolidDB (IBM)Teracotta BigMemory (Software AG) …
To_Conclusion To_Main
References1. McKinsey Global Institute Report 20112. A Simple Introduction to Data Science - Noreen Burlingame and Lars Nielsen – 20123. Big Data Now - Allen Noren, 2011 (O’Reilly Radar Team)4. What is Data Science - Mike Loukides, 2011 (O’Reilly Media)5. Big Data: The Management Revolution - Andrew McAfee and Erik Brynjolfsson (Harvard Bus Rev – Oct 2012)6. Data Scientist: The Sexiest Job of the 21st Century - Thomas H. Davenport and D.J. Patil (Harvard Bus Rev –Oct 2012)7. Making Advance Analytics Work for You - Dominic Barton and David Court (Harvard Bus Rev –Oct 2012)8. Various YouTube Materials / Hadoop - Stanford University
To_MainNext_To_ThankYou
To_Main