22
Big Data EXABYTE PETABYTE TERABYTE GIGABYTE KILOBYTE Management Issues Career Direction Technical Resources Speaker: Dr. Kang Mun Arturo Tan Assistant Professor Management Information Systems Management Sciences Department Date: Dec 23, 2012 Time: 12:20 – 13:10 pm Place: Room 75 Yanbu University Colle Open to the Public Free Admissi on

On Big Data

Embed Size (px)

Citation preview

Page 1: On Big Data

Big Data

EXABYTE

PETABYTE

TERABYTE

GIGABYTE

KILOBYTE

Management IssuesCareer DirectionTechnical Resources

Speaker:

Dr. Kang Mun Arturo TanAssistant Professor

Management Information Systems Management Sciences Department

Date: Dec 23, 2012Time: 12:20 – 13:10 pmPlace: Room 75 Yanbu University College

Open to the Public

Free Admission

Page 2: On Big Data

The Big DataReport Highlights and Units

The McKinsey Global Institute Report

Capturing Big Data Value

What is Big Data?

Various Dimensions of Big DataTools and Technologies

Additional Tools and Technologies

How can we benefit from big data?Transforming the organization

Talent Specifications

Educational Courses / Training

References

Cloudera Distribution including Hadoop

Conclusion

Page 3: On Big Data

The UnitsMultiples of bytes

SI decimal prefixes

Name(Symbol) Value

kilobyte(kB) 103

megabyte(MB)

106

gigabyte(GB) 109

terabyte(TB) 1012

petabyte(PB) 1015

exabyte (EB)

1018

zettabyte(ZB) 1021

yottabyte(YB) 1024

To_Main

Talk Highlights - - - - - - - - - -

CERN generates 40 TB/sec of data

In 2009, nearly all sectors in the US economy have 200 TB of data on the average.

One Exabyte approximately equals 4000 X Information stored in the US Library of Congress

235 Terabytes data were collected by the US Library of Congress in April 2011

15 out of 17 firms in the US have more data stored per company than the US Library of Congress

By 2018, US will have shortage of 140,000 to 190,000 deep data analysts and 1.5 million more data savvy managers.

Training on big data is still in its infancy.

Page 4: On Big Data

The McKinsey Global Institute (MGI) ReportMGI, May 2011, issued a report about

organizations being deluged with data. This large amount of data is generally referred

to as “Big Data.”The use/analysis of Big Data will be the basis of

innovation, competition and productivity.Corporations will be using “Data Science” to

properly manage and utilize “Big Data.”Data Scientists are the elite and specialized

class of highly-compensated data cleaning, analysis and visualization experts.

To_MainNext_Capturing_its_Value

Page 5: On Big Data

Capturing its Value

$300 Billion/year - Potential annual value to US Health Care

€250 Billion/year – Potential annual value to the European government administration

$600 Billion – Potential annual consumer surplus from using personal location data globally

60% potential increase in retailer’s operating margins possible with big data

However, USA alone needs 140,000 to 190,000 more deep analytical talent positions, and

1.5 million more data-savvy managers needed to take full advantage of big data.

To_MainNext_What_is_Big_Data

Page 6: On Big Data

What is Big Data?Large data sets which are impossible to

manage with conventional database tools. Size is relative. What is big today will be

small tomorrow.In 2011, our global output of data was

estimated at 1.8 zettabytes. Big Data consists of Structured, machine-friendly information Unstructured, human-friendly information

(email, social media, video, audio, click-streams and images.)

To_MainNext_Various_Dimension

Page 7: On Big Data

Various Dimensions Volume – terabytes … petabytes of information

Variety – extends well beyond structured data: text, audio, video, click streams, log files, etc.

Velocity – frequently time-sensitive, big data must be used with its stream into the enterprise in order to maximize its value. (Example: static average:

7:00AM: 1,3 (4/2 ->Avg = 2) 10:00AM: 1,3,5 (9/3->3)

Dynamic: 1,3(_, 4,2,2) (5, 9,3,3)

To_MainNext_Tools_Technologies

Page 8: On Big Data

Tools and TechnologiesHadoop – is a free, Java-based programming

framework that supports the processing large data sets in a distributed computing environment.

Facebook, LinkedIn, Twitter, eBay use Hadoop.

Hadoop is at the center of this decade’s Big Data revolution.

In 2011, five major companies embraced Hadoop: EMC, IBM, Informatica, Microsoft and Oracle. To_MainNext_Additional_TechnologiesJump_CDH

Page 9: On Big Data

Additional TechnologiesCassandra – a scalable multi-master database with no single

points of failureChukwa – data collection system for managing large

distributed systemsHbase – a scalable, distributed database that supports

structured data storage for large tablesHive – a data warehouse infrastructure that provides data

summarization and ad hoc queryingMahout – a scalable machine learning and data mining

libraryPig – a high-level data-flow language and execution

framework for parallel computationZookeeper – a high-performance coordination service for

distributed applicationsTo_MainNext_Commercial_Technology

Page 10: On Big Data

Commercial Technology (CDH)CDH (Cloudera Distribution including Hadoop)

File System Mount(Fuse-DPS)

UI Framework/SDK(Hue)

Data Mining(Apache Mahout)

Workflow(Apache Oozie)

Scheduling(Apache Oozie)

Metadata(Apache Hive)

Data Integration(Apache FLUME, Apache SQOOP)

Languages/Compilers(Apache Pig, Apache Hive)

Fast Read/ Write Access (Apache Hbase)

Hadoop

Coordination (Apache Zookeeper)

SCM Express (Installation Wizard)To_MainNext_How_To_Benefit_from_Big_DataGoTo_Hadoop

Page 11: On Big Data

How to benefit from Big Data?Choose the right data

Data should be in line with corporate objectives.

Build models that predict and optimize outcomesHypothesis-based model building is better.

Transform your company’s capabilitiesData Science is not a replacement for human

judgment. To_MainTransforming_Your_Company

Page 12: On Big Data

Transforming your companyLeadership – companies succeed because they have

leadership teams that set clear goals, define what success looks like and ask the right questions.

Talent Management – companies need to manage a unique breed of individuals who are scientists but who are comfortable with the language of business.

Technology – The tools to handle the volume, velocity and variety of big data are always a necessary component of big data strategy.

Decision Making – An effective organization puts information and the relevant decision rights in the same location.

Company Culture – Companies should NOT ask “What do we think?” but should ask “What do we know?”

To_MainNext_Talent_Specifications

Page 13: On Big Data

Talent SpecificationsHybrid of data hacker, communicator and

trusted adviserUniversal skill: ability to write codeCan communicate in a language that his

stakeholders understandCan tell story with data, whether verbally,

visually or bothMany of the brightest data scientists are PhD

in esoteric fields like ecology and systems biology

To_MainNext_Talent_Specifications_2

Page 14: On Big Data

Talent Specifications - 2Roumeliotis, PhD in Astrophysics, Head of Data Science

Team at Intuit in Silicon Valley begins his search for candidates by:

asking the candidate if they can develop prototypes in any mainstream programming language, like Java.

seeking a skill set consisting of: Mathematics, Statistics, Probability and Computer Science and a certain habits of the mind (curiosity, inventiveness, discipline, endurance?).

looking for people with a feel for business issues and empathy for customers.

immersing the candidate with on-the-job training with occasional course in a particular technology

To_MainNext_Talent_Specifications_3

Page 15: On Big Data

Talent Specifications - 3Many of the data scientists working in

business today were formally trained in computer science, mathematics, statistics or economics.

They can emerge from any field that has a strong data and computational focus.

Hal Varian, the chief economist at Google, is known to have said, “The next sexy job in the next 10 years will be statisticians.”

To_MainNext_Courses

Page 16: On Big Data

CoursesThere are only few formal courses being

offered right now.Data Science is at the center of:Computer Science, Operations Research,

Statistics and Business

To_Main

Statistics

Data Science

Business

Operations

Research

Computer Science

Next_Schools_Offering_Data_Science

Page 17: On Big Data

Schools offering Data ScienceMaster of Science in Analytics (MSA) Institute for Advanced AnalyticsNorth Carolina State University= = =The class of 2012 has the following job statistics:-15 interviews per student- Average base salary offer with professional

experience $99,600$65,000 to $160,000 for candidates with experience$60,000 to $100,000 for candidates with no

experienceTo_MainNext_Schools_offering_Data_Science_2

Page 18: On Big Data

Schools offering Data Science -2Insight Data Science Fellows Program - a postdoctoral fellowship designed by Jake

Klamka ( a High-Energy Physicist by training) takes scientists from academia and in six weeks prepares them to succeed as data scientists

Syracuse University’s School of Information Studies (iSchool)

Rensselaer Polytechnic’s Data Science Research Center

To_MainNext_Conclusion

Page 19: On Big Data

ConclusionBig Data is now a reality with a huge profit potential.Tools and Technologies are available through Open-Source.Each one of us can benefit from working with Big Data

(dynamic) in its pure form or in its traditional form (static).Data Science is the path towards the full utilization of Big

Data.Schools are in the process of offering Data Science

programs.Students could pursue a career on Data Science programs.Doing statistical interpretation is the everyday work

routine of Data Science. (Many commercial implementations exist.)

To_MainNext_to_Reference

Page 20: On Big Data

Commercial ImplementationsSAP Hana – MetscaleMicrosoft Parallel Data warehouse Exadata Database Machine (Oracle)Exalytics In-Memory Machine (Oracle)Greenplum Data Computing Appliance (EMC)Netezza Data Warehouse Appliance (IBM)Vertica Analytics Platform (HP)SolidDB (IBM)Teracotta BigMemory (Software AG) …

To_Conclusion To_Main

Page 21: On Big Data

References1. McKinsey Global Institute Report 20112. A Simple Introduction to Data Science - Noreen Burlingame and Lars Nielsen – 20123. Big Data Now - Allen Noren, 2011 (O’Reilly Radar Team)4. What is Data Science - Mike Loukides, 2011 (O’Reilly Media)5. Big Data: The Management Revolution - Andrew McAfee and Erik Brynjolfsson (Harvard Bus Rev – Oct 2012)6. Data Scientist: The Sexiest Job of the 21st Century - Thomas H. Davenport and D.J. Patil (Harvard Bus Rev –Oct 2012)7. Making Advance Analytics Work for You - Dominic Barton and David Court (Harvard Bus Rev –Oct 2012)8. Various YouTube Materials / Hadoop - Stanford University

To_MainNext_To_ThankYou

Page 22: On Big Data

To_Main