40
St. Gallen, 29.03.2016 Ass.-Prof. Dr. Jochen Wulf Big Data from an Information Management Perspective First Experiences in Teaching and Research Directions HSG Big Data Seminar

Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

St. Gallen, 29.03.2016Ass.-Prof. Dr. Jochen Wulf

Big Data from an Information Management Perspective – First Experiences in Teaching and Research Directions

HSG Big Data Seminar

Page 2: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Agenda

A socio-technical perspective on big data

Initial experiences from teaching big data

management

Current and future research activities

2016

Dr. Jochen Wulf, IWI-HSG

Slide 2

Page 3: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

“Big data refers to data that is too big to fit on a

single server, too unstructured to fit into a row-and-

column database, or too continuously flowing to fit

into a static data warehouse. While its size

receives all the attention, the most difficult aspect

of big data really involves its lack of structure”

(Davenport 2014)

Page 4: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

2016

Dr. Jochen Wulf, IWI-HSG

Slide 4

Chen, M., Mao, S., Zhang, Y.,

Leung, V.C. (2014) Big data:

related technologies, challenges

and future prospects. Springer.

Page 5: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Hadoop reduces the «Total Cost per Bit»: Historical and fine-

grained data become economically exploitable.

https://www.youtube.com/watch?v=d2xeNpfzsYI

Page 6: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

2015

Dr. Jochen Wulf, IWI-HSG

Slide 6

Page 7: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

There are different perspectives on the ‘big data’

phenomenon

2016

Dr. Jochen Wulf, IWI-HSG

Slide 7

Page 8: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Data science postulates interdisciplinarity.

J. Wulf, IWI-HSG

8

Computer Science

StatisticsManagement/Data-

driven decisions

Data

Science

Page 9: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Socio-technical theory

2016

Dr. Jochen Wulf, IWI-HSG

Slide 9

Page 10: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Data science = software development?

2016

Dr. Jochen Wulf, IWI-HSG

Slide 10

Fisher, D., DeLine, R., Czerwinski, M., & Drucker, S. (2012).

Interactions with big data analytics. interactions, 19(3), 50-59.

Page 11: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

A procedural view on data mining

2016

Dr. Jochen Wulf, IWI-HSG

Slide 11

Azevedo, A. I. R. L. (2008). KDD, SEMMA and CRISP-DM: a parallel

overview. IADS-DM.

Page 12: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Development

• Expansion and change of current

business

• Collaboration (BA, IT, business

departments)

Challenge

• Data-driven culture & organization

(e.g. collaboration, competence

center)

• Talent management

• Leadership

Organization

Business

Value

Big Data presents new organizational, procedural, and

technological challenges in a company context.

Development

• Changes in decision making

processes (e.g. automation,

change of locus of decision)

Challenge

• Change management

• Exploratory data analysisProcess

Business

Value

Development

• New characteristics (e.g. volume, velocity,

variety, veracity, variablility)

• New technologies (e.g. in-memory, cloud)

• New methods in data extraction,

transformation, retrieval and exploration

Challenge

• Development of technological capabilities

Technology

Page 13: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Agenda

A socio-technical perspective on big data

Initial experiences from teaching big data

management

Current and future research activities

2016

Dr. Jochen Wulf, IWI-HSG

Slide 13

Page 14: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

February

2013

Dr. Jochen Wulf, IWI-HSG

Slide 14

Page 15: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Dashboard

February

2013

Dr. Jochen Wulf, IWI-HSG

Slide 15

Page 16: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Classroom

February

2013

Dr. Jochen Wulf, IWI-HSG

Slide 16

Page 17: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Data Science (Written Report)

For the class project, please write a brief blog post (1-5 pages) describing New York City Subway system. In this blog post, you should pose a question about the New York City Subway system that can be addressed by the MTA dataset that you’ve worked with in the course and then, through working with the dataset, draw an interesting conclusion about the New York City Subway system itself.

More specifically:

You can choose to pursue the question that was discussed in the course: how does rain affect ridership in the New York City Subway system? If you choose to address this question, you can use the work that you’ve done in projects 2 through 5 in your blog post.

You can also choose a different question to investigate about the New York Subway System. However, this question should be complex enough to warrant the use of all the following skills and tools that you have learned in the class:

Statistical Test and Linear Regression (Lesson 3)

Data Visualization (Lesson 4)

MapReduce (Lesson 5)

Design

Template: http://icis2012.aisnet.org/Files/template_submissions2012.doc

Size restriction: 5 pages max

February

2013

Dr. Jochen Wulf, IWI-HSG

Slide 17

Page 18: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Methods for classical ETL-processes based on SAS

Page 19: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

SAS Prog

1. Introduction

2. SAS Programs (2 practices)

3. Accessing Data (2 practices)

4. Producing Detail Reports (3 practices)

5. Formatting Data Values (2 practices)

6. Reading SAS Data Sets (2 practices)

7. Reading Spreadsheet and Database Data (1 practice)

8. Reading Raw Data Files (3 practices)

9. Manipulating Data (2 practices)

10. Combining Data Sets (4 practices)

11. Creating Summary Reports (4 practices)

2015

Dr. Jochen Wulf, IWI-HSG

Slide 19

Page 20: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Zertifizierung Inhalt und Prüfungsform

Die Zertifizierung „SAS Certified Base Programmer for SAS 9“ wurde für SAS Progammierer

konzipiert, die grundlegende Kenntnisse des Datenmanagements mit der SAS Software

beherrschen. Die Teilnehmer sollten mit den Funktionalitäten und Erweiterungen von SAS

9.3 vertraut sein. Das Bestehen dieses Tests ist Voraussetzung für die Teilnahme an der

Zertifizierung „SAS Certified Advanced Programmer for SAS 9“.

Für diese Zertifizierung benötigen Sie Kenntnisse über

den Import und Export von Rohdaten.

das Datenmanagement von SAS Dateien.

das Verbinden von SAS Dateien.

das Erstellen von Listenberichten und verdichteten Berichten.

das Ermitteln von Programmier- und Syntaxfehlern.

2015

Dr. Jochen Wulf, IWI-HSG

Slide 20

http://www.sas.com/de_de/training/services/zertifizierung/bp.html#-berblick

Page 21: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

The Hortonworks Data Platform (HDP)

Hortonworks Data Platform

GOVERNANCE&INTEGRATION

SECURITY OPERATIONSDATAACCESS

DATAMANAGEMENT

YARN:DataOpera ngSystem

Batch

Map

Reduce

Script

Pig

SQL

Hive/Tez

HCatalog

NoSQL

HBase

Accumulo

Stream

Storm

Others

In-Memory

Analy csISVEngines

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

°

°

°

°

n

HDFS(HadoopDistributedFileSystem)

DataWorkflow,Lifecycle&Governance

Falcon

Sqoop

FlumeNFS

WebHDFS

Authen ca onAuthoriza onAccoun ng

DataProtec on

Storage:HDFS

Resources:YARN

Access:Hive,…Pipeline:Falcon

Cluster:Knox

Provision,Manage&Monitor

Ambari

Zookeeper

Scheduling

Oozie

Search

Solr

Page 22: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

2015

Dr. Jochen Wulf, IWI-HSG

Slide 22

Page 23: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

2015

Dr. Jochen Wulf, IWI-HSG

Slide 23

Page 24: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Student project example: Predicting mobile operation system

based on tweet information

2016

Dr. Jochen Wulf, IWI-HSG

Slide 24

Page 25: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Envisioned Syllabus: Lecture „Managing Big Data“

Cluster Computing and cloud computing

Data sources

Data acquisition

Data storage

Data exploration and transformation

Querying data

Advanced and scalable algorithms

2016

Dr. Jochen Wulf, IWI-HSG

Slide 25

Page 26: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Agenda

A socio-technical perspective on big data

Initial experiences from teaching big data

management

Current and future research activities

2016

Dr. Jochen Wulf, IWI-HSG

Slide 26

Page 27: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Big data platform at IWI-HSG (based on Hadoop & Spark)

2016

Dr. Jochen Wulf, IWI-HSG

Page 28: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Exemplary Hive statements

2016

Dr. Jochen Wulf, IWI-HSG

Slide 28

Page 29: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Data import in H2O FLOW

2016

Dr. Jochen Wulf, IWI-HSG

Slide 29

Page 30: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Projects

Scraping:

Social media APIs:

2016

Dr. Jochen Wulf, IWI-HSG

Slide 30

Page 31: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Three dimensions of big data technology use

Data use refers to all activities associated with the inherent characteristics of data-

sources, such as the analysis of a large variety of data sources (Das et al., 2015; Dhar,

2013; Domingos, 2012), the management of high volume data (Dhar, 2013), the handling

of personally identifiable information (Jagadish et al., 2014), the integration of

heterogeneous data sources (Das et al., 2015; Fisher et al., 2012), and the

transformation of data into usable forms (Fisher et al., 2012).

Platform use includes all aspects related to the configuration and management of storage

and processing capacities. It covers the design of computing architectures (e.g. server

clusters) (Fisher et al., 2012; Jagadish et al., 2014), the configuration of processing

capacities (Fisher et al., 2012), and the support of distributed access and multiple users

(Jagadish et al., 2014).

Application use covers the analytical activities performed with the data on the basis of the

computing platform. It includes the own development of software and distributed

algorithms (Fisher et al., 2012), the automated processing of analytical models

(Domingos, 2012), and the application of visualization instruments.

2016

Dr. Jochen Wulf, IWI-HSG

Slide 31

Page 32: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Three dimensions of big data technology use

Data use refers to all activities associated with the inherent characteristics of data-

sources, such as the analysis of a large variety of data sources (Das et al., 2015; Dhar,

2013; Domingos, 2012), the management of high volume data (Dhar, 2013), the handling

of personally identifiable information (Jagadish et al., 2014), the integration of

heterogeneous data sources (Das et al., 2015; Fisher et al., 2012), and the

transformation of data into usable forms (Fisher et al., 2012).

Platform use includes all aspects related to the configuration and management of storage

and processing capacities. It covers the design of computing architectures (e.g. server

clusters) (Fisher et al., 2012; Jagadish et al., 2014), the configuration of processing

capacities (Fisher et al., 2012), and the support of distributed access and multiple users

(Jagadish et al., 2014).

Application use covers the analytical activities performed with the data on the basis of the

computing platform. It includes the own development of software and distributed

algorithms (Fisher et al., 2012), the automated processing of analytical models

(Domingos, 2012), and the application of visualization instruments.

2016

Dr. Jochen Wulf, IWI-HSG

Slide 32

How are these dimensions of use interrelated?

How do they influence the perceived usefulness of big data

technology?

Page 33: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Data collection strategy: log mining

2016

Dr. Jochen Wulf, IWI-HSG

Slide 33

Ren, K., Kwon, Y., Balazinska, M., Howe, B. (2013)

Hadoop's adolescence: an analysis of Hadoop usage in

scientific workloads. Proceedings of the VLDB

Endowment 6, 853-864.

Page 34: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

2015

Dr. Jochen Wulf, IWI-HSG

Slide 34

Goth, G. (2015) Bringing big data to the big tent. Communications

of the ACM 58, 17-19.

Page 35: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Relative weights of evaluation criteria for open-

source/proprietary ERP systems

2016

Dr. Jochen Wulf, IWI-HSG

Slide 35

Benlian, A., Hess, T. (2011) Comparing the relative importance of

evaluation criteria in proprietary and open‐source enterprise application

software selection–a conjoint study of ERP and Office systems.

Information Systems Journal 21, 503-525.

Page 36: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Relative weights of evaluation criteria for open-

source/proprietary ERP systems

2016

Dr. Jochen Wulf, IWI-HSG

Slide 36

Benlian, A., Hess, T. (2011) Comparing the relative importance of

evaluation criteria in proprietary and open‐source enterprise application

software selection–a conjoint study of ERP and Office systems.

Information Systems Journal 21, 503-525.

How does the delivery mode, i.e. the choice of open-source

versus proprietary software, influence the success of big

data technologies?

Page 37: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

The role of big data cloud platforms

2016

Dr. Jochen Wulf, IWI-HSG

Slide 37

Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Khan, S.U. (2015) The rise of “big

data” on cloud computing: Review and open research issues. Information Systems 47, 98-115.

Prior literature has identified several factors, which guide the selection of an operational

mode in the case of enterprise resource processing systems through case study

research, including flexibility, customization, cost, operation, and maintenance (Link and

Back, 2015).

Page 38: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

The role of big data cloud platforms

2016

Dr. Jochen Wulf, IWI-HSG

Slide 38

Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Khan, S.U. (2015) The rise of “big

data” on cloud computing: Review and open research issues. Information Systems 47, 98-115.

Prior literature has identified several factors, which guide the selection of an operational

mode in the case of enterprise resource processing systems through case study

research, including flexibility, customization, cost, operation, and maintenance (Link and

Back, 2015).

How does the operational mode (cloud computing vs on-

premise) influence the success of big data technologies?

Page 39: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

Model of big data technology success (based on Delone and

McLean, 2003)

2016

Dr. Jochen Wulf, IWI-HSG

Slide 39

Page 40: Big Data from an Information Management Perspective First ...bigdata.unisg.ch/fileadmin/projects/bigdata/wulf_published.pdf · Data Science (Written Report) For the class project,

2015

Dr. Jochen Wulf, IWI-HSG

Slide 40

Prof. Dr. Jochen Wulf

Assistant Professor

Müller-Friedberg-Strasse 8

CH-9000 St. Gallen

Tel.: +41-(0)71-224-3865

Fax: +41-(0)71-224-3296

[email protected]

www.iwi.unisg.ch