Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
St. Gallen, 29.03.2016Ass.-Prof. Dr. Jochen Wulf
Big Data from an Information Management Perspective – First Experiences in Teaching and Research Directions
HSG Big Data Seminar
Agenda
A socio-technical perspective on big data
Initial experiences from teaching big data
management
Current and future research activities
2016
Dr. Jochen Wulf, IWI-HSG
Slide 2
“Big data refers to data that is too big to fit on a
single server, too unstructured to fit into a row-and-
column database, or too continuously flowing to fit
into a static data warehouse. While its size
receives all the attention, the most difficult aspect
of big data really involves its lack of structure”
(Davenport 2014)
2016
Dr. Jochen Wulf, IWI-HSG
Slide 4
Chen, M., Mao, S., Zhang, Y.,
Leung, V.C. (2014) Big data:
related technologies, challenges
and future prospects. Springer.
Hadoop reduces the «Total Cost per Bit»: Historical and fine-
grained data become economically exploitable.
https://www.youtube.com/watch?v=d2xeNpfzsYI
2015
Dr. Jochen Wulf, IWI-HSG
Slide 6
There are different perspectives on the ‘big data’
phenomenon
2016
Dr. Jochen Wulf, IWI-HSG
Slide 7
Data science postulates interdisciplinarity.
J. Wulf, IWI-HSG
8
Computer Science
StatisticsManagement/Data-
driven decisions
Data
Science
Socio-technical theory
2016
Dr. Jochen Wulf, IWI-HSG
Slide 9
Data science = software development?
2016
Dr. Jochen Wulf, IWI-HSG
Slide 10
Fisher, D., DeLine, R., Czerwinski, M., & Drucker, S. (2012).
Interactions with big data analytics. interactions, 19(3), 50-59.
A procedural view on data mining
2016
Dr. Jochen Wulf, IWI-HSG
Slide 11
Azevedo, A. I. R. L. (2008). KDD, SEMMA and CRISP-DM: a parallel
overview. IADS-DM.
Development
• Expansion and change of current
business
• Collaboration (BA, IT, business
departments)
Challenge
• Data-driven culture & organization
(e.g. collaboration, competence
center)
• Talent management
• Leadership
Organization
Business
Value
Big Data presents new organizational, procedural, and
technological challenges in a company context.
Development
• Changes in decision making
processes (e.g. automation,
change of locus of decision)
Challenge
• Change management
• Exploratory data analysisProcess
Business
Value
Development
• New characteristics (e.g. volume, velocity,
variety, veracity, variablility)
• New technologies (e.g. in-memory, cloud)
• New methods in data extraction,
transformation, retrieval and exploration
Challenge
• Development of technological capabilities
Technology
Agenda
A socio-technical perspective on big data
Initial experiences from teaching big data
management
Current and future research activities
2016
Dr. Jochen Wulf, IWI-HSG
Slide 13
February
2013
Dr. Jochen Wulf, IWI-HSG
Slide 14
Dashboard
February
2013
Dr. Jochen Wulf, IWI-HSG
Slide 15
Classroom
February
2013
Dr. Jochen Wulf, IWI-HSG
Slide 16
Data Science (Written Report)
For the class project, please write a brief blog post (1-5 pages) describing New York City Subway system. In this blog post, you should pose a question about the New York City Subway system that can be addressed by the MTA dataset that you’ve worked with in the course and then, through working with the dataset, draw an interesting conclusion about the New York City Subway system itself.
More specifically:
You can choose to pursue the question that was discussed in the course: how does rain affect ridership in the New York City Subway system? If you choose to address this question, you can use the work that you’ve done in projects 2 through 5 in your blog post.
You can also choose a different question to investigate about the New York Subway System. However, this question should be complex enough to warrant the use of all the following skills and tools that you have learned in the class:
Statistical Test and Linear Regression (Lesson 3)
Data Visualization (Lesson 4)
MapReduce (Lesson 5)
Design
Template: http://icis2012.aisnet.org/Files/template_submissions2012.doc
Size restriction: 5 pages max
February
2013
Dr. Jochen Wulf, IWI-HSG
Slide 17
Methods for classical ETL-processes based on SAS
SAS Prog
1. Introduction
2. SAS Programs (2 practices)
3. Accessing Data (2 practices)
4. Producing Detail Reports (3 practices)
5. Formatting Data Values (2 practices)
6. Reading SAS Data Sets (2 practices)
7. Reading Spreadsheet and Database Data (1 practice)
8. Reading Raw Data Files (3 practices)
9. Manipulating Data (2 practices)
10. Combining Data Sets (4 practices)
11. Creating Summary Reports (4 practices)
2015
Dr. Jochen Wulf, IWI-HSG
Slide 19
Zertifizierung Inhalt und Prüfungsform
Die Zertifizierung „SAS Certified Base Programmer for SAS 9“ wurde für SAS Progammierer
konzipiert, die grundlegende Kenntnisse des Datenmanagements mit der SAS Software
beherrschen. Die Teilnehmer sollten mit den Funktionalitäten und Erweiterungen von SAS
9.3 vertraut sein. Das Bestehen dieses Tests ist Voraussetzung für die Teilnahme an der
Zertifizierung „SAS Certified Advanced Programmer for SAS 9“.
Für diese Zertifizierung benötigen Sie Kenntnisse über
den Import und Export von Rohdaten.
das Datenmanagement von SAS Dateien.
das Verbinden von SAS Dateien.
das Erstellen von Listenberichten und verdichteten Berichten.
das Ermitteln von Programmier- und Syntaxfehlern.
2015
Dr. Jochen Wulf, IWI-HSG
Slide 20
http://www.sas.com/de_de/training/services/zertifizierung/bp.html#-berblick
The Hortonworks Data Platform (HDP)
Hortonworks Data Platform
GOVERNANCE&INTEGRATION
SECURITY OPERATIONSDATAACCESS
DATAMANAGEMENT
YARN:DataOpera ngSystem
Batch
Map
Reduce
Script
Pig
SQL
Hive/Tez
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
Others
In-Memory
Analy csISVEngines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
°
°
°
n
HDFS(HadoopDistributedFileSystem)
DataWorkflow,Lifecycle&Governance
Falcon
Sqoop
FlumeNFS
WebHDFS
Authen ca onAuthoriza onAccoun ng
DataProtec on
Storage:HDFS
Resources:YARN
Access:Hive,…Pipeline:Falcon
Cluster:Knox
Provision,Manage&Monitor
Ambari
Zookeeper
Scheduling
Oozie
Search
Solr
2015
Dr. Jochen Wulf, IWI-HSG
Slide 22
2015
Dr. Jochen Wulf, IWI-HSG
Slide 23
Student project example: Predicting mobile operation system
based on tweet information
2016
Dr. Jochen Wulf, IWI-HSG
Slide 24
Envisioned Syllabus: Lecture „Managing Big Data“
Cluster Computing and cloud computing
Data sources
Data acquisition
Data storage
Data exploration and transformation
Querying data
Advanced and scalable algorithms
2016
Dr. Jochen Wulf, IWI-HSG
Slide 25
Agenda
A socio-technical perspective on big data
Initial experiences from teaching big data
management
Current and future research activities
2016
Dr. Jochen Wulf, IWI-HSG
Slide 26
Big data platform at IWI-HSG (based on Hadoop & Spark)
2016
Dr. Jochen Wulf, IWI-HSG
Exemplary Hive statements
2016
Dr. Jochen Wulf, IWI-HSG
Slide 28
Data import in H2O FLOW
2016
Dr. Jochen Wulf, IWI-HSG
Slide 29
Projects
Scraping:
Social media APIs:
2016
Dr. Jochen Wulf, IWI-HSG
Slide 30
Three dimensions of big data technology use
Data use refers to all activities associated with the inherent characteristics of data-
sources, such as the analysis of a large variety of data sources (Das et al., 2015; Dhar,
2013; Domingos, 2012), the management of high volume data (Dhar, 2013), the handling
of personally identifiable information (Jagadish et al., 2014), the integration of
heterogeneous data sources (Das et al., 2015; Fisher et al., 2012), and the
transformation of data into usable forms (Fisher et al., 2012).
Platform use includes all aspects related to the configuration and management of storage
and processing capacities. It covers the design of computing architectures (e.g. server
clusters) (Fisher et al., 2012; Jagadish et al., 2014), the configuration of processing
capacities (Fisher et al., 2012), and the support of distributed access and multiple users
(Jagadish et al., 2014).
Application use covers the analytical activities performed with the data on the basis of the
computing platform. It includes the own development of software and distributed
algorithms (Fisher et al., 2012), the automated processing of analytical models
(Domingos, 2012), and the application of visualization instruments.
2016
Dr. Jochen Wulf, IWI-HSG
Slide 31
Three dimensions of big data technology use
Data use refers to all activities associated with the inherent characteristics of data-
sources, such as the analysis of a large variety of data sources (Das et al., 2015; Dhar,
2013; Domingos, 2012), the management of high volume data (Dhar, 2013), the handling
of personally identifiable information (Jagadish et al., 2014), the integration of
heterogeneous data sources (Das et al., 2015; Fisher et al., 2012), and the
transformation of data into usable forms (Fisher et al., 2012).
Platform use includes all aspects related to the configuration and management of storage
and processing capacities. It covers the design of computing architectures (e.g. server
clusters) (Fisher et al., 2012; Jagadish et al., 2014), the configuration of processing
capacities (Fisher et al., 2012), and the support of distributed access and multiple users
(Jagadish et al., 2014).
Application use covers the analytical activities performed with the data on the basis of the
computing platform. It includes the own development of software and distributed
algorithms (Fisher et al., 2012), the automated processing of analytical models
(Domingos, 2012), and the application of visualization instruments.
2016
Dr. Jochen Wulf, IWI-HSG
Slide 32
How are these dimensions of use interrelated?
How do they influence the perceived usefulness of big data
technology?
Data collection strategy: log mining
2016
Dr. Jochen Wulf, IWI-HSG
Slide 33
Ren, K., Kwon, Y., Balazinska, M., Howe, B. (2013)
Hadoop's adolescence: an analysis of Hadoop usage in
scientific workloads. Proceedings of the VLDB
Endowment 6, 853-864.
2015
Dr. Jochen Wulf, IWI-HSG
Slide 34
Goth, G. (2015) Bringing big data to the big tent. Communications
of the ACM 58, 17-19.
Relative weights of evaluation criteria for open-
source/proprietary ERP systems
2016
Dr. Jochen Wulf, IWI-HSG
Slide 35
Benlian, A., Hess, T. (2011) Comparing the relative importance of
evaluation criteria in proprietary and open‐source enterprise application
software selection–a conjoint study of ERP and Office systems.
Information Systems Journal 21, 503-525.
Relative weights of evaluation criteria for open-
source/proprietary ERP systems
2016
Dr. Jochen Wulf, IWI-HSG
Slide 36
Benlian, A., Hess, T. (2011) Comparing the relative importance of
evaluation criteria in proprietary and open‐source enterprise application
software selection–a conjoint study of ERP and Office systems.
Information Systems Journal 21, 503-525.
How does the delivery mode, i.e. the choice of open-source
versus proprietary software, influence the success of big
data technologies?
The role of big data cloud platforms
2016
Dr. Jochen Wulf, IWI-HSG
Slide 37
Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Khan, S.U. (2015) The rise of “big
data” on cloud computing: Review and open research issues. Information Systems 47, 98-115.
Prior literature has identified several factors, which guide the selection of an operational
mode in the case of enterprise resource processing systems through case study
research, including flexibility, customization, cost, operation, and maintenance (Link and
Back, 2015).
The role of big data cloud platforms
2016
Dr. Jochen Wulf, IWI-HSG
Slide 38
Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Khan, S.U. (2015) The rise of “big
data” on cloud computing: Review and open research issues. Information Systems 47, 98-115.
Prior literature has identified several factors, which guide the selection of an operational
mode in the case of enterprise resource processing systems through case study
research, including flexibility, customization, cost, operation, and maintenance (Link and
Back, 2015).
How does the operational mode (cloud computing vs on-
premise) influence the success of big data technologies?
Model of big data technology success (based on Delone and
McLean, 2003)
2016
Dr. Jochen Wulf, IWI-HSG
Slide 39
2015
Dr. Jochen Wulf, IWI-HSG
Slide 40
Prof. Dr. Jochen Wulf
Assistant Professor
Müller-Friedberg-Strasse 8
CH-9000 St. Gallen
Tel.: +41-(0)71-224-3865
Fax: +41-(0)71-224-3296
www.iwi.unisg.ch