32
Lluis Belanche + Alfredo Vellido Intelligent Data Analysis and Data Mining a.k.a. Data Mining II

Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

Lluis Belanche + Alfredo Vellido

Intelligent Data Analysis and Data Mininga.k.a. Data Mining II

Page 2: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

IDADM2012/2013. Alfredo Vellido

An Introduction to Mining (coda)

Page 3: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

RECAP: CRISP: The virtuous loop of methodology phases

IDADM

Page 4: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

A note on CRISP-DM 2.0CRISP-2.0: Updating the Methodology

Why?

Many changes have occurred in the business application of data mining since CRISP‐DM 1.0 was published. Emerging issues and requirements include:

The availability of new types of data—text, Web, and attitudinal data, for example—along with new techniques for pre‐processing, analyzing, and combining them with related case data 

Integration and deployment of results with operational systems such as call centers and Web sites 

Far more demanding requirements for scalability and for deployment into real‐time environments 

The need to package analytical tasks for non‐analytical end users and integrate these tasks in business workflows

The need to seamlessly integrate the deployment of results and closed‐loop feedback with existing business processes 

The need to mine large‐scale databases in situ, rather than exporting an analytical dataset Organizations’ increasing reliance on teams, making it important to educate greater numbers of people on the processes and best practices associated with data mining and predictive analytics 

In July 2006 the consortium announced that it was going to start the process of working towards a second version of CRISP‐DM. On 26 September 2006, the CRISP‐DM SIG met to discuss potential enhancements for CRISP‐DM 2.0 and the subsequent roadmap. However, these efforts appear to be stalled. The SIG has not met, updated the CRISP website, or communicated anything to members since early 2007.

IDADM

Page 5: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

resources

Page 6: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration
Page 7: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration
Page 8: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration
Page 9: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

Some bibliography available at books.google.com:

Data mining: practical machine learning tools and techniquesI.H. Witten, E. Frank (2005)

Data mining: concepts and techniquesJ. Han, M. Kamber (2006)

Principles of data miningD. J. Hand, H. Mannila, P. Smyth (2001)

Page 10: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

Some FREE SOFTWARE to know about …

Page 11: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

KEEL (.es)

IDADM

Page 12: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

WEKA (.nz)

IDADM

Page 13: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

RapidMiner (.us)

IDADM

Page 14: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

An insider’s view …

Page 15: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

Geoff Holmes

Page 16: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration
Page 17: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration
Page 18: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration
Page 19: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration
Page 20: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

Current hot topics in DM

Page 21: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

A company that sells DM (ML) for big data, in the cloud

Page 22: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

PMML: Predictive Modelling Mark‐up Language

IDADM

PMML is an XML‐based markup language developed by the Data Mining Group (DMG, a public‐private consortium) to provide a way for applications to define models related to data mining and to share those models between PMML‐compliant applications.

PMML provides applications a vendor‐independent method of defining models so that proprietary issues and incompatibilities are not a barrier to the exchange of models between applications. It allows users to develop models within one vendor's application and use other vendors' applications to visualize, analyze, evaluate or otherwise use the models. With PMML, the exchange of models between compliant applications is straightforward.

Page 23: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

Process Mining (PM)PM sits between CI and DM on the one hand, and process modeling and analysis on the other. PM aims to discover, monitor and improve real processes by extracting knowledge from event logs. Why PM? … an ever‐increasing number of events are being recorded, providing detailed information about the history of processes. On the other hand, there is a need to improve and support business processes in rapidly changing and aggressively competitive environments.PM includes (automated) process discovery (extracting process models from an event log), conformance checking (monitoring deviations of model from log), organizational mining (inc. social networks), automatedconstruction of simulation models, model extension, model repair, case prediction, and history‐based recommendations.

IDADM

Page 24: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

Process Mining (PM)PM could be a bridge between DM and business process modeling and analysis, under the umbrella concept of Business Intelligence (BI). It can also be seen as the "missing link" between DM and traditional model‐driven BPM. Most DM techniques are not fit as such for process analysis.Co‐existing analytical concepts: Business Activity Monitoring (BAM): technologies enabling the real‐time monitoring of business processes. Complex Event Processing (CEP): technologies to process large amounts of events for optimizing the business in real time. Corporate Performance Management (CPM): measuring the performance of a process or organization. Co‐existing management concepts: such as Continuous ProcessImprovement (CPI), Business Process Improvement (BPI), Total Quality Management (TQM), and Six Sigma. PM enables all these within a single framework.

IDADM

Page 25: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

Process Mining (PM)Event logs:  All PM techniques assume that it is possible to sequentially record events such that each event refers to an activity (a well‐defined step in some process) and is related to a particular case (a process instance). ELmay store additional information about events: resource (person or device) executing the activity, timestamp of the event, or data elements recorded together with the event.

IDADM

Page 26: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

Process Mining (PM)Discovery:  The first element of PM is discovery. A discovery technique takes an event log and produces a model without using any a priori information.

Conformance:  The second is conformance: an existing process model is compared with an event log of the same process. Conformance checking can be used to check if reality/process, as recorded in the EL, conforms to the model and vice versa. Conformance checking can be applied to procedural models, organizational models, declarative process models, etc.

IDADM

Enhancement : Extending or improving an existing PM using information about the actual process recorded in some EL. This third type of PM aims at changing or extending the a priori model.

Page 27: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

Process Mining (PM): perspectives• Control‐flow perspective:  focuses on the ordering of activities. The goal of 

mining this perspective is to find a good characterization of all possible paths. The result is typically expressed in terms of a Petri Net or some other process notation (EPCs event‐driven process chain, BPMN, or UML activity diagrams). 

• Organizational perspective: focuses on information about resources hidden in the event log, i.e., which actors (people, systems, roles, or departments) are involved and how are they related. The goal is to either structure the organization by classifying people in terms of roles and organizational units or to map a social network. 

• Case perspective: focuses on properties of cases. A case can be characterized by its path in the process or by the actors working on it.

IDADM

Business Process Model and Notation (BPMN) example. A graphical representation for specifying business processes in a business process model.

Page 28: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

Process Mining (PM): BPM vs. PM• Business Process Modeling: 7 phases : In the (re)design phase a new process 

model is created or an existing process model is adapted. In the analysis phase a candidate model and its alternatives are analyzed. Then, the model is implemented (implementation phase) or an existing system is (re)configured(reconfiguration phase). In the execution phase, the designed model is enacted. During the execution phase the process is monitored. Moreover, smaller adjustments may be made without redesigning the process (adjustment phase). In the diagnosis phase the enacted process is analyzed and the output of this phase may trigger a new process redesign phase.

IDADM

Page 29: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

IDADM

Process Mining (PM): BPM vs. PMPMining: 5 stages :Plan and Justify: Includes understanding the available data and process domain. Extract: event data, models, objectives, and questions need to be extracted from systems, domain experts, and management. Control‐flow modelling: control‐flow model isconstructed and linked to the event log. Hereautomated process discovery techniques can be used. The event log may be filtered or adapted using the model (e.g., removing outlier cases and inputing missing events). Integrated process model: the control‐flow model may be extended with other perspectives (e.g., data, time, and resources).Operational support:Moreover, smaller adjustments may be made without redesigningthe process (adjustment phase). In the diagnosis phase the enacted process is analyzed and the output of this phase may trigger a new process redesign phase.

Page 30: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

IDADM

Process Mining (PM): Guiding principles

Page 31: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

Process Mining (PM)PM as a building block of BI

IDADM

Page 32: Intelligent Data Analysis and Data Mining - UPC …avellido/teaching/12-13/Intro4...techniques for pre‐processing, analyzing, and combining them with related case data Integration

Process Mining (PM)PM IEEE TF

IDADM