Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Lluis Belanche + Alfredo Vellido
Intelligent Data Analysis and Data Mininga.k.a. Data Mining II
IDADM2012/2013. Alfredo Vellido
An Introduction to Mining (coda)
RECAP: CRISP: The virtuous loop of methodology phases
IDADM
A note on CRISP-DM 2.0CRISP-2.0: Updating the Methodology
Why?
Many changes have occurred in the business application of data mining since CRISP‐DM 1.0 was published. Emerging issues and requirements include:
The availability of new types of data—text, Web, and attitudinal data, for example—along with new techniques for pre‐processing, analyzing, and combining them with related case data
Integration and deployment of results with operational systems such as call centers and Web sites
Far more demanding requirements for scalability and for deployment into real‐time environments
The need to package analytical tasks for non‐analytical end users and integrate these tasks in business workflows
The need to seamlessly integrate the deployment of results and closed‐loop feedback with existing business processes
The need to mine large‐scale databases in situ, rather than exporting an analytical dataset Organizations’ increasing reliance on teams, making it important to educate greater numbers of people on the processes and best practices associated with data mining and predictive analytics
In July 2006 the consortium announced that it was going to start the process of working towards a second version of CRISP‐DM. On 26 September 2006, the CRISP‐DM SIG met to discuss potential enhancements for CRISP‐DM 2.0 and the subsequent roadmap. However, these efforts appear to be stalled. The SIG has not met, updated the CRISP website, or communicated anything to members since early 2007.
IDADM
resources
Some bibliography available at books.google.com:
Data mining: practical machine learning tools and techniquesI.H. Witten, E. Frank (2005)
Data mining: concepts and techniquesJ. Han, M. Kamber (2006)
Principles of data miningD. J. Hand, H. Mannila, P. Smyth (2001)
Some FREE SOFTWARE to know about …
KEEL (.es)
IDADM
WEKA (.nz)
IDADM
RapidMiner (.us)
IDADM
An insider’s view …
Geoff Holmes
Current hot topics in DM
A company that sells DM (ML) for big data, in the cloud
PMML: Predictive Modelling Mark‐up Language
IDADM
PMML is an XML‐based markup language developed by the Data Mining Group (DMG, a public‐private consortium) to provide a way for applications to define models related to data mining and to share those models between PMML‐compliant applications.
PMML provides applications a vendor‐independent method of defining models so that proprietary issues and incompatibilities are not a barrier to the exchange of models between applications. It allows users to develop models within one vendor's application and use other vendors' applications to visualize, analyze, evaluate or otherwise use the models. With PMML, the exchange of models between compliant applications is straightforward.
Process Mining (PM)PM sits between CI and DM on the one hand, and process modeling and analysis on the other. PM aims to discover, monitor and improve real processes by extracting knowledge from event logs. Why PM? … an ever‐increasing number of events are being recorded, providing detailed information about the history of processes. On the other hand, there is a need to improve and support business processes in rapidly changing and aggressively competitive environments.PM includes (automated) process discovery (extracting process models from an event log), conformance checking (monitoring deviations of model from log), organizational mining (inc. social networks), automatedconstruction of simulation models, model extension, model repair, case prediction, and history‐based recommendations.
IDADM
Process Mining (PM)PM could be a bridge between DM and business process modeling and analysis, under the umbrella concept of Business Intelligence (BI). It can also be seen as the "missing link" between DM and traditional model‐driven BPM. Most DM techniques are not fit as such for process analysis.Co‐existing analytical concepts: Business Activity Monitoring (BAM): technologies enabling the real‐time monitoring of business processes. Complex Event Processing (CEP): technologies to process large amounts of events for optimizing the business in real time. Corporate Performance Management (CPM): measuring the performance of a process or organization. Co‐existing management concepts: such as Continuous ProcessImprovement (CPI), Business Process Improvement (BPI), Total Quality Management (TQM), and Six Sigma. PM enables all these within a single framework.
IDADM
Process Mining (PM)Event logs: All PM techniques assume that it is possible to sequentially record events such that each event refers to an activity (a well‐defined step in some process) and is related to a particular case (a process instance). ELmay store additional information about events: resource (person or device) executing the activity, timestamp of the event, or data elements recorded together with the event.
IDADM
Process Mining (PM)Discovery: The first element of PM is discovery. A discovery technique takes an event log and produces a model without using any a priori information.
Conformance: The second is conformance: an existing process model is compared with an event log of the same process. Conformance checking can be used to check if reality/process, as recorded in the EL, conforms to the model and vice versa. Conformance checking can be applied to procedural models, organizational models, declarative process models, etc.
IDADM
Enhancement : Extending or improving an existing PM using information about the actual process recorded in some EL. This third type of PM aims at changing or extending the a priori model.
Process Mining (PM): perspectives• Control‐flow perspective: focuses on the ordering of activities. The goal of
mining this perspective is to find a good characterization of all possible paths. The result is typically expressed in terms of a Petri Net or some other process notation (EPCs event‐driven process chain, BPMN, or UML activity diagrams).
• Organizational perspective: focuses on information about resources hidden in the event log, i.e., which actors (people, systems, roles, or departments) are involved and how are they related. The goal is to either structure the organization by classifying people in terms of roles and organizational units or to map a social network.
• Case perspective: focuses on properties of cases. A case can be characterized by its path in the process or by the actors working on it.
IDADM
Business Process Model and Notation (BPMN) example. A graphical representation for specifying business processes in a business process model.
Process Mining (PM): BPM vs. PM• Business Process Modeling: 7 phases : In the (re)design phase a new process
model is created or an existing process model is adapted. In the analysis phase a candidate model and its alternatives are analyzed. Then, the model is implemented (implementation phase) or an existing system is (re)configured(reconfiguration phase). In the execution phase, the designed model is enacted. During the execution phase the process is monitored. Moreover, smaller adjustments may be made without redesigning the process (adjustment phase). In the diagnosis phase the enacted process is analyzed and the output of this phase may trigger a new process redesign phase.
IDADM
IDADM
Process Mining (PM): BPM vs. PMPMining: 5 stages :Plan and Justify: Includes understanding the available data and process domain. Extract: event data, models, objectives, and questions need to be extracted from systems, domain experts, and management. Control‐flow modelling: control‐flow model isconstructed and linked to the event log. Hereautomated process discovery techniques can be used. The event log may be filtered or adapted using the model (e.g., removing outlier cases and inputing missing events). Integrated process model: the control‐flow model may be extended with other perspectives (e.g., data, time, and resources).Operational support:Moreover, smaller adjustments may be made without redesigningthe process (adjustment phase). In the diagnosis phase the enacted process is analyzed and the output of this phase may trigger a new process redesign phase.
IDADM
Process Mining (PM): Guiding principles
Process Mining (PM)PM as a building block of BI
IDADM
Process Mining (PM)PM IEEE TF
IDADM