Data Mining IIavellido/teaching/13-14/... · * Analyze a huge amount of data by using data mining and statistics techniques; generate actionable insights for improving advertising

Lluis Belanche + Alfredo Vellido

Intelligent Data Analysis and Data Miningor …

Data Analysis and Knowledge Discoverya.k.a. Data Mining II

Last sessions wrap‐up

CRISP: The virtuous loop of methodology phases

IDADM

CRISP: Phases: Problem understanding

DETERMINEPROBLEMGOAL

ASSESS SITUATION

DETERMINEDM

GOALS

PRODUCE PROJECTPLAN

PROBLEM UNDERSTANDING

DATA

UNDERST’ING

DATA

PREPARATIONMODELLING EVALUATION

IMPLEMEN

TATION

BACKGROUND

INVENTORY RESOURCES

GOALS DM

PROJECT

PLAN

PROBLEM

GOALS

SUCCESS

CRITERIA

SUCCESS CRITERIA DM

REQUERIMS. ASSUMPTIONS LIMITATIONS

RISKS CONTINGEN. TERMINOLOG. COSTS &

BENEFITS

INITIAL SELECTION OF

TOOLS

IDADM

DM application areas (’10‐>’11)IDADM

CRISP: Phases: Data understanding

OBTAIN INITIAL DATA

DESCRIPTION DATA

DATAEXPLORATION

DATA QUALITYVERIFICATION

INITIAL DATA REPORT

DATA DESCRIPTIVE REPORT

DATA EXPLORATION

REPORT

DATA QUALITY REPORT


DATA

UNDERST’ING

DATA


IMPLEMEN

TATION

IDADM

METROFANG: a real story about data understanding (2)caudal entrada

0,00

50,00

100,00

150,00

200,00

250,00

300,00

350,00

1 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671

Par motor Secador A

0,00

20,00

40,00

60,00

80,00

100,00

120,00

140,00

1 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671

Missing data

Stationality

Outliers

Time Series

Weekend?

FORUM???

IDADM

www.secadolodos.com/73027_es/METROFANG‐(Barcelona‐España)

CRISP: Phases: Data preparation

DATA SELECTION

DATA CLEANING

RECONSTRUCT DATA

INTEGRATE DATA

DATA FORMATTING

ARGUMENTS FOR SELECTION

DATA CLEANING REPORT

DERIVATED VARIABLES

INTEGRATED DATA

OSERVATIONS GENERATED

DATA WITH NEW FORMAT


DATA

UNDERST’ING

DATA


IMPLEMEN

TATION

IDADM

Is data preparation that important?

IDADM

How large is it? … (‘13)IDADM

Some fun facts:• Google processes over 20 PB worth of data every day.

• Back in December 2007, YouTube generated 27 PB of traffic.

• The CERN Large Hadron Collider (HLC) generetes about 20 PB of usable data per year.

• The volume of global annual data traffic is expected exceed 60,000 PB in 2016, from 8,000

petabytes in 2011

• In the next decade, astronomers expect to be processing 10 PB of data every hour from the Square

Kilometre Array (SKA) telescope ►one exabyte every four days.

Data “manipulation”tools …(‐>’13)

IDADM

CRISP: Phases: Modelling

SELECT MODELINGTECHNIQUE

CREATE TEST DESIGN

BUILDMODEL

VALIDATE MODEL

SELECTED

TECHNIQUE

TEST DESIGN

PARAMETER SELECTION

MODEL VALIDATION

MODEL MODEL DESCRIPTION


DATA

UNDERST’ING

DATA


IMPLEMEN

TATION

IDADM

CRISP: Selection of techniquesU N I V E R S E OF T E C H N I Q U E S

TECHNIQUES SUITED TO A PROBLEM

POLITICAL REQUIREMENTS

(Business, executive)

LIMITATIONS

Data types, knowledge

SELECTED TOOL(S)

Money, time, hh.rr.

(Definided by tools)

IDADM

end of last session wrap‐up

Commonly used models/techniques (‘07)…

IDADM

Commonly used models/techniques (‘11)…IDADM

CRISP: Phases: Evaluation

EVALUATE RESULTS

REVISE PROCESSES

DETERMINE NEXT STEPS

EVALUATION OF DM RESULTS

REVISION OF THE PROCESS

LIST OF POSSIBLE ACTIONS

DECISSIONS

APPROVED MODELS


DATA

UNDERST’ING

DATA


IMPLEMEN

TATION

IDADM

Model results should be evaluated in the context of the problem objectives established in the first phase (problem understanding)

This will lead to the identification of other needs, frequently reverting to prior phases of CRISP‐DM. Gaining problem understanding is an iterative procedure in DM, where the results show the user new relationships that provide a deeper understanding of the problem.

CRISP: Phases: Deployment

PLAN IMPLEMENTATION

PLAN MONITORIZATION & MAINTENANCE

FINAL REPORT PRODUCTION

PROJECTREVISION

IMPLEMENTATION PLAN

MONITORIZATION & MAINTENANCE PLAN

FINAL REPORT

DOCUMENTATION OF EXPERIENCE

FINAL PRESENTATION


DATA

UNDERST’ING

DATA


IMPLEMEN

TATION

IDADM

Through the knowledge discovered in the earlier phases of the CRISP‐DM process, sound models can be obtained that may then be applied to address practical issues in the area of the problem.These models need to be monitored for changes in operating conditions, because what might be true today may not be true even in the near future. If significant changes do occur, the model should be reapprised.

How do you deploy it? (’06‐>’09)IDADM

Software popularity (‘07)

Free vs. commercial:

debate!

IDADM

Software popularity (‘09)

IDADM

Ouch! (‘10)

Consolidated new scene (‘12)

A note on CRISP-DM 2.0CRISP-2.0: Updating the Methodology

Why?

Many changes have occurred in the business application of data mining since CRISP‐DM 1.0 was published. Emerging issues and requirements include:

The availability of new types of data—text, Web, and attitudinal data, for example—along with new techniques for pre‐processing, analyzing, and combining them with related case data

Integration and deployment of results with operational systems such as call centers and Web sites

Far more demanding requirements for scalability and for deployment into real‐time environments

The need to package analytical tasks for non‐analytical end users and integrate these tasks in business workflows

The need to seamlessly integrate the deployment of results and closed‐loop feedback with existing business processes

The need to mine large‐scale databases in situ, rather than exporting an analytical dataset Organizations’ increasing reliance on teams, making it important to educate greater numbers of people on the processes and best practices associated with data mining and predictive analytics

In July 2006 the consortium announced that it was going to start the process of working towards a second version of CRISP‐DM. On 26 September 2006, the CRISP‐DM SIG met to discuss potential enhancements for CRISP‐DM 2.0 and the subsequent roadmap. However, these efforts appear to be stalled. The SIG has not met, updated the CRISP website, or communicated anything to members since early 2007.

IDADM

Show me the money!

Mining jobs…Company: TravelersPosition: Predictive Modeling DirectorLocation: Minnesota, US We are one of the leading insurance companies in the United States … As a Director, you will employ the most advanced predictive modeling and data mining tools and methodologies in cutting‐edge research. You will manage strategic projects like next generation customer pricing plans, optimization of marketing practices, assessment of commission structures, and improvement in reserving practices.Work Experience* 7+ years of research, modeling or actuarial experience* Proven ability in statistical modeling and data mining techniques (GLM, clustering, decision trees, etc.)* Experience in preparing large, complex, and dirty datasets for analysis* Knowledge of insurance products and operations* Familiarity with SAS programming a plus

Certificates/Degrees* College degree in statistics, mathematics, economics, finance, actuarial science, engineering or related field.

Keywords* Statistical Analysis GLM Generalized Linear Models Math* Statistics Database Marketing SAS Actuary* Probability Predictive Modeling Time Series Regression* Economics S‐Plus Data Mining Data Analysis

Mining jobs …

Company: Microsoft adCenter LabPosition: Data Mining AnalystLocation: Redmond, WA * Analyze a huge amount of data by using data mining and statistics techniques; generate actionable insights for improving advertising technology and systems.* Collaborate with other researchers in the lab on discovering problems in different areas where data mining/machine learning/statistics can help; act as an expert in the area of data mining/machine learning/statistics in the advertising technology field* Design and carry out experiments to evaluate different algorithms and their real impact on advertising business* Research and exploration in the areas of data mining, machine learning, and statistics.Qualifications:* Extensive knowledge and experience in data mining/machine learning/statistics/databases* Solid experience in very large real world data analysis* Experience/knowledge with various data analysis tools, data mining tools, and statistical packages such as R, SAS* Ability to explain statistical concepts to non‐statisticians.* Proficiency with databases and programming languages such as C#, Perl or F#* Master degree or PhD (preferred) in the area of data mining/machine learning/statistics is required.

Mining jobs…Company: Dow AgroSciencesPosition: Computational BiologistLocation: Indianapolis, US Dow AgroSciences, LLC, is a top tier agricultural company providing innovative crop protection, pest and vegetation management, seed, and agricultural biotechnology solutions to serve the world’s growing population.DescriptionDow AgroSciences seeks a Computational Biologist to assist in the analyses of biological datasets. Key job responsibilities will include development and application of machine learning and statistical algorithms and software to mine biological datasets. Extensive development of software to integrate and automate data for mining is also expected. Topics of interest include bioinformatics, Bayesian networks, support vector machines, instance‐based algorithms, decision trees, neural networks, text mining, mixed models, spatial models, and time‐series modeling. Analysis of biological experiments, interpretation and presentation of results are also expected as well as to act as a guide for the ongoing research process.

Qualifications:* Ph.D. in computer sciences, bioinformatics, computational biology, statistics or closely related field, is required.* An emphasis in machine learning is highly desired.* A solid foundation on statistics is preferred.* Demonstrated ability to write scientific papers is desirable.

¡Empleos! de minero…Company: G2 Marketing Intelligence (WPP Group)Position: Senior Data MinerLocation: Madrid

* Position requires a demonstrated understanding and wide experience in CRM international projects, related with segmentation, modeling, relationship programs, ...* Experience in experimental design and artificial intelligence* Position requires >5 years experience designing own methodologies and utilizing multiple learning techniques, such as, for instance, cluster analysis, discriminant analysis, optimization, decision trees and neural networks.* Experience with relational databases such as Oracle, SQLServer, and Data Marts/Data Warehouses.* Experience with tools as: SPSS, Clementine, SAS, MATLAB, Reporting and Analysis Services, ...Qualifications:* Position requires a Master degree in Engineering, Mathematics, Statistics, or related IT,

experimental studies.* High level of English language.

Mining jobs…

Mining jobs … socially

email lists and groups…The CS department at UC Santa Cruz invites applications for a tenured position at all levels in areas relevant to Big Data and Data Science broadly conceived, such as database management, machine learning, and data storage systems

Please visit: http://apo.ucsc.edu/academic_employment/jobs/JPF00059‐14.pdf

… for a full description and application instructions. To ensure full consideration, applications for this position, including letters of reference, must arrive by December 17, 2013

email lists and groups…Xerox Research Centre Europe (XRCE) and the Computer Vision Center (CVC) at the Universitat Autònoma de Barcelona (UAB) …

…are seeking a candidate for a 3‐year PhD on the domain of computer vision. The funded research will aim at exploring the use of synthetic data to tackle object detection and image classification challenges building on the strong competencies of CVC and XRCE in those aspects. The PhD student will be physically located at XRCE and will occasionally visit the CVC/UAB.

We seek a motivated candidate with a university degree in computer science, telecommunications, mathematics, physics or similar discipline… Moreover, a master degree in either Computer Vision, Machine Learning, Applied Mathematics or similar is also required. Criteria: ‐ High motivation for research. ‐ Capability of working in an autonomous way. ‐ Good

mathematical understanding. ‐ Good programming skills (C, C++, MatLab, Python) … start is expected by January 2014 (if possible, earlier).

Candidates should send an e‐mail to [email protected] and jose‐[email protected] (subject: XRCE‐ADAS_CVC‐2013). Applications are considered until October 1st, 2013.

Some (hopefully) useful resources …

Some bibliography available at books.google.com:

Data mining: practical machine learning tools and techniquesI.H. Witten, E. Frank (2005)

Data mining: concepts and techniquesJ. Han, M. Kamber (2006)

Principles of data miningD. J. Hand, H. Mannila, P. Smyth (2001)

Some FREE SOFTWARE to know about …

KEEL (.es)

IDADM

WEKA (.nz)

IDADM

RapidMiner (.us)

IDADM

KNIME (.de)

IDADM

Documents

Data Mining IIavellido/teaching/13-14/... · * Analyze a huge amount of data by using data mining and statistics techniques; generate actionable insights for improving advertising