Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Lluis Belanche + Alfredo Vellido
Intelligent Data Analysis and Data Miningor …
Data Analysis and Knowledge Discoverya.k.a. Data Mining II
Last sessions wrap‐up
CRISP: The virtuous loop of methodology phases
IDADM
CRISP: Phases: Problem understanding
DETERMINEPROBLEMGOAL
ASSESS SITUATION
DETERMINEDM
GOALS
PRODUCE PROJECTPLAN
PROBLEM UNDERSTANDING
DATA
UNDERST’ING
DATA
PREPARATIONMODELLING EVALUATION
IMPLEMEN
TATION
BACKGROUND
INVENTORY RESOURCES
GOALS DM
PROJECT
PLAN
PROBLEM
GOALS
SUCCESS
CRITERIA
SUCCESS CRITERIA DM
REQUERIMS. ASSUMPTIONS LIMITATIONS
RISKS CONTINGEN. TERMINOLOG. COSTS &
BENEFITS
INITIAL SELECTION OF
TOOLS
IDADM
DM application areas (’10‐>’11)IDADM
CRISP: Phases: Data understanding
OBTAIN INITIAL DATA
DESCRIPTION DATA
DATAEXPLORATION
DATA QUALITYVERIFICATION
INITIAL DATA REPORT
DATA DESCRIPTIVE REPORT
DATA EXPLORATION
REPORT
DATA QUALITY REPORT
PROBLEM UNDERSTANDING
DATA
UNDERST’ING
DATA
PREPARATIONMODELLING EVALUATION
IMPLEMEN
TATION
IDADM
METROFANG: a real story about data understanding (2)caudal entrada
0,00
50,00
100,00
150,00
200,00
250,00
300,00
350,00
1 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671
Par motor Secador A
0,00
20,00
40,00
60,00
80,00
100,00
120,00
140,00
1 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671
Missing data
Stationality
Outliers
Time Series
Weekend?
FORUM???
IDADM
www.secadolodos.com/73027_es/METROFANG‐(Barcelona‐España)
CRISP: Phases: Data preparation
DATA SELECTION
DATA CLEANING
RECONSTRUCT DATA
INTEGRATE DATA
DATA FORMATTING
ARGUMENTS FOR SELECTION
DATA CLEANING REPORT
DERIVATED VARIABLES
INTEGRATED DATA
OSERVATIONS GENERATED
DATA WITH NEW FORMAT
PROBLEM UNDERSTANDING
DATA
UNDERST’ING
DATA
PREPARATIONMODELLING EVALUATION
IMPLEMEN
TATION
IDADM
Is data preparation that important?
IDADM
How large is it? … (‘13)IDADM
Some fun facts:• Google processes over 20 PB worth of data every day.
• Back in December 2007, YouTube generated 27 PB of traffic.
• The CERN Large Hadron Collider (HLC) generetes about 20 PB of usable data per year.
• The volume of global annual data traffic is expected exceed 60,000 PB in 2016, from 8,000
petabytes in 2011
• In the next decade, astronomers expect to be processing 10 PB of data every hour from the Square
Kilometre Array (SKA) telescope ►one exabyte every four days.
Data “manipulation”tools …(‐>’13)
IDADM
CRISP: Phases: Modelling
SELECT MODELINGTECHNIQUE
CREATE TEST DESIGN
BUILDMODEL
VALIDATE MODEL
SELECTED
TECHNIQUE
TEST DESIGN
PARAMETER SELECTION
MODEL VALIDATION
MODEL MODEL DESCRIPTION
PROBLEM UNDERSTANDING
DATA
UNDERST’ING
DATA
PREPARATIONMODELLING EVALUATION
IMPLEMEN
TATION
IDADM
CRISP: Selection of techniquesU N I V E R S E OF T E C H N I Q U E S
TECHNIQUES SUITED TO A PROBLEM
POLITICAL REQUIREMENTS
(Business, executive)
LIMITATIONS
Data types, knowledge
SELECTED TOOL(S)
Money, time, hh.rr.
(Definided by tools)
IDADM
end of last session wrap‐up
Commonly used models/techniques (‘07)…
IDADM
Commonly used models/techniques (‘11)…IDADM
CRISP: Phases: Evaluation
EVALUATE RESULTS
REVISE PROCESSES
DETERMINE NEXT STEPS
EVALUATION OF DM RESULTS
REVISION OF THE PROCESS
LIST OF POSSIBLE ACTIONS
DECISSIONS
APPROVED MODELS
PROBLEM UNDERSTANDING
DATA
UNDERST’ING
DATA
PREPARATIONMODELLING EVALUATION
IMPLEMEN
TATION
IDADM
Model results should be evaluated in the context of the problem objectives established in the first phase (problem understanding)
This will lead to the identification of other needs, frequently reverting to prior phases of CRISP‐DM. Gaining problem understanding is an iterative procedure in DM, where the results show the user new relationships that provide a deeper understanding of the problem.
CRISP: Phases: Deployment
PLAN IMPLEMENTATION
PLAN MONITORIZATION & MAINTENANCE
FINAL REPORT PRODUCTION
PROJECTREVISION
IMPLEMENTATION PLAN
MONITORIZATION & MAINTENANCE PLAN
FINAL REPORT
DOCUMENTATION OF EXPERIENCE
FINAL PRESENTATION
PROBLEM UNDERSTANDING
DATA
UNDERST’ING
DATA
PREPARATIONMODELLING EVALUATION
IMPLEMEN
TATION
IDADM
Through the knowledge discovered in the earlier phases of the CRISP‐DM process, sound models can be obtained that may then be applied to address practical issues in the area of the problem.These models need to be monitored for changes in operating conditions, because what might be true today may not be true even in the near future. If significant changes do occur, the model should be reapprised.
How do you deploy it? (’06‐>’09)IDADM
Software popularity (‘07)
Free vs. commercial:
debate!
IDADM
Software popularity (‘09)
IDADM
Ouch! (‘10)
Consolidated new scene (‘12)
A note on CRISP-DM 2.0CRISP-2.0: Updating the Methodology
Why?
Many changes have occurred in the business application of data mining since CRISP‐DM 1.0 was published. Emerging issues and requirements include:
The availability of new types of data—text, Web, and attitudinal data, for example—along with new techniques for pre‐processing, analyzing, and combining them with related case data
Integration and deployment of results with operational systems such as call centers and Web sites
Far more demanding requirements for scalability and for deployment into real‐time environments
The need to package analytical tasks for non‐analytical end users and integrate these tasks in business workflows
The need to seamlessly integrate the deployment of results and closed‐loop feedback with existing business processes
The need to mine large‐scale databases in situ, rather than exporting an analytical dataset Organizations’ increasing reliance on teams, making it important to educate greater numbers of people on the processes and best practices associated with data mining and predictive analytics
In July 2006 the consortium announced that it was going to start the process of working towards a second version of CRISP‐DM. On 26 September 2006, the CRISP‐DM SIG met to discuss potential enhancements for CRISP‐DM 2.0 and the subsequent roadmap. However, these efforts appear to be stalled. The SIG has not met, updated the CRISP website, or communicated anything to members since early 2007.
IDADM
Show me the money!
Mining jobs…Company: TravelersPosition: Predictive Modeling DirectorLocation: Minnesota, US We are one of the leading insurance companies in the United States … As a Director, you will employ the most advanced predictive modeling and data mining tools and methodologies in cutting‐edge research. You will manage strategic projects like next generation customer pricing plans, optimization of marketing practices, assessment of commission structures, and improvement in reserving practices.Work Experience* 7+ years of research, modeling or actuarial experience* Proven ability in statistical modeling and data mining techniques (GLM, clustering, decision trees, etc.)* Experience in preparing large, complex, and dirty datasets for analysis* Knowledge of insurance products and operations* Familiarity with SAS programming a plus
Certificates/Degrees* College degree in statistics, mathematics, economics, finance, actuarial science, engineering or related field.
Keywords* Statistical Analysis GLM Generalized Linear Models Math* Statistics Database Marketing SAS Actuary* Probability Predictive Modeling Time Series Regression* Economics S‐Plus Data Mining Data Analysis
Mining jobs …
Company: Microsoft adCenter LabPosition: Data Mining AnalystLocation: Redmond, WA * Analyze a huge amount of data by using data mining and statistics techniques; generate actionable insights for improving advertising technology and systems.* Collaborate with other researchers in the lab on discovering problems in different areas where data mining/machine learning/statistics can help; act as an expert in the area of data mining/machine learning/statistics in the advertising technology field* Design and carry out experiments to evaluate different algorithms and their real impact on advertising business* Research and exploration in the areas of data mining, machine learning, and statistics.Qualifications:* Extensive knowledge and experience in data mining/machine learning/statistics/databases* Solid experience in very large real world data analysis* Experience/knowledge with various data analysis tools, data mining tools, and statistical packages such as R, SAS* Ability to explain statistical concepts to non‐statisticians.* Proficiency with databases and programming languages such as C#, Perl or F#* Master degree or PhD (preferred) in the area of data mining/machine learning/statistics is required.
Mining jobs…Company: Dow AgroSciencesPosition: Computational BiologistLocation: Indianapolis, US Dow AgroSciences, LLC, is a top tier agricultural company providing innovative crop protection, pest and vegetation management, seed, and agricultural biotechnology solutions to serve the world’s growing population.DescriptionDow AgroSciences seeks a Computational Biologist to assist in the analyses of biological datasets. Key job responsibilities will include development and application of machine learning and statistical algorithms and software to mine biological datasets. Extensive development of software to integrate and automate data for mining is also expected. Topics of interest include bioinformatics, Bayesian networks, support vector machines, instance‐based algorithms, decision trees, neural networks, text mining, mixed models, spatial models, and time‐series modeling. Analysis of biological experiments, interpretation and presentation of results are also expected as well as to act as a guide for the ongoing research process.
Qualifications:* Ph.D. in computer sciences, bioinformatics, computational biology, statistics or closely related field, is required.* An emphasis in machine learning is highly desired.* A solid foundation on statistics is preferred.* Demonstrated ability to write scientific papers is desirable.
¡Empleos! de minero…Company: G2 Marketing Intelligence (WPP Group)Position: Senior Data MinerLocation: Madrid
* Position requires a demonstrated understanding and wide experience in CRM international projects, related with segmentation, modeling, relationship programs, ...* Experience in experimental design and artificial intelligence* Position requires >5 years experience designing own methodologies and utilizing multiple learning techniques, such as, for instance, cluster analysis, discriminant analysis, optimization, decision trees and neural networks.* Experience with relational databases such as Oracle, SQLServer, and Data Marts/Data Warehouses.* Experience with tools as: SPSS, Clementine, SAS, MATLAB, Reporting and Analysis Services, ...Qualifications:* Position requires a Master degree in Engineering, Mathematics, Statistics, or related IT,
experimental studies.* High level of English language.
Mining jobs…
Mining jobs … socially
email lists and groups…The CS department at UC Santa Cruz invites applications for a tenured position at all levels in areas relevant to Big Data and Data Science broadly conceived, such as database management, machine learning, and data storage systems
Please visit: http://apo.ucsc.edu/academic_employment/jobs/JPF00059‐14.pdf
… for a full description and application instructions. To ensure full consideration, applications for this position, including letters of reference, must arrive by December 17, 2013
email lists and groups…Xerox Research Centre Europe (XRCE) and the Computer Vision Center (CVC) at the Universitat Autònoma de Barcelona (UAB) …
…are seeking a candidate for a 3‐year PhD on the domain of computer vision. The funded research will aim at exploring the use of synthetic data to tackle object detection and image classification challenges building on the strong competencies of CVC and XRCE in those aspects. The PhD student will be physically located at XRCE and will occasionally visit the CVC/UAB.
We seek a motivated candidate with a university degree in computer science, telecommunications, mathematics, physics or similar discipline… Moreover, a master degree in either Computer Vision, Machine Learning, Applied Mathematics or similar is also required. Criteria: ‐ High motivation for research. ‐ Capability of working in an autonomous way. ‐ Good
mathematical understanding. ‐ Good programming skills (C, C++, MatLab, Python) … start is expected by January 2014 (if possible, earlier).
Candidates should send an e‐mail to [email protected] and jose‐[email protected] (subject: XRCE‐ADAS_CVC‐2013). Applications are considered until October 1st, 2013.
Some (hopefully) useful resources …
Some bibliography available at books.google.com:
Data mining: practical machine learning tools and techniquesI.H. Witten, E. Frank (2005)
Data mining: concepts and techniquesJ. Han, M. Kamber (2006)
Principles of data miningD. J. Hand, H. Mannila, P. Smyth (2001)
Some FREE SOFTWARE to know about …
KEEL (.es)
IDADM
WEKA (.nz)
IDADM
RapidMiner (.us)
IDADM
KNIME (.de)
IDADM