If you can't read please download the document
Upload
dayton
View
31
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Issues in Data Mining Infrastructure. Authors:Nemanja Jovanovic, [email protected] Valentina Milenkovic, [email protected] Prof. Dr. Veljko Milutinovic, [email protected] http://galeb.etf.bg.ac.yu/~vm. Data Mining in the Nutshell. Uncovering the hidden knowledge. - PowerPoint PPT Presentation
Citation preview
Issues in Data Mining InfrastructureAuthors:Nemanja Jovanovic, [email protected] Milenkovic, [email protected] Prof. Dr. Veljko Milutinovic, [email protected]
http://galeb.etf.bg.ac.yu/~vm
Data Mining in the NutshellUncovering the hidden knowledgeHuge n-p complete search spaceMultidimensional interface
A Problem You are a marketing manager for a cellular phone companyProblem: Churn is too highBringing back a customer after quitting is both difficult and expensiveGiving a new telephone to everyone whose contract is expiring is very expensive (as well as wasteful)You pay a sales commission of 250$ per contractCustomers receive free phone (cost 125$) with contractTurnover (after contract expires) is 40%
A SolutionThree months before a contract expires, predict which customers will leaveIf you want to keep a customer that is predicted to churn, offer them a new phoneThe ones that are not predicted to churn need no attentionIf you dont want to keep the customer, do nothingHow can you predict future behavior?Tarot Cards?Magic Ball?Data Mining?
Still Skeptical?
The DefinitionAutomatedThe automated extraction of predictive information from (large) databases ExtractionPredictiveDatabases
History of Data Mining
Repetition in Solar Activity1613 Galileo Galilei1859 Heinrich Schwabe
The Return of the Halley Comet191019862061 ???153116071682239 BCEdmund Halley (1656 - 1742)
Data Mining is NotData warehousingAd-hoc query/reportingOnline Analytical Processing (OLAP)Data visualization
Data Mining isAutomated extraction of predictive information from various data sourcesPowerful technology with great potential to help users focus on the most important information stored in data warehouses or streamed through communication lines
Data Mining canAnswer question that were too time consuming to resolve in the pastPredict future trends and behaviors, allowing us to make proactive, knowledge driven decision
Focus of this PresentationData Mining problem typesData Mining models and algorithmsEfficient Data MiningAvailable software
Data Mining Problem Types
Data Mining Problem Types6 types Often a combination solves the problem
Data Description and SummarizationAims at concise description of data characteristicsLower end of scale of problem typesProvides the user an overview of the data structureTypically a sub goal
SegmentationSeparates the data into interesting and meaningful subgroups or classesManual or (semi)automaticA problem for itself or just a step in solving a problem
ClassificationAssumption: existence of objects with characteristics that belong to different classesBuilding classification models which assign correct labels in advanceExists in wide range of various applicationSegmentation can provide labels or restrict data sets
Concept DescriptionUnderstandable description of concepts or classesClose connection to both segmentation and classificationSimilarity and differences to classification
Prediction (Regression)Similar to classification- difference: discrete becomes continuousFinds the numerical value of the target attribute for unseen objects
Dependency AnalysisFinding the model that describes significant dependences between data items or eventsPrediction of value of a data itemSpecial case: associations
Data Mining Models
Neural NetworksCharacterizes processed data with single numeric valueEfficient modeling of large and complex problemsBased on biological structures NeuronsNetwork consists of neurons grouped into layers
Neuron FunctionalityI1I2I3InOutputW1W2W3WnfOutput = f (W1*I1, W2*I1, , Wn*In)
Training Neural Networks
Neural Networks - ConclusionOnce trained, Neural Networks can efficiently estimate value of output variable for given inputNeurons and network topology are essentialsUsually used for prediction or regression problem typesDifficult to understandData pre-processing often required
Decision TreesA way of representing a series of rules that lead to a class or valueIterative splitting of data into discrete groups maximizing distance between them at each splitCHAID, CHART, Quest, C5.0Classification trees and regression treesUnlimited growth and stopping rulesUnivariate splits and multivariate splits
Decision Trees
Rule InductionMethod of deriving a set of rules to classify casesCreates independent rules that are unlikely to form a treeRules may not cover all possible situationsRules may sometimes conflict in a prediction
K-nearest Neighbor and Memory-Based Reasoning (MBR)Usage of knowledge of previously solved similar problems in solving the new problemAssigning the class to the group where most of the k-neighbors belongFirst step finding the suitable measure for distance between attributes in the dataHow far is black from green?+ Easy handling of non-standard data types- Huge models
K-nearest Neighbor and Memory-Based Reasoning (MBR)
Data Mining Models and AlgorithmsLogistic regressionDiscriminant analysisGeneralized Adaptive Models (GAM)Genetic algorithmsEtcMany other available models and algorithmsMany application specific variations of known modelsFinal implementation usually involves several techniquesSelection of solution that match best results
Efficient Data Mining
Dont Mess With It!YESNOYESYou Shouldnt Have!NOWill it ExplodeIn Your Hands?NOLook The Other WayAnyone ElseKnows?Youre in TROUBLE!YESYESNOHide ItCan You Blame Someone Else?NONO PROBLEM!YESIs It Working?Did You Mess With It?
DM Process ModelCRISPDM tends to become a standard5A used by SPSS Clementine (Assess, Access, Analyze, Act and Automate)SEMMA used by SAS Enterprise Miner (Sample, Explore, Modify, Model and Assess)
CRISP - DMCRoss-Industry Standard for DMConceived in 1996 by three companies:
CRISP DM methodologyFour level breakdown of the CRISP-DM methodology:PhasesGeneric TasksProcess InstancesSpecialized Tasks
Mapping generic models to specialized modelsAnalyze the specific contextRemove any details not applicable to the contextAdd any details specific to the contextSpecialize generic context according to concrete characteristic of the contextPossibly rename generic contents to provide more explicit meanings
Generalized and Specialized CookingPreparing food on your ownFind out what you want to eatFind the recipe for that mealGather the ingredientsPrepare the mealEnjoy your foodClean up everything (or leave it for later)Raw stake with vegetables?Check the Cookbook or call momDefrost the meat (if you had it in the fridge)Buy missing ingredients or borrow the from the neighborsCook the vegetables and fry the meatEnjoy your food or even moreYou were cooking so convince someone else to do the dishes
CRISP DM modelBusiness understandingData understandingData preparationModelingEvaluationDeploymentBusiness understandingData understandingData preparationModelingEvaluationDeployment
Business UnderstandingDetermine business objectivesAssess situationDetermine data mining goalsProduce project plan
Data UnderstandingCollect initial dataDescribe dataExplore dataVerify data quality
Data PreparationSelect dataClean dataConstruct dataIntegrate dataFormat data
ModelingSelect modeling techniqueGenerate test designBuild modelAssess model
EvaluationEvaluate resultsReview processDetermine next stepsresults = models + findings
DeploymentPlan deploymentPlan monitoring and maintenance Produce final reportReview project
At Last
Available Software14
Conclusions
WWW.NBA.COM
Se7en
CD ROM
CreditsAnne Stern, SPSS, Inc.Djuro Gluvajic, ITE, DenmarkObrad Milivojevic, PC PRO, Yugoslavia
ReferencesBruha, I., Data Mining, KDD and Knowledge Integration: Methodology and A case Study, SSGRR 2000Fayyad, U., Shapiro, P., Smyth, P., Uthurusamy, R., Advances in Knowledge Discovery and Data Mining, MIT Press, 1996Glumour, C., Maddigan, D., Pregibon, D., Smyth, P., Statistical Themes nad Lessons for Data Mining, Data Mining And Knowledge Discovery 1, 11-28, 1997Hecht-Nilsen, R., Neurocomputing, Addison-Wesley, 1990Pyle, D., Data Preparation for Data Mining, Morgan Kaufman, 1999galeb.etf.bg.ac.yu/~vmwww.thearling.comwww.crisp-dm.comwww.twocrows.comwww.sas.com/products/minerwww.spss.com/clementine
The END