Upload
indiguy141
View
226
Download
0
Embed Size (px)
Citation preview
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
1/36
CECS 632-50 UNIT 1
Data Mining Concept
Mehmed Kantardzic
Louisville, 2006
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
2/36
Quest ions about Dat a Min ingWhat is data mining?Why data mining: motivation andbenefits?
What kind of data to mine?When to mine the data?How to organize the mining process?What are challenges in data mining?
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
3/36
Trends Leading toData Flood: :
Bank, telecom, otherbusiness transactions ...
Scientific data: astronomy,biology, etc
Web, text, and e-commerce
Why Dat a Min ing Now ?
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
4/36
5 m i l l ion t e raby t es c reat ed in 2002 ! UC Berkeley 2003 estimate: 5 exabytes (5 million
terabytes) of new data was created in 2002.
www.sims.berkeley.edu/research/projects/how-much-info-2003/
US produces ~40% of new stored data worldwide.
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
5/36
Largest Dat abases Commercial databases:
Winter Corp. 2003 Survey: France Telecom has largestdecision-support DB, ~30TB; AT&T ~ 26 TB
Europe's Very Long Baseline Interferometry (VLBI) has 16telescopes, each of which produces 1 Gigabit/second ofastronomical data over a 25-day observation session:
Web
Alexa internet archive: 7 years of data, 500 TB Google searches 4+ Billion pages, many hundreds TB IBM WebFountain, 160 TB (2003) Internet Archive (www.archive.org),~ 300 TB
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
6/36
Do you need m ore ex am ples ? MEDLINE text database
12 million published articles
Google 4.2 billion Web pages indexed 80 million site visitors per day
CALTRANS loop sensor data Every 30 seconds, thousands of sensors, 2Gbytes per second
NASA MODIS satellite Coverage at 250m resolution, 37 bands, whole earth, every day
Walmart transaction data Order of 100 million transactions per day
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
7/36
Why Dat a Min ing Now ? Data Explosion causes Data Wasting:
Only a small portion (5% - 10%) of the collected data is everanalyzed.
Data that may be never analyzed continues to be collected atgreat expenses.
WE ARE DROWNING IN DATA, BUTSTARWING FOR KNOWLEDGE!
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
8/36
Why Dat a Min ing Now ?Sources of data overload: Distributed data sources
Remote sensing Exponential growth Internet of digital information Multimedia data Internet 2 107
. hosts
4 105
_______________________________ 1988___________2000_____
Data size and dimensionality are too large for manual analyses
and interpretation.
There exists a gap between data collection and organizationcapabilities, and abilities to analyze large data sets and extract
useful information for decision processes.
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
9/36
Managers Bel ieve 61% believe that information overload is present in their workplace.
80% believe the situation will get worse. 50% ignore large data sets in current decision process. 84% store the data for future with current use or analysis.
60% believe that the cost of gathering informationoutweights its value!!!!!!!
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
10/36
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
11/36
Why Mine Dat a Now ?Com m erc ial V iew point
Lots of data is being collectedand warehoused
Web data, e-commerce purchases at department/
grocery stores
Bank/Credit Cardtransactions
Computers have become affordable and more
powerful Competitive Pressure is Strong
Provide better, customized services for an edge, andinformation is becoming product on its own right.
Data Mining may help?
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
12/36
Why Mine Dat a Now ?Sc ien t i f i c V iew point
Data collected and stored at
enormous speeds (GB/ hour) remote sensors on a satellite telescopes scanning the skies
microarrays generating geneexpression data scientific simulations
generating terabytes of data
Traditional techniques infeasible for raw data Data mining may help scientists in new discoveries
in classifying and segmenting data, detecting patterns in hypothesis formation
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
13/36
Data Mining Now:
Opportunity and ChallengesData Mining Now:Data Mining Now:
Opportunity and ChallengesOpportunity and Challenges
Data RichKnowledge Poor(theresource)
Enabling Technology(New sensors, OLAP,parallel computing, Web, etc.)
CompetitivePressure
Data MiningTechnologyMature
DM
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
14/36
What is dat a m in ing?The magic phrase used to ....
put in your resume
use in a proposal to NSF, NIH, NASA, etc market database software
sell statistical analysis software
sell parallel computing hardware sell consulting services
make refuge from the collapse of some AI promises
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
15/36
Dat a Min ing isa t a Min ing is NOTOT Brute-force crunching of bulk data
Blind application of algorithms
Going to find relationships where
none exist
Presenting data in different ways
A database intensive task
A difficult to understand technology
requiring an advanced degree in
computer science
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
16/36
Also Dat a Min ing islso Dat a Min ing is NOTOT
Data warehousing SQL / Ad Hoc Queries /
Reporting
Software Agents Online Analytical Processing
(OLAP)
Data Visualization
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
17/36
What Is Dat a Min inghat Is Dat a Min ing? In many domains there is a shift from classical modeling and
analyses based on first principleto developing models andcorresponding analyses directly from data.
DATA MINING PROCESS
Data Model
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
18/36
What Is Dat a Min ing?Data Mining is a process for the automaticextraction of non-obvious, hidden knowledge fromlarge volumes of data.
106-1012 bytes:never see the wholedata set or put it in the
memory of computers
Data miningprocess?
What knowledge?How to representand use it?
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
19/36
What is Dat a Min ing?hat is Dat a Min ing?Potential point of confusion:
The extracting ore from rock metaphor doesnot really apply to the practice of data mining
If it did, then standard database queries wouldfit under the rubric of data mining
In practice, DM refers to: finding patterns/models across large
datasets
discovering unknown information
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
20/36
What Is Dat a Min ing?The non-trivial process of identifying valid, novel, potentially useful,and ultimately understandable patterns/models in dataFayyad, Platetsky-Shapiro, Smyth (1996)
non-trivial process
Multiple processesAnd iterations
valid Justified patterns/models
novel Previouslyunknown
useful Can be used by end user
understandableby human and machine
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
21/36
From Data to KnowledgeFrom Data to KnowledgeFrom Data to Knowledge
...10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148,712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS,VIRUS
12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71,59, F,-,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA
15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47,63, F, -,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA
16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0, 0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39,2, 44, 57, F, -, ABPC+CZX, ?, ? ,negative, ?, n, n, ABSCESS, VIRUS...
Medical Data by Dr. Tsumoto, Tokyo Med. & Dent. Univ., 38 attributes
Numerical attribute categorical attribute missing values class labels
IF cell_poly 15THEN Prediction = VIRUS [87,5%]
predictive accuracy
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
22/36
Possib le Business Disc over iesTable 1.3 Acme Investors Incorporated
Customer Account Margin Transaction Trades/ Favorite Annual
ID Type Account Method Month Sex Age Recreation Income
1005 Joint No Online 12.5 F 3039 Tennis 4059K1013 Custodial No Broker 0.5 F 5059 Skiing 8099K1245 Joint No Online 3.6 M 2029 Golf 2039K2110 Individual Yes Broker 22.3 M 3039 Fishing 4059K1001 Individual Yes Online 5.0 M 4049 Golf 6079K
Can I develop a general characterisation/profile of different
investor types? (CLASSIFICATION)
What characteristics distinguish between Online and Brokerinvestors? (DISCRIMINATION)
Can I develop a model which will predict the average
trades/month for a new investor? (PREDICTION)
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
23/36
Dat a MiningWhat s in a Nam e?Information Harvesting
Knowledge Mining
Data Mining
Intelligent DataAnalysis Knowledge Discovery
in DatabasesData Dredging
Data Pattern ProcessingData Archaeology
Database Mining
SiftwareData Fishing
Knowledge Extraction
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
24/36
Dat a Min ing Root sStatistics
driven by the notation of a modelDatabase Technology
concentration on large amount of dataMachine Learning
emphasize algorithmsControl Theory
to predict a systems behavior,- to explain the interactions.
Artificial Neural NetworksPattern RecognitionChaos TheoryData Visualization
Its
Mine!!
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
25/36
Let the data speak
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
26/36
Let the data speak
The data may havequite a lot to say.. but it
may just be noise!
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
27/36
Dat a Min ing Proc essSTATE THE PROBLEM
(COLLECT THE DATA)
DATA PREPROCESSING
ESTIMATE THE MODEL
MODEL INTERPRETATION & CONCLUSIONS
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
28/36
Other view: Data mining asthe core of knowledgediscovery process.
Data Cleaning
Data Integration
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
Dat a Min ing & (or) K DD Proc ess
Databases
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
29/36
Charac t er is t i c s Of Raw Dat a Missing data,
Misrecorded data,
Data may be from the other population(heterogeneous),
Different structures & formats,
With or without compression,
Redundant,
With implicit temporal & spatial components,
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
30/36
Dat a Min ing Tec hniquesat a Min ing Tec hn iquesRaw Data = Messy Data
_________________________________________________________
ALGORITHMS for PREPROCESSING :
- Scaling & Normalization- Encoding- Outlier Detection & Removal- Feature Selection & Composition- Data Cleansing & Scrubbing- Data Smoothing- Missing Data Elimination- Sampling
iP i
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
31/36
Primary Tasks of Data M iningPrimaryPrimary Tasks of Data M iningTasks of Data M ining
Classification
Deviation and
change detection
?
Summarization
Clustering
DependencyModeling
Regression
finding the descriptionof several predefinedclasses and classifya data item into one
of them.
maps a data item
to a real-valuedprediction variable.
identifying a finiteset of categories orclusters to describe
the data.
finding acompact descriptionfor a subset of data
finding a modelwhich describes
significant dependenciesbetween variables.
discovering themost significantchanges in the data
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
32/36
Dat a Min ing Tec hniquesare
Decision Trees
Nearest Neighbor Classification
Neural Networks
Rule Induction
K-means Clustering
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
33/36
Is data mining lot of hammers looking for nails?
Dat a Min ing Bubble ?
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
34/36
Pot ent ia l Dat a Mining App l ic a t ionsot en t ia l Dat a Mining App l ic a t ionsBusiness Manufacturing
Science
Personal
- Marketing and salesdata analysis
- Investment analysis- Loan approval- Fraud detection- etc. - Controlling and scheduling
- Network management- Sensor monitoring- etc.
- Gene analysis- Space image classification- Experiment result analysis- etc.
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
35/36
Dat a Min ing is Spreading FINANCIAL INSTITUTIONS
RETAIL INDUSTRY
TELECOMMUNICATION INDUSTRY
HEALTH INDUSTRY
SCIENCE & ENGINEERING
GOVERNMENT
E-COMMERCE
8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF
36/36
Dat a Mining: When & How ?SOME HINTS FOR SUCCESS:
Business or scientific needs are more importantthan the razzle-dazzle of a technical solution.
Preparing data is as much as 80% of the miningprocess.
Dont rely on a single methodology !
Keep the end-users informed and involved this
is an interdisciplinary task.
Data mining is an iterative and interactiveprocess: Be prepared to generate a lot ofgarbage until you hit something that is
actionable and meaningful, and useful.