35
Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Embed Size (px)

Citation preview

Page 1: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Data mining: some basic ideas

Francisco MorenoExcerpts from Fundamentals of DB

Systems, Elmasri & Navathe and other sources

Page 2: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Data mining

• For many years, organizations have generated a large amount of data in the form of files and databases

• These data can be processed using database technology with languages such as SQL

• SQL drawbacks: it is assumed that the user is aware of the DB schema, some queries can become very complex, for example, those oriented to discover information…

Page 3: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Data mining

• Data mining refers to the discovery of information in terms of patterns or rules from vast amounts of data

• To be useful, data mining must be carried out efficiently on large files and databases

• Data mining uses techniques from areas such as machine learning, statistics, neural networks, and genetic algorithms, among others.

Page 4: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Data mining

• We will highlight the nature of the information that is discovered, the types of problems faced in databases and potential applications

• Data mining is related with a broader area called knowledge discovery (see below)

Page 5: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Data mining

• Remember: the goal of a Data Warehouse (DW) is to support decision making with data: Data mining can be used in conjuntion with a DW to help with decision making processes

• It is possible to apply data mining to operational databases (or files) with individual transactions

• However, to make data mining more efficient a DW could be used, where we could take advantage of the aggregated collection of data

Page 6: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Data mining

• Data mining helps in extracting meaningful patterns that cannot be found necessarily by merely querying or processing data in the DW

• Data mining requirements should be considered early, during the design of a DW

• Indeed, for very large databases, succesful use of data mining will depend first on the construction of the DW

Page 7: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Data mining

• Data mining is a part of the knowledge discovery process

• Knowledge discovery in databases (KDD), typically encompasses more than data mining

Page 8: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

KDD

• The KDD comprises six phases:– Data cleansing– Enrichment– Data transformation and encoding– Data selection– Data miningData mining– Reporting and display of the discovered

information

Data integration

Page 9: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

KDD

Data integration: Data cleansing, enrichment, data transformation, encoding

Databases

Data Warehouse

Data Mining

Pattern EvaluationKnowledge

Selection

Page 10: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

KDD: Data integration

• During data cleansing, invalid data can be fixed: fix zip codes or eliminate records with wrong phone prefixes

Page 11: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

KDD: Data integration

• Enrichment typically enhances the data with additional information from other sources. For example, given the customer names and phone numbers, an organization can get (perhaps buy) other data such as age, income, and credit card rating and then append them to each customer record.

Page 12: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

KDD: Data integration

• Data transformation and encoding may be done to reduce the amount of data. For example, product codes may be grouped in terms of product categories. Zip codes may be aggregated into geographic regions, incomes may be divided into ranges, and so on.

Page 13: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Data mining

• During data selection, data about specific products or categories of specific products, or from stores in a specific region, may be selected

• After such preprocessing, data miningdata mining techniques are used to discover rules and patterns

Page 14: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Data mining

• For example, the result of mining could discover:– Association rules: whenever a customer buys video

equipment, he also buys another electronic gadget– Sequential patterns: a customer who buys a camera,

he will buy photographic supplies usually within the next three months, and within six months, an accesory item. A customer who buys more than twice in the lean periods* may be likely to buy at least once during Christmas period

* Periodos de escasez

Page 15: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Data mining

– Classification trees: customers may be classified by frequency of visits, by types of financing used, by amount of purchase, by affinity for types of items some revealing statistics may be generated for such classes

Page 16: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Data mining

• This information can then be used – to plan additional store locations based on

demographics– to run store promotions– to combine products in advertisements– to plan seasonal marketing strategies

Page 17: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Goals of data mining and knowledge discovery

• The goals of data mining fall into the following classes:– Prediction– Identification– Classification– Optimization

Page 18: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Goals of data mining and knowledge discovery

• Prediction: Data mining can show how certain attributes within the data will behave in the future: analysis of buying transactions to predict what consumers will buy under certains discounts, how much sales volume a store would generate in a given period, and whether deleting a product line would yield more profits

Page 19: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Goals of data mining and knowledge discovery

• Identification: to identify the existence of an item, an event, or an activity: intruders may be identified by the programs executed, files accessed, and CPU time per session; a gene can be identified by certain sequences of nucleotide symbols in the DNA sequence.

Page 20: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Goals of data mining and knowledge discovery

• Classification: Data mining can partition the data so that different classes can be identified based on combination of parameters: customers in a supermarket can be classified into discount-seekers or shoppers in a rush.

Page 21: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Goals of data mining and knowledge discovery

• Optimization: to optimize the use of limited resources such as time, space, money, or materials and to maximize output variables such as sales under a given set of constraints A strong resemblance with the objective function in operations research field (there is no sharp line separating data mining from this and other related disciplines)

Page 22: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Data mining

• Some types of knowledge discovered during data mining:– Association rules– Sequential patterns– Patterns within time series– Categorization and segmentation

Page 23: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Data mining

• Association rules*: correlate the presence of items with another range of values for another set of variables: when a female retail shopper buys a handbag, she is likely to buy shoes.

* Later, we will focus on this type of knowledge.

Page 24: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Data mining

• Sequential patterns: a sequence of actions or events is sought: if a patient underwent cardiac bypass surgery and later developed high blood urea within a year of surgery, he is likely to suffer from kidney within the next year.

• Note that detection of sequential patterns is equivalent to detecting association among events with certain temporal relationships

Page 25: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Data mining

• Patterns within time series: similarities can be detected within positions of time series: stocks of a utility (service) company A and a financial company B show the same pattern during a year, two products show the same selling price pattern in summer but a different one in winter.

Page 26: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Data mining• Categorization and segmentation: a given

population of events or items can be partitioned into sets of “similar” elements: – a population of treatment data may be divided

into groups based on similarity of side effects– a population may be categorized into groups

from “most likely to buy” to “least likely to buy” – web accesses made by users may be

analized in terms of keywords to reveal clusters of users Web usage mining

Page 27: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Association rules

• The database is regarded a collection of transactions (for example, purchases), each involving a set of items

• A common example is that of market-based data

• Consider the following example with four transactions:

Page 28: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Association rules

Transaction_id Items_bought

1 milk, bread, juice

2 milk, juice

3 milk, eggs

4 bread, cookies, coffee

Note: Some important information is not considered, for example, the quantity of each item purchased in each transaction

Page 29: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Association rules

• Another example: a text document data set, where each document is treated as a set of keywords:

• Doc 1: {student, teach, school}

• Doc 2: {student, school}

• Doc 3: {teach, school, city, game}

• Doc 4: {baseball, basketball}

• Doc 5: {basketball, team, city, game}

Text mining, Web content mining

Page 30: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Association rules

• An association rule is of the form:• LHS(left hand side) RHS(right hand side)

X Y

where X = {x1, x2, …, xn} and

Y = {y1, y2, …, ym} are set of items,

xi and yi being distinct items for all i and j and X Y =

• This association states that if a customer buys X, he is also likely to buy Y.

Page 31: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Association rules

• Association rules should include both support (prevalence) and confidence (strenght)

• The support for a rule LHS RHS is the percentage of transactions that hold all the items in the set LHS RHS.

• If the support is low, it implies that there is no overwhelming evidence that the items LHS RHS occur together.

Page 32: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Association rules: Support examples

• Milk Juice has 50% support.

• Bread Juice has 25% support.

Page 33: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Association rules

• To compute confidence, we consider all transactions that include items in LHS. The confidence for LHS RHS is the percentage of such transactions that also include RHS.

Page 34: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Association rules: Confidence examples

• Milk Juice has 66.6% confidence.

• Bread Juice has 50% confidence.

Page 35: Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Association rules

• n = number of transactions, then:

• (X Y).count

• (X Y).count

Support = n

Confidence = X.count