Upload
dokhanh
View
218
Download
4
Embed Size (px)
Citation preview
Mineracao de Dados AplicadaThe Pattern Discovery Process
Loıc Cerf
August, 7th 2017DCC – ICEx – UFMG
Practical matters
hello, world
I am:
Loıc Cerf;
French;
Still learning Portuguese;
Your teacher for this course;
A free software advocate.
The Web page of the course, which hosts these slides, ishttp://dcc.ufmg.br/~lcerf/pt/mda.html.
2 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Practical matters
hello, world
I am:
Loıc Cerf;
French;
Still learning Portuguese;
Your teacher for this course;
A free software advocate.
The Web page of the course, which hosts these slides, ishttp://dcc.ufmg.br/~lcerf/pt/mda.html.
2 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Practical matters
Organization of the course
During the sessions, there will be:
“Theory”;
Practice;
Your (inter)active implication.
Following the ICEx rule, a student who misses more than 25% ofthe course fails it.
The subject is elective but, to pass it, the work is compulsory!
3 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Practical matters
Organization of the course
During the sessions, there will be:
“Theory”;
Practice;
Your (inter)active implication.
Following the ICEx rule, a student who misses more than 25% ofthe course fails it.
The subject is elective but, to pass it, the work is compulsory!
3 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Practical matters
Organization of the course
During the sessions, there will be:
“Theory”;
Practice;
Your (inter)active implication.
Following the ICEx rule, a student who misses more than 25% ofthe course fails it.
The subject is elective but, to pass it, the work is compulsory!
3 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Practical matters
“Theory”
The “theory” will:
mainly give “big pictures”;
decrease (in volume) along the sessions;
be adapted to your practical needs.
4 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Practical matters
Practice
The practice will:
be done within a data mining platform;
consist of a few practical exercises;
mainly consist of projects in groups of three students with adifferent and freely-chosen dataset for each group.
5 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Practical matters
Assessment
The final mark will be based on:
an exam on the POSIX text-processing commands (20 points);
random questions on the content of the course (20 points);
a non-trivial step of the project (10 points);
the rest of the project (40 points);
the clarity of a 12-page report (5 points);
the clarity of a 20-minute presentation (5 points).
Individual adjustments based on:
questions about the project (in particular, justifications);
the help brought to other groups.
6 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Practical matters
Assessment
The final mark will be based on:
an exam on the POSIX text-processing commands (20 points);
random questions on the content of the course (20 points);
a non-trivial step of the project (10 points);
the rest of the project (40 points);
the clarity of a 12-page report (5 points);
the clarity of a 20-minute presentation (5 points).
Individual adjustments based on:
questions about the project (in particular, justifications);
the help brought to other groups.
6 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Practical matters
Collaboration
Collaboration is good.
Every group will regularly present its current advancement (fiveminutes). The other students are invited to help with remarks andsuggestions.
7 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Practical matters
Collaboration
Collaboration is good.
Every group will regularly present its current advancement (fiveminutes). The other students are invited to help with remarks andsuggestions.
7 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Practical matters
A little bit of psychology
You (and I) have the natural tendency to prefer easy activities withshort-term recompenses (watching blockbusters vs. watchingdocumentaries, writing tweets vs. writing an article, eating candiesvs. eating healthy, playing video games vs. doing homework, etc).
The solution is not in time management but in “metacognition”.To succeed in a project (in life?), the most important may not bethe intelligence but the resistance to immediate desires.
By imposing a regular work, this ability is trained and, for sure, theresulting work will be better.
8 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Practical matters
A little bit of psychology
You (and I) have the natural tendency to prefer easy activities withshort-term recompenses (watching blockbusters vs. watchingdocumentaries, writing tweets vs. writing an article, eating candiesvs. eating healthy, playing video games vs. doing homework, etc).
The solution is not in time management but in “metacognition”.To succeed in a project (in life?), the most important may not bethe intelligence but the resistance to immediate desires.
By imposing a regular work, this ability is trained and, for sure, theresulting work will be better.
8 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Practical matters
A little bit of psychology
You (and I) have the natural tendency to prefer easy activities withshort-term recompenses (watching blockbusters vs. watchingdocumentaries, writing tweets vs. writing an article, eating candiesvs. eating healthy, playing video games vs. doing homework, etc).
The solution is not in time management but in “metacognition”.To succeed in a project (in life?), the most important may not bethe intelligence but the resistance to immediate desires.
By imposing a regular work, this ability is trained and, for sure, theresulting work will be better.
8 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Practical matters
Outline
1 Many perspectives on data mining
2 The pattern discovery process
9 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Outline
1 Many perspectives on data mining
2 The pattern discovery process
10 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Epistemological Perspective
Knowledge passes through patterns;
It is acquired from data;
It is assessed by quality measures.
Data mining is an empiricism. Notice however that it is necessarilybiased (choice of a type of model/pattern, of a quality measure tomaximize, etc.).
11 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Epistemological Perspective
Knowledge passes through patterns;
It is acquired from data;
It is assessed by quality measures.
Data mining is an empiricism. Notice however that it is necessarilybiased (choice of a type of model/pattern, of a quality measure tomaximize, etc.).
11 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Epistemological Perspective
Knowledge passes through patterns;
It is acquired from data;
It is assessed by quality measures.
Data mining is an empiricism. Notice however that it is necessarilybiased (choice of a type of model/pattern, of a quality measure tomaximize, etc.).
11 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Epistemological Perspective
Knowledge passes through patterns;
It is acquired from data;
It is assessed by quality measures.
Data mining is an empiricism. Notice however that it is necessarilybiased (choice of a type of model/pattern, of a quality measure tomaximize, etc.).
11 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Data Mining Perspective
From data to databases to data warehouses to patterns.
Knowledge arises from the organization of the data.
Typical data-mining task
Local pattern discovery: enumerating subsets of a dataset thatstand out of the rest of it.
12 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Data Mining Perspective
From data to databases to data warehouses to patterns.
Knowledge arises from the organization of the data.
Typical data-mining task
Local pattern discovery: enumerating subsets of a dataset thatstand out of the rest of it.
12 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Inductive Databases
Querying data:{d ∈ D | q(d ,D)}
where:
D is a dataset (tuples),
q is a query.
13 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Inductive Databases
Querying patterns:{p ∈ P | Q(p,D)}
where:
P is the pattern space,
D is the dataset,
Q is an inductive query.
13 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Machine Learning Perspective
Patterns are an abstraction of the data. They are information analgorithm can learn from.
Machine learning focuses on the different ways to learn from data.It is the artificial intelligence side of data mining. It has strong tieswith computational statistics and mathematical optimization.
Typical machine-learning task
Supervised classification: learning, from the descriptions of clas-sified objects, what characterizes every class.
14 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Machine Learning Perspective
Patterns are an abstraction of the data. They are information analgorithm can learn from.
Machine learning focuses on the different ways to learn from data.It is the artificial intelligence side of data mining. It has strong tieswith computational statistics and mathematical optimization.
Typical machine-learning task
Supervised classification: learning, from the descriptions of clas-sified objects, what characterizes every class.
14 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Computational Statistics Perspective
Patterns are statistics that summarize the data.
Computational statistics aims to design efficient algorithms thatimplement statistical methods.
Typical computational statistics task
Representative-based clustering: summarizing data as a mixtureof distributions.
15 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Computational Statistics Perspective
Patterns are statistics that summarize the data.
Computational statistics aims to design efficient algorithms thatimplement statistical methods.
Typical computational statistics task
Representative-based clustering: summarizing data as a mixtureof distributions.
15 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Big Data perspective
Thanks to greater and greater computing capabilities, largedatasets can be stored. Big data analytics is about understandingthese many data (what usually requires parallel computing).
Bio-informatics
Genomes are now easily sequenced. The current challenge taskis to understand the expression mechanism (from genomics toproteomics to phenotypes). DNA chips give the expression levelsof tens of thousands of genes in different samples.
Data mining methods are designed with time and spacecomplexities in mind.
16 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Big Data perspective
Thanks to greater and greater computing capabilities, largedatasets can be stored. Big data analytics is about understandingthese many data (what usually requires parallel computing).
Bio-informatics
Genomes are now easily sequenced. The current challenge taskis to understand the expression mechanism (from genomics toproteomics to phenotypes). DNA chips give the expression levelsof tens of thousands of genes in different samples.
Data mining methods are designed with time and spacecomplexities in mind.
16 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Big Data perspective
Thanks to greater and greater computing capabilities, largedatasets can be stored. Big data analytics is about understandingthese many data (what usually requires parallel computing).
Bio-informatics
Genomes are now easily sequenced. The current challenge taskis to understand the expression mechanism (from genomics toproteomics to phenotypes). DNA chips give the expression levelsof tens of thousands of genes in different samples.
Data mining methods are designed with time and spacecomplexities in mind.
16 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Data Science Perspective
Data science takes an application perspective. It encompasseseverything that can be done on data from a specific field, hencemachine learning, computational statistics and data mining. Itemphasizes the necessity to understand the application domain.
17 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Business Intelligence Perspective
Patterns allow a better understanding of a the “activity” of acompany, hence better decisions taken by the manager.
Besides data mining, the business intelligence emphasizesheterogeneous data, reporting, visualization, etc.
18 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Business Intelligence Perspective
Patterns allow a better understanding of a the “activity” of acompany, hence better decisions taken by the manager.
Besides data mining, the business intelligence emphasizesheterogeneous data, reporting, visualization, etc.
18 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Big Brother Perspective
Knowing everything about everyone is a new business. . .
. . . and a political threat. The mere collection of large amount ofpersonal data in centralized repositories is unethical: these datawill eventually be misused. Data anonymization techniques can beemployed.
Be responsible!
19 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Big Brother Perspective
Knowing everything about everyone is a new business. . .
. . . and a political threat. The mere collection of large amount ofpersonal data in centralized repositories is unethical: these datawill eventually be misused. Data anonymization techniques can beemployed.
Be responsible!
19 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Big Brother Perspective
Knowing everything about everyone is a new business. . .
. . . and a political threat. The mere collection of large amount ofpersonal data in centralized repositories is unethical: these datawill eventually be misused. Data anonymization techniques can beemployed.
Be responsible!
19 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Buzz words
Data mining is also known as information discovery, knowledgediscovery, Knowledge Discovery in Databases (KDD), dataanalytics, cognitive computing, etc.
20 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Absence of a Unified Theory
Data mining is a collection of tasks. Each of them can be solvedby various techniques. These techniques are applicable onparticular types of data only.
There is no unified theory of data mining.
21 / 25Loıc Cerf Mineracao de Dados Aplicada
N
Many perspectives on data mining
Absence of a Unified Theory
Data mining is a collection of tasks. Each of them can be solvedby various techniques. These techniques are applicable onparticular types of data only.
There is no unified theory of data mining.
21 / 25Loıc Cerf Mineracao de Dados Aplicada
N
The pattern discovery process
Outline
1 Many perspectives on data mining
2 The pattern discovery process
22 / 25Loıc Cerf Mineracao de Dados Aplicada
N
The pattern discovery process
The naive view of pattern discovery
Raw data Extraction Patterns
c©2005 Tim Morgan (from flickr R©)
These icons are licensed under the Creative Commons Attribution 2.0 License.
23 / 25Loıc Cerf Mineracao de Dados Aplicada
N
The pattern discovery process
The pattern discovery process
Raw data Pre-process Data Extraction Patterns
c©2005 Tim Morgan (from flickr R©)
These icons are licensed under the Creative Commons Attribution 2.0 License.
24 / 25Loıc Cerf Mineracao de Dados Aplicada
N
The pattern discovery process
The pattern discovery process
Raw data Pre-process Data Extraction Patterns
c©2005 Tim Morgan (from flickr R©)
These icons are licensed under the Creative Commons Attribution 2.0 License.
24 / 25Loıc Cerf Mineracao de Dados Aplicada
N
The pattern discovery process
The pattern discovery process
Raw data Pre-process Data Extraction Patterns
c©2005 Tim Morgan (from flickr R©)
These icons are licensed under the Creative Commons Attribution 2.0 License.
24 / 25Loıc Cerf Mineracao de Dados Aplicada
N
License
c©2011–2017 Loıc Cerf
These slides are licensed under the Creative CommonsAttribution-ShareAlike 4.0 International License.
25 / 25Loıc Cerf Mineracao de Dados Aplicada
N