View
180
Download
0
Category
Preview:
Citation preview
2Objectives
At the end of this presentation you will understand : Understand data science and it’s application Get overview of Machine Learning Learn some type of clustering algorithm Implementation clustering with R
3Data science and it’s Applications
Extract knowledge or insight from data From speech-recognition and search engine to health-care and
humanities These scenarios involves :
Storing , organizing and integrating huge amount of unstructured data Processing and Analyzing data Extracting Knowledge , insight and predict future from data
Processing , Analyzing , Extracting knowledge and insight done through Machine Learning
5Machine Learning
Field of study that gives computers the ability to learn without being explicitly programmed
Classified into three broad category : Supervised Learning Unsupervised Learning *Reinforcement Learning
6Machine Learning Category
Supervised learning Decision tree learning Classification …
Unsupervised learning Clustering Association rule learning …
7Cluster definition
Cluster analysis or clustering grouping similar object together ( called cluster)
Type of Clustering Intra-class similarity Inter-class similarity
8Clustering Scenario
The following scenarios implement clustering :
Market segmentation Summarized news ( cluster and then find centroid ) City planning Image segmentation
9Methods of clustering
Partitioning methods (Centroid models ) Hierarchical methods (Connectivity models ) Density-based methods Grid-based methods Model-based methods Constraint-based methods
10Partitioning method
database of ‘n’ objects and the partitioning method constructs ‘k’ partition of data which satisfy following : Each group contains at least one object Each object must belong to exactly one group
Points to remember This method create initial partitioning Use iterative relocation technique to improve partitioning
17Density based Methods
Areas of higher density consider as cluster Sparse areas usually consider as noise It use two basic idea
Density reachable Density connectivity
20DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Advantage Does not require a-priori specification of number of clusters. Able to identify noise data while clustering. is able to find arbitrarily size and arbitrarily shaped clusters
Disadvantage Fails in case of neck type of dataset. Does not work well in case of high dimensional data
21Grid based algorithm
Using multi-resolution grid data structure Clustering complexity depends on number of grid cell and not objects Space into finite number cells that form a grid structure on which all of
the operation for clustering is performed Clique , STING , WaveCluster
22Clique ( CLustering-In-QUEst
Clique is used for clustering high-dimensional data High dimensional data means have many attrs Clique identifies the dense unit in subspace
Recommended