Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Clustering algorithms for extractinginformation from textual databases
Marina Marino1 Germana Scepi1 Cristina Tortora2
1Università di Napoli Federico II2Stazione Zoologica Anton Dohrn, Napoli
Symposium on learning and data science 1
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
BLUE-ETS project
This work is financially supported by BLUE-ETS European project
In this context, we are testing algorithms on reports, financialstatements and additional notes on a sample of Italian companywhich provides databases of balance sheets for the assessmentof credit worthiness.
The possibility of referring both on structured and unstructureddata will help us in defining a proper strategy suitable whennumerical data are not available.
Symposium on learning and data science 2
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Aim and Scope
AimTo visualize the relationships between terms belonging to differentcorpora and identifying classes of terms and corpora (firms)
Scope
To compare different two-mode clustering techniques on themanagement commentaries dataset
Symposium on learning and data science 3
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Aim and Scope
AimTo visualize the relationships between terms belonging to differentcorpora and identifying classes of terms and corpora (firms)
Scope
To compare different two-mode clustering techniques on themanagement commentaries dataset
Symposium on learning and data science 3
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Framework
Two-mode clustering aims at finding:A partition of rows RA partition of columns CA partition of R × C obtained by fully crossing row andcolumn partitioning
Rows and columns of data matrix are permuted such that all rowand column clusters consist of neighboring elements
Symposium on learning and data science 4
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Outline
1 Hierarchical clustering
2 Co-clustering
3 Two-mode Factor Probabilistic Distance clusteringPD-clusteringFactor PD-clustering
Symposium on learning and data science 5
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Dataset
Among the 406 italian listed company in 2009, 25 companieshave been selected using a sample technique.
An important part of the official budget of listed companies is themanagement commentary.The management commentaries of the selected companies havebeen processed using ad hoc techniques, interested readers canrefer to (Spano and Triunfo 2012).
The matrix obtained is a frequency matrix (companies x words):25 listed companies on rows;81 words on columns.
Symposium on learning and data science 6
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Outline
1 Hierarchical clustering
2 Co-clustering
3 Two-mode Factor Probabilistic Distance clusteringPD-clusteringFactor PD-clustering
Symposium on learning and data science 7
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Hierarchical Clustering
Symposium on learning and data science 8
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Outline
1 Hierarchical clustering
2 Co-clustering
3 Two-mode Factor Probabilistic Distance clusteringPD-clusteringFactor PD-clustering
Symposium on learning and data science 9
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Co-clusteringDhillon et al. 2003
Aim: simultaneous clustering of rows and columns of acontingency table for obtaining row-clusters and column-clusterssimultaneously
Hypothesis: the contingency table is considered as an empiricaljoint probability distribution of two discrete random variables thattake values over the rows and columns
DefinitionCo-clustering classifies both terms and documentssimultaneously by considering the matrix ’terms x documents’
Symposium on learning and data science 10
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Co-clusteringDhillon et al. 2003
Aim: simultaneous clustering of rows and columns of acontingency table for obtaining row-clusters and column-clusterssimultaneously
Hypothesis: the contingency table is considered as an empiricaljoint probability distribution of two discrete random variables thattake values over the rows and columns
DefinitionCo-clustering classifies both terms and documentssimultaneously by considering the matrix ’terms x documents’
Symposium on learning and data science 10
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Information-theoretic Co-clusteringDhillon et al. 2003
Let us defining:Y discrete random variables taking values in y1, . . . , yp
Z discrete random variables taking values in z1, . . . , zp
p(Y ,Z ) the joint probability distribution between Y and Zestimated on the base of the observed values
Y K disjoint clusters of YZ H disjoint clusters of Z
The partition functions for rows and for columns are allowed todepend upon the entire joint distribution p(Y ,Z )
Symposium on learning and data science 11
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Information-theoretic Co-clustering
Mutual Information
I(Y ,Z ) =∑
Y
∑Z
p(Y ,Z ) · log(
p(Y ,Z )
p(Y )p(Z )
)
Loss in Mutual Information
I(Y ,Z )− I(Y , Z )
The method aims at minimizing the loss in mutual information bymaximizing I(Y , Z )
Symposium on learning and data science 12
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Co-clustering
Symposium on learning and data science 13
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Outline
1 Hierarchical clustering
2 Co-clustering
3 Two-mode Factor Probabilistic Distance clusteringPD-clusteringFactor PD-clustering
Symposium on learning and data science 14
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
PD-clustering in a nutshellBen-Israel and Iyigun in 2008
Some essential notations:xij ∈ X original data matrix with i = 1, . . . ,n, j = 1, . . . , J;ckj ∈ C cluster centers matrix with k = 1, . . . ,K ;
pk (x) probability that the element xij belongs to cluster k ;dk (x) distance of the element xij from the center ckj .
PD-clustering is a probabilistic clustering method based on thefollowing assumption:
pk (x)dk (x) = JDF (x)
JDF (x), Joint Distance Function, is a constant depending on x .
Symposium on learning and data science 15
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Factor PD-clustering
Factor PD-clustering consists in finding a transformation oforiginal data x∗iq = xijbjq , where biq ∈ B is a weighting system, andcluster centers ckq such that the JDF is minimized:
ˆJDF = arg minC;B
n∑i=1
Q∑q=1
K∑k=1
(x∗iq − c∗kq)2p2ik .
x∗ij is the generic element of the 1× q vector of thecoordinates of the i th statistical unit;
c∗jk is the generic element of the 1× q vector c∗kindicating the generic cluster centers on factorialspace;
pik is the probability of the generic i th statistical unit ofbelonging to the generic cluster k .
Symposium on learning and data science 16
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Factor PD-clustering
Factor PD-clustering consists in finding a transformation oforiginal data x∗iq = xijbjq , where biq ∈ B is a weighting system, andcluster centers ckq such that the JDF is minimized:
ˆJDF = arg minC;B
n∑i=1
Q∑q=1
K∑k=1
(x∗iq − c∗kq)2p2ik .
Matrices B and C are computed according to the optimization ofthe same function
An iterative two-step algorithm is required because B and Ccannot be computed simultaneously
Symposium on learning and data science 16
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Factor PD-clustering
giqk = (x∗iq − c∗kq) is the general element of G
It has been demonstrated (Tortora 2011) that the value of matrixB that minimize JDF can be obtained through a Tucker3decomposition of distances matrix G:
G = UΛ(V ′ ⊗ B′) + E
Core of the method:Tucker 3 decomposition;PD-clustering on Tucker3 factors;iteration of procedure until convergence.
Number of factors Q < J
Symposium on learning and data science 17
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Two-mode Factor ProbabilisticDistance Clustering
Generalization of FPDC to a two-mode clustering
Iterative clustering procedure based on the following steps:1 Random initialization of cluster of units and computation of
distance matrix;2 Three-way decomposition of distance matrix;3 Projection of units and variables on the factorial space;4 PD-clustering of variables on the factorial space;5 Computation of distance matrix of variables;6 Three-way decomposition of distance matrix;7 Projection of units and variables on the factorial space;8 PD-clustering of units on the factorial space;
Steps 2 and 8 are iterated until convergence to the solution.
Symposium on learning and data science 18
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Two-mode FPDC
Symposium on learning and data science 19
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Conclusions
Hierarchical clusteringAdvantages Find homogeneous groups of words and firmsDisadvantages There is not a block structure
Co-clusteringAdvantages Find a block structureDisadvantages 1 big cluster and 3 very small clusters
Two-mode FPDCAdvantages Find a block structure, balanced number ofelement in each groupDisadvantages Iterative algorithm: convergence must beanalytically verified
Symposium on learning and data science 20
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Conclusions
Hierarchical clusteringAdvantages Find homogeneous groups of words and firmsDisadvantages There is not a block structure
Co-clusteringAdvantages Find a block structureDisadvantages 1 big cluster and 3 very small clusters
Two-mode FPDCAdvantages Find a block structure, balanced number ofelement in each groupDisadvantages Iterative algorithm: convergence must beanalytically verified
Symposium on learning and data science 20
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Conclusions
Hierarchical clusteringAdvantages Find homogeneous groups of words and firmsDisadvantages There is not a block structure
Co-clusteringAdvantages Find a block structureDisadvantages 1 big cluster and 3 very small clusters
Two-mode FPDCAdvantages Find a block structure, balanced number ofelement in each groupDisadvantages Iterative algorithm: convergence must beanalytically verified
Symposium on learning and data science 20
Clusteringalgorithms for
extractinginformation from
textual databases
M.MarinoG.ScepiC.Tortora
Hierarchicalclustering
Co-clustering
Two-mode FPDCPD-clustering
Factor PD-clustering
Bibliography
Balbi S., Miele R. Scepi G.:2012 Clustering of documents from a two-way viewpoint, 10th Int.Conf. on Statistical Analysis of Textual Data
Ben-Israel A., Iyigun C.: 2008 Probabilistic D-clustering, Journal of Classification
Dhillon, I., Mallela, S., and Modha, D.: 2003 Information-theoretic co-clustering, Proceedingsof the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Kroonenberg, P. M.: 2008, Applied multiway data analysis, Wiley series in probability endstatistics.
Spano, M. and Triunfo, N.:2012, La relazione sulla gestione delle società italiane quotate sulmercato regolamentato, JADT 2012 : 11es Journées internationales dÕAnalyse statistiquedes Données Textuelles
Tortora, C.:2011 Non-hierarchical clustering methods on factorial subspaces, PhD thesis,Universiti. Napoli Federico II.
Tortora, C., Gettler Summa, M., and Palumbo, F.:2011 Factorial pd-clustering. Proceedings ofthe Joint Conference of the German Classification Society.
Van Mechelen, I., Bock, H., and Boeck, P. D. (2004). Two-mode clustering methods: astructure overview. Statistical methods in medical research, 13(5):363Ð 394.
Vichi M., Kiers A.L.H.:2000 Factorial k-means analysis for two-way data, ComputationalStatistics and Data Analysis.
Symposium on learning and data science 21