Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Clustering algorithms for extractinginformation from textual databases

Marina Marino1 Germana Scepi1 Cristina Tortora2

1Università di Napoli Federico II2Stazione Zoologica Anton Dohrn, Napoli

Symposium on learning and data science 1



textual databases



Co-clustering



BLUE-ETS project

This work is financially supported by BLUE-ETS European project

In this context, we are testing algorithms on reports, financialstatements and additional notes on a sample of Italian companywhich provides databases of balance sheets for the assessmentof credit worthiness.

The possibility of referring both on structured and unstructureddata will help us in defining a proper strategy suitable whennumerical data are not available.




textual databases



Co-clustering



Aim and Scope

AimTo visualize the relationships between terms belonging to differentcorpora and identifying classes of terms and corpora (firms)

Scope

To compare different two-mode clustering techniques on themanagement commentaries dataset




textual databases



Co-clustering



Aim and Scope

AimTo visualize the relationships between terms belonging to differentcorpora and identifying classes of terms and corpora (firms)

Scope

To compare different two-mode clustering techniques on themanagement commentaries dataset




textual databases



Co-clustering



Framework

Two-mode clustering aims at finding:A partition of rows RA partition of columns CA partition of R × C obtained by fully crossing row andcolumn partitioning

Rows and columns of data matrix are permuted such that all rowand column clusters consist of neighboring elements




textual databases



Co-clustering



Outline

1 Hierarchical clustering

2 Co-clustering

3 Two-mode Factor Probabilistic Distance clusteringPD-clusteringFactor PD-clustering




textual databases



Co-clustering



Dataset

Among the 406 italian listed company in 2009, 25 companieshave been selected using a sample technique.

An important part of the official budget of listed companies is themanagement commentary.The management commentaries of the selected companies havebeen processed using ad hoc techniques, interested readers canrefer to (Spano and Triunfo 2012).

The matrix obtained is a frequency matrix (companies x words):25 listed companies on rows;81 words on columns.




textual databases



Co-clustering



Outline


2 Co-clustering





textual databases



Co-clustering



Hierarchical Clustering




textual databases



Co-clustering



Outline


2 Co-clustering





textual databases



Co-clustering



Co-clusteringDhillon et al. 2003

Aim: simultaneous clustering of rows and columns of acontingency table for obtaining row-clusters and column-clusterssimultaneously

Hypothesis: the contingency table is considered as an empiricaljoint probability distribution of two discrete random variables thattake values over the rows and columns

DefinitionCo-clustering classifies both terms and documentssimultaneously by considering the matrix ’terms x documents’




textual databases



Co-clustering



Co-clusteringDhillon et al. 2003

Aim: simultaneous clustering of rows and columns of acontingency table for obtaining row-clusters and column-clusterssimultaneously

Hypothesis: the contingency table is considered as an empiricaljoint probability distribution of two discrete random variables thattake values over the rows and columns

DefinitionCo-clustering classifies both terms and documentssimultaneously by considering the matrix ’terms x documents’




textual databases



Co-clustering



Information-theoretic Co-clusteringDhillon et al. 2003

Let us defining:Y discrete random variables taking values in y1, . . . , yp

Z discrete random variables taking values in z1, . . . , zp

p(Y ,Z ) the joint probability distribution between Y and Zestimated on the base of the observed values

Y K disjoint clusters of YZ H disjoint clusters of Z

The partition functions for rows and for columns are allowed todepend upon the entire joint distribution p(Y ,Z )




textual databases



Co-clustering



Information-theoretic Co-clustering

Mutual Information

I(Y ,Z ) =∑

Y

∑Z

p(Y ,Z ) · log(

p(Y ,Z )

p(Y )p(Z )

)

Loss in Mutual Information

I(Y ,Z )− I(Y , Z )

The method aims at minimizing the loss in mutual information bymaximizing I(Y , Z )




textual databases



Co-clustering



Co-clustering




textual databases



Co-clustering



Outline


2 Co-clustering





textual databases



Co-clustering



PD-clustering in a nutshellBen-Israel and Iyigun in 2008

Some essential notations:xij ∈ X original data matrix with i = 1, . . . ,n, j = 1, . . . , J;ckj ∈ C cluster centers matrix with k = 1, . . . ,K ;

pk (x) probability that the element xij belongs to cluster k ;dk (x) distance of the element xij from the center ckj .

PD-clustering is a probabilistic clustering method based on thefollowing assumption:

pk (x)dk (x) = JDF (x)

JDF (x), Joint Distance Function, is a constant depending on x .




textual databases



Co-clustering




Factor PD-clustering consists in finding a transformation oforiginal data x∗iq = xijbjq , where biq ∈ B is a weighting system, andcluster centers ckq such that the JDF is minimized:

ˆJDF = arg minC;B

n∑i=1

Q∑q=1

K∑k=1

(x∗iq − c∗kq)2p2ik .

x∗ij is the generic element of the 1× q vector of thecoordinates of the i th statistical unit;

c∗jk is the generic element of the 1× q vector c∗kindicating the generic cluster centers on factorialspace;

pik is the probability of the generic i th statistical unit ofbelonging to the generic cluster k .




textual databases



Co-clustering




Factor PD-clustering consists in finding a transformation oforiginal data x∗iq = xijbjq , where biq ∈ B is a weighting system, andcluster centers ckq such that the JDF is minimized:

ˆJDF = arg minC;B

n∑i=1

Q∑q=1

K∑k=1

(x∗iq − c∗kq)2p2ik .

Matrices B and C are computed according to the optimization ofthe same function

An iterative two-step algorithm is required because B and Ccannot be computed simultaneously




textual databases



Co-clustering




giqk = (x∗iq − c∗kq) is the general element of G

It has been demonstrated (Tortora 2011) that the value of matrixB that minimize JDF can be obtained through a Tucker3decomposition of distances matrix G:

G = UΛ(V ′ ⊗ B′) + E

Core of the method:Tucker 3 decomposition;PD-clustering on Tucker3 factors;iteration of procedure until convergence.

Number of factors Q < J




textual databases



Co-clustering



Two-mode Factor ProbabilisticDistance Clustering

Generalization of FPDC to a two-mode clustering

Iterative clustering procedure based on the following steps:1 Random initialization of cluster of units and computation of

distance matrix;2 Three-way decomposition of distance matrix;3 Projection of units and variables on the factorial space;4 PD-clustering of variables on the factorial space;5 Computation of distance matrix of variables;6 Three-way decomposition of distance matrix;7 Projection of units and variables on the factorial space;8 PD-clustering of units on the factorial space;

Steps 2 and 8 are iterated until convergence to the solution.




textual databases



Co-clustering



Two-mode FPDC




textual databases



Co-clustering



Conclusions

Hierarchical clusteringAdvantages Find homogeneous groups of words and firmsDisadvantages There is not a block structure

Co-clusteringAdvantages Find a block structureDisadvantages 1 big cluster and 3 very small clusters

Two-mode FPDCAdvantages Find a block structure, balanced number ofelement in each groupDisadvantages Iterative algorithm: convergence must beanalytically verified




textual databases



Co-clustering



Conclusions







textual databases



Co-clustering



Conclusions







textual databases



Co-clustering



Bibliography

Balbi S., Miele R. Scepi G.:2012 Clustering of documents from a two-way viewpoint, 10th Int.Conf. on Statistical Analysis of Textual Data

Ben-Israel A., Iyigun C.: 2008 Probabilistic D-clustering, Journal of Classification

Dhillon, I., Mallela, S., and Modha, D.: 2003 Information-theoretic co-clustering, Proceedingsof the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Kroonenberg, P. M.: 2008, Applied multiway data analysis, Wiley series in probability endstatistics.

Spano, M. and Triunfo, N.:2012, La relazione sulla gestione delle società italiane quotate sulmercato regolamentato, JADT 2012 : 11es Journées internationales dÕAnalyse statistiquedes Données Textuelles

Tortora, C.:2011 Non-hierarchical clustering methods on factorial subspaces, PhD thesis,Universiti. Napoli Federico II.

Tortora, C., Gettler Summa, M., and Palumbo, F.:2011 Factorial pd-clustering. Proceedings ofthe Joint Conference of the German Classification Society.

Van Mechelen, I., Bock, H., and Boeck, P. D. (2004). Two-mode clustering methods: astructure overview. Statistical methods in medical research, 13(5):363Ð 394.

Vichi M., Kiers A.L.H.:2000 Factorial k-means analysis for two-way data, ComputationalStatistics and Data Analysis.


Documents

Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical