26
Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical clustering Co-clustering Two-mode FPDC PD-clustering Factor PD-clustering Clustering algorithms for extracting information from textual databases Marina Marino 1 Germana Scepi 1 Cristina Tortora 2 1 Università di Napoli Federico II 2 Stazione Zoologica Anton Dohrn, Napoli Symposium on learning and data science 1

Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Clustering algorithms for extractinginformation from textual databases

Marina Marino1 Germana Scepi1 Cristina Tortora2

1Università di Napoli Federico II2Stazione Zoologica Anton Dohrn, Napoli

Symposium on learning and data science 1

Page 2: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

BLUE-ETS project

This work is financially supported by BLUE-ETS European project

In this context, we are testing algorithms on reports, financialstatements and additional notes on a sample of Italian companywhich provides databases of balance sheets for the assessmentof credit worthiness.

The possibility of referring both on structured and unstructureddata will help us in defining a proper strategy suitable whennumerical data are not available.

Symposium on learning and data science 2

Page 3: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Aim and Scope

AimTo visualize the relationships between terms belonging to differentcorpora and identifying classes of terms and corpora (firms)

Scope

To compare different two-mode clustering techniques on themanagement commentaries dataset

Symposium on learning and data science 3

Page 4: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Aim and Scope

AimTo visualize the relationships between terms belonging to differentcorpora and identifying classes of terms and corpora (firms)

Scope

To compare different two-mode clustering techniques on themanagement commentaries dataset

Symposium on learning and data science 3

Page 5: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Framework

Two-mode clustering aims at finding:A partition of rows RA partition of columns CA partition of R × C obtained by fully crossing row andcolumn partitioning

Rows and columns of data matrix are permuted such that all rowand column clusters consist of neighboring elements

Symposium on learning and data science 4

Page 6: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Outline

1 Hierarchical clustering

2 Co-clustering

3 Two-mode Factor Probabilistic Distance clusteringPD-clusteringFactor PD-clustering

Symposium on learning and data science 5

Page 7: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Dataset

Among the 406 italian listed company in 2009, 25 companieshave been selected using a sample technique.

An important part of the official budget of listed companies is themanagement commentary.The management commentaries of the selected companies havebeen processed using ad hoc techniques, interested readers canrefer to (Spano and Triunfo 2012).

The matrix obtained is a frequency matrix (companies x words):25 listed companies on rows;81 words on columns.

Symposium on learning and data science 6

Page 8: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Outline

1 Hierarchical clustering

2 Co-clustering

3 Two-mode Factor Probabilistic Distance clusteringPD-clusteringFactor PD-clustering

Symposium on learning and data science 7

Page 9: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Hierarchical Clustering

Symposium on learning and data science 8

Page 10: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Outline

1 Hierarchical clustering

2 Co-clustering

3 Two-mode Factor Probabilistic Distance clusteringPD-clusteringFactor PD-clustering

Symposium on learning and data science 9

Page 11: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Co-clusteringDhillon et al. 2003

Aim: simultaneous clustering of rows and columns of acontingency table for obtaining row-clusters and column-clusterssimultaneously

Hypothesis: the contingency table is considered as an empiricaljoint probability distribution of two discrete random variables thattake values over the rows and columns

DefinitionCo-clustering classifies both terms and documentssimultaneously by considering the matrix ’terms x documents’

Symposium on learning and data science 10

Page 12: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Co-clusteringDhillon et al. 2003

Aim: simultaneous clustering of rows and columns of acontingency table for obtaining row-clusters and column-clusterssimultaneously

Hypothesis: the contingency table is considered as an empiricaljoint probability distribution of two discrete random variables thattake values over the rows and columns

DefinitionCo-clustering classifies both terms and documentssimultaneously by considering the matrix ’terms x documents’

Symposium on learning and data science 10

Page 13: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Information-theoretic Co-clusteringDhillon et al. 2003

Let us defining:Y discrete random variables taking values in y1, . . . , yp

Z discrete random variables taking values in z1, . . . , zp

p(Y ,Z ) the joint probability distribution between Y and Zestimated on the base of the observed values

Y K disjoint clusters of YZ H disjoint clusters of Z

The partition functions for rows and for columns are allowed todepend upon the entire joint distribution p(Y ,Z )

Symposium on learning and data science 11

Page 14: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Information-theoretic Co-clustering

Mutual Information

I(Y ,Z ) =∑

Y

∑Z

p(Y ,Z ) · log(

p(Y ,Z )

p(Y )p(Z )

)

Loss in Mutual Information

I(Y ,Z )− I(Y , Z )

The method aims at minimizing the loss in mutual information bymaximizing I(Y , Z )

Symposium on learning and data science 12

Page 15: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Co-clustering

Symposium on learning and data science 13

Page 16: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Outline

1 Hierarchical clustering

2 Co-clustering

3 Two-mode Factor Probabilistic Distance clusteringPD-clusteringFactor PD-clustering

Symposium on learning and data science 14

Page 17: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

PD-clustering in a nutshellBen-Israel and Iyigun in 2008

Some essential notations:xij ∈ X original data matrix with i = 1, . . . ,n, j = 1, . . . , J;ckj ∈ C cluster centers matrix with k = 1, . . . ,K ;

pk (x) probability that the element xij belongs to cluster k ;dk (x) distance of the element xij from the center ckj .

PD-clustering is a probabilistic clustering method based on thefollowing assumption:

pk (x)dk (x) = JDF (x)

JDF (x), Joint Distance Function, is a constant depending on x .

Symposium on learning and data science 15

Page 18: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Factor PD-clustering

Factor PD-clustering consists in finding a transformation oforiginal data x∗iq = xijbjq , where biq ∈ B is a weighting system, andcluster centers ckq such that the JDF is minimized:

ˆJDF = arg minC;B

n∑i=1

Q∑q=1

K∑k=1

(x∗iq − c∗kq)2p2ik .

x∗ij is the generic element of the 1× q vector of thecoordinates of the i th statistical unit;

c∗jk is the generic element of the 1× q vector c∗kindicating the generic cluster centers on factorialspace;

pik is the probability of the generic i th statistical unit ofbelonging to the generic cluster k .

Symposium on learning and data science 16

Page 19: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Factor PD-clustering

Factor PD-clustering consists in finding a transformation oforiginal data x∗iq = xijbjq , where biq ∈ B is a weighting system, andcluster centers ckq such that the JDF is minimized:

ˆJDF = arg minC;B

n∑i=1

Q∑q=1

K∑k=1

(x∗iq − c∗kq)2p2ik .

Matrices B and C are computed according to the optimization ofthe same function

An iterative two-step algorithm is required because B and Ccannot be computed simultaneously

Symposium on learning and data science 16

Page 20: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Factor PD-clustering

giqk = (x∗iq − c∗kq) is the general element of G

It has been demonstrated (Tortora 2011) that the value of matrixB that minimize JDF can be obtained through a Tucker3decomposition of distances matrix G:

G = UΛ(V ′ ⊗ B′) + E

Core of the method:Tucker 3 decomposition;PD-clustering on Tucker3 factors;iteration of procedure until convergence.

Number of factors Q < J

Symposium on learning and data science 17

Page 21: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Two-mode Factor ProbabilisticDistance Clustering

Generalization of FPDC to a two-mode clustering

Iterative clustering procedure based on the following steps:1 Random initialization of cluster of units and computation of

distance matrix;2 Three-way decomposition of distance matrix;3 Projection of units and variables on the factorial space;4 PD-clustering of variables on the factorial space;5 Computation of distance matrix of variables;6 Three-way decomposition of distance matrix;7 Projection of units and variables on the factorial space;8 PD-clustering of units on the factorial space;

Steps 2 and 8 are iterated until convergence to the solution.

Symposium on learning and data science 18

Page 22: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Two-mode FPDC

Symposium on learning and data science 19

Page 23: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Conclusions

Hierarchical clusteringAdvantages Find homogeneous groups of words and firmsDisadvantages There is not a block structure

Co-clusteringAdvantages Find a block structureDisadvantages 1 big cluster and 3 very small clusters

Two-mode FPDCAdvantages Find a block structure, balanced number ofelement in each groupDisadvantages Iterative algorithm: convergence must beanalytically verified

Symposium on learning and data science 20

Page 24: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Conclusions

Hierarchical clusteringAdvantages Find homogeneous groups of words and firmsDisadvantages There is not a block structure

Co-clusteringAdvantages Find a block structureDisadvantages 1 big cluster and 3 very small clusters

Two-mode FPDCAdvantages Find a block structure, balanced number ofelement in each groupDisadvantages Iterative algorithm: convergence must beanalytically verified

Symposium on learning and data science 20

Page 25: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Conclusions

Hierarchical clusteringAdvantages Find homogeneous groups of words and firmsDisadvantages There is not a block structure

Co-clusteringAdvantages Find a block structureDisadvantages 1 big cluster and 3 very small clusters

Two-mode FPDCAdvantages Find a block structure, balanced number ofelement in each groupDisadvantages Iterative algorithm: convergence must beanalytically verified

Symposium on learning and data science 20

Page 26: Clustering algorithms for extracting information from ...touati/SLDS2012... · Clustering algorithms for extracting information from textual databases M.Marino G.Scepi C.Tortora Hierarchical

Clusteringalgorithms for

extractinginformation from

textual databases

M.MarinoG.ScepiC.Tortora

Hierarchicalclustering

Co-clustering

Two-mode FPDCPD-clustering

Factor PD-clustering

Bibliography

Balbi S., Miele R. Scepi G.:2012 Clustering of documents from a two-way viewpoint, 10th Int.Conf. on Statistical Analysis of Textual Data

Ben-Israel A., Iyigun C.: 2008 Probabilistic D-clustering, Journal of Classification

Dhillon, I., Mallela, S., and Modha, D.: 2003 Information-theoretic co-clustering, Proceedingsof the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Kroonenberg, P. M.: 2008, Applied multiway data analysis, Wiley series in probability endstatistics.

Spano, M. and Triunfo, N.:2012, La relazione sulla gestione delle società italiane quotate sulmercato regolamentato, JADT 2012 : 11es Journées internationales dÕAnalyse statistiquedes Données Textuelles

Tortora, C.:2011 Non-hierarchical clustering methods on factorial subspaces, PhD thesis,Universiti. Napoli Federico II.

Tortora, C., Gettler Summa, M., and Palumbo, F.:2011 Factorial pd-clustering. Proceedings ofthe Joint Conference of the German Classification Society.

Van Mechelen, I., Bock, H., and Boeck, P. D. (2004). Two-mode clustering methods: astructure overview. Statistical methods in medical research, 13(5):363Ð 394.

Vichi M., Kiers A.L.H.:2000 Factorial k-means analysis for two-way data, ComputationalStatistics and Data Analysis.

Symposium on learning and data science 21