21
Diffusion Mapping and PCA on the WikiLeaks Cable Database Fei Xue and Samer Sabri CPSC 675 May 7, 2012

Di usion Mapping and PCA on the WikiLeaks Cable Databasemou3amalet.com/cargocollective/675_xuesabri-final.pdf · Di usion Mapping and PCA on the WikiLeaks Cable Database Fei Xue and

  • Upload
    vanminh

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Diffusion Mapping and PCA on the WikiLeaksCable Database

Fei Xue and Samer SabriCPSC 675

May 7, 2012

Contents

1 Introduction 41.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Method 52.1 Preprocessing: from Text to Matrix . . . . . . . . . . . . . . . 62.2 Analyzing the Data: PCA and Diffusion Mapping . . . . . . . 82.3 Theoretical and Practical Obstacles . . . . . . . . . . . . . . . 10

3 Results 123.1 Results of Analysis with PCA . . . . . . . . . . . . . . . . . . 123.2 Results of Analysis with Diffusion Mapping . . . . . . . . . . . 12

4 Discussion 144.1 Interpretation of Results . . . . . . . . . . . . . . . . . . . . . 144.2 Limitations, Problems and Possible Improvements . . . . . . . 15

5 Conclusion 18

A Appendix: Implementation Details 20

B Appendix: Navigating the output of PCA and DiffusionMapping 20

List of Figures

1 Data along PC dimensions 4 and 5, colored by region . . . . . 82 Data along eigenfunctions 4 and 5, colored by region . . . . . 93 Regions along eigenfunctions 1 and 4 for the year 2007 . . . . 104 Data along PC dimensions 2 and 6, colored by region . . . . . 135 Regions along eigenfunctions 1 and 8 for the year 2007 . . . . 146 Regions along eigenfunctions 2 and 6 for the year 2008 . . . . 167 Regions along eigenfunctions 1 and 3 for the year 2004 . . . . 178 Regions along eigenfunctions 1 and 3 for the year 2005 . . . . 179 Regions along eigenfunctions 1 and 3 for the year 2006 . . . . 17

2

10 Regions along eigenfunctions 1 and 3 for the year 2007 . . . . 1711 Regions along eigenfunctions 1 and 3 for the year 2008 . . . . 1812 Regions along eigenfunctions 1 and 3 for the year 2009 . . . . 18

3

1 Introduction

Finding the underlying structure of datasets consisting in assortments of re-lated texts can sound like an esoteric endeavor at first. However, accessto this structure can provide a lot of insights ranging from understandingand quantifying political and rhetorical trends to deriving important struc-tures that, given a specific corpus, are parallel to a larger metastructure:for example, one can imagine a theory of the internal organization of sciencearrived at through a careful analysis of a large subset of the science literature.

In this paper, we apply this idea by using two dimensionality reductiontechniques (principal component analysis and diffusion mapping) to analyzethe database of United States diplomatic cables published by Wikileaks [1],also known as Cablegate.

1.1 Background

Principal Component Analysis (PCA) PCA is a mathematical proce-dure that uses an orthogonal transformation to convert a matrix with corre-lated rows into a set of values of linearly uncorrelated variables. The trans-formed dimension is oftentimes much smaller than the original dimension[2]. The transformation definition makes sure that the principal componentsare ordered by variance so that by making use of the first several dimensionsonly, we can start gaining an understanding of the dataset’s organization.

Diffusion Mapping Diffusion mapping is a dimensionality reduction tech-nique that consists in building diffusion maps, which generate lower dimen-sion representations of complex geometric structures [3]. The technique ispredicated upon the definition of a notion of distance between points onthe manifold. Diffusion mapping works best if the data set points lie on anon-linear manifold the high-dimensional space.

Wikileaks The secret US Embassy Cables, also known as Cablegate, are“the largest set of confidential documents ever to be released into the publicdomain” [1]. WikiLeaks published 251,287 leaked US Embassy Cables, rang-ing in date from December 1966 to February 2010, and originating from over274 locations around the world [1].

4

1.2 Significance

As outlined in the introduction, applying dimensionality reduction techniquesto texts is a very promising idea, and one of the main purposes of this projectis to build a framework to make similar work easier in the future.

In addition, the WikiLeaks database presents a real problem to politicalanalysts because of its massive size: if one were able to read each cable in10 seconds and go on for a long time without attending to any of their bi-ological needs, it would take them over 29 days to read every cable in thedatabase. However, we contend that the 251,287 US Embassy Cables yieldmore information than the “scandals” that were relayed in the press: theyshould provide insights into shifts in US policy (especially in the period from2002 to 2010, where the database is densest) and changes in the way USdiplomats view different aspects of foreign policy.

1.3 Objectives

Although extracting all these insights is too ambitious a task, we attemptin this paper to show that it is possible (1) to use quantitative techniquessuch as dimensionality reduction on a corpus of texts and (2) to gain at leastsome metainformation about the given corpus and its internal organization.

More practically, we ask ourselves: can we “plot” cables in a basis thathas significantly lower dimensionality than the number of texts? If so, doesthis plot tell us anything about US foreign policy and the way US diplomats’view of the world varies across time and regions?

2 Method

In this section, we will introduce the method adopted to analyze the data setusing both diffusion mapping and PCA. In subsection 2.1, we will outline themain steps involved in preprocessing the cable database to “translate” it intoa data set on which we can apply PCA and diffusion mapping. Subsection2.2 will consist in an explanation of the specific information obtained fromapplying PCA and diffusion mapping to our dataset. Finally, in subsection2.3, we will provide an account of the main theoretical and practical obstacles

5

we encountered in the process.

2.1 Preprocessing: from Text to Matrix

Ideally, each point in our post-processed database should represent a docu-ment as a vector. The distribution of vectors should be dense, at least in thesame components, in order for dimensionality reduction to yield significantresults. The vectors should also be distinct enough from each other so theydo not simply form a ambiguous cluster.

The cable database was downloaded in .sql format the WikiLeaks website [1].The file was loaded into a relational database and queried on its columns.The database has 251,287 rows, each representing a cable and its metainfor-mation, and the columns corresponded to the following in each tuple: ID,date, origin, content. The ID is a unique number sorted by date of genera-tion, date is in yyyy/mm/dd format, origin is often times a foreign consularor embassy, content is the plain text content of the cable.

Querying the database was easy. For instance, one can use the followingsimple sentence in postgresql to get the content of the document with the id5535: SELECT content FROM cable WHERE id = 5535.

In order to make use of the database, we wrote a script to obtain eachdocument in a separate .txt file. Apart from the document content, we alsoneeded supplementary markers for other properties of the document. Wechose the following features: year, month, country and region. To obtainthose properties, we first used similar queries as above to get the date andorigin for each tuple ordered by ID. We then mapped the origins to theircountry, and the country to one of the following 10 regions: North America,South America, Africa, Arab World, Europe, East Asia, Australia, Russia,Turkic Countries, and Other Islands. For dates, we extracted year and monthand turned each pair into a 6-digit integer we will call ‘yearmonth.’

Naturally, we thought of using word frequency (normalized) as a represen-tation of the documents. In other words, given several vectors in our dataspace, each component corresponds to the frequency a specific word in eachof these vectors. We also needed to remove generally frequent words that do

6

not provide any valuable information, such as “for,” “get,” etc. Therefore,the next step at that point of preprocessing was to generate a dictionarytailored to the WikiLeaks database. In the rest of this paper, we will callthis dictionary “positive dictionary,” to contrast it with the “negative dic-tionary” used to filter out words that are too frequent. After creating this“positive dictionary,” we performed a word count on each document usingonly the words this dictionary contains to generate a vector of the length ofthis dictionary.

To obtain the dictionary, we did an overall word count on all the files inthe database. We then imposed a threshold on the minmum appearancetimes and deleted words that are in a list of over 200 common English words(the “negative dictionary”). We then went over the obtained word list tomake sure that most of the words in this “positive dictionary” were relatedto issues that are relevant to the Wikileaks database. After some fine tuning,we obtained a 2500 word “positive dictionary.” Using this dictionary, wewent over the database again to generate a “positive dictionary vector” foreach document. Each vector was stored in a separate text file, then loadedinto Matlab.

At that point, we had a M * N (M is number of documents and N is the dic-tionary size) matrix of non-negative integers that we will call MAT, as well asthree M * 1 vectors for country, region and ‘yearmonth’ respectively. As onemight imagine, due to the selection of the dictionary, some of the documents,especially short ones, had a very small number of words in common with our“positive dictionary,” and thus contained zeros in the large majority of theircorresponding vector’s components. These vectors should not be includedin the dimensionality reduction process because they do not convey enoughinformation to be mapped to a lower dimension. We also did a threshold cuton the minimum number of zeros in the vector. As we deleted the sparseentries from the dictioanry, we also deleted the corresponding positions inthe country, region and ‘yearmonth’ vectors. We then normalized the M*Nmatrix to have mean 0 and length 1 for each vector. This completed the“translation” process.

7

2.2 Analyzing the Data: PCA and Diffusion Mapping

Principal Component Analysis We ran the PCA algorithm written inthe assignment on the dataset, obtaining the top k (10 in this case) principalcomponents of the data set. We were looking to understand how the cableswould fall and cluster in the lower dimensional space formed by these kprincipal components. For each two unordered dimensions of the principalcomponents, we projected all the vectors in MAT into these two directionsand obtained two dimensional coordinates. We then proceeded to plot allthese points on a plane and color the points according to time, country andregion in order to have a better visulization.

Figure 1 is an example of one of these visual representations of our datain 2 of the principal components of the data set.

Figure 1: Data along PC dimensions 4 and 5, colored by region

Diffusion Mapping Diffusion mapping allowed us to obtain k eigenfunc-tions (10 in this case too) that constitute the basis of our lower dimensional

8

space. We plotted the data in a way analogous to the one described above,also coloring the points according to time, country and region. Figure 2 isan example of one of these visual representations of our data in 2 dimensionsusing as a basis two of the eigenfunctions generated by our diffusion mappinganalysis.

Figure 2: Data along eigenfunctions 4 and 5, colored by region

Given that the results for diffusion mapping were more clearly clusteredin terms of region, we had the idea of trying to visualize the evolution of eachregion over time. We did that by finding the center of gravity of all the pointsin each region for each year, and plotting each region as one point in a graphfor each year. We then made an animated gif file for each component pairsto demonstrate how they change over time (see Appendix B for a descriptionof these files and their location). Figure 3 is an example of such a graph.

9

Figure 3: Regions along eigenfunctions 1 and 4 for the year 2007

2.3 Theoretical and Practical Obstacles

Given the difference in nature between a database of text files and a collectionof same-sized vectors, we encoutered a fair amount of theoretical and practicalobstacles in generating PCA and Diffusion Mapping compatible vectors. Weelaborate on some of these difficulties in this subsection.

Data Sparsity Text data is complex; it is thus hard to represent it asnumbers and to manipulate these numbers while still preserving the meaningcontained in that data. Also, since these texts cover various topics on a globalbasis, their mathematical equivalent is by nature very high-dimensional andsparse.

As described in subsection 2.1, we grouped together all cables from the samemonth and location (more granular than regions) as one data point, and ex-cluded all the vectors below a certain treshold of pre-normalization weight.

10

“Negative” and “Positive” Dictionary Selection The need for a “neg-ative dictionary” is evident, in the sense that some words convey no usefulinformation, because they are too frequent. In addition, a “positive dictio-nary” is necessary because including all the words in the dataset would implyvery long vectors that are even more sparse. The words that are “lost” byusing a “positive dictionary” are for the most part words that do not allowfor a meaningful comparison between cables because they only exist in a fewof them.

The problem of defining these dictionaries would ideally be solved by readingeach dictionary and deciding which words to include or exclude, while makingsure that the implied vector size is still manageable. However, given the sheersize of these dictionaries (over 200 words and over 2500 words respectively),such an endeavor is simply too time consuming. The imperfect solution wechose is to fine tune the tresholds we used until we had a reasonable “positivedictionary” size that seemed to contain relevant words upon a quick scan.

Grouping and Representing Data The amount of data points preventsone from meaningfully labeling them individually in a graphical represen-tation. Given the non-quantitative nature of the differences between texts,translating back from a quantitative represenation proved the most complexproblem we faced. It was also resolved imperfectly by grouping country loca-tions into regions, and subsequently plotting graphs containing each regionas a point (weighted averages of all the points in that region).

Interpreting Lower Dimensional Representations When one sees agraphical representation like figure 3, one is tempted to interpret the inverseof the distance between regions as a representation of affinity. However, whatthis lower dimensional space represents is more complex. Indeed, these diplo-matic cables are for the most part reports of meetings and descriptions ofsituations sent from US embassies around the world. They are therefore arepresenation of the world from the perspective of US foreign policy and USinterests. If two points are closer to each other in the graph, it means thatthe words used in these cables are similar, at least along a certain “axis.”There are many interpretations of what this means; the one we found mostplausible is that two groups of cables from different regions with a lot of sim-ilar words signify that along a certain dimension (in the geopolitical sense

11

of the word this time), these groups represent similar problems, concerns orinterests from a US perspective - which, in turn, can have to do with a lotof different topics ranging from human rights concerns to long-term energyinterests.

We discuss ways this information could be extracted in ways that allow forthese interpretations to be made with more certitude in subsection 4.2.

Scale of the Database The scale of the database caused some practicalproblems that could become real obstacles if dealing with a bigger data set(such as the entirety of the Gutenberg project, etc.). For instance, we en-countered problems with space limitations when we first attempted to extracteach cable as an individual text file from the database. In addition, generat-ing vectors from text files took several hours, which was a problem when wemade changes to our dictionaries given that it meant that the process hadto be repeated and we had to wait.

3 Results

3.1 Results of Analysis with PCA

As we see in Figure 4, some of the underlying stucture is captured by PCA,given that points from the same region tend to cluster together in some of theplots. This is not surprising, given that points belonging to the same regionwill tend to contain the same vocabulary, or a vocabulary that is similar incomparison with the differences from region to region.

However, there is never a clear segregation of these clusters, and they arecongregated in one place in most dimensions, which is likely to indicate thatthese dimensions do not capture the differences among cables correctly, orcapture them along variables that we do not have access to.

3.2 Results of Analysis with Diffusion Mapping

The results of the analysis with diffusion mapping are more encouraging thanthe results with PCA. As in figure 5, we observe the same clustering in terms

12

Figure 4: Data along PC dimensions 2 and 6, colored by region

of regions as we do in PCA. However, the clusters are further apart fromeach other, and each one seems to form a coherent region of the data spaceas we plot the data along different eigenfunctions.

As these results seemed promising, it seemed natural to try to extractmore information regarding the evolution of each region over time. We henceplotted different regions as one point each (the center of gravity of all pointsfrom the same region), and obtained several figures such as figure 3. As wemoved into a more in depth interpretation of our data, it was important tokeep in mind the remarks about what this data means made in subsection 2.3.

13

Figure 5: Regions along eigenfunctions 1 and 8 for the year 2007

4 Discussion

4.1 Interpretation of Results

It is clear at first glance that diffusion mapping fares better than PCA inproviding a convincing reduction of dimensionality of our data. This can beattributed to the fact that the distribution is far from a multivariate normaldistribution, which would have made it easier for PCA to “disentangle” theprincipal components of such a distribution. Instead, crawling through thedata space and using a definition of distance that is very natural - two textsare “close” to each other if the normalized word frequency of a lot of wordsbetween them is similar - seems to provide us with a much better way tounderstand our data and reduce its dimensionality without losing too muchinformation.

Nevertheless, given the sheer size of the database and the difficulty of trans-

14

lating this quantitative data back into qualitative information outlined insubsection 2.3, we found ourselves only able to obtain certain insights intothe structure of the data rather than a comprehensive view. We present heretwo interesting observations we made:

1. In a lot of our plots in which regions are represented as one point each,East Asia, Africa and the Arab World seem to be far from each otherand from the other data points, which tend to be clustered together.This is illustrated by figure 6. This can be explained by the fact thatthe concerns tied to these three regions are very particular and unique- which is reflected in differences in word use hence setting them apartfrom other regions in our representation.

2. In some plots, such as in figures 7 through 12, we see Africa movingcloser to the Arab World from 2004 to 2009 - while remaining relativelyapart from each other, for, in our opinion, the reasons outlined in ourfirst point. This parallels the increase of the US’s focus, starting in2004, on counterterrorism initiatives in Africa and a framing of Africaas relevant to the US from a national security perspective. This focusis illustrated by the approval in of US$500 million by the US Congressfor the Trans-Saharian Counterterrorism Initiative over 6 years, andculminates in the creation of Africom, the United States Africa Com-mand, in 2007. We conjecture that the phenomenon observed reflectsan increasing closeness in language in the way these regions are de-scribed from a US perspective because these regions increasingly evokesimilar interests for the United State’s foreign policy.

4.2 Limitations, Problems and Possible Improvements

As conveyed throughout this paper, the greatest difficulty encountered in ourwork was to find one-to-one correspondences between ideas in the realm ofgeopolitical relations and depictions on one hand, and our mathematical rep-resentation of these on the other. Due to the nature of the database, we hadaccess to a limited amount of metadata, namely date and location of origin,from which we generated an additional “region” field. This metadata canexplain broad distribution trends as observed in our description of the plotsobtained from diffusion mapping, but there is no doubt that more metadata

15

Figure 6: Regions along eigenfunctions 2 and 6 for the year 2008

would do more justice to the power of the methods used.

One field which we had access to, but which we could not think of a scalableway to use, is the tags associated with a lot of the cables. These tags iden-tify very specific subjects, and provide a topical organization of the cables.Some examples are “MARR” for “Military and Defense Arrangements” and“TINT” for “Internet Technology.” Making use of these tags in future similaranalyses would be a great step forward.

As explained in subsection 2.3, dictionaries could have been rendered morepowerful by spending more time carefully selecting the words and fine tuningthe tresholds to arrive at a concise yet powerful “positive dictionary.” Thiswas rendered difficult by the time constraints at hand, but could be an im-portant theoretical and practical avenue to explore in the future.

16

Figure 7: Regions along eigenfunctions1 and 3 for the year 2004

Figure 8: Regions along eigenfunctions1 and 3 for the year 2005

Figure 9: Regions along eigenfunctions1 and 3 for the year 2006

Figure 10: Regions along eigenfunc-tions 1 and 3 for the year 2007

17

Figure 11: Regions along eigenfunc-tions 1 and 3 for the year 2008

Figure 12: Regions along eigenfunc-tions 1 and 3 for the year 2009

5 Conclusion

The main conceptual problem we tried to tackle through this project is thetranslation from quantitative to qualitative information. The ultimate goalwould be to be able to interpret axes in terms of what they indicate on ageopolitical level - or along axes that shed light upon the structure of thecorpus in terms of topics and themes. This can be easier for other analyses us-ing diffusion mapping on corpuses represented by normalized text frequency:for instance, an analysis of US presidential speeches would organize aroundtopic, and then possibly along party lines and periods of recession and eco-nomic booms. This kind of interpretation is more difficult to do in our case,but still theoretically possible.

We can conclude that this study provided us with some insights into thetexts we were exploring, and it seems like applying diffusion mapping tonormalized text frequency vectors is a very promising method, which we at-tribute to the “naturalness” of distance in such a manifold. This methodis relevant in politics, but as we tried to show, it is extensible to literature,academic papers, and any large amount of information of which the synthesisor the reading is rendered problematic because of its size.

18

References

[1] Wikileaks, May 2012. URL: http://wikileaks.org/.

[2] H. Abdi and L.J. Williams. Principal component analysis. Wiley Inter-disciplinary Reviews: Computational Statistics, 2:433–459, 2010.

[3] R.R. Coifman and S. Lafon. Diffusion maps. Applied and ComputationalHarmonic Analysis, 21:5–30, 2006.

19

A Appendix: Implementation Details

The code used for preprocessing and postprocessing and the plots made withthe results of diffusion mapping and PCA are available at“http://ge.tt/1RIcWHH/v/0?c”.The files used for pre and postprocessing are contained in the ‘code’ folder.We used Perl for text manipulation, bash shell scripts for batch file process-ing and Matlab for manipulation and analysis of vectors and matrices.

For preprocessing, we wrote “analyze.pl” to take a subset of our text anda “negative dictionary” and generate a “positive dictionary.” “myscript.pl”takes as input a “positive dictionary” and a cable (in .txt format) and out-puts a vector representing the amount of times each word in the “positivedictionary” is found in the cable. This file is used by “letsdoit.sh” to gener-ate a data file for each text file as a batch job. Finaly, the data is importedinto Matlab, normalized and preprocessed using “import.m,” “normaliza-tion.m” and “preproc.m” (written by Daniel Holtmann-Rice) respectively.“desparse.m” was then used to organize all the vectors with the same month,year and origin in one vector and remove null vectors from the data whilekeeping the date, origin and region vectors aligned with our matrix.

We then used the PCA and Diffusion Mapping routines written for our home-work, and used “plotAll.m” and “plotAndYar.m” as well as other bash shellscripts to generate the plots that we describe in Appendix B.

B Appendix: Navigating the output of PCA

and Diffusion Mapping

The link provided in the previous appendix also contains a folder called ‘fig-ures,’ in which one can find the figures generated from our analysis. Thisfolder in turn contains 3 other folders.

1. PCA: this folder contains the output of PCA. Each file is called “(coun-tries/dates/regions) i j.jpg” and represents all the data points coloreddepending on their corresponding countries/dates/regions and plottedalong the i and jth principal components.

20

2. DiffMap: this folder contains the output of diffusion mapping. Each fileis called “(countries/dates/regions) i j.jpg” and represents all the datapoints colored depending on their corresponding countries/dates/regionsand plotted along the i and jth eigenfunctions.

3. DiffMapByRegion: this folder contains the output of plotting the cen-ter of gravity of each bundle of data points from the same region alongthe eigenfunctions generated by diffusion mapping. Files named “re-gions i j yearyyyy.jpg” represent the centers of gravity of regions in theyear yyyy plotted along eigenfunctions i and j. Files named “i j.gif”are animated image files that represent the centers of gravity of regionsalong eigenfunctions i and j for each year (on a separate image) from2002 to 2010.

21