Discov

Embed Size (px)

Citation preview

  • 7/28/2019 Discov

    1/142

    Advanced Quantitative Research Methodology, LectureNotes: Text Analysis II: Unsupervised Learning via

    Cluster Analysis1

    Gary Kinghttp://GKing.Harvard.Edu

    December 23, 2011

    1 Copyright 2010 Gary King, All Rights Reserved.

    Gary King http://GKing.Harvard.Edu () Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis II: Unsupervis

    December 23, 2011 1 / 23

    http://find/
  • 7/28/2019 Discov

    2/142

    Reading

    Justin Grimmer and Gary King. 2010. Quantitative Discovery of

    Qualitative Information: A General Purpose Document ClusteringMethodologyhttp://gking.harvard.edu/files/abs/discov-abs.shtml .

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 2 / 23

    http://gking.harvard.edu/files/abs/discov-abs.shtmlhttp://gking.harvard.edu/files/abs/discov-abs.shtmlhttp://find/
  • 7/28/2019 Discov

    3/142

  • 7/28/2019 Discov

    4/142

  • 7/28/2019 Discov

    5/142

  • 7/28/2019 Discov

    6/142

  • 7/28/2019 Discov

    7/142

    The Problem: Discovery from Unstructured Text

    Examples: scholarly literature, news stories, medical information, blog

    posts, comments, product reviews, emails, social media updates,audio-to-text summaries, speeches, press releases, legal decisions, etc.

    10 minutes of worldwide email = 1 LOC equivalent

    An essential part of discovery is classification: one of the most

    central and generic of all our conceptual exercises. . . . the foundationnot only for conceptualization, language, and speech, but also formathematics, statistics, and data analysis. . . . Without classification,there could be no advanced conceptualization, reasoning, language,data analysis or, for that matter, social science research. (Bailey,

    1994).We focus on cluster analysis: discovery through (1) classification and(2) simultaneously inventing a classification scheme

    (We analyze text; our methods apply more generally)

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 3 / 23

    http://find/
  • 7/28/2019 Discov

    8/142

  • 7/28/2019 Discov

    9/142

    Why Johnny Cant Classify (Optimally)

    Bell(n) = number of ways of partitioning n objects

    Bell(2) = 2 (AB, A B)

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

    http://find/
  • 7/28/2019 Discov

    10/142

    Why Johnny Cant Classify (Optimally)

    Bell(n) = number of ways of partitioning n objects

    Bell(2) = 2 (AB, A B)

    Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

    http://find/
  • 7/28/2019 Discov

    11/142

    Why Johnny Cant Classify (Optimally)

    Bell(n) = number of ways of partitioning n objects

    Bell(2) = 2 (AB, A B)

    Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

    Bell(5) = 52

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

    http://find/
  • 7/28/2019 Discov

    12/142

    Why Johnny Cant Classify (Optimally)

    Bell(n) = number of ways of partitioning n objects

    Bell(2) = 2 (AB, A B)

    Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

    Bell(5) = 52

    Bell(100)

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

    http://find/
  • 7/28/2019 Discov

    13/142

  • 7/28/2019 Discov

    14/142

    Why Johnny Cant Classify (Optimally)

    Bell(n) = number of ways of partitioning n objects

    Bell(2) = 2 (AB, A B)

    Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

    Bell(5) = 52

    Bell(100) 1028 Number of elementary particles in the universe

    Now imagine choosing the optimal classification scheme by hand!

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

    http://find/
  • 7/28/2019 Discov

    15/142

    Why Johnny Cant Classify (Optimally)

    Bell(n) = number of ways of partitioning n objects

    Bell(2) = 2 (AB, A B)

    Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

    Bell(5) = 52

    Bell(100) 1028 Number of elementary particles in the universe

    Now imagine choosing the optimal classification scheme by hand!

    That we think of all this as astonishing . . . is astonishing

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

    http://find/
  • 7/28/2019 Discov

    16/142

    Wh HAL C Cl if Ei h

  • 7/28/2019 Discov

    17/142

    Why HAL Cant Classify Either

    The Goal an optimal application-independent cluster analysismethod is mathematically impossible:

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

    http://find/
  • 7/28/2019 Discov

    18/142

    Wh HAL C Cl if Ei h

  • 7/28/2019 Discov

    19/142

    Why HAL Cant Classify Either

    The Goal an optimal application-independent cluster analysismethod is mathematically impossible:

    No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

    Existing methods:

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

    http://find/
  • 7/28/2019 Discov

    20/142

    Wh HAL C t Cl if Eith

  • 7/28/2019 Discov

    21/142

    Why HAL Can t Classify Either

    The Goal an optimal application-independent cluster analysismethod is mathematically impossible:

    No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

    Existing methods:

    Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundations

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

    http://find/
  • 7/28/2019 Discov

    22/142

    Why HAL Cant Classify Either

  • 7/28/2019 Discov

    23/142

    Why HAL Can t Classify Either

    The Goal an optimal application-independent cluster analysismethod is mathematically impossible:

    No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

    Existing methods:

    Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge: With few exceptions, who knows?!

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

    Why HAL Cant Classify Either

    http://find/
  • 7/28/2019 Discov

    24/142

    Why HAL Can t Classify Either

    The Goal an optimal application-independent cluster analysismethod is mathematically impossible:

    No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

    Existing methods:

    Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge: With few exceptions, who knows?!The literature: little guidance on when methods apply

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

    Why HAL Cant Classify Either

    http://find/
  • 7/28/2019 Discov

    25/142

    Why HAL Can t Classify Either

    The Goal an optimal application-independent cluster analysismethod is mathematically impossible:

    No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

    Existing methods:

    Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge: With few exceptions, who knows?!The literature: little guidance on when methods apply

    Deep problem in cluster analysis literature: no way to know whichmethod will work ex ante

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

    If Ex Ante doesnt work try Ex Post

    http://find/
  • 7/28/2019 Discov

    26/142

    If Ex Ante doesn t work, try Ex Post

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23

    If Ex Ante doesnt work try Ex Post

    http://find/
  • 7/28/2019 Discov

    27/142

    If Ex Ante doesn t work, try Ex Post

    Methods and substance must be connected (no free lunch theorem)

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23

    If Ex Ante doesnt work try Ex Post

    http://find/
  • 7/28/2019 Discov

    28/142

    If Ex Ante doesn t work, try Ex Post

    Methods and substance must be connected (no free lunch theorem)

    The usual approach fails: hard to do it by understanding the model

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23

    http://find/
  • 7/28/2019 Discov

    29/142

    If Ex Ante doesnt work, try Ex Post

  • 7/28/2019 Discov

    30/142

    If Ex Ante doesn t work, try Ex Post

    Methods and substance must be connected (no free lunch theorem)

    The usual approach fails: hard to do it by understanding the model

    We do it ex post (by qualitative choice). For example:

    Create long list of clusterings; choose the best

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23

    http://find/
  • 7/28/2019 Discov

    31/142

    If Ex Ante doesnt work, try Ex Post

  • 7/28/2019 Discov

    32/142

    , y

    Methods and substance must be connected (no free lunch theorem)

    The usual approach fails: hard to do it by understanding the model

    We do it ex post (by qualitative choice). For example:

    Create long list of clusterings; choose the bestToo hard for mere humans!An organized list will make the search possible

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23

    If Ex Ante doesnt work, try Ex Post

    http://find/
  • 7/28/2019 Discov

    33/142

    , y

    Methods and substance must be connected (no free lunch theorem)

    The usual approach fails: hard to do it by understanding the model

    We do it ex post (by qualitative choice). For example:

    Create long list of clusterings; choose the bestToo hard for mere humans!An organized list will make the search possibleE.g.,: consider two clusterings that differ only because one document(of many) moves from category 5 to 6

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23

    Our Idea: Meaning Through Geography

    http://find/
  • 7/28/2019 Discov

    34/142

    g g g p y

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 7 / 23

    Our Idea: Meaning Through Geography

    http://find/
  • 7/28/2019 Discov

    35/142

    g g g p y

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 7 / 23

    Our Idea: Meaning Through Geography

    http://find/
  • 7/28/2019 Discov

    36/142

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 7 / 23

    Our Idea: Meaning Through Geography

    http://find/
  • 7/28/2019 Discov

    37/142

    We develop a (conceptual) geography of clusterings

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 7 / 23

    A New Strategy

    http://find/
  • 7/28/2019 Discov

    38/142

    Make it easy to choose best clustering from millions of choices

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 8 / 23

    A New Strategy

    http://find/
  • 7/28/2019 Discov

    39/142

    Make it easy to choose best clustering from millions of choices

    1 Code text as numbers (in one or more of several ways)

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 8 / 23

    A New Strategy

    http://find/http://goback/
  • 7/28/2019 Discov

    40/142

    Make it easy to choose best clustering from millions of choices

    1 Code text as numbers (in one or more of several ways)2 Apply all clustering methods we can find to the data each

    representing different (unstated) substantive assumptions (

  • 7/28/2019 Discov

    41/142

    Make it easy to choose best clustering from millions of choices

    1 Code text as numbers (in one or more of several ways)2 Apply all clustering methods we can find to the data each

    representing different (unstated) substantive assumptions (

  • 7/28/2019 Discov

    42/142

    Make it easy to choose best clustering from millions of choices

    1 Code text as numbers (in one or more of several ways)2 Apply all clustering methods we can find to the data each

    representing different (unstated) substantive assumptions (

  • 7/28/2019 Discov

    43/142

    Make it easy to choose best clustering from millions of choices

    1 Code text as numbers (in one or more of several ways)2 Apply all clustering methods we can find to the data each

    representing different (unstated) substantive assumptions (

  • 7/28/2019 Discov

    44/142

    Make it easy to choose best clustering from millions of choices

    1

    Code text as numbers (in one or more of several ways)2 Apply all clustering methods we can find to the data each

    representing different (unstated) substantive assumptions (

  • 7/28/2019 Discov

    45/142

    Make it easy to choose best clustering from millions of choices

    1

    Code text as numbers (in one or more of several ways)2 Apply all clustering methods we can find to the data each

    representing different (unstated) substantive assumptions (

  • 7/28/2019 Discov

    46/142

    You choose one (or more), based on insight, discovery, useful information,. . .

    Space of Cluster Solutions

    biclust_spectral

    clust_convex

    mult_dirproc

    dismea

    rock

    som

    spec_cos spec_eucspec_man

    spec_mink

    spec_max

    spec_canb

    mspec_cos

    mspec_euc

    mspec_man

    mspec_mink

    mspec_max

    mspec_canb

    affprop cosine

    affprop euclidean

    affprop manhattan

    affprop info.costs

    affprop maximum

    divisive stand.euc

    divisive euclidean

    divisive manhattan

    kmedoids stand.euc

    kmedoids euclidean

    kmedoids manhattan

    mixvmf

    mixvmfVA

    kmeans euclidean

    kmeans maxi

    kmeans manhattan

    kmeans canberra

    eans binary

    kmeans pearson

    kmeans correlation

    kmeans spearman

    kmeans kendall

    hclust euclidean ward

    hclust euclidean single

    hclust euclidean complete

    hclust euclidean average

    hclust euclidean mcquitty

    hclust euclidean median

    hclust euclidean centroidhclust maximum w

    hclust maximum single

    hclust maximum completehclust maximum averagehclust maximum mcquitty

    hclust maximum medianhclust maximum centroid

    hclust manhattan ward

    hclust manhattan single

    hclust manhattan complete

    hclust manhattan averagehclust manhattan mcquitty

    hclust manhattan median

    hclust manhattan centroid

    hclust canberra ward

    hclust canberra single

    hclust canberra complete

    hclust canberra average

    hclust canberra mcquittyhclust canberra median

    hclust canberra centroid

    hclust binary ward

    hclust binary single

    hclust binary complete

    hclust binary average

    hclust binary mcquitty

    hclust binary median

    hclust binary centroid

    hclust pearson ward

    hclust pearson single

    hclust pearson complete

    hclust pearson averagehclust pearson mcquitty

    hclust pearson medianhclust pearson centroid

    hclust correlation ward

    hclust correlation single

    hclust correlation complete

    hclust correlation averagehclust correlation mcquitty

    hclust correlation medianhclust correlation centroid

    hclust spearman ward

    hclust spearman single

    hclust spearman complete

    hclust spearman average

    hclust spearman mcquitty

    hclust spearman median

    hclust spearman centroid

    hclust kendall ward

    hclust kendall single

    hclust kendall complete

    hclust kendall average

    hclust kendall mcquitty

    hclust kendall median

    hclust kendall centroid

    q

    q

    Cluster Solution 1Cluster Solution 2

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 9 / 23

    Application-Independent Distance Metric: Axioms

    http://find/
  • 7/28/2019 Discov

    47/142

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 10 / 23

    Application-Independent Distance Metric: Axioms

    http://find/
  • 7/28/2019 Discov

    48/142

    Metric based on 3 assumptions

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 10 / 23

    Application-Independent Distance Metric: Axioms

    http://find/
  • 7/28/2019 Discov

    49/142

    Metric based on 3 assumptions1 Distance between clusterings: a function of the pairwise document

    agreements (pairwise agreements triples, quadruples, etc.)

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 10 / 23

    Application-Independent Distance Metric: Axioms

    http://find/
  • 7/28/2019 Discov

    50/142

    Metric based on 3 assumptions1 Distance between clusterings: a function of the pairwise document

    agreements (pairwise agreements triples, quadruples, etc.)2 Invariance: Distance is invariant to the number of documents (for any

    fixed number of clusters)

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 10 / 23

    Application-Independent Distance Metric: Axioms

    http://find/
  • 7/28/2019 Discov

    51/142

    Metric based on 3 assumptions1 Distance between clusterings: a function of the pairwise document

    agreements (pairwise agreements triples, quadruples, etc.)2 Invariance: Distance is invariant to the number of documents (for any

    fixed number of clusters)3 Scale: the maximum distance is set to log(num clusters)

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 10 / 23

    Application-Independent Distance Metric: Axioms

    http://find/http://goback/
  • 7/28/2019 Discov

    52/142

    Metric based on 3 assumptions1 Distance between clusterings: a function of the pairwise document

    agreements (pairwise agreements triples, quadruples, etc.)2 Invariance: Distance is invariant to the number of documents (for any

    fixed number of clusters)3 Scale: the maximum distance is set to log(num clusters)

    Only one measure satisfies all three (the variation ofinformation)

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 10 / 23

    Application-Independent Distance Metric: Axioms

    http://goforward/http://find/http://goback/
  • 7/28/2019 Discov

    53/142

    Metric based on 3 assumptions1 Distance between clusterings: a function of the pairwise document

    agreements (pairwise agreements triples, quadruples, etc.)2 Invariance: Distance is invariant to the number of documents (for any

    fixed number of clusters)3 Scale: the maximum distance is set to log(num clusters)

    Only one measure satisfies all three (the variation ofinformation)

    Meila (2007): derives same metric using different axioms (lattice

    theory)

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 10 / 23

    The FuTure oF PoliTical Science

    http://find/
  • 7/28/2019 Discov

    54/142

    Available March 2009: 304pp

    Pb: 978-0-415-99701-0: $24.95

    www.routledge.com/politics

    The list of authors in The Future of Political Scienceis a 'whoswho' of political science. As I was reading it, I came to think of itas a platter of tasty hors doeuvres. It hooked me thoroughly.

    Peter Kingstone, University of Connecticut

    In this one-of-a-kind collection, an eclectic set of contributorsoffer short but forceful forecasts about the future of the

    discipline. The resulting assortment is captivating, consistentlythought-provoking, often intriguing, and sure to spur discussionand debate.

    Wendy K. Tam Cho, University of Illinois at Urbana-Champaign

    King, Schlozman, and Nie have created a visionary andstimulating volume. The organization of the essays strikes me asnothing less than brilliant. . . It is truly a joy to read.

    Lawrence C. Dodd, Manning J. Dauer Eminent Scholar in Political Science,

    University of Florida

    The FuTure oF PoliTical Science100 Perspectiveseditd by Gary King, harvard univrsity, Kay Lehman Schlozman, Boston collg

    and Norman H. Nie, Stanford univrsity

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 11 / 23

    Evaluators Rate Machine Choices Better Than Their Own

    http://find/
  • 7/28/2019 Discov

    55/142

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23

    Evaluators Rate Machine Choices Better Than Their Own

    http://find/
  • 7/28/2019 Discov

    56/142

    Scale: (1) unrelated, (2) loosely related, or (3) closely related

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23

    Evaluators Rate Machine Choices Better Than Their Own

    http://find/
  • 7/28/2019 Discov

    57/142

    Scale: (1) unrelated, (2) loosely related, or (3) closely relatedTable reports: mean(scale)

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23

    Evaluators Rate Machine Choices Better Than Their Own

    http://find/
  • 7/28/2019 Discov

    58/142

    Scale: (1) unrelated, (2) loosely related, or (3) closely relatedTable reports: mean(scale)

    Pairs from Overall Mean Evaluator 1 Evaluator 2Random Selection 1.38 1.16 1.60

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23

    Evaluators Rate Machine Choices Better Than Their Own

    http://find/
  • 7/28/2019 Discov

    59/142

    Scale: (1) unrelated, (2) loosely related, or (3) closely relatedTable reports: mean(scale)

    Pairs from Overall Mean Evaluator 1 Evaluator 2Random Selection 1.38 1.16 1.60

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23

    Evaluators Rate Machine Choices Better Than Their Own

    http://find/
  • 7/28/2019 Discov

    60/142

    Scale: (1) unrelated, (2) loosely related, or (3) closely relatedTable reports: mean(scale)

    Pairs from Overall Mean Evaluator 1 Evaluator 2Random Selection 1.38 1.16 1.60Hand-Coded Clusters 1.58 1.48 1.68

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23

    Evaluators Rate Machine Choices Better Than Their Own

    http://find/
  • 7/28/2019 Discov

    61/142

    Scale: (1) unrelated, (2) loosely related, or (3) closely relatedTable reports: mean(scale)

    Pairs from Overall Mean Evaluator 1 Evaluator 2Random Selection 1.38 1.16 1.60Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23

    Evaluators Rate Machine Choices Better Than Their Own

    http://find/
  • 7/28/2019 Discov

    62/142

    Scale: (1) unrelated, (2) loosely related, or (3) closely relatedTable reports: mean(scale)

    Pairs from Overall Mean Evaluator 1 Evaluator 2Random Selection 1.38 1.16 1.60Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24Machine 2.24 2.08 2.40

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23

    Evaluators Rate Machine Choices Better Than Their Own

    http://find/
  • 7/28/2019 Discov

    63/142

    Scale: (1) unrelated, (2) loosely related, or (3) closely relatedTable reports: mean(scale)

    Pairs from Overall Mean Evaluator 1 Evaluator 2Random Selection 1.38 1.16 1.60Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24Machine 2.24 2.08 2.40

    p.s. The hand-coders did the evaluation!

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23

    Evaluating Performance

    http://find/
  • 7/28/2019 Discov

    64/142

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23

    Evaluating Performance

    http://find/
  • 7/28/2019 Discov

    65/142

    Goals:

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23

    Evaluating Performance

    http://find/
  • 7/28/2019 Discov

    66/142

    Goals:

    Validate Claim: computer-assisted conceptualization outperformshuman conceptualization

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23

    Evaluating Performance

    http://find/
  • 7/28/2019 Discov

    67/142

    Goals:

    Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluation

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23

    Evaluating Performance

    http://find/http://goback/
  • 7/28/2019 Discov

    68/142

    Goals:

    Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluation

    Inject human judgement: relying on insights from survey research

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23

    Evaluating Performance

    http://find/
  • 7/28/2019 Discov

    69/142

    Goals:

    Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluation

    Inject human judgement: relying on insights from survey researchWe now present three evaluations

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23

    Evaluating Performance

    http://find/
  • 7/28/2019 Discov

    70/142

    Goals:

    Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluation

    Inject human judgement: relying on insights from survey researchWe now present three evaluations

    Cluster Quality RA coders

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23

    Evaluating Performance

    http://find/
  • 7/28/2019 Discov

    71/142

    Goals:

    Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluation

    Inject human judgement: relying on insights from survey researchWe now present three evaluations

    Cluster Quality RA codersInformative discoveries Experienced scholars analyzing texts

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23

    Evaluating Performance

    http://find/
  • 7/28/2019 Discov

    72/142

    Goals:

    Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluation

    Inject human judgement: relying on insights from survey researchWe now present three evaluations

    Cluster Quality RA codersInformative discoveries Experienced scholars analyzing textsDiscovery Youre the judge

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23

    Evaluation 1: Cluster Quality

    http://find/
  • 7/28/2019 Discov

    73/142

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

    Evaluation 1: Cluster Quality

    http://find/
  • 7/28/2019 Discov

    74/142

    What Are Humans Good For?

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

    Evaluation 1: Cluster Quality

    http://find/
  • 7/28/2019 Discov

    75/142

    What Are Humans Good For?

    They cant: keep many documents & clusters in their head

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

    Evaluation 1: Cluster Quality

    http://find/
  • 7/28/2019 Discov

    76/142

    What Are Humans Good For?

    They cant: keep many documents & clusters in their headThey can: compare two documents at a time

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

    Evaluation 1: Cluster Quality

    http://find/
  • 7/28/2019 Discov

    77/142

    What Are Humans Good For?They cant: keep many documents & clusters in their headThey can: compare two documents at a time= Cluster quality evaluation: human judgement of document pairs

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

    Evaluation 1: Cluster Quality

    http://find/http://goback/
  • 7/28/2019 Discov

    78/142

    What Are Humans Good For?They cant: keep many documents & clusters in their headThey can: compare two documents at a time= Cluster quality evaluation: human judgement of document pairs

    Experimental Design to Assess Cluster Quality

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

    Evaluation 1: Cluster Quality

    http://find/
  • 7/28/2019 Discov

    79/142

    What Are Humans Good For?They cant: keep many documents & clusters in their headThey can: compare two documents at a time= Cluster quality evaluation: human judgement of document pairs

    Experimental Design to Assess Cluster Quality

    automated visualization to choose one clustering

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

    Evaluation 1: Cluster Quality

    http://find/
  • 7/28/2019 Discov

    80/142

    What Are Humans Good For?They cant: keep many documents & clusters in their headThey can: compare two documents at a time= Cluster quality evaluation: human judgement of document pairs

    Experimental Design to Assess Cluster Quality

    automated visualization to choose one clusteringmany pairs of documents

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

    Evaluation 1: Cluster Quality

    http://find/
  • 7/28/2019 Discov

    81/142

    What Are Humans Good For?They cant: keep many documents & clusters in their headThey can: compare two documents at a time= Cluster quality evaluation: human judgement of document pairs

    Experimental Design to Assess Cluster Quality

    automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely related

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

    Evaluation 1: Cluster Quality

    http://find/
  • 7/28/2019 Discov

    82/142

    What Are Humans Good For?They cant: keep many documents & clusters in their headThey can: compare two documents at a time= Cluster quality evaluation: human judgement of document pairs

    Experimental Design to Assess Cluster Quality

    automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23

    Evaluation 1: Cluster Quality

    http://find/
  • 7/28/2019 Discov

    83/142

    What Are Humans Good For?They cant: keep many documents & clusters in their headThey can: compare two documents at a time= Cluster quality evaluation: human judgement of document pairs

    Experimental Design to Assess Cluster Quality

    automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

    Gary King (Harvard IQSS) Quantitative Discovery from Text 14 / 23

    Evaluation 1: Cluster Quality

    http://find/
  • 7/28/2019 Discov

    84/142

    What Are Humans Good For?They cant: keep many documents & clusters in their headThey can: compare two documents at a time= Cluster quality evaluation: human judgement of document pairs

    Experimental Design to Assess Cluster Quality

    automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

    Gary King (Harvard IQSS) Quantitative Discovery from Text 14 / 23

    Evaluation 1: Cluster Quality

    http://find/
  • 7/28/2019 Discov

    85/142

    (Our Method) (Human Coders)

    0.3 0.2 0.1 0.1 0.2 0.3

    Gary King (Harvard IQSS) Quantitative Discovery from Text 15 / 23

    Evaluation 1: Cluster Quality

    http://find/http://find/
  • 7/28/2019 Discov

    86/142

    (Our Method) (Human Coders)

    0.3 0.2 0.1 0.1 0.2 0.3

    qLautenberg Press Releases

    Lautenberg: 200 Senate Press Releases (appropriations, economy,education, tax, veterans, . . . )

    Gary King (Harvard IQSS) Quantitative Discovery from Text 15 / 23

    Evaluation 1: Cluster Quality

    http://find/http://find/
  • 7/28/2019 Discov

    87/142

    (Our Method) (Human Coders)

    0.3 0.2 0.1 0.1 0.2 0.3

    qLautenberg Press Releases

    qPolicy Agendas Project

    Policy Agendas: 213 quasi-sentences from Bushs State of the Union(agriculture, banking & commerce, civil rights/liberties, defense, . . . )

    Gary King (Harvard IQSS) Quantitative Discovery from Text 15 / 23

    Evaluation 1: Cluster Quality

    http://find/http://find/
  • 7/28/2019 Discov

    88/142

    (Our Method) (Human Coders)

    0.3 0.2 0.1 0.1 0.2 0.3

    qLautenberg Press Releases

    qPolicy Agendas Project

    qReuter's Gold Standard

    Reuters: financial news (trade, earnings, copper, gold, coffee, . . . ); goldstandard for supervised learning studies

    Gary King (Harvard IQSS) Quantitative Discovery from Text 15 / 23

    Evaluation 2: More Informative Discoveries

    http://goback/http://find/http://find/http://goback/
  • 7/28/2019 Discov

    89/142

    Gary King (Harvard IQSS) Quantitative Discovery from Text 16 / 23

    Evaluation 2: More Informative Discoveries

    Found 2 scholars analyzing lots of textual data for their work

    http://find/
  • 7/28/2019 Discov

    90/142

    Gary King (Harvard IQSS) Quantitative Discovery from Text 16 / 23

    Evaluation 2: More Informative Discoveries

    Found 2 scholars analyzing lots of textual data for their work

    Created 6 clusterings:

    http://find/
  • 7/28/2019 Discov

    91/142

    Created 6 clusterings:

    Gary King (Harvard IQSS) Quantitative Discovery from Text 16 / 23

    Evaluation 2: More Informative Discoveries

    Found 2 scholars analyzing lots of textual data for their work

    Created 6 clusterings:

    http://find/
  • 7/28/2019 Discov

    92/142

    Created 6 clusterings:

    2 clusterings selected with our method (biased against us)

    Gary King (Harvard IQSS) Quantitative Discovery from Text 16 / 23

    Evaluation 2: More Informative Discoveries

    Found 2 scholars analyzing lots of textual data for their work

    Created 6 clusterings:

    http://find/
  • 7/28/2019 Discov

    93/142

    Created 6 clusterings:

    2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

    Gary King (Harvard IQSS) Quantitative Discovery from Text 16 / 23

    Evaluation 2: More Informative Discoveries

    Found 2 scholars analyzing lots of textual data for their work

    Created 6 clusterings:

    http://find/
  • 7/28/2019 Discov

    94/142

    Created 6 clusterings:

    2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

    Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

    Gary King (Harvard IQSS) Quantitative Discovery from Text 16 / 23

    Evaluation 2: More Informative Discoveries

    Found 2 scholars analyzing lots of textual data for their work

    Created 6 clusterings:

    http://find/
  • 7/28/2019 Discov

    95/142

    Created 6 clusterings:

    2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

    Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

    Asked for6

    2

    =15 pairwise comparisons

    Gary King (Harvard IQSS) Quantitative Discovery from Text 16 / 23

    Evaluation 2: More Informative Discoveries

    Found 2 scholars analyzing lots of textual data for their work

    Created 6 clusterings:

    http://find/
  • 7/28/2019 Discov

    96/142

    Created 6 clusterings:

    2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

    Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

    Asked for6

    2

    =15 pairwise comparisons

    User chooses only care about the one clustering that wins

    Gary King (Harvard IQSS) Quantitative Discovery from Text 16 / 23

    Evaluation 2: More Informative Discoveries

    Found 2 scholars analyzing lots of textual data for their work

    Created 6 clusterings:

    http://find/
  • 7/28/2019 Discov

    97/142

    Created 6 clusterings:

    2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

    Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

    Asked for6

    2

    =15 pairwise comparisons

    User chooses only care about the one clustering that wins

    Both cases a Condorcet winner:

    G Ki (H d IQSS) Q tit ti Dis f T t 16 / 23

    Evaluation 2: More Informative Discoveries

    Found 2 scholars analyzing lots of textual data for their work

    Created 6 clusterings:

    http://find/
  • 7/28/2019 Discov

    98/142

    C eated 6 c uste gs

    2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

    Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

    Asked for6

    2

    =15 pairwise comparisons

    User chooses only care about the one clustering that wins

    Both cases a Condorcet winner:

    Immigration:

    Our Method 1 vMF 1 vMF 2 Our Method 2 K-Means 1 K-Means 2

    G Ki (H d IQSS) Q tit ti Di f T t 16 / 23

    Evaluation 2: More Informative Discoveries

    Found 2 scholars analyzing lots of textual data for their work

    Created 6 clusterings:

    http://find/
  • 7/28/2019 Discov

    99/142

    g

    2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

    Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

    Asked for6

    2

    =15 pairwise comparisons

    User chooses only care about the one clustering that wins

    Both cases a Condorcet winner:

    Immigration:

    Our Method 1 vMF 1 vMF 2 Our Method 2 K-Means 1 K-Means 2

    Genetic testing:

    Our Method 1 {Our Method 2, K-Means 1, K-means 2} Dir Proc. 1 Dir Proc. 2

    G Ki (H d IQSS) Q tit ti Di f T t 16 / 23

    Evaluation 3: What Do Members of Congress Do?

    http://find/
  • 7/28/2019 Discov

    100/142

    G Ki (H d IQSS) Q tit ti Di f T t 17 / 23

    Evaluation 3: What Do Members of Congress Do?

    http://find/
  • 7/28/2019 Discov

    101/142

    - David Mayhews (1974) famous typology

    G Ki (H d IQSS) Q i i Di f T 17 / 23

    Evaluation 3: What Do Members of Congress Do?

    http://find/
  • 7/28/2019 Discov

    102/142

    - David Mayhews (1974) famous typology

    - Advertising

    G Ki (H d IQSS) Q i i Di f T 17 / 23

    Evaluation 3: What Do Members of Congress Do?

    http://find/
  • 7/28/2019 Discov

    103/142

    - David Mayhews (1974) famous typology

    - Advertising- Credit Claiming

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 17 / 23

    Evaluation 3: What Do Members of Congress Do?

    http://find/
  • 7/28/2019 Discov

    104/142

    - David Mayhews (1974) famous typology

    - Advertising- Credit Claiming- Position Taking

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 17 / 23

    Evaluation 3: What Do Members of Congress Do?

    http://find/
  • 7/28/2019 Discov

    105/142

    - David Mayhews (1974) famous typology

    - Advertising- Credit Claiming- Position Taking

    - Data: 200 press releases from Frank Lautenbergs office (D-NJ)

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 17 / 23

    Evaluation 3: What Do Members of Congress Do?

    http://find/
  • 7/28/2019 Discov

    106/142

    - David Mayhews (1974) famous typology

    - Advertising- Credit Claiming- Position Taking

    - Data: 200 press releases from Frank Lautenbergs office (D-NJ)

    - Apply our method

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 17 / 23

    Example Discovery

    mult_dirproc

    sot_cordivisive stand.euc

    mixvmf mixvmfVA

    kmeans correlationhclust canberra ward

    hclust binary completeh l l i i

    http://find/
  • 7/28/2019 Discov

    107/142

    biclust_spectral

    clust_convex

    dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

    mec

    rocksom

    sot_euc

    spec_cosspec_eucspec_man

    spec_minkspec_maxspec_canb

    mspec_cosmspec_euc

    mspec_manmspec_minkmspec_max

    mspec_canb

    affprop cosine

    affprop euclidean

    affprop manhattan

    affprop info.co

    affprop maximum

    divisive euclidean

    divisive manhattan

    kmedoids stand.euckmedoids euclidean

    kmedoids manhattan

    mixvmfVA

    kmeans euclidean

    kmeans maximum

    kmeans manhattan

    kmeans canberra

    kmeans binary

    kmeans pearson

    kmeans spearmankmeans kendall

    hclust euclidean ward

    hclust euclidean single

    hclust euclidean complete

    hclust euclidean averagehclust euclidean mcquitty

    hclust euclidean medianhclust euclidean centroid

    hclust maximum ward

    hclust maximum single

    hclust maximum completehclust maximum average

    hclust maximum mcquitty

    hclust maximum medianhclust maximum centroid

    hclust manhattan ward

    hclust manhattan single

    hclust manhattan complete

    hclust manhattan average

    hclust manhattan mcquitty

    hclust manhattan medianhclust manhattan centroid

    hclust canberra single

    hclust canberra complete

    hclust canberra average

    hclust canberra mcquitty

    hclust canberra median

    hclust canberra centroid

    hclust binary ward

    hclust binary single

    hclust binary complete

    hclust binary average

    hclust binary mcquittyhclust binary median

    hclust binary centroid

    hclust pearson ward

    hclust pearson single

    hclust pearson completehclust pearson average

    hclust pearson mcquittyhclust pearson median

    hclust pearson centroid

    hclust correlation ward

    hclust correlation single

    hclust correlation completehclust correlation average

    hclust correlation mcquitty

    hclust correlation median

    hclust correlation centroid

    hclust spearman ward

    hclust spearman single

    hclust spearman complete

    hclust spearman averagehclust spearman mcquitty

    hclust spearman medianhclust spearman centroid

    hclust kendall ward

    hclust kendall single

    hclust kendall complete

    hclust kendall averagehclust kendall mcquittyhclust kendall median

    hclust kendall centroid

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

    Example Discovery

    mult_dirproc

    sot_cordivisive stand.euc

    mixvmf mixvmfVA

    kmeans correlationhclust canberra ward

    hclust binary completehclust correlation mcquitty

    http://find/
  • 7/28/2019 Discov

    108/142

    biclust_spectral

    clust_convex

    dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

    mec

    rocksom

    sot_euc

    spec_cosspec_eucspec_man

    spec_minkspec_maxspec_canb

    mspec_cosmspec_euc

    mspec_manmspec_minkmspec_max

    mspec_canb

    affprop cosine

    affprop euclidean

    affprop manhattan

    affprop info.co

    affprop maximum

    divisive euclidean

    divisive manhattan

    kmedoids stand.euckmedoids euclidean

    kmedoids manhattan

    kmeans euclidean

    kmeans maximum

    kmeans manhattan

    kmeans canberra

    kmeans binary

    kmeans pearson

    kmeans spearmankmeans kendall

    hclust euclidean ward

    hclust euclidean single

    hclust euclidean complete

    hclust euclidean averagehclust euclidean mcquitty

    hclust euclidean medianhclust euclidean centroid

    hclust maximum ward

    hclust maximum single

    hclust maximum completehclust maximum average

    hclust maximum mcquitty

    hclust maximum medianhclust maximum centroid

    hclust manhattan ward

    hclust manhattan single

    hclust manhattan complete

    hclust manhattan average

    hclust manhattan mcquitty

    hclust manhattan medianhclust manhattan centroid

    hclust canberra single

    hclust canberra complete

    hclust canberra average

    hclust canberra mcquitty

    hclust canberra median

    hclust canberra centroid

    hclust binary ward

    hclust binary single

    hclust binary complete

    hclust binary average

    hclust binary mcquittyhclust binary median

    hclust binary centroid

    hclust pearson ward

    hclust pearson single

    hclust pearson completehclust pearson average

    hclust pearson mcquittyhclust pearson median

    hclust pearson centroid

    hclust correlation ward

    hclust correlation single

    hclust correlation completehclust correlation average

    hclust correlation mcquitty

    hclust correlation median

    hclust correlation centroid

    hclust spearman ward

    hclust spearman single

    hclust spearman complete

    hclust spearman averagehclust spearman mcquitty

    hclust spearman medianhclust spearman centroid

    hclust kendall ward

    hclust kendall single

    hclust kendall complete

    hclust kendall averagehclust kendall mcquittyhclust kendall median

    hclust kendall centroid

    affprop cosine

    Red point: a clustering byAffinity Propagation-Cosine(Dueck and Frey 2007)

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

    Example Discovery

    mult_dirproc

    sot_cordivisive stand.euc

    mixvmf mixvmfVA

    kmeans correlationhclust canberra ward

    hclust binary completehclust correlation mcquitty

    mixvmf

    http://find/
  • 7/28/2019 Discov

    109/142

    biclust_spectral

    clust_convex

    dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

    mec

    rocksom

    sot_euc

    spec_cosspec_eucspec_man

    spec_minkspec_maxspec_canb

    mspec_cosmspec_euc

    mspec_manmspec_minkmspec_max

    mspec_canb

    affprop cosine

    affprop euclidean

    affprop manhattan

    affprop info.co

    affprop maximum

    divisive euclidean

    divisive manhattan

    kmedoids stand.euckmedoids euclidean

    kmedoids manhattan

    kmeans euclidean

    kmeans maximum

    kmeans manhattan

    kmeans canberra

    kmeans binary

    kmeans pearson

    kmeans spearmankmeans kendall

    hclust euclidean ward

    hclust euclidean single

    hclust euclidean complete

    hclust euclidean averagehclust euclidean mcquitty

    hclust euclidean medianhclust euclidean centroid

    hclust maximum ward

    hclust maximum single

    hclust maximum completehclust maximum average

    hclust maximum mcquitty

    hclust maximum medianhclust maximum centroid

    hclust manhattan ward

    hclust manhattan single

    hclust manhattan complete

    hclust manhattan average

    hclust manhattan mcquitty

    hclust manhattan medianhclust manhattan centroid

    hclust canberra single

    hclust canberra complete

    hclust canberra average

    hclust canberra mcquitty

    hclust canberra median

    hclust canberra centroid

    hclust binary ward

    hclust binary single

    y p

    hclust binary average

    hclust binary mcquittyhclust binary median

    hclust binary centroid

    hclust pearson ward

    hclust pearson single

    hclust pearson completehclust pearson average

    hclust pearson mcquittyhclust pearson median

    hclust pearson centroid

    hclust correlation ward

    hclust correlation singlehclust correlation complete

    hclust correlation average

    hclust correlation mcquitty

    hclust correlation median

    hclust correlation centroid

    hclust spearman ward

    hclust spearman single

    hclust spearman complete

    hclust spearman averagehclust spearman mcquitty

    hclust spearman medianhclust spearman centroid

    hclust kendall ward

    hclust kendall single

    hclust kendall complete

    hclust kendall averagehclust kendall mcquittyhclust kendall median

    hclust kendall centroid

    affprop cosine

    Red point: a clustering byAffinity Propagation-Cosine(Dueck and Frey 2007)Close to:Mixture of von Mises-Fisherdistributions (Banerjee et. al.

    2005)

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

    Example Discovery

    mult_dirproc

    sot_cordivisive stand.euc

    mixvmf mixvmfVA

    kmeans correlationhclust canberra ward

    hclust binary completehclust correlation mcquitty

    http://find/
  • 7/28/2019 Discov

    110/142

    biclust_spectral

    clust_convex

    dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

    mec

    rocksom

    sot_euc

    spec_cosspec_eucspec_man

    spec_minkspec_maxspec_canb

    mspec_cosmspec_euc

    mspec_manmspec_minkmspec_max

    mspec_canb

    affprop cosine

    affprop euclidean

    affprop manhattan

    affprop info.co

    affprop maximum

    divisive euclidean

    divisive manhattan

    kmedoids stand.euckmedoids euclidean

    kmedoids manhattan

    kmeans euclidean

    kmeans maximum

    kmeans manhattan

    kmeans canberra

    kmeans binary

    kmeans pearson

    kmeans spearmankmeans kendall

    hclust euclidean ward

    hclust euclidean single

    hclust euclidean complete

    hclust euclidean averagehclust euclidean mcquitty

    hclust euclidean medianhclust euclidean centroid

    hclust maximum ward

    hclust maximum single

    hclust maximum completehclust maximum average

    hclust maximum mcquitty

    hclust maximum medianhclust maximum centroid

    hclust manhattan ward

    hclust manhattan single

    hclust manhattan complete

    hclust manhattan average

    hclust manhattan mcquitty

    hclust manhattan medianhclust manhattan centroid

    hclust canberra single

    hclust canberra complete

    hclust canberra average

    hclust canberra mcquitty

    hclust canberra median

    hclust canberra centroid

    hclust binary ward

    hclust binary singlehclust binary average

    hclust binary mcquittyhclust binary median

    hclust binary centroid

    hclust pearson ward

    hclust pearson single

    hclust pearson completehclust pearson average

    hclust pearson mcquittyhclust pearson median

    hclust pearson centroid

    hclust correlation ward

    hclust correlation singlehclust correlation complete

    hclust correlation average

    q y

    hclust correlation median

    hclust correlation centroid

    hclust spearman ward

    hclust spearman single

    hclust spearman complete

    hclust spearman averagehclust spearman mcquitty

    hclust spearman medianhclust spearman centroid

    hclust kendall ward

    hclust kendall single

    hclust kendall complete

    hclust kendall averagehclust kendall mcquittyhclust kendall median

    hclust kendall centroid

    Space between methods:

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

    Example Discovery

    mult_dirproc

    sot_cordivisive stand.euc

    mixvmf mixvmfVA

    kmeans correlationhclust canberra ward

    hclust binary complete

    h l t i lhclust correlation mcquitty

    http://find/
  • 7/28/2019 Discov

    111/142

    biclust_spectral

    clust_convex

    dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

    mec

    rocksom

    sot_euc

    spec_cosspec_eucspec_man

    spec_minkspec_maxspec_canb

    mspec_cosmspec_euc

    mspec_manmspec_minkmspec_max

    mspec_canb

    affprop cosine

    affprop euclidean

    affprop manhattan

    affprop info.co

    affprop maximum

    divisive euclidean

    divisive manhattan

    kmedoids stand.euckmedoids euclidean

    kmedoids manhattan

    kmeans euclidean

    kmeans maximum

    kmeans manhattan

    kmeans canberra

    kmeans binary

    kmeans pearson

    kmeans spearmankmeans kendall

    hclust euclidean ward

    hclust euclidean single

    hclust euclidean complete

    hclust euclidean averagehclust euclidean mcquitty

    hclust euclidean medianhclust euclidean centroid

    hclust maximum ward

    hclust maximum single

    hclust maximum completehclust maximum average

    hclust maximum mcquitty

    hclust maximum medianhclust maximum centroid

    hclust manhattan ward

    hclust manhattan single

    hclust manhattan complete

    hclust manhattan average

    hclust manhattan mcquitty

    hclust manhattan medianhclust manhattan centroid

    hclust canberra single

    hclust canberra complete

    hclust canberra average

    hclust canberra mcquitty

    hclust canberra median

    hclust canberra centroid

    hclust binary ward

    hclust binary singlehclust binary average

    hclust binary mcquittyhclust binary median

    hclust binary centroid

    hclust pearson ward

    hclust pearson single

    hclust pearson completehclust pearson average

    hclust pearson mcquittyhclust pearson median

    hclust pearson centroid

    hclust correlation ward

    hclust correlation singlehclust correlation complete

    hclust correlation averagehclust correlation median

    hclust correlation centroid

    hclust spearman ward

    hclust spearman single

    hclust spearman complete

    hclust spearman averagehclust spearman mcquitty

    hclust spearman medianhclust spearman centroid

    hclust kendall ward

    hclust kendall single

    hclust kendall complete

    hclust kendall averagehclust kendall mcquittyhclust kendall median

    hclust kendall centroid

    q

    Space between methods:

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

    Example Discovery

    mult_dirproc

    sot_cor

    ff i

    divisive stand.euc

    mixvmf mixvmfVA

    kmeans correlationhclust canberra ward

    hclust binary complete

    hclust pearson singlehclust correlation mcquitty

    http://find/
  • 7/28/2019 Discov

    112/142

    biclust_spectral

    clust_convex

    dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

    mec

    rocksom

    sot_euc

    spec_cosspec_eucspec_man

    spec_minkspec_maxspec_canb

    mspec_cosmspec_euc

    mspec_manmspec_minkmspec_max

    mspec_canb

    affprop cosine

    affprop euclidean

    affprop manhattan

    affprop info.co

    affprop maximum

    divisive euclidean

    divisive manhattan

    kmedoids stand.euckmedoids euclidean

    kmedoids manhattan

    kmeans euclidean

    kmeans maximum

    kmeans manhattan

    kmeans canberra

    kmeans binary

    kmeans pearson

    kmeans spearmankmeans kendall

    hclust euclidean ward

    hclust euclidean single

    hclust euclidean complete

    hclust euclidean averagehclust euclidean mcquitty

    hclust euclidean medianhclust euclidean centroid

    hclust maximum ward

    hclust maximum single

    hclust maximum completehclust maximum average

    hclust maximum mcquitty

    hclust maximum medianhclust maximum centroid

    hclust manhattan ward

    hclust manhattan single

    hclust manhattan complete

    hclust manhattan average

    hclust manhattan mcquitty

    hclust manhattan medianhclust manhattan centroid

    hclust canberra single

    hclust canberra complete

    hclust canberra average

    hclust canberra mcquitty

    hclust canberra median

    hclust canberra centroid

    hclust binary ward

    hclust binary singlehclust binary average

    hclust binary mcquittyhclust binary median

    hclust binary centroid

    hclust pearson ward

    hclust pearson single

    hclust pearson completehclust pearson average

    hclust pearson mcquittyhclust pearson median

    hclust pearson centroid

    hclust correlation ward

    hclust correlation singlehclust correlation complete

    hclust correlation averagehclust correlation median

    hclust correlation centroid

    hclust spearman ward

    hclust spearman single

    hclust spearman complete

    hclust spearman averagehclust spearman mcquitty

    hclust spearman medianhclust spearman centroid

    hclust kendall ward

    hclust kendall single

    hclust kendall complete

    hclust kendall averagehclust kendall mcquittyhclust kendall median

    hclust kendall centroid

    q

    Space between methods:local cluster ensemble

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

    Example Discovery

    mult_dirproc

    sot_cor

    affprop cosine

    divisive stand.euc

    mixvmf mixvmfVA

    kmeans correlationhclust canberra ward

    hclust binary complete

    hclust pearson singlehclust correlation mcquitty

    http://find/
  • 7/28/2019 Discov

    113/142

    biclust_spectral

    clust_convex

    dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

    mec

    rocksom

    sot_euc

    spec_cosspec_eucspec_man

    spec_minkspec_maxspec_canb

    mspec_cosmspec_euc

    mspec_manmspec_minkmspec_max

    mspec_canb

    affprop cosine

    affprop euclidean

    affprop manhattan

    affprop info.co

    affprop maximum

    divisive euclidean

    divisive manhattan

    kmedoids stand.euckmedoids euclidean

    kmedoids manhattan

    kmeans euclidean

    kmeans maximum

    kmeans manhattan

    kmeans canberra

    kmeans binary

    kmeans pearson

    kmeans spearmankmeans kendall

    hclust euclidean ward

    hclust euclidean single

    hclust euclidean complete

    hclust euclidean averagehclust euclidean mcquitty

    hclust euclidean medianhclust euclidean centroid

    hclust maximum ward

    hclust maximum single

    hclust maximum completehclust maximum average

    hclust maximum mcquitty

    hclust maximum medianhclust maximum centroid

    hclust manhattan ward

    hclust manhattan single

    hclust manhattan complete

    hclust manhattan average

    hclust manhattan mcquitty

    hclust manhattan medianhclust manhattan centroid

    hclust canberra single

    hclust canberra complete

    hclust canberra average

    hclust canberra mcquitty

    hclust canberra median

    hclust canberra centroid

    hclust binary ward

    hclust binary singlehclust binary average

    hclust binary mcquittyhclust binary median

    hclust binary centroid

    hclust pearson ward

    hclust pearson single

    hclust pearson completehclust pearson average

    hclust pearson mcquittyhclust pearson median

    hclust pearson centroid

    hclust correlation ward

    hclust correlation singlehclust correlation complete

    hclust correlation averagehclust correlation median

    hclust correlation centroid

    hclust spearman ward

    hclust spearman single

    hclust spearman complete

    hclust spearman averagehclust spearman mcquitty

    hclust spearman medianhclust spearman centroid

    hclust kendall ward

    hclust kendall single

    hclust kendall complete

    hclust kendall averagehclust kendall mcquittyhclust kendall median

    hclust kendall centroid

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

    Example Discovery

    mult_dirproc

    sot_cor

    affprop cosine

    divisive stand.euc

    mixvmf mixvmfVA

    kmeans correlationhclust canberra ward

    hclust binary complete

    hclust pearson singleh l t itt

    hclust correlation mcquitty

    http://find/
  • 7/28/2019 Discov

    114/142

    biclust_spectral

    clust_convex

    dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

    mec

    rocksom

    sot_euc

    spec_cosspec_eucspec_man

    spec_minkspec_maxspec_canb

    mspec_cosmspec_euc

    mspec_manmspec_minkmspec_max

    mspec_canb

    affprop cosine

    affprop euclidean

    affprop manhattan

    affprop info.co

    affprop maximum

    divisive euclidean

    divisive manhattan

    kmedoids stand.euckmedoids euclidean

    kmedoids manhattan

    kmeans euclidean

    kmeans maximum

    kmeans manhattan

    kmeans canberra

    kmeans binary

    kmeans pearson

    kmeans spearmankmeans kendall

    hclust euclidean ward

    hclust euclidean single

    hclust euclidean complete

    hclust euclidean averagehclust euclidean mcquitty

    hclust euclidean medianhclust euclidean centroid

    hclust maximum ward

    hclust maximum single

    hclust maximum completehclust maximum average

    hclust maximum mcquitty

    hclust maximum medianhclust maximum centroid

    hclust manhattan ward

    hclust manhattan single

    hclust manhattan complete

    hclust manhattan average

    hclust manhattan mcquitty

    hclust manhattan medianhclust manhattan centroid

    hclust canberra single

    hclust canberra complete

    hclust canberra average

    hclust canberra mcquitty

    hclust canberra median

    hclust canberra centroid

    hclust binary ward

    hclust binary singlehclust binary average

    hclust binary mcquittyhclust binary median

    hclust binary centroid

    hclust pearson ward

    hclust pearson completehclust pearson average

    hclust pearson mcquittyhclust pearson median

    hclust pearson centroid

    hclust correlation ward

    hclust correlation singlehclust correlation complete

    hclust correlation averagehclust correlation median

    hclust correlation centroid

    hclust spearman ward

    hclust spearman single

    hclust spearman complete

    hclust spearman averagehclust spearman mcquitty

    hclust spearman medianhclust spearman centroid

    hclust kendall ward

    hclust kendall single

    hclust kendall complete

    hclust kendall averagehclust kendall mcquittyhclust kendall median

    hclust kendall centroid

    Found a region with particularlyinsightful clusterings

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

    Example Discovery

    mult_dirproc

    sot_cor

    affprop cosine

    divisive stand.euc

    mixvmf mixvmfVA

    kmeans correlationhclust canberra ward

    hclust binary complete

    hclust pearson singlehclust pearson mcquittyhclust pearson medianhcl st correlation single

    hclust correlation mcquitty Mixture:

    http://goforward/http://find/http://goback/
  • 7/28/2019 Discov

    115/142

    biclust_spectral

    clust_convex

    dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

    mec

    rocksom

    sot_euc

    spec_cosspec_eucspec_man

    spec_minkspec_maxspec_canb

    mspec_cosmspec_euc

    mspec_manmspec_minkmspec_max

    mspec_canb

    p p

    affprop euclidean

    affprop manhattan

    affprop info.co

    affprop maximum

    divisive euclidean

    divisive manhattan

    kmedoids stand.euckmedoids euclidean

    kmedoids manhattan

    kmeans euclidean

    kmeans maximum

    kmeans manhattan

    kmeans canberra

    kmeans binary

    kmeans pearson

    kmeans spearmankmeans kendall

    hclust euclidean ward

    hclust euclidean single

    hclust euclidean complete

    hclust euclidean averagehclust euclidean mcquitty

    hclust euclidean medianhclust euclidean centroid

    hclust maximum ward

    hclust maximum single

    hclust maximum completehclust maximum average

    hclust maximum mcquitty

    hclust maximum medianhclust maximum centroid

    hclust manhattan ward

    hclust manhattan single

    hclust manhattan complete

    hclust manhattan average

    hclust manhattan mcquitty

    hclust manhattan medianhclust manhattan centroid

    hclust canberra single

    hclust canberra complete

    hclust canberra average

    hclust canberra mcquitty

    hclust canberra median

    hclust canberra centroid

    hclust binary ward

    hclust binary singlehclust binary average

    hclust binary mcquittyhclust binary median

    hclust binary centroid

    hclust pearson ward

    hclust pearson completehclust pearson average

    hclust pearson mcquittyhclust pearson median

    hclust pearson centroid

    hclust correlation ward

    hclust correlation singlehclust correlation complete

    hclust correlation averagehclust correlation median

    hclust correlation centroid

    hclust spearman ward

    hclust spearman single

    hclust spearman complete

    hclust spearman averagehclust spearman mcquitty

    hclust spearman medianhclust spearman centroid

    hclust kendall ward

    hclust kendall single

    hclust kendall complete

    hclust kendall averagehclust kendall mcquittyhclust kendall median

    hclust kendall centroid

    q

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

    Example Discovery

    mult_dirproc

    sot_cor

    affprop cosine

    divisive stand.euc

    mixvmf mixvmfVA

    kmeans correlationhclust canberra ward

    hclust binary complete

    hclust pearson singlehclust pearson mcquittyhclust pearson medianhclust correlation single

    hclust correlation mcquitty Mixture:

    http://find/
  • 7/28/2019 Discov

    116/142

    biclust_spectral

    clust_convex

    dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

    mec

    rocksom

    sot_euc

    spec_cosspec_eucspec_man

    spec_minkspec_maxspec_canb

    mspec_cosmspec_euc

    mspec_manmspec_minkmspec_max

    mspec_canb

    affprop euclidean

    affprop manhattan

    affprop info.co

    affprop maximum

    divisive euclidean

    divisive manhattan

    kmedoids stand.euckmedoids euclidean

    kmedoids manhattan

    kmeans euclidean

    kmeans maximum

    kmeans manhattan

    kmeans canberra

    kmeans binary

    kmeans pearson

    kmeans spearmankmeans kendall

    hclust euclidean ward

    hclust euclidean single

    hclust euclidean complete

    hclust euclidean averagehclust euclidean mcquitty

    hclust euclidean medianhclust euclidean centroid

    hclust maximum ward

    hclust maximum single

    hclust maximum completehclust maximum average

    hclust maximum mcquitty

    hclust maximum medianhclust maximum centroid

    hclust manhattan ward

    hclust manhattan single

    hclust manhattan complete

    hclust manhattan average

    hclust manhattan mcquitty

    hclust manhattan medianhclust manhattan centroid

    hclust canberra single

    hclust canberra complete

    hclust canberra average

    hclust canberra mcquitty

    hclust canberra median

    hclust canberra centroid

    hclust binary ward

    hclust binary singlehclust binary average

    hclust binary mcquittyhclust binary median

    hclust binary centroid

    hclust pearson ward

    hclust pearson complete

    hclust pearson averagehclust pearson mcquittyhclust pearson median

    hclust pearson centroid

    hclust correlation ward

    hclust correlation singlehclust correlation complete

    hclust correlation averagehclust correlation median

    hclust correlation centroid

    hclust spearman ward

    hclust spearman single

    hclust spearman complete

    hclust spearman averagehclust spearman mcquitty

    hclust spearman medianhclust spearman centroid

    hclust kendall ward

    hclust kendall single

    hclust kendall complete

    hclust kendall averagehclust kendall mcquittyhclust kendall median

    hclust kendall centroid

    q

    0.39 Hclust-Canberra-McQuitty

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

    Example Discovery

    mult_dirproc

    mec

    sot_cor

    affprop cosine

    divisive stand.euc

    mixvmf mixvmfVA

    kmeans correlationhclust canberra ward

    hclust binary complete

    hclust pearson single

    h l thclust pearson mcquittyhclust pearson medianhclust correlation single

    hclust correlation complete

    hclust correlation mcquitty Mixture:

    http://find/
  • 7/28/2019 Discov

    117/142

    biclust_spectral

    clust_convex

    dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

    mec

    rocksom

    sot_euc

    spec_cosspec_eucspec_man

    spec_minkspec_maxspec_canb

    mspec_cosmspec_euc

    mspec_manmspec_minkmspec_max

    mspec_canb

    affprop euclidean

    affprop manhattan

    affprop info.co

    affprop maximum

    divisive euclidean

    divisive manhattan

    kmedoids stand.euckmedoids euclidean

    kmedoids manhattan

    kmeans euclidean

    kmeans maximum

    kmeans manhattan

    kmeans canberra

    kmeans binary

    kmeans pearson

    kmeans spearmankmeans kendall

    hclust euclidean ward

    hclust euclidean single

    hclust euclidean complete

    hclust euclidean averagehclust euclidean mcquitty

    hclust euclidean medianhclust euclidean centroid

    hclust maximum ward

    hclust maximum single

    hclust maximum completehclust maximum average

    hclust maximum mcquitty

    hclust maximum medianhclust maximum centroid

    hclust manhattan ward

    hclust manhattan single

    hclust manhattan complete

    hclust manhattan average

    hclust manhattan mcquitty

    hclust manhattan medianhclust manhattan centroid

    hclust canberra single

    hclust canberra complete

    hclust canberra average

    hclust canberra mcquitty

    hclust canberra median

    hclust canberra centroid

    hclust binary ward

    hclust binary singlehclust binary average

    hclust binary mcquittyhclust binary median

    hclust binary centroid

    hclust pearson ward

    hclust pearson complete

    hclust pearson averagep q yp

    hclust pearson centroid

    hclust correlation ward

    ghclust correlation complete

    hclust correlation averagehclust correlation median

    hclust correlation centroid

    hclust spearman ward

    hclust spearman single

    hclust spearman complete

    hclust spearman averagehclust spearman mcquitty

    hclust spearman medianhclust spearman centroid

    hclust kendall ward

    hclust kendall single

    hclust kendall complete

    hclust kendall averagehclust kendall mcquittyhclust kendall median

    hclust kendall centroid

    q

    0.39 Hclust-Canberra-McQuitty

    0.30 Spectral clusteringRandom Walk(Metrics 1-6)

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

    Example Discovery

    mult_dirproc

    mec

    sot_cor

    affprop cosine

    divisive stand.euc

    mixvmf mixvmfVA

    kmeans correlationhclust canberra ward

    hclust binary complete

    hclust pearson single

    hclust pearson averagehclust pearson mcquittyhclust pearson medianhclust correlation single

    hclust correlation complete

    hclust correlation mcquitty Mixture:

    http://find/
  • 7/28/2019 Discov

    118/142

    biclust_spectral

    clust_convex

    dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

    mec

    rocksom

    sot_euc

    spec_cosspec_eucspec_man

    spec_minkspec_maxspec_canb

    mspec_cosmspec_euc

    mspec_manmspec_minkmspec_max

    mspec_canb

    affprop euclidean

    affprop manhattan

    affprop info.co

    affprop maximum

    divisive euclidean

    divisive manhattan

    kmedoids stand.euckmedoids euclidean

    kmedoids manhattan

    kmeans euclidean

    kmeans maximum

    kmeans manhattan

    kmeans canberra

    kmeans binary

    kmeans pearson

    kmeans spearmankmeans kendall

    hclust euclidean ward

    hclust euclidean single

    hclust euclidean complete

    hclust euclidean averagehclust euclidean mcquitty

    hclust euclidean medianhclust euclidean centroid

    hclust maximum ward

    hclust maximum single

    hclust maximum completehclust maximum average

    hclust maximum mcquitty

    hclust maximum medianhclust maximum centroid

    hclust manhattan ward

    hclust manhattan single

    hclust manhattan complete

    hclust manhattan average

    hclust manhattan mcquitty

    hclust manhattan medianhclust manhattan centroid

    hclust canberra single

    hclust canberra complete

    hclust canberra average

    hclust canberra mcquitty

    hclust canberra median

    hclust canberra centroid

    hclust binary ward

    hclust binary singlehclust binary average

    hclust binary mcquittyhclust binary median

    hclust binary centroid

    hclust pearson ward

    hclust pearson complete

    hclust pearson average

    hclust pearson centroid

    hclust correlation ward

    hclust correlation complete

    hclust correlation averagehclust correlation median

    hclust correlation centroid

    hclust spearman ward

    hclust spearman single

    hclust spearman complete

    hclust spearman averagehclust spearman mcquitty

    hclust spearman medianhclust spearman centroid

    hclust kendall ward

    hclust kendall single

    hclust kendall complete

    hclust kendall averagehclust kendall mcquittyhclust kendall median

    hclust kendall centroid

    q

    0.39 Hclust-Canberra-McQuitty

    0.30 Spectral clusteringRandom Walk(Metrics 1-6)

    0.13 Hclust-Correlation-Ward

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

    Example Discovery

    mult_dirproc

    mec

    sot_cor

    affprop cosine

    divisive stand.euc

    mixvmf mixvmfVA

    kmeans correlationhclust canberra ward

    hclust binary complete

    hclust pearson single

    hclust pearson averagehclust pearson mcquittyhclust pearson medianhclust correlation single

    hclust correlation complete

    hclust correlation mcquitty

    hclust correlation median

    Mixture:

    http://find/
  • 7/28/2019 Discov

    119/142

    biclust_spectral

    clust_convex

    dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

    rocksom

    sot_euc

    spec_cosspec_eucspec_man

    spec_minkspec_maxspec_canb

    mspec_cosmspec_euc

    mspec_manmspec_minkmspec_max

    mspec_canb

    affprop euclidean

    affprop manhattan

    affprop info.co

    affprop maximum

    divisive euclidean

    divisive manhattan

    kmedoids stand.euckmedoids euclidean

    kmedoids manhattan

    kmeans euclidean

    kmeans maximum

    kmeans manhattan

    kmeans canberra

    kmeans binary

    kmeans pearson

    kmeans spearmankmeans kendall

    hclust euclidean ward

    hclust euclidean single

    hclust euclidean complete

    hclust euclidean averagehclust euclidean mcquitty

    hclust euclidean medianhclust euclidean centroid

    hclust maximum ward

    hclust maximum single

    hclust maximum completehclust maximum average

    hclust maximum mcquitty

    hclust maximum medianhclust maximum centroid

    hclust manhattan ward

    hclust manhattan single

    hclust manhattan complete

    hclust manhattan average

    hclust manhattan mcquitty

    hclust manhattan medianhclust manhattan centroid

    hclust canberra single

    hclust canberra complete

    hclust canberra average

    hclust canberra mcquitty

    hclust canberra median

    hclust canberra centroid

    hclust binary ward

    hclust binary singlehclust binary average

    hclust binary mcquittyhclust binary median

    hclust binary centroid

    hclust pearson ward

    hclust pearson complete

    hclust pearson average

    hclust pearson centroid

    hclust correlation ward

    p

    hclust correlation averagehclust correlation median

    hclust correlation centroid

    hclust spearman ward

    hclust spearman single

    hclust spearman complete

    hclust spearman averagehclust spearman mcquitty

    hclust spearman medianhclust spearman centroid

    hclust kendall ward

    hclust kendall single

    hclust kendall complete

    hclust kendall averagehclust kendall mcquittyhclust kendall median

    hclust kendall centroid

    q

    0.39 Hclust-Canberra-McQuitty

    0.30 Spectral clusteringRandom Walk(Metrics 1-6)

    0.13 Hclust-Correlation-Ward

    0.09 Hclust-Pearson-Ward

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

    Example Discovery

    mult_dirproc

    mec

    sot_cor

    affprop cosine

    divisive stand.euc

    mixvmf mixvmfVA

    kmeans correlationhclust canberra ward

    hclust binary complete

    hclust pearson single

    hclust pearson averagehclust pearson mcquittyhclust pearson medianhclust correlation single

    hclust correlation complete

    hclust correlation mcquitty

    hclust correlation median

    Mixture:

    http://find/http://goback/
  • 7/28/2019 Discov

    120/142

    biclust_spectral

    clust_convex

    dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

    rocksom

    sot_euc

    spec_cosspec_eucspec_man

    spec_minkspec_maxspec_canb

    mspec_cosmspec_euc

    mspec_manmspec_minkmspec_max

    mspec_canb

    affprop euclidean

    affprop manhattan

    affprop info.co

    affprop maximum

    divisive euclidean

    divisive manhattan

    kmedoids stand.euckmedoids euclidean

    kmedoids manhattan

    kmeans euclidean

    kmeans maximum

    kmeans manhattan

    kmeans canberra

    kmeans binary

    kmeans pearson

    kmeans spearmankmeans kendall

    hclust euclidean ward

    hclust euclidean single

    hclust euclidean complete

    hclust euclidean averagehclust euclidean mcquitty

    hclust euclidean medianhclust euclidean centroid

    hclust maximum ward

    hclust maximum single

    hclust maximum completehclust maximum average

    hclust maximum mcquitty

    hclust maximum medianhclust maximum centroid

    hclust manhattan ward

    hclust manhattan single

    hclust manhattan complete

    hclust manhattan average

    hclust manhattan mcquitty

    hclust manhattan medianhclust manhattan centroid

    hclust canberra single

    hclust canberra complete

    hclust canberra average

    hclust canberra mcquitty

    hclust canberra median

    hclust canberra centroid

    hclust binary ward

    hclust binary singlehclust binary average

    hclust binary mcquittyhclust binary median

    hclust binary centroid

    hclust pearson ward

    hclust pearson complete

    p g

    hclust pearson centroid

    hclust correlation ward

    hclust correlation averagehclust correlation median

    hclust correlation centroid

    hclust spearman ward

    hclust spearman single

    hclust spearman complete

    hclust spearman averagehclust spearman mcquitty

    hclust spearman medianhclust spearman centroid

    hclust kendall ward

    hclust kendall single

    hclust kendall complete

    hclust kendall averagehclust kendall mcquittyhclust kendall median

    hclust kendall centroid

    q

    0.39 Hclust-Canberra-McQuitty

    0.30 Spectral clusteringRandom Walk(Metrics 1-6)

    0.13 Hclust-Correlation-Ward

    0.09 Hclust-Pearson-Ward

    0.05 Kmediods-Cosine

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

    Example Discovery

    mult_dirproc

    mec

    sot_cor

    affprop cosine

    divisive stand.euc

    mixvmf mixvmfVA

    kmeans correlationhclust canberra ward

    hclust binary complete

    hclust pearson single

    hclust pearson completehclust pearson average

    hclust pearson mcquittyhclust pearson medianhclust correlation singlehclust correlation complete

    hclust correlation mcquitty

    hclust correlation median

    Mixture:

    http://find/
  • 7/28/2019 Discov

    121/142

    biclust_spectral

    clust_convex

    dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

    rocksom

    sot_euc

    spec_cosspec_eucspec_man

    spec_minkspec_maxspec_canb

    mspec_cosmspec_euc

    mspec_manmspec_minkmspec_max

    mspec_canb

    affprop euclidean

    affprop manhattan

    affprop info.co

    affprop maximum

    divisive euclidean

    divisive manhattan

    kmedoids stand.euckmedoids euclidean

    kmedoids manhattan

    kmeans euclidean

    kmeans maximum

    kmeans manhattan

    kmeans canberra

    kmeans binary

    kmeans pearson

    kmeans spearmankmeans kendall

    hclust euclidean ward

    hclust euclidean single

    hclust euclidean complete

    hclust euclidean averagehclust euclidean mcquitty

    hclust euclidean medianhclust euclidean centroid

    hclust maximum ward

    hclust maximum single

    hclust maximum completehclust maximum average

    hclust maximum mcquitty

    hclust maximum medianhclust maximum centroid

    hclust manhattan ward

    hclust manhattan single

    hclust manhattan complete

    hclust manhattan average

    hclust manhattan mcquitty

    hclust manhattan medianhclust manhattan centroid

    hclust canberra single

    hclust canberra complete

    hclust canberra average

    hclust canberra mcquitty

    hclust canberra median

    hclust canberra centroid

    hclust binary ward

    hclust binary singlehclust binary average

    hclust binary mcquittyhclust binary median

    hclust binary centroid

    hclust pearson ward

    hclust pearson complete

    hclust pearson centroid

    hclust correlation ward

    hclust correlation average

    hclust correlation centroid

    hclust spearman ward

    hclust spearman single

    hclust spearman complete

    hclust spearman averagehclust spearman mcquitty

    hclust spearman medianhclust spearman centroid

    hclust kendall ward

    hclust kendall single

    hclust kendall complete

    hclust kendall averagehclust kendall mcquittyhclust kendall median

    hclust kendall centroid

    q

    0.39 Hclust-Canberra-McQuitty

    0.30 Spectral clusteringRandom Walk(Metrics 1-6)

    0.13 Hclust-Correlation-Ward

    0.09 Hclust-Pearson-Ward

    0.05 Kmediods-Cosine

    0.04 Spectral clusteringSymmetric

    (Metrics 1-6)

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

    Example Discovery

    mult_dirproc

    mec

    sot_cor

    affprop cosine

    divisive stand.euc

    mixvmf mixvmfVA

    kmeans correlationhclust canberra ward

    hclust binary complete

    hclust pearson single

    hclust pearson completehclust pearson average

    hclust pearson mcquittyhclust pearson medianhclust correlation singlehclust correlation complete

    hclust correlation mcquitty

    hclust correlation median

    http://find/
  • 7/28/2019 Discov

    122/142

    biclust_spectral

    clust_convex

    dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

    rocksom

    sot_euc

    spec_cosspec_eucspec_man

    spec_minkspec_maxspec_canb

    mspec_cosmspec_euc

    mspec_manmspec_minkmspec_max

    mspec_canb

    affprop euclidean

    affprop manhattan

    affprop info.co

    affprop maximum

    divisive euclidean

    divisive manhattan

    kmedoids stand.euckmedoids euclidean

    kmedoids manhattan

    kmeans euclidean

    kmeans maximum

    kmeans manhattan

    kmeans canberra

    kmeans binary

    kmeans pearson

    kmeans spearmankmeans kendall

    hclust euclidean ward

    hclust euclidean single

    hclust euclidean complete

    hclust euclidean averagehclust euclidean mcquitty

    hclust euclidean medianhclust euclidean centroid

    hclust maximum ward

    hclust maximum single

    hclust maximum completehclust maximum average

    hclust maximum mcquitty

    hclust maximum medianhclust maximum centroid

    hclust manhattan ward

    hclust manhattan single

    hclust manhattan complete

    hclust manhattan average

    hclust manhattan mcquitty

    hclust manhattan medianhclust manhattan centroid

    hclust canberra single

    hclust canberra complete

    hclust canberra average

    hclust canberra mcquitty

    hclust canberra median

    hclust canberra centroid

    hclust binary ward

    hclust binary singlehclust binary average

    hclust binary mcquittyhclust binary median

    hclust binary centroid

    hclust pearson ward

    p p

    hclust pearson centroid

    hclust correlation ward

    hclust correlation average

    hclust correlation centroid

    hclust spearman ward

    hclust spearman single

    hclust spearman complete

    hclust spearman averagehclust spearman mcquitty

    hclust spearman medianhclust spearman centroid

    hclust kendall ward

    hclust kendall single

    hclust kendall complete

    hclust kendall averagehclust kendall mcquittyhclust kendall median

    hclust kendall centroid

    q

    Clusters in this Clustering

    Mayhew

    Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

    Example Discovery

    mult_dirproc

    mec

    sot_cor

    affprop cosine

    divisive stand.euc

    mixvmf mixvmfVA

    kmeans correlationhclust canberra ward

    h l bi i l

    hclust binary complete

    hclust pearson single

    hclust pearson completehclust pearson average

    hclust pearson mcquittyhclust pearson medianhclust correlation singlehclust correlation complete

    h l l i

    hclust correlation mcquitty

    hclust correlation median

    http://find/http://goback/
  • 7/28/2019 Discov

    123/142

    biclust_spectral

    clust_convex

    dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

    rocksom

    sot_euc

    spec_cosspec_eucspec_man

    spec_minkspec_maxspec_canb

    mspec_cosmspec_euc

    mspec_manmspec_minkmspec_max

    mspec_canb

    affprop euclidean

    affprop manhattan

    affprop info.co

    affprop maximum

    divisive euclidean

    divisive manhattan

    kmedoids stand.euckmedoids euclidean

    kmedoids manhattan

    kmeans euclidean

    kmeans maximum

    kmeans manhattan

    kmeans canberra

    kmeans binary

    kmeans pearson

    kmeans spearmankmeans kendall

    hclust euclidean ward

    hclust euclidean single

    hclust euclidean complete

    hclust euclidean averagehclust euclidean mcquitty

    hclust euclidean medianhclust euclidean centroid

    hclust maximum ward

    hclust maximum single

    hclust maximum completehclust maximum average

    hclust maximum mc