Click here to load reader

Near-Duplicate Detection for eRulemaking

Embed Size (px)

DESCRIPTION

Near-Duplicate Detection for eRulemaking. Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University {huiyang, callan}@cs.cmu.edu. Presentation Outline. Introduction Problem Definition System Architecture Feature-based Document Retrieval - PowerPoint PPT Presentation

Citation preview

  • Near-Duplicate Detection for eRulemakingHui Yang, Jamie CallanLanguage Technologies InstituteSchool of Computer ScienceCarnegie Mellon University{huiyang, callan}@cs.cmu.edu

  • Presentation OutlineIntroductionProblem DefinitionSystem ArchitectureFeature-based Document RetrievalSimilarity-based ClusteringEvaluation and Experimental ResultsRelated WorkConclusion and Demo

  • Introduction - IU.S. regulatory agencies are required to solicit, consider, and respond to public comments before issuing the final regulations. Some popular regulations attract hundreds of thousands of comments from the general public. In late 1990s USDAs national organic standard, manually sort over 250,000 public comments. In 2004 the EPAs proposed Mercury rule (USEPA-OAR-2002-0056) attracted over 530,000 email messages.Very labor-intensive

  • Introduction - IIThings Become Worse Now Many Online form letters available Written by special interest groupsModifying an electronic form letter is extremely easySpecial Interest GroupsBuild electronic advocacy groups when there is a disconnect between broad public opinion and legislative action.Provide information and tools to help each individual have the greatest possible impact once a group is assembled. Moveon.org, http://www.moveon.orgGetActive, http://www.getactive.org

  • Introduction - IIIPublic comments will be near-duplicates if created from the same form letter.Near-Duplicates increase the likelihood of overlooking substantive information that an individual adds to a form letter. Goal: Recognizing the Near-duplicates and organize themFinding the added information by an individualFinding the Unique commentsOur research focused on recognizing and organizing near-duplicates by text mining and clustering, as well as handling large amount of data

  • Problem Definition - IWhat is a near-duplicate?Pugh declared that two documents are considered near duplicates if they have more than r features in common. Conrad, et al., stated that two documents are near duplicates if they share >80% terminology defined by human expertsOur definition based on the ways to create near-duplicates

  • Problem Definition - IISources of Public Commentsfrom scratch (unique comments)based on a form letter (exact- or near duplicates)A Category-based DefinitionBlock editKey blockMinor changeMinor change + block editExact

  • Block Edit

    Dear Environmental Protection Agency,

    The EPA should require power plants to cut mercury pollution by 90% by 2008. These reductions are consistent with national standards for other pollutants and achievable through available pollution-control technology.

    Sincerely,

    Mary10 Kennaday RoadMendham, NJ 07945

    ________This is an important constituent communication initiated through MoveOn.org.

    Dear Environmental Protection Agency,

    I urge the EPA to require controls at all power plants to stop mercury pollution. The health of our air and water, and especially our children is by far more important than any political agenda.

    The EPA should require power plants to cut mercury pollution by 90% by 2008. These reductions are consistent with national standards for other pollutants and achievable through available pollution-controltechnology.

    Sincerely,

    JohnPO Box 91067Atqasuk, AK 99791

    ________This is an important constituent communication initiated through MoveOn.org.

  • Key Block

    Dear Environmental Protection Agency,

    The EPA should require power plants to cut mercury pollution by 90% by 2008. These reductions are consistent with national standards for other pollutants and achievable through available pollution-control technology.

    Sincerely,

    Mary10 Kennaday RoadMendham, NJ 07945

    ________This is an important constituent communication initiated through MoveOn.org.

    Dear Environmental Protection Agency,

    American citizens need to stand up for their rights. Which means the freedom to pursue life, liberty, health, and happiness. Everybody has the right to wake up each morning and breath the freshest air that this green earth can provide us, not what some government organization says that we need to put up with because they want their standards so lax. This is a democracy, by the people, for the people, not what Bush decides because it suits his mood. It concerns me to see so many people that I care about every day trying so hard to live with the mental and neurological problems that they have acquired, or were born with due to mercury poisoning.

    The EPA should require power plants to cut mercury pollution by 90% by 008.These reductions are consistent with national standards for other pollutants and achievable through available pollution-control technology.

    Sincerely, Rose!

    The EPA should require power plants to cut mercury pollution by 90% by2008. These reductions are consistent with national standards for otherpollutants and achievable through available pollution-controltechnology.

  • Minor Change

    July 31, 2004

    Dear Administrator Leavitt,

    I am writing to urge you to take prompt action to clean up mercury and other toxic air pollution from power plants. EPA's current proposals permit far ore mercury pollution than what the Clean Air Act allows, while at the same time fail to address over sixty other hazardous air pollutants like dioxin.

    Sincerely,

    Ken12168 Duke Dr NEBlaine, MN 55434-3111

    March 31, 2004

    Dear Administrator Michael Leavitt,

    I am writing to urge you to take prompt action to clean up mercury and other toxic air pollution from the power plants. EPAs current proposals permit far more mercury pollution than what the Clean Air Act allows, while at the same time fail to address over sixty other hazardous air pollutants like dioxin.

    Sincerely,

    Sandra H

    ,USA

  • Minor Change + Block Edit

    As someone who cares about protecting the health of children and ourenvironment, I am deeply concerned about the mercury contamination of our lakes and streams. Mercury descends from polluted air into water and then works its way up the food chain. It is especially dangerous to people and wildlife that consume large amounts of fish.I urge you to reconsider your agencys approach and require power plants to reduce their emissions of mercury to the greatest extent possible. his is what the federal law requires, and also what the people and wildlife of this country deserve.

    Thank you for your consideration,

    jUDY Berk232 Beech HIll rd.Northport, ME 04849

    As someone who cares about protecting our wildlife and wild places, I am deeply upset about the mercury contamination of our lakes and streams. Mercury descends from polluted air into water and then works its way up the food chain. It is especially dangerous to people and wildlife that consume large amounts of fish.

    Specifically, I am concerned that EPA is proposing a cap-and-trade system to manage mercury emissions. Under such a system, not all plants would have to reduce their harmful missions of mercury and some could even increase! This approach =s unacceptable for dealing with such a toxic pollutant which is precisely why the Clean Air Act does not allow it. I urge you to reconsider your approach and require power plants to reduce heir emissions of mercury to the greatest extent possible. This is what the federal law requires, and also what the people and wildlife of this country deserve.

    Thank you for your consideration.

    Sincerely,

    jamie duval22 manville hill rdcumberland, Rhode Island 02864

  • Block Reordering

    Dear Environmental Protection Agency,

    The EPA should require power plants to cut mercury pollution by 90% by 2008. These reductions are consistent with national standards for other pollutants and achievable through available pollution-controltechnology.

    I urge the EPA to require controls at all power plants to stop mercury pollution. The health of our air and water, and especially our children is by far more important than any political agenda.

    Sincerely,

    JohnPO Box 91067Atqasuk, AK 99791

    Dear Environmental Protection Agency,

    I urge the EPA to require controls at all power plants to stop mercury pollution. The health of our air and water, and especially our children is by far more important than any political agenda.

    The EPA should require power plants to cut mercury pollution by 90% by 2008. These reductions are consistent with national standards for other pollutants and achievable through available pollution-controltechnology.

    Sincerely,

    JohnPO Box 91067Atqasuk, AK 99791

  • Exact

    The EPA should require power plants to cut mercury pollution by 90% by2008. These reductions are consistent with national standards for otherpollutants and achievable through available pollution-controltechnology.

    The EPA should require power plants to cut mercury pollution by 90% by2008. These reductions are consistent with national standards for otherpollutants and achievable through available pollution-controltechnology.

  • Presentation OutlineIntroductionProblem DefinitionSystem ArchitectureFeature-based Document RetrievalSimilarity-based ClusteringEvaluation and Experimental ResultsRelated WorkConclusion and Demo

  • System Architecture

  • Feature-based Document Retrieval - ITo get duplicate candidate set for each seedAvoid working on the entire datasetSteps: Each seed document is broken into chunks Select the most informative words from each chunka text span around term t* which is the term with minimal document frequency in the chunk

  • Feature-based Document Retrieval - IIMetadata extraction by Information Extractionemail senders, receivers, signatures, docket IDs, delivered dates,email relayers

  • Feature-based Document Retrieval - IIIQuery Formulation

    #AND ( docketoar.20020056 router.moveon #OR(standards proposed by will harm thousands unborn children for coal plants should other cleaner alternative by 90 by with national standards available pollution control) )

  • Presentation OutlineIntroductionProblem DefinitionSystem ArchitectureFeature-based Document RetrievalSimilarity-based ClusteringEvaluation and Experimental ResultsRelated WorkConclusion and Demo

  • Similarity-based Clustering - IDocument dissimlilarity based on Kullback-Leibler (KL) divergenceKL-divergence, a distributional similarity measure, is one way to measure the similarity of one document (one unigram distribution) given another.

    Clustering AlgorithmSoft, Non-Hierarchical Clustering (Partition)Single Pass Clustering with carefully selected seed documents each timeClose to K-MeansNo need to define K before hand

  • Similarity-based Clustering - IIDup CandidatesDup Set 1Dup Set 2

  • Adaptive ThresholdingCut-off threshold for different clusters should be differentDocuments in a cluster are sorted by their document-centroid similarity scores. Sample sorted scores at 10 document intervalsIf there is greater than 5% of the initial cut-off threshold within an interval, a new cut-off threshold is set at the beginning of the interval.

  • Does Feature-based Document Retrieval Helps?It works fairly efficient for large clusters cuts the number of documents needed to be clustered from the size of entire dataset to a reasonable number. 536,975 ->10,995 documentsBad for small clusters (especially for those only containing a single unique document)Disable feature-based retrieval after most of the big clusters have been found.assume that most of the remaining unclustered documents are unique. Only similarity-based clustering is used on them

  • Presentation OutlineIntroductionProblem DefinitionSystem ArchitectureFeature-based Document RetrievalSimilarity-based ClusteringEvaluation and Experimental ResultsRelated WorkConclusion and Demo

  • Evaluation Methodology - IDifficult to evaluate clustering performance lack of man power to produce ground truth for large datasetTwo subsets of 1,000 email messages each were selected randomly from the Mercury dataset. Assessors: two graduate research assistantsManually organized the documents into clusters of documents that they felt were near-duplicatesManually went through one of the experimental clustering results pair by pair ( compare document-centroid pair)

  • Evaluation Methodology - IIClass j vs. Cluster iF-measurepij = nij/ ni , rij = nij/ nj F = , Fj = maxi {Fij}

    Purity = , i = maxj{pij},

    Pairwise-measureFolkes and Mallows index Kappa = p(A) = (a+d)/m , p(E) =

  • Experimental Results - I

    (BC = .74)

    AB

    1

    0.14

    0.24

    0.08

    Baseline2 router

    0.28

    0.31

    3

    Metadata-based (router+filesizerange+docket)

    0.68

    0.34

    0.42

    0.42

    0.56

    5

    0.70

    0.40

    0.68

    Similarity-based clustering

    0.72

    0.56

    7

    0.74

    0.55

    0.71

  • Experimental Results - II

    Chart2

    10

    11

    11

    11

    21

    21

    21

    21

    21

    21

    21

    22

    22

    22

    22

    22

    22

    22

    33

    33

    33

    33

    43

    44

    54

    55

    55

    55

    65

    66

    66

    66

    76

    76

    76

    76

    77

    77

    77

    77

    77

    77

    77

    77

    77

    77

    77

    77

    77

    77

    77

    77

    77

    77

    78

    78

    78

    79

    79

    710

    710

    811

    812

    812

    812

    812

    813

    813

    913

    913

    1014

    1014

    1014

    1115

    1215

    1216

    1216

    1216

    1216

    1317

    1317

    1317

    1318

    1418

    1419

    1419

    1419

    1520

    1520

    1520

    1520

    1520

    1521

    1621

    1621

    1622

    1622

    1622

    1622

    1723

    1723

    1723

    1723

    1824

    1824

    1824

    1824

    1824

    1824

    1925

    1925

    1925

    2026

    2026

    2026

    2026

    2026

    2027

    2027

    2027

    2127

    2127

    2128

    2128

    2128

    2228

    2228

    2228

    2328

    2328

    2329

    2329

    2429

    2529

    2530

    2530

    2531

    2531

    2532

    2632

    2632

    2733

    2733

    2734

    2734

    2734

    2735

    2835

    2835

    2836

    2836

    2836

    2936

    2937

    disable document retrieval for small clusters

    enable document retrieval for small clusters

    number of clusters

    time(s)

    Effects of disable document retrieval for small clusters

    med

    maxmedminduration(s)new clusternumclusterspullkeep

    100005-07-2004-1034341701305

    111105-10-2004-4956792496113

    221102-16-2004-0095343349110

    221006-22-2004-520077410292

    231102-20-2004-134023550146

    331006-12-2004-513426623831

    431005-08-2004-10871275242

    441103-20-2004-175227849619

    441005-14-2004-50100792421

    541002-29-2004-1533581026911

    551104-23-2004-400538114999

    662104-26-2004-413950123975

    662003-13-2004-16377913185

    662003-04-2004-15674214724

    762003-23-2004-231149156511

    772104-23-2004-405656164362

    772005-25-2004-50538217127

    772004-17-2004-3766861881

    873003-23-2004-250642195274

    883103-23-2004-187947208244

    883003-17-2004-02330521221

    883003-18-2004-173068222551

    893103-23-2004-322133235431

    994003-23-2004-294855246521

    9104103-23-2004-054436258241

    10115105-07-2004-104528268511

    10125103-23-2004-243812276511

    11125003-23-2004-314528284441

    12125003-23-2004-278681295461

    12136103-23-2004-056321306511

    12146103-23-2004-042780318251

    13146003-23-2004-243382324451

    14156103-23-2004-278037338251

    14156007-11-2004-536083342682

    14156004-10-2004-3717523534223

    15166103-22-2004-1763883631

    15167003-23-2004-058660374491

    16167003-23-2004-310907384241

    16167006-18-2004-51833739101

    17167006-26-2004-529546401787

    17177103-29-2004-354292411185

    17177006-17-2004-51774242109

    18177003-16-2004-1711244321

    18177002-11-2004-118701442271

    18177003-23-2004-306565454391

    19177004-29-2004-4257784611

    19177006-14-2004-51401647141

    19187103-23-2004-335273483911

    19187006-14-2004-5145094921

    19197104-21-2004-379784504961

    19197004-01-2004-3619165111

    20197002-27-2004-15206152124

    20197003-30-2004-3565695321

    20197003-23-2004-335305541151

    21208103-23-2004-031697558241

    21208003-23-2004-200499566793

    21218103-23-2004-037385578223

    22229103-23-2004-275010588221

    22239103-23-2004-294087598221

    232310003-23-2004-274428605553

    232410103-23-2004-301926616781

    242411003-23-2004-305721626521

    252512103-23-2004-321036636521

    252512003-23-2004-033552644601

    252512003-23-2004-260795654431

    262612103-23-2004-050671665261

    262713103-23-2004-210128676781

    272713003-23-2004-037996686551

    272713003-23-2004-274690694411

    282813103-23-2004-244841704311

    282814003-23-2004-221874714361

    282814003-23-2004-023621725921

    292914103-23-2004-052150735451

    292915003-23-2004-272340748241

    303015103-23-2004-250568758241

    303116103-23-2004-045736768231

    313116003-23-2004-211832774591

    313216103-23-2004-262974784851

    313216003-23-2004-180376794581

    323317103-23-2004-304517805461

    333317005-07-2004-107485814611

    333317003-23-2004-262411825421

    343418103-23-2004-269881835501

    343418003-23-2004-300077844601

    353519103-23-2004-241889856431

    353519003-23-2004-295156864341

    363519003-23-2004-210177876511

    363620103-23-2004-050230886511

    373620003-23-2004-240092895471

    373720103-29-2004-074879905481

    383720003-23-2004-053645914551

    383820103-23-2004-204719926511

    393821003-23-2004-069103934341

    393921103-23-2004-330106945541

    403921003-23-2004-309476954511

    414022103-23-2004-194032966511

    414022003-23-2004-277633974361

    424022003-23-2004-210055984281

    424122103-23-2004-068553998231

    434223103-23-2004-0613841008231

    434223003-23-2004-3047421014501

    444223003-23-2004-2787291023691

    444323103-23-2004-2555311038241

    454424105-07-2004-1052111048221

    454424003-23-2004-3121221055331

    464524103-23-2004-2238951064121

    474524003-23-2004-2003731079251

    474624103-19-2004-174337108271

    484624003-23-2004-177676109261

    484625003-23-2004-3141361104711

    494725103-26-2004-3518931114801

    504725003-23-2004-1884911128231

    504826103-23-2004-3339701134361

    514926103-23-2004-2219211149251

    515026103-23-2004-2713781159191

    515026003-23-2004-0633521168231

    525126103-23-2004-2337731174461

    525127003-23-2004-2199881186991

    525227103-12-2004-1599231191011

    535227002-23-2004-1348701202101

    535227003-23-2004-1810051214431

    545227003-23-2004-2345541224341

    545328103-23-2004-3193491238241

    555428103-23-2004-3063891248231

    555428003-15-2004-17009912531

    555428003-10-2004-159055126961

    565428003-18-2004-173164127131

    565528103-23-2004-2234761286341

    565528002-03-2004-110960129302

    575628103-23-2004-3145561308231

    575629003-23-2004-2712851315691

    585729103-23-2004-0633741324432

    595729003-23-2004-2766281338431

    595729003-10-2004-15905713411

    595830103-23-2004-2131441354363

    605830003-23-2004-0533321365293

    615931103-23-2004-0313811378263

    616031103-23-2004-2536971386781

    616032003-23-2004-0267641398221

    626132103-23-2004-2589221408221

    626132003-23-2004-2448351414441

    636233103-23-2004-3290851426521

    646333103-23-2004-0423851438351

    646434103-23-2004-0509241448231

    656534103-23-2004-3230751458231

    656534003-23-2004-2133571468251

    666635103-23-2004-1856111478231

    666735103-23-2004-1786641489431

    666735003-23-2004-2017891494731

    666836103-23-2004-3140051504361

    676936103-23-2004-0568141518261

    686936003-23-2004-0634581528241

    687036103-23-2004-2205831534351

    697137103-23-2004-2086721548231

    med

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    med

    min

    max

    number of clusters so far

    time(s)

    Time vs. Number of Clusters

    min

    max

    med

    min

    number of clusters

    time(s)

    Time vs. Number of Clusters

    max

    seed counttime so far(s)duration(s)cluster size# redistributed docsnew clusternumclusterspullkeep

    2650043643605-07-2004-1034341701305

    2650005-10-2004-4956792114113

    2651102-16-2004-0095343113110

    2651006-22-2004-52007749492

    2651002-20-2004-134023526646

    2651006-12-2004-51342663131

    2651005-08-2004-10871274242

    2651003-20-2004-17522781919

    2651005-14-2004-50100792121

    2651002-29-2004-153358101111

    2651004-23-2004-40053811179

    2651004-26-2004-41395012145

    2652103-13-2004-1637791355

    2652003-04-2004-1567421444

    2652003-23-2004-231149154341

    2652004-23-2004-405656161672

    2652005-25-2004-5053821777

    2652004-17-2004-3766861881

    2652003-23-2004-250642195264

    2653103-23-2004-187947204404

    2653003-17-2004-0233052111

    2653003-18-2004-1730682251

    2653003-23-2004-322133235461

    2653003-23-2004-294855244361

    2654103-23-2004-054436258231

    2654005-07-2004-104528264361

    2655103-23-2004-243812274361

    2655003-23-2004-314528284221

    2655003-23-2004-278681294311

    2655003-23-2004-056321304361

    2656103-23-2004-042780314221

    2656003-23-2004-243382324221

    2656003-23-2004-278037334261

    2656007-11-2004-5360833422

    2656004-10-2004-371752355223

    2656003-22-2004-1763883611

    2656003-23-2004-058660374351

    2657103-23-2004-310907384351

    2657006-18-2004-51833739101

    2657006-26-2004-52954640937

    2657003-29-2004-3542924155

    2657006-17-2004-51774242109

    2657003-16-2004-1711244321

    2657002-11-2004-11870144461

    2657003-23-2004-306565454351

    2657004-29-2004-4257784611

    2657006-14-2004-5140164711

    2657003-23-2004-3352734811

    2657006-14-2004-5145094921

    2657004-21-2004-3797845011

    2657004-01-2004-3619165111

    2657002-27-2004-1520615211

    2657003-31-2004-3603835344

    2657003-30-2004-3565695421

    2657003-23-2004-3353055511

    2658103-23-2004-031697564401

    2658003-23-2004-200499575473

    2658003-23-2004-037385585243

    2659103-23-2004-275010595241

    2659003-23-2004-294087605241

    26510103-23-2004-274428615463

    26510003-23-2004-301926625461

    26511103-23-2004-305721638251

    26512103-23-2004-321036648251

    26512003-23-2004-033552654411

    26512003-23-2004-260795664361

    26512003-23-2004-050671674361

    26513103-23-2004-210128684361

    26513003-23-2004-037996694361

    26513003-23-2004-274690705481

    26513003-23-2004-244841714191

    26514103-23-2004-221874725461

    26514003-23-2004-023621734361

    26514003-23-2004-052150744221

    26515103-23-2004-272340754361

    26515003-23-2004-250568768231

    26516103-23-2004-045736774361

    26516003-23-2004-211832784361

    26516003-23-2004-262974794361

    26516003-23-2004-180376804361

    26517103-23-2004-304517814311

    26517005-07-2004-107485824361

    26517003-23-2004-262411835421

    26518103-23-2004-269881848231

    26518003-23-2004-300077854361

    26519103-23-2004-241889865421

    26519003-23-2004-295156874341

    26519003-23-2004-210177884361

    26520103-23-2004-050230894361

    26520003-23-2004-240092904311

    26520003-29-2004-074879911201

    26520003-23-2004-053645924221

    26520003-23-2004-204719934221

    26521103-23-2004-069103944341

    26521003-23-2004-330106955421

    26521003-23-2004-309476965531

    26522103-23-2004-194032974361

    26522003-23-2004-277633984351

    26522003-23-2004-210055994361

    26522003-23-2004-0685531004221

    26523103-23-2004-0613841014361

    26523003-23-2004-3047421024241

    26523003-23-2004-2787291034251

    26523003-23-2004-2555311044361

    26524105-07-2004-1052111054361

    26524003-23-2004-3121221064221

    26524003-23-2004-2238951074221

    26524003-23-2004-2003731084221

    26524003-19-2004-174337109391

    26524003-23-2004-17767611071

    26525103-23-2004-3141361115091

    26525003-26-2004-3518931121671

    26525003-23-2004-1884911134221

    26526103-23-2004-3339701144221

    26526003-23-2004-2219211154221

    26526003-23-2004-2713781164351

    26526003-23-2004-0633521174071

    26526003-23-2004-2337731184081

    26527103-23-2004-2199881194071

    26527003-12-2004-15992312021

    26527002-23-2004-1348701211671

    26527003-23-2004-1810051224221

    26527003-23-2004-2345541234221

    26528103-23-2004-3193491244231

    26528003-23-2004-3063891254221

    26528003-15-2004-17009912641

    26528003-10-2004-15905512711

    26528003-18-2004-17316412881

    26528003-23-2004-2234761294081

    26528002-03-2004-11096013022

    26528003-23-2004-3145561314361

    26529103-23-2004-2712851325721

    26529003-23-2004-0633741334242

    26529003-23-2004-2766281344081

    26529003-10-2004-15905713511

    26530103-23-2004-2131441364363

    26530003-23-2004-0533321375243

    26531103-23-2004-0313811388263

    26531003-23-2004-2536971395461

    26532103-23-2004-0267641405241

    26532003-23-2004-2589221415241

    26532003-23-2004-2448351425461

    26533103-23-2004-3290851434361

    26533003-23-2004-0423851448271

    26534103-23-2004-0509241454361

    26534003-23-2004-3230751464361

    26534003-23-2004-2133571474381

    26535103-23-2004-1856111486341

    26535003-23-2004-1786641495421

    26535003-23-2004-2017891505241

    26536103-23-2004-3140051514341

    26536003-23-2004-0568141525431

    26536003-23-2004-0634581534221

    26536003-23-2004-2205831544231

    26537003-23-2004-2086721556341

    max

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    time so far(s)

    seed counttime so far(s)duration(s)cluster size# redistributed docsnew clusternumclusterspullkeep

    2651152752705-07-2004-1034341701305

    2651005-10-2004-4956792263113

    2652102-16-2004-0095343389110

    2652006-22-2004-520077411792

    2652002-20-2004-134023524146

    2653106-12-2004-513426659331

    2654105-08-2004-108712718942

    2654003-20-2004-17522782419

    2654005-14-2004-501007922721

    2655102-29-2004-1533581033411

    2655004-23-2004-400538114209

    2656104-26-2004-413950122915

    2656003-13-2004-16377913695

    2656003-04-2004-156742141934

    2657103-23-2004-231149155321

    2657004-23-2004-405656166312

    2657005-25-2004-5053821787

    2657004-17-2004-3766861881

    2658103-23-2004-250642195304

    2658003-23-2004-187947205454

    2658003-17-2004-02330521471

    2658003-18-2004-1730682251

    2658003-23-2004-322133235271

    2659103-23-2004-294855245291

    2659003-23-2004-054436256831

    26510105-07-2004-104528265371

    26510003-23-2004-243812275271

    26511103-23-2004-314528286821

    26512103-23-2004-278681296911

    26512003-23-2004-056321305271

    26512003-23-2004-042780316211

    26513103-23-2004-243382325491

    26514103-23-2004-278037337701

    26514007-11-2004-53608334352

    26514004-10-2004-3717523546523

    26515103-22-2004-176388366621

    26515003-23-2004-058660375251

    26516103-23-2004-310907385241

    26516006-18-2004-51833739491

    26517106-26-2004-529546405647

    26517003-29-2004-354292415925

    26517006-17-2004-51774242139

    26518103-16-2004-171124431061

    26518002-11-2004-118701441161

    26518003-23-2004-306565456981

    26519104-29-2004-42577846361

    26519006-14-2004-51401647361

    26519003-23-2004-335273481071

    26519006-14-2004-51450949411

    26519004-21-2004-37978450281

    26519004-01-2004-3619165111

    26520102-27-2004-152061523214

    26520003-30-2004-3565695321

    26520003-23-2004-335305544011

    26521103-23-2004-031697554401

    26521003-23-2004-200499565283

    26521003-23-2004-037385575243

    26522103-23-2004-275010586691

    26522003-23-2004-294087595251

    26523103-23-2004-274428605293

    26523003-23-2004-301926615271

    26524103-23-2004-305721628461

    26525103-23-2004-321036638231

    26525003-23-2004-033552644421

    26525003-23-2004-260795655271

    26526103-23-2004-050671665281

    26526003-23-2004-210128676181

    26527103-23-2004-037996686851

    26527003-23-2004-274690696161

    26528103-23-2004-244841705251

    26528003-23-2004-221874715311

    26528003-23-2004-023621725281

    26529103-23-2004-052150736391

    26529003-23-2004-272340745941

    26530103-23-2004-250568756811

    26530003-23-2004-045736765241

    26531103-23-2004-211832776391

    26531003-23-2004-262974784411

    26531003-23-2004-180376794401

    26532103-23-2004-304517807091

    26533105-07-2004-107485816811

    26533003-23-2004-262411824801

    26534103-23-2004-269881836891

    26534003-23-2004-300077846811

    26535103-23-2004-241889855741

    26535003-23-2004-295156864691

    26536103-23-2004-210177876321

    26536003-23-2004-050230885271

    26537103-23-2004-240092896811

    26537003-29-2004-074879902851

    26538103-23-2004-053645916821

    26538003-23-2004-204719926291

    26539103-23-2004-069103934431

    26539003-23-2004-330106946821

    26540103-23-2004-309476956881

    26541103-23-2004-194032966711

    26541003-23-2004-277633976811

    26542103-23-2004-210055987021

    26542003-23-2004-068553995251

    26543103-23-2004-0613841005241

    26543003-23-2004-3047421016361

    26544103-23-2004-2787291025491

    26544003-23-2004-2555311036881

    26545105-07-2004-1052111045221

    26545003-23-2004-3121221056401

    26546103-23-2004-2238951069251

    26547103-23-2004-2003731075391

    26547003-19-2004-1743371085931

    26548103-23-2004-1776761092991

    26548003-23-2004-3141361105951

    26549103-26-2004-3518931118851

    26550103-23-2004-1884911125271

    26550003-23-2004-3339701135451

    26551103-23-2004-2219211146331

    26551003-23-2004-2713781154391

    26551003-23-2004-0633521165191

    26552103-23-2004-2337731175841

    26552003-23-2004-2199881187021

    26552003-12-2004-15992311931

    26553102-23-2004-1348701205281

    26553003-23-2004-1810051214621

    26554103-23-2004-2345541227341

    26554003-23-2004-3193491234611

    26555103-23-2004-3063891246251

    26555003-15-2004-170099125111

    26555003-10-2004-1590551263751

    26556003-18-2004-17316412791

    26556003-23-2004-2234761284431

    26556002-03-2004-1109601292792

    26557103-23-2004-3145561305251

    26557003-23-2004-2712851317121

    26558103-23-2004-0633741326822

    26559103-23-2004-2766281336851

    26559003-10-2004-1590571341241

    26559003-23-2004-2131441355313

    26560103-23-2004-0533321365273

    26561103-23-2004-0313811378253

    26561003-23-2004-2536971385271

    26561003-23-2004-0267641395261

    26562103-23-2004-2589221405241

    26562003-23-2004-2448351416441

    26563103-23-2004-3290851427031

    26564103-23-2004-0423851436851

    26564003-23-2004-0509241445241

    26565103-23-2004-3230751457751

    26565003-23-2004-2133571465361

    26566103-23-2004-1856111476291

    26566003-23-2004-1786641484431

    26566003-23-2004-2017891495251

    26566003-23-2004-3140051504431

    26567103-23-2004-0568141516941

    26568103-23-2004-0634581526131

    26568003-23-2004-2205831537481

    26569103-23-2004-2086721547591

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    time so far(s)

  • ConclusionLarge Volume Working SetDuplicate Definition and Automatic evaluationFeature-based Duplicate Candidate RetrievalSimilarity-based ClusteringImproved Efficiency

  • Related Work - IDuplicate detection in other domains:databases [Bilenko and Mooney 2003]to find records referring to the same entity but possibly in different representationselectronic publishing [Brin et al. 1995] to detect plagiarism or to identify different versions of the same document. web search [Chowdhury et al. 2002] [Pugh 2004]more efficient web-crawlingeffective search results rankingeasy web documents archiving

  • Related Work - IIFingerprintinga compact description of a document, and then do pair-wise comparison of document fingerprintsShingling [Broder et al.]represents a document as a series of simple numeric encodings for an n-term windowretain every mth shingle to produce a document sketchsuper shinglesSelective fingerprinting [Heintze]selected a subset of the substrings to generate fingerprintsStatistical approach [Chowdhury et al.]n high idf terms Improved accuracy over shinglingEfficient: one-fifth of the time over shinglingFingerprints Reliability in dynamic environment [Conrad et al.]Consider time factor on the Web

  • ReferencesM. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD-2003), Washington D.C., August 2003.S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proceedings of the Special Interest Group on Management of Data (SIGMOD 1995), pages 398409. ACM Press, May 1995. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of WWW6 97, pages 391404. Elsevier Science, April 1997.J. Callan, E-Rulemaking testbed. http://hartford.lti.cs.cmu.edu/eRulemaking/Data/. 2004A. Chowdhury. O. Frieder, D. Grossman, and M. McCabe. Collection statistics for fast Duplicate document detection. In ACM Transactions on Information Systems (TOIS), Volume 20, Issue 2, 2002.J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46, 1960.J. Conrad, X. S. Guo, and C. P. Schriber. Online duplicate document detection: Signature reliability in a dynamic retrieval environment. In Proceedings of CIKM03, pages 443452. ACM Press, Nov. 2003.N. Heintze. Scalable document fingerprinting. In Proceedings of the Second USENIX electronic Commerce Workshop, pages 191200, Nov. 1996.W. Pugh. US Patent 6,658,423 http://www.cs.umd.edu/~pugh/google/Duplicates.pdf. 2003

  • Demohttp://hartford.lti.cs.cmu.edu/eRulemaking/Data/USEPA-OAR-2002-0056/

    Modifying an electronic form letter is extremely easy. Moreover, it can better represent their opinions or stylistic preferences

    When there is a disconnect between broad public opinion and legislative action, MoveOn builds electronic advocacy groups.Examples of such issues are campaign finance, environmental and energy issues, media consolidation.. Once a group is assembled, MoveOn provides information and tools to help each individual have the greatest possible impact.

    Many of the form letter-based comments are near-duplicates that are similar, but not exactly identical, to a (usually unknown) initial form letter, and to one another.

    Recognizing, organizing, and considering near-duplicates are significant challenges for the agencies because near-duplicates increase the likelihood of overlooking substantive information (information that by law the agency must consider) that an individual adds to a form letter.However, it is unsupportable to have fully human assessments on huge document collections such as public comments in eRulemaking. Moreover, this definition assumes that the original document can be identified easily.

    For example, excerpt means one document takes first section from another article and elaboration means one document adds one or more paragraphs to another article.

    However, in domains such as eRulemaking it may not be possible to tell by just looking at the documents which document is the source and which is the modified copy. In order to have a fully automatic assessment of duplicate and near-duplicate detection without human experts involved, we need to have a quantitative definition of near-duplicates. That is to say, our definition is a guideline for machines. In the eRulemaking domain, there are two broad categories of documents: Comments that are written more-or-less from scratch (unique), and comments that are based on a form letter (exact-duplicates and near-duplicates). The near-duplicate category can be further divided into subcategories (Table 1). Our subcategories are similar in spirit to those of Conrad, et al., but reflect typical patterns appearing in public comment datasets Our duplicate and near-duplicate detection system includes three modules: Text preprocessing, feature-based document retrieval and similarity-based document clustering (Figure 1).The text preprocessing module consists of information extraction (IE), binning by length, and seed document selection. The IE module identifies the email sender, the receiver, address, salutation and signature lines; docket IDs mentioned in the text; delivery dates and the email-relaying organizations identified in email headers. These features are used later in duplicate detection. Documents are then binned based on their content lengths (i.e., after removal of headers, address, salutation, and signature lines are removed). Binning insures that exact-duplicates are grouped together. Moreover, even before duplicate clustering starts, the system has a rough guess about what are the large duplicate sets. These sets are handled first, to improve efficiency. Finally, to identify the dominating document within a bin, the document that repeats most (i.e., has the most exact duplicates) is selected to be the initial seed in any bin whose size is greater than 100.In the feature-based document retrieval module, the system begins near-duplicate detection by going through the size-based bins in a round robin manner. On each pass a document is selected from a bin, and a feature-based query is generated based on fingerprint and metadata. The system then performs probabilistic document retrieval, using a Boolean query, to get candidate duplicates for the document. The similarity-based clustering module measures the similarity of each document in the candidate set to the current seed; similar documents are grouped together. The next section describes the algorithms for feature-based document retrieval and similarity-based clustering. during preprocessing; the retrieval system indexes them, and supports their use in queries. Since many of the near duplicates are based on form letters created by special interest groups (mass mailers) or are commercial advertising (spam), duplicate documents are likely to share the same features as the seed document.

    One of our test collections contains 536,975 public comments; the average size of candidate duplicate sets is 10,995. This reduction is extremely significant for the big clusters. However, for small clusters, especially for those only containing a single unique document, it is inefficient to retrieve and look through the candidate set. Therefore, after most of the big clusters have been found, the feature-based retrieval module is disabled based on the assumption that most of the remaining unclustered documents are unique. Only similarity-based clustering is used on them. during preprocessing; the retrieval system indexes them, and supports their use in queries. Since many of the near duplicates are based on form letters created by special interest groups (mass mailers) or are commercial advertising (spam), duplicate documents are likely to share the same features as the seed document.Note that this requires an extra step to first store the similarity scores above a relatively loose initial cut-off threshold

    It works fairly efficient for large clusters since it usually cuts the number of documents needed to be clustered from the size of entire dataset to a reasonable number. One of our test collections contains 536,975 public comments; the average size of candidate duplicate sets is 10,995. This reduction is extremely significant for the big clusters. However, for small clusters, especially for those only containing a single unique document, it is inefficient to retrieve and look through the candidate set. Therefore, after most of the big clusters have been found, the feature-based retrieval module is disabled based on the assumption that most of the remaining unclustered documents are unique. Only similarity-based clustering is used on them FIn the context of clustering, the quality of a clustering with respect to the gold standard is assessed in terms of precision (p) and recall (r). p and r compare each cluster i with each class j in the ground truth:pij = nij/ ni (5) rij = nij/ nj (6)where nij is the number of documents from class j in cluster i, ni is the number of documents in cluster i and nj is the number of documents in class j.The F-measure widely used in IR community tries to capture how well the groups of the investigated clustering at the best match the groups of the ground truth. The F-measure for a cluster i and class j is: Fij = 2rij / (p ij +r ij)(7)The F-measure of the whole clustering set is: F = (8)where Fj = maxi {Fij} , n is the total number of document.PurityIf the purpose of clustering is to classify clusters of documents rather than single documents, purity of each cluster tries to capture how well the groups of the first on average matches those of the ground truth. The weighted average purity over all clusters is a measure of quality of the whole clustering. It is defined as: = (9)where i = maxj{pij}, n is the total number of document and pij is precision of cluster i with reference to class j.Pair-wise MeasureOne may also define measures based on the distribution of pairs of texts. Pair measures reflect degrees of overlap and are thus not dependent on the direction of the comparison. Folkes and Mallows index [9] is one of such pair measures that have been used for cluster validation(10) where a is the number of pairs in the same group in the ground truth and in the clustering (agreement), b is the number of pairs in the same group in the ground truth but in different in the clustering (false negative), c is the number of pairs in different groups in the ground truth but in the same in the clustering (false positive), d is the number of pairs in different groups in the ground truth and in the clustering (agreement).KappaFinally, we suggest using the Cohens kappa statistics [7] to assess agreement between clustering results relative to two ground truth given by different human assessors. We consider the investigated clustering results as another assessor. The kappa coefficient is used to compare inter-assessor concordances and is defined is: = (11)where p(A) is the observed agreement between the two assessments. It can be calculated as (a+d)/m by using a, b, c and d given in Equation 10 and m=a+b+c+d. p(E) is the agreement expected by chance. It can be calculated as .Our idea is that if the concordance of the clustering results given by our system and one of the assessors are as good as or even better than that of two human assessors, our algorithm could be considered as effective. In computational linguistics > 0.67 has often been required to draw any conclusions of agreement at all [5]. However, there has been a number of objections to this standard and less strict evaluations consider 0.20 < < 0.40 indicating fair agreement and 0.40 < < 0.60 indicating moderate agreement.Early research on duplicate detection was done mostly in the areas of databases [1][16] and electronic publishing [2][11] . In database research, the goal of duplicate detection is to find records referring to the same entity but possibly in different representations. In electronic publishing, duplicate detection is used to detect plagiarism or to identify different versions of the same document. Most recently, duplicate detection is extensively studied for web search tasks, for example, to give more efficient web-crawling, effective search results ranking, and easy web documents archiving.A widespread duplicate detection technique is to generate a document fingerprint, which is a compact description of the document, and then to do pair-wise comparison of document fingerprints. The assumption is that fingerprints can be compared much more quickly than complete documents. A common method of generating fingerprints is to select a set of character sequences from a document, and to generate a fingerprint based on the hash values of these sequences. Similarity between two documents is measured by counting the number of common sequences in the complete fingerprints. Different algorithms are characterized, and their computational costs are determined, by the hash function and how substrings are selected.

    The dominant approach in duplicate detection is shingling, which was proposed by Broder, et al. [1]. It represents a document as a series of simple numeric encodings for an n-term window. Filtering techniques were used to retain every mth shingle to produce a document sketch. The authors also presented a super-shingle technique that creates meta-sketches to further reduce the computational complexity. Pairs of documents that have a high shingle match coefficient are asserted to be near-duplicates.

    Heintze studied selective document fingerprinting that scales to large environments and that can identify similarities between documents which have as little as 5% or less in common [10].To reduce the size of the fingerprints, he selected a subset of the substrings to generate them. He demonstrated that selective fingerprints normally perform within a factor of two of complete fingerprints.

    Pugh, from Google, claimed that duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints matches [13].

    Recent duplicate detection research in the Web environment has focused on issues of computational efficiency and detection effectiveness while relying on collection statistics to consistently recognize document replicas in full-text collections [6][14]. Since a terms inverse document frequency (idf) is a measurement of a words rareness across a given collection, its value measures how well a term discriminates one document from others. Chowdhury, et al., selected terms best characterizing a document, i.e., n high idf words. Moreover, a term is not considered unless it appears across the collections at least five times to filter out possible misspelling tokens and typos. They claimed that in addition to improving accuracy over shingling, it executes in one-fifth of the time. They showed that a statistically-based approach (i) is more efficient than a fingerprint-based approach, (ii) scales in the number of documents and (iii) works well for documents of diverse sizes.