CS6604 Digital Libraries Global Events Team Final Presentation · CS6604 Digital Libraries Global...

Preview:

Citation preview

CS6604 Digital LibrariesGlobal Events Team Final Presentation

Presenters:Liuqing Li, Islam Harb, Andrej Galad

{liuqing, iharb, agalad}@vt.edu

Instructor: Dr. Edward A. Fox

Virginia Polytechnic Institute and State UniversityBlacksburg, VA, 24061

April 27, 2017

Global Events Team Final Presentation

• Background• Implementation

• DataCollection• DataProcessing• DataVisualization

• FutureWork• Acknowledgement

Outline

1

Global Events Team Final Presentation

Background

2

• GETAR*• GlobalEventandTrendArchiveResearch• Architecture

* Edward A Fox, Donald Shoemaker, Chandan Reddy, Andrea Kavanaugh, III: Small: Collaborative Research: Global Event and Trend Archive Research (GETAR), NSF grant IIS - 1619028, 2017-2019. http://eventsarchive.org

Global Events Team Final Presentation

Implementation – Architecture

3

Event Focused Crawler (EFC)

WARCFiles CDXFilesCDX Writer

ArchiveSpark

ApacheSpark

StanfordNER

RegularExpression

ScoreFunction

Entity-basedResults

Standalone HBase

WebApplication

Data Collection

Data Processing

Data Visualization

Global Events Team Final Presentation

SchoolShootingEvents YearVirginiaTechShooting 2007

NorthernIllinoisUniversityShooting 2008DunbarHighSchoolShooting 2009UniversityofAlabamaShooting 2010Worthing HighSchoolShooting 2011

SandyHookElementarySchoolShooting 2012SparksMiddleSchoolShooting 2013ReynoldsHighSchoolShooting 2014

UmpquaCommunityCollegeShooting 2015TownvilleElementarySchoolShooting 2016

Events of Interest

4

Global Events Team Final Presentation

Focused Crawler – Collecting / Archiving

5

START

ManuallyCurateSeeds

URLsQueue

DownloadPage

ProcessPage&ConvertintoWARCFormat

ExtractURLs

CalculateRelevancy

Relevant?

Discard

AppendResultwarc.gz EventFile

END

Yes

No

No

Yes

AllURLs?

Global Events Team Final Presentation

• Wget (Version1.14orlater)

WARC Libraries

6

Global Events Team Final Presentation

• Wpull

WARC Libraries

7

Global Events Team Final Presentation

• WARCIO:WARC(andARC)StreamingLibrary• Python2.7+and3.3+• Post-Processing:Read/WriteWARCformat

WARC Libraries

8

Global Events Team Final Presentation

• NamingConvention• [location]_[year].warc.gz

Ten Events Collections

9

Global Events Team Final Presentation

• ArchiveSpark• ApacheSparkframeworkforWebArchives• Easydataextraction• Input:WARCandCDXfiles

• CDXWriter• PythonscripttocreateCDXfilesofWARCfiles• Format:CDXNbamskrMSVg

• e.g.,edu,vt,cnre)/20170422005601http://cnre.vt.edu text/html200BT3ILJXROIILHBKQPNYDUCUVZRDKG3OA- - 947820104749data/Virginia-Tech-Shooting_20070416.warc.gz

Tools for Data Processing

10

Global Events Team Final Presentation

• WebpageCleaning• ExtractRawText

• payload.string.html.body.text• RemovejQuery&JavaScript

• {WPGroHo.syncProfileData(hash,id);},…• Removetags

• <br>,<p>,…• Removemarkers

• *,|,+,…• Removestopwords

• a,about,the,…

Data Preprocessing

11

Global Events Team Final Presentation

• EntityExtraction• BasicParsing

• eventnameanddate• StanfordNER(Integratedmodel)

• entities,shootername• RegularExpression

• eventdate• shooternameandage• numberofvictims• weaponlist

• ScoreFunction• 𝑡𝑓 ∗ 𝑑𝑓

Data Processing

12

Global Events Team Final Presentation

• Build-inImportTsv Utility• ImportDataintoHBase

HBase

13

Table Name globalevents

Row_Key Event_Date + Event Hash Value 20070416217787922

Column Family event

Column

event: name Virginia Tech Shooting

event: date 20070416

event: shooter_age 23-year-old

event: shooting_victims 32 victims

event: entities Virginia;Tech;VA;University;…

event: entities_count 146900;62415;13940;7732;…

event: entities_url url1,url2,url3,url4,url6;url2,url3,url4,url5;url1,url3,url4,url5,url6;…

Global Events Team Final Presentation

• KeyStages• Initialization

• CreateSparkSession• CreateNLPCore• CreateStorage

• Processing• ExtractEventName/Date/URL• ExtractNameEntities• ExtractOtherEventFeatures

• ExportandImport• GenerateTSVfile• ImportTSVfileintoHBase

Data Processing – Demo

14

Global Events Team Final Presentation

• Efficientvisualizationoflong-termglobalevents• Showrepresentativeterms->linktocorrespondingURLs• Visualizeevents’trendsovertime(timeseries)

• Java7SpringBootWebapplication• Buildsystem- Gradle• EmbeddedTomcatWebserver• Backend- HBase,in-memory• Frontend- D3.js,Bootstrap

Global Events Viewer

15

https://github.com/dedocibula/global-events-viewer

Global Events Team Final Presentation

• KeyComponents• WordCloud,RangeSelection,URLList,Trends

Global Events Viewer – Demo

16

Global Events Team Final Presentation

Problem Faced

17

DataCollectionEncodingproblems(UTF-8, ASCIIandothers)Get morerelevantseedsforoldevents

DataProcessingLack ofdocumentation(ArchiveSpark)Versionconflict(CDXWriter,Kernel inJupyter)JVMissue(Spark)

DataVisualizationSpringbootIntelliJsetupJQueryUI

Global Events Team Final Presentation

Lessons Learned

18

DataCollectionWARCIOFocusedCrawler

Data ProcessingArchiveSparkSpark& Scala(Map/ReduceProcess)

DataVisualizationD3WordCloudD3DynamicLineCharts

Global Events Team Final Presentation

Future Work

19

DataCollectionWayback MachineAutomaticRoutineforFocusedCrawlerEvent Extension(Sources,Time,Space)

Data ProcessingStandaloneMode-> ClusterModeNameEntityRecognizerAutomaticProcessing(CDXWriter andHBase)

DataVisualizationLocalization– DatamapsWeapons

Global Events Team Final Presentation

Acknowledgement

20

Projects

NSFIIS- 1319578 III:Small:IntegratedDigitalEventArchivingandLibrary(IDEAL)

NSFIIS- 1619028 III:Small:CollaborativeResearch:GlobalEventandTrendArchiveResearch(GETAR)

OrganizationsInternetArchiveL3SResearchCenter

PersonsInstructor Dr.EdwardA.FoxAlumnus Dr.MohamedMagdy FaragLabmates PrashantChandrasekar, XuanZhang

Thank you !

Questions?

Recommended