An Extensible Framework for Creating Personal Web Archives of Content Behind Authentication slide 0

An Extensible Framework for Creating Personal Web Archives of Content Behind Authentication

  • Published on
    23-Feb-2016

  • View
    33

  • Download
    0

DESCRIPTION

An Extensible Framework for Creating Personal Web Archives of Content Behind Authentication. Mat Kelly. Background. Internet Archive crawls and preserves webpages creating web archives Only public sites are preserved. Problems. A lot of content on web is not preserved - PowerPoint PPT Presentation

Transcript

Slide 1

An Extensible Framework for Creating Personal Web Archives of Content Behind AuthenticationMat KellyDirector:Michele C. WeigleCommittee:Michael L. NelsonYaohang Li8/3/2012MS Thesis - August 2012BackgroundInternet Archive crawls and preserves webpages creating web archivesOnly public sites are preserved8/3/2012MS Thesis - August 20122

Maintaining record of web is preserving digital heritageInternet Archive crawls and preserves webpages creating web archivesPreserved pages replayable at archive.orgOnly publicly accessible sites preserved

2ProblemsA lot of content on web is not preservede.g., Social media contentAs more people document lives on social media, importance of preserving becomes greaterContent not preserved = heritage lost

8/3/2012MS Thesis - August 201233Problems: Unsuitability of Institutional ToolsOverhead andlearning curveis steepInstitutionaltools meant forlarger scale8/3/2012MS Thesis - August 20124

Works well if you expend the energy to learn4Problems: Complete Lack of Preservation8/3/2012MS Thesis - August 20125

State of the Art inPersonal Web ArchivingPersonal web archiving toolsBreak when target sites hierarchy changesProduce sub-optimal archivesSome conventional web archiving practices not easily translatable to personal web archiving8/3/2012MS Thesis - August 20126Goals of ThesisShow social media content can be preservedWith output more optimal than current offeringsRemedy the tools breaking problemRemotely specify target sites hierarchiesShow spec is easily adaptable to toolsIdentify and consider solutions to domain-specific nuancesEstablish section commonality between social media websites

8/3/2012MS Thesis - August 20127Extent of the Unpreserved8/3/2012MS Thesis - August 20128

Internet Archive (IA) captured only public webCrawlers miss content behind authenticationQuantity of content behind auth > public web Large amount of content is not preserved

8Ways to Capture Missing Content:Supply crawler with auth credentialsUnsuitable for institutional crawlersOther Personal Web Archiving problems remain8/3/2012MS Thesis - August 20129Ways to Capture Missing Content:Save As Desired PagesMiss metadataDoesnt produce interoperable output8/3/2012MS Thesis - August 201210

Lose look & feelDifficult capturing all content desiredFrequently sub-optimal output format

Ways to Capture Missing Content:Utilize Fetching Tools8/3/2012MS Thesis - August 201211Tools Utilized In Thesis:Archive FacebookFirefox add-onCreates navigable web archivesOutputs files w/ original file typeSequential Archiving

8/3/2012MS Thesis - August 201212

Tools Utilized In Thesis:WARCreateGoogle Chrome extensionCreates Wayback-Compatible Web ARChive (WARC) filesAllows page manipulation prior to generating archive

8/3/2012MS Thesis - August 201213

8/3/2012MS Thesis - August 201214Integration with Other ToolsWayback (WARC replay system)Allows WARCreate output to be re-experiencedProvides content for MementoMementoAllows temporal traversal of archived pagesTimegate serves as relay only to local wayback instanceXAMPP (Client-Side Server Suite)Overcome Javascript inadequaciesProvide foundation for replay system

8/3/2012MS Thesis - August 201215

Institutional vs. Personal Web Archiving8/3/2012MS Thesis - August 201216

Institutional vs. Personal Web Archiving8/3/2012MS Thesis - August 201217

Institutional vs. Personal Web Archiving8/3/2012MS Thesis - August 201218

CrawlsWWW

Institutional vs. Personal Web Archiving8/3/2012MS Thesis - August 201219

CrawlsWWWInstitutional vs. Personal Web Archiving8/3/2012MS Thesis - August 201220

CrawlsWWWWARCoutputsInstitutional vs. Personal Web Archiving8/3/2012MS Thesis - August 201221

CrawlsWWWWARCoutputs

Institutional vs. Personal Web Archiving8/3/2012MS Thesis - August 201222

CrawlsWWWWARCoutputs

Indexes

Institutional vs. Personal Web Archiving8/3/2012MS Thesis - August 201223

CrawlsWWWWARCoutputs

Indexes

Publicly viewableArchive replayInstitutional vs. Personal Web Archiving8/3/2012MS Thesis - August 201224

Institutional vs. Personal Web Archiving8/3/2012MS Thesis - August 201225

Institutional vs. Personal Web Archiving8/3/2012MS Thesis - August 201226

Institutional vs. Personal Web Archiving8/3/2012MS Thesis - August 201227

WARCInstitutional vs. Personal Web Archiving8/3/2012MS Thesis - August 201228

WARC

IndexesInstitutional vs. Personal Web Archiving8/3/2012MS Thesis - August 201229

WARC

IndexesInstitutional vs. Personal Web Archiving8/3/2012MS Thesis - August 201230

WARC

8/3/2012MS Thesis - August 201231Problems Specific to Personal Web ArchivingPersonalization/AuthenticationDifferent users, facebook.com, different contentContextDifferent browsing tools, different site experienceOutput FormatAd hoc approaches are often used that lose metadata, context, content, etc.

8/3/2012MS Thesis - August 201232Personalization/AuthenticationTwo users, same URI, vastly different contentOne user, same URI, authentication vs. no authentication, different contentAs shown in IAs archive of FB

8/3/2012MS Thesis - August 201233

ContextSame URI+diff devices = diff content servedMobile vs. PCFirefox vs. Chrome 8/3/2012MS Thesis - August 201234

Output Format 8/3/2012MS Thesis - August 201235

Shows result from AFB / save webpage asFiles chaotically named

35Output Format8/3/2012MS Thesis - August 201236

Saving only HTML is not enoughLocal references need manipulation Browser alone is insufficient replay systemOutput Format8/3/2012MS Thesis - August 201237

Misses HTTP headers Request & Responsee.g., AuthIf headers included,inputs for personalization can be viewedGET / HTTP/1.1 Host: www.facebook.com User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:14.0) Gecko/20100101 Firefox/14.0.1 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip, deflate Connection: keep-alive Cookie: datr=KMo6T3jicPEdEl4pY2yFnr6F; lu=TgU4dhoSBG0ZmEnThtLeyqIA; c_user=100003509861423; fr=0KMqEWNPPgver2SIx.AWXf-6Ww_7iQFPPP9sFtiiMPaV0; s=Aa4dL41H8UGZ-4Lf.BQGryl; xs=1%3Am7APtmN9-ev4Vg%3A0%3A1343929509; act=1343929622029%2F3%3A2; p=1; presence=EM343929627EuserFA21B03509861423A2EstateFDsb2F0Et2F_5b_5dElm2FnullEuct2F1343929017BEtrFnullEtwF3302582290EatF1343929627063EutF0EsndF1EnotF0CEchFDp_5f1B03509861423F1CC HTTP/1.1 200 OK Cache-Control: private, no-cache, no-store, must-revalidate Expires: Sat, 01 Jan 2000 00:00:00 GMT P3P: CP="Facebook does not have a P3P policy. Learn why here: http://fb.me/p3p" Pragma: no-cache X-Content-Type-Options: nosniff x-frame-options: DENY X-XSS-Protection: 1; mode=block Content-Encoding: gzip Content-Type: text/html; charset=utf-8 X-FB-Debug: uMXm8343NOn0OOIeDna2teVECApUiEqj6s7GTwNx+Ss= Date: Thu, 02 Aug 2012 19:26:12 GMT Transfer-Encoding: chunked Connection: keep-aliveREQUESTRESPONSENOT CAPTURED BY BACKUP TOOLS/METHODSState of resource subject to inputsBrowser sends but usually hides headers, not caught on capture by AFBREQUEST headers are important, allow overcome context/personalization issues

37Specification and OOPSites hierarchies resemble OOP concepts (polymorphism, inheritance)Sites sections can be represented as classesClasses converted to XML specificationPersonal Web Archiving tools utilize this specification to become adaptive8/3/2012MS Thesis - August 201238Commonality of Sections Between Social Media Websites8/3/2012MS Thesis - August 201239Abstracted media typepersonal streamwallpostsmy tweetsglobal streamnews feedstreamsfollowees tweetsmultimedia - photosphotosphotosmultimedia - videosvideosvideosphoto collectionalbumspostsnotesfriendsfriendscircles

Example: Facebook Section Objects8/3/2012MS Thesis - August 201240SocialMediaWebsite facebook = new SocialMediaWebsite(homepage => "http://www.facebook.com")facebook->decorate([ new SocialMediaWebsiteSectionPersonalStream( name => "Wall", url => "http://www.facebook.com/profile.php?sk=wall", preprocessor => new SocialMediaScrollPrepreprocessor( timeBetweenFirings => 0, maxFirings = 0, conditionBeforeSubsequentFirings = null ) ), new SocialMediaWebsiteSectionUserInfo( name => "Info", url => "http://www.facebook.com/profile.php?sk=info" ), new SocialMediaWebsiteSectionMultimediaCollection( name => "Photos", url => "http://www.facebook.com/profile.php?sk=photos", proprocessor => new SocialMediaScrollPreprocessor( timeBetweenFirings => 0, maxFirings => 0, conditionBeforeSubsequentFirings = null ) ), ...

Example: Facebook Section Objects8/3/2012MS Thesis - August 201241SocialMediaWebsite facebook = new SocialMediaWebsite(homepage => "http://www.facebook.com")facebook->decorate([ new SocialMediaWebsiteSectionPersonalStream( name => "Wall", url => "http://www.facebook.com/profile.php?sk=wall", preprocessor => new SocialMediaScrollPrepreprocessor( timeBetweenFirings => 0, maxFirings = 0, conditionBeforeSubsequentFirings = null ) ), new SocialMediaWebsiteSectionUserInfo( name => "Info", url => "http://www.facebook.com/profile.php?sk=info" ), new SocialMediaWebsiteSectionMultimediaCollection( name => "Photos", url => "http://www.facebook.com/profile.php?sk=photos", proprocessor => new SocialMediaScrollPreprocessor( timeBetweenFirings => 0, maxFirings => 0, conditionBeforeSubsequentFirings = null ) ), ...

Example: Facebook Section Objects8/3/2012MS Thesis - August 201242SocialMediaWebsite facebook = new SocialMediaWebsite(homepage => "http://www.facebook.com")facebook->decorate([ new SocialMediaWebsiteSectionPersonalStream( name => "Wall", url => "http://www.facebook.com/profile.php?sk=wall", preprocessor => new SocialMediaScrollPrepreprocessor( timeBetweenFirings => 0, maxFirings = 0, conditionBeforeSubsequentFirings = null ) ), new SocialMediaWebsiteSectionUserInfo( name => "Info", url => "http://www.facebook.com/profile.php?sk=info" ), new SocialMediaWebsiteSectionMultimediaCollection( name => "Photos", url => "http://www.facebook.com/profile.php?sk=photos", proprocessor => new SocialMediaScrollPreprocessor( timeBetweenFirings => 0, maxFirings => 0, conditionBeforeSubsequentFirings = null ) ), ...

Example: Hierarchical Similarities8/3/2012MS Thesis - August 201243SocialMediaWebsite facebook = new SocialMediaWebsite(homepage => "http://www.facebook.com")facebook->decorate([ new SocialMediaWebsiteSectionPersonalStream( name => "Wall", url => "http://www.facebook.com/profile.php?sk=wall", preprocessor => new SocialMediaScrollPrepreprocessor( timeBetweenFirings => 0, maxFirings = 0, conditionBeforeSubsequentFirings = null ) ), new SocialMediaWebsiteSectionUserInfo( name => "Info", url => "http://www.facebook.com/profile.php?sk=info" ), new SocialMediaWebsiteSectionMultimediaCollection( name => "Photos", url => "http://www.facebook.com/profile.php?sk=photos", proprocessor => new SocialMediaScrollPreprocessor( timeBetweenFirings => 0, maxFirings => 0, conditionBeforeSubsequentFirings = null ) ), ...

Spec Retrieval ProcessTool accesses root specw/ URI parameterSpec returns with reference to site-specific hierarchy specTool fetches site specUpdated site hierarchyreturned8/3/2012MS Thesis - August 201244

Root Spec(spec)/facebook.xmlSite SpecConcrete Usage Tool AdaptationArchive FacebookMap current URIs to remotely fetched URIsPerform pre-processing defined in FB specWARCreateImplement sequential/cohesive archiving

8/3/2012MS Thesis - August 201245Evaluation 1:Tool AdaptabilitySetup synthetic social media websiteDefine sites remote specChange AFB to preserve synthetic siteChange hierarchy of synthetic siteShow AFB breakingChange synthetic site specShow AFB functionality restored8/3/2012MS Thesis - August 201246Simple hierarchyfor base case testingRequires AuthUtilizes CDNCan be manipulatedRecursive Sections8/3/2012MS Thesis - August 201247Evaluation 1: Tool AdaptabilityStep 1: Synthetic Site Creation

8/3/2012MS Thesis - August 201248

http://test.socialstandard.org Personal Stream http://test.socialstandard.org/personal 0 0 ? Photo Albums http://test.socialstandard.org/albums 0 0 ?

decorate([ new SocialMediaWebsiteSectionPersonalStream( name => "Wall", url => "http://www.facebook.com/profile.php?sk=wall", preprocessor => new SocialMediaScrollPrepreprocessor( timeBetweenFirings => 0, maxFirings = 0, conditionBeforeSubsequentFirings = null ) ), new SocialMediaWebsiteSectionUserInfo( name => "Info", url => "http://www.facebook.com/profile.php?sk=info" ), new SocialMediaWebsiteSectionMultimediaCollection( name => "Photos", url => "http://www.facebook.com/profile.php?sk=photos", proprocessor => new SocialMediaScrollPreprocessor( timeBetweenFirings => 0, maxFirings => 0, conditionBeforeSubsequentFirings = null ) ), ...

Example: Implicit Recursion8/3/2012MS Thesis - August 201267

Recommended

View more >