29
1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

Embed Size (px)

Citation preview

Page 1: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

1

Advanced Archive-It Application Training:

Archiving Social Networking and Social Media Sites

Page 2: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

2

Agenda

• Overview of Social Networking/Media sites• Why archive these sites?• Typical Challenges• Best Practices:• Twitter, Facebook, YouTube, Flickr

• Looking toward the future…• Questions/Discussion

Page 3: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

3

Why Archive These Sites?

• State Agencies: An increasing number have decided that the content on these sites are a record and need to be archived. "A tweet is a record”

• University libraries: Used to share information with students and alumni and contain important records about a school's culture, student body and campus events.

• Non Government Non Profit Organizations: Used to record online presence and impact

• Researchers: Used to preserve valuable social reactions and change on topics of interest

Page 4: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

4

Archive-It and Social Media Overview

• Capturing Social media sites is becoming more necessary for Archive-It partners

• Still focused on: Flickr, Facebook, Twitter, and YouTube

• On our radar: Vimeo, LinkedIn, Others?• Join the Archive-It social media list serve to hear

breaking news, including fixes and adjustments within Archive-It

Page 5: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

5

Social Media Crawling Notes

• Content behind log-ins can not be archived currently – Feature in 4.8 Release, April 2013

• Some parts of sites are not “archive-friendly” (i.e. complicated javascript, etc.)

• These sites tend to change both their technical structure and policy quickly and often.

Page 6: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

6

Scoping Social Media Sites

• Because of the way many of these sites are structured, scoping crawls correctly is very important if you are archiving these sites.– Each site has its own unique structure–Not scoping correctly can result in crawling

much much more than you intend, or not capturing the content you want to archive.

Page 7: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

7

Scoping - Overall Approaches

• Trial and Error: Try to harvest with a variety of settings and a variety of seeds

• Quality Review: review archived content thoroughly

• Collaborate: compare approaches and results with other Archive-It users

• Document detailed instructions, lessons learned, and best practices for other partners

Page 8: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

8

Best Practices

• Best practices for various social networking and social media sites are documented on the Archive-It Help Wiki:

https://webarchive.jira.com/wiki/display/ARIH/Archiving+Social+Networking+Sites+with+Archive-It

Page 9: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

9

Best Practices

• Be specific with your seed URLs - list only the page you would like to archive as a seed . Do NOT use the larger site as a seed (for example, do NOT use www.facebook.com or www.twitter.com as seeds. DO use: http://twitter.com/internetarchive/).

• Double –check your seed: Do you need an ending slash / ?

• Ignore Robots.txt as needed: Some sites block content using robots.txt

Page 10: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

10

Best Practices

• ALWAYS run a test crawl when first setting up these seeds to avoid using more of your document budget than expected. You may need to run more than one until you get it right.

Page 11: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

11

Best Practices

• After your first crawl…–Review post-crawl reports (did you crawl

too much?)–Review archived content in Wayback• Did you capture all the areas you

expected?• Are there any display issues?

Page 12: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

12

Reviewing Scoping Rules

To the web app!

Page 13: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

13

Twitter – Sample URLs

– Individual user feeds • https://twitter.com/archiveitorg/

– Searches• https://twitter.com/search?q=web

%20archiving&src=typd

– Lists• https://twitter.com/smithsonian/smithsonian/

– A specific tweet• https://twitter.com/archiveitorg/status/

294819565320413184

Page 14: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

14

Twitter - Scoping

Expand Scope (using SURTs) to capture dynamically loading content:

– Individual Twitter feed: • +http://(com,twitter,)/i/profiles/show/

BrowardCollege/

– Multiple Twitter feeds: • +http://(com,twitter,)/i/profiles/show/

Page 15: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

15

Links in Tweets

• Can I archive a url linked to using a ‘url shortener’?– Yes! Use an Expand Scope rule for http://t.co/ - all

URLs posted on Twitter redirect through that domain– Note: just the one page that the url shortener link

points to will be archived (plus embedded content)

Page 16: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

16

Twitter

• Examples of Archived Pages

Page 17: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

17

Facebook – Sample URLs

– Individual User Profiles – Timeline view • http://www.facebook.com/tonyforsenate/

– Pages - Timeline view • http://www.facebook.com/ArchiveIt/

– Events• http://www.facebook.com/events/265897963430841/

– Albums• https://www.facebook.com/media/set/?

set=a.13499334573.18616.6193904573&type=3

Page 18: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

18

Facebook - Scoping

– Ignoring robots.txt:• www.facebook.com • fbcdn.net• akamaihd.net

– Document limit on www.facebook.com (recommended 2000 for each seed) – Note, you cannot limit to *just* capture content from one Facebook account

– Expand Scope:– SURT +http://(net,fbcdn,

Page 19: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

19

Facebook

• Currently we can capture the initial content on a Facebook timeline, however the dynamically loading content can be difficult to capture due to the frequent changes in the way that content is served by Facebook

• Our engineers are working on keeping up to date with these changes and we are also investigating alternate methods for capturing Facebook pages

Page 20: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

20

Facebook

• Examples of Archived Pages

Page 21: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

21

YouTube - Sample URLs

– Channel /User pages• http://www.youtube.com/whitehouse

– Watch pages- individual videos• http://www.youtube.com/watch?v=5lVIuW8vJ_E

– Uploaded Document RSS Feed• http://gdata.youtube.com/feeds/api/users/whitehouse/

uploads/– Embedded YouTube Videos on other sites:• http://www.whitehouse.gov/photos-and-video/video/

2013/01/29/president-obama-speaks-comprehensive-immigration-reform

Page 22: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

22

YouTube - Scoping

• For all YouTube content, ignore robots.txt for:– youtube.com– ytimg.com

• For Watch pages- individual videos– Use “One Page Only” Seed Type

• For Channel/User pages – Crawl with a document limit or using RSS/News

Feed seed type

Page 23: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

23

YouTube

• Viewing YouTube videos:– YouTube videos for Watch pages and most

embedded YouTube videos will playback normally in Wayback

– For Channel/User Pages or other pages where videos are not playing back within the page, view videos from the video report or the public video page for that seed.

Page 24: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

24

YouTube

• Examples of Archived Pages

Page 25: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

25

Flickr

What types of pages can be archived?– Photo streams• Ex: http://www.flickr.com/photos/whitehouse/

– Individual photos• Ex:

http://www.flickr.com/photos/whitehouse/8390033709/in/photostream

Page 26: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

26

Flickr

• Examples of Archived Pages

Page 27: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

27

Other Sites

• Can sites other than those already mentioned be archived?– Yes! There are many more sites out there that

can be archived. Please send us sites you are interested in archiving.

– Other sites mentioned by partners currently are Google+, LinkedIn, Vimeo, and SlideShare.

Page 28: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

28

Moving Forward

• These best practices will change as the sites themselves make changes. Please be sure to check the Help Wiki page for updates

• We continue to focus on working with our partners to improve the capture and display of archived social networking sites

• The Archive-It team is exploring other capture mechanisms besides using a traditional crawler resource (Heritrix)

• Headless browsers• Hybrid architecture• API• Partnering with third party software• Enhance the display and search capabilities

Page 29: 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

29

Thank you!

• Questions? Discussion?

• Please take our quick survey: http://www.surveymonkey.com/s/GZ8CWC8