Upload
open-data-bay-area-obda
View
328
Download
2
Embed Size (px)
DESCRIPTION
Have you ever been curious as to how widely Google Analytics is used across the web? Stop pondering, start coding! In this presentation, Stephen discusses how he used the Common Crawl dataset to perform wide scale analysis over billions of web pages and what this means for privacy on the web at large.
Citation preview
Measuringtheimpact:
StephenMerity/smerity.com @smerity
Smerity@CommonCrawl
ContinuingthecrawlDocumentingbestpractices
GuidesfornewcomerstoCommonCrawl+bigdataReferenceforseasonedveterans
Spendingmanyhoursblessingand/orcursingHadoop
Before:UniversityofSydney'11,Harvard'14
GoogleSydney,Freelancer.com,GrokLearning
Iwashopingoncreatingatoolthatwillautomaticallyextractsomeofthemostcommonmemes("ButdoesitrunLinux?"and
"InSovietRussia..."stylejokesetc)andIneededacorpus-
.Idointenselyapologise.
Iwroteaprimitive(threaded:S)webcrawlerandstarteditbeforeI
consideredrobots.txt
--PastSmerity(16/12/2007)
WheredidalltheHTTPreferrersgo?
Referrers:leakingbrowsinghistory
Ifyouclickfrom
to
http://www.reddit.com/r/sanfrancisco
http://www.sfbike.org/news/protected-bikeways-planned-for-the-embarcadero/
thenSFBikeknowsyoucamefromReddit
1)HowmanywebsitesisGoogleAnalytics(GA)on?
2)Howmuchofauser'sbrowsinghistorydoesGAcapture?
Top10kdomains:65.7%
Top100kdomains:64.2%
Topmilliondomains:50.8%
Itkeepsdroppingoff,butbyhowmuch..?
Estimateofcapturedbrowsinghistory...
?
ReferrersalloweasywebtrackingwhendoneatGoogle'sscale!
Noinformation!GA→!GA
Fullinformation!GA→GA
GA→!GA→GAGA→!GA→GA→!GA→GA→!GA→GA→!GA→GA
Keyinsight:leakedbrowsinghistory
GoogleonlyneedsoneineverytwolinkstohaveGAinordertohaveyourfullbrowsingpath*
*possiblylessiflinkgraph+clicktiming+machinelearningused
Estimatingleakedbrowserhistory
foreach :link={pageA}→{pageB}total_links+=1if{pageA}or{pageB}hasGA:
total_leaked+=1
Estimateofleakedbrowserhistoryissimply:total_leaked/total_links
JointprojectwithChadHornbaker*atHarvardIACS
*Bestfullnameever:CaptainCharlesLafforestHornbakerII
Thetask
GoogleAnalyticscount:" "
Generatelinkgraph
Mergelinkgraph&GAcount
.google-analytics.com/ga.jswww.winradio.net.auNoGA1www.winrar.com.cnGA6www.winratzart.comGA1www.winrenner.chGA244
domainA.com->domainB.com<totaltimes>
cnet-cnec-driver.softutopia.com->www.softutopia.com24
Excitingageofopendata
Opendata+
Opentools+
Cloudcomputing
WARCrawwebdata
WATmetadata(links,title,...)foreachpage
WETextractedtext
WARC=GAusagerawwebdata
WAT=hyperlinkgraphmetadata(links,title,...)foreachpage
Estimatingthetask'ssize
Pagelevel( ):http://en.wikipedia.org/3.5billionnodes,128billionedges,331GBcompressed
Subdomainlevel( ):101millionnodes,2billionedges,9.2GBcompressed
Decidedonusingsubdomainsinsteadofpagelevel
http:// /
Engineeringforscale
✓Usetheframeworkthatmatchesbest
✓Debuglocally
✓StandardHadoopoptimizations(combiner,compression,re-useJVMs...)
✓Manysmalljobs≫onebigjob
✓Gangliaformetrics&monitoring
Hadoop:'(
Hadoop:'(
Monitoring&metricswithGanglia
Engineeringforcost
✓AvoidHadoopifit'ssimpleenough✓Usespotinstanceseverywhere*✖UseEMRifhighlycostsensitive
(ElasticMapReduce=hostedHadoop)
*Everywherebutthemasternode!
Jugglingspotinstances
c1.xlargegoesfrom$0.58p/hto$0.064p/h
EMR:Thegood,thebad,theugly
significantlyeasier,oneclicksetup
priceisinsanewhenusingspotinstances(spot=$0.075withEMR=$0.12)
Guesshowmanylogfilesfora100nodecluster?
584,764+logfiles.
Ouch.
Costprojection
BestoptimizedsmallHadoopjob:1/177ththedatasetin23minutes(12c1.xlargemachines+Hadoopmaster)
Estimatedfulldatasetjob:~210TBforwebdata+~90TBforlinkdata~$60inEC2costs(177hoursofspotinstances)~$100inEMRcosts(avoidEMRforcost!)
Finalresults
29.96%of48milliondomainshaveGA(topmilliondomainswas50.8%)
Thatmeansthat
oneineverytwohyperlinkswillleakinformationtoGoogle
Thewiderimpact
WantBigOpenData?
WebData
Coverseverythingatscale!Languages...
Topics...Demographics...
Processingthewebisfeasible
Downloadingitisapain!CommonCrawldoesthatforyou
Processingitisscary!Bigdataframeworksexistandare(relatively)painless
Theseexperimentsaretooexpensive!Cloudcomputingmeansexperimentscanbejustafewdollars
Getstartednow..!
Wantrawwebdata?CommonCrawl.org
Wanthyperlinkgraph/webtables/RDFa?WebDataCommons.org
Wantexamplecodetogetyoustarted?https://github.com/Smerity/cc-warc-examples
Measuringtheimpact:
Fullwrite-up:http://smerity.com/cs205_ga/
StephenMerity/smerity.com @smerity