25
DIGGING INTO DATA COLLECTION Jason Packer [email protected] @jhpacker Feb 17, 2016 #cbuswaw

Digging into Data Collection

Embed Size (px)

Citation preview

DIGGING INTO DATA COLLECTION

Jason Packer [email protected] @jhpacker

Feb 17, 2016#cbuswaw

WHAT DRIVES OUR METRICS?

*Note all metrics may be inaccurate by some amount****But we’re not sure which ones and by how much.

DATA COLLECTION 1.0: SERVER LOGS, HITS, IP ADDRESSES

• Server logs, valid in 1996 and 2016

• Basic, but still contains highly useful data!

• Unanalyzed raw logs get big, fast.

128.135.189.9 - - [15/Feb/1996:15:16:27] "GET / HTTP/1.1" 200 5397 "Mozilla/1.0 (Win3.1)” 65.60.216.104 - - [15/Feb/2016:15:16:27] "GET / HTTP/1.1" 200 5397 "Mozilla/5.0 (Mac OS)"

WEB ANALYST, CIRCA 2000

flickr: boston_public_libraryCC BY-NC-ND 2.0

DATA COLLECTION 2.0: CLIENT-SIDE JAVASCRIPT, COOKIES

• Easier to implement (“just a few lines of JavaScript…”)

• Cookies match users closer than IPs

• Much more info available on client-side

HOW DOES CLIENT-SIDE JS WORK? …SPECIFICALLY GOOGLE ANALYTICS

2 requests - 1st for code, 2nd with measurement

TRACKING CODE SNIPPETS

• Sets up command queue

• Loads analytics.js, which does the real work.

<script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

ga('create', 'UA-34128028-1', 'auto'); ga('send', 'pageview');

</script>

MEASUREMENT PROTOCOL

https://www.google-analytics.com/collect?v=1&_v=j41&a=702618035&t=pageview&_s=1&dl=https://www.quantable.com/&ul=en-us&de=UTF-8&dt=Quantable - Analytics & Optimization&sd=24-bit&sr=1680x1050&vp=1442x464&je=0&_u=SCCAAUAjK~&jid=&cid=157092037.1441829013&tid=UA-34128028-1&z=823826407

This hit..

Once made readable, is this data…

from ObservePoint tag debugger

SEEMS GREAT, WHAT COULD POSSIBLY GO WRONG?

Some data still only on the server side…

• Bot traffic (mostly)

• HTTP errors

• Pages we forgot to tag

• Content blocking users

SERVER LOGS, AGAIN

• Distributed systems, distributed logs

• As before, but somewhat different consumers

AS ANALYSTS, WHAT’S GIVING US GRIEF

• Cookie Deleting Users

• Bots

• Analytics “Referrer” Spam

• Ad blocker Users

COOKIE DELETING USERS IS IT STILL ~30%?

• Artificially increases user counts

• Visit after deletion is direct, no attribution

• Stats based on users accounts? flickr: diskantCC BY-NC 2.0

BROWSER FINGERPRINTS

• Survives Cookie deletion

• 2010 EFF Panopticlick: 84% of browsers unique

• Invasive?

• Browser fingerprint + IP in Piwik as cookie fallback

• Can be thought of as next gen User-Agent + IP

BOTS

• About 50% of all traffic may be bots (48.5%, Incapsula 2015)

• Most of these don’t show in GA (yet?)

• Smaller the site, higher the bot % (85% for <1k visits/day) flickr: skynoir

CC BY-NC 2.0

BOTS

BOTSBOTS

BOTS

ANALYTICS SPAM

• free-social-buttons.biz, top-seo-blah-blah-blah.com, number-one-analytics.fail

• Way to get traffic, SEO, and lulz since before 2009

• Not GA specific, just the #1 target

• Two kinds: Crawler & Ghost

WHO’S SPAMMING US TODAY?List of 2016 GA Spammers from Analytics Edge

Google is blocking offenders, but often not quickly.

WHY IS IT SO PREVALENT?“Ghost” version via Measurement Protocol abuse $ curl "https://www.google-analytics.com/collect?v=1&t=pageview&tid=UA-XXXX-X&cid=fa0c8140-eef8-47c5-a244-b4c60cf46f74&dr=http%3A%2F%2Fmyspamsite.pizza&dp=%2Fhome"

Just iterate through UA-XXXX-1 numbers.

HOW DO I FIX IT?

• Filters for new traffic, segments for historical

• Tool available on my site: quantable.com/spamfilter

• Higher than UA-XX—1 property tracking id number for new site

AD BLOCKING IS MAKING SOME OF OUR USERS DISAPPEAR

• Blockers such as AdBlock Plus, Ghostery, uBlock Origin, and Purify can block analytics tools, not just ads

• ABP has largest install base (300M downloads)

• These users are still in your server logs, but may never show up in your web analytics

HOW DOES THE BLOCKING WORK?

• Long lists of URLs to block loading for, e.g.: google-analytics.com/analytics.js /piwik.php ?[AQB]&ndh=1&t= com/0.gif?

• EasyPrivacy list (used by ABP and others) is over 10,000 lines long and very actively maintained

HOW MANY USERS BLOCK GA?

My study showing 8.7% blocking GA(for one particular site)

blockers

HOW DO I COUNT BLOCKERS?

• Can’t really be “fixed” client-side

• Still show up server-side

• May be against GA terms (can’t circumvent Opt-Out Add-on)

…because sometimes 22/7 is good enough.

SQUARING THAT CIRCLE

THANKS!slides & recap to be posted at cbuswaw.com

References & Further Reading

Quantable GA Blocking Analysis:https://www.quantable.com/analytics/how-many-users-block-google-analytics/

GA Tracking Code walkthrough:http://code.stephenmorley.org/javascript/understanding-the-google-analytics-tracking-code/

GA Measurement Protocol Hit Builder: https://ga-dev-tools.appspot.com/hit-builder/

Fingerprintjs2: http://valve.github.io/fingerprintjs2/

Incapsula 2015 Bot Reporthttps://www.incapsula.com/blog/bot-traffic-report-2015.html

Analytics Edge’s Guide to GA Spam:http://help.analyticsedge.com/spam-filter/definitive-guide-to-removing-google-analytics-spam/