Upload
danny-garcia
View
453
Download
0
Embed Size (px)
Citation preview
Fun Learning Web Scraping
queensjs.s02e02.mp4
Danny Garcia@buzzedword
• Director of Engineering/Operations at ClassPass
• Full stack engineer, DevOps day to day
• First love is Javascript
Screen ScrapingYou should know about it
Simple Scraping
• Layout dependent
• Difficult to traverse
• Error prone
• RegEx not made for XML
• Doesn’t support AJAX
Advanced Scraping
• Layout dependent
• Doesn’t support AJAX
Browser Emulation
• Layout dependent
• Slow
Screen ScrapingYou can beat it
Don’t make it easy
• Are you using IDs?
• Are you HTML templating?
• How identifiable are your components?
• Randomly break your layout to stop scripts
All about AJAX• Defeats most RegEx scrapers
• Defeats most XMLParsers
• Browsers can render AJAX
• Don’t use easy to access AJAX routes
• https://queensjs.com/page/2
• use https://queensjs.com/_ajax/pagination&?q=1
Know your enemy!• CSRF tokens for fields
• User authentication required
• Use a paywall to discourage anonymity
• Trace IP addresses
• DMCA takedown request
Screen Scraping…it’ll still happen
Why?
• Your data is amazing
• You don’t have an API
• Backdoor feature request
• Hackers!
Scraping happensbut who does it?
Is it worth it?
Costs
• Development time
• Anchoring bad features
• Never stops
• Alienating good engineers
What can you do?
• No, really don’t use IDs.
• Layout changes do break scraping
• Integrate attack vectors as new product features
• Don’t panic
• Scraping on an individual level is a query
• Scraping in a cluster is an attack