32
Fun Learning Web Scraping queensjs.s02e02.mp4

Fun Learning Web Scraping - QueensJS - 9/2/2015

Embed Size (px)

Citation preview

Page 1: Fun Learning Web Scraping - QueensJS - 9/2/2015

Fun Learning Web Scraping

queensjs.s02e02.mp4

Page 2: Fun Learning Web Scraping - QueensJS - 9/2/2015

Danny Garcia@buzzedword

Page 3: Fun Learning Web Scraping - QueensJS - 9/2/2015

• Director of Engineering/Operations at ClassPass

• Full stack engineer, DevOps day to day

• First love is Javascript

Page 4: Fun Learning Web Scraping - QueensJS - 9/2/2015

Screen ScrapingYou should know about it

Page 5: Fun Learning Web Scraping - QueensJS - 9/2/2015

Simple Scraping

Page 6: Fun Learning Web Scraping - QueensJS - 9/2/2015
Page 7: Fun Learning Web Scraping - QueensJS - 9/2/2015

• Layout dependent

• Difficult to traverse

• Error prone

• RegEx not made for XML

• Doesn’t support AJAX

Page 8: Fun Learning Web Scraping - QueensJS - 9/2/2015

Advanced Scraping

Page 9: Fun Learning Web Scraping - QueensJS - 9/2/2015

• Layout dependent

• Doesn’t support AJAX

Page 10: Fun Learning Web Scraping - QueensJS - 9/2/2015

Browser Emulation

Page 11: Fun Learning Web Scraping - QueensJS - 9/2/2015

• Layout dependent

• Slow

Page 12: Fun Learning Web Scraping - QueensJS - 9/2/2015
Page 13: Fun Learning Web Scraping - QueensJS - 9/2/2015

Screen ScrapingYou can beat it

Page 14: Fun Learning Web Scraping - QueensJS - 9/2/2015

Don’t make it easy

• Are you using IDs?

• Are you HTML templating?

• How identifiable are your components?

• Randomly break your layout to stop scripts

Page 15: Fun Learning Web Scraping - QueensJS - 9/2/2015

All about AJAX• Defeats most RegEx scrapers

• Defeats most XMLParsers

• Browsers can render AJAX

• Don’t use easy to access AJAX routes

• https://queensjs.com/page/2

• use https://queensjs.com/_ajax/pagination&?q=1

Page 16: Fun Learning Web Scraping - QueensJS - 9/2/2015

Know your enemy!• CSRF tokens for fields

• User authentication required

• Use a paywall to discourage anonymity

• Trace IP addresses

• DMCA takedown request

Page 17: Fun Learning Web Scraping - QueensJS - 9/2/2015
Page 18: Fun Learning Web Scraping - QueensJS - 9/2/2015

Screen Scraping…it’ll still happen

Page 19: Fun Learning Web Scraping - QueensJS - 9/2/2015

Why?

• Your data is amazing

• You don’t have an API

• Backdoor feature request

• Hackers!

Page 20: Fun Learning Web Scraping - QueensJS - 9/2/2015

Scraping happensbut who does it?

Page 21: Fun Learning Web Scraping - QueensJS - 9/2/2015
Page 22: Fun Learning Web Scraping - QueensJS - 9/2/2015
Page 23: Fun Learning Web Scraping - QueensJS - 9/2/2015
Page 24: Fun Learning Web Scraping - QueensJS - 9/2/2015
Page 25: Fun Learning Web Scraping - QueensJS - 9/2/2015
Page 26: Fun Learning Web Scraping - QueensJS - 9/2/2015
Page 27: Fun Learning Web Scraping - QueensJS - 9/2/2015
Page 28: Fun Learning Web Scraping - QueensJS - 9/2/2015

Is it worth it?

Page 29: Fun Learning Web Scraping - QueensJS - 9/2/2015

Costs

• Development time

• Anchoring bad features

• Never stops

• Alienating good engineers

Page 30: Fun Learning Web Scraping - QueensJS - 9/2/2015

What can you do?

• No, really don’t use IDs.

• Layout changes do break scraping

• Integrate attack vectors as new product features

• Don’t panic

Page 31: Fun Learning Web Scraping - QueensJS - 9/2/2015

• Scraping on an individual level is a query

• Scraping in a cluster is an attack

Page 32: Fun Learning Web Scraping - QueensJS - 9/2/2015

Thank you!

Danny Garcia

@buzzedword

ClassPass

https://classpass.com/jobs