47
Ruby Robots http://www.flickr.com/photos/flysi/183272970 Daniel Cukier @danicuki

Ruby Robots

Embed Size (px)

DESCRIPTION

Talk about creating web robots in Ruby programming language, using restclient, nokogiri, mechanize in rsonrails event

Citation preview

Page 1: Ruby Robots

Ruby Robots

http://www.flickr.com/photos/flysi/183272970

Daniel Cukier@danicuki

Page 2: Ruby Robots
Page 3: Ruby Robots

Relatives

• spiders

• crawlers

• bots

Page 4: Ruby Robots

Why robot?

Page 6: Ruby Robots

require 'anemone'

Anemone.crawl(url) do |anemone| anemone.on_every_page do |page| puts page.url endend http://www.cantora.mus.br/

http://www.cantora.mus.br/fotoshttp://www.cantora.mus.br/?locale=enhttp://www.cantora.mus.br/?locale=pt-BRhttp://www.cantora.mus.br/musicashttp://www.cantora.mus.br/videoshttp://www.cantora.mus.br/agendahttp://www.cantora.mus.br/novidadeshttp://www.cantora.mus.br/musicas/baixarhttp://www.cantora.mus.br/visitors/baixarhttp://www.cantora.mus.br/socialhttp://www.cantora.mus.br/fotos?locale=pt-BRhttp://www.cantora.mus.br/musicas?locale=enhttp://www.cantora.mus.br/fotos?locale=en

Page 7: Ruby Robots
Page 8: Ruby Robots

XPath<html>...<div class="bla"> <a>legal</a></div>...</html>

html_doc = Nokogiri::HTML(html)info = html_doc.xpath( "//div[@class='bla']/a")info.text=> legal

Page 9: Ruby Robots

XPath<table id="super"> <tr> <td>L1C1</td> <td>L1C2</td> </tr> <tr> <td>L2C1</td> <td>L2C2</td> </tr> <tr> <td>L3C1</td> <td>L3C2</td> </tr></table>

>> html_doc = Nokogiri::HTML(html)>> info = html_doc.xpath( "//table[@id='super']/tr")>> info.size=> 3

>> info=> legal

>> info[0].xpath("td").size=> 2

>> info[2].xpath("td")[1].text=> "L3C2"

Page 10: Ruby Robots

rest-client

Page 11: Ruby Robots

http://www.flickr.com/photos/amortize/766738216

GET

GET

Page 13: Ruby Robots

Good bot

/robots.txt

User-agent: *

Disallow:

http://www.flickr.com/photos/temily/5645585162

Page 14: Ruby Robots

Ruby Robots

http://www.flickr.com/photos/flysi/183272970

Daniel Cukier@danicuki

Page 16: Ruby Robots
Page 17: Ruby Robots

maxRowsList=16

WTF?

Page 18: Ruby Robots
Page 19: Ruby Robots
Page 20: Ruby Robots

>> body = RestClient.get(url)>> json = JSON.parse(body)>> content = json["Content"]>> content.size=> 16

AHA!!!

http://.../artistas?maxRowsList=1600&filter=Recentes>> body = RestClient.get(url)>> json = JSON.parse(a)>> content = json["Content"]>> content.size=> 1600

http://.../artistas?maxRowsList=1600000&filter=Recentes

>> content.size=> 9154

Bingo!!!

Page 21: Ruby Robots

>> b["Content"].map {|c| c["ProfileUrl"]}["caravella", "tomleite", "jeffersontavares", "rodrigoaraujo", "jorgemendes", "bossapunk", "daviddepiro", "freetools", "ironia", "tiagorosa", "outprofile", "lucianokoscky", "bandateatraldecarona", "tlounge", "almanaque", "razzyoficial", "cretinosecanalhas", "cincorios", "ninoantunes", "caiocorsalette", "alinedelima", "thelio", "grupodomdesamba", "ladoz", "alexandrepontes", "poeiradgua", "betimalu", "leonardobessa", "kamaross", "marcusdocavaco", "atividadeinformal", "angelkeys", "locojohn", "forcamusic", "tiaguinhoabreu", "marcelonegrao", "jstonemghiphop", "uniaoglobal", "bandaefex", "severarock", "manitu", "sasso", "kakka", "xsopretty", "belepoke", "caixaazul", "wknd", "bandastarven", "bleiamusic", "3porcentoaocubo", "lucianoterra", "hipnoia", "influencianegra", "bandaursamaior", "mariafreitas", "jessejames", "vagnerrockxe", "stageo3", "lemoneight", "innocence", "dinda", "marcelocapela", "paulocamoeseoslusiadas", "magnussrock", "bandatheburk", "mercantes", "bandaturnerock", "flaviasaolli", "tonysagga", "thiagoponde", "centeio", "grupodeubranco", "bocadeleao", "eusoueliascardan", "notoriaoficial", "planomasterrock", "rofgod", "dreemonphc", "chicobrant", "osz", "bandalightspeed", "cavernadenarnia", "sergiobenevenuto", "viniciusdeoliveira", ...]

Page 22: Ruby Robots

email?phone?

Page 23: Ruby Robots
Page 24: Ruby Robots

>> html = RestClient.get("http://.../robomacaco")>> html_doc = Nokogiri::HTML(html)>> info = html_doc.xpath("//span[@class='name']")>> info.text=> "[email protected] DE JANEIRO - RJ - Brasil21 9675-0199

Page 25: Ruby Robots
Page 26: Ruby Robots

cookies

cookies = {}c = "s_nr=12954999; s_v19=12978609471; ... __utmc=206845458"cook = c.split(";").map {|i| i.strip.split("=")}cook.each {|u| cookies[u[0]] = u[1]}

RestClient.get(url, :cookies => cookies)

Page 27: Ruby Robots

Proxies

Page 29: Ruby Robots
Page 30: Ruby Robots
Page 31: Ruby Robots
Page 32: Ruby Robots

>> response = RestClient.get(url)>> html_doc = Nokogiri::HTML(response)>> table = html_doc.xpath("//table[@class='proxylist']")>> lines = table.children>> lines.shift # tira o cabeçalho

>> lines[1].text=> "208.52.144.55 document.write(\":\"+i+r+i+r) anonymous proxy server-2 minutes ago United States"

Text

IP WTF?

Page 33: Ruby Robots

<script type="text/javascript"> z=5;i=8;x=4;l=1;o=9;q=6;n=3;u=2;k=7;r=0;</script>

Page 34: Ruby Robots

JAVASCRIPT=

RUBY

http://www.flickr.com/photos/drics/4266471776/

Page 35: Ruby Robots

<script type="text/javascript"> z=5;i=8;x=4;l=1;o=9;q=6;n=3;u=2;k=7;r=0;</script>

>> script = html_doc.xpath("//script")[1]>> eval script.text>> z=> 5>> i=> 8

Page 36: Ruby Robots

>> digits = lines[1].text.split(")")[0].split("+")=> ["208.52.144.55document.write(\":\"", "i", "r", "i", "r"]>> digits.shift>> digits=> ["i", "r", "i", "r"]>> port = digits.map {|c| eval(c)}.join("")=> "8080"

>> lines[1].text=> "208.52.144.55 document.write(\":\"+i+r+i+r) anonymous proxy server-2 minutes ago United States"

Voilà

RestClient.proxy = "http://#{server}:#{port}"

>> server = lines[1].text.split[0]=> "208.52.144.55"

Page 37: Ruby Robots

agent = Mechanize.newsite = "http://www.cantora.mus.br"page = agent.get("#{site}/baixar")form = page.formform['visitor[name]'] = 'daniel'form['visitor[email]'] = "[email protected]"page = agent.submit(form)tracks = page.links.select { |l| l.href =~ /track/ }tracks.each do |t| file = agent.get("#{site}#{t}) file.saveend

mechanize

Page 38: Ruby Robots

protection techniques

javascript

text as image

captcha

don’t be ingenuous

Page 39: Ruby Robots

captcha

YES you can!

prove you are not a robot

Page 40: Ruby Robots

3 steps

1. Download Image2. filter image3. run OCR software

Page 41: Ruby Robots

Good Luck!

Page 42: Ruby Robots

scaling

http://www.flickr.com/photos/liquene/3330714590

Page 43: Ruby Robots

clouds

$ knife ec2 server create

Page 44: Ruby Robots

threads+

queues

Page 45: Ruby Robots
Page 46: Ruby Robots

Nessa vida de programador malucoMe aparece cada situaçãoDe repente um cliente, uma proposta brutaPra pegar de um site informaçãoVocê tá louco, esse tipo de crime eu não façoSe quiser tenho uns amigos lá do sulFaz pra mim que eu te pago com essa jóia cool

Te dou um rubyPra você roubarCom o seu robô

Quer fazer robô?É só usar rubyÉ só usar rubyPra fazer robô

http://www.flickr.com/photos/jobafunky/5572503988

Page 47: Ruby Robots

Thank you

Daniel Cukier@danicuki