Ruby Robots

Ruby Robots

http://www.flickr.com/photos/flysi/183272970

Daniel Cukier@danicuki



Relatives

• spiders

• crawlers

• bots

Why robot?

http://www.flickr.com/photos/nhankamer/5016628611



require 'anemone'

Anemone.crawl(url) do |anemone| anemone.on_every_page do |page| puts page.url endend http://www.cantora.mus.br/

http://www.cantora.mus.br/fotoshttp://www.cantora.mus.br/?locale=enhttp://www.cantora.mus.br/?locale=pt-BRhttp://www.cantora.mus.br/musicashttp://www.cantora.mus.br/videoshttp://www.cantora.mus.br/agendahttp://www.cantora.mus.br/novidadeshttp://www.cantora.mus.br/musicas/baixarhttp://www.cantora.mus.br/visitors/baixarhttp://www.cantora.mus.br/socialhttp://www.cantora.mus.br/fotos?locale=pt-BRhttp://www.cantora.mus.br/musicas?locale=enhttp://www.cantora.mus.br/fotos?locale=en

http://www.cantora.mus.br

http://www.cantora.mus.br

http://www.cantora.mus.br/fotos

http://www.cantora.mus.br/fotos

http://www.cantora.mus.br/?locale=en

http://www.cantora.mus.br/?locale=en

http://www.cantora.mus.br/?locale=pt-BR

http://www.cantora.mus.br/?locale=pt-BR

http://www.cantora.mus.br/musicas

http://www.cantora.mus.br/musicas

http://www.cantora.mus.br/videos

http://www.cantora.mus.br/videos

http://www.cantora.mus.br/agenda

http://www.cantora.mus.br/agenda

http://www.cantora.mus.br/novidades

http://www.cantora.mus.br/novidades

http://www.cantora.mus.br/musicas/baixar

http://www.cantora.mus.br/musicas/baixar

http://www.cantora.mus.br/visitors/baixar

http://www.cantora.mus.br/visitors/baixar

http://www.cantora.mus.br/social

http://www.cantora.mus.br/social

http://www.cantora.mus.br/fotos?locale=pt-BR

http://www.cantora.mus.br/fotos?locale=pt-BR

http://www.cantora.mus.br/musicas?locale=en

http://www.cantora.mus.br/musicas?locale=en

http://www.cantora.mus.br/fotos?locale=en

http://www.cantora.mus.br/fotos?locale=en

XPath<html>...<div class="bla"> <a>legal</a></div>...</html>

html_doc = Nokogiri::HTML(html)info = html_doc.xpath( "//div[@class='bla']/a")info.text=> legal

XPath<table id="super"> <tr> <td>L1C1</td> <td>L1C2</td> </tr> <tr> <td>L2C1</td> <td>L2C2</td> </tr> <tr> <td>L3C1</td> <td>L3C2</td> </tr></table>

>> html_doc = Nokogiri::HTML(html)>> info = html_doc.xpath( "//table[@id='super']/tr")>> info.size=> 3

>> info=> legal

>> info[0].xpath("td").size=> 2

>> info[2].xpath("td")[1].text=> "L3C2"

rest-client

http://www.flickr.com/photos/amortize/766738216

GET

GET



http://www.flickr.com/photos/abbeychristine/223898960



Good bot

/robots.txt

User-agent: *

Disallow:

http://www.flickr.com/photos/temily/5645585162



Ruby Robots





http://www.flickr.com/photos/nephelim/5632618462



maxRowsList=16

WTF?

>> body = RestClient.get(url)>> json = JSON.parse(body)>> content = json["Content"]>> content.size=> 16

AHA!!!

http://.../artistas?maxRowsList=1600&filter=Recentes>> body = RestClient.get(url)>> json = JSON.parse(a)>> content = json["Content"]>> content.size=> 1600

http://.../artistas?maxRowsList=1600000&filter=Recentes

>> content.size=> 9154

Bingo!!!

http://www.oinovosom.com.br/artistas?_=1307730230027&page=2&maxRowsList=1600&filter=Recentes




>> b["Content"].map {|c| c["ProfileUrl"]}["caravella", "tomleite", "jeffersontavares", "rodrigoaraujo", "jorgemendes", "bossapunk", "daviddepiro", "freetools", "ironia", "tiagorosa", "outprofile", "lucianokoscky", "bandateatraldecarona", "tlounge", "almanaque", "razzyoficial", "cretinosecanalhas", "cincorios", "ninoantunes", "caiocorsalette", "alinedelima", "thelio", "grupodomdesamba", "ladoz", "alexandrepontes", "poeiradgua", "betimalu", "leonardobessa", "kamaross", "marcusdocavaco", "atividadeinformal", "angelkeys", "locojohn", "forcamusic", "tiaguinhoabreu", "marcelonegrao", "jstonemghiphop", "uniaoglobal", "bandaefex", "severarock", "manitu", "sasso", "kakka", "xsopretty", "belepoke", "caixaazul", "wknd", "bandastarven", "bleiamusic", "3porcentoaocubo", "lucianoterra", "hipnoia", "influencianegra", "bandaursamaior", "mariafreitas", "jessejames", "vagnerrockxe", "stageo3", "lemoneight", "innocence", "dinda", "marcelocapela", "paulocamoeseoslusiadas", "magnussrock", "bandatheburk", "mercantes", "bandaturnerock", "flaviasaolli", "tonysagga", "thiagoponde", "centeio", "grupodeubranco", "bocadeleao", "eusoueliascardan", "notoriaoficial", "planomasterrock", "rofgod", "dreemonphc", "chicobrant", "osz", "bandalightspeed", "cavernadenarnia", "sergiobenevenuto", "viniciusdeoliveira", ...]

email?phone?

>> html = RestClient.get("http://.../robomacaco")>> html_doc = Nokogiri::HTML(html)>> info = html_doc.xpath("//span[@class='name']")>> info.text=> "[email protected] DE JANEIRO - RJ - Brasil21 9675-0199

http://www.oinovosom.com.br/daniellaalcarpe

http://www.oinovosom.com.br/daniellaalcarpe

mailto:%[email protected]

mailto:%[email protected]

cookies

cookies = {}c = "s_nr=12954999; s_v19=12978609471; ... __utmc=206845458"cook = c.split(";").map {|i| i.strip.split("=")}cook.each {|u| cookies[u[0]] = u[1]}

RestClient.get(url, :cookies => cookies)

Proxies

http://www.ip-adress.com/proxy_list



>> response = RestClient.get(url)>> html_doc = Nokogiri::HTML(response)>> table = html_doc.xpath("//table[@class='proxylist']")>> lines = table.children>> lines.shift # tira o cabeçalho

>> lines[1].text=> "208.52.144.55 document.write(\":\"+i+r+i+r) anonymous proxy server-2 minutes ago United States"

Text

IP WTF?

<script type="text/javascript"> z=5;i=8;x=4;l=1;o=9;q=6;n=3;u=2;k=7;r=0;</script>

JAVASCRIPT=

RUBY

http://www.flickr.com/photos/drics/4266471776/



<script type="text/javascript"> z=5;i=8;x=4;l=1;o=9;q=6;n=3;u=2;k=7;r=0;</script>

>> script = html_doc.xpath("//script")[1]>> eval script.text>> z=> 5>> i=> 8

>> digits = lines[1].text.split(")")[0].split("+")=> ["208.52.144.55document.write(\":\"", "i", "r", "i", "r"]>> digits.shift>> digits=> ["i", "r", "i", "r"]>> port = digits.map {|c| eval(c)}.join("")=> "8080"

>> lines[1].text=> "208.52.144.55 document.write(\":\"+i+r+i+r) anonymous proxy server-2 minutes ago United States"

Voilà

RestClient.proxy = "http://#{server}:#{port}"

>> server = lines[1].text.split[0]=> "208.52.144.55"

agent = Mechanize.newsite = "http://www.cantora.mus.br"page = agent.get("#{site}/baixar")form = page.formform['visitor[name]'] = 'daniel'form['visitor[email]'] = "[email protected]"page = agent.submit(form)tracks = page.links.select { |l| l.href =~ /track/ }tracks.each do |t| file = agent.get("#{site}#{t}) file.saveend

mechanize

http://www.cantora.mus.br/'

http://www.cantora.mus.br/'

protection techniques

javascript

text as image

captcha

don’t be ingenuous

captcha

YES you can!

prove you are not a robot

3 steps

1. Download Image2. filter image3. run OCR software

Good Luck!

scaling

http://www.flickr.com/photos/liquene/3330714590



clouds

$ knife ec2 server create

threads+

queues

Nessa vida de programador malucoMe aparece cada situaçãoDe repente um cliente, uma proposta brutaPra pegar de um site informaçãoVocê tá louco, esse tipo de crime eu não façoSe quiser tenho uns amigos lá do sulFaz pra mim que eu te pago com essa jóia cool

Te dou um rubyPra você roubarCom o seu robô

Quer fazer robô?É só usar rubyÉ só usar rubyPra fazer robô

http://www.flickr.com/photos/jobafunky/5572503988



Thank you


Technology

Ruby Robots