Upload
daniel-cukier
View
3.632
Download
0
Embed Size (px)
DESCRIPTION
Talk about creating web robots in Ruby programming language, using restclient, nokogiri, mechanize in rsonrails event
Citation preview
Ruby Robots
http://www.flickr.com/photos/flysi/183272970
Daniel Cukier@danicuki
Relatives
• spiders
• crawlers
• bots
Why robot?
http://www.flickr.com/photos/nhankamer/5016628611
require 'anemone'
Anemone.crawl(url) do |anemone| anemone.on_every_page do |page| puts page.url endend http://www.cantora.mus.br/
http://www.cantora.mus.br/fotoshttp://www.cantora.mus.br/?locale=enhttp://www.cantora.mus.br/?locale=pt-BRhttp://www.cantora.mus.br/musicashttp://www.cantora.mus.br/videoshttp://www.cantora.mus.br/agendahttp://www.cantora.mus.br/novidadeshttp://www.cantora.mus.br/musicas/baixarhttp://www.cantora.mus.br/visitors/baixarhttp://www.cantora.mus.br/socialhttp://www.cantora.mus.br/fotos?locale=pt-BRhttp://www.cantora.mus.br/musicas?locale=enhttp://www.cantora.mus.br/fotos?locale=en
XPath<html>...<div class="bla"> <a>legal</a></div>...</html>
html_doc = Nokogiri::HTML(html)info = html_doc.xpath( "//div[@class='bla']/a")info.text=> legal
XPath<table id="super"> <tr> <td>L1C1</td> <td>L1C2</td> </tr> <tr> <td>L2C1</td> <td>L2C2</td> </tr> <tr> <td>L3C1</td> <td>L3C2</td> </tr></table>
>> html_doc = Nokogiri::HTML(html)>> info = html_doc.xpath( "//table[@id='super']/tr")>> info.size=> 3
>> info=> legal
>> info[0].xpath("td").size=> 2
>> info[2].xpath("td")[1].text=> "L3C2"
rest-client
http://www.flickr.com/photos/amortize/766738216
GET
GET
http://www.flickr.com/photos/abbeychristine/223898960
Good bot
/robots.txt
User-agent: *
Disallow:
http://www.flickr.com/photos/temily/5645585162
Ruby Robots
http://www.flickr.com/photos/flysi/183272970
Daniel Cukier@danicuki
http://www.flickr.com/photos/nephelim/5632618462
maxRowsList=16
WTF?
>> body = RestClient.get(url)>> json = JSON.parse(body)>> content = json["Content"]>> content.size=> 16
AHA!!!
http://.../artistas?maxRowsList=1600&filter=Recentes>> body = RestClient.get(url)>> json = JSON.parse(a)>> content = json["Content"]>> content.size=> 1600
http://.../artistas?maxRowsList=1600000&filter=Recentes
>> content.size=> 9154
Bingo!!!
>> b["Content"].map {|c| c["ProfileUrl"]}["caravella", "tomleite", "jeffersontavares", "rodrigoaraujo", "jorgemendes", "bossapunk", "daviddepiro", "freetools", "ironia", "tiagorosa", "outprofile", "lucianokoscky", "bandateatraldecarona", "tlounge", "almanaque", "razzyoficial", "cretinosecanalhas", "cincorios", "ninoantunes", "caiocorsalette", "alinedelima", "thelio", "grupodomdesamba", "ladoz", "alexandrepontes", "poeiradgua", "betimalu", "leonardobessa", "kamaross", "marcusdocavaco", "atividadeinformal", "angelkeys", "locojohn", "forcamusic", "tiaguinhoabreu", "marcelonegrao", "jstonemghiphop", "uniaoglobal", "bandaefex", "severarock", "manitu", "sasso", "kakka", "xsopretty", "belepoke", "caixaazul", "wknd", "bandastarven", "bleiamusic", "3porcentoaocubo", "lucianoterra", "hipnoia", "influencianegra", "bandaursamaior", "mariafreitas", "jessejames", "vagnerrockxe", "stageo3", "lemoneight", "innocence", "dinda", "marcelocapela", "paulocamoeseoslusiadas", "magnussrock", "bandatheburk", "mercantes", "bandaturnerock", "flaviasaolli", "tonysagga", "thiagoponde", "centeio", "grupodeubranco", "bocadeleao", "eusoueliascardan", "notoriaoficial", "planomasterrock", "rofgod", "dreemonphc", "chicobrant", "osz", "bandalightspeed", "cavernadenarnia", "sergiobenevenuto", "viniciusdeoliveira", ...]
email?phone?
>> html = RestClient.get("http://.../robomacaco")>> html_doc = Nokogiri::HTML(html)>> info = html_doc.xpath("//span[@class='name']")>> info.text=> "[email protected] DE JANEIRO - RJ - Brasil21 9675-0199
cookies
cookies = {}c = "s_nr=12954999; s_v19=12978609471; ... __utmc=206845458"cook = c.split(";").map {|i| i.strip.split("=")}cook.each {|u| cookies[u[0]] = u[1]}
RestClient.get(url, :cookies => cookies)
Proxies
http://www.ip-adress.com/proxy_list
>> response = RestClient.get(url)>> html_doc = Nokogiri::HTML(response)>> table = html_doc.xpath("//table[@class='proxylist']")>> lines = table.children>> lines.shift # tira o cabeçalho
>> lines[1].text=> "208.52.144.55 document.write(\":\"+i+r+i+r) anonymous proxy server-2 minutes ago United States"
Text
IP WTF?
<script type="text/javascript"> z=5;i=8;x=4;l=1;o=9;q=6;n=3;u=2;k=7;r=0;</script>
JAVASCRIPT=
RUBY
http://www.flickr.com/photos/drics/4266471776/
<script type="text/javascript"> z=5;i=8;x=4;l=1;o=9;q=6;n=3;u=2;k=7;r=0;</script>
>> script = html_doc.xpath("//script")[1]>> eval script.text>> z=> 5>> i=> 8
>> digits = lines[1].text.split(")")[0].split("+")=> ["208.52.144.55document.write(\":\"", "i", "r", "i", "r"]>> digits.shift>> digits=> ["i", "r", "i", "r"]>> port = digits.map {|c| eval(c)}.join("")=> "8080"
>> lines[1].text=> "208.52.144.55 document.write(\":\"+i+r+i+r) anonymous proxy server-2 minutes ago United States"
Voilà
RestClient.proxy = "http://#{server}:#{port}"
>> server = lines[1].text.split[0]=> "208.52.144.55"
agent = Mechanize.newsite = "http://www.cantora.mus.br"page = agent.get("#{site}/baixar")form = page.formform['visitor[name]'] = 'daniel'form['visitor[email]'] = "[email protected]"page = agent.submit(form)tracks = page.links.select { |l| l.href =~ /track/ }tracks.each do |t| file = agent.get("#{site}#{t}) file.saveend
mechanize
protection techniques
javascript
text as image
captcha
don’t be ingenuous
captcha
YES you can!
prove you are not a robot
3 steps
1. Download Image2. filter image3. run OCR software
Good Luck!
scaling
http://www.flickr.com/photos/liquene/3330714590
clouds
$ knife ec2 server create
threads+
queues
Nessa vida de programador malucoMe aparece cada situaçãoDe repente um cliente, uma proposta brutaPra pegar de um site informaçãoVocê tá louco, esse tipo de crime eu não façoSe quiser tenho uns amigos lá do sulFaz pra mim que eu te pago com essa jóia cool
Te dou um rubyPra você roubarCom o seu robô
Quer fazer robô?É só usar rubyÉ só usar rubyPra fazer robô
http://www.flickr.com/photos/jobafunky/5572503988
Thank you
Daniel Cukier@danicuki