GooSweep Paper 2007

Embed Size (px)

Citation preview

  • 8/3/2019 GooSweep Paper 2007

    1/7

    Chapter 17U S I N G S E A R C H E N G I N E S T O A C Q U I R EN E T W O R K F O R E N S I C E V I D E N C ER ober t McGrew and R ayfo rd Vaughn

    Abs tra c t Search engine APIs can be used very effectively to automate the sur-reptitious gathering of information abo ut network assets. This pap erdescribes GooSweep, a tool that uses the Google API to automate thesearch for references to individual IP addresses in a target network.GooSweep is a promising investigative tool. It can assist network foren-sic investigators in gathering information about individual computerssuch as referral logs, guest books, spam blacklists, and instructions forlogging into servers. GooSweep also provides valuable intelligence abouta suspect 's Internet activities, including browsing habits and communi-cations in web-based forums.

    Keywords : Network forensics, search engines, evidence gathering

    1, In t roduc t ionIndividuals and groups involved in penetrat ion test ing of network as-sets often use search engines to locate target websi tes. The search resul ts

    may reveal informat ion about unsecured adminis t ra t ive in ter faces to thewebsi tes, vulnerable versions of web applicat ions and the locat ions ofthese applicat ions. Similar ly, at tackers seeking to deface websi tes or hostphishing websi tes often at tempt to ident ify targets with older versions ofweb appl ica tions wi th known vulnerabi l i ti es . Large num bers of poten-t ial ly vulnerable hosts can be enumerated quickly using search engines.Th e Google Hacking D ata ba se [5] po sts the resul ts of such searches a ndprovides applicat ions that run the searches on target websi tes [5] .

    Information about computing assets col lected by search engines canalso be used in netwo rk forensic invest igat ions . Th is pa pe r de scribesthe design and implementat ion of GooSweep, a tool for gathering net-work forensic information by performing searches using specif ic ranges

    Please use the following format when citing this chapter:McGrew, R., Vaughn, R., 2007, in IFIP International Federation for Information Processing, Volume 242, Advances inDigital Forensics III; eds. P. Craiger and S Slienoi;(Boston: Springer), pp. 247-253.

  • 8/3/2019 GooSweep Paper 2007

    2/7

    248 ADVANC ES IN DIGITAL FORENSICS IIIof IP addresses and the i r cor responding hos t names . W ri t ten in Pyt ho n,GooSweep uses the Google Search Engine API to ga ther informat ionabout t a rge t ne tworks wi thout requi r ing di rec t communica t ion wi th thenetworks [6] . In part icular , GooSweep can provide useful network foren-sic information related to server compromise and web use pol icy vi-o la t ions . W hi le the quahty and quan t i ty of informat ion obta ine d byGooSweep may vary dramatical ly from one case to another , i ts abi l i tyto gather potentially valuable forensic information quickly and efficientlymakes it a powerful tool for network forensic investigations.

    The next sect ion discusses how search engines can be used to obtaininformat ion per ta in ing to hos t s and appl ica t ions . Sect ion 2 descr ibesthe GooS weep tool and i ts app hca t ion to network forensics. Th e f inalsect ion, Sect ion 3, presents our conclusions.2. Search ing for Hos t s

    An Internet search for references to a specif ic IP address often returnsinstru ct ion s th at inform users how to log into th e host . For exam ple,i f an organizat ion has a database server , there may be instruct ions onthe organizat ion's web server for employees, informing users about cl ientsoftware for connect ing to the server , as well as the IP address and/orhost name of the server . I f an email server is among the hosts searched,i ts presence and purpose wil l often be apparent in the resul ts , especial lyif users of the server post information to publicly-archived mailing fists.If these maifing list archives are accessible to the public via the web andindexed by search engines, emails to the list from users will often includedeta i led header informat ion. Th e header informat ion m ay conta in th ehost name and IP address of the originat ing email server (al lowing i t tobe indexed and discovered us ing the technique descr ibed in th i s paper ) ,along with detai led version information about the email server software,cl ient software and t ime stamps. Email message content provides usefulinformation as wellone of our searches returned a post by a systemsadministrator seeking help with a specif ic software package.

    Cl ient works ta t ions a l so provide varying amounts of informat ion tha tmay be indexed by search engines. Some web servers, including m anygovernment and academic systems, maintain access logs that are publiclyaccessible. Th is information can be used by a forensic invest iga tor toidentify the sites visited by users. In some cases, these log files includet ime s tam ps , opera t ing sys tem and web browser vers ion informat ion, andthe referr ing URLs ( the websi tes that led users to the dest inat ion) [1] .The referrals may also ci te other websi tes that the users visi ted or revealthe search terms they used to arr ive at the si te that logged them.

  • 8/3/2019 GooSweep Paper 2007

    3/7

    McGrew & Vaughn 24 9Communicat ions channels such as In ternet Relay Chat ( IRC) , web-

    based forums and websi te guest books also record and display IP andhost nam e informat ion th a t m ay be indexed by search engines. W hena user joins an IRC channel (analogous to a chat room) on most IRCnetworks, a line similar to the following is displayed to other users:

    11: 41 - ! - ha nd l e [n= use rna me Qc -xx-xx-xx-xx .hsd l .mi . e xa mpl e .ne t ]has joined \#chamie lname

    In th i s examp le , h an d le is the nam e adop ted by the user in th e chan-nel, username is the user name on the single-user workstat ion or mult i-user system, and text fol lowing the @ symbol is the host name of thecomputer that connected to the IRC server [7] . Often, users who fre-quent IRC channels wil l post logs of interest ing chat sessions on theweb. In the case of many open source projects , where meetings are heldover IRC, al l chat sessions on project channels are automatical ly loggedand are publicly accessible. Search engines index al l this informa tion,enabling i t to be found by tools l ike GooSweep. Web-based forums andguest books work in a similar way, logging and, sometimes, displayingthe IP address or host name of the user who made the post in an effortto d iscourage spam and abuse .

    Securi ty-related information can also be found regarding hosts in asub net . Spam blacklists , which contain l is ts of hos ts kno wn to relayspam emai l , a re used by sys tem adminis t ra tors to t rack and block un-wanted email ; by design they contain host names and IP addresses [3] . I fa system was once compromised and used as a platform for at tacks or tohost phishing si tes, often there wil l be discussion on public mail ing l is tsabo ut b locking the machine or shut t in g down the hos t . This informa-t ion is valuable to a forensic invest igator as historical inform ation ab ou thosts or networks, or as intel l igence about hosts and networks that wereinvolved in an at tack.

    Querying Internet search engines for information about individualhosts in a range of IP addresses is promising because of the type ofresults i t can retur n. In add it ion to faci l i tating netw ork intel l igenceand penet ra t ion tes t ing ac t iv i t ies , the informat ion gathered can be veryvaluable in incident response and forensic invest igat ions.3. G o o S w e e p

    GooSweep is a Python script that automates web searches of IP ad-dress ranges and their corresponding host names. Like many other searchengines, Google does not permit automated scripts to use i ts normal webinterfacethese scripts increase the load on the interface and ignore ad-vert isements. However, Google provides an API for programmers to de-

  • 8/3/2019 GooSweep Paper 2007

    4/7

    250 ADVANC ES IN DIGITAL FORENSICS IIIvelop applicat ions that ut i l ize i ts search engine [2] . This enables Googleto provide a separate interface for scr ipted search requests and also tol imi t the ra te a t which automated searches are conducted.

    GooSweep uses the Google Search Engine API to perform searches.The API current ly hmits each scr ipt user to 1,000 requests in a 24-hourperiod . GooS weep uses a single A P I request for each IP addre ss an deach host nam e. W ith reverse-DNS resolut ion of host nam es enable d,an invest igator can use GooSweep to search a class C subnet in a 24-hour period (256 hosts , each with an IP address and host name search,requires a total of 512 API requests) . Fortunately, many networks do nothave host names assigned to every IP address in their address ranges;this reduces the number of API requests required to scan a network.Consequently, an invest igator can typical ly run GooSweep on two classC subnets in a 24-hour period. The "burst mode" can be employed forlarger IP address ranges. This mode causes a scr ipt to idle af ter i ts APIrequests are expended; the scr ipt is act ivated when more requests can beissued. GooSweep genera tes an HT M L repo r t wi th the search resul t s ,including the number of websi tes found that match each host in the IPaddress range .3 ,1 R u n n i n g G o o S w e e p

    Exe cut ing G ooSweep requires a Py tho n in terpre ter [8] and the Py -Google interface to the Google Search Engine API [4] . PyGoogle re-quires the S OA Pp y web service library to be instal led as well [9]. AGooSweep user must register for a Google API key to run scr ipts thatissue que ries. Th is key m ust be placed in a loca tion specified by th ePyGoogle documentat ion ( typical ly in a f i le named . g o o g l e k e y i n t h euser 's home directory) . The GooSweep script i tself is contained in a f i lenamed goosweep.py, which does not requi re any separa te ins ta l la t ionprocedu res . GooSweep has been extens ive ly tes ted on Linux sys tems.Several users have had success running i t on Windows systems withoutmodificat ion.

    GooSweep may be executed from the command l ine using the fol low-ing syntax:

    . /goosw eep.py [-h num] [-r] [-b num]

    Th e requi red -s a rgum ent specif ies the subnet to be searched. Th eargu me nt is specified in "do t ted- qua d" form at , with an aster isk as awild card to den ote the pa rt of th e address th at is to be changed foreach search. For example , - s 1 9 2 .1 6 8 .5 .* d i rec t s GooSweep to scant h e I P ad d re s s r an g e 1 9 2 . 1 6 8 . 5 . 0 t h r o ug h 1 9 2 . 1 6 8 . 5 . 2 5 5 .

  • 8/3/2019 GooSweep Paper 2007

    5/7

    McGrew & Vaughn 251Ei ther or bo th of the -o a nd -d arg um ents a re requi red to produ ce

    an outp ut . A f ilename should be suppl ied to -o to prod uce an H TM Lreport with horizontal bars indicat ing the relat ive number of hi ts foreach host . A f i lename should be supplied to the -d opt ion to generatea comma-del imited output f i le for analysis using other programs, e.g. ,Microsoft Excel,

    Th e -b opt ion , if specif ied, su pp ort s the burs t mo de . T he GoogleA P I l imits each user to 1,000 A PI requ ests in a 24-hour perio d. Th eburst mode opt ion enables a user to specify the number of searches thatGooS weep should perform in a 24-hour period. After perform ing th especif ied number of searches, GooSweep idles for the remainder of the24-hour period and then cont inues with ano the r set of searches. Th isal lows GooSweep to automatical ly perform large scans without violat ingthe l imitat ions imposed by the Google API. Users may also use the -bopt ion to budget the number of GooSweep API reques t s per day so tha to ther Google API appl ica t ions can run s imul taneous ly .

    The -h opt ion enables the user to specify how often GooSweep shouldoutput hash marks (#) to the screen to indicate the progress of i ts search.The option may be turned off" if GooSweep is being run as part of awrapper scr ipt or applicat ion, or the opt ion may be set as necessary todetermine if GooSweep is running at a normal pace. The defaul t opt ionoutputs one hash mark for every eight hosts searched.

    Th e - r o pt ion al lows the user to specify th at a reverse-DN S looku pshould be performed for each IP address, and if a host name is returned,that it is to be searched for as well. This option is turned off by default.

    GooSweep was original ly designed to provide information about a tar-ge t ne twork in a s tea l thy manner , wi thout sending any packets to thetarge t . A reverse-DNS lookup subm i ts a DNS reques t to the ta rge tnetwork, assuming that the resul t is not cached in a local DNS server .Issuing a large number of DNS requests can set off intrusion detect ionsystem sensors ( these requests are often submit ted by at tackers perform-ing ne twork enumera t ion or reconnaissance) . Th e - r opt ion should beturned off during penetrat ion test ing in order to "f ly under the radar ."In general , reverse-DNS lookups should be act ivated in network forensicscenarios that involve scanning one's own networks.

    The fol lowing is a sample GooSweep scan and dialog:./goosweep.py -s 192.168.5.* -o report.html -r -h 4#######...###Generating report (report.html)Completed.

    For privacy reasons, the subnet scanned is in the "private" non-routab le range . A repor t genera ted by the scan cons is t s of an HT M L

  • 8/3/2019 GooSweep Paper 2007

    6/7

    252 ADVANC ES IN DIGITAL FORENSICS III

    Figure 1. Sample GooSweep report.

    tab le with each row containing an IP add ress, a host nam e ( if th e - ropt ion is specif ied and a name is found), the resul ts returned for eachhost , and a bar chart showing the number of hi ts for each host relat iveto other hos ts in th e scan. To assist digi tal forensic invest igators, theIP addresses and host names are rendered as hyperl inks to the relevantGoogle search engine resul ts .3 .2 G o o S w e e p E x a m p l e

    Fig ure 1 i l lustrates th e resul ts of execu t ing Goo Sweep , targe t ing anetwork typical ly used by students. The resul ts have been censored toobscure the ac tua l IP addresses scanned. IP addresses in the range tha tresul ted in no search engine hi ts are omit ted for brevi ty.

    For each resul t with one or more hi ts , the IP address can be selectedto view th e correspo ndin g Google search resul ts . For mo st of th e IPaddresses in the example, web server logs were found at other academicinst i tut ions that had logged visi ts by these hosts . The visi ts were to webpages re la ted to topics such as programming ass i s tance and upcomingconferences. One IP address resul ted in f inding a securi ty-related paperpublished at an academic conference that used a host in the addressrange in an example. The resul ts dated as far back as 2004 and as recentas the current year (2006). Note that while this example was executedwithout reverse-DNS lookups, some of the web server logs contained theresul ts of their own reverse-DNS lookups, al lowing the naming scheme forthis IP address range to be determined without having to issue queriesus ing GooSweep.

  • 8/3/2019 GooSweep Paper 2007

    7/7

    McGrew & Vaughn 2534. C o n c l u s i o n s

    GooSweep leverages the latest Internet search engine technology toprovide valuable information gathering capabil i t ies for network forensicinvest iga tors. Th ere is no gua ran tee th at GooS weep will be frui tful inany given situation, but few, if any, forensic techniques or tools can makethis claim. Nev ertheless, given i ts ease of execu t ion an d the r ichnessof the information i t can gather , GooSweep is an at t ract ive tool fornetwo rk forensic invest igat ions. GooSw eep and i ts source code [6] areavai lable free-of-charge to mem bers of th e inform ation assu ranc e a nddigi tal forensics community.R e f e r e n c e s

    [1] Apach e Sof tware Foundat ion , Apache Comm on Log Form at (h t tp d. apache . o rg /docs /1 . 3 / l ogs . h t ml#common) , 2006 .[2] Google, Google APIs (code.google.com/apis.html) .

    [3] N. Kraw etz, Anti-sp am solut ions and securi ty (www .securi tyfocus.com/infocus /1763) , 2004.

    [4] B. Landers, PyGoogle: A Python interface to the Google API (py-google.sourceforge.net) .

    [5] J . Long , Th e Google Hacking Data ba se ( johnny . ihackstuff .com/g hd b . p h p ) .

    [6] R. McGrew, GooSweep, McGrew Securi ty Services and Research(www.mcgrewsecur i ty .com/projec t s /goosweep) , 2006.

    [7] J . Oikarin en and D. Reed, RF C 1459: Inte rne t Relay Ch at P ro to-col, IE T F Network Working Gro up (ww w. ie t f .org/ r fc / r fc l459. tx t?n u m b e r - 1 4 5 9 ) , 1 9 9 3 .

    [8] Python Sof tware Foundat ion, Python programming language (pyt hon . o rg ) .

    [9] G. Warnes and C. Blunck, Python web services (pywebsvcs.sourceforge.net) .