The Current State of Anonymous File Sharing

Embed Size (px)

Citation preview

  • 8/9/2019 The Current State of Anonymous File Sharing

    1/64

    Abstract

    This thesis will discuss the current situation of anonymous file-sharing. An overviewof the currently most popular file-sharing protocols and their properties concerninganonymity and scalability will be discussed.

    The new generation of "designed-for-anonymity" protocols, the patterns behind themand their current implementations will be covered as well as the usual attacks onanonymity in computer networks.

    II

  • 8/9/2019 The Current State of Anonymous File Sharing

    2/64

    Declaration of originality

    I hereby declare that all information in this document has been obtained and presentedin accordance with academic rules and ethical conduct. I also declare that, as requiredby these rules and conduct, I have fully cited and referenced all material and results

    that are not original to this work.

    Marc SeegerStuttgart, July 22, 2008

    III

  • 8/9/2019 The Current State of Anonymous File Sharing

    3/64

    Contents

    1 Different needs for anonymity 1

    2 Basic concept of anonymity in file-sharing 2

    2.1 What is there to know about me? . . . . . . . . . . . . . . . . . . . . . . 22.2 The wrong place at the wrong time . . . . . . . . . . . . . . . . . . . . . 22.3 Motivation: Freedom of speech / Freedom of the press . . . . . . . . . . 3

    3 Technical Networking Background 33.1 Namespaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.3 Identification throughout the OSI layers . . . . . . . . . . . . . . . . . . 5

    4 Current popular file-sharing-protocols 74.1 A small history of file-sharing . . . . . . . . . . . . . . . . . . . . . . . . 7

    4.1.1 The beginning: Server Based . . . . . . . . . . . . . . . . . . . . . 74.1.2 The evolution: Centralized Peer-to-Peer . . . . . . . . . . . . . . 84.1.3 The current state: Decentralized Peer-to-Peer . . . . . . . . . . . 84.1.4 The new kid on the block: Structured Peer-to-Peer networks . . . 94.1.5 The amount of data . . . . . . . . . . . . . . . . . . . . . . . . . 10

    4.2 Bittorrent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2.1 Roles in a Bittorrent network . . . . . . . . . . . . . . . . . . . . 124.2.2 Creating a network . . . . . . . . . . . . . . . . . . . . . . . . . . 124.2.3 Downloading/Uploading a file . . . . . . . . . . . . . . . . . . . . 134.2.4 Superseeding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    4.3 eDonkey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.3.1 Joining the network . . . . . . . . . . . . . . . . . . . . . . . . . . 154.3.2 File identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.3.3 Searching the network . . . . . . . . . . . . . . . . . . . . . . . . 164.3.4 Downloading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    4.4 Gnutella . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.4.1 Joining the network . . . . . . . . . . . . . . . . . . . . . . . . . . 194.4.2 Scalability problems in the Gnutella Network . . . . . . . . . . . 214.4.3 QRP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.4.4 Dynamic Querying . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    4.5 Anonymity in the current file-sharing protocols . . . . . . . . . . . . . . 244.5.1 Bittorrent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.5.2 eDonkey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.5.3 Gnutella . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    5 Patterns supporting anonymity 295.1 The Darknet pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    5.1.1 Extending Darknets - Turtle Hopping . . . . . . . . . . . . . . . . 30

    IV

  • 8/9/2019 The Current State of Anonymous File Sharing

    4/64

    5.2 The Brightnet pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.2.1 the legal situation of brightnets . . . . . . . . . . . . . . . . . . . 32

    5.3 Proxy chaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.3.1 What is a proxy server? . . . . . . . . . . . . . . . . . . . . . . . 335.3.2 How can I create a network out of this? (aka: the proxy chaining

    pattern) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.3.3 How does this improve anonymity? . . . . . . . . . . . . . . . . . 345.3.4 legal implications of being a proxy . . . . . . . . . . . . . . . . . 355.3.5 illegal content stored on a node (child pornography as an example) 365.3.6 copyright infringing content . . . . . . . . . . . . . . . . . . . . . 37

    5.4 Hop to Hop / End to End Encyrption . . . . . . . . . . . . . . . . . . . . 385.4.1 Hop to Hop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.4.2 End to End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.4.3 Problems with end to end encryption in anonymous networks . . 39

    6 Current software-implementations 416.1 OFF- the owner free filesystem - a brightnet . . . . . . . . . . . . . . . . 416.1.1 Structure of the network . . . . . . . . . . . . . . . . . . . . . . . 416.1.2 Cache and Bucket . . . . . . . . . . . . . . . . . . . . . . . . . . . 446.1.3 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.1.4 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.1.5 Downloading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.1.6 Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.1.7 Scaling OFF under load . . . . . . . . . . . . . . . . . . . . . . . 46

    6.2 Retroshare - a friend to friend network . . . . . . . . . . . . . . . . . . . 496.2.1 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.2.2 Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    6.3 Stealthnet - Proxychaining . . . . . . . . . . . . . . . . . . . . . . . . . . 496.3.1 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.3.2 Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.3.3 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.3.4 "Stealthnet decloaked" . . . . . . . . . . . . . . . . . . . . . . . . 51

    7 Attack-Patterns on anonymity 537.1 Attacks on communication channel . . . . . . . . . . . . . . . . . . . . . 53

    7.1.1 Denial of Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    7.2 Attacks on Anonymity of peers . . . . . . . . . . . . . . . . . . . . . . . 547.2.1 Passive Logging Attacks . . . . . . . . . . . . . . . . . . . . . . . 547.2.2 Passive Logging: The Intersection Attack . . . . . . . . . . . . . . 557.2.3 Passive Logging: Flow Correlation Attacks . . . . . . . . . . . . . 557.2.4 Active Logging: Surrounding a node . . . . . . . . . . . . . . . . 56

    V

  • 8/9/2019 The Current State of Anonymous File Sharing

    5/64

    1 Different needs for anonymity

    If John Doe hears the word "anonymity", it usually leaves the impression of a "you

    cant see me!" kind of situation. While this might be the common understanding of the

    word "anonymity", it has to be redefined when it comes to actions taking place on theinternet.

    In real life, the amount of work that has to be put into taking fingerprints, footprints,

    analyzing DNA and other ways of identifying who spend some time in a given location

    involves a lot more work than the equivalents in the IT world.

    It is in the nature of anonymity, that it can be achieved in various places and using

    various techniques. The one thing they all have in common is that a person tries to hide

    something from a third party.

    In general, and especially in the case of IT-networks, a person usually tries to hide:

    Who they are (Their identity)

    What information they acquire/share (The content they download/upload)

    Who they talk to (The people they communicate with)

    What they do (That they participate at all)

    The third party they are trying to hide from differs from country to country and from

    society to society.

    The main reason a person is trying to hide one of the things mentioned above is becausethe society as a whole, certain groups in society or the local legal system is considering

    the content or the informations they share/download either amoral or illegal.

    Downloading e.g. Cat Stephens Song "Peace train could get you in trouble for different

    reasons.

    In Germany or the USA, youd probably be infringing copyright. In Iran, the ruling

    government simply wouldnt want you to listen to that kind of music.

    There are also people who have the urge to protect their privacy because they dont see

    the need for any third party to collect any kind of information about them.Over the past years there have been many attempts to keep third parties from gathering

    information on digital communication. To allow me to keep this thesis at a certain level

    of detail, I decided to focus on the attempts that allow the sharing of larger amounts of

    data in a reasonable and scaleable way. Having to explain all the different methods that

    allow general purpose anonymous communication would extend the scope way beyond

    the targeted size of this paper.

    1

  • 8/9/2019 The Current State of Anonymous File Sharing

    6/64

    2 Basic concept of anonymity in file-sharing

    This chapter will give a short overview about the motivation for an anonymous file-

    sharing network and anonymity in general.

    2.1 What is there to know about me?

    Especially when it comes to file-sharing, the type of videos you watch, the kind of texts

    you read and the persons you interact with reveal a lot about you as a person.

    As sad as it may seem, but in the times of dragnet investigations and profiling, you

    may end up on one list or another just because of the movies you rented over the past

    months. Arvin Nayaranan and Vitaly Shmatikov of the University of Texas in Austin

    showed in their paper "Robust De-anonymization of Large Sparse Datasets" [1] how a

    supposedly anonymous dataset, which was released by the online movie-rental-companyNetflix, could be connected to real-life persons just by combining data freely available

    on the internet.

    This demonstrates how easy it is to use just fragments of data and connect them to a

    much bigger piece of knowledge about a person.

    2.2 The wrong place at the wrong time

    One of the social problems with current peer to peer networks is the way in which

    peer to peer copyright enforcement agencies are able "detect" violation of copyright

    laws. A recent paper by Michael Piatek, Tadayoshi Kohno and Arvind Krishnamurthy

    of the University of Washingtons department of Computer Science & Engineering titled

    Challenges and Directions for Monitoring P2P File Sharing Networks or Why My

    Printer Received a DMCA Takedown Notice[2] shows how the mere presence on a non-

    anonymous file-sharing network (in this case: Bittorrent) can lead to accusations of

    copyright infringement. They even managed to receive DMCA Takedown Notices for

    the IPs of 3 laser printers and a wireless access point by simple experiments. All of the

    current popular file-sharing networks are based on the principle of direct connectionsbetween peers which makes it easy for any third party to identify the users which are

    participating in the network on a file-basis.

    2

  • 8/9/2019 The Current State of Anonymous File Sharing

    7/64

    2.3 Motivation: Freedom of speech / Freedom of the press

    The whole situation of free speech and a free press could be summarized by a famous

    quote of American journalist A. J. Liebling: Freedom of the press is guaranteed only to

    those who own one.

    While free speech and a free press are known to be a cornerstone of most modern civi-

    lizations (nearly every country in the western hemisphere has freedom of speech/freedom

    of the press protected by its constitution ), recent tendencies like the patriot act in the

    USA or the pressure on the media in Russia have shown that a lot of societies tend

    to trade freedom for a (supposed) feeling of security in times of terror or arent able

    to defend themselves against their power-hungry leaders. With fear-mongering media

    and overzealous government agencies all over the world, it might be a good idea to

    have some form of censorship resistant network which allows publishing and receiving

    of papers and articles in an anonymous manner and still allow the public to access theinformation within this network. Moreover, it should be possible to allow sharing of

    textual content as well as video and audio. This tends to be difficult as file sizes of video

    and audio content are pretty big when compared to simple textual content.

    3 Technical Networking Background

    3.1 Namespaces

    To really understand the patterns which allow accessing networks in an anonymous

    way, it is first necessary to understand the basic concepts of networks and the unique

    characteristics a PC possesses in any given network.

    There basically are two things that identify a PC in a network:

    The NICs 1 MAC2 address (e.g. "00:17:cb:a5:7c:4f")

    The IP address used by the PC to communicate (e.g. "80.131.43.29")

    Although a MAC address should be unique to a NIC and an IP addresses should be

    unique on a network, both of them can be altered. Both of them can be spoofed to

    imitate other Clients or simply disguise the own identity. Spoofing the MAC address of

    another client that is connected to the same switch will usually lead to something called

    1NIC: Network Interface Card2MAC: Media Access Layer

    3

    http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/9/2019 The Current State of Anonymous File Sharing

    8/64

    "MAC flapping" and make proper communication impossible.

    Also: MAC Addresses are only of concern in the same collision domain. As soon as a

    Network-Packet is being transfered over a gateway to another network, the MAC Address

    of the source node will be overwritten by the MAC address of the gateway. As this is

    the case on the internet, we can mainly concentrate on the IP address as a means toidentify a Computer on the internet. There are other means of identifying a Computer

    within a group of peers, this will be covered in Chapter 3.3, but first, we have to take a

    look at how the data is being transfered over the Internet.

    3.2 Routing

    A main factor in the whole concept of anonymity is the way a file takes when traveling

    through the Internet. There are some patterns that allow an anonymous exchange of

    files when being on the same physical network, but that is usually not a "real-world"

    case and will not be discussed in this paper.

    When sending a file from one edge of the Internet to another, you usually end up going

    through several different stations in between. Lets look at the usual "direct way" a

    TCP/IP packet will go by when traveling from the Stuttgart Media University to the

    MIT University to give a certain understanding of who is involved in the exchange of a

    simple IP packet: The tool (traceroute) visualizes the path a packet will usually take to

    Figure 1: IP hops from Germany to the US

    reach its target. As you can see in the trace log, there are 18 different nodes involved

    in the transfer of a packet from stuttgart to MIT. While this is necessary to keep the

    internet stable in the event of outages, it also increases the numbers of possibilities of a

    4

  • 8/9/2019 The Current State of Anonymous File Sharing

    9/64

    "man in the middle attack. To keep this routing process scaleable and still hiding the

    receiver and sender information is one of the biggest tasks to accomplish when designing

    an anonymous file-sharing protocol.

    3.3 Identification throughout the OSI layers

    The previously mentioned basic namespaces (IP/MAC-addresses) arent the only means

    of identifying a specific PC within a group of others. In the different layers of the Open

    Systems Interconnection Basic Reference Model (OSI Reference Model or OSI Model for

    short), you can find several means of identification. Figure 2 shows a small summary of

    Figure 2: OSI Layer Model

    the different Layers of the OSI Model together with a sample of often found protocols

    residing in these layers. It has to be noted that describing the actual layer concept in

    connection with implemented systems is not always easy to do as many of the protocols

    that are in use on the Internet today were designed as part of the TCP/IP model, and

    may not fit cleanly into the OSI model. As the whole networking community seems to

    be especially in distress about the tasks taking place inside the Session and Presentation

    layer and the matching protocols, I decided to treat them as one and only give a short

    overview of their possible contents in respect to anonymity.

    5

  • 8/9/2019 The Current State of Anonymous File Sharing

    10/64

    Physical Link Layer: Anonymity in the Physical Link Layer can be undermined by an

    unauthorized 3rd party tapping into the physical medium itself. In the case of

    Wireless LAN or any other Radio-Networks, this is a pretty trivial task, tapping

    into ethernet can be a bit of a challenge and tapping into a fiberglass connection

    is pretty much impossible due to the precision with which the material has tobe handled. The result is always the same, communication can be recorded and

    analyzed.

    Data Link Layer (Ethernet): The Data Link Layer is the first thing when it comes to

    identifying a computer on a local network. Its frames usually carry the Network

    interface cards physical (MAC) address of the sender and the destination of a

    packet.

    Network Layer (IP): When it comes to bigger networks, the communication is basedon IP addresses which identify each computer and are embedded in the Network

    packets themselves to allow routing protocols to work.

    These IP addresses are managed by central instances. At the moment, there are

    five organizations giving out IP addresses:

    AfriNIC Africa Region

    APNIC Asia/Pacific Region

    ARIN North America Region

    LACNIC Latin America and some Caribbean Islands

    RIPE NCC Europe, the Middle East, and Central Asia

    This centralized IP distribution usually allows anybody to look up the company

    that acquired a certain address-space. If those companies are handing out IP

    addresses to people (e.g. your Internet Service Provider), they have to, at least in

    Germany, keep records which allow to identify who used a given IP address at a

    given point of time.

    The only thing that might keep people from identifying a single computer on a

    larger private network is the fact, that doing Network Address Translation will

    only show the IP of the gateway to the outside world.

    Transport Layer (TCP): In the Transport Layer, a port is the thing identifying the use

    of a certain application. While this usually doesnt give a third party that much

    information about the identity of a Computer as the Network Layer, it enriches

    this information with the Actions of the machine in question. People connecting

    6

  • 8/9/2019 The Current State of Anonymous File Sharing

    11/64

    to e.g. Port 21 will most probably try to download or upload a file from or to an

    FTP Server.

    Presentation+Session Layer: These layers might only help identifying a node if en-

    cryption based on public key cryptography is used. The certificate exchanged ine.g. TLS could give away a persons Identity, but it will keep the Data in the

    application layer under protection.

    Application Layer (FTP/HTTP): The application layer usually adds the last piece of

    information to the actions of a node under surveillance. While Layers 1-4 will tell

    you who is doing something and what this node is doing in general, the application

    layer will tell what the node is doing in particular. This is where Instant Messengers

    carry their Text Messages or Mail-Clients transmit login information.

    4 Current popular file-sharing-protocols

    4.1 A small history of file-sharing

    Genealogist Cornelia Floyd Nichols once said "You have to know where you came from

    to know where youre going". This turns out to be true especially for fast-moving fields

    such as Computer Science. Before one can talk about the current and future file-sharing

    protocols, it would be wise to take a look at where file-sharing came from.

    4.1.1 The beginning: Server Based

    Even before the creation of internet people were trading files using Bulletin Board Sys-

    tems ("BBS"). You had to dial into a remote machine using telephone-infrastructure.

    After some time of running on their own, Those remote machines were brought together

    and synchronized with each other and formed a network called the "FIDO-NET" which

    was later on replaced by the internet.

    In the first years of the Internet and until today, Usenet (based on the NNTP Protocol)

    allowed people to discuss topics ranging from politics to movies in different "Groups"

    with random strangers from all over the world. Usenet Providers synchronized each oth-

    ers servers so everybody could talk on the same logical network. Usenet also carries so

    called "Binary groups" starting with the prefix "alt.binaries." . In those groups, binary

    files were uploaded as text-messages using BASE64-encoding (the same encoding used

    7

  • 8/9/2019 The Current State of Anonymous File Sharing

    12/64

    in email attachments until today). This leads to a 33% increase in size and isnt the

    best way to exchange files, especially since the usability of newsreader applications able

    to decode binary files never really was something for most endusers. On a smaller scale,

    servers using the File Transfer Protocol (FTP) are until today one of the main means

    of file exchange between peers.

    4.1.2 The evolution: Centralized Peer-to-Peer

    When Napster came up in 1998, it was the birth of the revolutionary idea of Peer-to-Peer

    (P2P) file-sharing. Napster searched each computer it was started on for MP3 audio files

    and reported the results to a central server. This central server received search requests

    by every other Napster application and answered with a list IP addresses of computers

    which carried the files in question and some metadata (filenames, usernames, ...). Using

    these IP addresses, the Napster client was able to directly connect to other clients onthe network and transfer files.

    Although this server-based approach allowed search results to be returned quickly with-

    out any overhead, it was what brought napster down in the end as it gave corporations

    something to sue. Napster had to install filters on its servers which should filter out

    requests for copyrighted files and lost its user base.

    Other protocols fitting in this generation of P2P applications are Fasttrack (Kazaa as

    the usual client) and eDonkey. What separates Fasttrack and eDonkey from Napster

    is the possibility for everybody to set up a server and propagate its existence to thenetwork.

    4.1.3 The current state: Decentralized Peer-to-Peer

    Right about the time when the first lawsuits against Napster rolled in, Justin Frankel

    released the first version of a decentralized P2P application he called Gnutella. His em-

    ployer (AOL) forced him to take down the app a short time later, but it already had

    spread through the web. Although it only was released as a binary and no source code

    was available, the protocol used by the application was reverse-engineered and severalother compatible clients appeared. Gnutella used a flooding-technique to be able to keep

    a network connected without the use of a central server (more information in chapter

    4.4).

    There are basically two ways of searching for data in a decentralized Peer-to-Peer net-

    work: flooding or random walk.

    The flooding approach uses intermediate nodes to forward request to all of its neighbors

    8

  • 8/9/2019 The Current State of Anonymous File Sharing

    13/64

  • 8/9/2019 The Current State of Anonymous File Sharing

    14/64

    structured search and content distribution

    In the case of e.g. eMule, the hash table-based Kademlia protocol has been adopted

    by the EMule file-sharing client (more about it in chapter 4.3) and is been used as an

    overlay protocol to use forefficient searching

    . The download is still being handledby the eDonkey protocol itself. The same goes for Bittorrent with the limitation of

    not being able to do a text-based search. Bittorrent clients only use the hash table to

    automatically allocate peers which are also downloading the same file (based on the files

    hash value. more about Bittorrent in chapter 4.2).

    The other way of implementing the hash table approach is to use the hash table not

    only for searching, but also for logical content distribution within the network.

    A model that can be often found is to hash the content a node is sharing and transferring

    it over to the node with the networks equivalent to the hash of the file. This way, every

    Download request for a certain file hash can be directed directly to the node closest,

    according to the networks criteria, to the filehash itself. This means that every node in

    the network donates a bit of hard disc space and stores data according to the networks

    criteria. The specifics of each of the networks are different, but the main idea can be

    easily categorized this way. Examples of the the currently developing structured P2P-

    file-sharing Networks are Chord, CAN, Tapestry or Pastry or the more public Wua.la

    and OFF (details about OFF are discussed in chapter 6.1).

    4.1.5 The amount of data

    According to the ipoque p2p study 2007 [3], file-sharing applications are responsible for

    roughly 70% of the total Internet traffic in Germany: The current "big" file-sharing

    protocols that make up the most traffic on the internet are:

    Bittorrent

    eDonkey

    Gnutella

    To understand the problems and weaknesses that allow 3rd parties to receive information

    on a nodes file transfers, we first have to take a short look at how these protocols handle

    file search and file distribution.

    In the next sections we will take a more detailed look on the currently biggest peer to

    peer networks: Bittorrent, eDonkey and Gnutella.

    10

  • 8/9/2019 The Current State of Anonymous File Sharing

    15/64

  • 8/9/2019 The Current State of Anonymous File Sharing

    16/64

    about a "network" in context with Bittorrent, it relates to the "network" that is being

    built for a specific torrent file. The protocol has no built-in search possibilities or other

    bells and whistles, it is a pretty slim protocol and there are 2 or 3 different types of

    nodes involved in a file transfer on the bittorrent network depending on the way the

    network is used.

    4.2.1 Roles in a Bittorrent network

    In a bittorrent network, you can find 3 different roles that clients can take:

    a "seeder": the node serving the file to the network

    a "leecher": a node downloading the file from the network

    And usually there also is:

    a "tracker":

    A place where all the information about the participants of the network come

    together. The tracker is a node serving meta-information about the torrent file.

    According to the current protocol specs, a tracker can be replaced by a distributed

    hashtable which is being distributed throughout the users of the different scat-

    ternets, given that the clients on the network support it. The major Bittorrent

    applications and libraries all support the so called mainline DHT.

    4.2.2 Creating a network

    A file (or a group of files) that is being distributed to the bittorrent network first

    has to undergo a process to create a matching ".torrent" file. The file(s) is/are

    being cut into chunks of 128 kb size and those chunks are being hashed using the

    SHA1 algorithm. The information about the name of the file(s), the amount of

    chunks, the chunks respective position in the original file(s) and the hash of eachchunk is being saved in the torrent file. After the meta-information about the

    files has been added, an "announce URL" is being embedded in the torrent file.

    The announce URL contains the IP/DNS of the tracker and a hash identifying the

    torrent file itself. Usually, the resulting file is being transfered to (and therefor:

    registered by) the tracker. The seeder now uses a Bittorrent-client to announce to

    the tracker that it has 100% of the file and can be reached using its IP address

    12

  • 8/9/2019 The Current State of Anonymous File Sharing

    17/64

    (e.g. 80.121.24.76).

    4.2.3 Downloading/Uploading a file

    To download this file, a client now has to gain access to the meta-information

    embedded into the torrent file. Trackers often have a web-interface that provides

    an HTTP-based download for the registered torrent files. After downloading the

    torrent file, the client now announces its presence on the files network (which is

    identified by the hash of the torrent file) and starts "scraping" (asking) the tracker

    for the IP addresses of other nodes that also are on the files logical network. The

    tracker then sends a list of IP+port combinations to the client containing the other

    peers on the files network. The client starts to connect to the given IP addressesand asks the peers for chunks that itself is missing and sends a list of chunks that

    it has already acquired. To signal which pieces the client still misses and which

    ones are available for others to download, the Bittorrent protocol uses a special

    "bitfield message" with a bitmap-datastructure like this:

    000000000000000000 nothing downloaded yet (its optional to send this)

    100000000000100010 three blocks completed

    111111111111111111 successfully downloaded the whole fileThis way, every node on the network is able to upload/share what they have

    acquired so far and download the blocks they still miss at the same time. To dis-

    tribute the load put on the tracker by every client asking about other peers, some

    additions like PEX3 and the DHT4 were integrated in the widespread clients.

    3PEX: Peer Exchange4DHT: Distributed Hashtable

    13

    http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/9/2019 The Current State of Anonymous File Sharing

    18/64

    Figure 5: schematic view of a files bittorrent network

    4.2.4 Superseeding

    One problem of Bittorrent is, that the original uploader of a given file will show up as

    the first node having a 100% file availability and therefor losing a bit of anonymity.

    To keep nodes from showing up as "the original uploader", a mode called "superseeding"

    is used, which results in the original client disguising as yet another downloader. This

    is done is by simply having the uploading node lying about its own statistics and telling

    other nodes that its also still missing files. To allow the distribution process to be

    completed, the client varies the supposedly "missing" pieces over time, making it look

    as if itself was still in the process of downloading.

    It has to be noted that this might only be of help if the tracker itself isnt compromised

    and being monitored by a third party.

    14

  • 8/9/2019 The Current State of Anonymous File Sharing

    19/64

    4.3 eDonkey

    The eDonkey Network came into existence with the "eDonkey2000" client by a com-

    pany called Metamachine. After some time there were several open-source Clients that

    had been created by reverse-engineering the network. Especially a client called "Emule"

    gained popularity and soon was the major client in the network with a thriving developer

    community creating modified versions of the main emule client that introduced lots of

    new features.

    The eDonkey network, unlike the Bittorrent network, is a "real" network. It is fully

    searchable and information about files on the network can therefor be gathered without

    the need of any metadata besides the files name.

    4.3.1 Joining the network

    To join the edonkey network, there are 2 basic possibilities:

    1. Connecting using an eDonkey Server

    2. Connecting using the Kademlia DHT

    The concept of eDonkey Servers was the "original" way to connect to the network and

    already implemented in the original eDonkey2000 client. There are a lot of public servers

    available which are interconnected and able forward search requests and answers to oneanother.

    The Kademlia DHT was integrated in an application called Overnet by the creators

    of the original eDonkey2000 client and later on by EMule clients. It has to be noted

    that although they both used the same DHT algorithm, these networks are incompatible

    with one another.

    On an interesting side note: As of 16 October 2007 the Overnet Protocol was still being

    used by the Storm botnet for communication between infected machines. For more infor-

    mation on the Storm Worm, I recommend reading A Multi-perspective Analysis of the

    Storm (Peacomm) Worm by Phillip Porras and Hassen Sadi and Vinod Yegneswaran

    of the Computer Science Laboratory [25].

    4.3.2 File identification

    Files inside the eDonkey network are identified by a so called "ed2k hash" and usually

    are exchanged as a string looking like this:

    15

  • 8/9/2019 The Current State of Anonymous File Sharing

    20/64

    ed2k://|file||||/

    a real-world example would be:

    ed2k://|file|eMule0.49a.zip|2852296|ED6EC2CA480AFD4694129562D156B563|/

    The file hash part of the link is an MD4 root hash of an MD4 hash list, and gives a

    different result for files over 9,28 MB than simply MD4. If the file size is smaller orequal to 9728000 bytes, a simple MD4 of the file in question is being generated. The

    algorithm for computing ed2k hash of bigger files than 9728000 bytes is as follows:

    1. split the file into blocks (also called chunks) of 9728000 bytes (9.28 MB) plus a

    zero-padded remainder chunk

    2. compute MD4 hash digest of each file block separately

    3. concatenate the list of block digests into one big byte array

    4. compute MD4 of the array created in step 3.

    The result of step 4 is called the ed2k hash.

    It should be noted that it is undefined what to do when a file ends on a full chunk and

    there is no remainder chunk. The hash of a file which is a multiple of 9728000 bytes in

    size may be computed one of two ways:

    1. using the existing MD4 chunk hashes

    2. using the existing MD4 hashes plus the MD4 hash of an all-zeros chunk.

    Some programs use one method some the other, and a few use both. This results to the

    situation where two different ed2k hashes may be produced for the same file in different

    programs if the file is a multiple of 9728000 bytes in size.

    The information about all of this can be found in the emule source (method: CKnown-

    File::CreateFromFile) and the MD4 section of the aMule wiki [16].

    4.3.3 Searching the network

    When connecting to a server, the client sends a list of its shared files. Upon receiving this

    list, the server provides a list of other known servers and the source lists (IP addresses and

    ports) for the files currently in a downloading state. This provides an initial, automatic

    search-result for the clients currently active file transfers. When searching for new files,

    the server checks the search request against his connected clients file lists and returns

    a list matches and their corresponding ed2k hashes

    16

    http://ed2k//%7Cfile%7C%3Cfile%20name%3E%7C%3Cfile%20size%3E%7C%3Cfile%20hash%3E%7C/http://ed2k//%7Cfile%7CeMule0.49a.zip%7C2852296%7CED6EC2CA480AFD4694129562D156B563%7C/http://ed2k//%7Cfile%7CeMule0.49a.zip%7C2852296%7CED6EC2CA480AFD4694129562D156B563%7C/http://ed2k//%7Cfile%7C%3Cfile%20name%3E%7C%3Cfile%20size%3E%7C%3Cfile%20hash%3E%7C/
  • 8/9/2019 The Current State of Anonymous File Sharing

    21/64

    4.3.4 Downloading

    The downloading process itself is based on the Multisource File Transfer Protocol (MFTP)

    and has slight differences between the different clients on the network. I will not discuss

    the handshake procedure and other protocol specific details, but rather the things that

    make the protocol interesting when compared to its competitors.

    The protocol has the concept of an upload queue which clients have to wait in when

    requesting a chunk of data and other transfers are already in progress.

    Something else noteworthy is the possibility to share metadata about a given file within

    the network (such as: this file is good, this file is corrupted, this file isnt what the name

    may indicate); in this case, the files are identified with their ed2k hash numbers rather

    than their filenames (which might vary between nodes).

    There are several unique forks of the protocol:

    eDonkey2000 implemented a feature called "hording" of sources. This allows the user

    to define one download which has top priority. Several nodes which have set the

    same file as a top priority one will try to maximize the upload between each other

    and e.g. skip waiting queues.

    eMule uses a credit system to reward users that maintained a positive upload/download

    ratio towards the client in question. This will shorten the time a client has to wait

    in the upload queue.

    xMule has extended the credit system to help in the transfer of rare files

    4.4 Gnutella

    The Gnutella Protocol is one of the most widespread Peer to Peer file-sharing protocols.

    It is implemented in many clients such as Frostwire/Limewire, Bearshare, Shareaza,

    Phex, MLDonkey and others. The development history is kind of special. It was devel-

    oped by Nullsoft (the makers of the popular MP3 Player Winamp) and acquisitioned

    by AOL. After the binary version of the application was available for only one day,AOL stopped the distribution due to legal concerns. These concerns also prohibited the

    planned release of the source code. Luckily for the open-source community, the binary

    version of the application had been downloaded thousands of times and allowed the pro-

    tocol to be reverse engineered.

    Gnutella is part of the second big generation of decentralized peer to peer applications.

    The gnutella network is based on a "Friend of a Friend" principle. This principle is

    17

  • 8/9/2019 The Current State of Anonymous File Sharing

    22/64

    based on the property of societies, that you dont need to know every node on the whole

    network to reach a large audience (e.g. when sending out searches for filenames). Its

    enough to know 5-6 people who themselves know 5-6 other people.

    If each node knew 5 contacts that havent been reachable from the "root node" in the

    net before, this little sample illustrates how many nodes could be reached by a simplesearch request that is being relayed be the nodes the request is sent to:

    1st hop: 5 nodes (the ones you know directly)

    2nd hop: 25 nodes (the ones that your direct contacts know)

    3rd hop: 125 nodes

    4th hop: 625 nodes

    5th hop: 3125 nodes

    6th hop: 15625 nodes

    7th hop: 78125 nodes

    ...

    Before the discussion of the "how does my node get on the network" topic, Id like toprovide the "basic vocabulary" of the gnutella protocol as taken from the Gnutella Pro-

    tocol Specification v0.4[5]:

    Ping :

    Used to actively discover hosts on the network.

    A servent receiving a Ping descriptor is expected to respond with one or more

    Pong descriptors.

    Pong :

    The response to a Ping.

    Includes the address of a connected Gnutella servent and information regarding

    the amount of data it is making available to the network.

    18

  • 8/9/2019 The Current State of Anonymous File Sharing

    23/64

    Query :

    The primary mechanism for searching the distributed network.

    A servent receiving a Query descriptor will respond with a QueryHit if a match is

    found against its local data set.

    QueryHit :

    The response to a Query.

    This descriptor provides the recipient with enough information to acquire the data

    matching the corresponding Query.

    Push :

    A mechanism that allows a firewalled servent to contribute file-based data to the

    network.

    4.4.1 Joining the network

    Now that the meaning of the packets in the gnutella protocol are understood, we can

    address the topic of how to gain access to the network.

    To gain access to the network, you need to know the IP address of a node already

    participating in the network. Once you have the IP of a node that is integrated in the

    network, you have basically two possibilities to get to know other nodes:

    Pong-Caching :

    "Pong Caching" means to simply ask the node you already know to introduce you

    to its "friends". This might lead to a segmentation of the net as certain "cliques"

    tend to form which are highly interconnected but lack links to the "outside" of

    this group of nodes.

    Searching :

    Another way of getting to know more nodes is to simply search for a specific

    item. The search will be relayed by the node you know and the incoming resultsare containing the information about other nodes on the network, e.g. "JohnDoe"

    at the IP+Port 192.168.1.9:6347 has a files called "Linux.iso" in his share. Using

    this information, a network of known nodes that share the same interests (at least

    the search results reflect these interests by the filenames) is established. As most

    people tend to share different kinds of media those cliques that were built based

    upon interests will interconnect or at least "stay close" (hop-wise) to the other

    19

  • 8/9/2019 The Current State of Anonymous File Sharing

    24/64

    cliques. A little example would be, that a node that has an interest in Linux ISO

    files and the music of Bob Dylan. This Node could have 2 connections to other

    nodes that gave an result on the search term "Linux iso" and 3 Nodes that gave

    an result on the search term "dylan mp3".

    The two methods of pong-caching and searching are working fine as long as you already

    know at least one node in the network. This however leaves us standing there with

    a chicken or the egg causality. To gain information about other nodes we have to

    have information about other nodes. As not everybody wants to go through the hassle

    of exchanging their current IPs with others on IRC/Mail/Instant Message, something

    called GWebCaches have been implemented which automate that process.

    That is why the current preferred way to gain information about the FIRST node to

    connect to if your own "address book of nodes" is empty, is to look up the information

    in a publicly available "web cache".

    A web cache is usually a scripting file on a web server that is publicly available on

    the internet, which a gnutella clients queries for information. There are GWebcache

    implementations in PHP, Perl, ASP, C and may other programming languages.

    The web cache itself basically does 3 things:

    1. Save the IP+Port information about the client asking for other nodes

    2. Give out information about the recently seen contacts that also asked for other

    nodes

    3. Give out information about other public web caches

    Please note that a web cache should only be used if the initial "address book" of nodes

    is empty or outdated. otherwise the other two possibilities should be used to gather

    information about nodes to connect to.

    An alternative to the previously discussed ways to connect to nodes is called "UDP Host

    Caching" Its basically a mix between Pong-Caching and web caches. The communi-cation to an UDP Host Cache relies, as the name suggests, on UDP instead of TCP

    which keeps the overhead to a minimum. The UDP Host Cache can be either built into

    a Gnutella Client or it can be installed on a web server.

    20

  • 8/9/2019 The Current State of Anonymous File Sharing

    25/64

    4.4.2 Scalability problems in the Gnutella Network

    The way that searches are being relayed though the network poses the risk of your

    directly connected nodes acting quite similar to "smurf amplifiers" in a smurf IP Denial-

    of-Service attack (more information: CERT Advisory CA-1998-01 Smurf IP Denial-of-

    Service Attacks5). In network technology, a smurf attack consists of sending spoofed

    ICMP echo (aka "ping") packets to a large number of different systems. The resulting

    answers will be directed to the spoofed IP address and put heavy load on the PC behind

    that IP. In Gnutella this can happen if a node starts searching for a term that will most

    likely result in lots of hits (e.g. just the file extension ".mp3").

    As a solution to this problem, the equality of all nodes was divided into two classes with

    the introduction of the (not yet finished) Gnutella Specifications 0.6[6]:

    Ultrapeers and leaves.

    An Ultrapeer is a node with an high-bandwidth connection to the internet and enoughresources to be able to cope with. It also acts as a pong cache for other nodes. The

    "leaves" (which are the "outer" Nodes, similar to the leaves on a tree) are connected

    to an ultrapeer and the ultrapeers themselves are connected to other ultrapeers. In

    the Gnutella Client Limewire, there usually there are 30 leaves per Ultrapeer, each leaf

    usually connected to 3-5 ultrapeers and the ultrapeers themselves maintain 32 intra-

    ultrapeer connections. Upon connecting, the leaves send the equivalent of a list of shared

    files (represented by cryptographic hashes of the shared filenames. See the "QRP"

    chapter) to their ultrapeer. After the exchange of information, search requests thatprobably wont result in a response usually dont reach the leaves anymore and are

    instead answered answered by their ultrapeers. Interestingly, the same situation took

    place in regular computer networks too. Gnutella transformed from a big broadcasting

    network (Hubs) to a "routed" network in which the ultrapeers act as routers between

    the network segments. this way, broadcasting gets limited to a minimum and the overall

    traffic is reduced. The current state of gnutella is that leaves dont accept Gnutella

    connections. They search for ultrapeers and initiate connections to them instead.

    To sum this up, we can basically say that there are 3 kinds of connections in the GnutellaNetwork:

    1. A leaf connected up to an ultrapeer (3 connection slots)

    2. An ultrapeer connected to another ultrapeer (32 connection slots)

    5http://www.cert.org/advisories/CA-1998-01.html

    21

    http://-/?-http://www.cert.org/advisories/CA-1998-01.htmlhttp://www.cert.org/advisories/CA-1998-01.htmlhttp://-/?-
  • 8/9/2019 The Current State of Anonymous File Sharing

    26/64

    3. An ultrapeer connected down to one of its leaves (30 connection slots)

    Figure 6: schematic view of the Gnutella network

    4.4.3 QRP

    As previously mentioned, a leaf is sending a list of its shared files to its ultrapeer. As

    sending every filename and a matching filehash would take up way too much bandwidth,Leafs create an QRP Table and send it instead. A QRP Table consists of an array of

    65,536 bits (-> 8192 byte). Each filename the leaf has in its share is being cryptograph-

    ically hashed to an 8 byte hash value and the the bits at the according position in the

    QRP array is being set to 1. After all filenames have been hashed, the resulting 8 kb

    table is being sent to the corresponding ultrapeers upon connection. The ultrapeers

    themselves hash the search querries and compare the resulting number with the position

    22

  • 8/9/2019 The Current State of Anonymous File Sharing

    27/64

    of its leaves QRP Tables. If the bit in the leaves QRP table is set to 0, the search querry

    wont be forwarded as there wont probably be any result anyway. If the leaf has a 1 at

    the corresponding position of its QRP Table, the result is forwarded and the information

    about filename, filesize etc. are transmitted back to the ultrapeer that forwarded the

    the search query. To keep unnecessary querries between ultrapeers down to a minimum,each ultrapeer XORs the QRP-Tables of each of its leaves and generates a composite

    table of all the things his leaves are sharing. These composite tables are being exchanged

    between ultrapeers and help the a ultrapeer to decide for which one of its own ultrapeer

    connections the search queries wont generate any results anyway. This allows the ultra-

    peer to direct search queries with a TTL of 1 only to ultrapeers which will answer with

    a result. Because of the high number of intra-ultrapeer connections (32 in Limewire),

    each relayed search can produce up to 32 times the bandwidth of the previous hop. A

    TTL 3 query being sent to one ultrapeer will generate up to 32 query messages in itssecond hop, and up to 1024 (32*32) in its third hop. The final hop accounts for 97%

    of the total traffic (up to 32768 messages). This is exactly where ultrapeer QRP comes

    really into play.

    4.4.4 Dynamic Querying

    Another thing that is being used in Gnutella to keep the incoming results and network

    traffic down to a minimum is called "Dynamic Querying". Dynamic querying is a series

    of searches which are separated by a time interval. As an ultrapeer performing dynamicquerying, it will do the first step, wait 2.4 seconds, and then do the second step. If the

    first or second search gets more than 150 hits, no more searches will be started because

    150 hits should be enough and further searches should use more specific search terms to

    get the amount of results down. If the last search lasts for more than 200 seconds, the

    ultrapeer will give up and stop.

    The first step:

    The first search is being directed only the leaves directly connected the the ultra-

    peer that received the search query (> TTL 1: only one hop). The ultrapeersearches the QRP Tables and forwards the search to the leaves that are likely

    to have an result. If less than 150 results are returned, the second step is being

    started. If more than 150 results are returned, the search has officially ended and

    no more "deeper" searches will be started.

    The second step:

    23

  • 8/9/2019 The Current State of Anonymous File Sharing

    28/64

    This step is being called the "probe query". Its a broad search of the top layers of

    the network and doesnt put too much stress on it. If the number of hits returned

    is above 150, no more further searches will be started. If the number of results

    is below 150, it is used to decide how "deep" future searches will penetrate the

    gnutella network.At first, a TTL 1 message is sent by the ultrapeer to its connected ultrapeers.

    This will only search the connected ultrapeers and their nodes, but not forward

    the message to other ultrapeers than the directly connected ones. If a TTL1 search

    is being blocked by all of the connected ultrapeers QRP tables, the TTL will be

    raised to 2 for some of the ultrapeers (meaning that the search will be forwarded

    by the directly connected ultrapeer to another ultrapeer once and end there).

    If after the "broad" search less then 150 results were returned, a search with a

    TTL of 2 or 3 is being sent to a single ultrapeer each 2.4 seconds. For a very raresearch, the queries all have TTLs of 3 and are only 0.4 seconds apart. The time

    between searches and the depth has been decided by the results the probe query

    got.

    According to the developers of Limewire, "a modern Gnutella program should

    never make a query message with a TTL bigger than 3."

    If you feel like implementing your own client by now, you should take a look at

    the excellent Gnutella for Users [7] Documentation and the Limewire Documentation

    concerning Dynamic Querrying [8]

    4.5 Anonymity in the current file-sharing protocols

    4.5.1 Bittorrent

    At first, the Bittorrent-protocol wasnt designed to have any anonymizing functions, but

    there are some properties and widespread additions that might add a certain layer of

    anonymity.

    The first thing that can be identified as "anonymizing" is the way a person joins a Bit-torrent network. To gain knowledge about the people participating in the distribution

    of a certain file, you first have to have the information about the network itself. This

    information is usually embedded within the torrent file itself in the form of the announce

    URL and identifying hash. This information can also be received by using PEX or the

    mainline DHT. Both of these additional ways can be disabled by placing a "private" flag

    24

  • 8/9/2019 The Current State of Anonymous File Sharing

    29/64

    in the torrent file. Marking a torrent file as private should be respected by bittorrent-

    client-applications and should allow people to create private communities (similar to the

    Darknet pattern described in chapter 5.1 ).

    Another way of hiding unnecessary information from third parties came up when ISPs

    6 started to disturb Bittorrent traffic to keep the bandwidth requirements of their cus-tomers down (see: "ISPs who are bad for Bittorrent"[10] over at the Azureus Wiki).

    This resulted in two new features being implemented in the main clients:

    Lazy Bitfield:

    Wherever Bittorrent connects to another client, it sends a Bitfield Message to signal

    which parts of the file it has already got and which ones are still missing (compare

    Chapter 4.2.3). More detail is available in the official Bittorrent Specification 1.0

    [11]:

    bitfield:

    The bitfield message may only be sent immediately after the handshaking

    sequence is completed, and before any other messages are sent. It is

    optional, and need not be sent if a client has no pieces.

    The bitfield message is variable length, where X is the length of the

    bitfield. The payload is a bitfield representing the pieces that have been

    successfully downloaded. The high bit in the first byte corresponds to

    piece index 0. Bits that are cleared indicated a missing piece, and set

    bits indicate a valid and available piece. Spare bits at the end are set to

    zero.

    A bitfield of the wrong length is considered an error. Clients should drop

    the connection if they receive bitfields that are not of the correct size,

    or if the bitfield has any of the spare bits set.

    Some ISPs who were especially keen on keeping clients from uploading data (which

    is potentially more cost intensive for the ISP) installed third party appliances

    which filtered traffic for "complete bitfields" meaning that the peer has already

    downloaded the complete file and only wants to upload things from that point

    on. Upon detection of such a packet, usually forged TCP RST packets are sent

    by the ISP to both ends of the connection to forcefully close it or the connection

    was shaped down to a point where no more than 1-2 kb/s throughput are being

    transfered.

    6ISP: Internet Service Provider

    25

    http://-/?-http://-/?-
  • 8/9/2019 The Current State of Anonymous File Sharing

    30/64

    To keep the ISP from detecting this, Lazy Bitfield was invented. When enabling

    Lazy Bitfield, the Bittorrent client handshake is altered, so that even if a Bittorrent

    Client has all available Data already downloaded (> complete bitfield), the client

    reports that it is still missing pieces. This will make a BitTorrent client that

    usually just wants to upload data initially appear to be a downloading peer (alsoreferred to as "a leecher"). Once the handshake is done, the client notifies its peer

    that it now has the pieces that were originally indicated as missing.

    Message Stream Encryption[12]:

    Heres a little quote taken from the azureus wiki[12], that describes the use of

    Message Stream Encryption (or if only the header is encrypted but the payload

    left unchanged: "Protocol Header Encryption") pretty good:

    The following encapsulation protocol is designed to provide a completely

    random-looking header and (optionally) payload to avoid passive proto-

    col identification and traffic shaping. When it is used with the stronger

    encryption mode (RC4) it also provides reasonable security for the en-

    capsulated content against passive eavesdroppers.

    It was developed when some ISPs started traffic-shaping Bittorrent traffic down

    to a point where no more communication was possible.

    Message Stream Encryption establishes an RC4 encryption key which is derivedfrom a Diffie-Hellman Key Exchange combined with the use of the torrents info

    hash. Thanks to the Diffie-Hellman Key Exchange, eavesdroppers cant gather

    important information about the crypto process and the resulting RC4 encryption

    key. The use of the info hash of the torrent minimizes the risk of a man-in-the-

    middle-attack. The attacker would previously have to know the file the clients are

    trying to share in order to establish a connection as a man in the middle.

    This is also emphasized in the specifications[12]:

    It is also designed to provide limited protection against active MITMattacks and port scanning by requiring a weak shared secret to com-

    plete the handshake. You should note that the major design goal was

    payload and protocol obfuscation, not peer authentication and data in-

    tegrity verification. Thus it does not offer protection against adversaries

    which already know the necessary data to establish connections (that is

    IP/Port/Shared Secret/Payload protocol).

    26

  • 8/9/2019 The Current State of Anonymous File Sharing

    31/64

    RC4 is used as a stream cipher (mainly because of its performance). To prevent

    the attack on RC4 demonstrated in 2001 by Fluhrer, Mantin and Shamir[13] the

    first kilobyte of the RC4 output is discarded.

    The "Standard Cryptographic Algorithm Naming" database[14] has a "most con-

    servative" recommendation of dropping the first 3072 bit produced by the RC4algorithm, but also mentions that dropping 768 bit "appears to be just as reason-

    able".

    According to these recommendations, the 1024 bit dropped in the MSE specifica-

    tion seem just fine.

    According to the ipoque internet study 2007[3], the percentage of encrypted bittorrent

    traffic compared to the total amount of bittorrent traffic averages at about 19%. In

    Figure 7: percentage of encrypted Bittorrent P2P Traffic

    a press release [15] by Allot Communications dating back to the end of 2006, they

    state that their NetEnforcer appliance detects and is able to filter encrypted Bittorrent

    packages.

    4.5.2 eDonkey

    Emule, the client with the biggest amount of users in the eDonkey network, supportsprotocol obfuscation similar to the Bittorrent protocol header encryption starting from

    version 0.47c to keep ISPs from throttling the Emule file-sharing traffic. The keys to

    encrypt the protocol are generated by using an MD5 hash function with the user hash,

    a magic value and a random key.

    Encrypting the protocol (Handshakes, ...) tries to keep 3rd parties from detecting the

    usage of Emule. The number of concurrent connections, connections to certain ports

    27

  • 8/9/2019 The Current State of Anonymous File Sharing

    32/64

    and the amount of traffic generated by them might be a hint, but the protocol itself

    cant be distinguished from other encrypted connections (VPN, HTTPS...), at least in

    theory. The only technical documentation about this feature that can be found without

    hours of research consists of copy & paste source codes comments in the Emule Web

    Wiki[17]

    4.5.3 Gnutella

    The way searches are forwarded over multiple hops doesnt allow a single node (besides

    the connected supernode) to determine who actually sent the search in the beginning.

    Although this wouldnt keep a 3rd party which acts as an ISP to detect who sends out

    searches, but it keeps a single malicious attacker from "sniffing around" unless he acts

    as a supernode.

    28

  • 8/9/2019 The Current State of Anonymous File Sharing

    33/64

    5 Patterns supporting anonymity

    So far, this thesis has discussed the currently "popular" file-sharing networks and their

    protocol design concerning scalability . It was also pointed out that although these pro-

    tocols werent designed with anonymity in mind, they tend to have properties or modesof usage which make them anonymize parts of a users action in the network.

    Now it is time to take a look at network-protocols that were designed with anonymity

    in mind. This thesis will first take a look at the basic patterns behind nearly all of the

    currently available implementations of anonymous file-sharing software and then discuss

    the implementations themselves with a focus on their scalability.

    5.1 The Darknet patternThe darknet pattern targets the social part of the whole anonymous file-sharing problem.

    As explained in the Chapter "Different needs for anonymity" (1), there are different

    reasons why someone might want to achieve a certain level of anonymity, but most of

    them deal with the problem of hiding information about the things you do/say from a

    third party. This third party most of the time isnt anyone you know personally and

    tend to socialize with. This fact is exactly where Darknets come to play. A Darknet

    exists between friends and well-known contacts which share a mutual level of trust.

    By creating a darknet (also known as a "friend to friend" (F2F) network), you manuallyadd the nodes your application is allowed to connect and exchange information with.

    In the end, the "pure" concept resembles an instant messenger with an added layer of

    file-sharing. Most of the major implementations have an "all or nothing" approach to

    sharing files, so having private information in your shared folder which is personally

    directed to only one of your peers usually isnt a good idea.

    One of the major problems of darknets is, that they usually tend to model the social

    contacts of a person into a network. While this is good for protection against certain

    third parties, the amount of data shared and the number of contacts that are online

    at a single instance is tiny compared to the popular file-sharing protocols. The reason

    behind that is the fact that a person can only add a limited number of friends without

    compromising security by adding peers they dont really know (The usual seen at a

    party once kind of contact). In some implementations (e.g. Alliance P2P7 ), it is

    possible to allow friends to interconnect even if they might not really know each other.

    7AllianceP2P: http://alliancep2p.sf.net

    29

    http://-/?-http://alliancep2p.sf.net/http://alliancep2p.sf.net/http://-/?-
  • 8/9/2019 The Current State of Anonymous File Sharing

    34/64

    This might help to solve the problem of having too few nodes to share files with, but it

    destroys the idea behind darknets as a protection against unknown third parties.

    A good overview of darknets can be found in the paper How to Disappear Completely:

    A survey of Private Peer-to-Peer Networks[20] by Michael Rogers (University College

    London) and Saleem Bhatti (University of St. Andrews).

    5.1.1 Extending Darknets - Turtle Hopping

    As previously mentioned in Section 5.1, the main problem with Darknets is, that the

    number of peers available at a single instance is usually only a fraction of the number

    of total peers in the users contact list. Different time-zones, different schedules at

    work/university/school tend to diminish the number of shared files in a persons darknet

    to the point where it loses its attractiveness to run the software. This problem is being

    solved by the "Turtle Hopping pattern, described on turtle4privacy homepage [18]. TheTurtle Hopping pattern combines proxy chaining and Darknets by using your direct

    connections as gateways to connect to your friends friends. While this solution poses

    Figure 8: Turtle Hopping

    30

  • 8/9/2019 The Current State of Anonymous File Sharing

    35/64

    a certain throughput/security tradeoff, it extremely extends the amount and availability

    of files. Both, search queries and results are only forwarded between trusted peers. The

    "official" implementation of turtle hopping, which was developed at the Vrije Universiteit

    in Amsterdam also uses encryption between hops to keep eavesdroppers from listening

    to the searches.A great source of further information concerning the darknet pattern is "The Darknet

    and the Future of Content Distribution" by Microsoft [9]

    5.2 The Brightnet pattern

    The word Brightnet makes one thing pretty clear: its the opposite of a Darknet. To

    freshen up you memories: In the Darknet pattern, you only control WHO you share

    your data with, but once you decide to share with a peer, you usually dont have any

    secrets from that node about WHAT you share.

    In the Brightnet pattern it is exactly the opposite: you usually dont care that much

    about who you trade data with, but the data itself is a "secret".

    This might seem a little bit confusing, how can data that is openly shared with everyone

    be a secret?

    To understand this, one has to look at the typical bits and bytes that are shared in a

    file-sharing application: Usually, every piece of data does not only consists of arbitrary

    numbers, but it has a certain meaning. It is part of the movie "xyz", the song "abc"

    or the book "123". When uploading even a small portion of this data, you might beinfringing copyright or, depending where you live, might get yourself into all sorts of

    trouble for the content you downloaded. This could lead to the situation that every

    government agency or corporation could tell if youre sharing/acquiring a file by either

    simply receiving a request from or sending a request to you (given that youre on an

    "open" network) or looking at the transfered data on an ISP level.

    The main idea behind the brightnet pattern is to strip the meaning from the data

    you transfer. Without any distinct meaning, the data itself isnt protected by any

    copyright law or deemed questionable in any way. It would be totally fine to shareand download this meaningless data. To end up with "meaningless" data, the current

    brightnet implementations use a method called "multi-use encoding" as described in the

    paper "On copyrightable numbers with an application to the Gesetzklageproblem[19] by

    "Cracker Jack" (The title is a play on Turings "On computable numbers").

    The simplified idea is to rip apart two or more files and use an XOR algorithm to

    recombine them to a new blob of binary data which, depending on the way you look at

    31

  • 8/9/2019 The Current State of Anonymous File Sharing

    36/64

    it, has either no specific meaning or any arbitrary meaning you want it to. The whole

    meaning of the block lies in the way you would have to combine it with other blocks to

    end up with either one of the "original" files used in the process.

    The main idea of the people behind the brightnet ideas is the fact that the current

    copyright-laws originate from a time where a "digital copy" without any cost wasntpossible and that those laws simply dont reflect the current possibilities. They claim

    that digital copyright is not mathematically self consistent. Therefore logically/legally

    speaking the laws are inconsistent as well.

    5.2.1 the legal situation of brightnets

    In most western societies, the interesting thing is the way copyrighted original blocks

    would influence the whole situation concerning the possible copyright of the resulting

    block.The specific details of this can be found in the great paper by Hummelcalled "Wrdigung

    der verschiedenen Benutzungshandlungen im OFF System Network (ON) aus urheber-

    rechtlicher Sicht" [24].

    To summarize it, there are basically 3 different possibilities the resulting block could be

    treated as according to German law:

    a new work As the process of creation was fully automated when generating the result-

    ing block, the new block doesnt qualify as a real geistige Schpfung according

    to 2 II UrhG in connection with 7 UrhG.

    a derivation of the old work In this case, the original copyright holder would still have

    a copyright claim on the newly generate block.

    To be able to argue this case, the operation used to create the resulting block

    would have to be reversible and a single person would need to be able to claim

    copyright.

    As the resulting and original blocks which are needed to reverse the transformation

    find themselves on a peer to peer network, the reversibility is only given when all

    of the nodes carrying the blocks in question are actually on the network.

    As the resulting block could contain traces of more than one copyrighted file (in

    theory: as many as there are blocks in the network), everybody could start claiming

    copyright on that block.

    shared copyright In german law, according to 8 UrhG, a shared copyright only ex-

    ists when there was a combined creation of the work (gemeinsame Schpfung

    32

  • 8/9/2019 The Current State of Anonymous File Sharing

    37/64

    des Werkes). As the generation of the block was fully automated, there is no

    "innovative process" to constitute a shared copyright.

    This leads to the conclusion that everyone could legally download arbitrary blocks.

    5.3 Proxy chaining

    When it comes to the pattern of proxy chaining, we have to look at two things:

    What is a proxy server?

    How does this improve anonymity? (aka: the proxy chaining pattern)

    What does being a proxy mean in the legal sense?

    5.3.1 What is a proxy server?The first question can be easily answered: A proxy server is a computer system or an

    application, which takes the requests of its clients and forwards these requests to other

    servers.

    The thing differentiating proxies is their visibility. Sometimes a proxy has to be know-

    ingly configured in the clients application. Companies often use SOCKS Proxies to

    allow clients behind a firewall to access external servers by connecting to a SOCKS

    proxy server instead.

    These server allow companies to filter out questionable content or manage weather a userhas the proper authorization to access the external server. If everything that is checked

    has been deemed fine, the SOCKS Server passes the request on to the target server and.

    SOCKS can also be used in the other direction, allowing the clients outside the firewall

    zone to connect to servers inside the firewall zone. Another use of socks proxies can be

    caching of data to improve latency and keep the outbound network traffic down to a

    minimum.

    Transparent proxies can do pretty much the same thing, the difference to the usual

    SOCKS proxies is, that the client doesnt need to tell his applications to use any form

    of proxy because they are built as part of the network architecture. One could compare

    transparent proxies to a positive version of a "man in the middle".

    5.3.2 How can I create a network out of this? (aka: the proxy chaining pattern)

    The main feature of a proxy chaining network is, that it is a logical network being built

    upon the internets TCP/IP infrastructure.

    33

  • 8/9/2019 The Current State of Anonymous File Sharing

    38/64

    This means that the packets being send into the network dont have specific IP addresses

    as a source or destination on a logical level. They usually use some sort of generated

    Identifier (e.g. a random 32 bit string). This can be compared to the way Layer 3

    switches /routers manage IP addresses. A switch itself doesnt know where in the world

    a certain IP address is stationed. The switch itself only knows on which port someone islistening who knows a way to transfer a package to the destination IP. While receiving

    new packets, the switch makes a list (a so called "routing table") of the different senders

    IP addresses in conjunction with the MAC Address and the (hardware-) port the packet

    came from. this way, the switch learns to identify a certain port with a certain MAC/IP

    address.

    When joining a proxy-chaining network, the client tries to connect to a certain number of

    other clients on the network. These initial connections can be compared to the different

    ports a switch usually has. After connecting, the client notes the Source IDs of packetsit receives from one of those initial connections and generates a routing table. This

    way, the client begins to draw its own little map of the network which helps forwarding

    packets to the proper connection.

    The way a client in a proxy chaining network reacts when receiving a packet with an

    unknown target ID can also be compared to the way a switch works. This time its

    comparable to an unknown target MAC address. In case a switch doesnt know a target

    MAC address, it simply floods the ethernet-frame out of every port. This way, the packet

    should arrive at the destination if one of the machines behind one of the (hardware-)

    ports knows a way leading to the destination address. To keep packets from floating

    though the network forever, some sort of "hop-to-live/time-to-live" counter is usually

    introduced which gets decreased by 1 every time someone forwards the packet. If there

    is no reasonable short way to get the packet to its destination.

    5.3.3 How does this improve anonymity?

    As there is no central registration that gives out the IDs used to identify nodes on

    the logical network, it is virtually impossible to gather knowledge about the identity of

    a person operating a node by looking at the ID of e.g. a search request. In current

    networks such as Gnutella (Chapter 4.4) or eDonkey (Chapter 4.3), the identity of a

    client is based on its IP address. As IP addresses are managed globally, an IP can be

    tracked down to the company issuing it (in most cases: you internet Service Provider)

    and, at least in germany, they are supposed to keep records when a customer was issued

    34

  • 8/9/2019 The Current State of Anonymous File Sharing

    39/64

    an IP.

    The only thing that can be said with certainty is that a node forwarded a packet, but

    as I will explain in Chapter 5.3.4, this shouldnt be a problem in most modern societies.

    5.3.4 legal implications of being a proxy

    Before I start talking about the legal implications, Id first like to clarify that I am not

    a lawyer and can only rely on talks with people who have actually studied law.

    As this thesis is written in Germany, Id like to give a short overview of the current

    german laws regarding the situation of a proxy in a network. This is important to

    know as some of the implementations of anonymous file-sharing protocols like the ones

    implementing a form of proxy chaining will put data on your node and make it the origin

    of material that might be copyright infringing or pornographic without the your node

    specifically asking for it.While communication is encrypted in most of the cases, it might happen that the node

    directly in front or directly behind you in a proxy chain will try to sue you for forwarding

    (in a simpler view: "sending out") or receiving a file. That could be because the file your

    node forwarded is being protected by copyright or the content itself is illegal (e.g. child

    pornography). Especially in the case of severely illegal things like child pornography,

    its important to know how the current legal situation might affect the owner of a node

    who might not even know that kind of data was forwarded.

    As a simple "proxy", the user should be covered by 9 TDG:

    9 TDG: Durchleitung von Informationen

    1. Dienstanbieter sind fr fremde Informationen, die sie in einem Kommu-

    nikationsnetz bermitteln, nicht verantwortlich, sofern sie:

    1. die bermittlung nicht veranlasst,

    2. den Adressaten der bermittelten Informationen nicht ausgewhlt

    und

    3. die bermittelten Informationen nicht ausgewhlt oder verndert

    haben.

    Satz 1 findet keine Anwendung, wenn der Dienstanbieter absichtlich mit

    einem Nutzer seines Dienstes zusammenarbeitet, um eine rechtswiedrige

    Handlung zu begehen

    35

  • 8/9/2019 The Current State of Anonymous File Sharing

    40/64

    2. Die bermittlung von Informationen nach Absatz 1 und die Vermit-

    tlung des Zugangs zu ihnen umfasst auch die automatische kurzzeitige

    Zwischenspeicherung dieser Informationen, soweit dies nur zur Durch-

    fhrung der bermittlung im Kommunikationsnetz geschieht und die

    Informationen nicht lnger gespeichert werden, als fr die bermittlungblicherweise erforderlich ist.

    You have to take in concern that most of the time, there is hop-to-hop encryption

    between the nodes in anonymous file-sharing network, but its not very likely that this

    could be interpreted as "changing" the transported information according to 9 (1) Nr.3

    While the general situation for nodes acting as a proxy server seems to be perfectly fine,

    there could be a differentiation between nodes concerning the content they forward or

    the network such a node is positioned in.

    I will discuss two of the different worst-case scenarios a node-owner could face when

    operating a client in a file-sharing network in a proxy-like manner. These scenarios

    would be a lawsuit for:

    1. Distribution of illegal content

    2. Distribution of copyright protected content

    5.3.5 illegal content stored on a node (child pornography as an example)

    184b StGB clearly states that the distribution, ownership or acquisition of child pornog-

    raphy is illegal. The paragraph doesnt explicitly make negligence illegal.

    Its a bit hard to grasp the term "negligence" when it comes to child pornography, but

    keep in mind that in some of the networks, your PC only acts as a proxy for file-transfers

    started by other nodes. A users node might even store some data to serve it to the rest

    of the network without the users knowledge. In this case, the user couldnt expect this

    to happen and therefor might have acted negligent and it could be deducted from 15

    StGB that this isnt punishable:

    15 StGB: Vorstzliches und fahrlssiges Handeln

    Strafbar ist nur vorstzliches Handeln, wenn nicht das Gesetz fahrlssiges

    Handeln ausdrcklich mit Strafe bedroht.

    The burden of proof in this case is yet another problem which could fill a thesis on its on

    and will therefor not be discussed in this paper, but usually the legal term "Ei incumbit

    36

  • 8/9/2019 The Current State of Anonymous File Sharing

    41/64

    probatio qui dict, non que negat" comes into play (Translates to: "The burden of proof

    rests on who asserts, not on who denies").

    The latin legal maxim of "in dubio pro reo" is in effect in most modern societies and

    "local" versions of it can be found

    in English: "the presumption of innocence"

    in French: "La principe de la prsomption dinnocence"

    in German: "Im Zweifel fr den Angeklagten"

    Unluckily for the situation of anonymous file-sharing, it is always a question of the

    point of view if there are any doubts concerning the legality of a persons actions.

    In case of file-sharing networks, it is always of interest what the network in mainly used

    for. If the network doesnt degenerate to a meeting point for child-molesters to swap

    pornography, and the existence of illegal content stays an exception in the network, a

    node that acts as a proxy and forwards it shouldnt be facing any punishment.

    In case of the network degenerating or being covered badly in the media (usually the

    yellow press is pretty interested in "big headlines"... ), it could be difficult to establish

    the "negligence" part when running a proxy node in the network.

    In this case it could be seen as a potentially premeditated distribution (eventualvorst-

    zliche Verbreitung) which would be judged as a premeditated offense (Vorsatzdelikt).

    At the moment, the current german legal system seems to be under the impression that

    a node forwarding possibly illegal data (acting as a proxy) in an anonymous file-sharingnetwork shouldnt face any punishment for doing so.

    For people more interested in the corresponding german laws, Id recommend reading

    Art 103 II GG, Art 6 EMRK and Art 261 StPO

    5.3.6 copyright infringing content

    When it comes to data that is subject to copyright, the situation is basically the same

    as in Chapter 5.3.4. The main difference is, that in this case the files are protected by

    106 UrhG:

    106 UrhG: Unerlaubte Verwertung urheberrechtlich geschtzter Werke

    1. Wer in anderen als den gesetzlich zugelassenen Fllen ohne Einwilligung

    des Berechtigten ein Werk oder eine Bearbeitung oder Umgestaltung

    eines Werkes vervielfltigt, verbreitet oder ffentlich wiedergibt, wird

    mit Freiheitsstrafe bis zu drei Jahren oder mit Geldstrafe bestraft.

    37

  • 8/9/2019 The Current State of Anonymous File Sharing

    42/64

    2. Der Versuch ist strafbar

    Another interesting Paragraph concerning copyright is 109 UrhG, which shouldnt go

    unmentioned in this chapter:109 UrhG: Strafantrag

    In den Fllen der 106 bis 108 und des 108b wird die Tat nur auf Antrag

    verfolgt, es sei denn, dass die Strafverfolgungsbehrden ein Einschreiten von

    Amts wegen fr geboten halten.

    The main difference to 5.3.4 is, that the border between negligence and probable intent is

    very small. Especially in the case of some of the bigger anonymous file-sharing networks

    which are at the moment almost exclusively used to swap copyrighted movies/music/software,

    it would be pretty unbelievable to argue that a user wouldnt tolerate his node forward-

    ing copyright protected files (according to 106 UrhG).

    5.4 Hop to Hop / End to End Encyrption

    Encryption can be used as a means to keep a passive third party from eavesdropping on

    the content of your file-transfers. They will still allow third parties to monitor who you

    share files with, but besides a little hit on performance, it wont harm using it.

    5.4.1 Hop to Hop

    Hop to Hop encryption can also be described as "Link encryption", because every intra

    node communication is secured by the use of cryptography. Using a Hop to Hop encryp-

    tion only secures connections between nodes and keeps 3rd parties which are not involved

    in the process from spying on the data which is exchanged. The nodes which are for-

    warding the data and are active components of the file-sharing-network (not just routing

    instances on the "normal" internet) will be able to read the data in cleartext. Everynode that receives a message will decrypt, read, re-encrypt and forward this message

    to the next node on the route. It will also put extra stress on the network because the

    encryption has to occur on every hop again. Usually the speeds of internet connections

    arent that fast, so recent PCs dont have a problem encrypting the data fast enough.

    38

  • 8/9/2019 The Current State of Anonymous File Sharing

    43/64

    Figure 9: Hop to Hop encryption

    5.4.2 End to End

    End to end encryption adds encryption from the source to the end. This keeps every

    node besides the source and the destination from spying on the content of the data.

    5.4.3 Problems with end to end encryption in anonymous networks

    As Jason Rohrer, the inventor of the anonymous file-sharing software "Mute"8

    pointsout in his paper dubbed "End-to-end encryption (and the Person-in-the-middle attacks)"

    [21], using end-to-end encryption in an anonymous network poses a technical problem:

    How can the sender get the receivers key to start the encrypted transfer in the first

    place? In the majority of anonymous requests are routed over several nodes acting as

    proxies (compare: Chapter 5.3). Each of those nodes could act as a "man in the middle"

    and manipulate th