20
Int J Parallel Prog DOI 10.1007/s10766-013-0282-5 Inaccuracy in Private BitTorrent Measurements Hai Jin · Honglei Jiang · Shadi Ibrahim · Xiaofei Liao Received: 8 April 2013 / Accepted: 4 October 2013 © Springer Science+Business Media New York 2013 Abstract Recently, BitTorrent communities are rapidly evolving into private torrent sites (PT). PT sites have employed several incentive rules to improve the performance and availability of the system. Many studies have been dedicated for measuring and modeling the PT systems in order to better understand the new rules and their impact on the users’ behavior in order to improve the usability of the system. These stud- ies have been performed on different PT sites that differ in their implementation of the system and in their system’s user incentive rules. Therefore, current measurement findings cannot reflect accurate results and, more importantly, the current conclusions may be biased. In this paper, we investigate the accuracy of previous measurement studies on PT sites, while emphasizing the incentive rules employed and the interplay between these rules and corresponding objective factors. We evaluate the behavior regulation policies of the front-end website and the tracker and examine the semantics of provided data. Using this information we have designed a new crawling method- ology and conducted a large-scale measurement study across four representative PT sites over a year. Interestingly, we find that most reported measurements have neither considered design goals, nor thought through the incentive policies and their inter- play. This lack of awareness in turn may lead to inaccurate conclusions of system properties. For example, the Seeder to Leecher Ratio (SLR), which is reported in most of the available measurements, is routinely at least 16–45% less than the real SLR H. Jin (B ) · H. Jiang · X. Liao Cluster and Grid Computing Lab, Services Computing Technology and System Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China e-mail: [email protected] S. Ibrahim INRIA Rennes-Bretagne Atlantique, Rennes, France e-mail: [email protected] 123

Inaccuracy in Private BitTorrent Measurements

  • Upload
    xiaofei

  • View
    214

  • Download
    3

Embed Size (px)

Citation preview

Int J Parallel ProgDOI 10.1007/s10766-013-0282-5

Inaccuracy in Private BitTorrent Measurements

Hai Jin · Honglei Jiang · Shadi Ibrahim ·Xiaofei Liao

Received: 8 April 2013 / Accepted: 4 October 2013© Springer Science+Business Media New York 2013

Abstract Recently, BitTorrent communities are rapidly evolving into private torrentsites (PT). PT sites have employed several incentive rules to improve the performanceand availability of the system. Many studies have been dedicated for measuring andmodeling the PT systems in order to better understand the new rules and their impacton the users’ behavior in order to improve the usability of the system. These stud-ies have been performed on different PT sites that differ in their implementation ofthe system and in their system’s user incentive rules. Therefore, current measurementfindings cannot reflect accurate results and, more importantly, the current conclusionsmay be biased. In this paper, we investigate the accuracy of previous measurementstudies on PT sites, while emphasizing the incentive rules employed and the interplaybetween these rules and corresponding objective factors. We evaluate the behaviorregulation policies of the front-end website and the tracker and examine the semanticsof provided data. Using this information we have designed a new crawling method-ology and conducted a large-scale measurement study across four representative PTsites over a year. Interestingly, we find that most reported measurements have neitherconsidered design goals, nor thought through the incentive policies and their inter-play. This lack of awareness in turn may lead to inaccurate conclusions of systemproperties. For example, the Seeder to Leecher Ratio (SLR), which is reported in mostof the available measurements, is routinely at least 16–45 % less than the real SLR

H. Jin (B) · H. Jiang · X. LiaoCluster and Grid Computing Lab, Services Computing Technology and System Lab,School of Computer Science and Technology, Huazhong University of Scienceand Technology, Wuhan 430074, Chinae-mail: [email protected]

S. IbrahimINRIA Rennes-Bretagne Atlantique, Rennes, Francee-mail: [email protected]

123

Int J Parallel Prog

because the sites ignored the “partial seeders” in their calculation. This study aims tooffer fundamental insights into designing an accurate methodology when conductingmeasurement studies on PT sites.

Keywords BitTorrent · Private torrent · Inaccuracy · Incentive policies ·Measurements

1 Introduction

BitTorrent (BT) has been dominating Internet traffic as one of the most commonprotocols for transferring large files. It has been estimated that BT accounted forroughly 43–70 % of all Internet traffic [11]. As the users’ requirements for performanceis increasing, BitTorrent communities have witnessed a large shift towards PrivateTorrent sites (PT) with strong built-in incentive mechanisms that aid in retaining users.Different from BT, a user needs to obtain an invitation for joining a PT site in order tobrowse or download any content. More importantly, PT sites require users to maintaina certain upload/download ratio, called Share Ratio Enforcement (SRE), in order not tobe banned or deleted, and, therefore, download more contents. In addition to the SREfeature, PT sites employ other rules such as torrent-promotion and bonus-system inorder to incentivize users to efficiently contribute to the system. For instance, some PTsites decrease the download cost and/or increase the upload reward of some torrentsto motivate users to leech or seed them. Moreover, in some PT sites, the seeding timeis counted instead of upload volume in order to encourage users with limited networkbandwidth to spend more time in seeding old and unpopular torrents. Through theuse of the aforementioned features and techniques, PT sites provide higher downloadspeeds and a longer life time for torrents in contrast to BT [2,14].

The increasing popularity of PT sites necessitates the use of empirical measurementsin order to better understand these systems and aid in developing technical innovationsthat could improve their functions. Accordingly, many studies have been dedicatedfor measuring and modeling PT systems [1,2,12–14,22]. These studies have beenconducted on different PT sites, which differ in their system implementation andincentive policies. This range of factors indicates that these PT sites vary in term ofusability. In addition, different studies are using different data collection and crawlingmethodologies. Consequently, current measurements cannot reflect accurate results.More importantly, bias can easily occur. Therefore, in this paper, we focus on exploringthe inaccuracies present in PT measurements and analyzing their causes.

By evaluating the behavior regulation of front-end websites and trackers, we havedesigned a new crawling methodology in order to address the possibility of missingdata during collection. By obtaining a Staff membership in both KMGTP and SJTUsites, we are able to browse almost all the information provided by these sites andalso were able to reduce data lost during the traces. Then, over the course of one year,we implement our crawling methodology on four representative PT sites in China:ChinaHDTV (http://www.chinahdtv.org/), HDChina (http://hdchina.org/), KMGTP(http://www.kmgtp.org/), and SJTU (http://pt.sjtu.edu.cn/). By studying the details ofthe collected data, not only from the perspective of an observer, but also from the

123

Int J Parallel Prog

perspective of an administrator, we provide a numerical analysis of the inaccuracy inPT measurements, which is due to the interaction between employed incentive rulesand their objective factors (torrent age, popularity and size).

Interestingly, our study demonstrates that some results reported are highly biasedwhen different PT sites are targeted. For example, users’ Share Ratio (S R =

U pload V olumeDownload V olume ) values vary among different sites by up to 60 %. This is explainedby the impact of different incentive policies with varying degrees of implementation.Thus, it is better to provide the “real” SR values, which indicate users’ real upload anddownload volumes in these sites, side-by-side with the “virtual” SR. The differencebetween the “real” and “virtual” SR values is because the “virtual” SR values takeinto account the effect of the incentive policies. Moreover, the Seeder to Leecher Ratio(SLR), reported by most PT sites, is at least 16–45 % less than the “real” SLR becausethe sites ignored “partial seeders.”

In Sect. 2 of this paper, we provide an overview of the Private BitTorrent incen-tive polices and their existing crawling methods. In this study, there are three keyareas focused on: data-source analysis, crawler design, and data analysis. Section 3focuses on data-source analysis, which describes how to analyze and understand thesites investigated. Building on Sects. 3, 4 concentrates on crawler design. This sectionaims to provide insight into some of the challenges which we may face in avoidingmissing data when crawling. We then introduce our crawling methodology that takesinto account and tries to rectify some of these challenges. In Sect. 5, we demon-strate the inaccuracies encountered during our analysis of the crawled data. Section 6summarizes related work, and we conclude in Sect. 7.

2 Background

In this section we briefly introduce the incentive policies used in PT sites, and thendiscuss the current crawling methodologies used in BT and PT measurements.

2.1 Incentive Mechanism in Private BitTorrent Communities

Recently, the BitTorrent community has witnessed a remarkable formation of PrivateTorrent clubs (PT). PT communities provide policies to guarantee download perfor-mance and content availability. The policies include Share Ratio Enforcement (SRE),torrent promotion, and the credit (bonus) system.

2.1.1 Share Ratio Enforcement

As mentioned before, an account is needed before the user can browse or downloadany content from the PT site. Moreover, most PT sites have employed a built-in incen-tive mechanism called “Share Ratio Enforcement” in order to incentivize users tocontribute more to the download performance of the site.

Each PT site keeps track of the download volume and upload volume of all users.Consequently, each user is associated with a Share Ratio Enforcement (SRE) value.In accordance to the SRE, PT sites determine the amount of data which the user can

123

Int J Parallel Prog

Table 1 SJTU share-ratioenforcement and users-levels

User class Join weeks Downloaded Share ratio

User – > 20 GB ≥ 0.3

> 50 GB ≥ 0.4

> 100 GB ≥ 0.5

> 200 GB ≥ 0.6

> 400 GB ≥ 0.7

> 800 GB ≥ 0.8

Power user ≥ 4 ≥ 50 GB ≥ 1.05

Elite user ≥ 8 ≥ 120 GB ≥ 1.55

Crazy user ≥ 15 ≥ 300 GB ≥ 2.05

Insane user ≥ 25 ≥ 500 GB ≥ 2.55

Veteran user ≥ 40 ≥ 750 GB ≥ 3.05

Extreme user ≥ 60 ≥ 1 TB ≥ 3.55

Ultimate user ≥ 80 ≥ 1.5 TB ≥ 4.05

download, upgrade or downgrade the level of a user, and find users that need to bebanned or deleted. The caste system (also called the member-ranking system) has beendeveloped as a way to encourage users. For example, users with greater number ofcontributions are given some privileges, such as being able to buy invitation codesfor their friends. In the caste system, users are classified into different levels on thebasis of their join time, upload volume, download volume, share ratio, and so on(see Table 1). For example, in the SJTU site, newly joined users, categorized bydefault as “USER”, can download up to 20 GB without any contributions to the site.The inspection period lasts one month. After the inspection period, they may stayas ”USER” or may upgraded to “Power USER” based on their SRs and downloadvolumes. Otherwise, if they don’t meet the minimum requirements of the “USER”level, they will be warned and demoted to “Peasant Users”. “Peasant Users” will begiven a fixed period of 8 days to increase their SRs to reach the minimum requirementof “USER”, or they will be reported to the system administrator and deleted.

2.1.2 Promotion Attributes of Torrents

PT sites have implemented the system of award attributes for certain torrent in orderto improve download performance and the availability of some torrents, includinglarge size, or old and non-popular files. The SRE alone may cause a “credit squeeze”problem, where system performance could be reduced due to the lack of users’ credits,and thus users with small SRs cannot download more content, which means that theycannot seed them. To solve this problem, some PT sites have promoted particulartorrents by applying award attributes. Users are therefore willing to download andseed these torrents for the purpose of increasing their SRs. For instance, to encourageuploading, torrents are marked as UX-up, where U > 1, (e.g. users’ upload volumeswill be counted twice for a 2X-up torrent). To encourage downloading, administratorsmark target torrents as Y%-leech, where 0 ≤ Y < 100, (e.g. for 50 %-leech, only halfof the download volume is counted).

123

Int J Parallel Prog

Table 2 Promotion torrents inHDChina: Using a snapshotwhich was taken in November22nd 2011 where the totaltorrents were 48,551 torrents

Promotion attribute No. of torrents Percentage (%)

30 % Leech 1,432 2.9

50 % Leech 9,066 18.6

Free 2,439 5

Free in 24 h 13 0.26

Total 12,950 26.6

Through the use of these promotion attributes, content will be more affordable byusers, and this will alleviate the “credit squeeze” problem, especially for large-sizetorrents. That is, by decreasing the cost of downloading some files, more users will beable to download these files and then seed them. For example, the site CHDBits (http://chdbits.org/) speeds up the distribution of new or large-size torrents by setting newlyuploaded torrents as free “0 %-leech” for the first 24 h. Consequently, more users candownload these files and seed them later. The most efficient way for new users to earnupload volume is to download as many free-leech torrents as possible and seed them.Moreover, by setting the attributes of old torrents to UX-up, the content availabilitywill be further improved as more users will be willing to keep seeding them. Promotionattributes for torrents have proved to be an efficient way to improve the performanceand availability of the content, and therefore, it is used across many PT sites. Forexample, 26.6 % of the torrents in HDChina are set with promotion attributes shownin Table 2.

2.1.3 Seeding Bonus System

To improve the availability of their content, some PT sites have different policies toaward users for seeding more/larger/older/fewer-seeder torrents by calculating theirseeding time as “points per hour” instead of by the upload volume. Accordingly, userscan exchange their points for upload volume; in particular, users with low-upload-bandwidth (like ADSL users) may be able to meet the minimum SRE requirementseven though their upload volume is low by using the “points per hour” system insteadof up-load volume. Moreover, in order to incentivize users with high-upload volumeto contribute more to the site, users can use their points to buy an account invitation,or users are given the privilege to start a topic or make a post on the PT forum.

2.2 Existing Crawling Methodologies

Different methodologies have been used to collect data-traces for BitTorrent measure-ments including: (a) obtaining the tracker logs, (b) using scripts to gather informationfrom both torrent websites and directly from peers, (c) analyzing packet traces col-lected at Internet access link, (d) joining an ongoing torrent with a modified clientconfigured to collect event logs; and (e) conducting experiments on network testbeds,like PlanetLab, or user-constructed networks of PCs.

Among these methods, two ways are mainly used in PT measurements, shown inFig. 1:

123

Int J Parallel Prog

Fig. 1 Crawling methods used in PT site measurement

(1) Active method. The crawler pretends to be a BitTorrent client. Thus, it repeatedlyrequests the tracker for lists of peers participating in the torrent and then repeatedlycontacts every peer discovered to collect information. However, the limitation ofthis method is that user identification is imprecise, allowing only heuristic-basedidentification of peers at a “torrent level”[7];

(2) Passive method. By crawling the web pages provided by each PT. In fact, since allcommunication between client and tracker is account-based, periodical reports ofusers from BitTorrent clients are recorded by the tracker, and then processed anddisplayed on the web sites. By crawling the web pages, researchers are able totrace the seeding/leeching behaviors of users at a community-level [1].

Although PT sites are able to track all reports from users’ clients, a user may havemultiple clients. The information, which is provided by the site and can be found inthe site’s HTML, is sometimes restricted to certain members, and little informationis publicly published. For example, “bt.neu6.edu.cn”, an IPv6-only Private BitTorrentcommunity that targets CERNET users, only the name of users that have join-in inthe swarm are provided and no further information about any other peers is provided.However, many NexusPHP-based PT sites like ChinaHDTV, KMGTP, and SJTU pro-vide multi-level information for high-level users. NexusPHP is an open-source privatetracker implementation [15].

3 Data Source Analysis

Understanding the data source object is the basic premise of crawling and analyzingcollected data. Accordingly, we need to know the meaning of data and the behavioralcharacteristics of the tracker. That is, we need to know how the data reports in thePT community are processed and displayed in the front-end web server. However,different PT sites are differently coded, and each PT site has its own implementationor modification according to those open-source versions that they are based on, which

123

Int J Parallel Prog

may cause dissimilarity in the fields, in the update frequency of the fields, and inthe tracker behavior. Even more problematically, different PT sites have employeddifferent incentive policies which vary in both function and implementation. As aresult, PT measurement is complex and prone to bias.

3.1 The Behavior of the Private Tracker

The huge diversity in both PT implementations and the individual modifications ofeach site, results in the following problems that make the measurement of PT sites arather complex, tricky, and error-prone process:

– The behavior regulation of trackers. Many new BitTorrent Enhancement Proposals,like partial seeds [5] and IPv6 [6], have been provided in the past few years [9].However, different private tracker and client implementations have different supportstructures. The limited exploration into and little understanding of the tracker’sbehaviors can easily result in a noticeable bias in measurement, (e.g. our researchshows that 52 % of these torrents have at least one “partial seeding peer,” and 36 %leeching peers are actual “partial seeding peer”). Mistaking the ”partial seeder” fora “leecher” may cause incorrect results.

– The ambiguous process for the reports that tracker records from the BitTorrentclient, which, in turn, is displayed in the front-end websites. When the trackerreceives a record, it first does validation and anti-cheat checks. If passed, the trackerrecords the report and uses the upload volume and download volume client reportto update user information. Depending on the torrent promotion attributes and usertypes, the download volume recorded may be lower than actually reported (e.g. ifthe torrent is 30 %-leech right now, then 30 % of the download volume the userproduced since the last report will be recorded in user info). Some informationdisplayed in the websites is updated immediately after receiving a user’s report,while other information is updated periodically at different intervals. For this reason,the semantic structure of data fields and their update frequency are ambiguous. Ourresearch shows that revealing this information enables us to adjust the frequencyand the sequence of requests to decrease the amount of missed data, and, therefore,be able to conclude the implicit facts.

– Rich and diverse incentive rules among different PT sites. Besides Share RatioEnforcement, many new incentive mechanisms are introduced frequently in manyof the PT sites. Thus, the users’ behaviors are not only affected by objective factorslike torrent age/size/type, seeder/leecher number in a torrent, but are also affectedby these incentive rules, as shown in Fig. 9. Our research reveals that a clear under-standing of the effects of the diverse incentive rules and objective factors enable usto design more accurate methods to analyze and process the crawled data-sets.

3.2 The Behavior of Front-end Web Sites

There are many types of information associated with a private torrent website. Beforecrawling, knowing the meaning and update frequency of each data field is imperative.This range of possibilities drives us to obtain the real meaning of each field displayed

123

Int J Parallel Prog

on torrent-discovery sites, and each field’s update frequency to limit the loss of infor-mation during crawling. We use two methods to achieve this goal:

1. Since many PT sites are based on open-source tracker implementations (likeTBsouce [19], TBDEV.NET [18], NexusPHP [15], etc.), by analyzing the source,we are able to identify the basic process flow and the semantic structure of manyfields. For example, in NexusPHP-based PT sites, the user’s last-active time in theuser’s profile is the last time that the user browses the front web site, and not lastreport to tracker. In contrast, in the torrent profile pages, the last-active time refersto the time that the tracker last receive a report for peers. The user’s last-active timein the user’s profile is updated immediately when the user produces a new webrequest, while the seeding/leeching time in the user’s profile is not updated at thesame time. Therefore by revealing this conflict, we observe that we cannot use theseeding/leeching time in the user’s profile to collect current information.

2. Although many PT sites are open-source based, each PT site updates and modifiesthe source code periodically. This makes the behaviors of the tracker and of thefrontend web sites change from time to time. We use BitTorrent clients to join thereal swarm, record the communication between BitTorrent client and tracker byWireshark [20], and check simultaneously when and how the data field changes onthe web site. To ensure consistent and accurate data, we use two accounts for eachPT site, one that takes part in the swarm and another for observation. For example,if we want to ascertain whether the user-level affects the user seeding behaviors, wehave to consider that the user-level displayed in the user’s info may not be currentit may take tens of days before this info is updated. Thus, we may still need toretrieve the user-level info in the peer-list when crawling.

4 Crawler Design

In Sect. 3, we have understood how data is produced by analyzing the behavior regula-tion of the front-end website and tracker and by understanding the meaning of varyingdata fields. In this section, we optimize our crawling to limit data loss.

4.1 Choosing the Frequency of the Crawling Request

Our aim is to get all reports from peers in a certain period though crawling severalsnapshots of all the online clients of all the torrents on a site. In order to do thiseffectively, we have to choose the appropriate crawling frequency. To avoid missingreports from BitTorrent clients, we have two main considerations. First, our crawlinginterval should be less or equal to the report interval of clients, as shown in Fig. 3.Second, we need periodic crawling. However, since a snapshot of site’s peer-listsconsists of tens of thousands of HTTP requests to the community’s web server, thefrequency of the data collection must be moderated to minimize the crawling overhead,which relies on the following:

– Payload of the server: Keeping the crawling frequency less frequent than the reportinterval will cause congestion, thus we should limit our crawling accordingly.

123

Int J Parallel Prog

– Anti-DoS attack mechanism of server: In some sites like CHDBits, an IP addressaccessing the server too frequently will be banned. In this situation, we need adistributed crawler using enough HTTP proxies with independent IPs, and we needto limit our visit frequency to bypass this mechanism.

– Unessential requests: Since the torrent-id is incremental and not reusable, by storingthe IDs of deleted torrents, we avoid crawling their peer-lists.

– Incomplete HTML pages or HTTP errors like 503 are common: We need to occa-sionally retry crawling these pages to decrease the amount of missed data. Using adifferent HTTP proxy each time is the preferred way.

In summary, knowing the update frequency of data-sets is a good way to avoiddata loss. If a client quits normally, the client will disappear immediately from thepage of peer-list on the site, but if the client leaves without any report, the informationwill remain in the peer-list of the website for more than 30 minutes. If we cannot getany further information, reconstructing the exact time that the peer has left is needed[1]. The longer the time between two snapshots, the more inaccuracies exist in ourreconstruction. If the crawling period of a snapshot for all torrents is longer than theminimum report interval, it will be easier to trigger this mechanism as continuouscrawling is impossible.

4.2 Adjust the Request Sequence of Crawling

Crawling data periodically is not enough to collect sufficient data for PT measurements.The sequence of the crawling request should be taken into account as well. For example,take the user’s seeding disk space, we need to identify both (a) the torrents which eachuser currently is seeding, and (b) the size of each torrent. The info (a) can be crawledeither from user torrents, or from peer lists. If a PT site has more users than torrents, wemay choose the peer list, as this will produce fewer HTTP requests, and thus shortenthe period of crawling and ease the payload on the server. Moreover, we are able toget the anonymous users’ torrents’ information for NexusPHP-based sites1, althoughtheses sites are among other sites that we cannot get the seeding torrents of anonymoususers from the peer lists, due to user’s privilege reasons. The info (b) can be crawledfrom the torrent info. Since users/torrents in PT sites are added/deleted irregularly, ifwe miss items in the torrent info, then the corresponding records in the peer list areunable to be adequately analyzed.

4.3 Design Crawler Based on the Correlation of Data-sets

Some information is not explicitly provided, therefore knowing the correlationbetween different data sets is a good way to uncover implicit information by reorder-ing a crawling request’s sequence. For example, the time when a peer left the swarmcannot be retrieved from the crawled peer list info as the peer disappears from thelist. However, a peer’s seeding/leeching duration time in a user’s torrents is shown in

1 It is a bug in NexusPHP code that a tracker does not check the privilege of a user in user-torrent’s pages.

123

Int J Parallel Prog

Fig. 2 Crawling sequence

the complete download list in the user info pages. We also make sure that this seed-ing/leeching duration time is updated immediately when the tracker receives a newreport from a client. This information allows us to get the exact time point of whena peer left the swarm. The seeding/leeching duration changes promptly when a newpeer appears in the peer list of a torrent or an existing peer disappeared from the peerlist of a torrent.

As shown in Fig 2, at the first crawling snapshot (S1), we find a new peer (P)in a torrent (TorrentA), and P’s last report time is T1. The crawled seeding/leechingduration of TorrentA is D1 in P’s user torrents. At snapshot (S2) we find that P hasdisappeared in TorrentA’s peer list, and therefore we then crawled Ps user torrentsand find the seeding/leeching duration of TorrentA is D2. Then we can infer that P’sle f t time = T 1 + (D2 − D1). By using the same methodology, we can find theswitch time of a peer from a leecher to a seeder. This combines crawling and web-analyzing, and adds automatic-decision logic into the crawling code. This is signifi-cantly different from the traditional method, where there is no real-time analysis andall analysis is done after crawling is complete. We cannot use the seeding/leechingtime in the user info because these two values are representative of the total seed-ing/leeching time of all torrents for a user and it is calculated hours after the actualdata changes. If we cannot get the user’s torrent’s info for our analysis, we can insteaduse the methodology outlined by Andrade et al. [1] which is based on the assump-tion that it is more likely that peers must stay in a swarm (torrent) till they finishdownloading.

To simplify the implementation of our crawler, each crawler corresponds to a certaintype of data found in a PT site, and we re-run the crawler as an independent processin order to do periodic crawling of certain data types, like the peer list, as shown inFig. 3. To ease the payload of the disk I/O, we dump our crawled data at the end ofcrawling. As was mentioned before, we need to check whether a new peer joined oran existing peer left. To this end, when we crawl the peer list of a torrent, we need thelast report time of each peer in the torrent to find out whether there is a new peer oran existing peer leaves. To do so, we use the memory-based key-value database redis[3] to store such temporary information, rather than extract this information from thelast snapshot. In conclusion, knowing the correlation of data sets is a good way touncover this information by reordering the crawl sequence according to our specificneeds.

123

Int J Parallel Prog

Fig. 3 Crawling period problem

5 Data Analysis

In this section, we analyze the trace data collected from four representative PT sitesin China, namely ChinaHDTV, HDChina, KMGTP, and SJTU, in order to evaluatedifferent issues related to the incentive policies and their interaction with the dataanalysis. The changes in the relationships between different incentive policies affectthe accuracy in PT measurements.

5.1 Effects of Rules

5.1.1 Incentive Rules

As discussed earlier, different PT communities have employed different incentive rulesand/or ban polices in order to improve the download performance and availability oftheir content. This leads to a high variability in their measurement results, even thoughthese measurement studies use the same underlying data source as we do. Researchstudies [13] and [22] have reported that 72 % of users have a share-ratio (S R > 1),while [8] has noted that only 35 % of users have a share-ratio (S R > 1). However,their results are based on the data provided in the user info field. A more holisticanalysis would take into account that the total uploaded and downloaded volume ofa user could be obtained from either the user info data or the user torrents data. Weshow in Fig. 4 the cumulative distribution functions (CDFs) of users’ share-ratioobtained from these two data sets using the data from SJTU. Sixty-eight percent ofusers have share ratio S R > 1 using the user info, but this statistic drops to only26 % when using the user torrent data as detailed in Fig. 5 (we only show part ofthe CDF graph for S R ≤ 2 users). The reason for this result is that the informationprovided in the torrent info represents the “real” upload and download volume ofthe user, while the info displayed in the user info represents the “virtual” upload anddownload volume. The “virtual” upload and download volume is derived from theunderlying “real” data after modifying it according to the other incentive rules used

123

Int J Parallel Prog

Fig. 4 Users’ virtual share-ratio and real share-ratio

Fig. 5 Users’ upload/download scatters from two different sources

by the PT site, like the torrent promotion and credit system. For example, for sometorrents in the site, marked as 2X, users will be granted two times their “real” uploadvolume.

This explains the earlier biased results discussed in [13] and [22] in contrast with[8]. In the HDStar(http://hdstar.org/) and HDChina sites, which were measured in[13] and [22] respectively, users have a higher share-ratio compared with the DIMEsite(http://www.dimeadozen.org/), measured in [8] because they have employed otherincentive rules, torrent promotion and credit system in addition to share-ratio, whilein the DIME site only the SRE is used2. In summary, we observe that the degree and

2 In DIME, beside the share-ratio rule, the site employs free leeching attribute, which has a small impacton the “virtual” share-ratio because of the small number of torrents that are classified as “free-leeching”.

123

Int J Parallel Prog

number of incentive rules employed by different PT sites lead to different results.Accordingly, since the share-ratios crawled from the user info does not reflect the“real” upload and download volume of users in the site, it is much more instructive topresent the “real” upload and download volume crawled from the torrent info alongside the “virtual” one crawled from the user info in order to accurately compare theperformance and the efficiency of the incentive rules between different sites.

5.1.2 Banned Rules

When measuring the activity of users in PT sites (independent of whether users of asite are inclined to keep seeding/leeching), the CDF of the last active/online time ofregistered users [22], or the ratio of active users/total users [2] is used. However, asshown in Fig. 6, the last active time of users in different PT sites is highly variable.This can be explained by the difference the sites’ rules for banning or deleting users,especially due to variance in the duration and methods by which this is done. Forexample, in 100 days, the active users in SJTU are 21 % higher than KMGTP, andthat can be explained by that users in SJTU are automatically banned by a trackerprogram if the users are inactive for >120 days, while in KMGTP, users are banned bythe administrator manually and randomly; users could be inactive for 300 days beforethey are removed. Thus, we should not consider all registered users in a site as themain metric to compare the level of activity of the different sites.

This leads us to observe that the raw data cannot be directly trusted as real reflectionof users’ behavior as there are too many factors that should be strongly consideredwhile conducting measurement studies. To support our observation, we plot the CDFsof monthly active users in the four different sites in Fig. 7. Interestingly, we get quitedifferent results comparing to Fig. 6 as all the four PT sites have relatively the sameratio of active users.

Fig. 6 Users activity in PT sites: CDF of user’s last active time

123

Int J Parallel Prog

Fig. 7 Users activity in PT sites: CDF of monthly active users

5.2 Effects of Protocols

As an application layer protocol, the protocol of the BitTorrent client and tracker isstill evolving, which may cause significant bias in the collected data, especially forthe purposes of PT measurements.

5.2.1 Partial Seed

A partial seed is a peer that is incomplete without downloading more files bundledinto the torrent. This happens for multi-file torrents where users only download someof the files. For example, many film .torrent files contain a sample file so that users cantake a look at the content before downloading the complete film, or some .torrent filescontain URL link files, which users tend to not download. A possible solution to thepartial seed problem is introduced in BT systems by proposing an extension for partialseeds so that trackers and clients are able to recognize them [5], but it is not practicallyin use because it has relatively less effects on the efficiency of the BitTorrent system.However, partial seeds are more common in PT sites, because of the typical user SREis different from that of the BT system, as PT members pay more attention to theirshare-ratio as they should maintain it >1.

To our knowledge, this is the first study which analyzes the partial seeder’s problemand its impact on measurement results for PT sites. The mistake of taking the “partialseeder” as a leecher will result in incorrect number of seeders and leechers in a swarm.

We use all snapshots from 8pm to midnight on May 26th 2011 for all the four PTsites. Considering the limited support for the partial seeders in many PT trackers, thevery first question we face is how to distinguish partial seeders from leechers? If thereare related seeders in swarm, the complete download ratio of its reports to the trackerwill increase, while the partial seeders’ ratio will not. To make sure that the leeching

123

Int J Parallel Prog

Table 3 Data sets for partial seeders

Site Torrentsset

Leechingusers

Leechingpeers

Partialseeders

Torrents withpartial seeder

Users actedas Partialseeders

Crawlinginterval(min)

ChinaHDTV 2,984 4,733 14,915 5,348 1,556 1,659 20

KMGTP 1,401 1,853 3,442 786 479 460 15

SJTU 2,460 2,572 4,772 2,142 1,190 1,226 30

HDStar∗ 2,952 3,853 9,124 1,446 917 827 30∗11.9 % of leeching peers in HDStar is anonymous, partial seeder in these anonymous peers are not included

peer has a chance to download the contents, we use the following standards to selectthe torrent set:

– at least one connectable seeding peer must be present at each snapshot during thisperiod;

– at least one leeching peer must be present at one snapshot during this period.

To distinguish “partial seeders” from other leechers, we use the following standard:

– the partial seeder should be noted as “leeching” at least one hour in this period;– their complete-ratio does not change in this period; and– their complete-ratio >0.

Following this standard, we did not only restrict the possibility of mistaking aleecher as a partial seeder, but also we decrease the chance of taking the partial seederas leecher. We record 2,984 torrents in ChinaHDTV, as shown in Table 3. As a resultwe get 4,733 leeching users and 14,915 leeching peers that take part in these torrents.Interestingly, as shown in Table 3 for the ChinaHDTV site, we find that there are 5,348partial-seeders among the 14,915 leeching peers with a ratio of (35.8 %). Fifty-twopercent of torrents have at least one “partial seeder”, and 35 % of leeching users actas “partial-seeding peer” in at least one torrent.

Obviously, we cannot use partial seeders as leechers for purposes of data collection.However, we cannot simply classify them as seeders because they are missing somefiles in the torrent. This leads to an important underlying question: how are the “partialseeders” affecting the analysis of PT sites?

In many BT or PT sites and as outlined in existing papers [1,2,12,13], the Seederto Leecher Ratio (SLR) is considered to be the key indicator of the download speed ofusers. In order to show the distribution of SLR in each swarm, we take a snapshot of thesite “ChinaHDTV” and use torrents that have at least one leecher (or a partial seeder)in swarm as total set. Using this methodology, Fig. 8a shows the CDF of each torrent’s

SLR. We take SL Rsite =∑

Seeding Peers∑Leeching Peers as an example, as this is used in many PT

sites as a summary statistic. Table 4 shows the SL Rsite before and after removingpartial seeders from a peer list snapshot for different sites. The values range from 17to 79 % indicating that the results are highly biased. We see that after removing those“partial seeders,” the SLR value is much higher than indicated before. Thus, “ExcessSupply” [12] or “Over Seed,” which means seeders in the swarm provide more uploadbandwidth than the leechers need, is actually more prevalent in PT than we thought.

123

Int J Parallel Prog

(a) (b)

(c)

Fig. 8 Partial seeders in ChinaHDTV

Table 4 Partial seeder in a snapshot

ChinaHDTV KMGTP SJTU HDStar

Total seeding peer 74,391 42,726 145,672 89,944

Total leeching peer (including partial seeder) 12,231 2,454 3,579 8,221

Total leeching peer (without partial seeder) 8,150 1,761 2,004 7,002

Total seeding peer/total leeching peer (includingpartial seeder)

6.1 17.4 40.7 10.9

Total seeding peer/total leeching peer (withoutpartial seeder)

9.1 24.3 72.7 12.8

We define active torrent as torrents with at least one peer, using the methodologyfound in [22,23]. Figure 8b, c show the distribution of leechers in each active torrentfor the site “ChinaHDTV” before and after removing partial seeders. Note that afterremoving partial seeders, the number of active torrents with leechers is significantlydecreased in Fig. 8b. The ratio of torrents having few leechers is dramatically increased,while the ratio of no-leecher torrents grows from 12 to 31 %.

123

Int J Parallel Prog

Fig. 9 Factors affecting user behavior

In summary, the “partial seeders” dramatically affect the research results in PT. Wehave demonstrated that mistaking “partial seeders” lead to inaccurate results of theSLR for the different sites.

5.3 The Effects of Objective Factors

As mentioned earlier, different incentive rules have been introduced in PT communitiesto improve the performance and availability of the sites’ content. Figure 9 showsthe factors affecting user behaviors, including both objective factors (torrent age,popularity and size) and incentive rules.

We need to check whether these objective factors affect our results when we analyzethe effectiveness of incentives. For example, when we try to check the effect of torrentpromotion on user download behaviors, we should consider the influence of torrent ageand torrent size, while taking into account that promotion attributes may be correlatedwith torrent age/size.

As shown in Fig. 10, the downloaded torrents grouped by their promotion attributeswill not show the complete picture of the effectiveness of this rule. Even worse, it maylead to incorrect conclusions. For instance, it shows that 25 % of the 30 %-download-torrents (i.e., download volume will be considered only 30 % of the actual volumeof torrent) have been completely downloaded more than 200 times, while it is only20 % for free torrents, and 9 % for 50 %-download-torrents. This leads to an incorrectconclusion that users are more likely to download 30 %-download-torrents than freetorrents or 50 %-download-torrents. Although the number of 50 %-download-torrentsis almost 7 times the number of 30 %-download-torrents as shown Table 5.

Accordingly, in order to correctly take into account the torrent promotion policies,we need to show and discuss the distribution of the torrents’ attributes in relation to the

123

Int J Parallel Prog

Fig. 10 Completely downloaded torrents grouped by torrent attributes in HDStar

Table 5 Promotion torrents inHDStar

Using a snapshot which wastaken on November 22nd 2011when the total number oftorrents was 15,326

Promotion attribute No. of torrents

30 %-download-torrents 756

50%-download-torrents 4,320

Free 359

(a) (b)

Fig. 11 The distribution of torrent attributes grouped by torrents size and age in HDStar

torrents’ objective factors, including size and age. Figure 11 shows the distribution oftorrent promotion attributes grouped by torrents’ size and age for the site “HDStar”.We note that the torrent promotion attributes are correlated with torrent age as shownin Fig. 11a. This explains the previous results (the lower number of completed down-loaded torrents for 50 %-download-torrents, in contrast to the 30 %-download-torrentsis because 95 % of the 50 %-download-torrents are older than 100 days while only

123

Int J Parallel Prog

11 % of 30 %-download-torrents are older than 100 days). Since most free or 30 %torrents are newly published while many 50 % torrent are older ones, if we comparethe torrent sets grouped by different torrent promotion attributes directly, we willget incorrect results due to differences caused by torrent age rather than promotionattributes assigned to the torrent.

In summary, since the user behaviors are the results of the interaction betweenobjective attributes and incentive rules, we should account for the effects of theseobjective attributes when analyzing the effects of incentive polices.

6 Related Work

Many studies have been dedicated on conducting measurements for public BitTorrent[4,16] and private BitTorrent systems [1,2,12–14,22]. However, most of these studiesare based on data collection crawled from: (1) the front-end torrent site, and by (2) thetracker using a modified BitTorrent client to request peers from the tracker as a normalclient, and then performing a handshake with the peers in order to get the informationof the pieces each peer is reported to have [10,21]. Few studies have analyzed theexistence of bias in public BitTorrent measurements [1,16,21]. To avoid this problemwhen studying torrents, Andrade et al. [1] applied the create-based method proposedby Roselli et al. [17], which allows obtaining an unbiased sample of torrents witha maximum duration of τ . However, to our knowledge, no one has yet analyzedthe inaccuracies and the biases present in private torrent measurements, which arecaused by the relationships between the different incentive features employed by thesesystems.

7 Conclusions

Measuring BT systems has proven to be an efficient way to understand their propertiesand how they interact with users, This results in conclusions that help to improve theirusability. As BitTorrent communities are rapidly evolving towards a Private Torrentmodel, with strong built-in incentive mechanisms, many studies have been dedicated tomeasure and model PT systems for the same purpose. These studies have been appliedto different PT sites, which differ in their system implementation and in their employedincentive policies. In addition, different studies are using different data collectionand crawling methodologies. Therefore, current measurements cannot reflect accurateresults and therefore bias is prevalent.

In this paper, we investigate the accuracy of PT sites measurements and emphasizethe effect on conclusions that the employed incentive rules have. We further documentthe relationship between the incentive rules and objective factors, like torrent age andsize. Accordingly, we have designed a new crawling methodology and conducted large-scale data collection across four representative PT sites over a year. Our results showthat ignoring the incentive rules and their interplay may lead to inaccurate conclusionsof system properties. This study aim to offer fundamental insights into designing anaccurate and general methodology when measuring PT sites.

123

Int J Parallel Prog

References

1. Andrade, N., Santos-Neto, E., Brasileiro, F., Ripeanu, M.: Resource demand and supply in bittorrentcontent-sharing communities. Comput. Netw. 53(4), 515–527 (2009)

2. Chen, X., Jiang, Y., Chu, X.: Measurements, analysis and modeling of private trackers. In: Proceedingsof the 10th International Conference on Peer-to-Peer Computing (P2P’10), pp. 1–10. IEEE (2010)

3. Citrusbyte: Redis. http://redis.io/ (2011)4. Cuevas, R., Laoutaris, N., Yang, X., Siganos, G., Rodriguez, P.: Deep diving into bittorrent locality.

In: Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modelingof Computer Systems (SIGMETRICS’10), pp. 349–350. ACM, New York, NY, USA (2010)

5. Extension for Partial Seeds. http://www.bittorrent.org/beps/bep_0021.html (2008)6. Greg Hazel, A.N.: IPv6 Tracker Extension: http://www.bittorrent.org/beps/bep_0007.html (2008)7. Guo, L., Chen, S., Xiao, Z., Tan, E., Ding, X., Zhang, X.: Measurements, analysis, and modeling of

bittorrent-like systems. In: Proceedings of the 5th ACM SIGCOMM conference on Internet Measure-ment, pp. 4–18. USENIX Association (2005)

8. Hales, D., Rahman, R., Zhang, B., Meulpolder, M., Pouwelse, J.: Bittorrent or bitcrunch: Evidenceof a credit squeeze in bittorrent? In: Proceedings of the 18th International Workshops on EnablingTechnologies: Infrastructures for Collaborative Enterprises (WETICE ’09), pp. 99–104. IEEE (2009)

9. Harrison, D.: Index of BitTorrent Enhancement Proposals. http://www.bittorrent.org/beps/bep_0000.html (2009)

10. Iosup, A., Garbacki, P., Pouwelse, J., Epema, D.: Correlating topology and path characteristics ofoverlay networks and the internet. In: Proceedings of the 6th IEEE International Symposium on ClusterComputing and the Grid (CCGRID’06), pp. 10–17. IEEE Computer Society, Washington, DC, USA(2006)

11. Internet Study 2008/2009. http://www.ipoque.com/sites/default/files/mediafiles/documents/internet-study-2008-2009.pdf (2009)

12. Kash, I.A., Lai, J.K., Zhang, H., Zohar, A.: Economics of bittorrent communities. In: Proceedings ofthe 21st International Conference on World Wide Web (WWW’12), pp. 221–230. ACM, New York,NY, USA (2012)

13. Liu, Z., Dhungel, P., Wu, D., Zhang, C., Ross, K.W.: Understanding and improving ratio incentives inprivate communities. In: Proceedings of the 30th International Conference on Distributed ComputingSystems (ICDCS’10), ICDCS’10, pp. 610–621. IEEE Computer Society, Washington, DC, USA (2010)

14. Meulpolder, M., D’Acunto, L., Capota, M., Wojciechowski, M., Pouwelse, J.A., Epema, D.H.J., Sips,H.J.: Public and private bittorrent communities: a measurement study. In: Proceedings of the 9th Inter-national Conference on Peer-to-peer systems (IPTPS’10), pp. 10–10. USENIX Association, Berkeley,CA, USA (2010)

15. NexusPHP. http://sourceforge.net/projects/nexusphp/ (2010)16. Otto, J.S., Sánchez, M.A., Choffnes, D.R., Bustamante, F.E., Siganos, G.: On blind mice and the ele-

phant: Understanding the network impact of a large distributed system. SIGCOMM Comput. Commun.Rev. 41(4), 110–121 (2011)

17. Roselli, D., Lorch, J., Anderson, T.: A comparison of file system workloads. In: Proceedings of theAnnual Conference on USENIX Annual Technical Conference, pp. 4–4. USENIX Association (2000)

18. Tbdev.net. http://sourceforge.net/projects/tbdevnet/ (2010)19. TBsource PHP/MySql BitTorrent Tracker. http://sourceforge.net/projects/tbsource/ (2010)20. WireShark. http://www.wireshark.org/ (2011)21. Zhang, B., Iosup, A., Pouwelse, J., Epema, D., Sips, H.: Sampling bias in bittorrent measurements. In:

Proceedings of the 16th International Euro-Par Conference on Parallel processing: Part I (EuroPar’10),pp. 484–496. Springer, Berlin, Heidelberg (2010)

22. Zhang, C., Dhungel, P., Wu, D., Liu, Z., Ross, K.W.: Bittorrent darknets. In: Proceedings of the 29thConference on Information Communications (INFOCOM’10), pp. 1460–1468. IEEE Press, Piscat-away, NJ, USA (2010)

23. Zhang, C., Dhungel, P., Wu, D., Ross, K.W.: Unraveling the bittorrent ecosystem. IEEE Trans. ParallelDistrib. Syst. (TPDS) 22(7), 1164–1177 (2011)

123