Measuring Sustainability and Adoption Trends of …plg.uwaterloo.ca/~migod/papers/2012/icse12.pdfrecommendations on better design of next-generation of web browsers that are more sustainable

Measuring Sustainability and Adoption Trends of Open Source Web Browsers

Olga Baysal and Michael W. GodfreySchool of Computer Science

University of WaterlooWaterloo, Canada

{obaysal, migod}@uwaterloo.ca

Abstract—Web-based applications typically collect enormousarchives of usage data — including profiles of users, their usageenvironment, locality, and their browsing behaviour — that areoften disregarded or unused by the development team. Thispaper explores whether such usage data can be unified withdevelopment information, such as the release history, to assessthe sustainability of a software project. In particular, we havecombined usage data culled from web server logs with productrelease histories to study the history and sustainability of twoopen source web browsers: Firefox and Chrome. Our findingssuggest that Firefox fosters better hardware sustainability,while Chrome facilitates larger diversity and ethnicity amongits users. We detected no evidence in age-specific differencesin navigational behaviour among Chrome and Firefox users;however, we hypothesize that younger users are more likely tohave more up-to-date versions than more mature users.

Keywords-Release history, user adoption, web usage mining,software sustainability

I. INTRODUCTION

With the continued growth of web services, the volume ofuser data collected by organizations has grown enormously.Analyzing such data can help software projects determinevalues of users, evaluate success of the product, designmarketing strategies, etc. Such analyses involve searchingfor meaningful patterns from a large collection of web serveraccess logs.

In our previous work on comparing two open sourcebrowsers, Chrome and Firefox, two distinct profilesemerged: Firefox, as the older and established system,with long product version cycles but short bug fix cycles;and Chrome, as the new and fast evolving system, withshort version cycles and longer bug fix cycles [1]. Thesefindings encouraged us to further study user community andadoption of these two browsers, and to investigate whetherthe knowledge about users and their browsing behaviourtogether with the release characteristics can provide insightsinto the sustainability of a browser. Understanding how usersadopt and use web browsers is important for a number ofreasons. First, studying how users adopt software systemsleads to evaluation of currently popular web browsers and torecommendations on better design of next-generation of webbrowsers that are more sustainable. Second, understandingkey characteristics of user population is crucial to measuringthe success of a product and to perform target market

analysis. We note that we do not aim at providing webanalytics and statistics on the obtained usage data, as thereis a plethora of tools available for such purposes. Rather, weapply web usage mining techniques to search for meaningfulpatterns, as well as perform statistical analysis on the results.The main goal of our research focuses on investigating ofwhether available usage data combined with developmentinformation can provide enough evidence on how usersadopt a software system, determine key characteristics ofthe user adoption, and whether these characteristics couldprovide insights on the sustainability of the product.

This paper addresses a number of research questions:

Q1: Are there differences in platform preferences betweenend-users of the browsers?

Q2: Is there a difference in geographic distribution betweenuser populations?

Q3: Is there a difference in navigational behaviour betweentwo user groups?

Q4: Can the combination of usage data with developmentinformation provide insight into the sustainability of thebrowser? And if so, which is more sustainable?

Our study reveals several notable differences in userpopulations, and their adoption and use of open source webbrowsers. Chrome undergoes continual and regular updatesand has short release cycles, while Firefox is more traditionalin delivering major updates, yet providing support for moreand older platforms which in turn fosters hardware sustain-ability. Our data suggests that Firefox users are primarilycentred in North America, while Chrome users are betterdistributed across the globe, and thus Chrome aids bettergeographic and ethnic diversity among users. We detectedno evidence in age-specific differences in navigational be-haviour among Chrome and Firefox users. However, wehypothesize that a younger population of users are morelikely to have more up-to-date versions of a web browserthan more mature users.

Our work makes several contributions. First, by miningweb usage data we define several characteristics of the userpopulation for empirical evaluation. Second, we analyze theusage patterns and highlight the main differences in howthe browsers provide OS support to the end-users, appealto the users across the globe, and emphasize age-specific

differences among its users in the adoption of new releases.Third, we demonstrate how characteristics of user populationand adoption together with the development patterns canprovide insight into the nature of the sustainable practices ofa software project. Our findings may also help improve userexperience. First, development team members may considerour findings to target wider user population. Second, ourfindings have implications for more sustainable design ofa web browser that appeals to wider population across theglobe, supports older and a variety of platforms, and reducesage-specific usability issues. And finally, our work mightfacilitate further research on user adoption and acceptanceof software products.

The rest of the paper is organized as follows. SectionII summarizes prior work. Section III provides backgroundinformation on mining web usage data. Section IV describesthe setup of our study. Section V presents results of theempirical study and Section VI discusses our findings onadoption trends, behavioural characteristics of users, andalso addresses threats to validity. And finally, in Section VIIwe summarize our main findings.

II. RELATED WORK

The most relevant related work is the research on miningusage data and measuring software sustainability.

Mining Web Usage Data Web usage mining applies datamining techniques to discover usage patterns on web data.Web usage mining research provides a number of tax-onomies summarizing existing research efforts in the areas,as well as commercial offerings of a variety of tools [2], [3].

Mobasher [4] discussed web usage mining includingsources and types of data such as web server applicationlogs. He indicated that there are four primary groups ofdata sources: usage, content, structure, and user data. Hediscussed the key elements of web usage data pre-processingthat required high-level tasks in usage data pre-processingthat includes the integration of click stream data with othersources such as content or semantic information, as well asuser and product information from operational databases.

Empirical software engineering research has been focusedon mining software development data (source code, elec-tronic communication, defect data, requirements documen-tation, etc.). Relatively little work has been done on miningusage data. El-Ramly and Stroulia [5] mined software usagedata to support re-engineering and program comprehension.They studied system-user interactions data that containedtemporal sequences of events that took place when userswere interacting with the system. They developed a processfor mining interaction patterns and applied it to legacy andweb-based systems. The discovered patterns were used foruser interface re-engineering and personalization. Li et al. [6]investigated how usage characteristics relate to field qualityand how usage characteristics differ between beta and post-releases. They analyzed anonymous failure and usage data

from millions of pre-release and post-release Windows R©

machines.In our previous work we examined development artifacts

– release histories, bug reporting and fixing data, as wellas usage data of the Firefox and Chrome web browsers.In this study, two distinct profiles emerged: Firefox, as theolder and established system, with long product versioncycles but short bug fix cycles, and a user base that isslow to adopt newer versions; and Chrome, as the new andfast evolving system, with short version cycles, longer bugfix cycles, and a user base that very quickly adopts newversions as they become available (due largely to Chrome’smandatory automatic updates). When analyzing the usagedata, we focused on only the difference in adoption trendsand whether the volume of defects affects popularity of abrowser. Figure 1 depicts observed trends in user adoptionof the two browsers. In this paper, we take a more detailedlook at the usage data by studying characteristics of the userpopulations of the browsers.

Time

Num

ber

of h

its, 1

:1,0

00

05

1015

20

Feb−2007 May−08 Aug−09 Feb−10 Sep−10

c02c03c04c10c20c30c40c50c60c70

Time

Num

ber

of h

its, 1

:1,0

00

035

7010

5

Feb−2007 May−08 Aug−09 Feb−10 Sep−10

f08f09f10f15f20f30f35f36

Figure 1. User Adoption Trends for Chrome (up) and Firefox (down).

Google Research performed a study on comparing updatemechanisms of web browsers [7]. Their work investigates theeffectiveness of web browser update mechanisms in securingend-users from various vulnerabilities. They performed aglobal scale measurement of update effectiveness comparingupdate strategies of four different web browsers – GoogleChrome, Mozilla Firefox, Opera, Apple Safari, and MSInternet Explorer. By tracking the usage shares over threeweeks after a new release, they determined how fast usersupdate to the latest version and compared the update per-formance between different releases of the same and otherbrowsers. They applied similar approach of parsing user-agent string to determine the browser’s name and version

number. They evaluated the approach on the data obtainedfrom Google web servers distributed all over the world.Unlike the Google study that investigates updates within thesame major version of various web browsers, we studiedmajor releases of the web browsers. We realize that our datais several orders of magnitude smaller than the Google data.However, we address different research questions related tothe characteristics of user population.

Software Sustainability The research community has yetto agree on a common definition for a sustainable softwaresystem and to define metrics to measure sustainability of asoftware system. Seacord et al. [8] define software sustain-ability as “the ability of a sustainment team to modify asoftware system based on customer needs and deploy thesemodifications”. In their work, they addressed limitations ofexisting sustainability measures, introduced new measuresfor sustainability assessment and provided analysis of suchmeasures. However, their definition of software sustainabil-ity rather refers to the maintenance activity of a softwaredevelopment.

The work of Albertao et al. [9] introduces a method tomonitor the sustainability of software projects by measuringa set of metrics over several releases of a software product.They selected and organized a set of software engineeringmetrics that can be used to assess the economic, social andenvironmental sustainability of software projects. They ap-plied these metrics to assess sustainability of a real softwareproject. In our work, we do not try to define metrics for mea-suring sustainability. We are interested in exploring whetheruser characteristics together with characteristics of softwaredevelopment (e.g., release delivery, release lifespan) can helpto reason about sustainability of a software system.

III. BACKGROUND ON WEB USAGE MINING

This section describes usage log data, provides a sampleof the web server log, and explains the process of web usagemining.

A. Web server logs

Web sever logs are automatically generated by web severs,such as Apache. These logs contain detailed informationabout the browsing behaviour of visitors to a website. EachHTTP request to the sever, called a hit, is recorded in thesever access log. Each log record may contain the followingfields: the IP address of the client (remote host) that madethe request to the server, the time and date of the request,the requested resource, the status of the request, the HTTPmethod used, the size of the object returned to the client,the referring web resource, and the user-agent of the client.Log files can be stored in various formats such as commonlog, extended log, or combined log formats. An example ofa combined log format obtained from www.cs.uwaterloo.caserver is given in Figure 2. IP addresses of the visitors havebeen changed to protect their privacy. The user-agent field

10.0.0.1 - - [20/Oct/2008:23:05:24 -0400] "GET /undergrad/handbook/courses/waitlist/index.shtml HTTP/1.1" 301 368 "http://www.cs.uwaterloo.ca/current/" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_5; en-us) AppleWebKit/525.18 (KHTML, like Gecko) Version/3.1.2 Safari/525.20.1"

10.0.0.2 - - [26/Oct/2008:16:47:49 -0400] "GET /~fwtompa/.papers/xmldb-desiderata.pdf HTTP/1.1" 301 365 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"

10.0.0.3 - - [17/Nov/2008:07:17:27 -0500] "GET /Prospective/what_is_se.htm HTTP/1.1" 200 18721 "http://www.google.ba/search?hl=bs&client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&hs=nZi&q=software+engineering+vs+information+systems&btnG=Tra%C5%BEi&meta=" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4"

10.0.0.4 - - [26/Nov/2008:20:01:34 -0500] "GET /images/frontlogo_1.jpg HTTP/1.1" 200 8923 "http://www.softeng.uwaterloo.ca/" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/0.4.154.25 Safari/525.19"

10.0.0.5 - - [06/Dec/2008:08:48:04 -0500] "GET /swag.css HTTP/1.1" 200 2013 "http://www.swag.uwaterloo.ca/tools.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008100320 GranParadiso/3.0.3"

Figure 2. An example of server access log

identifies the client’s browser type and version, as well asinformation on the operating system used.

B. Web usage mining

Organizations that provide web-based information or ser-vices collect and store high volumes of user data. Analyzingsuch data can help organizations to better understand usersand their values, develop better marketing strategies, opti-mize the structure of their web space, etc. Such analysis in-volves searching for interesting patterns in a large collectionof user data. User or usage data is gathered automaticallyby web servers and stored in web server access logs. A webserver log is one of the sources for performing web usagemining since it explicitly records the browsing behaviourof the site visitors [2]. Web usage mining is a process ofextracting useful information from web server logs [2], [3],[4]. The main goal is to analyze the behavioural patterns andprofiles of users visiting a web site. Mining users’ browsinghistory provides insights on what users are looking for on theweb site. Typical web usage mining process consists of threestages: data collection and pre-processing, pattern discovery,and pattern analysis [2]. Figure 3 presents an architecture ofa web usage mining process adapted from [2].

DatabasePre-processingWeb

server logs

Pattern analysis

Pattern discovery

Interesting Patterns, Statistics

Figure 3. High level architecture of web usage mining.

In the pre-processing stage, the data is cleaned and dividedinto transactions representing activities of each user duringdifferent visits. Depending on the analysis, the usage data

is then transformed and aggregated at different levels ofabstraction (users, sessions, click-streams, or page views).

During the pattern discovery stage, various operationalmethods are applied to uncover hidden patterns reflectingbehaviour of users. The most commonly-used methods aredescriptive statistical analysis, association rule mining, clus-tering, classification, sequential pattern analysis and depen-dency modelling [2], and prediction. These techniques aretypically used for personalization, marketing intelligence,system improvements such as web caching and networktraffic improvements, and site modification. Statistical anal-ysis provides statistical measures to organize and summarizeinformation. Association rule mining concerns discovery ofrelationships between items in transaction. Clustering is anunsupervised grouping of objects, while classification issupervised grouping. In web mining, the objects can beusers, pages, sessions, events, etc. Sequential pattern analysisis similar to association rules but it also considers thesequence of events. The fact that page A is requested beforepage B is an example of a discovered pattern. All thesetechniques were designed for knowledge discovery fromvery large databases of numerical data and were adaptedfor web mining with relative success.

In the final stage, pattern analysis, the discovered patternsand statistics are further processed and used as input to ap-plications such as recommender systems, report generationtools, visualization tools, etc.

IV. SETUP OF THE CASE STUDY

This paper performs an empirical study on extractingthe knowledge about user populations and adoption of twopopular open source web browsers, and exploring whetherthis knowledge can be used to evaluate the sustainablepractices of software projects. The study considers therelease histories of the browsers and the web usage data.This section explains these data.

A. Release History

Mozilla Firefox is an older web browser, originally re-leased in November 2004 as the successor of the Mozillaproject. Firefox’s codebase has a long and rich history goingback to Netscape, the second historically important webbrowser after Mosaic. In 1998, the Netscape codebase wasbranched into the open source Mozilla project suite, whichincluded the browser, an email client, and an HTML editor.In 2004, the browser and email clients were decoupled intoseparate projects (Firefox and Thunderbird, respectively),and the Firefox browser was officially born.

Google Chrome is a younger web browser that was firstreleased in September 2008 as a beta version, with the firstofficial stable release being deployed in December 2008.Chrome is based on the Webkit layout engine, which is alsoused by Apple’s Safari browser, whereas Firefox uses its ownrendering engine, called Gecko, that has since been adopted

Release

Life

span

in d

ays

010

020

030

040

050

060

0

b3 b2 b1 r1 r2 r3 r4 r5 r6 r7

Figure 4. Lifespan of major releases for Chrome and Firefox. Thedifference in release delivery is statistically significant (p-value<0.005).

by several other projects. Strictly speaking, Chrome is notopen source but its core base — a project called Chromium— is.

The release history of Firefox consists of 8 major re-leases [10] and 10 major releases for Chrome starting withversion 0.2 [11]. In this paper, we use the following labelswhen comparing releases of the browsers: b3 refers toChrome 0.2 release, b2 – Chrome 0.3 and Firefox 0.8releases, b1 – Chrome 0.4 and Firefox 0.9 releases, r1 –Chrome 1.0 and Firefox 1.0, r2 – Chrome 2.0 and Firefox1.5, r3 – Chrome 3.0 and Firefox 2.0, r4 – Chrome 4.0and Firefox 3.0, r5 – Chrome 5.0 and Firefox 3.5, r6 –Chrome 6.0 and Firefox 3.6 and r7 stands for Chrome 7.0.On average, a new release of the Firefox browser is launchedevery 10 months, while a new version of the Chrome browseris released every 2.5 months (see Figure 4). The differencein release delivery of the browsers is statistically significant(p-value<0.005).

B. Web Traffic Data

Web server logs were obtained from the School ofComputer Science, University of Waterloo (http://www.cs.uwaterloo.ca). The web server logs gather data in a com-bined log format, including information about a user agentand a referrer. The obtained log spanned from 2007 till2010. The physical size of the log is 36 GB and consistsof 349,738,887 entries in its raw format. After the pre-processing phase, the usage data contains 331,677,001 en-tries.

In this work, we evaluate accesses to the web resources byconsidering page accesses or “hits” (Section VI-A explainsour decision). A typical data cleaning process eliminates allthe log entries generated by the web agents such as webcrawlers, spiders, robots, indexers, or other intelligent agents

that pre-fetch pages for caching purposes. We did not removethe requests from such user agents but marked them as“robots” during the transformation phase. We were interestedin comparing the traffic generated by the automated agentswith the rest of the traffic. We also restructured the dateand time fields of the log entry to [year-month-dayhour:minute:second] format. After the cleaning andtransformation stages, the web log is loaded into a relationaldatabase.

We examined the web log data to determine the browsertype of our visitors. We analyzed the HTTP user agentstrings that web browsers report when requesting a webpage. We extracted the name of the browser from the useragent strings and calculated the number of accesses to ourweb site for each release of a browser. The share of theaccesses for each web browser is shown in Figure 5. Theleft pie chart shows the percentages of the total volume eachthat browser makes up. To our surprise, Firefox dominatesthe web traffic with 31%, followed by Internet Explorer with24%, while Chrome users contribute only 3%. To have amore fair picture on the web usage share, we eliminated the“Other” slice and normalized the cumulative access countper browser by its average market share: Chrome (10.70%),Firefox (20.16%), Microsoft’s Internet Explorer or MSIE(52.37%) [12]. As can be observed from the right-hand sidepie chart in Figure 5, Firefox is an obvious preferred choiceamong the visitors to our web site.

MSIE 24%

Firefox 31%Chrome 3%

Other 42%

MSIE 18%

Firefox 70%

Chrome 11%

Figure 5. Pie charts representing volumes of accesses for a web browser.The pie chart on the left depicts percentages of the total hits per browser,while right-hand side shows “normalized” traffic shares among three leadingbrowsers.

V. EMPIRICAL STUDY

This section addresses each research question by describ-ing how we approach the problem and reporting the obtainedfindings. When applicable, we report results of statisticalanalysis of the data.

Q1: Are there differences in platform preferences betweenend-users of the browsers?

Motivation Since web browsers are typically developedto run on multiple operating systems, we were interestedto compare the browser’s adoption and support for differentenvironments.

Approach We examined the choice of computing platformof the visitors to our website. For each release of thebrowser, we extracted the number of accesses from threeoperating systems: Windows, Linux, and OSX. This datais then normalized by the operating system market shareobtained from StatOwl.com, which predominately measuresUnited States web sites [13], [14]. Since our server islocated in Canada, we consider our choice of market sharestatistics from StatOwl.com as reasonable to use for ouranalysis. The OS market share data represents “real” website browsing community (excludes automated systems likesearch robots) and excludes mobile usage. For each releaseof a browser, we calculated an average OS market sharepercentage (reported every month) for the period of therelease’s lifespan. Since the market share numbers werereported starting from 2008, we applied the total averagemarket share (Windows – 88.21%, OSX – 11.10%, Linux– 0.54%) to the releases deployed prior September 2008.Table I provides the percentages we used to normalize ourusage data with respect to the users’ choice of the platform.

Table IOPERATING SYSTEMS MARKET SHARE.

Release Win OSX Linuxcm02 90.67% 8.87% 0.43%cm03 90.39% 9.13% 0.45%cm04 90.34% 9.00% 0.63%cm10 91.12% 8.21% 0.61%cm20 89.74% 9.59% 0.57%cm30 88.43% 11.06% 0.41%cm40 88.35% 11.07% 0.44%cm50 87.93% 11.40% 0.51%cm60 87.65% 11.66% 0.53%cm70 87.43% 11.84% 0.59%ff08 88.21% 11.10% 0.54%ff09 88.21% 11.10% 0.54%ff10 88.21% 11.10% 0.54%ff15 88.21% 11.10% 0.54%ff20 88.21% 11.10% 0.54%ff30 90.89% 8.49% 0.56%ff35 89.02% 10.37% 0.51%ff36 87.68% 11.65% 0.51%

Results The distribution of the number of user accessesfrom a platform is presented in Figure 6. A beanplotconsists of a one-dimensional scatter plot (aka boxplot), itsdistribution as a density shape and an average line for thedistribution [15]. The left side of a beanplot represents thedensity of the distribution for Chrome, while the right sideof a beanplot shows the distribution for the Firefox browser.Applying Mann-Whitney statistical test, we compared thedistributions of each platform across two browsers. Theresults show that the difference in density for both OSXand Linux platforms between two browsers is statisticallysignificant (p-value<0.05), while distributions of Windowsusers across Chrome and Firefox are fairly similar (p-

Win

num

ber o

f pag

e re

ques

ts,1

:1,0

00,0

00

05

1015

2025

30 ChromeFirefox

OSX

05

1015

2025

30

ChromeFirefox

Linux

050

100

150

200

250

300

ChromeFirefox

Figure 6. Assymetric beanplots showing the density of the page requestsby user’s platform. The left side of each bean consists hits for the Chromebrowser, whereas the right side of a bean contains hits for Firefox. Thehorizontal lines represent the average. The difference in distributions forboth OSX and Linux platforms between two browsers is statisticallysignificant (p-value<0.05).

value=0.40). Users do not adopt browsers equally acrossoperating systems. Users on a Linux or OSX platformprefer Firefox over Chrome. On Windows, users equally optfor either one of the two browsers.

We then performed Kruskal-Wallis statistical test to com-pare distributions between the three different platforms foreach browser (see Figure 7). Unlike Chrome (p-value=0.23),the difference in distributions of operating systems betweeneach other for the Firefox browser is statistically significant(p-value=0.05). This suggests that Firefox users have inorder of magnitude higher preferences for Linux thanOSX or Windows systems. While Chrome is beingadopted fairly similar across platforms.

By analyzing historical trends on how Chrome and Firefoxprovide support for different operating systems (see Fig-ure 8), we noticed that Firefox offers outstanding OScompatibility from the very beginning, while Chromebegins to reach for Linux and OSX users only startingfrom release r4, i.e., Chrome 4.0 (Google officially startedto offer OSX and Linux OS support with the release ofChrome 5.0). Firefox reaches the peak of its adoption amongWindows users in release r3 (Firefox 2.0), among OSXpopulation releases r3 (Firefox 2.0) and r5 (Firefox 3.5)are more well adopted than others, while Linux users seem tofavour release r4 (Firefox 3.0). We should also mention thatFirefox can run not only on Windows, OSX and Linux, butalso on BSD and other Unix platforms. Therefore, Firefoxprovides early and better OS compatibility.

Win OSX Linux Win OSX Linux

num

ber o

f pag

e re

ques

ts,1

:1,0

00

010

0,00

020

0,00

030

0,00

0 ChromeFirefox

Figure 7. Beanplots showing the density of the page requests by user’splatform within a browser. Black beans represent Chrome, and grey beansrepresent Firefox. The difference between OS platforms for Firefox isstatistically significant (p-value=0.05).

Win

Num

ber o

f hits

;1:1

,000

,000

05

1015

20

b3 b2 b1 r1 r2 r3 r4 r5 r6 r7

ChromeFirefox

OSX

Num

ber o

f hits

;1:1

00,0

00

02

46

810

b3 b2 b1 r1 r2 r3 r4 r5 r6 r7

ChromeFirefox

Linux

Num

ber o

f hits

;1:1

00,0

00

05

1015

b3 b2 b1 r1 r2 r3 r4 r5 r6 r7

ChromeFirefox

Figure 8. Graphs showing support for Windows, OSX and Linux platformsacross releases of Chrome and Firefox.

Q2: Is there a difference in geographic distribution be-tween user populations?

Motivation Previous question has shown that there areclear differences between how two browsers are beingadopted across operating systems. We now study geograph-ical location of the users and whether there is a differencein adoption of the browsers across the globe.

Approach We used a geolocation service to track thegeographic distribution of visitors to our website. We usedGeo::IPfree, a Perl module, to look up a country of an IPaddress. Since there is no standard convention on the numberof continents among scientists, we referred to the list ofcountries by continents provided by WorldAtlas.com [16].We used six continents to map a user’s IP address to ageographical location in our analysis. Unsurprisingly, oursever logs contained no page requests from Antarctica, andthus this continent is not present in the results. The list ofcontinents, we call them regions, includes Africa (AF), Asia(AS), Europe (EU), North America (NA), South America(SA) and Australia/Oceania (OC). During the process ofmapping IP address to the country and region, we detecteda number of private IP addresses (local network), which weexcluded from the analysis.

Results Figure 9 illustrates the differences in the distribu-tion of the user populations by world regions. The Geo::IPdatabase contains information from various registry sources.In some cases, a country is only indicated as Europe, whichmeans that the requests from such hosts may come fromanywhere in the European Union. To bridge the global digitaldivide – the disparities in the opportunities to access theInternet between developed and developing countries [17],we normalized our user accesses by the world’s Internetusage data. The statistics on the distribution of the Internetusers by world regions report the following numbers: NA13.0%, AS 44.0%, EU 22.7%, SA 10.3%, AF 5.7% andOC 1.0% [18]. We found that 85% of Firefox users andonly 72% of Chrome users are located in North America.While overall, Chrome adoption is better distributed acrossthe globe.

NA 72%AS 5%EU 6%SA 2%AF 3%OC 12%

NA 85%AS 2%EU 4%SA 1%AF 2%OC 7%

Figure 9. Pie charts showing the density of the page requests by regionfor Chrome (left) and Firefox (right).

We then compared user adoption of the browsers withrespect to the country coverage. We found that Chrome isadopted by the users in 187 countries, while the Firefox userpopulation covers 207 countries. Table II presents statisticson the user accesses by the top 5 countries. We were not

surprised to see China and India in the top 5 countrieslist as these two countries contribute to the majority ofthe international students in our school’s undergraduateprogram.

Table IITOP 5 COUNTRIES OF USER’S ACCESSES

Chrome FirefoxCanada CanadaUSA USAIndia IndiaChina EuropeUK UK

The results suggest that Firefox adoption has beenmainly concentrated in North America, while Chromeusers are better distributed across the globe and thus,Chrome supports larger diversity among its user popu-lation.

Q3: Is there a difference in navigational behaviour be-tween two user groups?

Motivation By looking at the content of the pages re-quested by the visitors, we wanted to identify whether userpopulation of two browsers have different browsing goalsand behaviour.

Approach Unlike Google, we do not have enough informa-tion to track individual users and their navigational history.However, we were interested in classifying users accordingto their navigational behaviour on our website. To investigatepatterns in the navigational behaviour of the users, we firstdetermine the types of the web content our web site offersto the visitors. Our school’s web site is mainly designatedto the following visitors:

1) students – offering information to current and prospec-tive students about the courses, their description,schedules, lectures, assignments, exams, etc.

2) researchers/industry partners – offering informationon faculty’s and grad students’ research interests,current projects, publications, potential collaborationopportunities, etc.

Based on the content of the page requests, we defined twoprofiles of the user accesses related to teaching or research.We note that not every access is related to either of the twoprofiles. Table III defines our rules for classifying visitors’requests into two profiles.

Requests to the publications are defined as ones toany .pdf document located under /pubs/, /publications/ or/papers/ directories.

Results Figure 10 illustrates the differences in the brows-ing behaviour between Chrome and Firefox users. As weexpected, undergrad pages are accessed more often thanresearch-related ones by both Chrome and Firefox users.

Table IIIPATTERN MATCHING RULES TO CLASSIFY USER ACCESS TYPE

Undegrad Research• requests to any CS 100-,200-, 300-, 400- and 600-levelcourse

• requests to any CS 700- and800-level course

• requests to course descrip-tions and course schedules forundergrads

• requests to publications

• requests to information forfuture undergrads and prospec-tive students

• requests to anything under/research

While we detected no statistical evidence for age-specificdifferences in browsing behaviour among Firefox andChrome users, Figure 10 suggests that the browsing habitsof Chrome users follow approximately normal distribution,while for Firefox the distributions of the accesses of bothresearch and undergrad pages are more spread out. Sincewe did not detect any statistical difference between thedistributions, we performed the Kolmogorov-Smirnov testto test for the equality of two distributions. The resultssuggested that for the undergrad profile, Chrome and Firefoxsamples come from the same distribution. This tells us thatwhen navigating to undergrad content, Chrome andFirefox users behave similar.

Research

Num

ber

of p

age

requ

ests

110

010

K1M

ChromeFirefox

Undergrad

Num

ber

of p

age

requ

ests

101K

100K

10M

ChromeFirefox

Figure 10. Beanplots showing the differences in navigational behaviourbetween Chrome and Firefox users.

Historical trends of the user accesses for each release ofa browser are demonstrated in Figure 11. Chrome usershave similar navigational patterns in viewing both typesof the web content: both research- and undergrad-relatedpages were accessed from more recent releases of a browserstarting from Chrome 3.0. Unlike Chrome, we found quite

different patterns in viewing web content among Firefoxusers. Most hits to the research-related pages came fromFirefox 1.0 (shown as release r1 in Figure 11). We weresurprised to see no accesses from the earlier releases ofFirefox to the undergrad content. Firefox 2.0 is the oldestbrowser used to navigate to the undergrad information,while the largest volume of the page views to this contentoriginated from the Firefox releases 3.0 and 3.5. Thesefindings suggest that undergraduate students, a youngerpopulation of users, have more up-to-date versions ofa web browser (true for both Chrome and Firefox),while researchers, a more mature population of users,do not update their browsers as quick as their youngersubordinates.

Surprisingly, the first five releases of the Chrome browserhave no or comparatively fewer number of hits to both typesof the web content on our web site. Since our web trafficdata is dated to February 2007 and Google Chrome wasrelease later in 2008, we expected to see our visitors havingearly releases of Chrome installed on their computers. Thisobservation suggests that Chrome adoption took a slowstart, with first good wave a year later with Chrome 3.0.

Firefox 1.0 was released in November 2004, yet at thetime of the first records in our logs, this release wasmore than two years old. Thus, Firefox was well adoptedfrom the very first release (mainly due to earlier de-ployment of the browser under different names - m/b(mozilla/browser) under the Mozilla Suite, Phoenix, andMozilla Firebird) and users stayed quite loyal to theinitial release of the browser, hesitating to update itto Firefox 1.5 up until Firefox 2.0 became available inOctober 2006.

Research

Num

ber

of h

its;1

:1,0

00

010

020

030

0

b3 b2 b1 r1 r2 r3 r4 r5 r6 r7

ChromeFirefox

Undergrad

Num

ber

of h

its;1

:1,0

00

020

040

060

0

b3 b2 b1 r1 r2 r3 r4 r5 r6 r7

ChromeFirefox

Figure 11. Plots showing the distributions of user accesses to researchand undergrad web content per each release of a browser.

Q4: Can the combination of usage data with developmentinformation provide insight into the sustainability of thebrowser? And if so, which is more sustainable?

Motivation Previous questions have investigated the differ-ences in release and user characteristics between the Firefoxand Chrome browsers. Therefore, we can now ask if thesecharacteristics can be used to evaluate the sustainability ofthe browser? If so, which browser is more sustainable?

Approach Sustainability recognizes the integration of en-vironment, social, and economic spheres and meets “theneeds of the present without compromising the ability offuture generations to meet their own needs” [19]. A sustain-able software system should be socially and environmentallybearable, viable economically without introducing impactsto the environment, and socially and economically equitableand accessible to everyone (see Figure 12) [20]. By studyingthe user population, we can infer some of the properties ofthe sustainability of a software project.

• Development process and practices, in particular releasehistory of a product, can account for the maintenancequality. For example, shorter release cycles underlybetter maintenance and delivery of more reliable anddefect-free software.

• Knowledge about users’ environment can inform onhow well projects provide support for various operatingsystems. Development of cross-platform applicationsthat can be executed on older computers contributesto hardware sustainability, and thus, reduces e-waste.

• Reaching for a wider user population across the globepromotes cultural diversity and ethnicity. Tracking cul-tural trends of the user adoption can provide a pictureon the globalization of a project.

• By looking at the navigational behaviour of the users,we can reason about the population and age diversity,as well as the success of a software release.

!"

Statement on Green Information Systems and Technology at UW University of Waterloo Waterloo, Ontario, Canada, N2L 3G1

Preamble #$"%"&'()*"+%,-$."%"$/012("'+"34(2%35"3'"-35"$%3/(%)"2$6-('$02$357"342"8$-62(5-39"'+":%32()''"%,,2;35"-35"(25;'$5-1-)-39"3'",(2%32"%$"2$6-('$02$3%))9"5/53%-$%1)2",%0;/5",'00/$-39<"='$52>/2$3)97"&2"53(-62"3'",(2%32"%",'00/$-39"'+"/52(5"'+"-$+'(0%3-'$"5953205"%$*"32,4$')'.9"?#@A"34%3"-5"5/53%-$%1)2"&42(2"-35"02012(5"201(%,2";(-$,-;)25"'+"B(22$"#@<"!"#$%&'%(&)&$*"(2,'.$-525"342"-$32.(%3-'$"'+"2$6-('$02$37"5',-%)7"%$*"2,'$'0-,"5;42(25"%$*"02235" 342"$22*5"'+"342";(252$3"&-34'/3",'0;('0-5-$."342"%1-)-39"'+"+/3/(2".2$2(%3-'$5"3'"0223"342-("'&$"$22*5 !<"@'"%,4-262"%"5/53%-$%1)2",'00/$-397"#@"(25'/(,25"%3"342"8$-62(5-39"'+":%32()''"54'/)*"12"5',-%))9"%$*"2$6-('$02$3%))9"12%(%1)27"6-%1)2"2,'$'0-,%))9"&-34'/3"-$3('*/,-$."-0;%,35"3'"342"2$6-('$02$37"%$*"5',-%))9"%$*"2,'$'0-,%))9"2>/-3%1)2"%$*"%,,255-1)2"3'"262(9'$2<"@42"-0;%,3"'+"#@"'$"342"2$6-('$02$3",%$"12"(2*/,2*"19"$'3"C/53",'$5/0-$.")2557"1/3"19"(2,'.$-D-$."%$*",'$5-*2(-$."342"+/))",9,)2"'+";('*/,3-'$7"%,>/-5-3-'$7"/527"%$*"/)3-0%32)97"342"*-5;'5%)"'+"32,4$')'.9"(25'/(,25"%$*"52(6-,25"%3"8:<"E9"201(%,-$.";(-$,-;)25"'+"B(22$"#@7"&2",%$",(2%32"%$*"0%-$3%-$"%"0'(2"5/53%-$%1)2"2$6-('$02$3<"!

Guiding Principles !< !"#$%"#&'(()**"+<"@42"8$-62(5-39"'+":%32()''",'00/$-39"0%F25"2G32$5-62"/52"'+"#@"

%$*"&2"(2,'.$-D2"'/("(25;'$5-1-)-39"3'"0-$-0-D2"'/("2$6-('$02$3%)"-0;%,3"&4-)2"0%-$3%-$-$."'/("4-.4"53%$*%(*5"-$"(252%(,47"32%,4-$.7"%$*"%*0-$-53(%3-62",'0;/3-$.<""

H< !"#&',-)+"%#*."#/0#1)2"#&3&1"<"8:",'$5-*2(5"342"+/))")-+2",9,)2"'+",'0;/3-$."2>/-;02$3"%$*";(',255257"%$*"53(-625"3'"0-$-0-D2"342"'62(%))"2$6-('$02$3%)"-0;%,3"%,('55"342")-+2",9,)2"19I""

-< %,>/-(-$.",'0;/32(5"%$*"52(6-,25"%;;(';(-%32"3'"342"-$32$*2*"/527" !"J2+-$-3-'$"3%F2$"+('0I"8$-32*"K%3-'$5":'()*"='00-55-'$"'$"L$6-('$02$3"%$*"J262)';02$3<"?!MNOA<"+",-./00/'-1"$",2"?;<"PA<"Q-./(2"%*%;32*"+('0I"#8=K<"?HRRSA<"342-1"$",2-/1-#"#$%&'%(&)&$*5-627$4&'8&'9-2':&,/'02'$-%';-;2:2)/<02'$-&'-$42-$=2'$*71&,#$-.2'$",*<"

Figure 12. The three pillars of sustainable development. Adapted from[20].

Results In terms of delivering more reliable software,Chrome automated update mechanisms are desirable asthey provide an up-to-date shield against security threats.

However, we found in our previous study [1] that Firefoxaddresses software bugs faster than Chrome but does not notdeploy major releases as fast as Chrome does. Therefore,when assessing reliability of a software systems, one mustconsider both the release deployment rate and the defectresolution rate. In terms of platform compatibility, Firefoxprovides support for more and older operating systemswhich in turn fosters hardware sustainability. Adoption ofthe web browsers is not evenly diffused around the world.Firefox adoption has been concentrated predominantly inNorth America, while Chrome users appear to be betterdistributed through the different regions in the world andthus, Chrome supports larger diversity and ethnicity amongits user population.

Software sustainability in this context needs to recog-nize not only environmental aspects (e.g., development ofsoftware solutions that require lower energy consumption),but also social ones (user adoption, interactions, etc.). Topreserve sustainability, full cycle of development, use, main-tenance and disposal of a software must be considered.Knowing user populations, their values and needs can im-prove sustainability of a software that will remain longeron the market. Focusing on development of architecturesfor more effective integration and performing analysis ofthe usage data from different sources are likely to result inmore useful and more sustainable software that can inferintelligence from user interaction with the web.

VI. DISCUSSION

This section discusses our findings and lessons learnedabout the differences in user populations and adoption of twoopen source web browsers. It also addresses several threatsto validity.

Software projects collect and store enormous archives ofweb usage data that are often disregarded or unused. Sucharchives utilize data on user characteristics including userenvironment, locality, browsing behaviour. This paper showsthat usage characteristics can be helpful in assessing thesustainability of a project.

Web browsers are designed with the goal of bringingInternet resources to the users. And yet, they differ accordingto their performance, available feature sets (e.g., how theypreserve user privacy and security), and extensibility –additional plug-ins/extensions.

We were limited by the data collected by our web server.The information on user adoption and user characteristicscan be extracted not only from the traffic logs but also fromthe product download centre, customer support centre, etc.For open source products, it can be quite challenging toconstruct product popularity trends based on the number ofproduct downloads due to the lack of a central repository totrack such downloads. However, analysis of web usage datacan provide valuable information on how users adopt soft-ware projects, analysis of historical trends can justify about

the popularity of a certain release of a software product. Itis important to know the end-users of a product not onlyfrom the statistics collected by the marketing surveys butalso by analyzing real usage data such as web logs to inferknowledge on user population, their technical environment,locality and navigational behaviour. The knowledge of theseusage characteristics can lead to better understanding of thesustainability of software projects.

A. Threats To Validity

External validity. Our findings are limited by the obtaineddata set: web server logs. It is arguable whether our school’sweb traffic is representative of the world-wide user popu-lation of two browsers. Usage data sets are typically notpublicly available due to privacy and business concerns. Inour analysis we tried to balance the data representativenessand to avoid being biased by normalizing the number of thepage requests by the system’s usage share. Further studiesmay be necessary to confirm our findings.

Internal validity. Our web usage data has a few gaps dueto the specifics of the university’s backup routine, makingthe quality of the logs an important threat. A small numberof CS undergraduate courses are offered through UW-ACE— a web-based course management system. Our web serverlogs do not include user accesses to such courses. We alsoneed to mention that CS graduate courses normally resideon the faculty’s web space. However, some faculty membershave their web sites hosted by the web servers belongingto the Faculty of Mathematics. In such cases, we were notable to track accesses to these courses. For example, CS846course has been taught by several professors through theyears and its web site is located on both plg.uwaterloo.caand se.uwaterloo.ca sub-domains. Neither plg. nor se. sub-domains are hosted under cs.uwaterloo.ca.

Our choice of the granularity in analyzing web logs isdetermined by the existing challenges to identify users.Accurate tracking of the individual users by IP address isnot always possible. A user who accesses the web from dif-ferent machines (e.g., work vs. home computer) might havedifferent IP addresses. A user that uses multiple browserson the same machine will appear as multiple users (useragents will differ). ISPs can assign multiple IP addresses toa user for each request or several users might share same IPaddress.

Unlike Google who uses the DoubleClick cookies to trackindividual users and their navigational behaviour over theweb, we are limited with the data captured in typical webtraffic logs.

Construct validity. We have chosen a set of metrics toquantify the value of the collected data that captures onlya part of its potential meaning. Our choices are a functionof our interest in exploring the data and the availability andstructure of the data sets.

Conclusion validity. We reported findings based on thestatistical significance. We applied statistical analysis whenneeded, and were able to reject null hypotheses and detectinteresting patterns.

VII. CONCLUSION

In this study we explored a number of research questionsrelated to web usage logs and product release histories1. Wefound that Chrome and Firefox have different release anduser population characteristics. Chrome undergoes continualand regular updates and has short release cycles, whileFirefox is more traditional in delivering major updates, yetit provides support for more and older platforms, fosteringhardware sustainability. While Firefox has been well adoptedfrom the initial release, our data suggests that its adoptionhas been concentrated predominantly in North America.Chrome adoption started more slowly, yet Chrome usersare better distributed across the globe, and thus Chromefacilitates larger diversity among its user population. Havingdiversity in behaviour and demographics is one of the mainprinciple for sustainable practices [20]. Chrome users aremore likely to have up-to-date browsers, while Firefox usersare more loyal to an aging browser version. A sustainableweb browser should be able to effectively adapt to changingrequirements — especially innovative technologies such asHTML5 and CSS3 — yet it should be able to support avariety of choices and resources. Browser releases that forceusers to upgrade their systems are increasing e-waste. Asustainable browser needs to support cross-platform inde-pendence and to run comfortably on outdated hardware.

Software development teams focus mainly on technicaland economical values of a software system having noawareness of the social and environmental impacts of thedeveloped system. Assessing, measuring and managing sus-tainability of the software system, as well as focusing moreon a software adoption rather than development would makesoftware development a more sustainable practice.

ACKNOWLEDGMENT

We wish to thank the Computer Science ComputingFacility (CSCF) of the University of Waterloo for providingweb traffic log files for this study, as well as Dr. Ian Davisfor his help on extracting the raw usage data from the webtraffic archives.

REFERENCES

[1] O. Baysal, I. Davis, and M. W. Godfrey, “A tale of twobrowsers,” in Proceeding of the 8th working conference onMining software repositories, ser. MSR ’11, 2011, pp. 238–241.

[2] J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan, “Webusage mining: discovery and applications of usage patternsfrom web data,” SIGKDD Explor. Newsl., vol. 1, pp. 12–23,January 2000.

1Available at http://www.cs.uwaterloo.ca/∼obaysal/sustainability.html

[3] R. Cooley, B. Mobasher, and J. Srivastava, “Web mining:information and pattern discovery on the world wide web,”in Tools with Artificial Intelligence, 1997. Proceedings., NinthIEEE International Conference on, nov 1997, pp. 558 –567.

[4] B. Mobasher, N. Jain, E.-H. S. Han, and J. Srivastava, “Webmining: Pattern discovery from world wide web transactions,”Tech. Rep., 1996.

[5] M. El-Ramly and E. Stroulia, “Mining software usage data,”in Proceedings 1st International Workshop on Mining Soft-ware Repositories, ser. MSR’04, 2004, pp. 64–68.

[6] P. L. Li, R. Kivett, Z. Zhan, S.-e. Jeon, N. Nagappan,B. Murphy, and A. J. Ko, “Characterizing the differencesbetween pre- and post- release versions of software,” inProceeding of the 33rd international conference on Softwareengineering, ser. ICSE ’11. New York, NY, USA: ACM,2011, pp. 716–725.

[7] T. Duebendorfer and S. Frei, “Why Silent Updates BoostSecurity,” TIK, ETH Zurich, Tech. Rep. 302, May 2009.

[8] R. C. Seacord, J. Elm, W. Goethert, G. A. Lewis, D. Plakosh,J. Robert, L. Wrage, and M. Lindvall, “Measuring softwaresustainability,” Software Maintenance, IEEE InternationalConference on, vol. 0, p. 450, 2003.

[9] F. Albertao, J. Xiao, C. Tian, Y. Lu, K. Q. Zhang, andC. Liu, “Measuring the sustainability performance of soft-ware projects,” E-Business Engineering, IEEE InternationalConference on, vol. 0, pp. 369–373, 2010.

[10] Wikipedia, “Mozilla Firefox — Wikipedia, the free encyclo-pedia,” http://en.wikipedia.org/wiki/Mozilla Firefox, [Online;accessed 28-November-2010].

[11] ——, “Google Chrome — Wikipedia, the free encyclope-dia,” http://en.wikipedia.org/wiki/Google Chrome, [Online;accessed 28-November-2010].

[12] StatOwl.com, “Web browser market share,” August 2011.[Online]. Available: http://www.statowl.com/web browsermarket share.php

[13] ——, “Operating systems market share,” July 2011.[Online]. Available: http://statowl.com/operating systemmarket share.php

[14] ——, “About our data,” July 2011. [Online]. Available:http://statowl.com/about our data.php

[15] P. Kampstra, “Beanplot: A boxplot alternative for visualcomparison of distributions,” Journal of Statistical Software,Code Snippets, vol. 28, no. 1, pp. 1–9, 2008. [Online].Available: http://www.jstatsoft.org/v28/c01/

[16] WorldAtlas.com, “Countries listed by continent,” July 2011.[Online]. Available: http://www.worldatlas.com/cntycont.htm

[17] M.-T. Lu, “Digital divide in developing countries,” Journalof Global Information Technology Management, vol. 4, no. 3,pp. 1–4, 2001.

[18] InternetWorldStats.com, “Internet usage and populationstatistics,” [Online; accessed 20-September-2011]. [Online].Available: \url{http://www.internetworldstats.com/stats.htm}

[19] U. Nations, “Report of the world commission on environmentand development,” General Assembly Resolution 42/187,Tech. Rep., 11 December 1987. [Online]. Available:http://www.un.org/documents/ga/res/42/ares42-187.htm

[20] W. Adams, “The future of sustainability: Re-thinkingenvironment and development in the twenty-firstcentury,” IUCN Renowned Thinkers Meeting, Tech. Rep.,January (2006. [Online]. Available: http://cmsdata.iucn.org/downloads/iucn future of sustanability.pdf

Documents

Measuring Sustainability and Adoption Trends of …plg.uwaterloo.ca/~migod/papers/2012/icse12.pdfrecommendations on better design of next-generation of web browsers that are more sustainable