10
How do Mobile Phones Fail? A Failure Data Analysis of Symbian OS Smart Phones Marcello Cinque, Domenico Cotroneo Dipartimento di Informatica e Sistemistica Universit` a degli Studi di Napoli Federico II Via Claudio 21, 80125 - Naples, Italy {macinque, cotroneo}@unina.it Zbigniew Kalbarczyk, Ravishankar K. Iyer Center for Reliable and High Performance Computing University of Illinois at Urbana-Champaign 1308 W. Main St., Urbana, IL 61801 {kalbar, iyer}@crhc.uiuc.edu Abstract While the new generation of hand-held devices, e.g., smart phones, support a rich set of applications, grow- ing complexity of the hardware and runtime environment makes the devices susceptible to accidental errors and ma- licious attacks. Despite these concerns, very few studies have looked into the dependability of mobile phones. This paper presents measurement-based failure characterization of mobile phones. The analysis starts with a high level fail- ure characterization of mobile phones based on data from publicly available web forums, where users post informa- tion on their experiences in using hand-held devices. This initial analysis is then used to guide the development of a failure data logger for collecting failure-related informa- tion on SymbianOS-based smart phones. Failure data is collected from 25 phones (in Italy and USA) over the period of 14 months. Key findings indicate that: (i) the majority of kernel exceptions are due to memory access violation errors (56%) and heap management problems (18%), and (ii) on average users experience a failure (freeze or self shutdown) every 11 days. While the study provide valuable insight into the failure sensitivity of smart-phones, more data and fur- ther analysis are needed before generalizing the results. 1 Introduction New generation of mobile and embedded devices, such as smart phones and PDAs (personal digital assistants) sup- port a rich set of applications, e.g., web browsing and enter- tainment software. What’s more, the time-to-market pres- sure forces manufacturers to deliver products with new fea- tures within very short time windows (e.g., six months) of- ten sacrificing the testing efforts. As a result, we witness an increasing susceptibility of hand-held devices to acciden- tal errors and malicious attacks. An example is the recently reported first mobile phone virus, Cabir, affecting Symbian- OS-based smart phones. Reliability becomes even more critical as new critical applications emerge for mobile phones, e.g., robot con- trol [15, 10], traffic control [2] and telemedicine [4]. In such scenarios, a phone failure affecting the application could re- sult in a significant loss or hazard, e.g., a robot performing uncontrolled actions. Despite these concerns, very few studies have looked into the dependability of mobile phones. As a result, there is little understanding of how and why mobile phones fail. This paper presents measurement-based failure analysis of mobile phones. The analysis starts with a high level fail- ure characterization of mobile phones based on everyday user’s experiences. Data for this study spans the four year period (between 2003 and 2006) and is obtained from pub- licly available web forums, where users post information on their experiences in using hand-held devices. The informa- tion collected in these forums is not well structured, and a relatively small number of entries can be considered as fail- ure reports. However, collected data enables: (i) identifica- tion of the high level failure manifestation, (ii) categoriza- tion of the user-initiated recovery from the device failure, and (iii) characterization of the failure severity. This initial analysis is then used to guide the develop- ment of a failure data logger for smart phones, initially in- troduced in [1]. The logger employs heartbeat mechanism to detect system/application failures. Upon failure detec- tion, the logger records information about the phone activ- ities, the list of running applications, and error conditions signaled by the system/application modules. The logger has been deployed on 25 Symbian-based smart phones in Italy and in the US since September 2005. The Symbian OS was chosen because of: (i) its open programmability features supporting C++ and Java programming languages and (ii)

How Do Mobile Phones Fail? A Failure Data Analysis of Symbian OS Smart Phones

Embed Size (px)

Citation preview

How do Mobile Phones Fail?A Failure Data Analysis of Symbian OS Smart Phones

Marcello Cinque, Domenico Cotroneo

Dipartimento di Informatica e Sistemistica

Universita degli Studi di Napoli Federico II

Via Claudio 21, 80125 - Naples, Italy

{macinque, cotroneo}@unina.it

Zbigniew Kalbarczyk, Ravishankar K. Iyer

Center for Reliable and High Performance Computing

University of Illinois at Urbana-Champaign

1308 W. Main St., Urbana, IL 61801

{kalbar, iyer}@crhc.uiuc.edu

Abstract

While the new generation of hand-held devices, e.g.,smart phones, support a rich set of applications, grow-ing complexity of the hardware and runtime environmentmakes the devices susceptible to accidental errors and ma-licious attacks. Despite these concerns, very few studieshave looked into the dependability of mobile phones. Thispaper presents measurement-based failure characterizationof mobile phones. The analysis starts with a high level fail-ure characterization of mobile phones based on data frompublicly available web forums, where users post informa-tion on their experiences in using hand-held devices. Thisinitial analysis is then used to guide the development of afailure data logger for collecting failure-related informa-tion on SymbianOS-based smart phones. Failure data iscollected from 25 phones (in Italy and USA) over the periodof 14 months. Key findings indicate that: (i) the majority ofkernel exceptions are due to memory access violation errors(56%) and heap management problems (18%), and (ii) onaverage users experience a failure (freeze or self shutdown)every 11 days. While the study provide valuable insight intothe failure sensitivity of smart-phones, more data and fur-ther analysis are needed before generalizing the results.

1 Introduction

New generation of mobile and embedded devices, suchas smart phones and PDAs (personal digital assistants) sup-port a rich set of applications, e.g., web browsing and enter-tainment software. What’s more, the time-to-market pres-sure forces manufacturers to deliver products with new fea-tures within very short time windows (e.g., six months) of-ten sacrificing the testing efforts. As a result, we witness anincreasing susceptibility of hand-held devices to acciden-

tal errors and malicious attacks. An example is the recentlyreported first mobile phone virus, Cabir, affecting Symbian-OS-based smart phones.

Reliability becomes even more critical as new criticalapplications emerge for mobile phones, e.g., robot con-trol [15, 10], traffic control [2] and telemedicine [4]. In suchscenarios, a phone failure affecting the application could re-sult in a significant loss or hazard, e.g., a robot performinguncontrolled actions.

Despite these concerns, very few studies have lookedinto the dependability of mobile phones. As a result, thereis little understanding of how and why mobile phones fail.

This paper presents measurement-based failure analysisof mobile phones. The analysis starts with a high level fail-ure characterization of mobile phones based on everydayuser’s experiences. Data for this study spans the four yearperiod (between 2003 and 2006) and is obtained from pub-licly available web forums, where users post information ontheir experiences in using hand-held devices. The informa-tion collected in these forums is not well structured, and arelatively small number of entries can be considered as fail-ure reports. However, collected data enables: (i) identifica-tion of the high level failure manifestation, (ii) categoriza-tion of the user-initiated recovery from the device failure,and (iii) characterization of the failure severity.

This initial analysis is then used to guide the develop-ment of a failure data logger for smart phones, initially in-troduced in [1]. The logger employs heartbeat mechanismto detect system/application failures. Upon failure detec-tion, the logger records information about the phone activ-ities, the list of running applications, and error conditionssignaled by the system/application modules. The logger hasbeen deployed on 25 Symbian-based smart phones in Italyand in the US since September 2005. The Symbian OS waschosen because of: (i) its open programmability featuressupporting C++ and Java programming languages and (ii)

its relatively wide-spread use at the time of this analysis.The analysis of the collected failure data shows: (i) Majorityof kernel exceptions (56%) are due to memory access viola-tion errors and heap management problems (18%). This isdespite the micro- kernel design model and advanced mem-ory management facilities provided by the Symbian OS. (ii)System panics often occur in bursts - two or more panicevents in a short succession, which indicates error propa-gation between applications (especially between real-timetasks and interactive applications). (iii) Users experience afailure (the phone freeze or self shutdown) every 11 days onaverage.

2 Background

Evolution of Mobile Phones. Mobile phone evolutioncan be described according to three waves, each one char-acterized by a specific class of mobile terminal [8]:

• Voice-centric mobile phone (first wave): a hand-heldmobile radiotelephone for use in an area divided intosmall sections (cells) and supporting SMS (Short Mes-sage Service).

• Rich-experience mobile phone (second wave): a mo-bile phone with numerous advanced features, typicallyincluding the ability to handle data (web-browsing, e-mail, personal information management, images, mu-sic) through high-resolution color screens.

• Smart phone (third wave): a general-purpose, pro-grammable mobile phone with enhanced processingand storing capabilities. It can be viewed as a combi-nation of a mobile phone and a PDA, and it may havea PDA-like screen and input devices.

Recent mobile phone models on the market feature morecomputing and storing capabilities, new operating sys-tems, new embedded devices (e.g., cameras, radio), andcommunication technologies (Bluetooth, IrDA, WAP,GPRS, UMTS). The number of units sold during the thirdquarter of 2005 (205 millions) doubled with respect to thethird quarter of 2001 (97 millions units)1. In the sameperiod, the percentage of smart phones sold has sextupled.According to industry, the time from conception to themarket deployment of a new phone model is between 4 to 6months. Clearly, the pressure to deliver a product on-time,frequently results in compromising the device reliability.The hope is that any potential reliability problems can befixed quickly by deploying new releases of phone firmware,which can be installed on the phone by service phonecenters.

1sources: http://www.itfacts.biz, http://www.theregister.co.uk

Symbian OS. Symbian [8] is a light-weight operatingsystem developed for mobile phones and carried out by sev-eral leading mobile phone’s manufacturers. The design ofSymbian is based on a hard real-time, multithreaded micro-kernel. All system services are provided by server appli-cations. Clients access servers using kernel supported mes-sage passing mechanisms.

Since mobile phones resources are highly constrained,special care is taken for memory management. Specific pro-gramming rules are defined to ensure freeing unused mem-ory and avoid memory leaks even in the case of failures.In particular, the following mechanisms are provided: (i)clean-up stack, which is an OS resource for storing refer-ences to objects allocated on the heap memory. (ii) trap-leave technique, which is similar to the try-catch paradigmdefined for C++ and Java languages, where upon an excep-tion raised during the execution of a trap block, the currentmethod “leaves” and the control returns to the caller, whichhandles the problem. In meantime, the operating systemfrees memory allocated for all objects stored on the clean-up stack during the execution of the trap block, thus avoid-ing potential memory leaks. (iii) two-phase constructionparadigm, which is defined to construct objects with dy-namic extensions. The mechanism assures that, when er-rors occur during the construction of an object, the dynamicextension is freed using the clean-up stack mechanism.

The Symbian OS defines two levels of multitasking: (i)threads, which execute at the lower level and are scheduledby a time-sharing, preemptive, priority-based OS threadscheduler, (ii) Active Objects (AOs), which execute at theupper level and are scheduled by a non-preemptive, event-driven active scheduler. Multiple AOs can run within athread. Use of active objects enables the light-weight OSdesign since the AOs eliminate need for synchronizationprimitives and hence, incur a lower context-switch overheadthan threads.A crucial Symbian aspect, which is of interest to us is panicevents. A panic event represents a non-recoverable errorcondition signaled to the kernel by either user or systemapplications. Information associated with a panic (paniccategory and panic type) is delivered to the kernel, whichdecides on the recovery action, e.g., application terminationor system reboot.

3 Related Research

The goal of measurement-based analysis of failure dataof computer systems is to classify errors/failures, to charac-terize their temporal behavior, and to guide development ofdetection and recovery mechanisms. [17] identifies trends(shifting error sources, explosive complexity, and globalvolume) in computer industry that impact computer systemdependability and security. The evolution of three research

threads in experimental dependable systems (error moni-toring and failure data analysis, fault injection, and designmethodology) are traced to illustrate how research respondsto or anticipates the direction of the computer industry. Theauthors indicate a need for more research, especially on is-sues of complexity, security, and reliability of current andnew generation computing systems and applications. To-wards this, our study proposes a method for experimen-tal measurement-based analysis of failure mechanisms ofemerging smart handheld devices.

Number of studies focus on measurement-based de-pendability analysis of operating systems, e.g., WindowsNT [9, 20], Windows 2000 [19], and Linux [7, 18].Other studies characterize failures of networked systemsand more recently, large-scale heterogeneous server envi-ronments [11] [14].

In the field of mobile distributed systems, an architec-ture for gathering and analyzing failure data for the Blue-tooth distributed systems is proposed in [6], whereas [12]reports on an experimental study of the drop impact on mo-bile devices hardware failures. [13] discusses failure datacollected from the base stations of the cellular system.

All these studies exploit failure information collected insystem event logs, or failure reports provided by specializedmaintenance staff. In the case of smart phones devices (an-alyzed in this paper), logging facilities are limited and notfully exploited. In particular, the Symbian OS provides aserver application (flogger), which allows logging the ap-plication specific information. However, in order to accessthe data logged by a generic system/application module, it isnecessary to create (on the device) a directory with a well-defined, system specific name (e.g., Xdir). The problemis that the names of such directories are not made publiclyavailable to developers. These directories are used by man-ufacturers during the development/testing. Recently, a toolcalled D EXC2 has been introduced to enable collectingpanic events generated on a phone. However, the tool doesnot relate panic events to failure manifestations, running ap-plications, and phone activities as we do in our study.

4 Smart Phones’ High-level Failures Charac-terization

In order to conduct a high level failure characterizationof mobile phones, we use publicly available data foundon several web forums 3, where mobile phone users postinformation on their experience in using hand-held devices.The posted data has a free format and a relatively smallnumber of entries report on device errors/failures. Here are

2D EXC is a Symbian project avilable atwww.symbian.com/developer/downloads/tools.html

3www.howardforums.com cellphoneforums.net, www.phonescoop.com,and www.mobiledia.com

two examples of user reports: “the phone freezes wheneverI try to write a text message, and stays frozen until I takethe battery out” and “the phone exhibits random wallpaperdisappearing and power cycling, due to UI memory leaks”.Note that the latter report gives details on a potential failurecause. The posted information is filtered (to extract entriesrelated to device failures), classified, and analyzed alongseveral dimension as discussed further in this section.

Failure Types. Following failure categories are iden-tified based on the extracted data. 4.

• Freeze (lock-up or a halting failure [3]): The device’soutput becomes constant, and the device does not re-spond to the user’s input.

• Self-shutdown (silent failure [3]): The device shutsdown itself, and no service is delivered at the user in-terface.

• Unstable behavior (erratic failure [3]): The device ex-hibits erratic behavior without any input inserted bythe user, e.g. backlight flashing, and self-activation ofapplications.

• Output failure (value failure [5]): The device, in re-sponse to an input sequence, delivers an output se-quence that deviates from the expected one. Examplesinclude inaccuracy in charge indicator, ring or musicvolume different from the confgured one, and eventreminders going off at wrong times.

• Input failure (omission value failure [5]): User inputshave no effect on device behavior, e.g. soft keys do notwork.

User-Initiated Recovery. User-initiated actions to recoverfrom a device failure can be classified according to the fol-lowing categories:

• Repeat the action: Repeating the action is sometimesufficient to get the phone working properly, i.e., theproblem was transient.

• Wait an amount of time: Often it is enough to waitfor a certain amount of time (the exact amount is notreported by users) to let the device deliver the expectedservice.

• Reboot (power cycle or reset): The user turns off thedevice and then turns it on to restore the correct oper-ation (a temporary corrupted state is cleaned up by thereboot).

4It is possible that other failure categories, not present in the analyzedlogs, exist

• Remove battery: Battery removal is mainly performedwhen the phone freezes. In this case, the phone oftendoes not respond to the power on/off button. Batteryremoval can clean up a permanent corrupted state (e.g.,due to a user’s customized settings).

• Service the phone: The user has to bring the phone to aservice center for assistance. Often, when the failure isfirmware-related, the recovery consists of either a mas-ter reset (all the settings are reset to the factory settingsand the user’s content is removed from the memory) ora firmware update, i.e., uploading a new version of thefirmware. Hardware problems are fixed by substitut-ing malfunctioning components (e.g., the screen or thekeypad) or replacing the entire device with a new one.

If a failure report does not contain any information aboutthe recovery, we classify the recovery as unreported.

Failure Severity In introducing failure severity, thisstudy takes the user perspective and defines severity levelscorresponding to the difficulty of the recovery action(s).

• High: A failure is considered to be highly severe whenrecovery requires the assistance of service personnel.

• Medium: A failure is considered to be of mediumseverity when the recovery requires reboot or batteryremoval.

• Low: A failure is considered to be of low severity if thedevice operation can be reestablished by repeating theaction or waiting for an amount of time.

4.1 Reports analysis

The results discussed in this section are obtained fromthe analysis of failure reports posted between January 2003and March 2006. A total of 533 reports are used in thisstudy. Phone models from all major vendors are present:Motorola, Nokia, Samsung, Sony-Ericcson, LG, besidesKyocera, Audiovox, HP, Blackbarry, Handspring, and Dan-ger. The 22.3% of failure reports are from smart phones,although smart phones represented only 6.3% of the mar-ket share in 2005. We attribute this to the fact that smartphones: (i) have more complex architecture than voice-centric or rich-experience mobile phones and (ii) are openfor users to download and install third party applicationsand/or develop their own applications.

Note that not all considered phones are Symbian-basedsmart phones. Consequently, while the discussion in thissection provides high level characterization of phones fail-ures, the reported figures may differ from the results givenin section 6, which discusses failure data collected by thelogger software run on the Symbian-based smart phones.

Table 1. Failure frequency distribution with respect tofailure types and recovery actions; the numbers are percent-ages of the total number of failures

6.87

6.65

6.87

0.64

3.65

service phone

8.800.640.210.211.72unstable behavior

7.7300.432.150self-

shutdown

13.735.790.640.438.80output failure

0.860.6400.210.64input

failure

freeze 6.0104.299.012.36

unrep. repeatwaitbattery removal

reboot

Recovery action

Failure Type

6.87

6.65

6.87

0.64

3.65

service phone

8.800.640.210.211.72unstable behavior

7.7300.432.150self-

shutdown

13.735.790.640.438.80output failure

0.860.6400.210.64input

failure

freeze 6.0104.299.012.36

unrep. repeatwaitbattery removal

reboot

Recovery action

Failure Type

Nevertheless, these considerations do not change our con-clusions, since the purpose of this preliminary study is togain an initial understanding of the observed phenomena,rather than conducting a detailed failure analysis.

The most frequent failure type is output failure (36.3%),followed by freeze (25.3%), unstable behavior (18.5%),self-shutdown (16.9%), and input failure (3%). Despitetheir high occurrence, output failures are often of low-severity since repeating the action is often sufficient to re-store a correct device operation (5.8%, see Table 1). Onthe other hand, self-shutdown and unstable behavior can beconsidered as high-severity failures, because they are effec-tively recovered by serviceing the phone, or removing thebattery. Phone freezes are usually of medium severity, sincereboot (2.4% of the total number of failures; see Table 1) orthe battery removal (9.0%; see Table 1) usually do the joband reestablish the proper operation. Only in about 3.7%(see Table 1) of cases must the user seek assistance.

To gain an understanding of the relationship betweenfailure types and recovery actions, Table 1 reports failuredistribution with respect to failure types and correspondingrecovery actions. From the recovery action perspective, itshould be noted that reboots are an effective way to recoverfrom output failures (8.8% of the total number of failures).This indicates that output failures are often due to a tem-porary software corrupted state, which is cleaned up by thereboot. This is also confirmed by the fact that repeating theaction is often sufficient to restore a correct device opera-tion. Freezes are usually recovered by pulling out the bat-tery (9.01%), even if a significant number of them (4.29%)are recovered by simply waiting an amount of time for thephone to respond. This may indicate that a certain fractionof battery removals and reboots in response to freezes aredue to impatient users. In general, this lead us to observehow freezes are more annoying than output failures, wherethe user does not often need to pull out the battery.

Analyzed data also allows correlating failure occur-rences with the user activity at the time of the failure. In par-ticular, 13% of failures occur during the voice calls, 5.4%while creating/sending/receiving text messages, 3.6% whileusing Bluetooth and 2.4% when manipulating images. Fi-nally, several reports (we guess from more sophisticatedusers) provide insight into the failure causes, e.g., there areindications of memory leaks, incorrect use of the device re-sources, bad handling (by the software) of indexes/pointersto objects, and incorrect management of buffer sizes.

5 Data Collection

In order to gain in-depth understanding of the failurebehavior of handheld devices we developed a failure datalogger for Symbian based smart phones. The logger en-ables: i) recording the occurrences of user-perceived appli-cation/system failures and ii) associating high-level failureevents with the low-level error conditions signaled by appli-cations and system modules in the form of panics. The col-lected data provide basis for analyzing the low-level causesof failures observed by users. Towards this, it is importantto record the phone status at the time of failure. For ex-ample, when a phone freezes while a text message is beingreceived, the stored failure data should enable answering thefollowing questions:

1. Was the text message received despite of the failure?

2. Did any user/system module fail?

3. What other applications were running on the device atthe time of the failure?

In order to address these questions, it is necessary to re-late the failure (the freeze event in our example) with thephone activity/status at the time of the failure and with apanic event, which can be signaled by application or systemmodules.

In this study, we focus on freeze and self-shutdown fail-ures, since they can be relatively easily detected without hu-man intervention. The automated detection of value and er-ratic failures (output failures, input failures, and unstablebehavior identified in the previous section) requires the im-plementation of a perfect observer, which has a completeknowledge of the system specification [5]. An alternativecould be to involve the user in the detection process, byasking him/her to report the occurrence of a value or erraticfailures. However, as our experience with analysis of Blue-tooth failures shows [6], users are quite unreliable and oftenneglect or forget to post the required information, thus bias-ing the results. While this approach can be considered ac-ceptable for an initial evaluation, as discussed in section 4,

Log Engine

PowerManager

PanicDetector

HeartBeat

RunningApplications

Detector

Log file

beats

activity

runapppower

AO

File

Log Engine

PowerManager

PanicDetector

HeartBeat

RunningApplications

Detector

Log file

beats

activity

runapppower

AO

File

Figure 1. Overall architecture

it becomes too unreliable for a more detailed analysis. Re-gardless of its limited scope, the study of freezes and self-shutdowns enables us to infer valuable insight into failurebehavior of Symbian-based smart phones.

5.1 Failure Logger Architecture

The high-level architecture of the failure data logger isshown in figure 1. The logger is implemented as a dae-mon application that starts at the phone start-up time andexecutes in the background. It consists of a set of ActiveObjects (AOs) responsible for the following tasks:

• Heartbeat: which is in charge of detecting both freezesand self-shutdowns (the next subsection provides moredetails on the heartbeat active object).

• Running Applications Detector: which periodicallystores (in the runapp file) the list of IDs of the applica-tions running on the phone. The list is obtained fromthe Application Architecture Server.

• Log Engine: which collects the smart phone activity(e.g., calls, messages, and web browsing). The infor-mation is gathered from the Database Log Server andstored into the activity file.

• Power Manager: which provides information aboutthe battery status and enables differentiating self-shutdowns due to failures and those due to low battery.The battery status is gathered from the System AgentServer and stored into the power file.

• Panic Detector: which collects panic events as soon asthey are notified. In order to gather panic related infor-mation (panic category and type), the Panic Detectorexploits services provided by the RDebug object in theSymbian OS Kernel Server. The Panic Detector is alsoresponsible collecting data produced by the other ac-tive objects into a single Log File. This operation isperformed either when a panic is detected or when thelogger application starts (i.e., when the phone starts).

5.2 Detection mechanisms

Freezes and self-shutdowns detection is accomplished bymeans of the heartbeat technique. This is a well known ap-proach for crash detection. The Heartbeat AO periodicallywrites a heartbeat events to the beats file. During normalexecution, the Heartbeat writes an ALIVE event. Once ashutdown is performed, the Heartbeat writes a REBOOTevent. Note that before the phone reboots, the Symbian OSallows applications to complete their tasks. This is sufficientfor the Heartbeat to record the REBOOT event. When theuser deliberately turns off the logger application, a MAOFF(Manual OFF) event is written to the log file. Finally, ifa shutdown is due to low battery (the battery status is re-quested to the Power Manager), a LOWBT (LOW BaTtery)event is written.

When the phone is turned on and the logger starts, thePanic Detector checks the last event logged by the Heart-beat. An ALIVE event indicates the phone has been shutdown by pulling out the battery. In all other cases (i.e.,a shutdown due to the low battery, the user, or the ker-nel) the Heartbeat would log REBOOT or LOWBT events.This means that the phone was frozen, which is consistentwith the fact that pulling out the battery is the only reason-able user-initiated recovery action for a freeze. Therefore,a freeze is recorded by the Panic Detector, along with theinformation gathered by the Log Engine and the RunningApplications Detector. On the other hand, a REBOOT eventcan be logged because either the phone rebooted itself or itwas rebooted by the user. Hence, it becomes important todistinguish the two cases.

More details on the logger including the tuning of theheartbeat frequency and the description of the software in-frastructure for automated transfer of Log Files from thephones used in this study, can be found in [1].

6 Experimental Results

This section reports results from the analysis of failuredata collected over the period of 14 months from 25 phones,which run Symbian OS versions 6.1 to 8.0 or version 9.0.The majority of phones use the Symbian version 8.0, themost popular on the market at the time the analysis started.The targeted phones belong to students, researchers, andprofessors from both Italy and USA. The phones have thelogger installed and have been under normal use during theperiod of the experiment.

Self-shutdowns Identification. As a first step in thefailure data analysis, we isolate the self-shutdowns from theuser triggered shutdowns. Unfortunately, it is not possibleto automatically distinguish the two types of shutdownsbecause the generated event (i.e., the one captured by the

% s

hu

tdo

wn

even

ts

Reboot duration (s)

duration < 500 s

% s

hu

tdo

wn

even

ts

Reboot duration (s)

duration < 500 s

Figure 2. Distribution of reboot durations; the inner his-togram zooms the external one for duration < 500 s

Heartbeat AO) is the same in both cases. We discriminatebetween these two events by examining the phone off-time(or the reboot duration) recorded by the Panic Detector.Figure 2 shows the distribution of reboot durations. Thehistogram includes all recorded shutdown events (1778events). Two local maximums can be noticed in thefigure: a first one for reboot durations shorter than 500s,which corresponds to self-shutdowns, and a second onearound 30000 seconds (about eight hours and 20 minutes),corresponding to the phone off time during the night whenusers usually turn off their phones. The inner histogramzooms in on the data around the first local maximum (forthe reboot durations less than 500 seconds) and showsa peak around 80 seconds, which corresponds to themedian self-shutdown duration. Note that the number ofevents approaches zero seconds for durations longer than360 seconds. We filtered-out all shutdown events withdurations longer than 360 seconds. The remaining eventsare assumed to be self-shutdown events (471 events or24.2% of the overall data set).

Freezes and Self-shutdowns. A total of 360 freezesand 471 self-shutdowns are reported by the logger. Basedon this data we estimate the Mean Time Between Freezes(MTBFr) and the Mean Time Between Self-shutdowns(MTBS), in terms of wall-clock hours, averaged per singlephone. The results show: MTBFr of 313 hours and MTBSof 250 hours. Hence, on average, a user experienceshis/her phone freeze about every 13 days and the phoneself-shutdown about every 10 days. These figures give anoverall idea of today’s mobile phones user-perceived de-pendability. While these values are acceptable for everydaydependability requirements [16], they indicate potentiallimitations in using smart phones for critical applications.

Captured Panic Events. Table 2 reports on the panicevents recorded during the experiment. The panics are

Table 2. Collected panic events

4

3

70

2

5

3

11

0

70

11

10

92

91

69

47

46

33

15

3

0

Type

0.25

6.31

0.25

0.25

0.76

0.25

2.53

0.25

0.76

5.81

1.52

0.76

0.51

10.10

0.25

0.76

5.56

0.51

56.31

6.31

%

it appears when the TInt value passed to SetVolume(TInt) gets 10 or moreMMFAudioClient

Failed to write data into asynchronous call descriptor to be passed back to clientMSGS Client

Corrupt edwin state for inlining editingEIKCOCTL

Not documentedPhone.app

occurs when using a listbox object from the eikon framework and an invalid Current Item Index is specified.

occurs when using a listbox object from the eikon framework and no view is defined to display the object.EIKON-LISTBOX

occurs when one active object’s event handler monopolizes the thread’s active scheduler loop and the application’s ViewSrvactive object cannot respond in time (the View Server monitors applications for activity/inactivity, if it thinks the application is in some kind of infinite loop state it will close it. Clever use of Active Objects should help overcome this).

ViewSrv

This panic is raised by the Kernel Server when it attempts to close a Kernel object in response to an RHandleBase::Close() request. The panic occurs when the object represented by the handle cannot be found. The panic is also raised by the Kernel Server when it cannot find an object in the object index for the current process or current thread using the specified object index number (the raw handle number). The most likely cause is a corrupt handle.

KERN-SVR

This panic is raised when attempting to complete a client/server request and the RMessagePtr is null.

This panic is raised when any operation that moves or copies data to a 16-bit variant descriptor, causes the length of that descriptor to exceed its maximum length. It may be caused by any of the copying, appending or formatting member functions and, specifically, by the Insert(), Replace(), Fill(), Fillz() and ZeroTerminate() descriptor member functions. It can also be caused by the SetLength() function.

This panic is raised when the position value passed to a 16-bit variant descriptor member function is out of bounds. It may be raised by the Left(), Right(), Mid(), Insert(), Delete() and Replace() member functions of TDes16.

USER

Not documented

Not documented

This panic is raised if no trap handler has been installed. In practice, this occurs if CTrapCleanup::New() has not been called before using the cleanup stack.

This panic is raised by the Error() virtual member function of an active scheduler, a CActiveScheduler. This function is called when an active object’s RunL() function leaves. Applications always replace the Error() function in a class derived from CActiveScheduler; the default behaviour provided by CActiveScheduler raises this panic.

This panic is raised by an active scheduler, a CActiveScheduler. It is caused by a stray signal.

Raised by the destructor of a CObject. It is caused, if an attempt is made to delete the CObject when the reference count is not zero.

E32USER-CBase

This panic is raised when a timer event is requested from an asynchronous timer service, an RTimer, and a timer event is already outstanding. It is caused by calling either the At(), After() or Lock() member functions after a previous call to any ofthese functions but before the timer event requested by those functions has completed.

This panic is raised when an unhandled exception occurs. Exceptions have many causes, but the most common are access violations caused, for example, by dreferencing NULL. Among other possible causes are: general protection faults, executing an invalid instruction, alignment checks, etc.

This panic is raised when the Kernel Executive cannot find an object in the object index for the current process or current thread using the specified object index number (the raw handle number).

KERN-EXEC

MeaningPanic

4

3

70

2

5

3

11

0

70

11

10

92

91

69

47

46

33

15

3

0

Type

0.25

6.31

0.25

0.25

0.76

0.25

2.53

0.25

0.76

5.81

1.52

0.76

0.51

10.10

0.25

0.76

5.56

0.51

56.31

6.31

%

it appears when the TInt value passed to SetVolume(TInt) gets 10 or moreMMFAudioClient

Failed to write data into asynchronous call descriptor to be passed back to clientMSGS Client

Corrupt edwin state for inlining editingEIKCOCTL

Not documentedPhone.app

occurs when using a listbox object from the eikon framework and an invalid Current Item Index is specified.

occurs when using a listbox object from the eikon framework and no view is defined to display the object.EIKON-LISTBOX

occurs when one active object’s event handler monopolizes the thread’s active scheduler loop and the application’s ViewSrvactive object cannot respond in time (the View Server monitors applications for activity/inactivity, if it thinks the application is in some kind of infinite loop state it will close it. Clever use of Active Objects should help overcome this).

ViewSrv

This panic is raised by the Kernel Server when it attempts to close a Kernel object in response to an RHandleBase::Close() request. The panic occurs when the object represented by the handle cannot be found. The panic is also raised by the Kernel Server when it cannot find an object in the object index for the current process or current thread using the specified object index number (the raw handle number). The most likely cause is a corrupt handle.

KERN-SVR

This panic is raised when attempting to complete a client/server request and the RMessagePtr is null.

This panic is raised when any operation that moves or copies data to a 16-bit variant descriptor, causes the length of that descriptor to exceed its maximum length. It may be caused by any of the copying, appending or formatting member functions and, specifically, by the Insert(), Replace(), Fill(), Fillz() and ZeroTerminate() descriptor member functions. It can also be caused by the SetLength() function.

This panic is raised when the position value passed to a 16-bit variant descriptor member function is out of bounds. It may be raised by the Left(), Right(), Mid(), Insert(), Delete() and Replace() member functions of TDes16.

USER

Not documented

Not documented

This panic is raised if no trap handler has been installed. In practice, this occurs if CTrapCleanup::New() has not been called before using the cleanup stack.

This panic is raised by the Error() virtual member function of an active scheduler, a CActiveScheduler. This function is called when an active object’s RunL() function leaves. Applications always replace the Error() function in a class derived from CActiveScheduler; the default behaviour provided by CActiveScheduler raises this panic.

This panic is raised by an active scheduler, a CActiveScheduler. It is caused by a stray signal.

Raised by the destructor of a CObject. It is caused, if an attempt is made to delete the CObject when the reference count is not zero.

E32USER-CBase

This panic is raised when a timer event is requested from an asynchronous timer service, an RTimer, and a timer event is already outstanding. It is caused by calling either the At(), After() or Lock() member functions after a previous call to any ofthese functions but before the timer event requested by those functions has completed.

This panic is raised when an unhandled exception occurs. Exceptions have many causes, but the most common are access violations caused, for example, by dreferencing NULL. Among other possible causes are: general protection faults, executing an invalid instruction, alignment checks, etc.

This panic is raised when the Kernel Executive cannot find an object in the object index for the current process or current thread using the specified object index number (the raw handle number).

KERN-EXEC

MeaningPanic

classified according to their categories and types. The tablealso gives a relative frequency (with respect to the totalnumber of panics) of occurrences of different panic types.In addition, a brief description (extracted from the SymbianOS documentation) of each panic category is given.The data on panic events provides an overall insight intothe software defects, which lead to application/systemfailures. The most frequent panics are due to accessviolations caused by dereferencing null pointers. In thiscase the Symbian kernel executive terminates the offendingapplication and signals a KERN-EXEC type 3 panic.Other frequent panic causes include: invalid object indexes(KERN-EXEC type 0 panic), runtime errors related tothe heap management (causing E32User-CBase panics),

and copy operations causing a descriptor to exceed itsmaximum length (USER type 11 panic). These findingsare consistent with our observations from the analysisof failure data reported in the public web forums anddiscussed earlier in this paper.Further analysis of panic events reveals that in many cases(25%), a cascade of more than one panic event is recordedin the logs (see figure 3). Since a panic generation is the lastoperation performed by an application or a system module(just after, the application is terminated by the kernel),multiple panic events in a short succession indicate errorpropagation within the operating system. The observableconsequence of this phenomenon is the termination ofmultiple applications.

no. of subsequent Panics%

Pan

ics

no. of subsequent Panics%

Pan

ics

Figure 3. Distribution of subsequent panics

time

panic(isolated)

panicfreeze(isolated)

self-shutdown

windowwindowtime

panic(isolated)

panicfreeze(isolated)

self-shutdown

windowwindow

Figure 4. Panics and HL events coalescence scheme

Panics and High Level Events. From the collected datawe can infer the relationship between panics and the high-level (HL) events, e.g., freezes and self-shutdowns. To-wards this, we correlate panic events with freeze and self-shutdown events as depicted in Figure 4. When a panicis found in the Log File, we search for freeze and self-shutdown events, within a predefined temporal window. Asindicated in Figure 4 there can be panic events which donot relate to HL events as well as isolated HL events. Thetemporal window for grouping the events must be carefullyselected to avoid misinterpretation of the results. Analysisof the collected data shows that the number of coalescedevents increases for window’s sizes up to five minutes. Afurther increase in the number of the coalesced events is ob-served for much larger temporal windows (of the order ofhours), which indicates that the coalesced events are mostlikely uncorrelated. For these reasons, we fix the temporalwindow size to be five minutes.

Figure 5 shows the results of this coalescence procedure(including the distribution of isolated panics, i.e., those pan-ics which cannot be related to any HL event5).

The results show that more than a half of the recordedpanics (51%) are related to HL events. If we considera relatively small number of HL events (one every 11days), these relationships cannot be just a coincidence.Furthermore, if we include all shutdown events recordedin the logs (hence about 300% increase in the number ofevents, from 471 to 1778 shutdown events), the percentageof panics related to HL events increases to 55%, i.e., onlyby 4%. This also confirms our previous observation thatthe shutdown events, which we filtered out from the data

5These panic events, most likely, relate to output failures, which ourfailure logger (in its current implementation) is not able to collect

(b)

(a)

(b)(b)

(a)

(b)

Figure 5. Panics and HL events: a) across all events, b)details with respect to freeze and self-shutdown events

analysis, are user-triggered shutdowns.Figure 5a, also shows panic categories (EIKON-LISTBOX,EIKCOCTL, MMFAudioClient, and KERN-SVR) whichdo not manifest as HL events. The first three panics aretypical application panics, concerning the view or the audiostreaming. This indicates a good OS resilience with respectto application panics. More frequent system panics, suchas KERN-EXEC, E32USER-Cbase, USER and ViewSrv,usually lead to an HL event. Depending on the componentthat caused the panic: (i) the phone can crash if the panicis raised by a critical system server or (ii) the phonekeeps working properly once the offending application isterminated by the kernel. As a further observation, there arepanics, e.g., Phone.app and MSGS Client, which alwayscause the self-shutdown. The two panic events correspondto the core applications provided by the phone and hence,the OS kernel always reboots the phone if any of theseapplications fails.Figure 5b details the relationship between specific panicevents and HL events (freezes and self-shutdowns). Thedata enables identifying panic categories which are symp-tomatic of freezes, e.g., the heap management (E32USER-Cbase), USER, and ViewSrv, and KERN-EXEC (type 0panics). On the other hand, access violation-related panics

Table 3. Panic-activity relationship

act. type

Allcateg.

54.8...9.1940.40.37.4.78unspecified

38.64.049.56..17.3.1.106.62Voice call

6.62..1.10.4.41..1.10message

11 11 2 3 3 0 47 33

ViewSrv

USER Phone.app

MSGS Client

KERN-EXEC

E32USER-CBase

act. type

Allcateg.

54.8...9.1940.40.37.4.78unspecified

38.64.049.56..17.3.1.106.62Voice call

6.62..1.10.4.41..1.10message

11 11 2 3 3 0 47 33

ViewSrv

USER Phone.app

MSGS Client

KERN-EXEC

E32USER-CBase

no. of apps at panic time

Per

cen

tag

e

no. of apps at panic time

Per

cen

tag

e

Figure 6. Distribution of the number of running applica-tions at panic time

(KERN-EXEC type 3) can trigger both phone freeze andself-shutdowns.

Phone Activity at Panic Time. Table 3 reports the useractivity at the time of the panic, in terms of voice calls andtext messages (the only ones registered on the Symbian’sDatabase Log Server). Only panics which lead to an HLevent are considered in this analysis. Interestingly, about45% of panics are recorded when the user performs real-time activities, e.g., a voice call, or sending/receiving a shortmessage. This confirms our earlier observation (based onfailure data from the web forums), which indicates pres-ence of interferences between various applications/systemmodules. In other terms, this is also a symptom of thelack of sufficient (to protect error propagation) isolation be-tween real-time and time-sharing modules. Thus, more ef-fort should be directed to enhance the isolation between thetwo types of system modules. Also, there are panics, suchas USER and ViewSrv, which are triggered only while avoice call is performed. Similarly, there are panics, e.g.,Phone.app, which manifest only when a short message issent/received.

The Running Application Detector allowed us to collectthe set of running application at the time of the panic. It isinteresting to notice that often only one user application isfound to be running at the panic time, as can be observed inFigure 6. This indicates, somewhat counter intuitive, that aconcurrent execution of multiple applications does not nec-essary lead to more frequent panics.Table 4 summarizes panic-running applications relation-ship. Only cases with significant percentage are taken intoaccount, covering 53% of the total number of panics. The

rows correspond to HL events and panic categories. Thecolumns indicate applications which execute at the time ofa panic. Numbers reported in every cell of the table rep-resent percentages of the total number of panics, e.g., theClock application is present in 3.2% of all recorded KERN-EXEC panics which lead to freeze. Consistently with ourfindings from the web forums, the Message application isone of the main panic causes. Other potential dependabilitybottlenecks are the camera, the Bluetoth browsing tool, andthe log of incoming/outgoing calls. The table also gives aninsight into the applications which, even panicking, do notcause HL events.

7 Conclusions and Lessons Learned

This work presented a measurement-based failure anal-ysis of mobile phones. A dedicated logger has been imple-mented to gather failure-related information on Symbian-OS-based smart phones. Failure data has been collectedfrom 25 phones over the period of 14 months. Key find-ings indicate that: (i) Majority of kernel exceptions are dueto memory access violation errors and heap managementproblems (despite adopting the micro-kernel model in theSymbian design and providing advanced memory manage-ment facilities). This is consistent with our initial analysisof failure data on hand-held devices obtained from publiclyavailable web forums, which pinpoints the memory leaks asone of the main causes of failures. (ii) Similarly, analysis ofdata collected by the logger and data from the web forumsshows that the majority of failures occur when the user per-forms real-time tasks, e.g., a voice call or sending/receivingof a text message. This indicates the need to strength theisolation between interactive and real-time tasks. (iii)Usersexperience a failure (freeze or self shutdown) every 11 days,on average. Since these figures are obtained from a singlestudy, more data and further analysis are needed before gen-eralizing the results.

Future effort will focus on: (i) conducting experimentson a larger set of phones, including other platforms, e.g.,MS Windows, (ii) enhancing the logging mechanism to en-able capturing output failures (this may require involvementof users).

8 Acknowledgments

This work has been supported in part by the University ofNaples Federico II - Ufficio Programmi Internazionali, bythe Italian Ministry for Education,University, and Research(MIUR) in the framework of the PRIN Project “COM-MUTA: Mutant hardware/software components for dynam-ically reconfigurable distributed systems”, and by the Mo-torola Corporation as part of Motorola Center in the Univer-sity of Illinois at Urbana-Champaign, USA. We also thank

Table 4. Panic-running applications relationship

Application

ViewSrv

USER

KERN-EXEC

EIKON-LISTBOX

EIKCOCTL

E32USER-CBase ...0.26...0.26.....6.390.38

No HL event

.......0.13.......

............0.26..

0.260.380.891.281.532.561.151.02..1.281.66.0.266.78

.....0.38.3.07......

........0.13..0.13..

Panic categoryHL event

1.53

.

.

.

Telep

ho

ne

1.53

.

.

.

Messag

es C

on

tacts

2.56

.

.

.

battery

2.94

.

.

1.28

Co

ntacts

3.07

.

.

1.02

Lo

gC

on

tacts

3.32

3.20

.

.

Lo

gT

eleph

on

e

4.48

.

.

3.20

Clo

ck

5.50

.

.

3.20

Lo

g

6.78

6.39

.

.

Cam

era Lo

g

Telep

ho

ne

6.91

.

.

.

Messag

esL

og

8.18

.

.

0.51

Messag

es

1.281.281.353.07Total

....MSGS Client

..0.18.KERN-EXEC Self-Shutdown

1.020.900.28.KERN-EXEC Freeze

To

mT

om

Clo

ckL

og

FE

xplo

rer

BT

_Bro

wser

Lo

g T

eleph

.

Application

ViewSrv

USER

KERN-EXEC

EIKON-LISTBOX

EIKCOCTL

E32USER-CBase ...0.26...0.26.....6.390.38

No HL event

.......0.13.......

............0.26..

0.260.380.891.281.532.561.151.02..1.281.66.0.266.78

.....0.38.3.07......

........0.13..0.13..

Panic categoryHL event

1.53

.

.

.

Telep

ho

ne

1.53

.

.

.

Messag

es C

on

tacts

2.56

.

.

.

battery

2.94

.

.

1.28

Co

ntacts

3.07

.

.

1.02

Lo

gC

on

tacts

3.32

3.20

.

.

Lo

gT

eleph

on

e

4.48

.

.

3.20

Clo

ck

5.50

.

.

3.20

Lo

g

6.78

6.39

.

.

Cam

era Lo

g

Telep

ho

ne

6.91

.

.

.

Messag

esL

og

8.18

.

.

0.51

Messag

es

1.281.281.353.07Total

....MSGS Client

..0.18.KERN-EXEC Self-Shutdown

1.020.900.28.KERN-EXEC Freeze

To

mT

om

Clo

ckL

og

FE

xplo

rer

BT

_Bro

wser

Lo

g T

eleph

.

Paolo Ascione for an excellent work on the implementationof the logger and Daniel Chen for help in the collection ofthe failure data.

References

[1] P. Ascione, M. Cinque, and D. Cotroneo. Automated Log-ging of Mobile Phones Failure Data. Proc. of the 9th IEEEInternational Symposium on Object-oriented Real-time Dis-tributed Computing (ISORC 2006), April 2006.

[2] V. Astarita and M. Florian. The use of Mobile Phones inTraffic Management and Control. Proc. of the 2001 IEEE In-telligent Transportation Systems Conference, August 2001.

[3] A. Avizienis, J. Laprie, B. Randell, and C. Landwehr. BasicConcepts and Taxonomy of Dependable and Secure Com-puting. IEEE Transactions on Dependable and Secure Com-puting, 1(1):11–33, 2004.

[4] A. A. Aziz and R. Besar. Application of Mobile Phone inMedical Image Transmission. Proc. of the 4th National Con-ference on Telecommunication Technology, January 2003.

[5] A. Bondavalli and L. Simoncini. Failures Classification withRespect to Detection. Proc. of the 2nd IEEE Workshop onFuture Trends in Distributed Computing Systems, 1990.

[6] M. Cinque, D. Cotroneo, and S. Russo. Collecting and An-alyzing Failure Data of Bluetooth Personal Area Networks.proc. of the 2006 International Conference on DependableSystems and Networks (DSN’06), June 2006.

[7] W. Gu, Z. Kalbarczyk, R. K. Iyer, and Z. Yang. Characteri-zation of Linux Kernel Behavior under Errors. Proc. of the2003 International Conference on Dependable Systems andNetworks (DSN’03), June 2003.

[8] R. Harrison. Symbian OS C++ for Mobile Phones Volume 2.Symbian Press, 2004.

[9] R. K. Iyer, Z. Kalbarczyk, and M. Kalyanakrishnam.Measurement-Based Analysis of Networked System Avail-ability. Performance Evaluation Origins and Directions, Ed.G. Haring, Ch. Lindemann, M. Reiser, Lecture Notes in Com-puter Science 1769, Springer Verlag, 2000.

[10] T. Kubik and M. Sugisaka. Use of a Cellular Phone in mobilerobot voice control. Proc. of the 40th SICE Annual Confer-ence, July 2001.

[11] Y. Liang, Y. Zhang, A. Sivasubramaniam, R. K. Sahoo, andM. Jette. BlueGene/L Failure Analysis and Prediction Mod-els. proc. of the 2006 International Conference on Depend-able Systems and Networks (DSN’06), June 2006.

[12] C. Lim. Drop Impact Study of Handheld Electronic Prod-ucts. Proc. of the 5th International Symposium on ImpactEngineering, July 2004.

[13] S. M. Matz, L. G. Votta, and M. Malkawi. Analysis of FailureRecovery Rates in a Wireless Telecommunication System.Proc. of the 2002 International Conference on DependableSystems and Networks (DSN’02), June 2002.

[14] B. Schroeder and G. Gibson. A Large-Scale Study of Fail-ures in High-Performance Computing Systems. Proc. of theIEEE International Conference on Dependable Systems andNetworks (DSN 2006), June 2006.

[15] A. Sekman, A. B. Koku, and S. Z. Sabatto. Human RobotInteraction via Cellular Phones. Proc. of the 2003 IEEE Int.Conf. on Systems, Man and Cybernetics, October 2003.

[16] M. Shaw. Everyday Dependability for Everyday Needs.Proc. of the 13th IEEE International Symposium on SoftwareReliability Engineering, November 2002.

[17] D. P. Siewiorek, R. Chillarege, and Z. Kalbarczyk. Reflec-tions on industry trends and experimental research in de-pendability. IEEE Transactions on Dependable and SecureComputing, 1(2), 2004.

[18] C. Simache and M. Kaaniche. Measurement-Based Avail-ability Analysis of Unix Systems in a Distributed Environ-ment. Proc. of the 12th International Symposium on SoftwareReliability Engineering (ISSRE’01), November 2001.

[19] C. Simache, M. Kaaniche, and A. Saidane. Event Log basedDependability Analysis of Windows NT and 2K Systems.Proc. of the 2002 Pacific Rim International Symposium onDependable Computing (PRDC’02), December 2002.

[20] J. Xu, Z. Kalbarczyc, and R. K. Iyer. Networked Win-dows NT System Field Data Analysis. Proc. of the 1999Pacific Rim International Symposium on Dependable Com-puting (PRDC’99), December 1999.