User Action Framework

Heuristic evaluation: Comparing ways of nding and reportingusability problems

Ebba Thora Hvannberg a,*, Ee Lai-Chong Law b, Marta Kristn Larusdottir c

a University of Iceland, Hjardarhaga 2-6, 107 Reykjavik, Icelandb Eidgenossische Technische Hochschule Zurich Gloriastrasse 35, CH-8902, Zurich, Switzerland

c Reykjavik University, Ofanleiti 2, 103 Reykjavik, Iceland

Received 28 June 2005; received in revised form 29 August 2006; accepted 11 October 2006Available online 1 December 2006

Abstract

Research on heuristic evaluation in recent years has focused on improving its eectiveness and eciency with respect to user testing.The aim of this paper is to rene a research agenda for comparing and contrasting evaluation methods. To reach this goal, a frameworkis presented to evaluate the eectiveness of dierent types of support for structured usability problem reporting. This paper reports on anempirical study of this framework that compares two sets of heuristics, Nielsens heuristics and the cognitive principles of Gerhardt-Pow-als, and two media of reporting a usability problem, i.e. either using a web tool or paper. The study found that there were no signicantdierences between any of the four groups in eectiveness, eciency and inter-evaluator reliability. A more signicant contribution ofthis research is that the framework used for the experiments proved successful and should be reusable by other researchers because of itsthorough structure. 2006 Elsevier B.V. All rights reserved.

Keywords: User interface; Heuristic evaluation; Reporting; Web tool; Eectiveness; Eciency; Comparison framework

1. Introduction

Since the early 1990s, researchers have carried out stud-ies comparing and contrasting some of the methodsbrought forward to uncover usability problems of interac-tive computer systems, (Desurvire et al., 1992; Holzinger,2005; Jeries et al., 1991; Karat et al., 1992). Currentresearch on usability evaluation clearly searches for meth-ods that produce benecial results for users and developersalike at low cost in an ever-increasing competitive industry.

In this paper, we report a case study of a framework forvalidating the use of usability evaluation methods and ofproblem registration tools and other support for enablingstructured usability problem reporting. The focus is onrening a research agenda for comparing and contrastingevaluation methods. In the case study, the number andthe seriousness of problems found per evaluator in heuris-tic evaluation, with two dierent sets of usability heuristics:Nielsens heuristics and the cognitive principles of Ger-hardt-Powals (Gerhardt-Powals, 1996), were compared.Furthermore, two dierent ways of reporting usabilityproblems, on paper and with the help of a web tool, arecompared to the results found in user testing. The casestudy thus serves as an example of how the frameworkcan be used.

1.1. Tool vs. paper

In our previous empirical studies on heuristic evaluation(Law and Hvannberg, 2004a), evaluators complained that

0953-5438/$ - see front matter 2006 Elsevier B.V. All rights reserved.doi:10.1016/j.intcom.2006.10.001

Abbreviations: AE, actual eciency; DV, dependent variable; HE, heu-ristic evaluation; ICT, Information and Communication Technology; IV,independent variable; PUP, predicted usability problems; SUPEX, Struc-tured Usability Problem EXtraction; SUS, System Usability Scale; UAF,User Action Framework; UP, usability problem; UT, user test.* Corresponding author. Tel.: +354 525 4702; fax: +354 525 4937.E-mail addresses: [email protected] (E.T. Hvannberg), [email protected]

(E.L.-C. Law), [email protected] (M.K. Larusdottir).

www.elsevier.com/locate/intcom

Interacting with Computers 19 (2007) 225240

reporting problems on paper was cumbersome and time-consuming. This motivated us to attempt to improve theevaluation method by providing evaluators with a webtool. We wanted to improve the problem descriptions withfurther characterization of usability problems, such as con-text, cause, severity and relevant usability heuristics. Theultimate goal is to advance the validity of predictive meth-ods, such as heuristic evaluation, with respect to user test-ing. In other words, to try to predict serious problems sothat they are corrected in revisions of the user interfaceand to try to minimize the number of serious problemsfalsely predicted. We can reason that reporting usabilityproblems using a software tool may help due to the follow-ing qualities:

More accessible explanation of usability heuristics alongwith concrete examples.

Easier to search, review, modify problem descriptionsand link to relevant material (Gladwell, 2002).

Faster entry of usability problems, thus making it moreecient.

In addition, the following may improve immediate man-agement of usability problems:

Merging of problem sets from dierent evaluators to geta unique set of problems.

Measuring reliability of usability problems, i.e. whetherone or more evaluators report them.

Prioritizing usability problems according to impact, e.g.severity and cost of removal; tracking them throughrevisions, etc.

Locating problematic contexts or tasks in the applica-tion, followed by designing task scenarios for user teststo evaluate these problems.

Associating problems with previously proven patterns ofsolutions.

On the other hand, a software tool can have a negativeeect on the usability problem reporting:

Switching back and forth between the application beingtested and the software tool can decrease the sense ofcontext in the application for the evaluator. On theother hand, users frequently use a number of softwareapplications in their work so such context-switching iscommon.

More noise in the problem descriptions because of easyreporting, i.e. more False Alarms.

Bias towards certain values in classication because ofdefault values or order of values presented in menus.

So far, only a few software tools have been developed tosupport usability problem analysis, classication andreporting. A tool to assist evaluators with cognitive walk-through and to record the results has been shown toimprove the evaluation process over paper walkthrough

(Rieman et al., 1991). The User Action Framework(UAF) (Andre et al., 2001) has a number of tools, includ-ing a usability design guide to be used during interactiondesign and usability lab testing, usability problem inspectorto be used during formative evaluation, usability problemclassier and usability problem database. This databasecan provide valuable input to project management, solu-tions to problems, guidelines and relevant on-line litera-ture. Emerging holistic frameworks, such as UAF, whichcan potentially support evaluators in using more thanone method, can prepare them with appropriate training,help them with analysis, problem tracking and manage-ment. To the best of our knowledge, there has not beenany empirical study, which compares the eectiveness ofpaper-based vs. tool-based usability problem reporting.

1.2. Nielsen vs. Gerhardt-Powals

Numerous sets of heuristics can be applied during heu-ristic evaluation (Folmer and Bosch, 2004). Many of themhave common factors, such as consistency, task match,appropriate visual presentation, user control, memory-loadreduction, error handling and guidance and support. Niel-sens heuristics have resulted from studies of practicalapplications in various contexts (Nielsen, 1993; Nielsenand Molich, 1990) but there is a lack of a sound theoreticalframework to explain how they work. The less well-knownprinciples, which were put forward by Gerhardt-Powals(1996), and are based on situation awareness and cognitivetheory, have proven useful in a dynamic application suchas anti-submarine warfare. Nielsens heuristics are synthe-sized from a number of guidelines. Alternative sets ofusability heuristics have been developed and tested. Mostof the heuristics are design guidelines and refer to the sys-tems user interface, and only a few heuristic sets are basedon the understanding of user cognition (cf. Normans the-ory of action model (Norman, 1986)) or situation aware-ness. Gerhardt-Powals put forward a set of guidelinesbased on cognitive principles. These guidelines are basedon theory, but have not been adequately evaluated in prac-tice. Specically, we hypothesize that principles derivedfrom cognitive engineering, which are strongly rooted intheories of cognitive psychology and other related disci-plines, can well serve as a promising tool for heuristicevaluation.

Several factors, besides the set of usability heuristics,can inuence the performance of heuristic evaluation,such as evaluator training, evaluator knowledge of theapplication domain, task coverage, problem extraction/description, merging, etc. A number of attempts havebeen carried out to improve heuristic evaluation, includ-ing the inspective, descriptive and analytical part. Toimprove the descriptive and the analytical part, Cocktonand Woolrych (2001) have suggested a problem-reportingform and a way to analyse more accurately researchprocedures which can accurately count the number ofproblems discovered.

226 E.T. Hvannberg et al. / Interacting with Computers 19 (2007) 225240

Evaluators selection of tasks is an important part of auser test. Model-based evaluation has supported this inpart, for example, by letting a task model guide the evalu-ator through the application. As most surveys (Rosenbaumet al., 2000) on usability methods indicate, practitionersapply more than one method, some of which are orthogo-nal to one another, such as paper prototyping and heuristicevaluation, and some of which complement one anotherlike expert reviews and user testing. The discount usabilityengineering method (Nielsen, 1993, p. 17) uses four tech-niques: user and task observation, scenarios, simpliedthinking aloud and heuristic evaluation. Nielsen (1994b,p. 58) suggests that there are two major reasons for alter-nating between heuristic evaluation and user testing. First,heuristic evaluation does not require users, who can behard to get, and can be used initially to rinse out a numberof usability problems. Second, a number of studies haveshown that the two methods, heuristic evaluation and usertesting, nd distinct sets of problems. Frkjr andLarusdottir (1999) conducted an empirical study, whichshowed that performing heuristic evaluation (HE) priorto user tests on the same system could somehow helpnon-expert evaluators uncover more usability problems inuser testing, especially the severe ones. Such observationwas derived from comparing the results of the two usertests, which were performed with and without doing HEbeforehand. Besides, this combination of evaluation meth-ods could eliminate one of the most important weaknessesof HE when used by non-experts, the proneness to address-ing many false problems.

1.3. Task selection

One reason for a large number of predicted problemsbeing False Alarms, i.e. not conrmed in user testing, isthat users may not have been instructed to carry out theappropriate tasks and their set of tasks covered dierentcontexts from the contexts that evaluators visited duringheuristic evaluation. Thus, the issue of coverage of theapplication is of concern, when applying user tests to vali-date heuristic evaluation by comparing the respective listsof usability problems generated by these two UEMs(Usability Evaluation Methods). In heuristic evaluation,the portion of the application covered is up to the evalua-tor, especially if he/she is given total freedom of whataspects to cover. In a study of discovery methods, Cocktonet al. (2003b) report that most evaluators choose to use sys-tem-searching or system-scanning more often than goal-playing and method-following. In a think-aloud user-test-ing, the coverage is inuenced by the set of tasks presentedto the user. Comparing two independent usability problemsets from two UEMs has drawbacks. An evaluator maypredict problems that are miscoded as False Alarms sincea user never had an opportunity to see them during usertesting because he or she did not visit that part of the appli-cation (Cockton and Woolrych, 2001; Cockton et al.,2003b, 2004; Woolrych et al., 2004). Cockton and Wool-

rych (2001) suggest to systematically derive task sets foruser testing from the initial set of predicted problems iden-tied in heuristic evaluation so as to increase the power ofuser testing for exposing all predicted problems that reallyexist, i.e. eliminating instances of genuine False Alarms.Another reason for the small overlap of predicted and actu-al usability problems is that the predicted problems statecauses of problems but usability problems reported duringuser testing are frequently described as eects on the users(Cockton and Woolrych, 2001; Doubleday et al., 1997).

In view of the above discussion, we put forward in thispaper the following research questions:

1.3.1. Tool vs. Paper

Do we achieve benets in increased eectiveness, e-ciency and inter-evaluator reliability in terms of a highernumber of real usability problems in a shorter period byusing a software tool to report predicted problems overusing paper?

1.3.2. Nielsen vs. Gerhardt-Powals

Are Gerhardt-Powals cognitive engineering principlesmore eective than Nielsens usability heuristics in enablingevaluators to identify a higher number of real usabilityproblems (higher validity) in a shorter time (highereciency)?

1.3.3. Task selection

Can the validity of heuristic evaluation be increased byusing the set of usability problems so identied to guidethe task selection of user tests whose results are in turn usedto validate the outcomes of heuristic evaluation?

2. Materials and methods

Two experiments are described in this paper. A web por-tal called EducaNext (www.educanext.org) was evaluatedin both experiments. In the rst experiment, the portalwas evaluated with heuristic evaluation with two sets ofusability heuristics and two ways of reporting usabilityproblems. In a two by two between-subject experimentaldesign involving ve evaluators in each of four cells,labelled AD, we collected qualitative and quantitativedata of predicted usability problems discovered duringheuristic evaluation. The four groups are A, C (Nielsenheuristics), B, D (Gerhardt-Powals principles), where Aand B used paper forms and C and D used the web toolfor problem reporting.

Next, 8 task scenarios were designed based on the resultsfrom heuristic evaluation. In the second experiment, Edu-caNext was evaluated in user tests with 10 participantssolving the task scenarios to check how many of the pre-dicted usability problems (PUP) reported during heuristicevaluation were experienced by users as real problems(UP). We describe in detail how the predicted usabilityproblems are ltered and matched with the usability prob-lems discovered in the user test. The task selection process

E.T. Hvannberg et al. / Interacting with Computers 19 (2007) 225240 227

for the user test is also described. The two experiments aredescribed subsequently and an overview of the workow isgiven in Fig. 1.

2.1. Heuristic evaluation

Members of higher education, research organizationsand professional communities share, retrieve and reuselearning resources in a web portal, called EducaNext. Edu-caNext fosters collaboration among educators andresearchers, allowing them to participate in knowledgecommunities; communicate with experts in their eld; toexchange learning resources; to work together on the pro-duction of educational material e.g. textbooks, lecturenotes, case studies and simulations; to deliver distributededucational activities: lectures, courses, workshops andcase study discussions and to distribute electronic contentunder license. Nineteen Computer Science students in theirnal year of their BS-degree studies and one BS Computer

Science graduate evaluated EducaNext using heuristic eval-uation (HE). They had good knowledge of usability evalu-ation but little practice. Ten evaluators evaluated the portalusing heuristics from Nielsen (1993), and 10 using heuris-tics from Gerhardt-Powals (1996) as a basis for the heuris-tic evaluation. Furthermore, usability problems werereported in two ways, 10 evaluators used a paper formand 10 used a web tool specially made for reporting prob-lems for heuristic evaluation. Hence, there were ve evalu-ators in each of the four groups, as seen in Table 1. Insummary, two independent variables are Medium ofReporting (Paper vs. Tool) and Set of Heuristics (Nielsenvs. Gerhardt-Powals). A 2 2 between-subject factorialdesign was employed.

All evaluators were asked to evaluate EducaNext inde-pendently. They received email containing a checklist ofthe activities they were asked to perform. The instructionsincluded a pre-evaluation questionnaire, a post-evaluationquestionnaire, and an introduction material in a digital

Fig. 1. Workow of experiment.


audio-le and slides including: (a) guidelines for the proce-dure of the evaluation; (b) introduction to the heuristicevaluation; (c) introduction to the EducaNext system and(d) introduction to the reporting of the usability problems.In the post-evaluation questionnaire, evaluators wereasked to report the time spent, in hours, on the evaluation,and were asked to give a list of facilitators and hindrancesof the heuristic evaluation method that they had applied.

2.2. Medium of reporting

Evaluators reported the usability problems in one of twoways, using a web tool (see Fig. 2) or on a paper form. Astructured problem report format adapted from Cocktonand Woolrych (2001) was used to report seven attributesof each predicted usability problem (PUP) (Table 2). Note-worthy is that the use of the structured problem report for-mat can improve the reliability of merging PUPs and thereliability of matching predicted to actual problems (Cock-ton et al., 2003a), thereby increasing the overall internalvalidity of the usability evaluation results. The same attri-butes were reported on paper and with the tool, and thedesign of the forms on both media was similar. Heuristicsused and severity rate were selected with combo boxesand a default value was given in both cases. Examplesfor reporting problems were provided on paper for thepaper form, but in the web tool, evaluators could get helpon attributes and a complete example usability problem. Alist of all the heuristics and a list of the levels of severity,with further explanations via tooltips, were given in thetool, but the evaluators using the paper form were encour-aged to print these out and have in front of them duringevaluation.

The people using the web tool for reporting problemsreceived a short introduction to the tool. They were recom-mended to use two computers, one for keeping EducaNextmaximized and a laptop for reporting the usability prob-lems in parallel.

2.3. Translating predicted problems into task scenarios

According to Cockton et al. (2003a), usability inspectionmethods (UIMs), such as heuristic evaluation, can serve asdiscovery resource for user testing, which are designed tofocus on potential problems predicted by UIMs, therebyimproving construct validity (Gray and Salzman, 1998).Furthermore, Cockton and his colleagues (2004) demon-strate how UIM predicted problems could be translatedinto task scenarios of user testing. As the goal of falsica-tion testing is to maximize condence in false positive cod-ing, denition of task sets for user tests shouldsystematically be derived from UIM analysts predictions.A task denition methodology essentially consists of threeprocedures: processing predictions, translating predictionsinto tasks, and verifying tasks against predictions. Here,we delineate how these three procedures have been imple-mented in our studies.

2.3.1. Processing PUPs

A usability specialist (E1), who is highly knowledgeableabout the system tested, rst examined closely each of thePUPs discovered by individual evaluators to discard with-in-evaluator duplicate or incomprehensible PUP descrip-tions. Then, E1 applied the problem reduction methoddescribed in Connell and Hammond (1999) to lter outany overlapping PUPs to generate a list of unique PUPs.E1 then grouped these PUPs according to the attributeSpecic Context in the standard problem report form(Table 2), resulting in 15 groups, e.g. Access Content,Advanced Search, Simple Search, Browse Catalogue,Content Provision, Left Navigation Bar, etc.

2.3.2. Translating PUPs

With reference to her experiences about the usages ofthe system, E1 assessed the severity of individual PUPsand prioritized them within each group. PUPs with highpriority were translated into task scenarios by abstractingthe actions leading to their discovery (cf. the attributeShort Description in the structured problem report).

Table 1The experimental design of the Heuristic Evaluation

Nielsen Gerhardt-Powals

Paper Group A: 5 novice evaluators Group B: 5 novice evaluatorsTool Group C: 5 novice evaluators Group D: 5 novice evaluators

Fig. 2. User interface of a web tool for reporting usability problems.

Table 2Structured problem report format (adapted from Cockton and Woolrych(2001))

1. A numeric identier of the problem2. A short description of the problem3. Likely diculties for the user4. Specic context (the location of the problem in the interface)5. Possible causes of the problem (what is wrong in the design)6. The heuristic(s) used7. The severity rate, containing 3 levels: severe, moderate and minor


Caution was exercised to avoid over-constraining a usersaction with too detailed task descriptions. As the 15 groupsof PUPs are somewhat interrelated, one task scenario canaddress more than one group of PUPs. Finally, eight taskscenarios (Table 3) with a set of sub-tasks addressing dier-ent PUPs were developed.

2.3.3. Verifying task scenarios

It was critical to check the coverage of the task scenariosto see whether any signicant PUPs were left out from fur-ther validation. Another usability specialist (E2) mappedeach of the unique PUPs against the eight task scenariosand found that 15 PUPs were not covered by any of thescenarios. E1 repeated the same mapping exercise and iden-tied two mapped cases that E2 interpreted dierently. E1and E2 then negotiated to assess the relevance of the omit-ted PUPs. Consequently, they decided to ignore eight ofthese PUPs, which were highly situational (e.g. dependingon the type of the content that the user retrieved fromthe portal) or trivial, and incorporated the other ve intothe existing task scenarios.

2.4. User tests

The user tests (UT) were conducted to nd out howmany of the predicted problems reported by the evaluatorsusing the heuristic evaluation were real problems for users.The task sets for the user tests were derived from the resultsof the heuristic evaluation, asking the users to use thosefeatures of EducaNext that were found to be problematicin the heuristic evaluation (Section 2.3). This method wasadopted to increase the power of the user tests for validat-ing the predicted usability problems.

Ten participants (P1, . . ., P10) were asked to solve eighttasks (Table 3) while thinking aloud. First, the participantsanswered a pre-test questionnaire about their personal andtechnical background. Then, they solved the task scenariosone by one and answered a so-called After Scenario Ques-tionnaire (Lewis, 1991) in between measuring their subjec-tive satisfaction about that task. Everything that happenedon the screen was captured using a screen capture tool anda web camera was used to record both sound and video ofthe session. After completing the eight tasks, the users wereasked to answer a post-test questionnaire SUS (SystemUsability Scale) (Brooke, 1996) measuring their subjectivesatisfaction about EducaNext as a whole. SUS has been

proved a robust, reliable, and low-cost usability assessmenttool that is widely used in industry. The 10 participantswere all university sta, either professors or administrators.There were ve males and ve females. Two participantswere in their thirties, seven participants were in their fortiesand one in her fties. The average ICT (Information andCommunication Technology) competence was self-assessedat 3.5 (out of 5, N = 10) (SD = 0.97). Participants reportedan average of 3.5 (out of 5, N = 10) grade when askedabout experience in e-learning (SD = 1.27). Seven of theparticipants had developed or organized online learningcontent.

An experimenter observed the users unobtrusively, inthe same room, while they were performing the task scenar-ios. For each task, she reported the users performance(start-time, end-time, number and type of errors, instanceand type of help seeking, instance of expressed frustration),comments and problems with a template on a laptop. Theexperimenter checked these data against the screen cap-tures and audio recordings to ensure their accuracy.

2.5. Problem extraction, ltering and matching procedures

The inability to expose all actual problems is recognizedas an inherent limitation of user testing. Notwithstanding,user test results are employed to validate UIMs. Errors ingenerating a set of actual problems, identied in user tests,may lead to miscoding predicted problems as False Alarms.The SUPEX (Structured Usability Problem EXtraction)method (Cockton and Lavery, 1999) addresses this issueof problem extraction. However, the usability of thismethod, especially the learnability, is questionable becauseof its involved stages and sub-stages. The eciency ofapplying SUPEX appears low, given the time-consumingprocesses such as segmentation, transcription and coding.The cost-eectiveness of the method is not clear. Depend-ing on the budgetary constraint and other contextual fac-tors, stages and sub-stages of SUPEX can be skipped.What is left, when stripping the method to its core ele-ments, will be a common approach to problem extraction,which is more or less the same as the one adopted in thecurrent study. Such exibility in applying the methodmakes it dicult to decide between SUPEX and non-SUPEX, and to prove its claimed advantages. The compre-hensive assessment and development of SUPEX is yet to bedone, especially its usability and reliability. Consequently,we relied on the traditional approach to problem extractionfrom the user tests.

The usability specialist E1, who has performed severaluser tests on dierent versions of EducaNext, became veryknowledgeable about the system, which was vital for eec-tive and ecient problem extraction. Specically, shereferred to the conventional denition of usability problemto guide the extraction task, namely:

Simply stated, a usability problem is any aspect of a user

interface that is expected [or observed] to cause users

Table 3List of eight tasks for the user-test

Task 1: Apply for an EducaNext Portal User AccountTask 2: Login and Edit User PreferencesTask 3: Browse the Catalogue of the EducaNext PortalTask 4: Simple SearchTask 5: Advanced SearchTask 6: Check the Booking HistoryTask 7: Create and Join an EducaNext CommunityTask 8: Provide and Oer Educational Material


problems with respect to some salient usability measure

(e.g. learnability, performance, error rate, subjective sat-

isfaction) and that can be attributed to a single design

aspect (Nielsen, 1993, p. 388), [our addition].

For individual users, E1 derived a list of usability prob-lems from their think-aloud protocols and from the exper-imenters detailed observation notes. Furthermore, the 10lists of usability problems were merged and the overlappingones ltered out. Problem instances rather than problemtypes were counted (John and Mashyna, 1997). In otherwords, the same problem identied in two dierent con-texts would be counted as occurring twice rather than once.Observed UPs were recorded in the same structured reportformat used for PUPs, sharing the same set of attributes,including Identier, Users/Evaluators Involved, Descrip-tion, Context, Severity and Frequency, thereby facilitatingthe matching task.

In addition, E1 devised a so-called two-way mappingprocedure (forward- and backward-matching) for validat-ing the PUPs of heuristic evaluation against the UPs ofuser tests. First, we took PUP1 and mapped it to each ofthe 58 UPs; then repeated the same procedure for PUP2up to PUP85. Then, we took the UP1 and mapped it toeach of the 85 PUPs; then repeated the same procedurefor the remaining UPs. With this approach, the reliabilityof the mapping results could be enhanced, though it wastedious and time-consuming. Fig. 1 illustrates the afore-mentioned procedures.

3. Results

In the following three subsections, we report on theresults of the heuristic evaluation experiment, user testsand the validity and thoroughness of heuristic evaluationcompared to the user tests after the usability problems ofthe two experiments have been ltered and compared.We, thus, report how well the evaluators conducting heu-ristic evaluation were able to predict usability problemsreported by the users. Finally, we compare two and twogroups together, i.e. on the one hand, those that applyNielsen vs. Gerhardt-Powals and those applying paper vs.tool, on the other hand. To compare the dierent groups,we calculated the thoroughness, validity and eciency ofthe evaluation. The any-two-agreement measures of theevaluators (Hertzum and Jacobsen, 2001) within thegroups were evaluated.

3.1. Data analysis from heuristic evaluation

During heuristic evaluation, there were altogether 160predicted usability problems (PUPs) reported by 20 evalu-ators. Note that no PUP was reported by B1 (see Table 4).Groups A, B, C and D reported 33, 37, 43 and 47 PUPs,respectively. v2 test showed that there were no signicantdierences in the number of PUPs between the four groups.Examining means of the four groups revealed that there

was close to signicant dierence at the 0.05 level in thenumber of PUPs between Group A and Group C(t = 2.21, df = 8, p = 0.06), i.e. between paper and tool reg-istration when using the Nielsens heuristics, but no dier-ence when using the Gerhardt-Powals principles.

Note that B1, who apparently was not motivated for thegiven task, uncovered zero PUPs. When that subject wasremoved from the population, analysis further showed thatthere was signicant dierence in the number of PUPsbetween Group A and Group B (t = 2.62, df = 7,p = 0.04), i.e. between Nielsen and Gerhardt-Powals whenusing paper for problem reporting, with the latter discover-ing more usability problems than the former.

A usability specialist examined each of the PUPs anddiscarded 12 of them, which were reported more than onceby the same participant or were found to be incomprehen-sible. Consequently, 148 out of the original 160 PUPs wereconsolidated to eliminate any duplicate (Section 3.2), usingthe procedure described in Connell and Hammond (1999).This exercise ltered out duplicates and led to a list of 85unique instances of PUPs, on which the design of task sce-narios of a user test (UT) were based; these PUPs were val-idated by the results of the user test thus developed (seebelow).

Hertzum and Jacobsen (2001) developed a metric knownas Any-Two Agreement (see Eq. 1) to estimate inter-evaluator reliability, i.e. the probability that a particularusability problem is identied by more than one evaluator.This measure is supposed to be more accurate than the con-ventional method of using simple problem discovery rate(p), which can be much inated when a sample size is small.

Any-two-agreement Average of jP i \ P jj=jP i [ P jj 1over all 1/2 n (n 1) pairs of users, where Pi and Pj are thesets of UPs identied by useri and userj, and n is the num-ber of users.

We computed Any-two-agreement for all the 20 evalua-tors of heuristic evaluation as a whole, for each of the fourgroups and for a combination of the groups sharing a com-mon feature. The overall Any-two-agreement for the 20evaluators was 0.043 (or 0.046 for the 19 evaluators exclud-ing B1). The results are shown in Table 5, with the numbersin parentheses including evaluator B1.

Table 4Number of predicted usability problems (PUPs) per evaluator in eachgroup

Nielsen Heuristics

Paper ID A1 A2 A3 A4 A5 M = 6.6PUPs 6 5 6 9 7 SD = 1.4

Tool ID C1 C2 C3 C4 C5 M = 8.6PUPs 10 10 8 8 7 SD = 1.2

Gerhardt-Powals PrinciplesPaper ID B1 B2 B3 B4 B5 M = 7.4

PUPs 0 10 8 8 11 SD = 3.9Tool ID D1 D2 D3 D4 D5 M = 9.4

PUPs 4 8 11 13 11 SD = 3.1


The overall inter-evaluator reliability was very low, or0.046, in comparison with the average Any-two-agreementvalue of 0.101 within the group of user-test participants.The low reliability can be attributed to the heterogeneousbackgrounds of the evaluators, especially their experienceand knowledge of usability evaluation and of the domainof the system evaluated (i.e. e-learning). Besides, the twoindependent variables (i.e. set of heuristics and type ofmedium for problem reporting) had certain eects on thenature and number of usability problems identied. Fur-thermore, the dierences between the groups were not sta-tistically signicant. Group C showed the highest level ofagreement. It may be explained by the two factors: theevaluators familiarity with Nielsens heuristics and thefacilitating eect of the tool. Nonetheless, due to the smallnumber of participants, these assumptions cannot beconrmed.

After coding the qualitative data on facilitators and hin-drances the evaluators mentioned, we obtained the resultslisted in Table 6. It is interesting to note that the heuristicset was considered as both a facilitator (n = 6) and a hin-drance (n = 7), and that Gerhardt-Powals were perceivedas more hindering than were Nielsens. This nding raisesthe issue whether evaluators, without being given anyguidelines (i.e. control), could outperform those with anyof the two sets. At least two studies, comparing guidelinesto a control group, where the control group does not relyon any guidelines but only their own knowledge, have beenperformed in the past (Bastien et al., 1999; Connell andHammond, 1999). In the study by Bastien et al. (1999),where median number of usability problems uncoveredby the participants was used as a metric, the control andthe ergonomic criteria groups diered signicantly, withthe latter uncovering more problems but also spendingmore time on the evaluation. In the same study, signicantdierences did not appear between the control group and agroup using ISO/DIS 9241-10 dialogue principles. In thestudy by Connell and Hammond (1999), where Nielsens

heuristics and a set of 30 principles were compared to acontrol group, no signicant dierences were foundbetween the conditions for a group of novices. That moreevaluators perceived training as a facilitator than a hin-drance implies that the training given was somewhat eec-tive and desirable. Three of the 20 evaluators thought thatthey did not have enough time to evaluate the application.This may indicate that the task was more time-consumingthan they expected and could have been broken down tomore than one assignment. It raises the question whetherthe number of usability problems discovered is highlydependent on the amount of time spent. The Pearson cor-relation showed that there is a moderate relationshipbetween the two variables (r = 0.48, p = 0.03).

3.2. Analysis of usability problems from the user tests

There were altogether 125 unconsolidated usabilityproblems (UPs) experienced by the 10 participants in usertesting. The average number of problems reported per par-ticipant during user testing was M = 12.5, SD = 4.2,N = 10, which was higher than that of heuristic evaluation(M = 8.0, SD = 2.9, N = 20).

Because of the ltering procedure, a list of 58 UPs wasproduced, of which 10 were severe, 27 were moderate and21 were minor. Besides, only one participant experienced24 of these 58 UPs and six participants experienced onlyone UP, which was severe (see Fig. 3). This list was usedto validate the PUPs reported in heuristic evaluation.

The average problem discovery rate (p) over the 10 par-ticipants was 0.22, which was not particularly high. Inother words, at least seven users were required to identify80% of the discoverable UPs (cf. the assumption of MagicNumber 5 (Barnum, 2003) ve users can yield 80% ofthe ndings from a usability test also supported by Niel-sen (2000) and other human factors engineer (e.g. Virzi,1992) but questioned by several researchers (Faulkner,2003; Law and Hvannberg, 2004b; Spool and Schroeder,2001; Woolrych and Cockton, 2001)). Furthermore, theproblem discovery rate (p) for severe problems was 0.3,which was higher than that of moderate (p = 0.21) and thatof minor problems (p = 0.18).

Table 5Inter-evaluator reliability measured with any-two-agreement

Nielsen Gerhardt-Powals

Paper A = 0.042 B = 0.088 (0.053)a A + B = 0.045 (0.041)Tool C = 0.073 D = 0.070 C + D = 0.055

A + C = 0.047 B + D = 0.046 (0.034) All = 0.046 (0.043)

a Numbers in parenthesis include evaluator B1.

Table 6Hindrances and facilitators

Hindrance Freq. Facilitator Freq.

Unclear training/still not aproblem

1 Training Material 8

Heuristic guidelines 7 Heuristic guidelines 6Lack of experience 1 Previous experience 1Lack of time 3 Using two computers 1Dicult application/

EducaNext3 Easy application/

EducaNext1

24

1411

6

2 10

5

10

15

20

25

30

1 2 3 4 5 6Number of User

Freq

uenc

y of

UP

Fig. 3. Number of users experiencing UPs with dierent frequencies.


The 10 participants were required to ll out an After-Scenario Questionnaire (ASQ) for each of the eightscenarios to assess the perceived ease, perceived time andperceived utility of online help. Each of these three aspectswas measured with the seven-point Likert Scale with theleft and right anchor being Strongly Disagreed andStrongly Agreed, respectively. Furthermore, uponattempting all the tasks, the participants were required tocomplete a post-test questionnaire (SUS) to assess theiroverall subjective satisfaction with EducaNext. The aver-age SUS score was 57.25 (N = 10, SD = 16.43). Note thatSUS scores have a range of 0 to 100 (Brooke, 1996). Foreach of the 10 participants, we computed two dierentobjective measures: UP-severity-ratings and total-time-on-tasks. Specically, we added up the severity ratings(severe = 3, moderate = 2 and minor = 1) of all the UPsthat a participant experienced, resulting in his or herUP-severity-ratings. Similarly, we added up the time thata participant spent on all the tasks attempted. The averageUP-severity-rating was 23.6 (N = 10, SD = 7.79) and theaverage total-time-on-tasks was 54.6 min (N = 10,SD = 12.04).

3.3. Mapping the results of heuristic evaluations with those of

user tests

Using the two-way mapping procedure described in Sec-tion 2.5, the following results were obtained:

Number of Hits i:e: PUPs verified by UPs 32Number of False Alarms

i:e: PUPs not verified by any UP 53We studied how many of Hits and False Alarms were

identied by dierent groups of HE evaluators. Resultsare displayed in Table 7. Note that within-group duplicateswere not counted, yielding so-called ltered Hits. Forinstance, PUP20 was a hit uncovered by A1, A3, C1 andC5; however, the corresponding frequency of hit was onlyincreased by 1 for Group A and for Group C. v2 tests wereapplied to these results. On average, an evaluator in theheuristic evaluation demonstrated similar performance interms of Hits and False Alarms, irrespective of the typesof supports given, i.e. heuristic sets and reporting medium.Compared to the results of heuristic evaluation, the groupthat used Nielsens heuristics and reported on paper

(Group A) identied on average 6.6 usability problemsper evaluator, but on average only 2.6 of these problemsper evaluator could be veried in the user test.

Furthermore, to nd out whether tool-based reportingwas more eective than reporting on paper, we collapsedthe data of the respective groups (Table 7) and performedt-tests between Group A + Group B and GroupC + Group D. No signicant dierences in the number ofHits or False Alarms between tool-based and paper-basedevaluator were found. Similarly, to nd out whether Niel-sens usability heuristics were more eective than Ger-hardt-Powals principles, we collapsed the data of therespective groups and performed t-tests between GroupA + Group C and Group B + Group D. No signicant dif-ferences were found either. In summary, on average, anevaluator in the heuristic evaluation demonstrated similarperformance, irrespective of the types of supports given.

Besides, it is intriguing to know how many severe veri-ed UPs and unique veried UPs (i.e. UPs were reportedonly by the evaluators belonging to one particular group)individual groups identied. Table 8 displays the results.Group B appeared to be the most eective in identifyingsevere UPs and Group B and C in identifying additionalUPs, but the dierences were insignicant.

In addition to Hits and False Alarms, misses are denot-ed as those usability problems that inspection methodshave missed but are found in user testing. We computedthe overall eectiveness of heuristic evaluation based onthe following formulae (Hartson et al., 2001): Eective-ness = Validity * Thoroughness, where

Validity Hits=Hits False Alarms 32=85 0:38 2

Thoroughness Hits=HitsMisses 32=58 0:55 3

Overall Effectiveness 0:38 0:55 0:21 4

In addition, we researched whether the HE evaluators andUT users tended to identify the real UPs with a similar fre-quency. To answer this question, we correlated the two fre-quencies of individual Hits. The non-parametric SpearmanRank-Order Correlation Coecient rs was 0.23 (p = 0.2,N = 32), showing that the two groups, HE evaluatorsand UT users, were not signicantly correlated in thisrespect.

Table 7Distribution of Hits and False Alarms ltered sets

Hits False Alarms

NielsenHeuristics

Gerhardt-Powalsprinciple

NielsenHeuristics


Paper GroupA = 8

Group B = 13 GroupA = 18

Group B = 12

Tool GroupC = 15

Group D = 11 GroupC = 19

Group D = 26

Table 8Distribution of severe veried problems and unique veried problems

Severe problems Unique problems

NielsenHeuristics


NielsenHeuristics


Paper GroupA = 2

Group B = 5 GroupA = 2

Group B = 6

Tool GroupC = 2

Group D = 2 GroupC = 8

Group D = 4


3.4. Results according to heuristics

To answer the question which of the two sets of heuris-tics can result in higher validity and thoroughness, we col-lapsed the data of the respective groups and analysed them.

We adopted the following denition (Law and Hvann-berg, 2004a):

Actual efficiencyAE Number of Hits identifiedwithin the Testing Session=Total durationhours ofthe Testing Session 5From Table 9, we see that the validity is almost the samefor the two sets of heuristics, i.e. less than half of the pre-dicted problems could be veried in the user tests. (Notethat if a problem was discovered with Nielsens heuristicseither with paper or the tool, it was only counted once.)If the results of the user tests are the baseline, i.e. the truththat we can compare to, the measures indicate that about60% of the eort of doing heuristic evaluation has beenwasted. Similarly, the thoroughness for the two sets was al-most identical. The measures indicate that more than 60%of the usability problems discovered by the users wereundetected by the evaluators in the heuristic evaluation.

The usefulness of a usability evaluation method dependsin part on whether it enables the evaluator to discover themost severe problems and whether it avoids misleadingdevelopers to x False Alarms (i.e. those reported problemsnot veried by the user tests). The distribution of problemseverity between heuristic sets in Table 10 was notsignicant.

The average time it took each evaluator to do the eval-uation with Nielsens set of heuristics (2 h 18 min) was only5 min less than the average time it took an evaluator withGerhardt-Powals principles (2 h 23 min).

The evaluators satisfaction was mixed. Independent ofheuristic sets, some said it was time-consuming to use theheuristics, there were too many heuristics and hard tounderstand, while others said they had no problems. Manymentioned that while evaluating, they rst found a problem

and then had a hard time nding the right heuristic to referto. This may indicate that the heuristics are not alwaysexplicitly guiding the evaluators to discover problems orthat the evaluators are nding problems for which thereexist no heuristics in the respective set of heuristics. Relat-ing to the rst issue, i.e. guidance of heuristics in discover-ing a problem, in a study where an extended problemreport format was used, Cockton et al. (2003b) found sig-nicant improvements in appropriateness scores increasingto a mean score of 61% from an earlier one of 31% (Cock-ton and Woolrych, 2001). They contributed at least part ofthe dierence to discovery methods and part to the extend-ed problem report format. Cockton and Woolrych (2001)have termed inappropriate heuristics when an expert ratherthan a heuristic evaluation is being applied. Regarding thesecond issue, i.e. no heuristics existing, apparently, it is dif-cult to design a heuristic set, which has total coverage.Already, Nielsen (1994a) showed that in a factor analysis,only 30% of the variance was due to seven main heuristicsfactors which motivated him to relax the criterion of cover-age from a problem matching a single heuristic perfectly(rating of 5) to a partial match (rating of 3). With this cri-terion, he assumed that usability problems are due to abroad variety of underlying phenomena.

Law and Hvannberg (2004a) show that not only is itsometimes dicult for the evaluators to nd the matchingheuristic, but they nd it dicult to associate a problemwith any of the heuristics. Their study concluded that eval-uators identied quite a number of the severe UPs based ontheir own personal experiences and intuitions (Law andHvannberg, 2004a, p. 247), and more so for the Ger-hardt-Powals principles than for the Nielsens heuristics.

3.5. Eectiveness and eciency according to the medium of

reporting

The medium of reporting could have some impact onhow many usability problems are found during heuristicevaluation and how serious the problems are rated. Wecomputed the validity, thoroughness and eciency of theresults for the two dierent media used to report the usabil-ity problems, paper or the tool, with 10 evaluators in eachgroup.

Table 11 shows that the validity was almost the same forusing a paper and using a tool, i.e. even though many moreproblems were reported using the tool than the paper. Soalmost 60% of the eort has been wasted using paper andabout 55% of the eort has been wasted using a tool.

Table 9Validity and thoroughness of heuristics problem sets

Nielsen (A + C) Gerhardt-Powals (B + D)

Validity 21/53 = 0.40 21/54 = 0.39Thoroughness 21/58 = 0.36 21/58 = 0.36Eciency (AE) 21/21.3 = 0.98 21/23.3 = 0.90

Table 10Veried problems according to problem severity and heuristic sets

Nielsen (A + C) Gerhardt-Powals (B + D) User test

Severe 3 (30%)a 5 (50%) 10Moderate 13 (48%) 10 (37%) 27Minor 5 (24%) 6 (26%) 21

a (Validity) = % of veried problems against the total number ofusability problems of respective severity identied in the user test.

Table 11Validity of heuristic evaluation according to the medium of problemreporting

Paper (A + B) Tool (C + D)

Validity 18/44 = 0.41 23/51 = 0.45Thoroughness 18/58 = 0.31 23/58 = 0.40Eciency (AE) 18/24 = 0.75 23/20.7 = 1.1


The thoroughness for using a tool was slightly higher.The measure indicates that almost 70% of the usabilityproblems experienced by the users were not reported bythe evaluators in the heuristic evaluation using a paperform, and 60% of the usability problems experienced bythe users were not reported by the evaluators in the heuris-tic evaluation using a tool to report the problems.

There was no signicant dierence in the distribution ofsevere, moderate and minor problems (see Table 12).

The average length of an evaluation session per evalua-tor using a tool was 2 h 14 min, which was 16 min less thanusing a paper (2 h 30 min).

The evaluators using the paper form stated that it wouldbe tiring in the long run to use the paper form. The toolusers were very positive, stating, that the tool was easy touse; it was eective to have tooltips and examples for guid-ance for reporting the problems; it was ecient to usedrop-down lists to choose the severity rate and heuristicsused to nd the problem. They would rather use the toolthan similar paper form.

4. Comparing two empirical studies

Previously, the authors performed an empirical study onestimating and improving the eectiveness of heuristicevaluation. The former study (Study 1) (Law and Hvann-berg, 2004a) shared some similarities with the current one(Study 2), including:

Employing Nielsens heuristics and Gerhardt-Powalsprinciples and applying them to an e-learning platform(NB: versions v.0.85 and v.1.0 of the same platform weretested in Study 1 and Study 2, respectively).

Validating the results of heuristic evaluations with thoseof user tests.

Comparable number of participants (n = 4 or 5) foreach of the four groups.

In contrast, there are four major important dierences(see Table 13) between the two studies, including:

1. More structured and detailed information about the sys-tem evaluated was provided to the evaluators in Study 2than in Study 1.

2. Between-subject design was used in Study 2 instead ofwithin-subject design in Study 1.

3. Using Paper-based reporting vs. tool-based reportingas one of the two independent variables in Study 2instead of using Textual descriptions versus Graphicalrepresentations about the system tested in Study 1.

4. User tests were conducted after heuristic evaluations andwere designed based on the data of heuristic evaluationsin Study 2, whereas user tests were conducted beforeheuristic evaluations in Study 1.

The major nding in Study 1 was that Nielsens heuris-tics could enable the evaluators to identify signicantlymore actual usability problems (Hits) than Gerhardt-Pow-als principles, irrespective of the type of the other support-ing information the evaluator received. However, Study 2could not corroborate this nding. It is intriguing to tryto understand what may contribute to this contradictoryconclusion. We examine the four factors delineated above:

Factor 1: the extra information given to the evaluatorsin Study 2 focused on the system evaluated but not onthe heuristics or principles. Hence, assuming that the posi-tive eect of increasing the understanding of the system was

Table 12Veried problems according to the problem severity and medium ofreporting

Paper (A + B) Tool (C + D) User Test

Severe 6 (60%)a 4 (40%) 10Moderate 7 (26%) 15 (56%) 27Minor 5 (29%) 4 (19%) 21

a (Validity) = % of veried problems against the total number ofusability problems of respective severity identied in the user test.

Table 13Comparison of the two studies

Former (Study 1) Current (Study 2)

Researchdesign

Independentvariables (IV)

IV1: Textual vs. Graphical description of the system IV1: Paper- vs. Tool-based problemreporting

IV2: Reference set Nielsen vs. Gerhardt-PowalsDependentvariable (DV)

Number and quality of usability problems

Experimentaldesign

Within-subject design (4 cells, each with 5 subjects) Between-subject design (4 cells, eachwith 5 subjects)

Pre-test training Focus on the usability guidelines Focus on the main features of thesystem

Validation Results of user tests [validate] Heuristic Evaluation [guide]design of user tests

Results of Heuristic Evaluation Results of user tests [validate]Results of Heuristic Evaluation

Mainndings

Form of support(IV1)

No eect on dependent variable

Heuristic set (IV2) Signicant dierence: Nielsen more eective No signicant dierences


more or less equal for all the four groups, it is ratherunlikely that this factor can explain the insignicant dier-ence between Nielsen and Gerhardt-Powals in Study 2.

Factor 2: We argued elsewhere for the legitimacy ofusing within-subject design in the context of Study 1 andalso analyzed the inherent characteristics of Nielsens heu-ristics that may render them more eective than Gerhardt-Powals principles (Law and Hvannberg, 2004a).

Factor 3: Given the insignicant eects of the indepen-dent variable textual vs. graphical of Study 1 and paperversus tool of Study 2 in yielding more Hits or even moreFalse Alarms, this factor should play no role in explainingthe contradictory conclusion.

Factor 4: The special arrangements of the user tests forvalidating heuristic evaluations in Study 2 were to enhancethe validity of the mapping results, and this purportedeect should be more or less equal for all the four groups.

However, we cannot denitely eliminate the possibilitythat the signicant dierences found in Study 1 could bean artifact of the experimental design. Hence, Factor 4may partially, but not fully, explain the insignicant dier-ences found in Study 2. Apparently, the so-called user eect(Law and Hvannberg, 2004b) and evaluator eect (Hertz-um and Jacobsen, 2001; Molich et al., 2004) can accountfor the contradictory nding between the two studies. Infact, evaluator eect can boil down to individual dierences(Dillon and Watson, 1996) that are dicult to control inempirical studies in HCI.

Nonetheless, given the small number of evaluators pergroup, the eect size or power of the statistical testsemployed in both studies is not particularly high. Clearly,an empirical study of a larger scale is required. Noteworthyis that the overall eectiveness of heuristic evaluation wasdisappointingly low in both studies 0.22 for Study 1and 0.21 for Study 2. This nding may threaten the claimthat heuristic evaluation is an eective discount methodfor evaluating usability.

5. Discussion

Before we summarise the answers to the research ques-tions we raised in Section 1 and suggest research issuesfor further work, we will point out the context of the studythat will help us draw conclusions from the results.

Novice evaluators performed the heuristic evaluation inthe rst of the two experiments presented in this paper. Asstated by Molich and Jeries (2003), the former being oneof the inventors of heuristic evaluation, it can be applied bysomeone without particular knowledge of usability engi-

neering to evaluate a user interface (Molich and Jeries,2003, p. 1060). Hertzum and Jacobsen (2001, p. 424) alsostated that: Any computer professional should be able toapply heuristic evaluation, but the informality of the method

leaves much to the evaluator.

Indeed, some of the gaps between the results of theheuristic evaluation and those of the user tests in thispaper could be attributed to the evaluator eect (Hertzum

and Jacobsen, 2001), i.e. levels of expertise and experienceof individual evaluators have observable inuences onusability evaluation outcomes. We tried to mitigate theevaluator eect by providing highly structured trainingmaterial on heuristic evaluation and on the system. Theevaluators got a checklist of the activities they were askedto perform and a digital audio-le introducing heuristicevaluation, the EducaNext system and the process ofreporting problems. An analysis of the qualitative dataon facilitators and hindrances of the heuristic evaluationshowed that 8 of 20 evaluators found that the trainingmaterial helped them, but one evaluator mentioned thattraining was a hindrance and one mentioned lack ofexperience.

In the following, we summarize the answers to theresearch questions we raised in Section 1 and put forwardfurther research issues.

5.1. Tool-based vs. Paper-based Reporting

Part of this study is an attempt to answer whether tool-based reporting is superior to paper-based in terms ofyielding a higher number of real usability problems(increased validity) within a shorter period of time(increased eciency). The proposition is that typing canbe faster than handwriting, digital content is more accessi-ble and easily modiable than its paper version and instruc-tions are more readily available. We achieved small benetsof tool over paper, the eectiveness was 0.18 when using atool compared to 0.13 using paper, but the dierence wasnot signicant. No signicant dierences in Actual Ecien-cies per evaluator were found between Group A (AE:M = 1.34, SD = 1.1) and Group C (AE: M = 2.1,SD = 2.2) or between Group B (AE: M = 1.32, SD = 0.9)and Group D (AE: M = 2.2, SD = 1.1) or between GroupA + B and Group C + D. While the evaluators using Niel-sens heuristics tended to type more with the web tool thanto write with the conventional paper-and-pen (t = 2.21,df = 8, p = 0.06), it did not imply a higher quality of thesePUPs. When asking participants to list hindrances andfacilitators of heuristic evaluation, none of them mentionedthe registration tool but one in group A mentioned thepaper form as a hindrance. The ineectiveness of our toolin enhancing the validity or eciency can be attributedby the three negative eects addressed earlier in Section1, including the cognitive load caused by switching betweenthe two software systems, hasty data entry resulting in falsealarms, and biased use of certain classication values.

We may need to implement more intelligence in the toolto gain further advantages. As it is, the tool provides agood basis for problem recording, providing help to itsusers that they reportedly liked. Although an earlier study(Law and Hvannberg, 2004a) showed no dierencebetween those that used training material (Lavery et al.,1996) for heuristic evaluation and those who did not, thereason may have been that the training material was notat hand during evaluation but was exposed to the evaluator


before the analysis. Additional assistance to evaluatorsmay be given in a tool by:

Linking the usability problem better to the context interms of screen scenarios or individual design features,for richer problem description and hopefully more e-cient problem xes.

Helping evaluators go through falsication testing tomitigate, if not totally eradicate, instances of falsealarms.

Pointing out gaps in testing coverage of the applicationto avoid missed problems.

Pointing to previously proven problematic areas thatneed to be retested (i.e. inspecting a revised version ofthe system).

Giving them usability problems prole that consists ofproblematic areas commonly identied for the type ofproducts or interfaces under evaluation (Chattratichartand Brodie, 2004).

Fostering the reliability and validity of problem severityratings with automatable problem-rating rules derivedfrom a robust theoretical and computational model(Blackmon et al., 2005) that are built in an intelligentreporting tool.

Research of the features and aordances of tool vs.paper has taken place in dierent domains, such as air traf-c control, document handling (Gladwell, 2002) and userinterface design (Cook and Bailey, 2005). Whereas paperhas a unique set of aordances such as being tangible, spa-tially exible and tailorable, digital documents can easily besearched, shared, stored, accessed remotely and linked toother related material (Gladwell, 2002). These aordancesof paper, it is claimed, make paper attractive for creativecollaboration and help maintain workers mental models.In addition to these features, designers have reported thatthey nd paper quicker, easier and more portable, in earlydesigns (Cook and Bailey, 2005). Note that the dierencebetween the representation of designers and evaluators ofthis study is that the latter produced text as deliverablesbut the former worked with sketches and text. We didnot study the collaborative task of merging usability prob-lems, but this may be an interesting subject for furtherstudy.

5.2. Nielsen vs. Gerhardt-Powals

Another approach to improving heuristic evaluation isto search for better heuristic sets. The validity of the twoheuristic sets, Nielsens and Gerhardt-Powals, with respectto user testing was the same. The overall eectiveness was0.14 in both cases. Percentage wise, Gerhardt-Powals prin-ciples could enable more severe problems to be foundalthough the dierence was not signicant. Only lookingat predicted problems, both the average number of prob-lems and the standard deviation for Gerhardt-Powalsprinciples were higher than those for Nielsens heuristics.

This indicates that the Gerhardt-Powals principles couldfurther be exploited but that evaluators need more training.The ndings of this study are contradictory to those of aprevious study (Law and Hvannberg, 2004a), which impliesthat an empirical study of larger scale is needed.

The low eectiveness may be due to mismatch in theapplication domain knowledge between the evaluators inthe two groups. Furthermore, in comparison with Nielsensheuristics, the hypothesized strengths of cognitive engineer-ing heuristics may not be exploited in a web-applicationwith standard operations such as search, insert, access,and delete. The implication of this observation for futureresearch is to compile and systematically evaluate a list ofusability heuristics for e-learning systems. Indeed, therehave been some recent studies in developing tailor-madeusability heuristics to t special application contexts (e.g.ambient displays, (Manko et al., 2003)) and large screendisplay information exhibits (Somervell et al., 2003), andthey are proven to be more eective than Nielsens heuris-tics (Somervell and McCrickard, 2005). A tool, with theadded intelligence of creating taxonomies of problems invarious application contexts to context specic heuristics,can make such a selection of heuristics more eective.

5.3. Task selection

The diculty in creating tasks for user testing was two-fold: the heuristic evaluation revealed that some of theproblems were not reproducible at all from the problemdescriptions, and some problems were situational and con-textual, i.e. depending on particular data retrieved orentered by the evaluator, and on the stability of the serverwhere the system evaluated resided.

A simple count of number of problem contexts, that canbe dened as an identiable place within the application, ineach of the two problem sets heuristic evaluation and usertests shows that of the 23 contexts with usability prob-lems in the user tests, PUPs covered 18 of those contexts.Evaluators predicted problems in nine contexts not discov-ered in user test. Note that the denition of a context needsfurther investigation. Other researchers have moved awayfrom using problem counts and used a more qualitativeapproach with an analysis of types of usability issues anduser-system mists (Connell et al., 2004). Another steptowards acknowledging the qualitative nature of the prob-lem descriptions, instead of using merely the stringent con-cepts of Hits, Misses and False Alarms as is traditionallyused, is to introduce Possible Hits (PH), which are lessclear-cut, ambiguous, but nevertheless plausible matchesbetween two methods, and Not Directly Observable conse-quences (NDO) (Connell et al., 2004). Therefore, false pos-itives and NDOs are restricted to expert analysis and maybe given less priority in revisions unless their frequencies orseverities give rise to other actions. An empirical study ofhow developers prioritize usability problems to correctand whether there is a dierence in problem revision strat-egies for two sets of usability problems being derived from


dierent usability evaluation methods should help shedlight on this issue.

6. Concluding remark

The framework for comparing evaluation methods thatwe have described in this paper can be reused by otherresearchers because of its thorough structure. The studycan be seen as a rst one of this framework but subsequentstudies may show that it needs to be improved. As a nalphase of this framework, we will in this section rene aresearch agenda for comparing and contrasting evaluationmethods.

While no conclusive claims about the two variables ofinterest two sets of usability heuristics and two mediaof problem reporting can be derived from the results ofthe present study, some implications for usability practitio-ners and researchers can be drawn. From the practicalpoint of view, a web tool for capturing and recordingusability problems in heuristic evaluation is recommended,especially when remote evaluation becomes increasinglypopular, thanks to the relentless expansion of the Internet.Besides, easy access to and eective management of thedata being captured in usability evaluation will enable col-laborative eorts of practitioners and researchers distribut-ed in dierent locations to work on common problems ofinterest. Further development of the tool creates opportu-nities for more intelligence in all aspects of evaluation,namely inspective, descriptive and analytical parts. In otherwords, the tool can improve the way we conduct the inspec-tion of user interfaces, the way we report usability prob-lems thus identied, and the way we consolidate a list ofusability problems (i.e. eliminating duplicates as well asFalse Alarms and accurate severity rating), enabling prac-titioners to prioritize and correct urgent problems.

Similar to our two sets of heuristics, the literature hasexamined two dierent types of heuristics. One type is syn-thesized, (Connell and Hammond, 1999; Nielsen, 1994a),that is, a heuristic set is created bottom up from a largerset of heuristics or types of usability problems. Anothertype of heuristics is dened from abstract theories (Horn-bk and Frkjr, 2004). The potential challenges thatevaluators have with the former type is that, whenresearchers or practitioners nd a common title for a syn-thesized heuristic, some details may be lost. With the sec-ond type, when the heuristics are described concretely,with one or two examples, evaluators may miss problemsif they are unable to understand the abstract description.

Our evaluators appreciated the help they receivedthrough examples in the tool. Learning by Examples isa well-researched topic in cognitive psychology (Renkland Atkinson, 2003). A caveat needs to be made that exam-ples themselves should be of good quality; otherwise, theywould impede rather than facilitate learning. Moreover,examples can rigidify how learners interpret heuristics,i.e. lower their free creative responses. Zhangs perspectivebased evaluation (Zhang et al., 1999) asks the evaluator to

view the human computer interaction with a certain per-spective, i.e. user type. Concrete examples, e.g. of noviceand expert users are given. This method has resulted in30% increase of usability problems over heuristics evalua-tion for 3 evaluators.

One recommendation that comes out of the work pre-sented in this paper is that better training schemes needto be devised. More good concrete examples need to beshown, but also help with understanding the abstractmeaning behind the heuristics, which will give evaluatorsenough freedom to identify problems not explicitly listedin the training material. Depending on the knowledgeand experience gaps being identied, training should beadapted and personalized to specic proles of individualevaluators. To help devise those examples and abstracttools, the following research question is posed:

How can usability heuristics help evaluators identifyproblems? What is the cognitive mechanism underpin-ning heuristic evaluation? Has the heuristic or principlenamed really guided the evaluator to uncover a usabilityproblem or the evaluator named it to justify her/hisbehaviour? Indeed, Cockton and Woolrych (2001)attempted to check the accuracy of heuristics that theirevaluators attributed to usability problems. The corre-sponding metric was coined as appropriateness. Accord-ingly, appropriate heuristic applications can bedetermined by correspondence between predicted di-culties and applicability criteria as stated in a HE train-ing manual (Lavery et al., 1996). Nonetheless, weassume that the reliability of such accuracy checks varieswith assessors level of expertise, both in the heuristicsand application domain.

One of the reasons that Gerhardt-Powals set of heuris-tics did not yield better outcomes than Nielsens may bethat it did not t the application of a brokerage system.Furthermore, with the high complexity of an applicationbut limited resource, evaluation thus needs to be scopedand goal-oriented. Somervells and McCrickards (2005)work focuses on the creation of so-called critical parame-ter-based heuristics, based on critical parameters proposedby Newman and Taylor (1999). For the class of large-scaledisplay systems Somervell and McCrickard identied threecritical parameters, namely Interruption, Reaction andComprehension. Clearly, the mapping of heuristics to thesystems characteristics is important. But Somervell andMcCrickard (2005) emphasize that mapping is not at thelevel of an individual system but a class of systems. Thisassertion aligns with Nielsens recommendation Further-more, it is possible to develop category-specic heuristics

that apply to a specic class of products as a supplement

to the general heuristics (Nielsen, 1994b, p. 29). He furthersuggests performing competitive analysis and user testingto create abstract categories of specic heuristics, but Som-ervell and McCrickard (2005) use claims-analysis as a basisfor heuristic creation. A registration tool may help collect


the most commonly found faults in this class of applica-tions or design features, depending on the granularity ofthe subject of evaluation. For brokerage systems, criticalparameters could be Relevance (search results), Ease ofupload/download (learning objects) and Security (intellec-tual property right). To further develop this issue we for-mulate another research question:

How should a set of usability heuristics be so selectedthat they can best t the context of the applicationdomain, the goal of the evaluation and availableresources? Is it possible to develop meta-guidelines toaddress this context tness?

One reason for the low thoroughness of the two heuris-tic sets may be the large set of features or task scenariosthat are described to the evaluator. Instead of inspectingan application for a few hours, a more iterative approachin a series of inspection sessions where a set of featuresor tasks are inspected each time may be more eective(Molich et al., 2004; Nielsen, 1994b). Inspection work isvery tedious and tiring, and it could be that at the end ofthe session evaluators stopped or showed decreased perfor-mance because of fatigue. An iterative inspection sessioncould reveal more dierences between the two heuristic setslists. Such an iterative approach may also give rise to adebrieng session in between iterations. Thus, evaluatorscan discuss problems discovered and, hence, raise theirunderstanding of the heuristics or the domain. A thirdresearch question we pose is:

Are evaluation results not only dependent on number ofevaluators but also on the number and duration of iter-ative inspection sessions?

In summary, there remain challenges for usabilitypractitioners and researchers to overcome. To cope withthe problem of generalizability and transferability acrosscontexts, extensive collaboration within the usabilitycommunity to conduct multi-site experiments and tosupport exchange of ideas and experiences is deemedessential.

References

Andre, T.S., Hartson, R.H., Belz, S.M., McCreary, F.A., 2001. The useraction framework: a reliable foundation for usability engineeringsupport tools. International Journal of Human-Computer Studies 54,107136.

Barnum, C.R., 2003. Whats in a number. Available from : , Accessed 2006,Society for Technical Communication.

Bastien, J.M.C., Scapin, D., Leulier, C., 1999. The Ergonomic Criteriaand the ISO 9241-10 Dialogue Principles: a comparison in anevaluation task. Interacting with Computers 11, 299322.

Blackmon, M.H., Kitajima, M., Polson, P.G., 2005. Tool for accuratelypredicting website navigation problems, non-problems, problemseverity, and eectiveness of repairs. In: Conference on HumanFactors in Computing Systems archive Proceedings of the SIGCHI

Conference on Human factors in Computing Systems, ACM, Port-land, Oregon, USA, pp. 3140.

Brooke, J., 1996. SUS: A quick and dirty usability scale. In: Jordan, P.W.,Thomas, B., Weerdmeester, B.A., McClelland, I.L. (Eds.), UsabilityEvaluation in Industry. Taylor & Francis, London, pp. 189194.

Chattratichart, J., Brodie, J., 2004. Applying user testing data to UEMperformance metrics. In: CHI 2004, ACM, Vienna, Austria, pp. 11191122.

Cockton, G., Lavery, D., 1999. A framework for usability problemextraction. In: Sasse, A., Johnson C. (Eds.), INTERACT 1999, pp.347355.

Cockton, G., Woolrych, A., 2001. Understanding inspection methods:lessons from an assessment of heuristic evaluation. In: Blandford, A.,Vanderdonckt, J., Gray, P.D. (Eds.), People and Computers XV.Springer-Verlag, Lille, France, pp. 171182.

Cockton, G., Lavery, D., Woolrych, A., 2003a. Inspection-based evalu-ation. In: Jacko, J.A., Sears, A. (Eds.), The Human-ComputerInteraction Handbook, Lawrence Erlbaum Associates, NJ.

Cockton, G., Woolrych, A., Hall, L., Hindmarch, M., 2003b. Changinganalysts tunes: the surprising impact of a new instrument for usabilityinspection method assessment. In: Palangue, P., Johnson, P., ONeill,E. (Eds.), HCI 2003. Springer-Verlag, Bath, pp. 145162.

Cockton, G., Woolrych, A., Hindmarch, M., 2004. Reconditionedmerchandise: extended structured report formats in usability inspec-tion. In: CHI2004, ACM, Vienna, Austria, pp. 14331436.

Connell, I.W., Hammond, N.V., 1999. Comparing usability evaluationprinciples with heuristics. In: Sasse, A., Johnson C. (Eds.), Proceedingsof the 7th IFIP international conference on Human-computer Inter-action, INTERACT99, IOS Press, Edinburgh.

Connell, I., Blandford, A., Green, T., 2004. CASSM and cognitivewalkthrough: usability issues with ticket vending machines. Behaviourand Information Technology 23, 307320.

Cook, D.J., Bailey, B.P., 2005. Designers use of paper and theimplications for informal tools. In: OZCHI 2005, vol. 122, ACM,Canberra, Australia, pp. 110.

Desurvire, H.W., Kondziela, J.M., Atwood, M.E., 1992. What is gainedand lost using evaluation methods other than empirical testing. In:Monk, A., Diaper, D., Harrison M.D. (Eds.), HCI, Proceedings of theConference on People and Computers VII, pp. 89102.

Dillon, A., Watson, C., 1996. User analysis HCI the historical lessonsfrom individual dierences research. International Journal of Human-Computer Studies 45, 619638.

Doubleday, A., Ryan, M., Springett, M., Sutclie, A., 1997. A comparisonof usability techniques for evaluating design. In: DIS97, ACM Press,Amsterdam, pp. 101110.

Faulkner, L., 2003. Beyond the ve-user assumption: benets of increasedsample sizes in usability testing. Behavior Research Methods, Instru-ments, and Computers 35, 379383.

Folmer, E., Bosch, J., 2004. Architecting for usability: a survey. TheJournal of Systems and Software 70, 6178.

Frkjr, E., Larusdottir, M.K., 1999. Predictions of usability: comparingmethod combinations. In:Managing InformationTechnologyResourc-es in Organizations in the Next Millennium, Idea group publishing.

Gerhardt-Powals, J., 1996. Cognitive engineering principles for enhancinghuman-computer performance. International Journal of Human-Computer Interaction 8, 189211.

Gladwell, M., 2002. The social life of paper. In: The New Yorker, TheNew Yorker Magazine , Newyork, NY, pp. 9296.

Gray, W.D., Salzman, M.C., 1998. Damaged merchandise? Human-Computer Interaction 13, 203262.

Hartson, H.R., Andre, T.S., Williges, R.C., 2001. Criteria for evaluatingusability evaluation methods. International Journal of Human-Com-puter Interaction 13, 373410.

Hertzum, M., Jacobsen, N.E., 2001. The evaluator eect: a chilling factabout usability evaluation methods. International Journal of Human-Computer Interaction 13, 421443.

Holzinger, A., 2005. Usability engineering methods for software develop-ers. Communication of the ACM 48, 7174.


Hornbk, K., Frkjr, E., 2004. Usability inspection by metaphors ofhuman thinking compared to heuristic evaluation. InternationalJournal of Human-Computer Interaction 17, 357374.

Jeries, R., Miller, J.R., Wharton, C., Uyeda, K.M., 1991. User interfaceevaluation in the real world: a comparison of four techniques. In:ACM CHI91, New Orleans, LA, pp. 119124.

John, B.E., Mashyna, M.E., 1997. Evaluating a multimedia authoringtool. Journal of the American Society for Information Science 48,10041022.

Karat, C., Campbell, R., Fiegel, T., 1992. Comparison of empirical testingand walkthrough methods in user interface evaluation. In: ACMCHI92 Conference, Monterey, California, pp. 397404.

Lavery, D., Cockton, G., Atkinson, M.P., Heuristic Evaluation. UsabilityEvaluation Materials, , 1996.

Law, E.L.-C., Hvannberg, E.T., 2004a. Analysis of strategies forimproving and estimating the eectiveness of heuristic evaluation. In:NordiCHI 2004, Tampere, Finland, pp. 241250.

Law, E.L.-C., Hvannberg, E.T., 2004b. Analysis of the combinatorial usereect of international usability tests. In: CHI 2004, ACM, Vienna,Austria, pp. 916.

Lewis, J.R., 1991. Psychometric evaluation of an after-scenario question-naire for computer usability studies: the ASQ. In: ACM SIGCHIBulletin, vol. 23, pp. 7881.

Manko, J., Dey, A.K., Hsieh, G., Kientz, J., Lederer, S., Ames, M., 2003.Heuristic evaluation of ambient displays. In: CHI 2003, ACM Press,Florida, USA., pp. 169176.

Molich, R., Jeries, R., 2003. Comparative Expert Reviews. In: CHI 2003:New Horizons, ACM, Ft. Lauderdale, Florida, USA, pp. 10601061.

Molich, R., Ede, M.R., Kaasgaard, K., Karyukin, B., 2004. Comparativeusability evaluation. Behaviour and Information Technology 23, 6574.

Newman, W., Taylor, A., 1999. Towards a methodology employingcritical parameters to deliver performance improvements in interactivesystems. In: Sasse, M.A., Tauber M. (Eds.), INTERACT99, 7th IFIPTC.13 International Conference on Human-Computer Interaction,Edinburgh, Scotland, pp. 605612.

Nielsen, J., 1993. Usability Engineering. Academic Press, New York.Nielsen, J., 1994a. Enhancing the explanatory power of usability heuris-

tics. In: CHI94, ACM, Boston, Massachusetts USA, pp. 152158.Nielsen, J., 1994b. Heuristic evaluation. In: Nielsen, J., Mack, R.L. (Eds.),

Usability Inspection Methods. John Wiley & Sons, pp. 2562.

Nielsen, J., 2000. Why you only need to test with 5 users? Available from:, Accessed 2005.

Nielsen, J., Molich, R., 1990. Heuristic evaluation of user interfaces. In:Proceedings ACM CHI90 Conference, ACM, Seattle, WA, pp. 249256.

Norman, D.A., 1986. Cognitive engineering. In: Norman, D.A., Draper,S.W. (Eds.), User Centered System Design: New Perspectives onHuman-Computer Interaction. Lawrence Erlbaum Associates, Hills-dale, NJ, pp. 3161.

Renkl, A., Atkinson, R.R., 2003. Structuring the transition from examplestudy to problem solving in cognitive skills acquisition: a cognitiveload perspective. Educational Psychologists 38, 1522.

Rieman, J., Davies, S., Hair, D.C., Esemplare, M., Polson, P., Lewis, C.,1991. An automated cognitive walkthrough. In: Proceedings of theSIGCHI conference on Human factors in computing systems: Reach-ing through technology, ACM, New Orleans, Louisiana, UnitedStates, pp. 427428.

Rosenbaum, S., Ronn, J.A., Humburg, J., 2000. A Toolkit for StrategicUsability: Results from Workshops, Panels, and Surveys. In: Turner,T., Szwillus, G., Czerwinski, M., Paterno F. (Eds.), CHI2000, ACM,Hague, Amsterdam, pp. 337344.

Somervell, J., McCrickard, D.S., 2005. Better discount evaluation:illustrating how critical parameters support heuristic creation. Inter-acting with Computers 17, 592612.

Somervell, J., Wahid, S., McCrickard, D.S., 2003. Usability heuristics forlarge screen information exhibits. In: Rauterberg, M., Menozzi, M.,Wesson J. (Eds.), INTERACT 2003, Zurich, Switzerland., pp. 904907.

Spool, J., Schroeder, W., 2001. Testing web sites: ve users is nowherenear enough. In: CHI 01 Extended Abstracts on Human factors inComputing Cystems, Seattle, Washington, pp. 285286.

Virzi, R.A., 1992. Rening the test phase of usability evaluation: Howmany subjects is enough? Human Factors 34, 457468.

Woolrych, A., Cockton, G., 2001. Why and when ve test users arentenough. In: Vanderdonckt, J., Blandford, A., Derycke, A. (Eds.),IHM-HCI, vol. 2. Toulouse, France, pp. 105108.

Woolrych, A., Cockton, G., Hindmarch, M., 2004. Falsication testingfor usability inspection method assessment. In: Fincher, S., Marko-poulos, P., Moore, D., Ruddle R. (Eds.), HCI, BCS, Bath.

Zhang, Z., Basili, V., Shneiderman, B., 1999. Perspective-based usabilityinspection: an empirical validation of ecacy. Empirical SoftwareEngineering 4 (1), 4369.


Heuristic evaluation: Comparing ways of .nding and reporting usability problemsIntroductionMaterials and methodsResultsComparing two empirical studiesDiscussionConcluding remarkReferences

Documents

User Action Framework