TECHNICAL REPORT App Review Analysis for Software ...TECHNICAL REPORT : TR-2020-12-03 1 App Review Analysis for Software Engineering: A Systematic Literature Review Jacek Dabro˛ wski†,

TECHNICAL REPORT⇤: TR-2020-12-03 1

App Review Analysis for Software Engineering:A Systematic Literature ReviewJacek Dabrowski†, Emmanuel Letier, Anna Perini, and Angelo Susi

Abstract—Analysing app reviews has become an active area of software engineering research in the last decade. App reviews providea rich source of on-line user feedback that can guide different software engineering activities. A large and diverse research effort hasbeen undertaken to understand what relevant information can be found in app reviews; how the information can be mined usingmanual and automatic approaches; and how the information can help software engineers. However, this knowledge is scattered inliterature, and consequently there is no thorough view on how mining app reviews can support software engineering. To consolidate theknowledge, we have performed a systematic literature review of software engineering literature covering 149 papers in the periodbetween 2010 and 2019. In this survey, we provide a comprehensive overview of the approaches and techniques for review mining,their empirical evaluation, as well as a thorough synthesis of software engineering activities that could be supported. The results showthat review mining approaches may potentially help software engineers in their requirements, design, testing and maintenanceactivities; researchers should however pay more attention to justify goals and use cases of their approaches. Also, it should clarifiedwhether existing techniques are good enough to be useful in practice; evaluation of these techniques should go beyond the standardinformation retrieval metrics; and also evaluate their impacts on software engineering processes. Finally, empirical evaluations shouldcontinue to improve in scale and reproducibility. This survey can be of interest to researchers and practitioners who are looking to useand to study app review analysis in a software engineering context. It can help to identify gaps in the current research, suggests areasfor further investigation, and provide a background to position new research activities.

Index Terms—App store analysis, mining app reviews, user feedback, mining software repository, software engineering.

F

1 INTRODUCTION

WITHIN the last ten years app stores have successfullydistributed millions of mobile phone apps [1]. The

number of apps available in leading app stores (e.g., GooglePlay Store and Apple Store) exceeds 5 million in 2020 [2].Software users are, more than ever, provided with the accessto large numbers of diverse mobile applications. With over2.7 billion smart-phone users across the world, it is nosurprise that the mobile app industry is thriving [3]; Mobileapps are anticipated to generate more than 935 billion USDin 2023 through in-app advertising and paid downloads [4].The lucrative business makes it imperative to engineer com-petitive software products that satisfy ever-increasing userexpectations [5].

App store platforms facilitate a rich source of data that canpotentially help developers to improve their app products aswell as elevate app construction [1]. In 2017, Martin et al.coined the term ’app store analysis’ to denote the widespreadresearch effort making use of app store data for softwareengineering [1]. Supposedly, the quantity, availability andusefulness of app store data have contributed to the increaseof scholars’ interest in analysing app store data. In particular,user-submitted feedback (a.k.a. app reviews) has attractedmajor research attention in app store analysis [1]. As of 2015,research making use of app reviews accounted for 24% of the

• ⇤This document should not be distributed, remixed, adapted, and usedin any medium or format without authors’ permission.

• †J. Dabrowski is the corresponding author. He is with University CollegeLondon and Fondazione Bruno Kessler.E-mail: [email protected]

• E. Letier is with University University College London.• A. Perini and A. Susi are with Fondazione Bruno Kessler.

Manuscript received Month X, Year; revised Month Y, Year.

studies in app store analysis [1].

1.1 Motivation and Problem Description

App reviews is a rich source of on-line user feedback thatcan affect different software engineering practices [5]; forinstance, in requirements engineering, analyzing the feed-back can help engineers to elicit new requirements about appfeatures that users desire [6], [7]; in testing, in addition tofinding bugs [8], [9], studying reviews can inform developersabout the general users’ reactions to released beta versionsof their apps [10], [11]; whereas, in maintenance, exam-ining user comments may help to identify modification re-quests [12], [13] and prioritise expected enhancements [14].

A large and diverse research effort has been made tostudy what relevant information can be found in app re-views; how the information can be analysed using manualand automatic approaches; and how it can help softwareengineers. However, this knowledge is scattered in literature,and consequently there is no clear view on how miningapp reviews can support software engineering. There existsurveys reviewing literature on app review mining [1], [15],[16], [17]. These works have different scopes and reviewthe literature at varied levels of depth; their objectives areto provide an overview of app store analysis research [1],[17], or to study specific techniques, and tools for reviewmining [15], [16]. None of these studies, however, providea comprehensive synthesis on mining app reviews in thecontext of software engineering. Consequently, there is alimited understanding why analysing app reviews can beuseful for software engineers; to what extent review miningapproaches can support software engineering practises; how


good is performance of these approaches, and what is thethe strength of evidences supporting the usefulness of theseapproaches in practise. This calls for a consolidation andsystematization of the knowledge.

1.2 Research Approach and Contribution

To connect the knowledge and provide a comprehensiveoverview of the current state of the art, this paper providesa systematic literature review on how mining app reviewssupport software engineering. In general, with the survey weaim to achieve the following objectives:

• Provide an overview of review analyses that miningapproaches facilitate and synthesize knowledge ontechniques employed to realize the analyses.

• Provide a summary of software engineering scenariossupported by mining app reviews and to synthesizeinformation on how the support is realized.

• Outline how state-of-the-art approaches mining appreviews have been empirically evaluated in the con-text of intended software engineering scenarios.

To accomplish these objectives and solidify the mentionedknowledge, we conducted a systematic literature review, fol-lowing a well-defined methodology that identifies, evaluates,and interprets the relevant studies with respect to specificresearch questions [18], [19]. After a systematic selectionand screening procedure, we ended up with a set of 149papers, covering the period 2010 to 2020, that were carefullyexamined to answer the research questions.

The primary contributions of the study are: (i) synthesisof approaches and techniques for mining app reviews, (ii)new knowledge on how software engineering scenarios canbe supported by mining app reviews, (iii) a summary ofempirical evaluation of review mining approaches, and finally(iv) a study of literature growth patterns, gaps, and directionsfor future research.

1.3 Organization

The rest of the paper is organized as follows: Section 2describes the research methodology followed in this study,including research questions, literature search and selection,and data extraction. Section 3 reports on the analysis andresults obtained. A discussion and interpretation of the resultstake place in Section 4. Section 5 comments on the threatsto validity, and Section 6 discusses related work. Finally,Section 7 concludes the paper.

2 RESEARCH METHOD

To conduct our systematic literature review, we followed themethodology proposed by Kitchenham et al. [18]. We firstdefined research questions and prepared a review protocol,which guided our conduct of the review and the collection ofdata. We then performed the literature search and selectionbased on agreed criteria. The selected studies were readthoroughly, and data items as in Tables 2 were collected usinga data extraction form. Finally, we synthesized the results forreporting.

2.1 Research QuestionsThe primary aim of the study is to understand how analysingapp reviews can support software engineering. Based onthe objective, the following research questions have beenderived:

• RQ1: What are the different types of app reviewanalyses?

• RQ2: What are techniques used to realize app reviewanalyses?

• RQ3: What can software engineering activities besupported by analysing app reviews?

• RQ4: How are mining approaches empirically evalu-ated?

• RQ5: How well can mining approaches support soft-ware engineers?

In answer to RQ1, we aim to provide an overview ofreview analysis that primary studies perform to obtain usefulinformation from app reviews and support software engineer-ing. In particular, with RQ1 we derive insight on how theseanalyses can be classified, what objectives these analysestry to accomplish, and what knowledge these analyses canfacilitate. In answer to RQ2, we aim to synthesize informationon techniques primary studies employ to facilitate differentreview analysis. We formulated RQ3 to identify activitiesin software development life cycle which primary studiesclaim to support by review analysis. Specifically, we providea thorough synthesis of software engineering scenarios; weexplain how mining information from app reviews can guidespecific software engineering activities. RQ4 aims to under-stand how primary studies obtain empirical evidences abouteffectiveness and the perceived-quality of their review analy-sis approaches. Finally, with RQ5 we seek to understand theextent to which mining approaches can be used in practice. Toanswer this question, we provide an overview of the reportedeffectiveness of the approaches as well as their user-perceivedquality.

2.2 Literature Search and SelectionWe followed systematic search and selection process to collectrelevant literature published between January 20101 andJanuary 2020. Figure 1 outlines the process as a PRISMAdiagram2; it illustrates the main steps of the process and theiroutcomes (the number of publications).

The initial identification of publications was performedusing keyword-based search on six major digital libraries:ACM Digital Library, IEEE Xplore Digital Library, SpringerLink Online Library, Wiley Online Library and Elsevier ScienceDirect. We defined two search queries that we applied inboth the meta-data and full-text (when available) of thepublications. To construct the first query, we looked at thecontent of several dozen publications analysing reviews forsoftware engineering.3 We identified key terms that thesepapers share and used the terms to formulate a specific query:

1. We selected 2010 to be the initial period of our search as the earlieststudy had been reported that year [1].

2. A description of the PRISMA (Preferred Reporting Items for System-atic Reviews and Meta-Analyses) method can be found in [20].

3. We identified the papers from previous surveys on app store analysisresearch [1].


Publications identifiedfrom digital libraries

(n = 1,506)

Publications screened(n = 1,234)

Publications meet-ing inclusion criteria

(n = 101)

Publicaitons surveyed(n = 149)

Duplicates removed(n = 272)

Publications excluded(n = 1,133)

Publications includedfrom issue-to-issuesearch (n = 11) andsnowballing (n = 37)

Iden

tifica

tion

Scree

ning

Included

Fig. 1: PRISMA diagram showing study search and selection.

(app review mining OR mining user review OR

review mining OR review analysis OR analyzing

user review OR analyzing app review) AND (app

store)

To not omit other relevant papers not covered by this specificquery, we formulated a general query based on phrasesreflecting key concepts of our research objective:

(app review OR user review OR app store review

OR user feedback) AND (software engineering

OR requirement engineering OR software

requirement OR software design OR software

construction OR software testing OR software

maintenance OR software configuration OR

software development OR software quality OR

software coding) AND (app store)

The initial search via digital libraries resulted in 1,506studies in total, where 272 of them were duplicated. Wescreened 1,234 studies obtained through the initial searchand selected those meeting the following inclusion criteria:(i) were related to software engineering, and may have ac-tionable consequences for engineers or researchers, (ii) wererelated to mobile app stores, concerning the use of appreviews from an app store to support at least one softwareengineering activity (directly or indirectly) [21], (iii) werepeer-reviewed and published as conference, journal, or work-shops papers or a book chapter. We excluded articles if they:(i) were not written in English; (ii) focused on analyzingapp reviews but without the purpose to support softwareengineering, (iii) were secondary or tertiary studies (e.g.,systematic literature reviews, surveys, etc.) technical reportsor manuals. We used these criteria to screen the papersbased on the title, abstract and full-text (if needed). Toensure the reliability of our screening process, four authors

TABLE 1: Selected conference proceedings and journals formanual search.

Venue Abbr.International Conference on Software Engineering ICSEEuropean Software Engineering Conference and Symposium ESEC/FSEon the Foundations of Software EngineeringInternational Conference on Automated Software Engineering ASEInternational Conference on Software Maintenance ICSMConference on Advanced Information Systems Engineering CAiSEInternational Requirements Engineering Conference REIEEE Transactions on Software Engineering TSEACM Transactions on Software Engineering and Methodology TOSEMIEEE Software IEEE SWEmpirical Software Engineering EMSEInformation and Software Technology ISTRequirements Engineering Journal REJ

TABLE 2: Data Extraction Form

Item ID Field UseF1 Author(s) DocumentationF2 Year DocumentationF3 Title DocumentationF4 Venue DocumentationF5 Citation DocumentationF8 Review Analysis RQ1F9 Mining Technique RQ2F10 Software Engineering Scenario RQ3F11 Justification RQ3F12 Evaluation Objective RQ4F13 Evaluation Procedure RQ4F14 Evaluation Metrics and Criteria RQ4F15 Evaluation Result RQ5F16 Annotated Dataset RQ4F17 Annotation Task RQ4F18 Number of Annotators RQ4F19 Quality Measure RQ4F20 Replication Package RQ4

collaboratively classified a sample of 20 papers (each paperwas assigned to two authors). We then assessed their inter-rater agreement (Cohen’ Kappa > 0.9) [22].

Due to the conservative searching, the majority of thestudies was found to be unrelated to the scope of the survey.We excluded 1,133 publications that did not meet the in-clusion criteria. Subsequently, we complemented our searchprocess with two other strategies to find relevant papers thatcould have been missed in the initial search. We performed amanual issue-by-issue search of major conference proceedingsand journals in software engineering in the period fromJanuary 2010 to January 2020. The searched journal andproceedings are listed in Table 1. That step produced another11 unique publications. Finally, we completed the search-ing with a snowballing procedure following Wohlin’s guide-lines [23]. We performed backward snowballing consideringall the references from relevant studies found by previoussearching strategies. Moreover, we conducted forward snow-balling based on the 10 most cited papers. Using snowballingprocedure, an additional 37 relevant articles were identified.Accordingly, we ended up with 149 articles included in thesurvey.

2.3 Data ExtractionThe first author created a data extraction form to collectdetailed contents for each of the selected studies. They used


extracted data items to synthesize information from primarystudies and answer research questions RQ1-RQ5. Table 2illustrates the data items the first author extracted. We hereprovide descriptions of the extracted data items:

• F1-F5: These data items are used for documentationto keep meta-information of papers and bibliographicreferences. For item F5 we record the citation countper paper based on Google Scholar (as of the 8th ofJuly 2020).

• F8: Review analysis used in the study. We separatedcollected data into: review analysis type (F8.1) e.g.,review classification, mined information (F8.2) e.g.,bug report, and subsidiary description (F8.3).

• F9: Technique used to realize a specific review anal-ysis. We recorded the technique type (F9.1) e.g., ma-chine learning and its name (F9.2) e.g., Naïve Bayes.

• F10: Software engineering scenario supported by thestudy. We recorded the software engineering activity(F10.1) e.g., requirements elicitation, and the soft-ware engineering phase (F10.2) the activity pertainsto e.g., requirements. We referred to activities gen-erally accepted in the software engineering commu-nity [21]; explicitly mentioned in a primary study toavoid our subjective and biased decisions.

• F11: We recorded an argumentation type (F11.1) astudy provides about a supported software engineer-ing scenario and the argumentation itself (F11.2).The argumentation type can be one of the following:claim, description, or none.

• F12: We collected information on overall evaluationpurpose (F12.1) i.e., effectiveness or user-perceivedquality as well as recorded information on an analysisthe objective was to evaluate (F12.2).

• F13: Evaluation procedure employed by a study tovalidate the overall approach or review analysis theapproach facilitates. The content of this data itemdetails evaluation steps.

• F14: Metrics used for quantitative assessment (ef-fectiveness) e.g., precision, and criteria defined forqualitative evaluation (user-perceived quality) e.g.,usability.

• F15: Indicate the result of empirical evaluation. Werecorded results with reference to evaluation metricscriteria; we also recorded details about the reviewanalysis these results concern e.g., review classifica-tion with precision of 52%.

• F16: The characteristics of annotated dataset usedin the study. We stored information about App Storename from which reviews were collected (F16.1) e.g.,Google Play, and the number of annotated reviews.

• F17: Indicate the task that humans annotators per-formed when labeling a sample of app reviews e.g.,classify reviews by discussed issue types.

• F18: The number of human annotators labeling appreviews for empirical evaluation.

• F19: The measure employed for assessing reliabilityof the annotated dataset e.g., cohen’s kappa.

• F20: The availability of replication package for thestudy i.e., yes or no. We also recorded details aboutwhether one of the following artefacts were available:

collected dataset, annotated dataset, or an imple-mented approach.

The reliability of data extraction was evaluated throughinter- and intra- rater agreements [24]. The agreementswere measured using percentage agreement [25]. To eval-uate intra-rater agreement, the first author re-extracted dataitems from a random sample of 20% of selected papers. Anexternal assessor4 then validated the extraction results be-tween the first and second rounds; and computed percentageagreement. To evaluate inter-rater agreement, the assessorcross-checked data extraction; the assessor independentlyextracted data items from a new random sample of 10%of selected papers. The first author and the assessor thencompared their results and computed agreement; Both intra-and inter- rater agreements were higher than 90%, indicatingnearly complete agreement [24].

2.4 Data SynthesisMost data in our review are grounded in qualitative research.As found by other researchers, tabulating the data is use-ful for aggregation, comparison, and synthesis of informa-tion [18]. The data was thus stored in the spreadsheets, man-ually reviewed, and interpreted to answer research questions.Parts of the extracted data we synthesized using descriptivestatistics.

3 RESULT ANALYSIS3.1 DemographicsFigure 2 presents frequency of primary studies per year,including breakdown of publication type (Journal, Confer-ence, Workshop, and Book). The publication date of primarystudies ranges from 2012 to 20195. We observed that 60%of the primary studies were published in the last 3 years,indicating a growing interest in research on analyzing appreviews to support software engineering.

The overall distribution of primary study type is presentedin Figure 3. The majority of the studies have been publishedas a conference or journal studies. The results revealed thatprimary studies were scatter across 70 venues. Table 3 liststhe ten major venues in terms of the number of publishedpapers6. The venues include prestigious conferences/journalsin the software engineering community, accounting for 42%of all primary papers.

Key insights from demographics:

• The interest in the research on app review analysisrose substantially in the last 3 years.

• The publications of the primary studies are scat-tered across different venues.

• The main venues publishing research in app re-view analysis include the main general softwareengineering conferences and journals (ICSE, FSE,ASE, TSE) as well as the main specialized venuesin empirical software engineering (ESEM) and re-quirements engineering (RE).

4. The assessor has an engineering background and experience withmanual annotation; they has no relationship with this research.

5. No study was published in 2010 and 2011.6. The complete list of venues can be found in the replication package

of this survey [26].


Fig. 2: Histogram showing the number of research paperswith breakdown of venue type in the period from 2010 toDecember 31, 2019. The data for the years 2010-11 is notpresented as no study was published that time.

Fig. 3: Pie chart showing the distribution of research papersper venue type in the period from 2010 to December 31,2019.

TABLE 3: Table showing top ten venues in which primarystudies were published in the period from 2010 to December31, 2019.

Venues No. StudiesEmpirical Software Engineering Journal (EMSE) 10International Requirements Engineering Conference (RE) 9International Conference on Software Engineering (ICSE) 6International Symposium on Foundations of Software Engineering (FSE) 6International Working Conference on Requirements Engineering (REFSQ) 6International Conference on Automated Software Engineering (ASE) 6Intl. Conference on Mobile Software Engineering and Systems (MOBILESoft) 5International Workshop on App Market Analytics (WAMA) 5IEEE Software IEEE Softw 5IEEE Transactions on Software Engineering (TSE) 4

3.2 RQ1: App Review AnalysisIn this section, we answer RQ1 (what are the differenttypes of app review analysis) based on data extracted inF8 (review analyses). To answer the question, we groupeddata items into one of nine general categories, each rep-resenting a different review analysis type (F8.1). We used

categories previously proposed in the context of app storeanalysis [1] and text mining research [27]. Here, we focusedon an abstract representation, because primary studies usevarious terminologies and level of granularity referring tothese analysis. Table 4 lists review analyses and shows thestudies distribution per each analysis. The reset of this sectionprovides definitions of each review analysis and synthesis ofcorresponding primary studies.

TABLE 4: Table showing the distribution of app review anal-ysis performed in research papers in the period from 2010 toDecember 31, 2019.

App Review Analysis No. Studies PercentageInformation Extraction 43 29%Classification 85 57%Clustering 37 25%Search and Information Retrieval 16 11%Sentiment Analysis 29 19%Content Analysis 39 26%Recommendation 24 16%Summarization 22 15%Visualization 18 12%

3.2.1 Information ExtractionApp reviews comes with an unstructured textual content. Theinformation relevant from a software engineering perspectivecan be found at different text locations. For instance, a prob-lematic feature can be discussed in a middle of a sentence [7],or a requested improvement can be expressed anywhere in areview [28]. Clearly, identifying and extracting text snippetsmanually is not cost-effective [29]. To address the problem,43 of the primary studies (29%) proposed approaches facili-tating information extraction. Formally, information extrac-tion is the task of extracting specific (pre-specified) infor-mation from the content of a review; this information mayconcern app features [6], [7], qualities [30], problem reportsand/or new feature requests (e.g., [12], [31], [32]), as wellas user opinions about favored or unfavored features (e.g.,[7], [10], [29], [33]).

3.2.2 ClassificationClassification can be seen as a task of assigning predefinedcategories to reviews or their textual snippets (e.g., sentencesor phrases). Generally, review classification is widely used toaddress the problem of filtering out informative reviews fromthose containing noise (e.g., [34], [35], [36]), spam [37]and fake information [38]. Informative reviews can be subse-quently classified to detect user intentions these reviews con-vey (e.g., [11]), or topics users discuss (e.g., [39]). Detectinguser intentions is typically realized by grouping reviews intothose reporting problems or asking for new features [40],[41].

Topics these reviews discuss can vary from applicationaspects such as installation problems, user interface, orprice [42], [43]; topics concerning user perception e.g.,rating, user experience or praise [44]; or topics reportingdifferent type of issues [45], [46]. For instance, review classi-fication has been proposed to detect different type of usabilityand user experience issues [47], [48], quality concerns [49]


or different types of security and privacy issues [50]. Simi-larly, app store feedback can be classified by their reportedrequirements type [51], [52], [53], [54], [55]. This couldhelp distinguish reviews reporting functional requirementsfrom those reporting non-functional requirements [51], [52],[53]; distilling non-functional requirements into fine-grainedquality categories such as reliability, performance, or effi-ciency [54], [55]. Another key use of the classification taskis rationale mining; it involves detecting types of argumenta-tions and justification users describe in reviews when makingcertain decisions, e.g. about upgrading, installing, or switch-ing apps [56], [57].

3.2.3 ClusteringClustering is the task of organizing reviews, sentences,and/or snippets into groups (called a cluster) whose mem-bers are similar in some way. Members in the same groupare more similar (in some sense) to each other than to thosein other groups. As opposed to the classification task, in theclustering there is no prior knowledge about group labels.Clustering is thus widely used as an exploratory analysisto suggest topics commonly discussed by users [7], [44],[58], [59] and aggregate reviews containing semanticallyrelated information [34], [60], [61]. The analysis, for in-stance, is widely used to cluster reviews requesting the samefeatures [36], [62], reporting similar problems [13], [14],[63], or providing opinions with respect to the same appaspect [64], [65]. Such generated clusters help softwareengineers to synthesize information from a group of reviewsreferring to the same topics rather than examining each usercomment separately [28], [32], [66].

3.2.4 Search and Information RetrievalSearch and information retrieval is the task of finding andtracing reviews (or their textual snippets) that matchesneeded information. The task can be used e.g., to find reviewsdiscussing a queried app feature [29], [67], [68], to obtainthe most diverse user opinions in reviews [60], or to tracewhat features described in the app description are discussedby users [6], [69]. In particular, detecting a linkage betweenapp reviews and other software artefacts has become animportant application of the task in app store research [70],[71]. The task is, for example, used to detect links betweenreviews and source code [61], stack tracers [72], issues fromtracking systems [70], [73], or warnings from static analysistools [74] to locate problems in source code [61], [75], [76],suggest potential changes [61], [70], or to flag errors andbugs in an application under test [74].

3.2.5 Sentiment AnalysisSentiment analysis (also known as opinion mining) refers tothe task of interpreting user emotions in app reviews. Thetask aims at detecting a sentiment polarity within a text(e.g., positive, neutral, or negative) whether it is a wholereview [77], [78], a sentence [7], [41], or phrase [10]. Inparticular, sentient analysis attract widespread interest dueto its practical applications in mining user opinions. Appreviews are a rich source of user opinions [7], [78], [79],[80]. Mining these opinions involves identifying user senti-ment about discussed topics [10], features [7] or software

qualities [47], [79]. These opinions can help developers tounderstand how users perceive their app [7], [10], [81],what are users’ requirements [67], [82], what are users’preferences [7], [47], [80], [83], and what do affects thesale of their mobile apps [84]. Not surprisingly, knowing useropinions is an important information need developers seek tosatisfy [85], [86].

3.2.6 Content Analysis

Content analysis is used to study the presence of certainwords, themes, or concepts within app reviews. Scholars ana-lyze review content to characterize and quantify the presenceand relationships of certain words, themes, concepts. Severalstudies, for example examine reviews vocabulary, size andrelationship between user rating and review length [87],[88]. It has now been demonstrated that users discuss diversetopics in reviews [44], such as app features, qualities [89],requirements [52], [55] or issues [45], [48]. For example,using content analysis, scholars, for example, discover recur-ring types of issues reported by users [46], their distributionin reviews as well as as relations between app issue typeand other information such as price or rating [90], [91].Interestingly, studies have pointed out that users’ perceptionfor the same apps can vary per country [92], user gen-der [93], development framework [94], or app store in whichapps was released [95]. The content analysis can be alsobeneficial for software engineers to understand whether theircross-platform apps achieve consistency of users’ perceptionsacross different app stores [96], [97], or whether hybriddevelopment tools achieve their main purpose: delivering anapp that is perceived similarly by users across platforms [97].Finally, studying the dialogue between users and developershas shown evidences that the chances of users to update theirrating for an app increase as result of developer’s response toreviews [98], [99].

3.2.7 Recommendation

Recommendation task aims to suggest course of actionthat software engineers should follow. Several mining ap-proaches, for instance [14], [34], [100], have been proposedto recommend reviews that software engineers should in-vestigate. These approaches typically assign priorities to agroup of comments reporting the same bug [13], [28], [101],requesting the same modification [14], [100], or askingfor the same improvement [102]. Such assigned prioritiesindicate relative importance of the information that thesereviews convey from the users’ perspective. Factors affectingthe importance vary from e.g., the number of reviews inthese groups [34], to the influence of this feedback on appdownload [103], and the overall sentiment these commentsconvey [104]. In line with this direction, mining approacheshave been elaborated to recommend feature refinement plansfor the next release [104], [105], to highlight static analysiswarnings that developers should check [74], to indicatemobile devices that should be tested [106], and to suggestreviews that developers should reply [107], [108]; the ap-proaches can analogously recommend responses for thesereviews [107], [108], stimulating users to upgrade theirratings or to revise feedback to be more positive [98], [109].


3.2.8 Summarization

Review summarization aims to provide a concise and precisesummary of one or more reviews, which could be cost-ineffective to undertake if done manually. The task has beencarried out to wrap-up the content of reviews along differentfacets. In the literature there are several examples of sum-marizing reviews based on commonly discussed topics, userintentions with respect to these topics, or the overall user sen-timent per topic (e.g., [7], [43], [43]). For example, Di Sorboet al. proposed summarizing thousands of user reviews by aninteractive report. The summary can tell software engineerswhat maintenance tasks need to be performed (e.g., bugfixing or feature enhancement) with respect to specific topicsdiscussed in reviews (e.g., UI improvements) [36], [39]. Ina similar manner, wrapping-up reviews give developers aquick overview about users’ perception specific to core fea-tures of their apps [7], [31], software qualities [43], and/ormain users’ concerns [9], [75], [110]. With the addition ofstatistics e.g., the number of reviews discussing each topicor requesting specific changes, such a summary can helpdevelopers to prioritize their work by focusing on the mostimportant modifications [75]. In addition, such a summarycan be exported to other software management tools (e.g.,GitHub, JIRA) [9] to generate new issue tickets and help inproblems resolution [111].

3.2.9 Visualization

Visualization can aid developers in identifying patterns,trends and outliers, making it easier to interpret informa-tion mined from reviews [59]. To communicate informationclearly and efficiently, review visualization uses tables, charts,and other graphical representations [11], [59], accompaniedby numerical data [11], [47]. Maalej et al., for example,demonstrated that trend analysis of review type (e.g., bugreport, feature request, user experience) over time can beused by software engineers as an overall indicator of howthe project is going [11]. Other studies propose visualizingdynamics of main themes discussed in reviews to identifyemerging issues [12], [13], [28], [112], or to show the issuedistribution for an app across different app stores [101].Simple statistics about these issue (e.g., How many reviewsreported specific issues?) can give an overall idea aboutthe main problems, in particular if compared against otherapps (e.g., Do users complain more about security of myapp compared to similar apps?’). Similarly, analyzing theevolution of user opinions or reported bugs referring to spe-cific features could help developers to monitor the health ofthese features and to prioritize maintenance tasks [29], [47],[64], [113]. For instance, developers, who spotted negativeopinions mentioning certain features, could benchmark howoften these opinions emerge, for how long these opinionshave been reported, and whether their frequency is rising ordeclining [10], [29]. This information could provide develop-ers with evidence of the relative importance of these opinionsfrom a users’ perspective [47], [67].

RQ1: App Review Analysis

• 9 broad types of review analyses have been iden-tified in the literature: (1) information extraction;(2) classification; (3) clustering; (4) search andinformation retrieval; (5) sentiment analysis; (6)content analysis; (7) recommendation; (8) summa-rization and (9) visualization.

• Reviews classification, clustering, and informationextraction are the mostly frequently applied tasks;they help to group reviews, discover hidden pat-terns and to focus on relevant parts of reviews.

• Content analysis is used to characterize reviews, toidentify discussed topics, and to explore informa-tion needs that can be satisfied by the feedback.

• Searching and information retrieval aids softwareengineers to query reviews with information of theirinterest, and to trace it over other software artefacts(e.g., stack traces or issue tracking system).

• Summarizing and visualizing information scatteredacross a large amount of reviews can aid developersin interpreting the information that could be costlyand time-consuming to undertake if done manually.

• Mined information is commonly used to recom-mend engineers a course of their maintenancesactions e.g., bugs in need of urgent intervention,or localizing the problem in the source code.

3.3 RQ2: Mining TechniquesApp review analyses (see Section 3.2) are realized by differ-ent text mining techniques. In this section, we address RQ2(what are techniques used to realize app review analysis)based on extracted data in F9 (mining technique). To answerthe question, we look at content analysis, natural languageprocessing, machine learning and statistical analysis tech-niques.

3.3.1 Manual AnalysisScholars have shown an interest in manual analysis of appreviews [57]. The technique is used to facilitate ContentAnalysis (e.g., to understand topics users discuss [44]) and todevelop a dataset for training and/or evaluating mining tech-niques, so-called ground truth [56]. Manual analysis typicallytakes a form of tagging a group of sample reviews with oneor more meaningful tags (representing certain concepts). Forexample, tags might indicate type of user complaint [114],feature discussed in reviews [8], or sentiment users ex-presses [115]. To make replicable and valid inferences uponmanual analysis, studies perform it in a systematic manner.Figure 4 illustrates the overall procedure of manual analysis.Scholars first formulate the analysis objective correspondingto the exploration of review content (e.g., understandingtypes of user complaints) or the development of groundtruth (e.g., labelling type of user feedback). They then selectreviews to be analysed, and specify the unit of analysis (e.g.,a review or a sentence). Next, one or more humans (so-called coders) follow a coding process. A coder examinesa sample of reviews and tags them with specific concepts.Unless these concepts are known in advance or coders agreeabout the tagging, the step is iterative; When, for example,new concepts are identified, coders examine once again all


the previously tagged reviews and check if they should bealso tagged with the new concepts. Such iterations minimizethe threat of human error when tagging the reviews. Once allthe reviews are tagged, authors either analyse findings or usethe dataset to evaluate other mining techniques [116].

Fig. 4: Figure showing the overall process of manual analysis.

Manual analysis is time-consuming and require a vasthuman effort [7], [44]; a pilot study typically proceeds anactual analysis [56], [115]; subsequently the actual tagging,focusing on a statistically representative sample of reviews,takes places [114]. For example, Guzman et al. [7] involvedseven coders who independently tagged 2800 randomly sam-pled user reviews. For each review, two coders independentlytagged the type of user feedback, features mentioned inthe review and sentiments associated to these features. Thestudy reports that coders spent between 8 and 12.5 hours forcoding around 900 reviews.

3.3.2 Natural Language ProcessingUser-generated content of app reviews takes the form oftext [87], [88]. Such text has plenty of linguistic structureintended for human consumption rather than for comput-ers [117]. The content must, therefore, undergo a goodamount of natural language processing (NLP) before it can beused [117], [118]. Given this fact, it is not surprising that themajority of primary studies (75% of surveyed papers) adoptNLP techniques to support review analysis (see Section 3.2).

At a high level, pre-processing can be simply seen asturning review content into a form that is analysable for aspecific mining task (see Section 3.2). There are differentways to pre-process reviews including text normalization,cleaning and augmenting [117], [118]. Transforming reviewinto a standard form, it is typically achieved by convertingtexts into lowercase [66], [115], breaking up a text intoindividual sentences [54], [119], separating out words (i.e.,tokenization) [9], [61], spelling correction [61], [76] aswell as turning words into their base forms (e.g., stemmingor lemmatization) [8], [54]. Of course, not all the reviewcontent is meaningful [7], [34]. Some parts are noisy and ob-struct text analysis [61], [70]. The content is thus cleaned byremoving punctuation [97], [120], filtering out noisy wordslike stop words [6], [75], or non-English words [70], [116].Such normalized and cleaned text tends to be augmentedwith additional information based on linguistic analysis e.g.,part-of-speech tagging (PoS) [105], [120] or dependencyparsing [10], [58].

Represented as a words sequence [6], bag-of-words(BoW) [8] or in vector space model (VSM) [29], a review

serves as input for other techniques mining an actual con-tent. In particular, primary studies refers to NLP techniquescomparing text similarity [52], [68], pattern matching [6],[30] and collocations finding [7], [69], [82].

Text similarity techniques (employed in 21 studies) de-termine how "close" two textual snippets (e.g., review sen-tences) are [118]. These snippets, represented in VSM orBoW, are compared using similarity measure like Cosinesimilarity [29], Dice similarity coefficient [70] or Jaccard in-dex [9]. These techniques support Searching and InformationRetrieval (e.g., to link reviews with issue reports from issuetracking systems [73]), Recommendation (e.g., recommendreview responses based on old ones that have been postedto similar reviews [107]), Clustering (e.g., grouping semanti-cally similar user opinions [64]), and Content Analysis (e.g.,comparing review content [94]).

Pattern matching techniques (employed in 21 studies) lo-calize parts of review text (or its linguistic analysis) matchinghand-crafted patterns. Such patterns can take many forms,such as, regular expressions [30], [53], PoS sequences [6],[64], dependencies between words [10], [62] or simple key-word matching [11], [53]. The technique has been adoptedin Information Extraction (e.g., extracting requirements fromreviews [30], [30], [53]), Classification (e.g., classifyingrequirements into functional and non-functional [53]) andSummarization (e.g., provide a bug report summary [30]).

Collocation finding techniques are employed for Informa-tion Extraction (e.g., extracting features [7] or issues [13]from reviews). Such collocations are phrases consisting oftwo or more words, where these words appear side-by-sidein a given context more commonly than the word partsappear separately [117]. The two most common types ofcollocation detected in the primary studies are bigrams i.e.,two adjacent words [7], [82]. Co-occurrences may be insuffi-cient as phrases such as ‘all the’ may co-occur frequently butare not meaningful. Hence, primary studies explore severalmethods to filter out the most meaningful collocations, suchas Pointwise Mutual Information (PMI) [13] and hypothesistesting [7], [117].

3.3.3 Machine Learning

TABLE 5: Table showing the distribution of machine learningtechniques used in primary studies in the period form 2010to December 31, 2019.

Type Machine Learning Techniques No. Studies

Supervised

Naïve Bayes 35Support Vector Machine 31Decision Tree 26Logistic Regression 20Random Forest 12Neural Network 9Linear Regression 5K-Nearest Neighbor 4

Unsupervised Latent Dirichlet Allocation 30K-Means 4

Overall, 90 of 149 primary studies (60%) reported theuse of machine learning (ML) techniques to facilitate min-ing tasks and review analysis. Table 5 reports ten most


commonly applied ML techniques by primary studies. Mostof them (i.e., 8 techniques) are supervised, whereas 2 ofthem are unsupervised [121]. The widespread interest inML techniques may be attributed to the fact that Clustering(e.g., grouping reviews discussing the same topics [66]) andClassification (e.g., categorizing user feedback based on userintention [122]), among the most common review analysis(see Table 4), are mainly facilitated using ML. When lookingat the whole spectrum of review analysis these ML techniquessupport, we have also recorded their use for SentimentAnalysis (e.g., identifying feature-specific sentiment [10]),Recommendation (e.g., assigning priorities to reviews report-ing bugs [14])) and Information Extraction (e.g., identifyingfeatures [123]).

Scholars experimented with manifold textual and non-textual review properties to make ML techniques workbest [8], [124]. Obliviously, choosing informative and in-dependent properties is a crucial step to make these tech-niques effective [11], [121]. Textual properties, for exampleconcern: text length, tense of text [56], [57], importance ofwords (e.g., td-idf) [54], a word sequence (e.g., n-gram) [8]as well as linguistic analysis (e.g., dependency relation-ship) [125]. These properties are commonly combined withnon-textual properties like user sentiment [11], review rat-ing [56] or app category [108]. We found that primary stud-ies experiment with different properties [11], [57]. None ofsurveyed papers, however, focus on engineering meaningfulproperties to boost the effectiveness of ML techniques.

3.3.4 Statistical Analysis

Statistical analysis become an essential part in primary stud-ies to report research results [63], [115], demonstrate theirsignificance [88], [126], and draw conclusions of a largepopulation of reviews by analysing their tiny portion [44],[49]. We observed an interest in use of descriptive andinferential techniques for Content Analysis (e.g., [44], [49],[88], [127]). Summary statistics, box plots, and cumulativedistribution charts help to gain understanding of reviewcharacteristics like their vocabulary size [87], [88], issue typedistribution [46], [96], or topics these reviews convey [44],[128]. Scholars employ different statistical tests to test checktheir hypothesis [93], [126], to examine relationship be-tween reviews characteristics [93], [128], and to study howsampling bias affects the validity of research results [63].

Guzman et al., for example, conducted an exploratorystudy investigating 919 reviews from eight countries [93],[127]. They studied how reviews written by male and femaleusers differ in terms of content, sentiment, rating, timing,and length. The authors employed Chi-square (e.g., content)and Mann-Whitney (e.g., rating) non-parametric tests fornominal and ordinal variables respectively [93]. Srisopha etal., studied whether there exists a relationship between usersatisfaction and the application’s internal quality characteris-tics [128]. Having employed Pearson correlation coefficient,the authors studied to what extent do warnings reportedby static code analysis tools correlate with different typesof user feedback and the average user ratings. Similarly,another study employed the Mann-Whitney test to examinedif densities of such warnings differ between apps with highand low ratings [126].

RQ2: Mining Techniques

• Primary studies employ 4 broad types of techniquesto realize app review analyses: (1) manual analy-sis; (2) natural language processing; (3) machinelearning and (4) statistical analysis.

• Manual analysis is used to study review content;and to develop datasets for training/evaluatingdata mining techniques. The technique is time-consuming and requires substantial human effort.

• NLP techniques play an important role for reviewanalysis. The majority of primary studies (75%)use the techniques for a wide spectrum of reviewanalyses: Search and Information Retrieval, Classi-fication, Clustering, Content Analysis, InformationExtraction, Summarization or Recommendation.

• ML is employed by ca. 60% of papers for Cluster-ing, Classification, Sentiment Analysis, Recommen-dation, or Information Extraction. Scholars experi-ment with textual and non-textual review proper-ties to boost the effectiveness of the techniques.

• Statistical analysis is used to support Content Anal-ysis: to summarize findings; to draw statisticallysignificant conclusions; or to check their validity.

3.4 RQ3: Supporting Software EngineeringTo answer RQ3 (what can software engineering activities besupported by analysing app reviews), we used data extractedbased on F10 (software engineering scenario) and F11 (justi-fication). Table 6 provides mapping between primary studiesand SE activities that the studies claim to support7; it alsoreports the number and the percentage of the studies pereach activity. We can observe that primary studies moti-vated their approaches to support activities across differentsoftware engineering phases, including requirements (44%),maintenance (39%), testing (17%) and design (3%); 14SE activities are supported in total; mostly research effortis focused on requirements elicitation (26%), requirementsprioritization (10%), validation by users (11%), problemand modification analysis (23%), and requested modificationprioritization (10%). We also recorded that 62 studies (42%)did not specify any SE activity that their approaches support.

3.4.1 RequirementsRequirements engineering typically concerns involving sys-tem users, obtaining their feedback and agreeing on thepurpose of a software to be built [131]. It therefore is notsurprising that review analysis has received much attentionto support requirements engineering activities, including re-quirements elicitation, requirements classification, require-ments prioritization and requirements specification (see Ta-ble 6).

Requirements Elicitation. Driven by their own needs,users give feedback describing their experience with apps,expressing their satisfaction with software products and rais-ing needs for improvements [5], [44]. Developers can makeuse of the feedback to elicit new requirements [5], [67], [82].

7. It is worth noting that some papers fall into more than one categoryi.e., claim to support more than one activity. In such case, we assignedthe study to all the claimed activities.


TABLE 6: Table showing the mapping between references of primary studies and SE activities that the studies support throughapp review analysis; it also reports the number and the percentage of the studies per SE activity.

SE Activity No. Studies Percentage ReferenceREQUIREMENTS 66 44%

Requirements Elicitation 39 26%

[5], [6], [7], [8], [9], [11], [14], [30], [31], [35], [36], [39], [43], [44],[53], [54], [55], [62], [65], [67], [69], [82], [89], [100], [110], [112],[113], [122], [129], [130], [131], [132], [133], [134], [135], [136],[137], [138], [139]

Requirements Classification 7 5% [30], [51], [52], [53], [54], [55], [138]

Requirements Prioritization 15 10% [5], [6], [7], [14], [30], [44], [56], [57], [87], [88], [100], [105], [113],[124], [131]

Requirements Specification 5 3% [11], [44], [56], [57], [131]DESIGN 5 4%

Design Rationale Capture 4 3% [30], [56], [57], [138]User Interface Design 1 1% [48]

TESTING 26 17%

Validation by Users 17 11% [5], [7], [8], [10], [10], [11], [12], [31], [43], [47], [58], [59], [66],[90], [110], [113], [140]

Test Documentation 3 2% [9], [72], [76]Test Design 3 2% [30], [101], [131]Test Prioritization 3 2% [101], [106], [114]

MAINTENANCE 58 39%

Problem and Modification Analysis 34 23%[5], [6], [9], [12], [13], [28], [32], [36], [40], [45], [47], [50], [59],

[61], [64], [66], [70], [71], [72], [74], [80], [82], [103], [104], [111],[112], [113], [114], [124], [130], [141], [142], [143], [144]

Requested Modification Prioritization 15 10% [9], [10], [14], [28], [67], [73], [74], [97], [100], [101], [102], [104],[114], [144], [145]

Help Desk 5 3% [98], [99], [107], [108], [109]Impact Analysis 4 3% [61], [70], [71], [75]

NOT SPECIFIED 58 39%

[29], [33], [34], [37], [38], [41], [42], [46], [49], [60], [63], [68],[77], [78], [79], [81], [83], [84], [91], [92], [93], [94], [95], [96],[115], [116], [119], [120], [123], [125], [126], [127], [128], [146],[147], [148], [149], [150], [151], [152], [153], [154], [155], [156],[157], [158], [159], [160], [161], [162], [163], [164], [165], [166],[167], [168], [169], [170]

For instance, they can employ opinion mining approaches toexamine reviews talking negatively about app features [7],[67], [69], [82], [113], [133], [137]. This can help devel-opers to understand user concerns about problematic fea-tures, and potentially help eliciting new requirements [6],[67], [82]. Additionally, finding users reviews that refer toa specific feature of their interest will allow them to quicklyidentify what users have been saying about the concrete fea-ture [67], [69], [137]. In line with this direction, approacheshave been proposed to group reviews by their user intention(e.g., reviewer requesting a new feature) [8], [11], [14],[100], [110] and by the type of requirements these reviewsformulate (e.g., functional or non-functional) [53], [54],[134], [138]. Such aggregated information can be furthersynthesized and presented to developers as a report of all thefeature requests reported for an app [9], [36], [39], [43],[110].

Requirements Classification. User feedback can be cat-egorized in a number of dimensions [21]. Several studiesclassified user comments based on types of requirementsthe feedback conveys [30], [51], [52], [53], [54], [55],[138]. These works typically categorized the feedback intotwo broad categories, namely - functional requirements (FRs)specifying the behavior of an app, and the non-functionalrequirements (NFRs) describing the constraints as well asthe quality characteristics of the app. The categorization at afurther level of granularity has been also demonstrated [54],[55], [138]; User feedback can be classified into the concrete

quality characteristics it refers to (e.g., defined by ISO 25010model [171]) so that developers could analyse raised require-ments more efficiently.

Requirements Prioritization. When added with statis-tics, certain information such as user opinions or user re-quests can help developers to prioritize their work [6], [7],[44], [67], [131]. Suppose a development team is awareabout problems with certain features which are commentedupon negatively. Finding negative opinion mentioning thesefeatures it could help them to compare how often theseopinions appears, for how long these opinions have beenmade, and whether their frequency is increasing or decreas-ing. Analogously, a user request to add a feature or to fix abug can be higher prioritized when it is frequently mentionedas a reason for switching to a competitive app or abandoningthe software [14], [56], [57], [100]. The information maybe not sufficient to prioritize the request, but it can provideuseful evidence-based data to contribute to prioritizationdecisions [113], [131].

Requirements Specification. Specification of require-ments typically refers to the devising of a formal docu-ment systematically reviewed, evaluated, and approved [21].App reviews can instead serve for producing ad hoc docu-mentations; they conveys information about FRs and NFRsrequirements, usage scenarios and user experience [11],[44], [56], [57], [131]. Software engineers can immediatelybenefit from review mining approaches to facilitate this in-formation in the form of provisional software requirements


specifications (SRS) or user stories [11], [44], [131]. Theseapproaches can for example group reviews by the type ofrequests users make (e.g., asking for new functions); sum-marise reviews referring to the same requests and generateprovisional SRS based on the information. Such SRS may listnew functions that users require; recap scenarios in whichthese functions are used; and report statistics indicatingrelative importance of the requirements e.g., by the numberof users requesting the functions [11]. Since users oftenjustify their needs and opinions, SPS may also documentuser rationales serving later for requirements negotiation ordesign decisions [56], [57].

3.4.2 DesignA few studies motivated app review analysis to assist soft-ware design activities: user interface (UI) design [48] andcapturing design rationale [30], [56], [57], [138].

User interface design. Software engineers indicate thesuccess of mobile applications depends substantially on userexperience [5]. To achieve full app potential, software en-gineers should design the interface to match the experience,skills and needs of users [21]. Alqahtani and Orji analysed thecontent of user reviews to identify usability issues in mentalhealth apps [48]. They manually tagged 1,236 reviews withdifferent types of usability issues for 106 apps from Apple’sApp Store and Google Play. Poor design of user interfacewas the second most frequently reported issue. It has beenfound that user-submitted content concerning interface mayprovide valuable design recommendations on how to improveinterface layout, boost readability and easy app navigation.UI/UX designers should therefore take advantage of the feed-back. If addressed, it would likely increase user engagementwith the apps and reduce the attrition rate.

Design rationale capture. Design rationale is essentialfor making the right design decisions and for evaluatingarchitectural alternatives for a software system [172], [173].A few studies motivated their approaches to capture potentialreasons for design decisions [30], [56], [57], [138]. Kur-tanovic and Maalej devised a grounded theory for gatheringuser rationale and evaluated different review classificationapproaches to mine the information from app reviews [56],[57]. User justifications e.g., on problems they encounteror criteria they chose for app assessment (e.g., reliabilityor performance) can enrich documentation with new designrationale and guide design decisions. Similarly, user-reportedNFR can convey architecturally significant requirements andserve as rationale behind an architecture decision [30],[172]. To capture such requirements, app reviews can beclassified by quality characteristics users discuss [30], [172]

3.4.3 TestingSoftware testing has received considerable attention in thesurveyed studies. We observed that reviews analysis has beenmotivated to support various testing activities: validationby users [5], [7], [8], [10], [10], [11], [12], [31], [43],[47], [58], [59], [90], [110], [113], [140], test documen-tation [9], [72], [76], test design [30], [101], [131] and testprioritization [106].

Validation by Users. Evaluating a software system withusers usually involves expensive usability testing in a lab-oratory [110] or acceptance testing performed in a formal

manner [174]. In the case of mobile apps, software engineerscan exploit user feedback to assess user satisfaction [10],[43], [47], [66], [90], [110], [113] and to identify anyglitches with their products [5], [8], [10], [11], [12], [12],[43], [110]. A recent survey with practitioners has shownthat developers release the alpha/beta version of their appsto test the general reaction of users and to discover anybugs [5].

In line with the direction, several approaches have beenproposed to mine user opinions [7], [10], [47], [59], [113]and detect bug reports [8], [11], [43], [58], [101], [110],[113]. Opinion mining approaches help to discover the mostproblematic features and to quantify the number of negativeopinions. Knowing what features users praise or hate cangive a developer a hint about user acceptance of these fea-tures [5], [47]. Assuming core features have been modified,the team may want to know how users react to these featuresso that they can fix any issues quickly and refine thesefeatures. Analogously, identifying and quantifying reportedbugs within a given time frame can help a development teamduring beta testing before official release [12], [43], [90],[110], [113]. If the number of reported issues is unusuallyhigh, development teams can reschedule the release of anew version in order to refocus on quality management andtesting [8], [11].

Test Documentation. Documentation is an integral partof testing process [21]. Limited work, however, motivated theuse of app store feedback to support it [9], [72], [76]. Iacobet al. proposed use of MARAM tool to generate a summaryof bugs reported in reviews (with breakdown by app versionand features that these bugs refer to) [9]. Such summarycan form the basis for later debugging the app and fixing theproblems. More recently, scholars integrated user commentsinto mobile app testing tools [72], [76]. Originally, the toolsgenerate a report of stack traces leading to an app crash [72],[76]. Analyzing the information to understand the root ofthe problems can be often counterintuitive. In such case,user comments can be used as a human readable companionfor such report; Linked to a related stack trace, user-writtendescription of the problem can instantly guide testers whereto look up for the emerged fault [72], [76].

Test Design. Developers can get inspiration from appstore feedback to design test cases [30], [101], [131].Analysing reported issues can help them to determine appbehavior, features, or functionality to be tested [101]. Thefeedback may describe particular use of the software inwhich users encountered an unusual situation (e.g., crashingwithout informing users of what happened) or inform aboutthe lack of supporting users in finding a workaround [131].Such information may help developers to design test casescapturing exceptions leading to a problem or to exercise newalternative scenarios other those initially considered [30],[131]. Additionally, identifying negative comments on qualitycharacteristics can help in specifying acceptance criteria anapp should hold [30]. For example, user complaints aboutperformance efficiency can indicate time criteria for functionsthat are expected to finish faster or more smoothly [30].

Test Prioritization. Reviews and their ratings have beenfound to correlate with a download rank, a key measure ofthe app’s success [1], [114]. User complaints about specificissues can have a negative impact on rating, and in turn


discourage users from downloading apps [114]. Therefore,it has been therefore suggested to prioritize issue-relatedtest cases based on frequency and impact of these com-plaints [101], [114].

To address device-specific problems a development teammust test their apps on a large number of devices, whichis inefficient and costly [175]. The problem can be partiallyameliorated by selecting devices submitted from reviewshaving the greatest impact on app ratings [106]. The strategycan be particularly useful for the team with limited resourcesthat can only afford to buy a few devices. Using the strategy,they can determine the optimal set of devices they can buyon which to test their app [106].

3.4.4 MaintenanceIn attempt to support software maintenance, review analysishas been proposed for problem and modification analysis,requested modification prioritization, help desk and impactanalysis (see Table 6).

Problem and Modification Analysis. Software engineersstrive continuously to satisfy user needs and keep their appproduct competitive in the market [5]. To this end, theycan exploit approaches facilitating problem and modificationanalysis [5], [6], [9], [12], [13], [28], [32], [36], [40],[45], [47], [50], [59], [61], [64], [66], [70], [71], [72],[74], [80], [82], [103], [104], [111], [112], [113], [114],[124], [130], [141], [142], [143], [144]. These approachesdetect user requests in app store feedback and classify themas problem reports and modifications requests. Fine-grainedcategorization can be carried out too, for example, to detectspecific issues like privacy [45], [50] or concrete changerequests like features enhancement [61]. Mining such infor-mation allows software engineers to determine and analyzeuser demands in timely and efficient fashion [12], [13], [28],[32]. By analysing the dynamics of reported problems overtime, software engineers can immediately spot when a "hotissue" emerges and link it to a possibly flawed release [59],[66], [112], [113]. Moreover, they can generate a summaryof user demands to obtain interim documentation serving aschange request/problem report [9], [36], [111].

Requested Modification Prioritization. App developersmay receive hundreds or even thousands of reviews request-ing modifications and reporting problems [14], [73] It istherefore not a trivial task for developers to select thoserequests which should be addressed in the next release [14].As with requirements, developers can investigate statisticsconcerning these requests (e.g., how many people requestedspecific modifications), estimate their impact on perceivedapp quality (e.g., expressed as user rating) or analyze thehow these requests change over time [9], [10], [14], [28],[67], [73], [74], [97], [100], [101], [102], [104], [114],[144], [145]. Assuming developers have to decide whichchange to address first, they could select one with the largestshare in the numbers of requests, or the one whose feedbackmost drives down the most app rating [10], [67]. Similarly,observing a sharp growth in feedback reporting of a specificproblem (e.g., security and privacy), it may suggest that theissue is harmful to users and should be resolved quickly.

Help Desk. Help desk typically provides end-users withanswers to their questions, resolve their problems or assistin troubleshooting [21]. Analogously, app developers can

respond to specific user reviews to answer users’ questions,to inform about fixing problems or to thank users for theirkind remarks about apps [98], [99]. Though the task isnot traditionally included in the typical responsibilities ofsoftware engineers, user support and managing the productreputation on the app store are essential to the app success;they should be viewed as important activities in in thesoftware lifecycle. In fact, responding to reviews motivateapp users to revise their feedback and ratings to be morepositive [98]. Some users even update their feedback toinform developers that the response solved users’ problemsor to thank for help [98], [99]. Since responding to a largenumber of reviews can be time-consuming, developers canmake use of approaches highlighting reviews that are morelikely to require a response; and generating automatic repliesto these reviews [99], [107], [108], [109].

Impact Analysis. Review mining approaches help devel-opers to discover modification requests posted in reviews;to identify app source code affected by these modifications;and to estimate how implementing the modifications mayimpact users’ satisfaction; [61], [70], [71], [75]. The ap-proaches typically group feedback requesting the same mod-ifications [61], [75] and detect links between review groupsand corresponding source code artefacts referring to themodifications [61], [70], [71], [75]. Such information canbe useful for engineers before an issue of new release as wellas afterwards. Software engineers can track which requestshave (not) been implemented; monitor the proportion ofreviews linked to software changes; and estimate the numberof users affected by these changes. After the release has beenissued, software engineers can also use the approaches toobserve gain/loss in terms of average rating with respect toimplemented changes.

RQ4: Supporting Software Engineering

• Analysing app reviews can support software engi-neers in requirements, design, testing, and mainte-nance activities.

• Most primary studies analyse app reviews to sup-port (i) requirements elicitation, (ii) requirementsprioritization, (iii) validation by users, (iv) problemand modification analysis, and (v) requested modi-fication prioritization.

• 62 of primary studies (42%) do not describe soft-ware engineering use cases of their review miningapproaches.

3.5 RQ4: Empirical EvaluationTo answer RQ4 (how are mining approaches empiricallyevaluated), we used data items: F12 (evaluation objective),F13 (evaluation procedure), F14 (metrics and criteria), F16(annotated datasets), F17 (annotation task), F18 (numberof annotators), F19 (quality measure) and 20 (replicationpackage). We found that 86 primary studies performed em-pirical evaluation of review mining approaches in terms oftheir effectiveness (83 studies) and user-perceived quality(17 studies).

3.5.1 Effectiveness EvaluationA generalized procedure for effectiveness assessment consistsof four steps: (i) formulate an evaluation objective, (ii) create


an annotated dataset, (iii) compare the output of an approachwith the annotated dataset, and (iv) quantify the effective-ness. The evaluation objective refers to assessing the degreeto which an approach can correctly perform a specific miningtask or analysis (see Section 3.2). A conclusive method forthe assessment relies on a human judgment. Primary studiesinvolved humans to perform the task manually on a sampleof reviews and to annotate the sample with correct solutions.Such annotated dataset (so-called ground truth) served asa baseline for evaluating the approach and quantifying theoutcome.

Most studies provided a detail description of how eachstep of their evaluation methods have performed. Hence, wecould record additional information:

• Scholars employed annotated dataset to evaluate theeffectives of their approaches in performing: Classi-fication, Clustering, Sentiment Analysis, InformationExtraction, Searching and Information Retrieval, Rec-ommendation and Summarization.

• Most studies have not released their annotated datasetor their dataset could not be accessed based on theinformation reported in the paper. Tables 7 providesan overview of 13 annotated datasets we could accessbased on the reported information.

• 22 primary studies (27%) reported how the qualityof their annotated datasets has been measured [7],[29], [33], [35], [36], [40], [41], [49], [52], [54],[56], [57], [61], [70], [71], [81], [103], [115],[124], [136], [147], [157]. Three most common met-rics were Cohen’s Kappa [176], Percentage Agree-ment [177] and Jaccard index [118]. PercentageAgreement and Cohen’s Kappa were used to mea-sure the quality of human annotation for performingClassification, Sentiment Analysis, or Feature Extrac-tion. Jaccard index was used for assessing the humanagreement for the task of Searching and InformationRetrieval. We have not found information on how theagreement was measured when annotators performedClustering, Recommendation, or Summarization task.

• The number of annotators labelling the same reviewsample (or their fragment) ranges from 1 to 5 with anaverage of 2 human annotators.

• Most primary studies collect reviews published inGoogle Play and Apple Store. The size of annotateddataset ranges from 80 to 36,464 reviews with anaverage of 4,047 annotated reviews (validate).

• Three most common metrics employed for assess-ing the effectiveness are precision, recall, and f1-measure [118]. The metrics were employed for eval-uating Classification, Clustering, Information Extrac-tion, Searching and Information retrieval, SentimentAnalysis, Recommendation and Summarization.

Naturally, the above procedure constitutes a certain gen-eralization. We recorded individual studies that exploitedevaluation methods departing from this procedure. Majordifferences we observed could be synthesized as follows:

• Six of the primary studies asked annotators to assessthe quality of output produced by their approaches,instead of creating an annotated dataset. Such was

practiced for evaluating Classification [33] Cluster-ing [7], [29], [61], Information Extraction [6], [33],Searching and Information Retrieval [74].

• Three studies used other software artefacts as anevaluation baseline rather than creating an annotateddataset [13], [28], [101]. To evaluate Recommenda-tion (i.e., determining priorities for reported issues),the studies compared recommended priorities for is-sues with priorities for the issues reported in userforums or changelogs.

TABLE 7: Released Annotated Datasets (Reviews)

Ref. Description Size[8] Reviews labeled with a type of user requests (bug

report, feature request, rating, user experience).4,400

[115] Identified user opinions (feature and sentiment). 1,760[116] Annotated a type of user feedback (problem re-

ports, inquiries, and irrelevant).6,406

[36] Reviews labeled with 12 topics (e.g. security) anduser intention (e.g., problem discovery).

3,439

[75] Tagged reviews with their topics (e.g., compati-bility, usage, resources, protection).

3,600

[165] Identified features discussed in reviews. 3,500[100] Reviews labeled with feedback category (e.g.,

bug report, feature request).3,000

[170] Annotated a type of user feedback (feature re-quest, bug reports, and others).

2,930

[138] Labeled reviews with non-functional require-ments user discuss (e.g. usability, dependability).

6,000

[34] Indicated whether the content of each review isinformative or uninformative.

12,000

[10] Identified type of user request each review con-vey (e.g., bug report, feature requests).

2,000

[76] Annotated reviews with their topics and a type ofissue users reports.

6,600

[124] Tagged reviews with topics (e.g., bug report,feature shortcoming, complaint, usage scenario).

4,500

3.5.2 User StudySeventeen studies evaluated their review mining approachesthrough user studies [10], [11], [13], [14], [36], [39],[41], [43], [58], [59], [60], [61], [75], [82], [100], [107],[137]. The objective of these evaluation was to qualitativelyassess how the approach and/or their facilitated analysis areperceived by intended users (e.g., software engineers). Suchevaluation procedure typically consists of the following steps:(i) define an evaluation subject and assessment criteria, (ii)recruit participants, (iii) instruct participants to perform atask with an approach or a produced analysis, (iv) surveyor/and interview participants.

We looked in details at how studies perform each of thesteps. The extracted data yields the following insights:

• User studies evaluated the following review analyses:Clustering, Classification, Sentiment Analysis, Infor-mation Extraction, Search and Information Retrieval,Recommendation, Summarization, and Visualization.

• Five evaluation criteria were typically taken into ac-count: 1) Usefulness denoting the quality of beingapplicable or having practical worth; 2) Accuracyindicating the ability of being correct; 3) Usability sig-nifying the quality of being easy to use; 4) Efficiency


indicating the capability of producing desired resultswith little or no human effort; and 5) Informativenessdenoting the condition of being informative and in-structive. Table 8 provides reference mapping of userstudies with a breakdown of evaluation criteria andevaluated subjects.

• The number of participants involved in the studyranges from 2 to 85 with an average of 18 partic-ipants. The participants included professionals withvaried roles, including developers, testers, require-ment engineers, software engineers, analysts, archi-tects and project managers as well as students andresearchers.

• The participants were typically tasked to performa subject analysis with and without evaluated ap-proaches, to revise analysis produced by the approachor just play with the approach and trial their function-alities.

RQ4: Empirical Evaluation

• Review mining approaches are evaluated in termsof their effectiveness in performing review analysesand their user-perceived quality.

• To evaluate effectiveness, studies compare outputsof mining approaches with human-generated oneson sample datasets. Most datasets, however, havenot been published.

• To assess perceived quality, studies perform userstudies with professionals e.g., software engineers.Participants are typically tasked to use a miningapproach with a certain objective; then assess itbased on specific quality criteria e.g., usefulness.

TABLE 8: Reference mapping of user studies with breakdownof evaluation criterion and app review analysis.

Criterion App Review Analysis

AccuracyInformation Extraction [13], [82], Classifica-tion [14], [36], [41], [82], [100], Clustering [61],Summarization [39].

Efficiency Classification [43], [75], Recommendation [107],Summarization [36], [39], [137].

Informativeness Classification [58], [75], [82], Summarization [36],[39], Visualization [13], [59].

Usability Recommendation [107], Summarization [36], [39],[82].

Usefulness

Information Extraction [13], [60], [82], Classifica-tion [11], [36], [41], [43], [58], [75], Cluster-ing [61], Search and Information Retrieval [61], Sen-timent Analysis [60], Recommendation [14], [100],Summarization [39], [137], Visualization [10].

3.6 RQ5: Empirical Results

We answered RQ5 (how well can mining approaches supportsoftware engineers) based on data item F15 (evaluation re-sult). The data come from 86 studies reporting results of theirempirical evaluations: effectiveness evaluations (83 studies)and user studies (17 studies). We synthesize results of thesestudies the subsequent subsections.

3.6.1 Effectiveness Evaluation ResultsThe methodology that primary studies employed for effec-tiveness evaluation was too diverse to undertake a meta-analysis or other statistical synthesis methods [178]; thesestudies characterized for example diversity in their treatment(e.g., review mining approach), population (e.g., reviewdataset) or study design (e.g., annotation procedure). Wethus employed ’summarizing effect estimates’ method [178];Table 9 reports the magnitude and range of effectivenessresults that primary studies reported for different reviewanalyses with breakdown of mined information type.8

Information Extraction. Effectiveness of extracting in-formation from reviews varies depending on the type ofmined information. Mining approaches scored the lowestperformance for identyfing features in reviews (the medianprecision of 41% [6], [7], the median recall of 62% [6], [7]);the wide range of reported effectiveness (e.g., precision be-tween 21% and 79% [80], [165]) indicates divergent resultsof these studies. On the hand, high effectiveness was reportedfor extracting both user requests and NFRs (e.g., the medianprecision above 90% [9], [30]). The small variability ofresults show the consistency of findings between the studies.

Classification. App reviews can be well classified byinformation types these reviews convey i.e., user requests,NFRs and issues; the median precision and recall are consis-tent for classifying reviews no matter what information theapproaches mine (precision above 70% [53], [54], [150],[160], recall around 80% [54], [62], [138], [148], [160]).

Clustering. Studies emphasize that app reviews can beaccurately grouped into semantically related clusters (ac-curacy of 83% [29]); these results are also in line withfindings concerning the quality of review clustering (MojoFMof 80% [14], [100]).

Search and Information Retrieval. Mining approachesshowed effectiveness in retrieving reviews to specific infor-mation needs; in particular, the results show that tracing in-formation between reviews and issues from ticket systems aswell as between reviews and source code can be precise (themedian precision above 75% [61], [71], [72]) and complete(median recall above 70% [70], [71], [72], [76].). Similarly,finding reviews related to specific features has been reportedwith 70% of precision and recall of 56% [6]. The variabilityof the results (e.g., precision between 36%-80% [67], [137]),however, may lead to inconclusive findings.

Sentiment Analysis. The overall review sentiment canbe identified with accuracy of 91% [79]. Analyzing morefine-grained information about user sentiment w.r.t. specificfeature is less effective as reflected by precision/recall ofaround 60% [47].

Recommendation. Recommending priorities for user re-quests was reported with medium to high effectiveness:the median accuracy of 78% [14], [100] and precision of62% [13], [28]. Whereas, generating review responses wasreported with BLEU-4 greater than 30% [108], reflectinghuman-understandable text.

Summarization. Mining techniques were recorded togenerate a compact description outlining the main themespresent in reviews with recall of 71% [170].

8. No effectiveness evaluation was performed w.r.t. content analysisand visualization.


TABLE 9: Table showing the results of effectives evaluations; it reports how well review analyses mine specific information.

App Review Analysis Mined Information Results

Information Extraction

Features Precision range from 21% to 79% [80], [165] with the median precision of 41% [6], [7];recall range from 42% to 77% [80], [165] with the median recall of 62 % [6], [7].

User Request Precision range from 85% to 91% [31], [110] with the median precision of 91% [9]; recallrange from 87% to 89% [31], [110] with the median recall of 89% [9].

NFR Precision at the level of 92% [30].

Classification

User Request Type Precision range from 35% to 94% [63], [166] with the median precision of 79% [150];recall range from 51% to 99% [120], [166] with the median recall of 81% [62], [148].

NFR Type Precision range from 63% to 100% [49], [52] with the median precision of 74% [53],[54]; recall range from 63% to 92% [52], [53] with the median recall of 79% [54], [138].

Issue Type Precision range from 66% to 100% [46], [49] with the median precision of 76% [160];recall range from 65% to 92% [46], [49] with the median recall of 79% [160].

Clustering Similar Reviews Accuracy range from 80% to 99% [73], [111] with the median accuracy of 83% [29];MojoFM range from 73% to 87% with the median MojoFM of 80% [14], [100].

Search and Information Retrieval

Feature-Specific Review Precision range from 36% to 83% [67], [137] with the median precision of 70% [6]; recallrange from 26% to 70% [67], [137] with the median recall of 56% [6].

Review-Issue Links Precision range from 77% to 79% [70], [73] with the median precision of 77% [71]; recallat the level of 73% [70], [71].

Review-Code Links Precision range from 51% to 82% [75], [76] with the median precision of 82% [61], [72];recall range from 70% to 79% [61], [75] with the median recall of 75% [72], [76].

Sentiment AnalysisFeature-Specific Precision at the level of 69%; recall at the level of 64% [47]. F1 score range between 64%

and 85% [10], [47].Review Accuracy at the level of 91% [79].

RecommendationUser Request Priority Accuracy range from 72% to 83% with the median accuracy of 78% [14], [100]; precision

range from 60% to 63% with the median precision of 62% [13], [28].Review Response BLEU-4 at the level of 36% [108].

Summarization Review Summary Recall at the level of 71% [170].

3.6.2 User Study Results

Seventeen studies evaluated user-perceived quality of reviewmining approaches. Table 10 provides synthesis of theirresults; it reports how review analyses were perceived fordifferent evaluation criteria.

Information Extraction. Extracting information from re-views e.g., issue reports and user opinions is useful fordevelopers [13]; it can help to elicit new requirements or pri-oritize development effort [60], [82]. In particular, machinelearning techniques are able to identify issues with acceptableaccuracy [13]; feature extraction methods instead producetoo imprecise analyses to be applicable in practice [82].

Classification. Review classification showed their utilityfor identifying different users’ needs e.g., feature requests,or bug reports [11], [36], [41], [43], [58], [75]. Suchcategorized feedback is informative and ease further manualreview inspection [58], [75], [82]. Practitioners reported tosave up to 75% of their time thanks to the analysis [43],[75]; and that their accuracy is sufficient for the practicalapplication [14], [36], [41], [100].

Clustering. Review clustering is convenient for group-ing feedback conveying similar content; for example, thosereporting the same feature request or discussing the sametopic [61]. Evaluated approaches can perform the analysiswith a high level of precision and completeness [61]

Searching and Information Retrieval. Developers ad-mitted the usefulness linking reviews to the source code com-ponents to be changed [61]; the task traditionally requires anenormous manual effort and is highly error-prone.

Sentiment Analysis. Analyzing user opinions can help toidentify problematic features and to prioritize developmenteffort to improve these features [60].

Recommendation. Project managers found recommend-ing priorities of user requests useful for release planning [14],

[100]; it can support their decision-making w.r.t. require-ments and modifications that users wish to address. Besides,developers perceived an automatic review response systemas more usable than the traditional mechanism [107]; rec-ommending reviews that require responding and suggest-ing responses to the reviews can reduce developers’ work-load [107].

Summarization. Compact description outlining most im-portant review content is useful for developers in their soft-ware engineers activities [39], [137]; in particular, sum-maries conveying information about frequently discussedtopics, user opinions as well as user requests. Facilitating thisinformation in a tabular form is easy to read and expres-sive [36], [39], [82]. Such summaries are generated withsufficient accuracy to be used in practical scenarios [39]; infact, developers reported to save up to 50% of their timethanks to the analysis [36], [39], [137].

Visualization. Presenting trends of frequently discussedtopics can inform developers about urgent issues, ’hot fea-tures’, or popular user opinions [13], [59]. Heat-map illus-trating feature-specific sentiment (i.e., user options) helpdevelopers to understand users experience with these fea-tures [10]; it indicates which features users praise and whichare problematic. Visualizing how user opinions change overtime aids developers in examining users’ reactions e.g., tonewly implemented modifications for these features.


TABLE 10: Table showing the results of user studies; it reports user-perceived quality for app review analyses.

App Review Analysis Criterion Results

Information ExtractionAccuracy Machine learning techniques extracts issues with acceptable accuracy [13]; collocation

algorithms generates too many false positives for feature extraction [82].

Usefulness Extracting issues is useful for app maintenance [13]; feature extraction may help toanalyze user opinions, and to prioritize development effort [60], [82].

Classification

Accuracy Classifying reviews by types of user requests, topics is sufficiently precise for practicaluse [14], [36], [41], [100]; ML techniques are superior in accuracy to simple NLP [82].

Efficiency Classifying reviews automatically (e.g., by user requests) could reduce up to 75% of thetime that app developers spend for a manual feedback inspection [43], [75].

Informativeness Grouping feedback by topics is instructive; a group may express shared users’ needsspecific to e.g., app features, emerging issues, new requirements [58], [75], [82].

Usefulness Classifying reviews is useful for identifying different users’ needs [36], [41], [58];comprehending reported issues and problematic features [11], [43], [58], [75].

ClusteringAccuracy Clustering reviews conveying semantically similar information is performed with high

precision and completeness [61].Usefulness Useful for grouping reviews bearing a similar message e.g. reporting the same issues [61].

Search and Information Retrieval Usefulness Helpful for linking reviews to source components (e.g., methods, class or field name) [61].

Sentiment Analysis Usefulness Helpful for detecting user opinions about app features; it can support prioritising devel-opment effort to improve these features [60].

Recommendation

Efficiency Suggesting reviews requiring responses lower app developers’ workload [107].

Usability Automatic review response system shows higher usability compared to the original appstore mechanism; survey respondents favored the the system [107].

Usefulness Recommending priority of user requests is useful for release planning [14], [100].

Summarization

Accuracy Summarizing topics and requests that users post can be made with high accuracy [39].

Efficiency Prevents more than half of the time required for analyzing feedback [36], [39]; it helps toimmediately understand user opinions and user requests [137].

Informativeness Listing user requests (e.g., bug reports) and topics through a summary table is informativeand expressive for software engineers [36], [39].

Usability Table showing frequently posted user requests, and topics is easy to read [36], [39], [82].

Usefulness Summarizing user requests and/or topics that reviews convey is useful for better under-standing users’ needs [39], [137].

VisualizationInformativeness Presenting what topics users discuss and how they change over time can inform about

emerging issues [13], or relevant user opinions [59].

Usefulness Showing heat-map of feature-specific sentiments, and their trends over time can helpdevelopers to identify problematic features [10].

RQ5: Evaluation Results

• Mining approaches perform well for 5 broad appreview analyses: (1) Classification; (2) Clustering;(3) Searching and Information Retrieval; (4) Rec-ommendation and (5) Summarization.

• Software engineers generally consider app reviewanalyses useful. The analyses ease their SDLC ac-tivities; reduce their workload; and support theirdecision-making.

• Software engineers find accuracy of most app re-view analyses promising for the practical usage; yetperformance of (6) Information Extraction and (7)Sentiment Analysis seem to be insufficient.

4 DISCUSSION

In this section we highlight and discuss findings from ourstudy.

4.1 Mining App Reviews MattersMining app reviews for software engineering is a relativelynew research area. The first use of app reviews for softwareengineering purposes can be dated back to 2012. Never-theless, the analysis of demographics has revealed that theresearch area increasingly attracts the attention of scholars.The number of papers published in line with the directions

has grown substantially in the last three years. A recentsurvey in app store analysis found 45 papers relevant to appreview analysis published up to 2015 [1]. Our findings showthat the number of published papers in the area has tripledby the end of 2019. The most frequent venues where scholarshave published their work concern high-quality softwareengineering conferences and journals (see Table 3). Theseshow that there is not only increasing effort on exploring theresearch direction, but also demonstrate that contributionsof these efforts are relevant from a software engineeringperspective; in fact, empirical evidences (RQ5) indicate thatsoftware engineers find app review analysis useful in supportof their SDLC activities; mining approaches can substantiallyreduce their workload; and facilitate knowledge that wouldbe difficult to obtain manually. Similarly, as other works [1],we also hypothesize that factors leading to the researchinterest in the field concerns increased popularity of mobileapps, an easy access to user feedback on a scale not seenbefore as well as a general interest in adopting data miningtechniques for mining software repository.

4.2 App Reviews AnalysisScholars employed different types of review analyses that canbe categorized to nine categories (RQ1). Most of them con-cern typical text mining tasks [27]. Similarly, as with Martinet al [1], we found that these analyses concern Classification,


Sentiment Analysis, Content Analysis and Summarization.Our findings have also revealed other widely used analysesthat have not been discussed in that survey i.e., Clustering,Information Extraction, Search and Information Retrieval,Recommendation and Visualization. We conjecture the studyeither overlooked these analyses, or performed more coarse-grained categorization. As that study did not provide defini-tions of these analyses, we could not confirm our hypothesis.

Scholars shown a strong interest in Classification, Clus-tering and Information Extraction to filter-out noisy reviews,and focus on relevant content (e.g., bug reports or featurerequests). Content analysis, on the other hand, was employedto characterize reviews, identify topics users discuss, and toexplore the possibility of using the feedback in satisfyingdifferent information needs. This observation is in line withprevious findings [1]. Scholars exploited sentiment analysisto determine user opinions about a specific app feature ordiscussed topic. The analysis was adopted to support i.a.,requirements elicitation and requirements prioritization aswell as validation by users (RQ3). Our results have revealedthat primary studies employed Searching and InformationRetrieval to find reviews discussing specific information (e.g.,app features, or bug report), or to trace the information overother software artefacts (e.g., stack traces, or issue trackingsystems). Summarization and Visualization were employedto aid software engineers in interpreting mined information,which could be time-consuming if undertaken manually. Fi-nally, we recorded that information mined from reviews isoften exploited to recommend a course of action to softwareengineers, e.g., which bug report should be resolved first, orwhere the problem can be localized in source code.

4.3 Data Mining TechniquesWe found data mining techniques could be categorized intofour broad categories (RQ2). These techniques include statis-tical analysis, manual analysis, natural language processingand machine learning. The findings are in line with obser-vations from previous surveys [15], [16], [179]. Our resultsindicate statistical analysis techniques were used mainly tosummarize research results, check their validity or to drawstatistically significant conclusions. Manual analysis on theother hand was performed for Content Analysis and to de-velop annotated datasets for training and evaluating miningtechniques. The results revealed that a majority of studies(75%) employed different NLP techniques and ML techniques(60%). Most widely used ML algorithms concerned standardones i.a., Naïve Bayes, or Support Vector Machine. Thesefindings are not surprising as both NLP and ML techniquesare widely known for text mining tasks [27]. Almost no study,however, benchmarked these techniques nor engineered fea-tures for them [8], [11]. Consequently, it is not known whichtechniques would achieve best performance in context ofdifferent review analysis. Empirical studies could address thegap and provide a valuable contribution to the community.Not surprisingly, a good performance of a technique woulddictate their adoption in practice [180], [181].

4.4 Mining Reviews Can Support Software EngineeringOur investigation of results (RQ3) concludes that miningapp reviews can be used to support software engineering.

Scholars analysed user feedback to support fourteen differentactivities, in the context of requirements, design, testing andmaintenance (see Table 6). Major efforts have been made tosupport RE-related activities supposedly as user feedback isan intrinsic element in that phase. In fact, venues dedicatedto requirement engineering are among those, where scholarsmainly published their works (see Table 2). On the otherhand, a recent survey with practitioners has shown that appstore feedback can affect software engineering practices andbe useful for different activities [5]. That observation is alsoin line with the result of our study.

No previous study, however, has made an attempt toconsolidate the knowledge and provide a thorough overviewof those activities that could be supported. Consequently, theknowledge about the potential of review mining has been im-plicit and scattered over different publications. The fact that28% of the studies (42 studies) of mining app reviews didnot explicitly anchor their works in any software engineeringscenario challenges their utility even more. We thus believeour results bring a new evidence-based knowledge aboutthe potential use of app reviews in the context of softwareengineering scenarios [182].

4.5 Missing Scenario RealizationMost studies (79%) claim to support software engineering ac-tivities through review mining approaches, but do not explainhow the support is realized (RQ3). The lack of justificationschallenges the understanding of how these approaches canbe used in practice, and in general how review mining can beexploited for software engineering purposes.

We partially addressed the problem by providing a thor-ough synthesis of software engineering scenarios (see Sec-tion 3.4). The synthesis elucidates what information can bemined from app reviews, what needs the information satisfiesand how it can support realization of software engineeringactivities. We believe our results should help researchersand practitioners to understand the rationale behind reviewmining and communicate to them the potential use of minedinformation. In the future, scholars can address the problemfrom another perspective. For example, they could elaboratea reference architecture (e.g., generalization of existing min-ing approaches) explaining how combining different miningtasks can realize specific software engineering scenarios,and how these tasks are operationalized using data miningtechniques. Providing that one wants to understand how tomake use of review mining for a certain scenario (e.g. re-quirements elicitation), they could look for realization of thatscenario in the architecture description, obtain the overallunderstanding on what components are needed, and howthey could be combined to realize the scenario. Not onlywould such architecture would communicate the knowledgeof the realization of software engineering scenarios, but itwould also provide a template guiding the design of specificsolutions.

4.6 Limitation of Effectiveness EvaluationA great deal of effort has been made to evaluate the effective-ness of data mining techniques (RQ4). Results have shownthat 83% of empirical evaluations concerned the capabilityassessment in performing mining tasks. Scholars employed


standard procedures for evaluating NLP and ML techniques,including the creation of annotated dataset [118], [176].

Primary studies, however, used evaluation datasets ofsmall size (on average 4,047 reviews). This is a tiny portion ofuser-submitted feedback in app stores. Popular mobile apps(like WhatsApp or Instagram) can receive more than 5,000reviews per day, and more than one million reviews in ayear [183]. This is an significant threat to the validity oftheir results when trying to generalize them. Naturally, theproblem is attributed to the substantial effort it takes forhumans to manually annotate reviews. For example, Guzmanand Maalej reported that labeling 900 reviews took an anno-tator up to 12.5 hours [7]. Not surprisingly, primary studiesadmitted the limitation as a major threat to the validity oftheir results (e.g., [51], [67], [75]). As none of surveyedstudies tried to tackle the problem, it opens an avenue forfuture research. Scholars supposedly could experiment withautomated data labeling techniques currently exploited tominimize effort on preparing training datasets [122], [130],[184]. Providing that the problem was handled, researchersshould be mindful of a potential sampling bias when col-lecting dataset from app store [185]. The latter problem,however, has been well-studied in a recent study [63].

Most papers also did not provide a link to a datasetthey used for evaluation. In fact, we found references to 13datasets only (see Table 7). This hampers the replicabilityof the studies as well as new comparative studies. In fact,only a single replication study was published out of thesurveyed studies [165]. The study reported the challenge invalidating results of the original work due the absence ofannotated dataset and insufficiently documented evaluationprocedure. We recommend that future studies provide repli-cation packages, including evaluation datasets, procedures,and approaches so that scholars will be able to validateexisting works and confirm reported findings. This will alsohelp in benchmarking approaches and provide a baselinefor evaluating new approaches aiming at improving perfor-mance.

4.7 Implication of Software Engineering Practice

There is no clear view whether the effectiveness of reviewmining approaches is sufficient for software engineers tosupport their tasks. Identifying what performance reviewmining approaches should have in order to be useful forsoftware engineers in practice is an important open ques-tion [180], [181]. Essentially, an approach facilitating reviewanalysis should synthesize reviews so that the effort for theirfurther manual inspection would be negligible or at leastmanageable. Clearly, the effort would depend on a scenarioan approach aims to realize. None of primary studies, how-ever, investigated that aspect. We found 17 studies assessingtheir approaches with intended users e.g., software engi-neers (RQ4). These studies evaluated user-perceived qualityof their approaches/or facilitated analysis with respect tocertain criteria like efficiency or usefulness (see Table 8).Though results of these studies are promising (see Table ??),they do not provide a conclusive evidence whether theseapproaches indeed support software engineering activities.These studies evaluated their approaches in isolation fromthe software engineering scenarios that these approaches aim

to realize. Therefore, future works could address the gap andstudy to what extent mining approaches can useful for theirintended scenarios, and what performance these approachesshould provide to be used in practice.

5 THREATS TO VALIDITY

One of the main threats to the validity of this systematicliterature review is incompleteness. The risk of this threathighly depends on the selected list of keywords formingsearch queries. To decrease the risk of an incomplete key-word list, we have used an iterative approach to keyword-list construction. We constructed two queries: generic andone specific. The generic query was formed using keywordsappearing in the index of terms in sample studies analysingapp reviews for SE. Specific query was formed based on a setof keywords representing concepts of our research objective.As in any other literature survey, we are also prone to apublication bias. To mitigate this threat, we complementeda digital library search with other strategies. We conductedan issue-by-issue search of top-level conferences and journalsas well as performed a backward and forward snowballing.

To ensure the quality and reliability of our study, wedefined a systematic procedure for conducting our survey,including research questions to answer, searching strategiesand selection criteria for determining primary studies ofinterest. We conducted a pilot study to assess the technicalissues such as the completeness of the data form and us-ability issues such as the clarity of procedure instructions.The protocol was reviewed by the panel of researchers inaddition to the authors of the study. It was then revisedbased on their critical feedback. Consequently, the selectionof primary studies followed a strict protocol in accordance towell-founded guidelines [18], [19], [182].

Another threat to validity we would like to highlight is oursubjectivity in screening, data extraction and classification ofthe studied papers. To mitigate the threat, each step wasperformed by one coder, who was the first author of thispaper. Then, the step was cross-checked by a second coder.Each step was validated on a randomly selected sampleof 10% of the selected papers. The percentage inter-coderagreement reached for all the phases was higher than 90%indicating high agreement between the authors. In addition,the intra-rater agreement was performed. The first authorre-coded once again a randomly selected sample of 20% ofstudied papers. Then an external evaluator, who has no re-lationship with the research, verified the agreement betweenthe first and the second rounds. The percentage intra-coderagreement was higher than 90%, indicating near completeagreement [24].

A similar threat concerns about whether our taxonomiesare reliable enough for analysing and classifying extracteddata. To mitigate the issue, we used an iterative contentanalysis method to continuously develop each taxonomy.New concepts which emerged when studying the paperswere introduced into a taxonomy and changes were maderespectively. These taxonomies were discussed between allthe authors and agreed upon their final form.


6 RELATED WORK

This review is not the first effort synthesizing knowledgefrom the literature analysing app reviews for SE [1], [15],[16], [17]. Our SLR, however, differs substantially fromexisting studies in the scope and level of depth the literatureis surveyed. Table 11 shows the differences between ourstudy and previous works in accordance with dimensions weconsidered for the comparison. We grouped the dimensionsinto information related to study characteristics and topicssurveyed in our study. The characteristics concern study type(i.e., systematic literature review or survey), time periodcovered and number of papers surveyed. The topics concern:Paper Demographics, App Reviews Analyses (RQ1), MiningTechniques (RQ2), Supporting Software Engineering (RQ3),Empirical Evaluation (RQ4) and Empirical Results (RQ5).

Martin et al. [1] surveyed literature with the aim todemonstrate a newly emerging research area i.e., app storeanalysis for software engineering. The scope of their surveyis much broader than of our study, as it covers literatureanalyzing various types of app store data (e.g., API, rank ofdownloads, or price). Our work has much narrower scope asits main objective is different. In our study, we synthesizedonly literature analyzing app reviews. Our aim, rather, wasto understand how review analysis can support softwareengineering. We, therefore, examined the same literature,guided by several research questions. Though the relatedsurvey also addresses (RQ1), our study is more up-to-dateand larger in scale, covering 149 papers. More importantly,most dimensions of our SLR i.e., RQ2-RQ5, are missing inthis other study.

Two other studies as well addressed our RQ2, but par-tially, as they are narrower in scope [15], [16]. Tavakoli etal [16] surveyed the literature in the context of techniquesand tools for mining app reviews. Similarly, Genc-Nayebiet al [15] consolidated literature to synthesize informationon techniques for opinion mining. Our SLR addresses thedimension more broadly, rather than in context of techniquesfor a specific review analysis or tool-supported approaches.We have made an effort to consolidate general knowledge ontechniques the literature employs for 9 broad types of reviewanalyses. We also provided mapping between different reviewanalyses and techniques facilitating their realization.

Noei et al. summarized 21 papers analysing app reviewsfrom Google Play [17]. The authors provided an overview ofeach paper, briefly explaining the applications, and mentiontheir limitations. The surveyed papers were selected subjec-tively, rather than following a systematic searching procure.In contrast, our study is SLR rather than a summary. Follow-ing a systematic procedure, we selected 149 studies that wecarefully read and then synthesized to answer five researchquestions. The related work marginally covers informationfor RQ1 and RQ2. But, it does not do so with the precisionnor on the scale as that of our study.

In summary, none of the previous studies covers the fullrange of dimensions addressed in our work. Moreover, if astudy does cover a dimension, it does not do it with the levelof detail as our study (e.g., by providing the precise numberof papers in that dimension), precision (e.g., following asystematic protocol) and scale (e.g., number of surveyedpapers).

TABLE 11: The summarized differences between our studyand related works.

Dimensions Our Study [1] [15] [16] [17]Study Type SLR Survey SLR SLR SurveyTime Period ’10-’19 ’00-’15 ’11-’15 ’11-’17 ’12-’19No. Papers 149 45 24 34 21Paper Demographics X X X XApp Review Analyses (RQ1) X X XMining Techniques (RQ2) X X X XSupporting SE (RQ3) XEmpirical Evaluation (RQ4) XEmpirical Results (RQ5) X

7 CONCLUSION

In this paper, we presented the results of our SLR consoli-dating the knowledge from literature analysing app reviewsfor software engineering. Through systematic search, weidentified 149 relevant studies in the research area that wethoroughly examined and synthesized to answer our researchquestions. Our findings have revealed a growing interest inthe research area. The number of papers has tripled in thelast three years. Much of these works was published in topSE venues, indicating their importance for the community.

During the review process, we acquired the knowledgeon how information can be mined from app reviews. Weidentified, categories and discussed main types of reviewanalysis as well as data mining techniques used for theirrealization. We also mapped these techniques per specificanalysis to clarify how a given analysis can be facilitated.The information should help scholars to understand howinformation is mined from reviews at an abstract and moretechnical level. This also provides a background necessaryto understand how review analysis supports software engi-neering. With that knowledge, we consolidated informationon software engineering scenarios supported through reviewanalysis. We found scholars motivated mining app reviewsto support 14 activities along different SDLC phases. Wealso learned that most studies lack explicit justifications onhow motivated scenarios are realized via review analysis. Weaddressed the gap and provided a thorough description ofscenarios realization. This knowledge should promote un-derstanding of the motivation behind review analysis in thecontext of software engineering. Finally, we also summarizedinformation on how mining approaches have been empiri-cally evaluated. We characterized the procedures the primarystudy undertook for effectiveness evaluation as well as userstudies. Our results indicated that research efforts are mostlyfocused on evaluating effectiveness of mining approaches,rather than their ability to satisfy needs of intended users(e.g. app developers). Finally, we discussed the results of ourSLR and highlighted potential directions for future research.We hope our SLR will help to communicate knowledge toscholars and practitioners on analyzing app review in thecontext of software engineering and inspire scholars to ad-vance the research area as well as assist them in positioningtheir new works.


ACKNOWLEDGEMENTS

We thank the anonymous reviewers for their valuable feed-back, and our colleagues who provide us with their veryhelpful comments on this survey.

REFERENCES

[1] W. J. Martin, F. Sarro, Y. Jia, Y. Zhang, and M. Harman, “Asurvey of app store analysis for software engineering,” IEEETrans. Software Eng., vol. 43, no. 9, pp. 817–847, 2017. [Online].Available: https://doi.org/10.1109/TSE.2016.2630689

[2] J. Clement, “Number of apps available in leading appstores as of 1st quarter 2020,” Statista, 2020. [On-line]. Available: https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/

[3] S. O’Dea, “Number of smartphone users world-wide from 2016 to 2021,” Statista, 2020. [On-line]. Available: https://www.statista.com/statistics/330695/number-of-smartphone-users-worldwide/

[4] J. Clement, “Worldwide mobile app revenues in 2014 to 2023,”Statista, 2020. [Online]. Available: https://www.statista.com/statistics/269025/worldwide-mobile-app-revenue-forecast/

[5] A. AlSubaihin, F. Sarro, S. Black, L. Capra, and M. Harman, “Appstore effects on software engineering practices,” IEEE Transactionson Software Engineering, pp. 1–1, 2019.

[6] T. Johann, C. Stanik, A. M. A. B., and W. Maalej, “Safe: Asimple approach for feature extraction from app descriptionsand app reviews,” in 2017 IEEE 25th International RequirementsEngineering Conference (RE), Sep. 2017, pp. 21–30.

[7] E. Guzman and W. Maalej, “How do users like this feature? afine grained sentiment analysis of app reviews,” in 2014 IEEE22nd International Requirements Engineering Conference (RE), Aug2014, pp. 153–162.

[8] W. Maalej and H. Nabil, “Bug report, feature request, or simplypraise? on automatically classifying app reviews,” in 2015 IEEE23rd International Requirements Engineering Conference (RE), Aug2015, pp. 116–125.

[9] C. Iacob, S. Faily, and R. Harrison, “Maram: Tool supportfor mobile app review management,” in Proceedings ofthe 8th EAI International Conference on Mobile Computing,Applications and Services, ser. MobiCASE’16. Brussels, BEL:ICST (Institute for Computer Sciences, Social-Informatics andTelecommunications Engineering), 2016, p. 42–50. [Online].Available: https://doi.org/10.4108/eai.30-11-2016.2266941

[10] X. Gu and S. Kim, “"what parts of your apps are loved byusers?" (T),” in 30th IEEE/ACM International Conference onAutomated Software Engineering, ASE 2015, Lincoln, NE, USA,November 9-13, 2015, 2015, pp. 760–770. [Online]. Available:https://doi.org/10.1109/ASE.2015.57

[11] W. Maalej, Z. Kurtanovic, H. Nabil, and C. Stanik, “Onthe automatic classification of app reviews,” Requir. Eng.,vol. 21, no. 3, pp. 311–331, 2016. [Online]. Available:https://doi.org/10.1007/s00766-016-0251-9

[12] C. Gao, W. Zheng, Y. Deng, D. Lo, J. Zeng, M. R. Lyu, andI. King, “Emerging app issue identification from user feedback:Experience on wechat,” in Proceedings of the 41st InternationalConference on Software Engineering: Software Engineering inPractice, ser. ICSE-SEIP ’19. IEEE Press, 2019, p. 279–288.[Online]. Available: https://doi.org/10.1109/ICSE-SEIP.2019.00040

[13] C. Gao, J. Zeng, M. R. Lyu, and I. King, “Online app reviewanalysis for identifying emerging issues,” in Proceedings of the40th International Conference on Software Engineering, ser. ICSE’18. New York, NY, USA: Association for Computing Machinery,2018, p. 48–58. [Online]. Available: https://doi.org/10.1145/3180155.3180218

[14] L. Villarroel, G. Bavota, B. Russo, R. Oliveto, and M. Di Penta,“Release planning of mobile apps based on user reviews,” in 2016IEEE/ACM 38th International Conference on Software Engineering(ICSE), May 2016, pp. 14–24.

[15] N. Genc-Nayebi and A. Abran, “A systematic literature review:Opinion mining studies from mobile app store user reviews,”Journal of Systems and Software, vol. 125, pp. 207 –219, 2017. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0164121216302291

[16] M. Tavakoli, L. Zhao, A. Heydari, and G. Nenadic, “Extractinguseful software development information from mobile applicationreviews: A survey of intelligent mining techniques and tools,”Expert Systems with Applications, vol. 113, pp. 186 –199, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0957417418303361

[17] E. Noei and K. Lyons, “A survey of utilizing user-reviews posted ongoogle play store,” in Proceedings of the 29th Annual InternationalConference on Computer Science and Software Engineering, ser.CASCON ’19. USA: IBM Corp., 2019, p. 54–63.

[18] B. A. Kitchenham, “Procedures for performing systematic re-views,” 2004.

[19] P. Ralph, S. Baltes, D. Bianculli, Y. Dittrich, M. Felderer,R. Feldt, A. Filieri, C. A. Furia, D. Graziotin, P. He,R. Hoda, N. Juristo, B. A. Kitchenham, R. Robbes, D. Méndez,J. Molleri, D. Spinellis, M. Staron, K. Stol, D. Tamburri,M. Torchiano, C. Treude, B. Turhan, and S. Vegas, “ACMSIGSOFT empirical standards,” CoRR, vol. abs/2010.03525,2020. [Online]. Available: https://arxiv.org/abs/2010.03525

[20] D. Moher, A. Liberati, J. Tetzlaff, and D. Altman, “Preferredreporting items for systematic reviews and meta-analyses: theprisma statement,” Br Med J, vol. 8, pp. 336–341, 01 2009.

[21] P. Bourque, R. Dupuis, A. Abran, J. Moore, and L. Tripp, “Theguide to the software engineering body of knowledge.” IEEESoftware, vol. 16, pp. 35–44, 12 1999.

[22] A. J. Viera and J. M. Garrett, “Understanding interobserver agree-ment: the kappa statistic.” Family medicine, vol. 37 5, pp. 360–3,2005.

[23] C. Wohlin, “Guidelines for snowballing in systematic literaturestudies and a replication in software engineering,” in Proceedingsof the 18th International Conference on Evaluation and Assessmentin Software Engineering, ser. EASE ’14. New York, NY, USA:Association for Computing Machinery, 2014. [Online]. Available:https://doi.org/10.1145/2601248.2601268

[24] N. Ide and J. Pustejovsky, Eds., Handbook of Linguistic Annotation.Springer Netherlands, 2017.

[25] M. Graham, A. T. Milanowski, and J. Miller, “Measuring andpromoting inter-rater agreement of teacher and principal perfor-mance ratings.” 2012.

[26] J. Dabrowski, “Replication package for System Literature Review:App Review Analysis for Software Engineering,” https://github.com/jsdabrowski/SLR-SE/, Oct. 2020.

[27] G. Miner, J. Elder, T. Hill, R. Nisbet, D. Delen, and A. Fast, PracticalText Mining and Statistical Analysis for Non-Structured Text DataApplications, 1st ed. USA: Academic Press, Inc., 2012.

[28] C. Gao, B. Wang, P. He, J. Zhu, Y. Zhou, and M. R. Lyu, “Paid:Prioritizing app issues for developers by tracking user reviewsover versions,” in 2015 IEEE 26th International Symposium onSoftware Reliability Engineering (ISSRE), Nov 2015, pp. 35–45.

[29] P. M. Vu, T. T. Nguyen, H. V. Pham, and T. T. Nguyen,“Mining user opinions in mobile app reviews: A keyword-based approach,” in Proceedings of the 30th IEEE/ACMInternational Conference on Automated Software Engineering, ser.ASE ’15. IEEE Press, 2015, p. 749–459. [Online]. Available:https://doi.org/10.1109/ASE.2015.85

[30] E. C. Groen, S. Kopczynska, M. P. Hauer, T. D. Krafft, and J. Doerr,“Users — the hidden software product quality experts?: A studyon how app users report quality aspects in online reviews,” in2017 IEEE 25th International Requirements Engineering Conference(RE), Sep. 2017, pp. 80–89.

[31] C. Iacob and R. Harrison, “Retrieving and analyzing mobile appsfeature requests from online reviews,” in Proceedings of the 10thWorking Conference on Mining Software Repositories. IEEE Press,2013, pp. 41–44.

[32] Y. Wang, H. Wang, and H. Fang, “Extracting user-reported mobileapplication defects from online reviews,” in 2017 IEEE Interna-tional Conference on Data Mining Workshops (ICDMW), Nov 2017,pp. 422–429.

[33] Y. Li, B. Jia, Y. Guo, and X. Chen, “Mining user reviews formobile app comparisons,” Proc. ACM Interact. Mob. WearableUbiquitous Technol., vol. 1, no. 3, Sep. 2017. [Online]. Available:https://doi.org/10.1145/3130935

[34] N. Chen, J. Lin, S. C. H. Hoi, X. Xiao, and B. Zhang, “Ar-miner:Mining informative reviews for developers from mobile appmarketplace,” in Proceedings of the 36th International Conferenceon Software Engineering, ser. ICSE 2014. New York, NY,


USA: Association for Computing Machinery, 2014, p. 767–778.[Online]. Available: https://doi.org/10.1145/2568225.2568263

[35] J. Oh, D. Kim, U. Lee, J.-G. Lee, and J. Song, “Facilitatingdeveloper-user interactions with mobile app review digests,” inCHI ’13 Extended Abstracts on Human Factors in ComputingSystems, ser. CHI EA ’13. New York, NY, USA: Association forComputing Machinery, 2013, p. 1809–1814. [Online]. Available:https://doi.org/10.1145/2468356.2468681

[36] A. Di Sorbo, S. Panichella, C. V. Alexandru, J. Shimagaki,C. A. Visaggio, G. Canfora, and H. C. Gall, “What would userschange in my app? summarizing app reviews for recommendingsoftware changes,” in Proceedings of the 2016 24th ACMSIGSOFT International Symposium on Foundations of SoftwareEngineering, ser. FSE 2016. New York, NY, USA: Association forComputing Machinery, 2016, p. 499–510. [Online]. Available:https://doi.org/10.1145/2950290.2950299

[37] R. Chandy and H. Gu, “Identifying spam in the ios app store,” inProceedings of the 2Nd Joint WICOW/AIRWeb Workshop on WebQuality. ACM, 2012, pp. 56–59.

[38] D. Martens and W. Maalej, “Towards understanding and detectingfake reviews in app stores,” Empirical Software Engineering,vol. 24, no. 6, pp. 3316–3355, 2019. [Online]. Available:https://doi.org/10.1007/s10664-019-09706-9

[39] A. Di Sorbo, S. Panichella, C. V. Alexandru, C. A. Visaggio,and G. Canfora, “Surf: Summarizer of user reviews feedback,”in Proceedings of the 39th International Conference on SoftwareEngineering Companion, ser. ICSE-C ’17. IEEE Press, 2017,p. 55–58. [Online]. Available: https://doi.org/10.1109/ICSE-C.2017.5

[40] S. Panichella, A. Di Sorbo, E. Guzman, C. A. Visaggio, G. Canfora,and H. C. Gall, “How can i improve my app? classifying userreviews for software maintenance and evolution,” in 2015 IEEEInternational Conference on Software Maintenance and Evolution(ICSME), Sep. 2015, pp. 281–290.

[41] S. Panichella, A. Di Sorbo, E. Guzman, C. A. Visaggio,G. Canfora, and H. C. Gall, “Ardoc: App reviews developmentoriented classifier,” in Proceedings of the 2016 24th ACMSIGSOFT International Symposium on Foundations of SoftwareEngineering, ser. FSE 2016. New York, NY, USA: Association forComputing Machinery, 2016, p. 1023–1027. [Online]. Available:https://doi.org/10.1145/2950290.2983938

[42] S. Mujahid, G. Sierra, R. Abdalkareem, E. Shihab, and W. Shang,“Examining user complaints of wearable apps: A case study onandroid wear,” in 2017 IEEE/ACM 4th International Conferenceon Mobile Software Engineering and Systems (MOBILESoft), May2017, pp. 96–99.

[43] A. Ciurumelea, S. Panichella, and H. C. Gall, “Poster: Automateduser reviews analyser,” in 2018 IEEE/ACM 40th International Con-ference on Software Engineering: Companion (ICSE-Companion),May 2018, pp. 317–318.

[44] D. Pagano and W. Maalej, “User feedback in the appstore: Anempirical study,” in 2013 21st IEEE International RequirementsEngineering Conference (RE), July 2013, pp. 125–134.

[45] H. Khalid, “On identifying user complaints of ios apps,” in Proceed-ings of the 2013 International Conference on Software Engineering.IEEE Press, 2013, pp. 1474–1476.

[46] S. McIlroy, N. Ali, H. Khalid, and A. E. Hassan, “Analyzingand automatically labelling the types of user issues that areraised in mobile app reviews,” Empirical Software Engineering,vol. 21, no. 3, pp. 1067–1106, 2016. [Online]. Available:https://doi.org/10.1007/s10664-015-9375-7

[47] E. Bakiu and E. Guzman, “Which feature is unusable? detect-ing usability and user experience issues from user reviews,” in2017 IEEE 25th International Requirements Engineering ConferenceWorkshops (REW), Sep. 2017, pp. 182–187.

[48] F. Alqahtani and R. Orji, “Usability issues in mentalhealth applications,” in Adjunct Publication of the 27thConference on User Modeling, Adaptation and Personalization,ser. UMAP’19 Adjunct. New York, NY, USA: Association forComputing Machinery, 2019, p. 343–348. [Online]. Available:https://doi.org/10.1145/3314183.3323676

[49] I. T. Mercado, N. Munaiah, and A. Meneely, “The impact ofcross-platform development approaches for mobile applicationsfrom the user’s perspective,” in Proceedings of the InternationalWorkshop on App Market Analytics, ser. WAMA 2016. New York,NY, USA: Association for Computing Machinery, 2016, p. 43–49.[Online]. Available: https://doi.org/10.1145/2993259.2993268

[50] L. Cen, L. Si, N. Li, and H. Jin, “User comment analysis for androidapps and cspi detection with comment expansion,” in Proceedingof the 1st International Workshop on Privacy-Preserving IR (PIR),2014, pp. 25–30.

[51] R. Deocadez, R. Harrison, and D. Rodriguez, “Automatically clas-sifying requirements from app stores: A preliminary study,” in2017 IEEE 25th International Requirements Engineering ConferenceWorkshops (REW), Sep. 2017, pp. 367–371.

[52] C. Wang, F. Zhang, P. Liang, M. Daneva, and M. van Sinderen,“Can app changelogs improve requirements classification fromapp reviews? an exploratory study,” in Proceedings of the12th ACM/IEEE International Symposium on Empirical SoftwareEngineering and Measurement, ser. ESEM ’18. New York, NY,USA: Association for Computing Machinery, 2018. [Online].Available: https://doi.org/10.1145/3239235.3267428

[53] H. Yang and P. Liang, “Identification and classification ofrequirements from app user reviews,” in The 27th InternationalConference on Software Engineering and Knowledge Engineering,SEKE 2015, Wyndham Pittsburgh University Center, Pittsburgh,PA, USA, July 6-8, 2015, 2015, pp. 7–12. [Online]. Available:https://doi.org/10.18293/SEKE2015-63

[54] M. Lu and P. Liang, “Automatic classification of non-functionalrequirements from augmented app user reviews,” in Proceedingsof the 21st International Conference on Evaluation and Assessmentin Software Engineering, ser. EASE’17. New York, NY, USA:Association for Computing Machinery, 2017, p. 344–353.[Online]. Available: https://doi.org/10.1145/3084226.3084241

[55] T. Wang, P. Liang, and M. Lu, “What aspects do non-functionalrequirements in app user reviews describe? an exploratory andcomparative study,” in 2018 25th Asia-Pacific Software EngineeringConference (APSEC), Dec 2018, pp. 494–503.

[56] Z. Kurtanovic and W. Maalej, “Mining user rationale from soft-ware reviews,” in 2017 IEEE 25th International RequirementsEngineering Conference (RE), Sep. 2017, pp. 61–70.

[57] Z. Kurtanovic and W. Maalej, “On user rationale in softwareengineering,” Requir. Eng., vol. 23, no. 3, pp. 357–379, 2018. [On-line]. Available: https://doi.org/10.1007/s00766-018-0293-2

[58] Y. Liu, L. Liu, H. Liu, and X. Wang, “Analyzing reviews guided byapp descriptions for the software development and evolution,”Journal of Software: Evolution and Process, vol. 30, no. 12,p. e2112, 2018, e2112 JSME-17-0184.R2. [Online]. Available:https://onlinelibrary.wiley.com/doi/abs/10.1002/smr.2112

[59] E. Guzman, P. Bhuvanagiri, and B. Bruegge, “Fave: Visualizinguser feedback for software evolution,” in 2014 Second IEEE Work-ing Conference on Software Visualization, Sep. 2014, pp. 167–171.

[60] E. Guzman, O. Aly, and B. Bruegge, “Retrieving diverse opinionsfrom app reviews,” in 2015 ACM/IEEE International Symposiumon Empirical Software Engineering and Measurement (ESEM), Oct2015, pp. 1–10.

[61] F. Palomba, P. Salza, A. Ciurumelea, S. Panichella, H. Gall,F. Ferrucci, and A. De Lucia, “Recommending and localizingchange requests for mobile apps based on user reviews,” inProceedings of the 39th International Conference on SoftwareEngineering, ser. ICSE ’17. IEEE Press, 2017, p. 106–117.[Online]. Available: https://doi.org/10.1109/ICSE.2017.18

[62] Z. Peng, J. Wang, K. He, and M. Tang, “An approach ofextracting feature requests from app reviews,” in CollaborateComputing: Networking, Applications and Worksharing - 12thInternational Conference, CollaborateCom 2016, Beijing, China,November 10-11, 2016, Proceedings, 2016, pp. 312–323. [Online].Available: https://doi.org/10.1007/978-3-319-59288-6_28

[63] W. Martin, M. Harman, Y. Jia, F. Sarro, and Y. Zhang, “The appsampling problem for app store mining,” in Proceedings of the 12thWorking Conference on Mining Software Repositories, ser. MSR ’15.IEEE Press, 2015, p. 123–133.

[64] P. M. Vu, H. V. Pham, T. T. Nguyen, and T. T. Nguyen,“Phrase-based extraction of user opinions in mobile app reviews,”in Proceedings of the 31st IEEE/ACM International Conferenceon Automated Software Engineering, ASE 2016, Singapore,September 3-7, 2016, 2016, pp. 726–731. [Online]. Available:https://doi.org/10.1145/2970276.2970365

[65] R. Chen, Q. Wang, and W. Xu, “Mining user requirements tofacilitate mobile app quality upgrades with big data,” ElectronicCommerce Research and Applications, vol. 38, p. 100889, 2019.

[66] B. Fu, J. Lin, L. Li, C. Faloutsos, J. Hong, and N. Sadeh,“Why people hate your app: Making sense of user feedbackin a mobile app store,” in Proceedings of the 19th ACM


SIGKDD International Conference on Knowledge Discovery andData Mining, ser. KDD ’13. New York, NY, USA: Association forComputing Machinery, 2013, p. 1276–1284. [Online]. Available:https://doi.org/10.1145/2487575.2488202

[67] J. Dabrowski, E. Letier, A. Perini, and A. Susi, “Findingand analyzing app reviews related to specific features: Aresearch preview,” in Requirements Engineering: Foundation forSoftware Quality - 25th International Working Conference, REFSQ2019, Essen, Germany, March 18-21, 2019, Proceedings, 2019,pp. 183–189. [Online]. Available: https://doi.org/10.1007/978-3-030-15538-4_14

[68] P. M. Vu, H. V. Pham, T. T. Nguyen, and T. T. Nguyen, “Toolsupport for analyzing mobile app reviews,” in 30th IEEE/ACMInternational Conference on Automated Software Engineering, ASE2015, Lincoln, NE, USA, November 9-13, 2015, 2015, pp. 789–794.[Online]. Available: https://doi.org/10.1109/ASE.2015.101

[69] T. Li, F. Zhang, and D. Wang, “Automatic user preferenceselicitation: A data-driven approach,” in Requirements Engineering:Foundation for Software Quality - 24th International WorkingConference, REFSQ 2018, Utrecht, The Netherlands, March 19-22,2018, Proceedings, 2018, pp. 324–331. [Online]. Available:https://doi.org/10.1007/978-3-319-77243-1_21

[70] F. Palomba, M. Linares-Vásquez, G. Bavota, R. Oliveto, M. DiPenta, D. Poshyvanyk, and A. De Lucia, “User reviews matter!tracking crowdsourced reviews to support evolution of successfulapps,” in 2015 IEEE International Conference on Software Mainte-nance and Evolution (ICSME), Sep. 2015, pp. 291–300.

[71] F. Palomba, M. Linares-Vásquez, G. Bavota, R. Oliveto, M. D.Penta, D. Poshyvanyk, and A. D. Lucia, “Crowdsourcing userreviews to support the evolution of mobile apps,” Journal ofSystems and Software, vol. 137, pp. 143 – 162, 2018.

[72] L. Pelloni, G. Grano, A. Ciurumelea, S. Panichella, F. Palomba,and H. C. Gall, “Becloma: Augmenting stack traces with userreview information,” in 2018 IEEE 25th International Conferenceon Software Analysis, Evolution and Reengineering (SANER), March2018, pp. 522–526.

[73] E. Noei, F. Zhang, S. Wang, and Y. Zou, “Towards prioritizing user-related issue reports of mobile applications,” Empirical SoftwareEngineering, vol. 24, no. 4, pp. 1964–1996, 2019. [Online].Available: https://doi.org/10.1007/s10664-019-09684-y

[74] L. Wei, Y. Liu, and S.-C. Cheung, “Oasis: Prioritizing staticanalysis warnings for android apps based on app user reviews,”in Proceedings of the 2017 11th Joint Meeting on Foundationsof Software Engineering, ser. ESEC/FSE 2017. New York, NY,USA: Association for Computing Machinery, 2017, p. 672–682.[Online]. Available: https://doi.org/10.1145/3106237.3106294

[75] A. Ciurumelea, A. Schaufelbühl, S. Panichella, and H. C. Gall,“Analyzing reviews and code of mobile apps for better releaseplanning,” in 2017 IEEE 24th International Conference on SoftwareAnalysis, Evolution and Reengineering (SANER), Feb 2017, pp. 91–102.

[76] G. Grano, A. Ciurumelea, S. Panichella, F. Palomba, and H. C. Gall,“Exploring the integration of user feedback in automated testingof android applications,” in 2018 IEEE 25th International Confer-ence on Software Analysis, Evolution and Reengineering (SANER),March 2018, pp. 72–83.

[77] D. Martens and T. Johann, “On the emotion of users in appreviews,” in Proceedings of the 2nd International Workshop onEmotion Awareness in Software Engineering, ser. SEmotion ’17.IEEE Press, 2017, p. 8–14.

[78] D. Martens and W. Maalej, “Release early, release often, andwatch your users’ emotions: Lessons from emotional patterns,”IEEE Software, vol. 36, no. 5, pp. 32–37, Sep. 2019.

[79] R. A. Masrury, Fannisa, and A. Alamsyah, “Analyzing tourismmobile applications perceived quality using sentiment analysisand topic modeling,” in 2019 7th International Conference onInformation and Communication Technology (ICoICT), July 2019,pp. 1–6.

[80] H. Malik, E. M. Shakshuki, and W.-S. Yoo, “Comparing mobileapps by identifying ‘hot’ features,” Future Generation ComputerSystems, 2018.

[81] J. Huebner, R. M. Frey, C. Ammendola, E. Fleisch, and A. Ilic,“What people like in mobile finance apps: An analysis of userreviews,” in Proceedings of the 17th International Conference onMobile and Ubiquitous Multimedia, MUM 2018, Cairo, Egypt,November 25-28, 2018, 2018, pp. 293–304. [Online]. Available:https://doi.org/10.1145/3282894.3282895

[82] F. Dalpiaz and M. Parente, “RE-SWOT: from user feedbackto requirements via competitor analysis,” in RequirementsEngineering: Foundation for Software Quality - 25th InternationalWorking Conference, REFSQ 2019, Essen, Germany, March 18-21, 2019, Proceedings, 2019, pp. 55–70. [Online]. Available:https://doi.org/10.1007/978-3-030-15538-4_4

[83] M. Nicolai, L. Pascarella, F. Palomba, and A. Bacchelli,“Healthcare android apps: a tale of the customers’ perspective,”in Proceedings of the 3rd ACM SIGSOFT International Workshop onApp Market Analytics, WAMA@ESEC/SIGSOFT FSE 2019, Tallinn,Estonia, August 27, 2019, 2019, pp. 33–39. [Online]. Available:https://doi.org/10.1145/3340496.3342758

[84] T.-P. Liang, X. Li, C.-T. Yang, and M. Wang, “What in consumerreviews affects the sales of mobile apps: A multifacet sentimentanalysis approach,” International Journal of Electronic Commerce,vol. 20, no. 2, pp. 236–260, 2015.

[85] R. P. L. Buse and T. Zimmermann, “Information needs forsoftware development analytics,” in 34th International Conferenceon Software Engineering, 2012, pp. 987–996. [Online]. Available:https://doi.org/10.1109/ICSE.2012.6227122

[86] A. Begel and T. Zimmermann, “Analyze this! 145 questions fordata scientists in software engineering,” in 36th InternationalConference on Software Engineering, 2014, pp. 12–13. [Online].Available: https://doi.org/10.1145/2568225.2568233

[87] L. Hoon, R. Vasa, J.-G. Schneider, and K. Mouzakis, “A preliminaryanalysis of vocabulary in mobile app user reviews,” in Proceedingsof the 24th Australian Computer-Human Interaction Conference.ACM, 2012, pp. 245–248.

[88] R. Vasa, L. Hoon, K. Mouzakis, and A. Noguchi, “A preliminaryanalysis of mobile app user reviews,” in Proceedings of the 24thAustralian Computer-Human Interaction Conference. ACM, 2012,pp. 241–244.

[89] G. Williams and A. Mahmoud, “Modeling user concerns in the appstore: A case study on the rise and fall of yik yak,” in 2018 IEEE26th International Requirements Engineering Conference (RE), Aug2018, pp. 64–75.

[90] C. Iacob, V. Veerappa, and R. Harrison, “What are you complain-ing about?: A study of online reviews of mobile applications,”in Proceedings of the 27th International BCS Human ComputerInteraction Conference. British Computer Society, 2013, pp. 29:1–29:6.

[91] S. Hassan, C. Bezemer, and A. E. Hassan, “Studying bad updatesof top free-to-download apps in the google play store,” IEEETransactions on Software Engineering, pp. 1–1, 2018.

[92] K. Srisopha, C. Phonsom, K. Lin, and B. Boehm, “Same app,different countries: A preliminary user reviews study on mostdownloaded ios apps,” in 2019 IEEE International Conference onSoftware Maintenance and Evolution (ICSME), Sep. 2019, pp. 76–80.

[93] E. Guzman and A. Paredes Rojas, “Gender and user feedback: Anexploratory study,” in 2019 IEEE 27th International RequirementsEngineering Conference (RE), Sep. 2019, pp. 381–385.

[94] I. Malavolta, S. Ruberto, T. Soru, and V. Terragni, “End users’perception of hybrid mobile apps in the google play store,” inProceedings of the 4th International Conference on Mobile Services(MS). IEEE, 2015.

[95] M. Ali, M. E. Joorabchi, and A. Mesbah, “Same app, differentapp stores: A comparative study,” in Proceedings of the 4thInternational Conference on Mobile Software Engineering andSystems, ser. MOBILESoft ’17. IEEE Press, 2017, p. 79–90.[Online]. Available: https://doi.org/10.1109/MOBILESoft.2017.3

[96] H. Hu, C. Bezemer, and A. E. Hassan, “Studying the consistencyof star ratings and the complaints in 1 & 2-star user reviews fortop free cross-platform android and ios apps,” Empirical SoftwareEngineering, vol. 23, no. 6, pp. 3442–3475, 2018.

[97] H. Hu, S. Wang, C. Bezemer, and A. E. Hassan, “Studyingthe consistency of star ratings and reviews of popularfree hybrid android and ios apps,” Empirical SoftwareEngineering, vol. 24, no. 1, pp. 7–32, 2019. [Online]. Available:https://doi.org/10.1007/s10664-018-9617-6

[98] S. McIlroy, W. Shang, N. Ali, and A. Hassan, “Is it worth respond-ing to reviews? a case study of the top free apps in the googleplay store,” IEEE Software, vol. PP, 2015.

[99] S. Hassan, C. Tantithamthavorn, C. Bezemer, and A. E. Hassan,“Studying the dialogue between users and developers of freeapps in the google play store,” Empirical Software Engineering,


vol. 23, no. 3, pp. 1275–1312, 2018. [Online]. Available:https://doi.org/10.1007/s10664-017-9538-9

[100] S. Scalabrino, G. Bavota, B. Russo, M. D. Penta, and R. Oliveto,“Listening to the crowd for the release planning of mobile apps,”IEEE Transactions on Software Engineering, vol. 45, no. 1, pp. 68–86, Jan 2019.

[101] Y. Man, C. Gao, M. R. Lyu, and J. Jiang, “Experience report:Understanding cross-platform app issues from user reviews,” in2016 IEEE 27th International Symposium on Software ReliabilityEngineering (ISSRE), Oct 2016, pp. 138–149.

[102] S. Keertipati, B. T. R. Savarimuthu, and S. A. Licorish,“Approaches for prioritizing feature improvements extractedfrom app reviews,” in Proceedings of the 20th InternationalConference on Evaluation and Assessment in Software Engineering,ser. EASE ’16. New York, NY, USA: Association for ComputingMachinery, 2016. [Online]. Available: https://doi.org/10.1145/2915970.2916003

[103] G. Tong, B. Guo, O. Yi, and Y. Zhiwen, “Mining and analyzing userfeedback from app reviews: An econometric approach,” in 2018IEEE SmartWorld, Ubiquitous Intelligence Computing, AdvancedTrusted Computing, Scalable Computing Communications, CloudBig Data Computing, Internet of People and Smart City Innovation(SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), Oct 2018,pp. 841–848.

[104] S. A. Licorish, B. T. R. Savarimuthu, and S. Keertipati, “Attributesthat predict which features to fix: Lessons for app store mining,” inProceedings of the 21st International Conference on Evaluation andAssessment in Software Engineering, ser. EASE’17. New York, NY,USA: Association for Computing Machinery, 2017, p. 108–117.[Online]. Available: https://doi.org/10.1145/3084226.3084246

[105] J. Zhang, Y. Wang, and T. Xie, “Software feature refinementprioritization based on online user review mining,” Informationand Software Technology, vol. 108, pp. 30 – 34, 2019.

[106] H. Khalid, M. Nagappan, E. Shihab, and A. E. Hassan, “Prioritizingthe devices to test your app on: a case study of android gameapps,” in Proceedings of the 22nd ACM SIGSOFT InternationalSymposium on Foundations of Software Engineering, (FSE-22),Hong Kong, China, November 16 - 22, 2014, 2014, pp. 610–620.[Online]. Available: https://doi.org/10.1145/2635868.2635909

[107] G. Greenheld, B. T. R. Savarimuthu, and S. A. Licorish, “Automat-ing developers’ responses to app reviews,” in 2018 25th Aus-tralasian Software Engineering Conference (ASWEC), Nov 2018,pp. 66–70.

[108] C. Gao, J. Zeng, X. Xia, D. Lo, M. R. Lyu, and I. King, “Automatingapp review response generation,” in 2019 34th IEEE/ACM Interna-tional Conference on Automated Software Engineering (ASE), Nov2019, pp. 163–175.

[109] P. M. Vu, T. T. Nguyen, and T. T. Nguyen, “Why do app reviewsget responded: A preliminary study of the relationship betweenreviews and responses in mobile apps,” in Proceedings of the 2019ACM Southeast Conference, ser. ACM SE ’19. New York, NY,USA: Association for Computing Machinery, 2019, p. 237–240.[Online]. Available: https://doi.org/10.1145/3299815.3314473

[110] C. Iacob, R. Harrison, and S. Faily, “Online reviews as first classartifacts in mobile app development,” in Proceedings of the 5thInternational Conference on Mobile Computing, Applications, andServices. MobiCASE ’13, 2013.

[111] K. Phetrungnapha and T. Senivongse, “Classification of mobileapplication user reviews for generating tickets on issue trackingsystem,” in 2019 12th International Conference on InformationCommunication Technology and System (ICTS), July 2019, pp.229–234.

[112] C. Gao, H. Xu, J. Hu, and Y. Zhou, “Ar-tracker: Track the dynamicsof mobile apps via user review mining,” in 2015 IEEE Symposiumon Service-Oriented System Engineering, SOSE ’15, 2015, pp. 284–290.

[113] F. A. Shah, K. Sirts, and D. Pfahl, “Using app reviews forcompetitive analysis: Tool support,” in Proceedings of the 3rdACM SIGSOFT International Workshop on App Market Analytics,ser. WAMA 2019. New York, NY, USA: Association forComputing Machinery, 2019, p. 40–46. [Online]. Available:https://doi.org/10.1145/3340496.3342756

[114] H. Khalid, E. Shihab, M. Nagappan, and A. E. Hassan, “What domobile app users complain about?” IEEE Software, vol. 32, no. 3,pp. 70–77, May 2015.

[115] M. Sänger, U. Leser, S. Kemmerer, P. Adolphs, and R. Klinger,“SCARE - the sentiment corpus of app reviews with fine-grained

annotations in German,” in Proceedings of the Tenth InternationalConference on Language Resources and Evaluation (LREC’16),2016. [Online]. Available: https://www.aclweb.org/anthology/L16-1178

[116] C. Stanik, M. Haering, and W. Maalej, “Classifying multilingualuser feedback using traditional machine learning and deep learn-ing,” in 2019 IEEE 27th International Requirements EngineeringConference Workshops (REW), Sep. 2019, pp. 220–226.

[117] D. Jurafsky and J. H. Martin, Speech and Language Processing (2ndEdition). USA: Prentice-Hall, Inc., 2009.

[118] C. D. Manning, P. Raghavan, and H. Schütze,Introduction to Information Retrieval. Cambridge, UK:Cambridge University Press, 2008. [Online]. Available: http://nlp.stanford.edu/IR-book/information-retrieval-book.html

[119] N. Jha and A. Mahmoud, “MARC: A mobile applicationreview classifier,” in Joint Proceedings of REFSQ-2017 Workshops,Doctoral Symposium, Research Method Track, and Poster Trackco-located with the 22nd International Conference on RequirementsEngineering: Foundation for Software Quality (REFSQ 2017),Essen, Germany, February 27, 2017, 2017. [Online]. Available:http://ceur-ws.org/Vol-1796/poster-paper-1.pdf

[120] A. Puspaningrum, D. Siahaan, and C. Fatichah, “Mobile appreview labeling using lda similarity and term frequency-inversecluster frequency (tf-icf),” in 2018 10th International Conferenceon Information Technology and Electrical Engineering (ICITEE),July 2018, pp. 365–370.

[121] C. M. Bishop, Pattern Recognition and Machine Learning (Informa-tion Science and Statistics). Berlin, Heidelberg: Springer-Verlag,2006.

[122] V. T. Dhinakaran, R. Pulle, N. Ajmeri, and P. K. Murukannaiah,“App review analysis via active learning: Reducing supervisioneffort without compromising classification accuracy,” in 2018 IEEE26th International Requirements Engineering Conference (RE), Aug2018, pp. 170–181.

[123] M. Sänger, U. Leser, and R. Klinger, “Fine-grained opinion miningfrom mobile app reviews with word embedding features,” inNatural Language Processing and Information Systems - 22ndInternational Conference on Applications of Natural Languageto Information Systems, NLDB 2017, Liège, Belgium, June21-23, 2017, Proceedings, 2017, pp. 3–14. [Online]. Available:https://doi.org/10.1007/978-3-319-59569-6_1

[124] E. Guzman, M. El-Halaby, and B. Bruegge, “Ensemble methodsfor app review classification: An approach for softwareevolution,” in Proceedings of the 30th IEEE/ACM InternationalConference on Automated Software Engineering, ser. ASE’15. IEEE Press, 2015, p. 771–776. [Online]. Available:https://doi.org/10.1109/ASE.2015.88

[125] F. A. Shah, K. Sirts, and D. Pfahl, “Simplifying theclassification of app reviews using only lexical features,”in Software Technologies - 13th International Conference,ICSOFT 2018, Porto, Portugal, July 26-28, 2018, RevisedSelected Papers, 2018, pp. 173–193. [Online]. Available:https://doi.org/10.1007/978-3-030-29157-0_8

[126] H. Khalid, M. Nagappan, and A. E. Hassan, “Examining therelationship between findbugs warnings and app ratings,” IEEESoftware, vol. 33, no. 4, pp. 34–39, July 2016.

[127] E. Guzman, L. Oliveira, Y. Steiner, L. C. Wagner, and M. Glinz,“User feedback in the app store: A cross-cultural study,” in 2018IEEE/ACM 40th International Conference on Software Engineering:Software Engineering in Society (ICSE-SEIS), May 2018, pp. 13–22.

[128] K. Srisopha and R. Alfayez, “Software quality through theeyes of the end-user and static analysis tools: A study onandroid oss applications,” in Proceedings of the 1st InternationalWorkshop on Software Qualities and Their Dependencies,ser. SQUADE ’18. New York, NY, USA: Association forComputing Machinery, 2018, p. 1–4. [Online]. Available:https://doi.org/10.1145/3194095.3194096

[129] M. Goul, O. Marjanovic, S. Baxley, and K. Vizecky, “Managingthe enterprise business intelligence app store: Sentiment analysissupported requirements engineering,” in 2012 45th Hawaii Inter-national Conference on System Sciences, Jan 2012, pp. 4168–4177.

[130] R. Deocadez, R. Harrison, and D. Rodriguez, “Preliminary studyon applying semi-supervised learning to app store analysis,” inProceedings of the 21st International Conference on Evaluation andAssessment in Software Engineering, ser. EASE’17. New York, NY,USA: Association for Computing Machinery, 2017, p. 320–323.[Online]. Available: https://doi.org/10.1145/3084226.3084285


[131] W. Maalej, M. Nayebi, T. Johann, and G. Ruhe, “Toward data-driven requirements engineering,” IEEE Software, vol. 33, no. 1,pp. 48–54, Jan 2016.

[132] Z. S. H. Abad, S. D. V. Sims, A. Cheema, M. B. Nasir, and P. Haris-inghani, “Learn more, pay less! lessons learned from applying thewizard-of-oz technique for exploring mobile app requirements,”in 2017 IEEE 25th International Requirements Engineering Confer-ence Workshops (REW), Sep. 2017, pp. 132–138.

[133] F. A. Shah, Y. Sabanin, and D. Pfahl, “Feature-based evaluationof competing apps,” in Proceedings of the International Workshopon App Market Analytics, ser. WAMA 2016. New York, NY, USA:Association for Computing Machinery, 2016, p. 15–21. [Online].Available: https://doi.org/10.1145/2993259.2993267

[134] N. Al Kilani, R. Tailakh, and A. Hanani, “Automatic classifica-tion of apps reviews for requirement engineering: Exploring thecustomers need from healthcare applications,” in 2019 SixthInternational Conference on Social Networks Analysis, Managementand Security (SNAMS), Oct 2019, pp. 541–548.

[135] L. V. Galvis Carreño and K. Winbladh, “Analysis of user comments:An approach for software requirements evolution,” in Proceedingsof the 2013 International Conference on Software Engineering, ser.ICSE ’13. IEEE Press, 2013, p. 582–591.

[136] L. Zhang, X. Huang, J. Jiang, and Y. Hu, “Cslabel: An approachfor labelling mobile app reviews,” J. Comput. Sci. Technol.,vol. 32, no. 6, pp. 1076–1089, 2017. [Online]. Available:https://doi.org/10.1007/s11390-017-1784-1

[137] Y. Liu, L. Liu, H. Liu, and X. Yin, “App store mining foriterative domain analysis: Combine app descriptions with userreviews,” Software: Practice and Experience, vol. 49, no. 6,pp. 1013–1040, 2019, sPE-19-0009.R1. [Online]. Available:https://onlinelibrary.wiley.com/doi/abs/10.1002/spe.2693

[138] N. Jha and A. Mahmoud, “Mining non-functional requirementsfrom app store reviews,” Empirical Software Engineering,vol. 24, no. 6, pp. 3659–3695, 2019. [Online]. Available:https://doi.org/10.1007/s10664-019-09716-7

[139] D. Sun and R. Peng, “A scenario model aggregation approach formobile app requirements evolution based on user comments,” inRequirements Engineering in the Big Data Era. Springer BerlinHeidelberg, 2015, vol. 558, pp. 75–91.

[140] V. H. S. Durelli, R. S. Durelli, A. T. Endo, E. Cirilo,W. Luiz, and L. Rocha, “Please please me: Does the presenceof test cases influence mobile app users’ satisfaction?” inProceedings of the XXXII Brazilian Symposium on SoftwareEngineering, ser. SBES ’18. New York, NY, USA: Association forComputing Machinery, 2018, p. 132–141. [Online]. Available:https://doi.org/10.1145/3266237.3266272

[141] M. Khalid, U. Shehzaib, and M. Asif, “A case of mobile appreviews as a crowdsource,” International Journal of InformationEngineering and Electronic Business (IJIEEB), vol. 7, no. 5, p. 39,2015.

[142] M. Gomez, R. Rouvoy, M. Monperrus, and L. Seinturier, “A recom-mender system of buggy app checkers for app store moderators,”in 2nd ACM International Conference on Mobile Software Engineer-ing and Systems. IEEE, 2015.

[143] H. Malik and E. M. Shakshuki, “Mining collective opinions forcomparison of mobile apps,” Procedia Computer Science, vol. 94,pp. 168 – 175, 2016, the 11th International Conference on FutureNetworks and Communications (FNC 2016) / The 13th Interna-tional Conference on Mobile Systems and Pervasive Computing(MobiSPC 2016) / Affiliated Workshops.

[144] S. Muñoz, O. Araque, A. F. Llamas, and C. A. Iglesias, “A cognitiveagent for mining bugs reports, feature suggestions and sentimentin a mobile application store,” in 2018 4th International Confer-ence on Big Data Innovations and Applications (Innovate-Data),Aug 2018, pp. 17–24.

[145] E. Noei, F. Zhang, and Y. Zou, “Too many user-reviews, whatshould app developers look at first?” IEEE Transactions on Soft-ware Engineering, pp. 1–1, 2019.

[146] M. Khalid, M. Asif, and U. Shehzaib, “Towards improving the qual-ity of mobile app reviews,” International Journal of InformationTechnology and Computer Science (IJITCS), vol. 7, no. 10, p. 35,2015.

[147] M. Nayebi, H. Cho, H. Farrahi, and G. Ruhe, “App store mining isnot enough,” in 2017 IEEE/ACM 39th International Conference onSoftware Engineering Companion (ICSE-C), May 2017, pp. 152–154.

[148] M. Nayebi, H. Cho, and G. Ruhe, “App store mining is notenough for app improvement,” Empirical Software Engineering,vol. 23, no. 5, pp. 2764–2794, 2018. [Online]. Available:https://doi.org/10.1007/s10664-018-9601-1

[149] G. Grano, A. Di Sorbo, F. Mercaldo, C. A. Visaggio, G. Canfora,and S. Panichella, “Android apps and user feedback: A datasetfor software evolution and quality improvement,” in Proceedingsof the 2nd ACM SIGSOFT International Workshop on App MarketAnalytics, ser. WAMA 2017. New York, NY, USA: Associationfor Computing Machinery, 2017, p. 8–11. [Online]. Available:https://doi.org/10.1145/3121264.3121266

[150] G. Deshpande and J. Rokne, “User feedback from tweets vs appstore reviews: An exploratory study of frequency, timing and con-tent,” in 2018 5th International Workshop on Artificial Intelligencefor Requirements Engineering (AIRE), Aug 2018, pp. 15–21.

[151] A. Simmons and L. Hoon, “Agree to disagree: On labelling helpfulapp reviews,” in Proceedings of the 28th Australian Conferenceon Computer-Human Interaction, ser. OzCHI ’16. New York, NY,USA: Association for Computing Machinery, 2016, p. 416–420.[Online]. Available: https://doi.org/10.1145/3010915.3010976

[152] S. Mcilroy, W. Shang, N. Ali, and A. E. Hassan, “User reviewsof top mobile apps in apple and google app stores,” Commun.ACM, vol. 60, no. 11, p. 62–67, Oct. 2017. [Online]. Available:https://doi.org/10.1145/3141771

[153] L. Hoon, R. Vasa, G. Y. Martino, J.-G. Schneider, andK. Mouzakis, “Awesome! conveying satisfaction on the appstore,” in Proceedings of the 25th Australian Computer-HumanInteraction Conference: Augmentation, Application, Innovation,Collaboration, ser. OzCHI ’13. New York, NY, USA: Associationfor Computing Machinery, 2013, p. 229–232. [Online]. Available:https://doi.org/10.1145/2541016.2541067

[154] W. Maalej, M. Nayebi, and G. Ruhe, “Data-driven requirementsengineering: An update,” in Proceedings of the 41st InternationalConference on Software Engineering: Software Engineering inPractice, ser. ICSE-SEIP ’19. IEEE Press, 2019, p. 289–290.[Online]. Available: https://doi.org/10.1109/ICSE-SEIP.2019.00041

[155] P. Weichbroth and A. Baj-Rogowska, “Do online reviews revealmobile application usability and user experience? the case ofwhatsapp,” in 2019 Federated Conference on Computer Science andInformation Systems (FedCSIS), Sep. 2019, pp. 747–754.

[156] E. Ha and D. Wagner, “Do android users write about electricsheep? examining consumer reviews in google play,” in ConsumerCommunications and Networking Conference (CCNC), 2013 IEEE,2013, pp. 149–157.

[157] F. A. Shah, K. Sirts, and D. Pfahl, “Simulating the impact ofannotation guidelines and annotated data on extracting app fea-tures from app reviews,” in International Conference on SoftwareTechnologies (ICSOFT), 2019.

[158] Z. Sun, Z. Ji, P. Zhang, C. Chen, X. Qian, X. Du, and Q. Wan,“Automatic labeling of mobile apps by the type of psychologicalneeds they satisfy,” Telematics and Informatics, vol. 34, no. 5, pp.767 – 778, 2017.

[159] L. Hoon, M. Rodriguez-García, R. Vasa, R. Valencia-García, and J.-G. Schneider, “App reviews: Breaking the user and developer lan-guage barrier,” in Trends and Applications in Software Engineering.Springer International Publishing, 2016, vol. 405, pp. 223–233.

[160] G. L. Scoccia, S. Ruberto, I. Malavolta, M. Autili, andP. Inverardi, “An investigation into android run-time permissionsfrom the end users’ perspective,” in Proceedings of the 5thInternational Conference on Mobile Software Engineering andSystems, ser. MOBILESoft ’18. New York, NY, USA: Associationfor Computing Machinery, 2018, p. 45–55. [Online]. Available:https://doi.org/10.1145/3197231.3197236

[161] S. Wang, Z. Wang, X. Xu, and Q. Z. Sheng, “App updatepatterns: How developers act on user reviews in mobileapp stores,” in Service-Oriented Computing - 15th InternationalConference, ICSOC 2017, Malaga, Spain, November 13-16,2017, Proceedings, 2017, pp. 125–141. [Online]. Available:https://doi.org/10.1007/978-3-319-69035-3_9

[162] K. Bailey, M. Nagappan, and D. Dig, “Examining user-developerfeedback loops in the ios app store,” in 52nd Hawaii InternationalConference on System Sciences, HICSS 2019, Grand Wailea, Maui,Hawaii, USA, January 8-11, 2019, 2019, pp. 1–10. [Online].Available: http://hdl.handle.net/10125/60178

[163] I. Malavolta, S. Ruberto, V. Terragni, and T. Soru, “Hybrid mobileapps in the google play store: an exploratory investigation,” in


Proceedings of the 2nd ACM International Conference on MobileSoftware Engineering and Systems. ACM, 2015.

[164] E. Noei, D. A. Da Costa, and Y. Zou, “Winning the appproduction rally,” in Proceedings of the 2018 26th ACMJoint Meeting on European Software Engineering Conferenceand Symposium on the Foundations of Software Engineering,ser. ESEC/FSE 2018. New York, NY, USA: Association forComputing Machinery, 2018, p. 283–294. [Online]. Available:https://doi.org/10.1145/3236024.3236044

[165] F. A. Shah, K. Sirts, and D. Pfahl, “Is the SAFE approachtoo simple for app feature extraction? A replication study,” inRequirements Engineering: Foundation for Software Quality - 25thInternational Working Conference, REFSQ 2019, Essen, Germany,March 18-21, 2019, Proceedings, 2019, pp. 21–36. [Online].Available: https://doi.org/10.1007/978-3-030-15538-4_2

[166] N. Jha and A. Mahmoud, “Mining user requirements fromapplication store reviews using frame semantics,” in RequirementsEngineering: Foundation for Software Quality - 23rd InternationalWorking Conference, REFSQ 2017, Essen, Germany, February 27- March 2, 2017, Proceedings, 2017, pp. 273–287. [Online].Available: https://doi.org/10.1007/978-3-319-54045-0_20

[167] C. Gao, J. Zeng, D. Lo, C.-Y. Lin, M. R. Lyu, and I. King,“Infar: Insight extraction from app reviews,” in Proceedingsof the 2018 26th ACM Joint Meeting on European SoftwareEngineering Conference and Symposium on the Foundations ofSoftware Engineering, ser. ESEC/FSE 2018. New York, NY,USA: Association for Computing Machinery, 2018, p. 904–907.[Online]. Available: https://doi.org/10.1145/3236024.3264595

[168] M. Nagappan and E. Shihab, “Mobile app store analytics,” inPerspectives on Data Science for Software Engineering, T. Menzies,L. Williams, and T. Zimmermann, Eds. Boston: Morgan Kauf-mann, 2016, pp. 47 – 49.

[169] S. Mujahid, G. Sierra, R. Abdalkareem, E. Shihab, andW. Shang, “An empirical study of android wear usercomplaints,” Empirical Software Engineering, vol. 23, no. 6, pp.3476–3502, 2018. [Online]. Available: https://doi.org/10.1007/s10664-018-9615-8

[170] N. Jha and A. Mahmoud, “Using frame semantics for classifyingand summarizing application store reviews,” Empirical SoftwareEngineering, vol. 23, no. 6, pp. 3734–3767, 2018. [Online].Available: https://doi.org/10.1007/s10664-018-9605-x

[171] ISO/IEC 25010, ISO/IEC 25010:2011, Systems and software engi-neering — Systems and software Quality Requirements and Evalua-tion (SQuaRE) — System and software quality models, Std., 2011.

[172] B. Nuseibeh, “Weaving together requirements and architectures,”Computer, vol. 34, no. 3, pp. 115–119, 2001.

[173] J. E. Burge, J. M. Carroll, R. McCall, and I. Mistrk, Rationale-BasedSoftware Engineering, 1st ed. Springer Publishing Company,Incorporated, 2008.

[174] IEEE, IEEE Standard Glossary of Software Engineering Terminology,IEEE Std., 1990.

[175] M. Erfani, A. Mesbah, and P. Kruchten, “Real challenges in mobileapp development,” in 2013 ACM/IEEE International Symposiumon Empirical Software Engineering and Measurement (ESEM), 102013, pp. 15–24.

[176] J. Pustejovsky and A. Stubbs, Natural Language Annotationfor Machine Learning - a Guide to Corpus-Building forApplications. O’Reilly, 2012. [Online]. Available: http://www.oreilly.de/catalog/9781449306663/index.html

[177] K. Hallgren, “Computing inter-rater reliability for observationaldata: An overview and tutorial,” Tutorials in Quantitative Methodsfor Psychology, vol. 8, pp. 23–34, 07 2012.

[178] C. J. C. M. L. T. P. M. W. V. Higgins JPT, Thomas J, Ed., CochraneHandbook for Systematic Reviews of Interventions, 2nd ed. Chich-ester (UK): John Wiley & Sons, 2019.

[179] B. Guo, Y. Ouyang, T. Guo, L. Cao, and Z. Yu, “Enhancing mo-bile app user understanding and marketing with heterogeneouscrowdsourced data: A review,” IEEE Access, vol. 7, pp. 68 557–68 571, 2019.

[180] D. M. Berry, “Evaluation of tools for hairy requirementsand software engineering tasks,” in IEEE 25th InternationalRequirements Engineering Conference Workshops, RE 2017Workshops, Lisbon, Portugal, September 4-8, 2017, 2017, pp. 284–291. [Online]. Available: https://doi.org/10.1109/REW.2017.25

[181] D. Berry, “Keynote: Evaluation of NLP tools for hairy REtasks,” in Joint Proceedings of REFSQ-2018 Workshops, DoctoralSymposium, Live Studies Track, and Poster Track co-located with

the 23rd International Conference on Requirements Engineering:Foundation for Software Quality (REFSQ 2018), Utrecht, TheNetherlands, March 19, 2018, 2018. [Online]. Available:http://ceur-ws.org/Vol-2075/NLP4RE_keynote.pdf

[182] B. A. Kitchenham, T. Dyba, and M. Jorgensen, “Evidence-basedsoftware engineering,” in Proceedings of the 26th InternationalConference on Software Engineering, ser. ICSE ’04. USA: IEEEComputer Society, 2004, p. 273–281.

[183] “App Annie Inc.” 2020. [Online]. Available: https://www.appannie.com/

[184] B. Miller, F. Linder, and W. R. Mebane, “Active learning ap-proaches for labeling text: Review and assessment of the perfor-mance of active learning approaches,” Political Analysis, p. 1–20,2020.

[185] D. H. Annis, “Probability and statistics: The science ofuncertainty, michael j. evans and jeffrey s. rosenthal,” TheAmerican Statistician, vol. 59, pp. 276–276, 2005. [Online].Available: https://EconPapers.repec.org/RePEc:bes:amstat:v:59:y:2005:m:august:p:276-276

Documents

TECHNICAL REPORT App Review Analysis for Software ...TECHNICAL REPORT : TR-2020-12-03 1 App Review Analysis for Software Engineering: A Systematic Literature Review Jacek Dabro˛ wski†,