19
Research Article On-Device Detection of Repackaged Android Malware via Traffic Clustering Gaofeng He, 1,2 Bingfeng Xu, 3 Lu Zhang, 4 and Haiting Zhu 1 1 College of Internet of ings, Nanjing University of Posts and Telecommunications, Nanjing 210003, China 2 Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, Nanjing, China 3 College of Information Science and Technology, Nanjing Forestry University, Nanjing 210037, China 4 College of Information Engineering, Nanjing University of Finance & Economics, Nanjing 210046, China Correspondence should be addressed to Haiting Zhu; [email protected] Received 3 July 2019; Revised 14 March 2020; Accepted 18 May 2020; Published 31 May 2020 Academic Editor: Pino Caballero-Gil Copyright © 2020 Gaofeng He et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Malware has become a significant problem on the Android platform. To defend against Android malware, researchers have proposed several on-device detection methods. Typically, these on-device detection methods are composed of two steps: (i) extracting the apps’ behavior features from the mobile devices and (ii) sending the extracted features to remote servers (such as a cloud platform) for analysis. By monitoring the behaviors of the apps that are running on mobile devices, available methods can detect suspicious applications (simply, apps) accurately. However, mobile devices are typically resource limited. e feature extraction and massive data transmission might consume substantial power and CPU resources; thus, the performance of mobile devices will be degraded. To address this issue, we propose a novel method for detecting Android malware by clustering apps’ traffic at the edge computing nodes. First, a new integrated architecture of the cloud, edge, and mobile devices for Android malware detection is presented. en, for repackaged Android malware, the network traffic content and statistics are extracted at the edge as detection features. Finally, in the cloud, similarities between apps are calculated, and the similarity values are automatically clustered to separate the original apps and the malware. e experimental results demonstrate that the proposed method can detect repackaged Android malware with high precision and with a minimal impact on the performance of mobile devices. 1.Introduction Recently, Android platforms (e.g., smartphone, smartwatch, and tablet) have become increasingly popular worldwide. According to Statista, Android is accounted for more than 74%oftheglobalmobileOSmarketasofDecember2019[1]. One of the most important reasons for Android’s popularity is that mobile users can conveniently download various types of apps from app stores [2]. For example, an official Android market, namely, the Google Play store, already had more than 2.9 million apps in January 2020 [3]. ese feature-rich apps make the Android platform attractive and vibrant. Android apps, i.e., Android Package Kit (APK) files, are archives in ZIP format, which include developer bytecodes, resource files, and a manifest file. Unfortunately, in contrast to traditional executable software, Android apps are easy to repackage. With open-source tools such as apktool (https:// github.com/iBotPeaches/Apktool) and jadx (https://github. com/skylot/jadx), one can easily add code or modify re- source files to realize various objectives. For instance, malware writers can graft ad code on popular apps to make money or graft malicious code to steal user privacy infor- mation (for example, the Xavir malware has infected over 800 android apps. ese repackaged apps are often self- signed and do not require mobile devices to be rooted). Hindawi Security and Communication Networks Volume 2020, Article ID 8630748, 19 pages https://doi.org/10.1155/2020/8630748

On-DeviceDetectionofRepackagedAndroidMalwarevia …downloads.hindawi.com/journals/scn/2020/8630748.pdf · 2020-05-31 · According to [4–6], most Android malware programs are repackagedapps,andthemajorityofnewAndroidmalware

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On-DeviceDetectionofRepackagedAndroidMalwarevia …downloads.hindawi.com/journals/scn/2020/8630748.pdf · 2020-05-31 · According to [4–6], most Android malware programs are repackagedapps,andthemajorityofnewAndroidmalware

Research ArticleOn-Device Detection of Repackaged Android Malware viaTraffic Clustering

Gaofeng He12 Bingfeng Xu3 Lu Zhang4 and Haiting Zhu 1

1College of Internet of ings Nanjing University of Posts and Telecommunications Nanjing 210003 China2Key Laboratory of Computer Network and Information Integration (Southeast University) Ministry of EducationNanjing China3College of Information Science and Technology Nanjing Forestry University Nanjing 210037 China4College of Information Engineering Nanjing University of Finance amp Economics Nanjing 210046 China

Correspondence should be addressed to Haiting Zhu htzhunjupteducn

Received 3 July 2019 Revised 14 March 2020 Accepted 18 May 2020 Published 31 May 2020

Academic Editor Pino Caballero-Gil

Copyright copy 2020 Gaofeng He et al is is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Malware has become a significant problem on the Android platform To defend against Android malware researchers haveproposed several on-device detection methods Typically these on-device detection methods are composed of two steps (i)extracting the appsrsquo behavior features from the mobile devices and (ii) sending the extracted features to remote servers (such as acloud platform) for analysis By monitoring the behaviors of the apps that are running on mobile devices available methods candetect suspicious applications (simply apps) accurately However mobile devices are typically resource limited e featureextraction and massive data transmission might consume substantial power and CPU resources thus the performance of mobiledevices will be degraded To address this issue we propose a novel method for detecting Android malware by clustering appsrsquotraffic at the edge computing nodes First a new integrated architecture of the cloud edge and mobile devices for Androidmalware detection is presented en for repackaged Android malware the network traffic content and statistics are extracted atthe edge as detection features Finally in the cloud similarities between apps are calculated and the similarity values areautomatically clustered to separate the original apps and the malware e experimental results demonstrate that the proposedmethod can detect repackaged Android malware with high precision and with a minimal impact on the performance ofmobile devices

1 Introduction

Recently Android platforms (eg smartphone smartwatchand tablet) have become increasingly popular worldwideAccording to Statista Android is accounted for more than74 of the global mobile OSmarket as of December 2019 [1]One of the most important reasons for Androidrsquos popularityis that mobile users can conveniently download varioustypes of apps from app stores [2] For example an officialAndroid market namely the Google Play store already hadmore than 29 million apps in January 2020 [3] esefeature-rich apps make the Android platform attractive andvibrant

Android apps ie Android Package Kit (APK) files arearchives in ZIP format which include developer bytecodesresource files and a manifest file Unfortunately in contrastto traditional executable software Android apps are easy torepackage With open-source tools such as apktool (httpsgithubcomiBotPeachesApktool) and jadx (httpsgithubcomskylotjadx) one can easily add code or modify re-source files to realize various objectives For instancemalware writers can graft ad code on popular apps to makemoney or graft malicious code to steal user privacy infor-mation (for example the Xavir malware has infected over800 android apps ese repackaged apps are often self-signed and do not require mobile devices to be rooted)

HindawiSecurity and Communication NetworksVolume 2020 Article ID 8630748 19 pageshttpsdoiorg10115520208630748

According to [4ndash6] most Android malware programs arerepackaged apps and the majority of new Android malwaresamples are polymorphic variants of known malwareRepackaged Android malware is becoming a substantialthreat to Android security

Researchers have extensively investigated the detectionof (repackaged) Android malware and several methods havebeen proposed and evaluated [7ndash14] Previous works can begenerally divided into off-device and on-device detectionOff-device detection is typically adopted by security ana-lyzers who analyze APK files at a dedicated analysis servere analyzers can check the APKs with signatures or dy-namically run them in a real or virtual execution environ-ment to record their behaviors [15] If malicious code orbehaviors are identified the apps will be judged as malwareOff-device detection can also be conducted via networkmonitoring [14] We will refer to off-device detectionmethods of this type as network-based methods in thefollowing

On-device detection is implemented partially or com-pletely locally namely at usersrsquo mobile devices e rec-ognition of malware from the installed apps is attemptedTechnically on-device detection also identifies malware byanalyzing the behaviors of the apps Compared to off-devicedetection on-device detection can detect malware that isrunning on mobile usersrsquo devices and identify more so-phisticated malicious apps such as those which are triggeredby specified events For example by observing the deviationsof app network behaviors at the mobile device one candetect self-updating malware that may be triggered by aspecified geo-location device type or OS version [8]Nevertheless on-device detection is more challenging due tothe limited resources of mobile devices

Typically several steps should be conducted to detectmalicious Android apps at mobile devices First detectionfeatures such as the userrsquos operating behaviors API usagesand application network behaviors should be defined andextracted en based on the extracted features detectionmodels are constructed and finally new apps are comparedwith the constructed models for malware detectionDepending on where these steps are performed currentapproaches can be categorized into two main groups client-side detection and server-side detection [14] For client-sidedetection these steps are all conducted at the mobile deviceswhile for server-side detection the main steps (modelconstruction and malware detection) are conducted onremote servers

In practice the application of the client-side approachescould be restricted because they consume substantialamounts of resources and power According to reference [8]the storage of 10 Android malware detection models re-quires 7272 kB plusmn 15 memory consumption and the learningof these models requires 13 plusmn 15 CPU consumptionWorse the moremodels that are trained and saved the moreresources and power are consumed erefore the server-side methods are typically adopted [16] However the re-source and power consumptions of the server-side methodsremain high (eg 7 performance overhead and approxi-mately 3 battery overhead [17]) is is because the

detection features are still mainly collected and processed atthe mobile devices [10 17]

In this paper we propose a novel method for detectingrepackaged Android malware based on clustering appsrsquotraffic in the edge computing platform Although Androidmalware can use emulatorsandbox detection [18] andpacking [19 20] to impede dynamic and static analysis it isdifficult for them to hide the traffic to be inspectede basicstrategy of our method is that the feature extraction andpreprocessing can be offloaded to the edge us theworkloads of mobile devices are reduced e designedarchitecture for Android malware detection consists of threecomponents mobile devices edge servers and the cloude mobile devices are used normally and an additionaledge-client app will be installed and run in the backgrounde edge-client app operates identically to an Android VPNapp such as Packet Capture (httpsplaygooglecomstoreappsdetailsidappgreyshirtssslcaptureamphlen US) ex-cept it only records the flow metainformation such as theflow starting time and flow protocol and it does not need tosave and manipulate the packet contents Hence it islightweight and imposes a negligible burden on the mobiledevices

e edge servers collect mobile network traffic and extractdetection features from the traffic e traffic preprocessingthat is conducted by the edge servers not only reduces theamount of data to be sent but also protects the mobile usersrsquoprivacy because the cloud can only observe the coarse-grainedfeatures e extracted network traffic features are sent to thecloud for final processing In the cloud similarities betweenapps are calculated based on the received features After thatthese similarity values are automatically clustered to separateoriginal apps and repackaged malware We conducted anextensive set of experiments and the results demonstratedthat the average detection accuracy was 969

Our main contributions are as follows

(i) A lightweight and efficient framework that is basedon edge computing for the detection of Androidmalware is proposed Compared to the typicalserver-side approaches the proposed frameworkwill have minimal impact on the performances ofthe mobile devices In addition compared to thenetwork-based methods the detection accuracy canbe increased and the privacy of mobile users can bebetter protected

(ii) A traffic clustering method for the identification ofrepackaged Android malware is proposed Wecluster the network traffic that is generated by thesame apps to determine which one have beenrepackaged First we use the plaintext contents andflow statistical features to calculate the similaritiesbetween the apps en the similarity values areclustered to separate the original apps and therepackagedmalware automatically Via clustering itis unnecessary tomodel the appsrsquo network behaviorsor extract matching signatures for malware detec-tion in advance Hence this approach will be moreefficient in practice

2 Security and Communication Networks

(iii) Extensive experiments are conducted on publicdatasets

e remainder of this paper is organized as followse preliminaries of Android malware and edge com-puting are introduced in Section 2 In Section 3 wediscuss relevant related work Section 4 describes the newintegration architecture of the cloud edge and mobiledevices for Android malware detection e methodologyis presented in Section 5 An experimental evaluation ofour method in terms of accuracy and performance ispresented in Section 6 We discuss the limitations of theproposed method in Section 7 Finally in Section 8 wepresent the conclusions of the paper and discuss potentialfuture work

2 Preliminaries

In this section we introduce important concepts regardingAndroid malware and edge computing which form the basisof this work

21 Android Malware As discussed in Section 1 Androidapps (APKs) are easy to repackage which is convenient forattackers who would like to create Android malware In factmost Android malware is produced by repackaging benignapps especially favorite apps to ensure a wide diffusion ofthe malicious code [4] As evidence MalGenome [21] whichis a reference dataset in the Android security community80 of the malicious samples were built via repackagingother apps Repackaged apps can be distributed throughthird-party markets which do not typically requirescreening or integrity evaluation of the uploaded apps [22]For example many of the apps in F-Droid are found to berepackaged by the same developer [23] Consequently wedetect mainly malicious repackaged Android apps in thisstudy

Repackaged Android malware can be efficiently detectedby analyzing its network traffic [7 14] is is because therepackaged apps are typically functionalized to interact withremote servers to receive commands or return sensitive dataand their network behaviors will differ from the originalnetwork behaviors To better explain this we decompile twoapps namely 00575DE650F78413C91E8E613ADF9099812F325AA9CF36C72E81C79B230F58F4 and A55EA5807203CE31470D1905ED30F9267AE3459809DDA25C4F339197DE6486FE and compare their corresponding folders toidentify the newly added files e added files are furtheranalyzed (by reading the Java code) to investigate theirnetwork behaviors In this example the apprsquos name is theSHA256 value of the APK file and 00575lowast corresponds tothe original app and A55EAlowast is the repackaged app Weused Jadx to decompile APKs and the file comparison isconducted with WinMerge (httpwinmergeorg) Fig-ure 1 presents the comparison results

As shown in Figure 1 there are a total of 250 additionalJava code files and these code files are distributed in thesourcescom folder We read these added Java codes care-fully and determine that Android API HttpPost is invoked 8

times and HttpGet is invoked 5 times Hence there are atleast 13 flows that would never be produced by the originalapp namely 00575lowast erefore the original app can beefficiently distinguished from the repackaged version bycomparing their network traffic Inspired by these obser-vations we propose clustering the mobile appsrsquo traffic toseparate original apps and repackaged malwareautomatically

22 Edge Computing Edge computing can be defined interms of the technologies that enable computations to beperformed at the edge of the network on downstreamdata on behalf of cloud services and on upstream data onbehalf of Internet of ings (IoT) services [24] Perhapsthe most important concept of edge computing is that theedge can consist of any computing and network resourcesalong the path between mobileIoT devices and clouddata centers For example the edge could be a phonebetween a smart bracelet and the cloud or a wirelessrouter between the smart TV and the cloud if the phoneor the wireless router performs computations on behalfof the smart devices or the cloud In these examples thecomputations could be data anonymization for privacyprotection or content caching for performance im-provement Figure 2 illustrates the paradigm of edgecomputing

As illustrated in Figure 2 cloud computing-like capa-bilities ie edge servers are deployed at the edge of thenetwork for data (pre) processing and forwarding In edgecomputing the mobileIoT devices could send computationtasks to the edge servers namely the devices can distributecode to the edge to be executed hence the mobileIoTdevices can be designed to conduct complicated operationssuch as machine learning or encryption Similarly the cloudmay push services to the edge servers and the edge can bedeployed and updated online In most cases the devicesshould be connected to the edge and the cloud

Figure 1 Folders and files that were only present in repackagedapp A55EAlowast as indicated by left only in the third column

Security and Communication Networks 3

simultaneously as represented by the solid and dotted linesin Figure 2is is because not all data must be preprocessedby the edge servers and more importantly the robustness ofthe entire system can be improved If the edge breaks downthe cloud can continue to provide services In this study weobey the edge computing paradigm that is illustrated inFigure 2 in the design of the Android malware detectionsystem

3 Related Work

Repackaged Android apps are one of the major sources ofmobile malware and extensive studies have been conductedon Android malware detection In this section we will re-view the most closely related studies e typical studies ofedge computing are also reviewed to provide more detailedbackground information

31 Off-Device Detection Repackaged Android apps andmalware can be efficiently detected via code comparison ofAPKs DroidMoss [25] focuses on app code and conductspairwise similarity comparisons for the detection ofrepackaged apps in Android markets e detection resultsdemonstrate that 5 to 13 of apps that are hosted on third-party marketplaces were repackaged However the pairwisecomparison approach cannot be scaled up because millionsof Android apps are now available in markets To resolve thisissue Shao et al [26] proposed ResDroid for the detection ofrepackaged apps in Android markets via a divide-and-conquer strategy First they built a k-d tree and insertedclusters that contained one or more apps into the k-d treeen the apps were compared with their neighbors in the k-d tree and the number of comparisons was reduced As-suming that the apps from the official market were benignMassvet [27] downloaded all apps from Android markets inadvance and constructed v-core (a geometric center of a viewgraph) andm-core (mapping the features of a Java method toan index) databases for repackaged Android malware de-tection Both databases were used to vet new apps that weresubmitted to the market New apps would be reported asrepackaged malware if their v-cores or m-cores were similarto records in the databases and the code was not legitimatelyreused

In addition to code comparison behavior-based andUI-based detection methods are also emerging Droid-Chain [28] constructs a behavior chain model that iscomposed of typical behavior processes of Android apps

for the detection of malware In DroidChain the typicalbehaviors include privacy leakage SMS financial charg-ing malware installation and privilege escalationAccording to Tian et al [29] repackaged malware wasdifficult to detect partly because of their behavioralsimilarities to benign apps To overcome this problemthey proposed a new repackaged Android malware de-tection technique that was based on code heterogeneityanalysis e code structure of an app was partitioned intomultiple dependence-based regions and each region wasindependently classified according to its behavioral fea-tures De Lorenzo et al [30] proposed VizMal for visu-alizing the execution traces of Android applications andhighlighting potentially malicious behaviors Lin et al[31] proposed the detection of repackaged Android appsbased on static UI features ey detected a total of 3723repackaged app pairs in the Anzhi market and 15856repackaged pairs in the Mi market ese results dem-onstrate that repackaged malware still poses a severethreat to the security of mobile users even though varioustypes of detection methods have been proposed andadopted by Android markets [32]

Recently several network-based approaches have beenproposed for the detection of repackaged apps and malwareArora and Peddoju [33] demonstrated the effectiveness ofusing network traffic to detect Android malware Wu et al[34] divided app HTTP traffic into 2 categories primarymodule traffic and nonprimary module traffic For primarymodule traffic the HTTP flow distance algorithm and theHungarian Method were used to calculate the traffic simi-larity and identify similar app pairs CREDROID [35]identified malicious apps on the basis of Domain NameServer (DNS) queries and the data that are transmitted toremote servers Zulkifli et al [36] proposed a method fordetecting Android malware that is based on seven networktraffic features and the J48 decision tree algorithm Chenet al [37] introduced the imbalanced data gravitation-basedclassification algorithm for the classification of imbalanceddata of malicious apps He et al [14 38] proposed theidentification of encrypted appsrsquo flows via traffic correlationand the detection of repackaged Android apps via com-parison of the network behaviors of similar apps In thesemethods the detection is conducted mostly in the router orat the network monitoring node hence the performances ofmobile devices are not affected However the network be-haviors of malware must be modeled in advance which ischallenging in practice

Edge servers

Datacomputations

Data

Data

Data

DataservicesIoT devices

Cloud

Figure 2 Paradigm of edge computing

4 Security and Communication Networks

32On-DeviceDetection e detection of Android malwarecan also be conducted at usersrsquo mobile devices [39]Alzaylaee et al [40] extracted features from real devices andused a deep learning system to detect malicious Androidapps Shamili et al [41] proposed a distributed SupportVector Machine (SVM) algorithm for the detection ofAndroid malware on a network of mobile devices eirdetection features include short duration calls mediumduration calls and the number of outgoing SMS (ShortMessage Service) among others e experimental resultsdemonstrated that the average computation time per clientduring the training phase is nearly 10 seconds Zhao et al[42] implemented a malware detection framework namelyRobotDroid using an SVM active learning algorithmRobotDroid monitors all the running apps to identify theirrunning characteristics ese characteristics are later inputinto learning modules to generate the behavioral charac-teristics for the classification of normal apps and malwareAccording to their performance evaluation the time ofaccessing the GPS is increased from 80ms to almost 120mswhen the detection system is executed Hence the perfor-mance of a smartphone that utilizes this framework wouldbe significantly degraded Shabtai et al [8] monitored theappsrsquo network behaviors at the mobile devices to detectrepackaged Android malware and the detection accuracyexceeds 80 however the performance overhead of theirmethod is not negligible For example it requires7272 kB plusmn 15 memory consumption and 13 plusmn 15 CPUconsumption for learning 10 Android malware detectionmodels

To further improve the performance of on-device de-tection Crowdroid [10] sends the detection features to acentralized server for processing Crowdroid only recordsthe system calls as the detection features and it is light-weight Talha et al [43] proposed APK Auditor which is asystem that uses permissions as static analysis features forAndroid malware detection APK Auditor consists of threecomponents (1) a signature database (2) an Android clientand (3) a central server e Android client sends the app tothe central server for analysis e central server extracts thepermissions that are requested by the app and computes thepermission malware score based on the permissions inmalware If the score exceeds a threshold value the app isclassified as malware e results demonstrate that APKAuditor achieves 925 specificity however it lacks thebenefits of dynamic analysis as it cannot detect dynamicmalicious payloads

Monet [17] also uses backend servers and a client app todetect Androidmalwaree client app constructs a directedgraph of the app components and the system componentsand it records the potentially dangerous system calls as thedetection features ese features are forwarded to thebackend server which conducts further detection by ap-plying a signature matching algorithm and returns thedetection results e experimental results demonstrate thatMonet can detect Android malware with 99 accuracyHowever the performance overhead of Monet is high Asshown in [17] Monet had 8 overhead on memory and IObenchmarks and 55 overhead for battery Most recently

Arshad et al [44] proposed SAMADroid which is a 3-levelhybrid model for the detection of Android malware Sim-ilarly SAMADroid has a client app and an analysis servere client app hooks the Strace tool and traces the systemcalls that are invoked by the user app If the user app remainsrunning on the device the client app continues tracing thesystem calls of the user app and generates a log file of thesystem calls Later the log file is sent to the SAMADroidserver with the identifier of the user app At the server thesame user app is downloaded from stores and staticallyanalyzed en the static features and the dynamic featuresthat are extracted from the log file are embedded in a vectorspace Finally the features are provided as input to themachine learning tool for classification eir experimentalresults demonstrate that random forest yields 9907 ac-curacy on the static features and 8276 on the dynamicfeatures e performance overhead of SAMADroid is 18for memory and 06 for CPU In previous work [45] weused mobile edge computing to detect malicious apps Weextend it to the edge computing paradigm and refine theclustering method in this study

33 Edge Computing Research e vision and challenges ofedge computing are introduced in the literature [24] ebasic strategy of edge computing is to deploy cloud com-puting at the edge of the network to improve the perfor-mance and create new applications for IoT [46] Howeverthe deployment location of the edge and the functionalitiesof the edge are controversialerefore in applications edgecomputing can be implemented as fog computing [47]mobile edge computing [48 49] and mobile cloud com-puting [50] In the fog computing paradigm the analysis oflocal information is conducted on the ldquogroundrdquo and thecoordination and global analytics are conducted in theldquocloudrdquo One can argue that all objects that are outside thecloud constitute the ground thus fog computing is almostthe same as edge computing

Mobile edge computing deploys cloud services at theedge of mobile networks such as 5G Namely the edge isdeployed in close proximity to the eNodeB which loops thetraffic through the mobile edge computing servers for fur-ther data processing us in mobile edge computing theedge is fixed and the edge servers can be provided bytelecommunication companies Mobile cloud computingfocuses mainly on mobile delegation mobile devices shoulddelegate the storage of bulk data and the execution ofcomputationally intensive tasks to remote entities A remoteentity can be a centralized cloud or another mobile deviceerefore the edge can consist of mobile devices in mobilecloud computing Hence one of the most active areas ofresearch in the field of mobile cloud computing is thedelegation of tasks to external services especially to othermobile devices Task delegation is also essential for edgecomputing [51]

In this study we provide a lightweight and efficientframework for the detection of repackaged Androidmalwarethat is based on edge computing In the proposed frame-work the mobile devices need only to record the flow

Security and Communication Networks 5

starting time flow protocol server-side IP address server-side port app name and version for each network flowLater these data are sent to the edge servers for traffictagging Since the amounts of data that are saved and sent aresmall the performances of mobile devices are not signifi-cantly degraded e edge servers extract detection featuresfrom the labeled app network traffic and send the extractedfeatures to the cloud to be clustered After clustering theoriginal and repackaged apps can be efficiently distin-guished In our study only network traffic is considered inAndroid malware detection In contrast to previous net-work-based methods such as [14 34ndash36] we do not need tomodel appsrsquo network behaviors in advance and our methodis more useful in practice

4 Framework for Android Malware Detection

e on-device detection of Android malware is challengingbecause mobile devices typically have limited resources Toalleviate the storage and computing limitations and prolongthe lifetimes of the mobile devices in this study we proposea new framework for on-device detection of repackagedAndroid malware e proposed framework is illustrated inFigure 3

Our framework is composed of three layers mobiledevices the edge and the cloud First we explain thepossible locations of the edge later we introduce thefunctionalities of each component In our framework theonly requirement for the edge is that it can monitor andprocess the mobile network traffice mobile devices couldconnect to the Internet via WiFi or 3G4G hence the edgenodes can be wireless gateways household security gatewayssuch as BitDefender Box and LTE eNodeBs e edge nodescan also be routers that are located in the backbone networkif they can observe the traffic that is generated by the mobiledevices erefore the deployment locations of the edgenodes are flexible in practice Figure 4 illustrates the possibledeployment scenarios of the edge

In the proposed framework the mobile devices mustcollect and send flow metaformation to the edge for datapreprocessing For this an edge-client app is designed to berun on themobile devicese edge-client app operates as anAndroid VPN app Its main objective is to collect andtransfer the flow metainformation which includes the appname flow starting time and flow protocol to the edgeiteratively With the flow metainformation the edge serverpreprocesses the network traffic by adding a label (the appname) to each flow After that it extracts detection featuresfrom these labeled flows e traffic labeling and featureextraction are illustrated in detail in Section 52 Once thefeature extraction has been completed the edge server sendsthese features to the cloud With the received features thecloud conducts the malware detection tasks and notifies themobile devices of the detection results

To further illustrate the advantages of the proposedframework we compare it with the available frameworksSince the proposed framework is based on edge computingwe will refer to it as the edge-computing-based framework inthe following Similarly the available frameworks are termed

as the cloud-based framework (the server-side detectionmethods that were described in Section 1) and the network-based framework (the network-based methods that wereintroduced in Sections 1 and 3) for differentiation Com-pared to the popular cloud-based framework (Figure 5(a))the most prominent advantage of the edge-computing-basedframework is that the feature extraction process is offloadedto the edge and the mobile devices must only save andtransmit the flow metainformation us the proposededge-computing-based framework is resource-friendly formobile devices

e network-based framework (Figure 5(b)) monitorsnetwork traffic without any interference withmobile devicesHowever it is less accurate since the network nodes cannotknow the origins of the network flows precisely namely itcannot identify app traffic with 100 accuracy As dem-onstrated in the literature [14] since the network nodecannot identify the app for each flow (eg some flows lackssignatures or are encrypted) the appsrsquo network behaviors areincomplete and the malware detection accuracy is reducedIn the proposed edge-computing-based framework wedesign a lightweight edge-client app for gathering app in-formation Furthermore mobile usersrsquo privacy can be betterprotected in our proposed framework By deploying onersquosown edge server a mobile user can control the types ofinformation that can be used as detection features Incontrast in the network-based framework the networknodes can observe the full traffic content and extract anyinformation they want which severely threatens the privacyof mobile users

Android malwaredetectionCloud

Edge

Mobiledevices

(1) Flowmetainformation

(2) Extractedfeatures

(3) Detectionresults

Network connectionData transmission

Figure 3 Framework for android malware on-device detectionthat is based on edge computing Mobile devices send flow met-ainformation such as the app name and the server-side IP addressto the edge Edge servers capture app network traffic and extractHTTP contents and flow statistical features from the capturedtraffic Later the features are sent to the cloud for malware de-tection e cloud also interacts with the mobile devices to returnthe detection results

6 Security and Communication Networks

After introducing the general architecture we outline thedetection methods that can be used in the edge-computing-based framework e cloud can use various methods todetect Android malware For example it can use conven-tional signature matching [35] and network behavioranalysis [8 14] to identify malicious apps However thesetypes of methods require a priori information (such assignatures or behavior models) In this paper we propose anovel repackaged malware detection method that clustersthe network traffic that is generated by the same app fromvarious mobile devices Figure 6 presents an example of thisprocess

e motivations behind the traffic clustering are two-fold first most Android malware is produced by repack-aging popular apps [4] and popular apps are often widelydistributed Second the network traffic generated by therepackaged malware differs from that generated by theoriginal app (as illustrated in Section 21) As illustrated inFigure 3 many mobile devices would connect to the cloudfor malware detection Hence the cloud can observe thenetwork traffic features of both the original apps and theirrepackaged versions with large probabilities due to the widedistribution of popular apps and the large number of mobiledevices If we regard the features of the original apps as thenormal data and the features of the repackaged versions asthe abnormal data the repackaged malware detectionproblem is transformed into the abnormal detection

problem Clustering is an efficient method for solving thisproblem [52] By clustering appsrsquo network traffic features weneed not model appsrsquo network behaviors in advance andmore importantly we can detect various repackaged ver-sions of the original app as illustrated in Figure 6e detailsof the proposed malware detection method will be describedin detail in Sections 53ndash55

5 Methodology

In this section we introduce the technical details of theproposed clustering method for Android malware detectionTable 1 defines the main notation that we use in this paper

(a) (b) (c)

Figure 4 Possible deployments of the edge e edge could be the wireless gateways or the dedicated servers that are connected to theeNodeB or the routers (a) Wireless gateway (b) eNodeB+ server (c) Router + server

Clustering

Traffic features of an app collected from

different mobile devicesOriginal apps

Repackaged apps

Figure 6 Illustration of traffic clustering for repackaged malwaredetection e plain circles represent original apps and the circleswith triangles and inverted triangles are repackaged ones

Network connectionData transmission

Android malwaredetectionCloud

Mobiledevices

(1) Extractedfeatures

(2) Detectionresults

(a)

Network connectionData transmission

Network node

Mobiledevices

(1) Detectionresults

(b)

Figure 5 Available frameworks for android malware detection Various approaches such as [10] send the extracted features to a remoteserver other than to the cloud We still regard them as following the cloud-based framework because there is no essential difference formalware detection As illustrated in (b) the detection results are typically presented to the network administrator and are not fed back to themobile devices (a) Framework based on cloud (b) Framework based on network

Security and Communication Networks 7

51 System Overview Figure 7 illustrates the main proce-dure of the proposed approach App i is the app for vettingand it has been installed on umobile devicesese u devicesare connected to e edge servers We assume that r deviceshave installed the repackaged malware (the malware is alsopresented as App i to deceive the users) and rlt (12)uHence most of the devices have installed the original ver-sion is situation is reasonable because repackaged An-droidmalware is typically distributed in third-party marketswhich are less popular than official markets [23] us thenumbers of downloads and installations are relatively smallWe will use this characteristic to classify the clusteringresults

To detect the repackaged malware the edge server labelsthe network flows that are generated by App i according tothe flow metainformation en it extracts suitable trafficfeatures (such as traffic content and statistics) from thelabeled flows and sends the features to the cloud for pro-cessing erefore there are u feature sets for App i in thecloud e cloud filters these u sets according to the appversions and removes common traffic After that it calculatesthe pairwise similarities of the feature sets Finally it clustersthe similarity values and identifies r repackaged malwareinstances

e traffic labeling feature extraction and filtering(labels①② and③ in Figure 7) the similarity calculationon the traffic contents and behaviors (labels④ and⑤) andclustering of similarity values (label⑥) are the core elementsof the proposed method ey will be described in detail inthe following sections

52 Traffic Labeling Filtering and Feature Extraction

521 Traffic Labeling In this study traffic labeling isconducted to identify the corresponding app for each net-work flow which is known as the app identification problem

[53] Extensive works have been conducted on the identi-fication of apps from mobile network traffic [38 54 55]However the achieved identification accuracies are all lowerthan 100 In our method we design an edge-client app forcollecting flow metaformation and identifying apps accu-rately e flow metainformation is defined in the following

Definition 1 (flow metainformation) For flow f the flowmetainformation includes the client-side IP address C-IPfthe flow starting time Tf the flow protocol Pf the server-sideIP address S-IPf the server-side port S-Portf the app nameAppNf and the app version AppVfe flow f is bidirectionaland Pf is the transport layer protocol such as TCP or UDP

e edge-client app operates similarly to a VPN appeapp sends the flow metainformation to the edge server at afixed time interval Ti Once it has received the flow meta-information the edge server saves it into a metainformationdatabase and can label mobile app traffic accurately Denotethe flow observed by the edge server as fu e edge serverlabels fu as follows

Step 1 extract the recording time Tuf of fu and its flow

protocol Puf client-side IP address CminusIPu

f server-sideIP address SminusIPu

f and server-side port SminusPortufStep 2 search the metainformation database with Pu

fCminusIPu

fSminusIPuf and SminusPortuf If records match the da-

tabase returns the record that has the most recent flowstarting time Tr

f Otherwise the edge server waits for awhile and repeats Step 2Step 3 if Tu

f is close to Trf the edge server labels flow fu

as ltAppNf AppVfgtOtherwise it waits for a while andrepeats Steps 2 and 3

In the above steps the client-side IP address CminusIPuf is

used to identify a mobile device uniquely which is correctwhen the edge server is the wireless gateway or the eNodeBIn such scenarios the mobile devices directly connect to theedge hence the IP address is unique for each deviceHowever if the edge server is deployed at the backbonerouter as illustrated in subgraph (c) of Figure 4 the CminusIPu

f

may be confusing because mobile devices could be locatedbehind NATs (Network Address Translations) and they willshare the same public IP address To solve this issue theedge-client app can insert special indicators into the packetsto uniquely represent a device In this study we alwaysassume that CminusIPu

f is unique and we will investigate theimplementation of the device-indicator in future work

In steps 2 and 3 the edge server utilizes a wait-and-repeat strategy to ensure that the traffic labeling is fresh andaccurate e edge server must wait because the mobiledevices send flow metainformation to the edge server at afixed time interval Ti (60 s in our experiments) and the savedflow metainformation in the database may be outdated Inour implementations the waiting time is set to Ti and if|Tu

f minus Trf|gt 24 h the returned flow metainformation is

regarded as obsolete

522 Traffic Feature Extraction After traffic labeling theedge server extracts detection features from the labeled

Table 1 Main notation used in this paper

Notation Meaningi Android app iu Number of mobile devices with app i installedr Number of devices with repackaged app i

f Network flowC-IPf Client-side IP address of fS-IPf Server-side IP address of fS-Portf Server-side port of fAppNf Name of the app that generates fAppVf Version of the app that generates fTi Set time intervalTu

f Recording time of flow fu at the edge serverdi Feature set of app iwj Plaintext word in the feature setsV(di) Numerical vector of traffic contentsF

j

diFeature vector for the encrypted flow j in di

TBntimesm(di) Traffic behaviors of diCSdik

Content similarity between di and dkBSdik

Behavior similarity between di and dkSdik

Final similarity between di and dk

8 Security and Communication Networks

traffic e extracted traffic features are categorized into twogroups traffic content features and traffic behavior featuresTraffic content features are the plaintext contents that areextracted from HTTP flows and traffic behavior features arethe statistics of network flows such as packet sizes andintervals Since Android apps may use an encrypted HTTPSprotocol to transmit data [54] traffic behavior features alsomust be considered e traffic content and traffic behaviorfeatures are described in detail in Sections 53 and 54

523 Traffic Filtering After traffic feature extraction theedge servers send the extracted features along with theclient-side IP address server-side IP addresses server-sideports app name AppNf and app version AppVf to the cloudNote that App i is installed on u devices hence the cloudeventually has u feature sets for App i Figure 8 illustrates thestructure of the feature set e cloud further filters these ufeature sets to increase the detection accuracy First it di-vides the u sets into v categories according to the app versionAppVf erefore each category contains the feature setsthat are generated by the same version of App i en foreach category the cloud removes the common traffic fromthe feature sets e common traffic is defined as network

flows that are contained in all features sets ese flows havethe same server-side IP address and the same server-sideport number e reason for this step is that the commontraffic will obfuscate the similarity calculations betweenapps Otherwise the similarity values between features setsare all high and they cannot be effectively used to distin-guish the repackaged malware from the original apps

After the filtering has been completed the similaritiesbetween feature sets are calculated and the similarity valuesare clustered to identify the repackaged malware

Client-side IP addressApp name AppN_fApp version AppV_fFlow 1 server-side IP address 1 server-side port 1 flow features

Flow f server-side IP address f server-side port f flow features

Figure 8 Illustration of the feature set e set contains the clientIP address app name and app version Each network flow that isgenerated by App i in a device is represented by the featuresextracted from the flow and by the server IP address and port

Trafficlabeling

Edge server 1

Traffic featureextraction

Similarity calculation of

trafficbehaviors

Similarity calculation of traffic contents Clustering of

similarity values

Original app

Repackaged malware

Cloud

Mobile device 1 2 (with app i installed)

Results

Traffic features

Traffic and flowmetainformation

ę

ę

Edge server e

Traffic features

Traffic and flow metainformation

Filtering

Mobile device u(with app i installed)

1

3

4

5

6

2

Figure 7emain procedure of the proposedmethod for Androidmalware detectione core parts of the proposedmethod are labeled bycircled numbers

Security and Communication Networks 9

53 Similarity Calculation of Traffic Content Android appsuse mainly HTTP and HTTPS protocols to communicatewith their remote servers [54] For HTTP traffic since it isplaintext based we can efficiently use the flow content tocalculate the similarity e basic strategy is to convertHTTP traffic to document files and the document files canbe handled by information retrieval technologies for simi-larity calculation HTTP flow content has also been used inapp identification from network traffic [55] However in thisstudy we only consider partial HTTP header information tobetter protect mobile usersrsquo privacy

In detail first we reassemble HTTP flows from packetsand extract the flow contents by saving the correspondingpacket contents into a text file as ASCII code Figure 9 showssuch an example en we preprocess the text file by de-leting the HTTP request content and the HTTP requestheader except for the GET (or POST) and HOST fields eHTTP response content is also deleted namely only theHTTP response header the GET (or POST) and the HOSTfiled in the HTTP request header are saved Via this ap-proach the mobile usersrsquo privacy can be protected from thecloud (for example the mobile phone information (HUA-WEI) is leaked by the User-Agent field in Figure 9) Afterthat the text file is spilt using special characters (eg ) and the processed file becomes the flow contentfeatures

e similarity is calculated via the improved TF-IDFalgorithm [56] and the cosine similarity e term frequencymeasures the count of words in a document We use theaugmented frequency measurement Denote the document(the feature set) as di and the word as wj Term frequencytf(wj di) is calculated as

tf wj di1113872 1113873 05 +05 times f wj di1113872 1113873

max f w di( 1113857 w isin di1113864 1113865 (1)

where f(wj di) is the number of occurrences of wj indocument di

e inverse document frequency measures of theamount of information a word provides Typically it is thelogarithmically scaled inverse fraction of the number ofdocuments that contain the word e inverse documentfrequency of wj is calculated as

idf wj1113872 1113873 logu

d wj isin d1113966 111396711138681113868111386811138681113868

11138681113868111386811138681113868+ 001⎛⎝ ⎞⎠ (2)

where u is the number of documents and | d wj isin d1113966 1113967| isnumber of documents in which the term wj appears

Denote by p(wj di) the product of TF(wj di) andI DF(wj) it is formulated as

p wj di1113872 1113873 log tf wj di1113872 11138731113872 1113873 times idf wj1113872 1113873

1113936

cj0 log

2 tf wj di1113872 11138731113872 1113873 times idf2 wj1113872 11138731113969 (3)

After the TF-IDF operation all documents are convertedinto equal-length numerical vectors We use the cosinesimilarity to measure the similarity between two documentvectors For documents di and dk the corresponding

numerical vectors are denoted as V(di) and V(dk) and thelength of vector is t e similarity between di and dk iscalculated as

CSdik

1113936tl1 V di( 1113857l times V dk( 1113857l( 1113857

1113936tl1 V di( 1113857

2l

1113969

times

1113936tl1 V dk( 1113857

2l

1113969 (4)

Suppose that in category vi (we divide the u feature setsgenerated by App i in to v categories in the traffic filteringstep) there are c feature sets For each feature set we save theflow features of all HTTP flows into a text file and c text filesare obtained en we calculate the numerical vector foreach feature set via equation (3) Last we compare thenumerical vector with each other as equation (4) and gain csimilarity values (CSdii

1) for each feature set We rep-resent these similarity values of feature set di (i 1 2 c)as a vector Si

contents and Sicontents (CSdi1

CSdi2 CSdic

)e vectors Si

contents (i 1 2 c) in combination with thesimilarity values of traffic behaviors will be clustered toidentify the abnormal data points namely the repackagedmalware

54 SimilarityCalculation of TrafficBehaviors In addition tocalculating the similarity values of traffic contents thesimilarity of traffic behaviors must be calculated due toencrypted flows We use m (m 11 in our study) statisticalfeatures as listed in Table 2 to represent the behavior of anencrypted network flow e units of the flow duration andthe packet interval time are milliseconds in our experiments

With the flow statistical features for feature set di the jthencrypted flow can be represented as a vector in the featurespace

Fj

di f

1j f

2j f

mj1113872 1113873 (5)

where flj represents the value of the lth statistical feature

1le llem Figure 10 shows an example of feature represen-tation for encrypted traffic Consequently the traffic be-haviors of di can be represented as a matrix of f

pj

Figure 9 Example of converting an HTTP flow to a text file

10 Security and Communication Networks

TBntimesm di( 1113857 F1di

F2di

Fndi

1113872 1113873T (6)

where n is the number of encrypted flows that are containedin di and nge 1

Since different feature sets may differ in terms of thenumber of encrypted flows traffic behaviors TBnitimesm(di) andTBnktimesm(dk) may differ in terms of different dimensionnamely the value of ni should not be equal to that of nk Toefficiently calculate the similarity between TBnitimesm(di) andTBnktimesm(dk) we calculate the Frobenius norm of TBnitimesm(di)

and TBnktimesm(dk) e formula is presented in equation (7)where norm(middot) normalizes the value of fl

j to [0 1]

FN(d)

1113944

n

j11113944

m

l1norm fl

j1113872 111387311138681113868111386811138681113868

111386811138681113868111386811138682

11139741113972

(7)

en the distance between TBnitimesm(di) and TBnktimesm(dk) iscalculated as

dis di dk( 1113857 FN di( 1113857 minus FN dk( 11138571113868111386811138681113868

1113868111386811138681113868 (8)

Finally the similarity between TBnitimesm(di) andTBnktimesm(dk) can be calculated by

BSdik 1 minus

dis di dk( 1113857

max dis dr dq1113872 1113873 r q 1 2 u rne q1113960 1113961 (9)

In the above process ni and nk are assumed to exceed 1If ni 0 (nk 0) and nkge 1 (nige 1) BSdik

is set to 0 to reflectthe significant difference between di and dk However ifni 0 and nk 0 namely there is no encrypted flow in bothdi and dk BSdik

is set to 1 because they exhibit the samebehaviors ie they do not produce any encrypted traffic

Similarly we let BSdii 1 and represent these similarity

values as a vector Sibehavior BSdi1

BSdi2 BSdiu

1113966 1113967

55 Clustering of Similarity Values and Malware DetectionFrom the traffic content similarity CSdik

and the traffic be-havior similarity BSdik

we can determine the final similarityvalue BSdik

between the feature sets di and dk as expressed inequation (10) In this equation q is the weight value In ourexperiments we set q 05 namely the traffic contentsimilarity and the traffic behavior similarity are assigned thesame weight is is because apps typically produce almostthe same numbers of HTTP and HTTPS flows [57] We alsoevaluate various weighted values in Section 62 and theexperimental results support that 05 is the most suitablevalue For feature set di a vector Sdi

(Sdi1 Sdi2

Sdiu) is

finally obtained which contains the similarity values be-tween di and the other feature setse set of vector Sdi

1113966 1113967 i

1 2 u will be clustered for the detection of repackagedmalware

Sdik q times CSdik

+(1 minus q) times BSdik (10)

To be specific we use the density peak clustering method[58] to realize our objective Density peak clustering is basedon the following assumption cluster centers are charac-terized by a higher density than their neighbors and by arelatively large distance from points that have higher den-sities Hence density peak clustering can automaticallyidentify the correct number of clusters is is importantbecause we do not know whether there is repackagedmalware or how many types of repackaged malware thereare In other words we cannot determine the number ofclusters in advance Density peak clustering can help usovercome this challenge

We assume that most mobile devices have installed theoriginal apps as discussed in Section 51 erefore afterclustering the cluster that has the largest number of voxelswill be recognized as original and the remaining clusters willbe regarded as suspicious However suspicious apps are notnecessarily repackaged malware For example apps can berepackaged to show ads only Repackaged apps of this typeshould not be simply judged as malware Additionallyoriginal apps can be suspicious due to different user be-haviors To further reduce the frequency of the false alarmswe can check the server hostnames that are contained in therepackaged clusters using VirusTotal (the URLs are includedin the HTTP traffic contents) If all of the serversrsquo hostnamesfor a suspicious cluster Cr are judged as clean it will berecognized as a cluster of abnormal but harmless appsOtherwise it will be identified as repackaged malware Viathis approach we can accurately detect repackaged Androidmalware In this step we only use VirusTotal as a white list toexclude false alarms We do not use it as a backlist (detectionsignatures) to detect repackaged malware Malware detec-tion is still realized by clustering the similarity values

Table 2 Flow statistical featuresBytes sent by the app and bytes sent by the serverNumbers of outgoing and incoming packetsMean and std dev of the outgoing and incoming packet sizesrsquoflow durationMean and std dev of the interpacket time

lt2761 16696 16 12 173 1294 289 1177 3392 121 190gt

Figure 10 Example of the conversion of an encrypted flow to afeature vector

Security and Communication Networks 11

6 Evaluation

61 System Implementation and Dataset Figure 11illustrates our experimental setup e Android phoneswere connected to the edge server via WiFi and each phonehad installed the edge-client app which sends the flowmetainformation to the edge server using Socket (jav-anetSocket) e edge is a Dell PowerEdge R730 server(with 32 processors 16GB RAM and 12 TB hard drives)that is configured as an access point In the edge server weused tcpdump (httpwwwtcpdumporg) to collect networktraffic e similarity calculations of traffic contents andbehaviors and the clustering of similarity values wereimplemented on the Windows Azure cloud computingplatforme URL checking was conducted using Virustotal(httpswwwvirustotalcom)

In the implementation we used aria2c (httpsaria2githubio) to download apps from Androzoo (httpsandrozoounilu) aria2c was encapsulated as commandsin Python code for parallel downloading After downloadingthe apps we automatically installed them on phones usingthe adb shell command Each phone had 40 tested appsinstalled We have implemented the edge-client app basedon Android VPN service In our implementations the edge-client app correlated network flows and apps by resolvingthe ports and PIDs (Process IDs) send the newly collectedflow metainformation to the edge server every 60 secondse edge server used tcpdump to collect network trafficwhich was saved as pcap files To extract traffic features weused Splitcap (httpswwwnetreseccompageSplitCap)to restore tcp flows from pcap files en for HTTP trafficwe extracted the contents using tshark (httpswwwwiresharkorgdocsman-pagestsharkhtml) commands -zfollow tcp and ascii e packet sizes and packet intervaltimes were also obtained by tshark with commands -eframelen ndashe frametime_relative In the cloud CSdik

and BSdik

were calculated using scikit-learn (httpscikit-learnorg)and Numpy (httpsplotlynumpynorm) e code ofdensity peak clustering is available on GitHub (httpsgithubcomjasonwbwDensityPeakCluster) and we havefixed various bugs that were present in the original code

We have downloaded 200 pairs of original andrepackaged apps from Androzoo for testing However notall of these downloaded repackaged apps are maliciousSome apps are only repackaged to show extra ads To screenout Android malware from the downloaded repackagedapps we further checked the 200 repackaged apps usingVirustotal If an app was judged as safe by all the antivirustools it was labelled as repackaged_safe in our experimentsOtherwise it was labelled as repackaged_malware In total143 repackaged apps were detected as malware by VirustotalNevertheless in our experiments we evaluated the proposedclusteringmethod on all 200 repackaged apps to obtainmorerealistic performance results

e downloaded apps were run in 10 rooted Androidphones in parallel to accelerate the experimental process Forall apps we used Droidbot [59] to send input events to themand to generate network traffic automatically We ran eachrepackaged app (repackaged_safe and repackaged_malware

apps) 10 times and the corresponding original app 30 timesto generate sufficient network traffic e running times ofthe original apps exceeded those of the repackaged appsbecause we assume that most of the apps are original asdiscussed in Section 51 Consequently we obtained 40traffic sets for each repackaged-original app pair Note thatfrom the perspectives of the mobile devices and the edgeserver there are only 200 distinct apps because therepackaged app has the same name as the correspondingoriginal app erefore our experiments have simulated40lowast 200 8000 mobile devices For each repackaged-orig-inal app pair there were 30 devices with the original app and10 devices with the repackaged version

In total we have gathered almost 89GB network traffice datasets that are used in our experiments are summa-rized in Table 3 In the table the traffic dataset sizes are thesizes of the captured network traffic and the feature datasetsizes are the sizes of the extracted features In our experi-ments the HTTP traffic contents and the flow statisticalfeatures which are listed in Table 2 were both saved as textfiles As shown in Table 3 the feature datasets are signifi-cantly smaller than the traffic datasets (25G vs 63G and 9G vs26G)us the edge computing can reduce the data size andincrease the data acquisition speed for cloud computing[24] Since the feature dataset sizes are still large (25G+ 9G)we did not send all these feature data to the cloud simul-taneously Instead we sent the feature data one by oneWhen the feature data of an app were successfully receivedby the cloud its traffic similarities were calculated andclustered immediately After that the feature data of anotherapp would be sent and analyzed e experimental resultswill be presented in the following

62 ExperimentalResults In this section wemainly evaluatethe detection efficiency of Android malware by clusteringsimilarity values e performance evaluation of the pro-posed method in terms of power memory and CPU con-sumptions is elaborated in the following Section 63

Similar to previous studies [8 14] to measure theperformance of the proposed method Accuracy and F-measure metrics are utilized which are defined in equations(11) and (12) In equation (11) TP FN FP and TN aredefined in Table 4 In Table 4 repackaged_x representsrepackaged_safe or repackaged_malware e precision andrecall in equation (12) are calculated via equations (13) and(14)

Accuracy TP + TN

TP + FN + FP + TN (11)

F minus measure 2 times precision times recallprecision + recall

(12)

precision TP

TP + FP (13)

recall TP

TP + FN (14)

12 Security and Communication Networks

We have evaluated the Accuracy and F-measure for eachrepackaged app Recall that we ran each repackaged app 10times and the corresponding original app 30 timeserefore for each app the number of the original apptraffic sets is 30 and the number of the repackaged app trafficsets is 10 namely TP + FN 10 and TN+FP 30 for eachapp We have randomly chosen 20 apps from repack-age_malware and the experimental results for these apps arepresented in Figures 12 and 13 Detailed descriptions of thechosen 20 apps are listed in Table 5 In the table the mainfunction of each app is determined by analyzing the packagename from the AndroidManifestxml and from the Googlesearch results of the package name As depicted in Figure 12100 accuracy was realized on a total of 6 apps (app3 app5app9 app10 app13 and app18) e minimum accuracy is85 (app2 3440) and the average accuracy is 952 Fig-ure 13 presents the F-measure values In our experimentsthe maximum F-measure value is 1 and the minimum valueis 0737 e average F-measure value is 0888 ese resultsdemonstrate the satisfactory performance of our proposedclustering method

We have also calculated the average accuracy and F-measure values for all repackaged_safe and repack-aged_malware apps In these calculationsTP + FN 57lowast10 570 and TN+FP 57lowast 301710 for

repackaged_safe apps For repackaged_malware appsTP + FN 143lowast101430 and TN+FP 143lowast 30 4290is is because in our experiments 143 apps were identifiedas repackaged malware and 57 apps as repackaged safe byVirustotal us the parameter values differ e experi-mental results are listed in Table 6

According to Table 6 the average accuracy values forrepackaged_safe and repackaged_malware apps are bothhigh However the average F-measure for repackaged_safeapps is only 078 which is lower than that of repackagedmalware is indicates that repackaged safe apps are

Router Windows Azure

Edge server DellPowerEdge R730

10 android phones

Repackagedoriginal apps

Androzoo

Figure 11 Experimental setup

Table 3 Datasets used in the experiments

of apps of runs of HTTP flows of encrypted flows Traffic dataset sizes Feature dataset sizesOriginal apps 200 30 340015 183452 asymp63G bytes asymp25G bytesRepackaged apps 200 10 110670 93041 asymp26G bytes asymp9G bytes

Table 4 Confusion matrix

ObservedDetected

Repackaged_x app Original appRepackaged_x app TP ( of TPs) FN ( of FNs)Original app FP ( of FPs) TN ( of TNs)

Apps

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

Figure 12 Detection accuracies for the 20 selected apps

Security and Communication Networks 13

difficult to distinguish from their original apps is isbecause repackaged_safe apps only generate smallamounts of additional network flows and their networktraffic features are not clearly distinguishable from thoseof normal traffic Fortunately the average F-measure ofrepackaged malware apps is outstanding is may be dueto the fact that malware typically send stolen data to orreceive commands from remote servers Hence theirnetwork traffic features are special and distinguishableFigure 14 presents an example of app clustering and il-lustrates the distinguishability of the repackaged malwareand the original app ese results demonstrate thatrepackaged Android malware can be effectively detectedby comparing their network traffic with that of originalapps

For repackaged malware apps the accuracy results withdifferent weight values (q in equation (10)) are presented inFigure 15 As depicted in the figure when q 05 we realizethe highest accuracy rate 969 If q 0 namely only thefeatures of encrypted traffic are used the accuracy result is753 e result is 617 when q 1 namely only thefeatures of plaintext traffic are considered ese resultsdemonstrate that both the encrypted and plaintext trafficcontribute to the differentiation of the original apps and therepackaged malware However the encrypted traffic con-tributes more than the plaintext traffic is may becausemalicious apps usually connect to their servers by securityprotocols to evade IDS detection

In the previous experiments we checked the serversrsquohostnames using VirusTotal to reduce the frequency of thefalse positives For various clusters that contained less than20 elements if all the serversrsquo hostnames were judged asclean the cluster was identified as an original app rather thana repackaged one As argued in Section 55 this process onlyreduces the frequency of false alarms To support this ar-gument we list the numbers of true positives and falsepositives of the 143 repackaged malware in Table 7 forcomparison When conducting the hostname checking thenumber of true positives is 1400 and the result is the samewhen the hostname checking is turned off is findingsupports that the malware is detected by clustering thesimilarity values However the number of false positivesincreased from 47 to 213 when the hostname checking wasturned off erefore we suggest checking the serversrsquohostnames after clustering to increase the practicality of themethod Note that false positives still occur when we checkthe hostnames by VirusTotal Hence there may be malwarein the original apps In our dataset the typical examples arecommobovideoplayerpro and cncfshop ele1taoxiaosanWe further analyze these two apps by reading theirdecompiled codes and we determine that they indeed es-tablish connections with suspicious servers In detail invarious runs app commobovideoplayerpro connects withhttpwwwtopappsquarecomwhich is judged as maliciousby CyRadar in VirusTotal App cncfshop ele1taoxiaosanvisits suspicious URL http955ccjSn9 Since these apps arethe original apps they are identified as false positives in theexperiments

In our experiments we chose density peak clustering asthe clustering method because density peak clustering canautomatically determine the correct number of clusters [58]Hence our method can detect various versions of malwarethat are repackaged from the same original app Unfortu-nately in the apps that were downloaded from AndroZoono malware was repackaged from the same original app Toevaluate the efficiency of detecting versions of malware wehave created two malware instances from a popular game

Apps

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

0

01

02

03

04

05

06

07

08

09

1

F-m

easu

re

Figure 13 Detection F-measure for the 20 selected apps

Table 5 Detailed descriptions of the chosen 20 apps

Main function ofapp

of HTTPflows

of encryptedflows

App1 VideoPlayer 453 51App2 Game 657 1455App3 Browser 1531 1478App4 MusicPlayer 341 25App5 Game 612 289App6 Game 788 123App7 Game 1336 524App8 Fitness 357 892App9 Email 122 431App10 Game 963 790App11 MusicPlayer 1128 547App12 Game 1598 426App13 News 2460 1178App14 News 2891 1543App15 Chat 171 49App16 Email 143 368App17 WeatherForecast 397 120App18 Game 1832 921App19 Fitness 732 362App20 OnlineLearning 2378 834

Table 6 Average accuracy and F-measure

Average accuracy () AverageF-measure

Repackaged_safe 903 078Repackaged_malware 969 094

14 Security and Communication Networks

app namely aircomaceviralmotox3 One is produced bygrafting AnserverBotrsquos source code into air-comaceviralmotox3 and the other is created by adding codefor stealing usersrsquo contacts and short messages and sendingthem to a remote server Similarly each malware is run 10times and the original app aircomaceviralmotox3 is alsorun 10 times for validatione clustering result is presentedin Figure 16 Apparently there are 3 clusters e

experimental results show that the value of F-measure foreach malware is 1 Hence our method can detect variousmalware that were repackaged from the same original app

63 Comparison In this section we compare our methodwith the typical methods that are based on network or cloudas illustrated in Figure 5 e comparative indicators are thedetection accuracy F-measure average power CPU andmemory consumptions of the mobile device First wecompare the proposed method with AppFA [14] which isimplemented completely at the network level Since AppFAalso must analyze app traffic we directly use the traffic sets ofthe repackaged malware apps for comparison While testingAppFA we deleted the label information that was containedin the traffic sets and mixed the network flows as a wholeen we used the signature matching and the constrainedK-means clustering algorithm proposed in [14] to restoreapp sessions namely the label for each flow e similarapps used in AppFA were chosen in Google Play via themethod introduced in [14] and for each repackaged mal-ware 10 similar apps were selected as the peer group epeer group apps were run once by Droidbot to generatenetwork traffic e comparison results are listed in Table 8

Since AppFA need not to install some special apps onmobile devices the extra power memory and CPU con-sumptions are 0 However the detection accuracy is muchlower (863 vs 969) than that of the proposedmethod ascompared in Table 8 In the proposed method we use theedge-client app to record flow metainformation thusnetwork flows can be accurately labeled However inAppFA flow labeling is realized via traffic clustering and theclustering accuracy cannot be 100 Meanwhile AppFAcompares the appsrsquo network behaviors with those of theirsimilar apps which also results in errors In the proposedmethod we directly compare the repackaged malware withits original app and the detection accuracy is increased eaverage power CPU and memory consumptions of theproposed method are also low and acceptable

In Table 8 the power memory and CPU consumptionsfor our edge-client app were obtained with the help of appGT (httpsgithubcomTencentGT) GT is a portable toolthat runs on smartphones for debugging the APP internalparameters and the code time-consuming statistics eaverage values for the power memory and CPU con-sumptions are calculated as follows Before running an appwith Droidbot we started the edge-client app by GT Whenthe Droidbot was stopped we recoded the power CPU andmemory consumptions We repeated the process until allapps (including 143 repackaged malware apps and theiroriginal versions) had been tested Via this way we obtainedthe average power CPU and memory consumptions for theedge-client app According to Table 8 the edge-client app islightweight and has minimal impact on the performances ofthe mobile devices

en we compare the proposed method with SAMA-Droid [44] SAMADroid monitors Android system calls andsends the calls to a centralized server for the detection ofAndroid malware SAMADroid is a typical cloud-based on-

Decision graph rho-delta-rho

20 251510 35 403050Sorted rho

00

02

04

06

08

Rho lowast

delt

a

Figure 14 Example of app clustering Obviously two clusters areidentified e dot in the top left corner is an outlier e outlier isdetected as an abnormal but harmless app by our method isfigure was generated by the density peak clustering tool

01 02 03 04 05 06 07 08 09 10Value of q

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

Figure 15 Accuracies of various weight values for traffic contentsimilarity and traffic behavior similarity

Table 7 True positives and false positives with and withouthostname checking

of true positives of false positivesWith hostname checking 1400 47Without hostnamechecking 1400 213

Security and Communication Networks 15

device detection method We have implemented the dy-namic detection according to the description in [44] Indetail the APIs that are invoked by the running apps aremonitored by the tool Sensitive API Monitor (httpsgithubcomcyruliuSensitive-API-Monitor) As done in [44] thefrequency values of 10 system calls (open ioctl brk readwrite close sendto sendmsg recvfrom and recvmsg) arerecorded ese recorded frequency values are sent toWindows Azure for malware detection We evaluate thedynamic detection performance of SAMADroid on our 143repackaged_malware apps and their original versions In theevaluation the frequency values of the original apps are usedto create the normality model and the frequency values ofthe repackaged malware apps are used for testing We alsouse the app GT to record the power memory and CPUconsumptions of SAMADroid Note that our phones wereall rooted and Sensitive_API_Monitor and GT can be usede experimental results are presented in Table 9

Last we compare our method with Shabtairsquos method [8]Shabtairsquos method detects Android malware at the mobiledevices and it considers the apprsquos network traffic patternsonly the HTTP traffic contents are not considered isgives us an opportunity to compare our method withmethods that have been specially designed for encryptedtraffic We have implemented Shabtairsquos method and chosenthe feature subset 2 as the detection features Feature subset2 includes Avg Sent Bytes Avg Rcvd Bytes and Pct ofAvg Rcvd Bytes among other values and it is proved to be

the best feature subset for malware detection [8] e De-cisionRegression tree is chosen as the classification algo-rithm e testing dataset is the traffic that was generated bythe 143 repackaged malware instances and the training set isthe traffic that was generated by the corresponding originalapps e classification results are listed in Table 10According to the table the detection accuracy and F-mea-sure values of our method are significantly higher than thoseof Shabtairsquos method is indicates that the HTTP trafficprovides useful information for Android malware detectionSince Shabtairsquos method is conducted at the mobile devices itconsumes more resources than our method as compared inTable 10

7 Discussion

e results of our extensive experiments demonstrate thatrepackaged Android malware can be efficiently and effec-tively detected by clustering network traffic at the edgedevices However there are still limitations for our methodFirst we assume the most of the devices have installed theoriginal apps (rlt (12)u as described in Section 51) is isreasonable in most cases since mobile users typicallydownload apps from the official markets However in areaswhere official markets such as the Google play and Amazonare banned it is possible that the most mobile devices haveinstalled the repackaged apps is may cause false positivesSecond if the malware does not generate sufficient trafficour method cannot detect it For example some malwareonly sends fraudulent premium SMS messages is type ofmalware cannot be detected by monitoring its networktraffic ird we rely on third-party tools such as Virustotalfor the identification of malicious URLs and to reduce thenumber of false alarms If a malicious app rents legalplatforms such as the public cloud as the remote controlserver our method will judge it as a normal app Hence falsenegatives will be generated

To overcome these limitations we could combine theproposed method with available methods such as SAMA-Droid In such cases the edge-client app will not only recordflow metainformation but also monitor app system be-haviors However the system behavior monitoring does notneed to be conducted all the time If abnormal but harmlessor lower traffic apps are detected the edge-client app will beinformed to record the corresponding appsrsquo system be-haviors Similarly the edge-client app can send the appsystem behaviors to the edge server for further processinge performance of this combined method will be evaluatedin our future work One can also combine the proposedframework with AppFA [14] for the detection of other typesof Android malware For example the edge extracts apptraffic behavior features and the cloud can compare appsrsquobehaviors with their peer groups to identify abnormalitiesFurthermore one could first use clustering techniques [60]to group repackaged apps according to the same maliciousmodules and then analyze the behaviors of the representativerepackaged apps in each group to obtain more accuratebehavior features

y

2D nonclassical multidimensional scaling

00ndash02 02ndash04x

ndash04

ndash03

ndash02

ndash01

00

01

02

03

04

Figure 16 Clustering results of various malware that wererepackaged from aircomaceviralmotox3e outlier in the middleis detected as an abnormal but harmless app is figure wasgenerated by the density peak clustering tool

Table 8 Comparison of the proposed method with AppFA

Proposed method AppFAAccuracy 969 863F-measure 094 075Average power consumption 12 0Average CPU consumption 06 0Average memory consumption 654MB 0

16 Security and Communication Networks

e paradigms of edge computing are used to detectAndroid malware in this study However several challengeswould be encountered when deploying it in practice Firstthe mobile devices must install a specified app for thecollection of essential information which may cause con-cerns and mobile users may refuse to cooperate Second inedge computing the data should be preprocessed by the edgeto better protect the usersrsquo privacy However the edge servercan observe all the data (including the sensitive informa-tion) and it threatens the privacy of users For example ifthe edge is deployed by the cloud provider the usersrsquo privacyinformation is also available to the cloud ere is no dif-ference with the user-cloud model for privacy protectionunless the edge server is deployed by the userird the edgeserver should be well protected from network attacksOtherwise it will be convenient for hackers to steal usersrsquoprivate information

8 Conclusion

In this paper we propose a novel method for detectingAndroid malware based on edge computing To the best ofour knowledge the detection of Android malware by uti-lizing edge computing and traffic clustering has not beenconsidered previously By introducing the edge layer themain tasks of the mobile device are offloaded to the edgeserver and the mobile device only needs to record flowmetainformation which substantially conserves the re-sources of the mobile device e edge server extracts appsrsquonetwork traffic features and sends these features to the cloudplatform for further analysis us the usersrsquo privacy can bebetter protected With the received features the cloud cal-culates the similarities between apps and clusters thesesimilarity values to separate the original apps and themalware automatically We evaluated our methods on 400Android apps and the experimental results demonstratethat the detection accuracy can reach 100 for several appsand the average accuracy is 969 We also tested theperformance of the proposed method and compared it with

the typical approaches e comparison results show thatour method can detect repackaged Android malware withboth higher detection accuracy and lower power and re-source consumptions In the future we will continue to workto overcome the discussed limitations and to evaluate theproposed method in real world scenarios

Data Availability

e downloaded apps used to support the findings of thisstudy have been deposited in httpspanbaiducoms1b8vMceIdtdd1gBRC5Gbzvw Its extraction code is vd5re clustering code is also in the repository Since appnetwork traffic contains usersrsquo sensitive information itcannot be publicly shared However we have explained allthe steps to generate network traffic automatically esesteps are also described in the instructionspdf in therepository

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported by National Key TechnologiesResearch and Development Program of China under Grant2018YFD0401404 National Natural Science Foundation ofChina under Grants 61702282 61802192 and 71801123Natural Science Foundation of the Jiangsu Higher EducationInstitutions of China under Grants 18KJB520024 and17KJB520023 Nanjing Forestry University under GrantsGXL016 and CX2016026 and NUPTSF under GrantNY217143

References

[1] Statista Mobile Operating Systemsrsquo Market Share Worldwidefrom January 2012 to December 2019 Statista HamburgGermany 2009 httpswwwstatistacomstatistics272698global-market-share-held-by-mobile-operating-systems-since-2009

[2] W Martin F Sarro Y Jia Y Zhang and M Harman ldquoAsurvey of app store analysis for software engineeringrdquo IEEETransactions on Software Engineering vol 43 no 9pp 817ndash847 2017

[3] httpswwwappbraincomstatsnumber-of-android-apps2020

[4] L Li D Li T F Bissyande et al ldquoUnderstanding android apppiggybacking a systematic study of malicious code graftingrdquoIEEE Transactions on Information Forensics and Securityvol 12 no 6 pp 1269ndash1284 2017

[5] M Fan J Liu X Luo et al ldquoAndroid malware familialclassification and representative sample selection via frequentsubgraph analysisrdquo IEEE Transactions on Information Fo-rensics and Security vol 13 no 8 pp 1890ndash1905 2018

[6] J Zhang Z Qin K Zhang H Yin and J Zou ldquoDalvik opcodegraph based android malware variants detection using globaltopology featuresrdquo IEEE Access vol 6 pp 51964ndash51974 2018

[7] S Garg S K Peddoju and A K Sarje ldquoNetwork-baseddetection of android malicious appsrdquo International Journal ofInformation Security vol 16 no 4 pp 385ndash400 2016

Table 9 Comparison of the proposed method with SAMADroid

Proposed method SAMADroidAccuracy 969 831F-measure 0940 0793Average power consumption 12 23Average CPU consumption 06 08Average memory consumption 654MB 773MB

Table 10 Comparison of the proposed method with Shabtairsquosmethod

Proposed method Shabtairsquosmethod

Accuracy 969 771F-measure 0940 0695Average power consumption 12 77Average CPU consumption 06 23Average memory consumption 654MB 937MB

Security and Communication Networks 17

[8] A Shabtai L Tenenboim-Chekina D Mimran L RokachB Shapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquoComputers amp Security vol 43 pp 1ndash18 2014

[9] L Xie X Zhang J-P Seifert and S Zhu ldquopBMDS a be-havior-based malware detection system for cellphone de-vicesrdquo in Proceedings of the ird ACM Conference onWireless Network Security ACM New York NY USApp 37ndash48 2010

[10] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacyin Smartphones and Mobile Devices ACM Chicago IL USApp 15ndash26 October 2011

[11] G Dini F Martinelli A Saracino and D SgandurraldquoMADAM a multi-level anomaly detector for android mal-warerdquo in Proceedings of the International Conference onMathematical Methods Models and Architectures for Com-puter Network Security vol 12 Springer St PetersburgRussia pp 240ndash253 2012

[12] Y Zhang M Yang B Xu et al ldquoVetting undesirable be-haviors in android apps with permission use analysisrdquo inProceedings of the 2013 ACM SIGSAC Conference on Com-puter amp Communications Security ACM Berlin Germanypp 611ndash622 November 2013

[13] SWang Z Chen X Li LWang K Ji and C Zhao ldquoAndroidmalware clustering analysis on network-level behaviorrdquo inProceedings of the International Conference on IntelligentComputing Springer Liverpool UK pp 796ndash807 August2017

[14] G He B Xu and H Zhu ldquoAppFA a novel approach to detectmalicious android applications on the networkrdquo in Securityand Communication Networks Wiley Hoboken NJ USA2018

[15] X Wang Y Yang and Y Zeng ldquoAccurate mobile malwaredetection and classification in the cloudrdquo SpringerPlus vol 4no 1 p 583 2015

[16] K Tam A Feizollah N B Anuar R Salleh and L Cavallaroldquoe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys (CSUR) vol 49 no 4p 76 2017

[17] M Sun X Li J C S Lui R T B Ma and Z Liang ldquoMonet auser-oriented behavior-based malware variants detectionsystem for androidrdquo IEEE Transactions on Information Fo-rensics and Security vol 12 no 5 pp 1103ndash1112 2017

[18] T Vidas and N Christin ldquoEvading android runtime analysisvia sandbox detectionrdquo in Proceedings of the 9th ACMSymposium on Information Computer and CommunicationsSecurity pp 447ndash458 Kyoto Japan June 2014

[19] Y Duan M Zhang A V Bhaskar et al ldquoings you may notknow about android (Un) packers a systematic study basedon whole-system emulationrdquo in Proceedings of the 2018Network and Distributed System Security Symposium pp 1ndash15 San Diego CA USA February 2018

[20] L Xue X Luo L Yu S Wang and D Wu ldquoAdaptiveunpacking of android appsrdquo in Proceedings of the 2017 IEEEACM 39th International Conference on Software Engineering(ICSE) IEEE Buenos Aires Argentina pp 358ndash369 May2017

[21] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy IEEE San Francisco CAUSA pp 95ndash109 2012

[22] N W Lo S K Lu and Y H Chuang ldquoA framework for thirdparty android marketplaces to identify repackaged appsrdquo inProceedings of the 2016 IEEE 14th International Conference onDependable Autonomic and Secure Computing 14th Inter-national Conference on Pervasive Intelligence and Computing2nd International Conference on Big Data Intelligence andComputing and Cyber Science and Technology Congresspp 475ndash482 Auckland New Zealand August 2016

[23] L Li J Gao M Hurier et al ldquoAndrozoo++ collectingmillions of android apps and their metadata for the researchcommunityrdquo 2017 httpsarxivorgabs170905281

[24] W Shi J Cao Q Zhang Y Li and L Xu ldquoEdge computingvision and challengesrdquo IEEE Internet of ings Journal vol 3no 5 pp 637ndash646 2016

[25] W Zhou Y Zhou X Jiang and P Ning ldquoDetectingrepackaged smart-phone applications in third-party androidmarketplacesrdquo in Proceedings of the Second ACM Conferenceon Data and Application Security and Privacy ACM AntonioTX USA pp 317ndash326 February 2012

[26] Y Shao X Luo C Qian P Zhu and L Zhang ldquoTowards ascalable resource-driven approach for detecting repackagedandroid applicationsrdquo in Proceedings of the 30th AnnualComputer Security Applications Conference ACM NewOrleans LA USA pp 56ndash65 December 2014

[27] K Chen P Wang Y Lee et al ldquoFinding unknown malice in10 seconds mass vetting for new threats at the google-playscalerdquo in Proceedings of the USENIX Security Symposiumvol 15 Washington DC USA August 2015

[28] Z Wang C Li Y Guan and Y Xue ldquoDroidchain a novelmalware detection method for android based on behaviorchainrdquo in Proceedings of the 2015 IEEE Conference onCommunications and Network Security (CNS) IEEE Flor-ence Italy pp 727-728 September 2015

[29] K Tian D D Yao B G Ryder G Tan and G PengldquoDetection of repackaged android malware with code-het-erogeneity featuresrdquo IEEE Transactions on Dependable andSecure Computing vol 17 no 1 pp 64ndash77 2017

[30] A De Lorenzo F Martinelli E Medvet F Mercaldo andA Santone ldquoVisualizing the outcome of dynamic analysis ofandroid malware with vizmalrdquo Journal of Information Se-curity and Applications vol 50 Article ID 102423 2020

[31] M Lin D Zhang X Su and T Yu ldquoEffective and scalablerepackaged application detection based on user interfacerdquo inProceedings of the 2017 IEEE Conference on ComputerCommunications Workshops (INFOCOM WKSHPS) IEEEAtlanta GA USA May 2017

[32] G Meng M Patrick Y Xue Y Liu and J Zhang ldquoSecuringandroid app markets via modelling and predicting malwarespread between marketsrdquo IEEE Transactions on InformationForensics and Security vol 14 no 7 pp 1944ndash1959 2018

[33] A Arora and S K Peddoju ldquoMinimizing network trafficfeatures for android mobile malware detectionrdquo in Proceed-ings of the 18th International Conference on DistributedComputing and Networking ACM Hyderabad India January2017

[34] X Wu D Zhang X Su and W Li ldquoDetect repackagedAndroid application based on HTTP traffic similarity httptraffic similarityrdquo Security and Communication Networksvol 8 no 13 pp 2257ndash2266 2015

[35] J Malik and R Kaushal ldquoCredroid android malware de-tection by network traffic analysisrdquo in Proceedings of the 1stACM Workshop on Privacy-Aware Mobile Computing ACMPaderborn Germany pp 28ndash36 July 2016

18 Security and Communication Networks

[36] A Zulkifli I R A Hamid W M Shah and Z AbdullahldquoAndroid malware detection based on network traffic usingdecision tree algorithmrdquo in Proceedings of the InternationalConference on Soft Computing and Data Mining SpringerSenai Malaysia pp 485ndash494 January 2018

[37] Z Chen Q Yan H Han et al ldquoMachine learning basedmobile malware detection using highly imbalanced networktrafficrdquo Information Sciences vol 433-434 pp 346ndash364 2018

[38] G He B Xu L Zhang and H Zhu ldquoMobile app identifi-cation for encrypted network flows by traffic correlationrdquoInternational Journal of Distributed Sensor Networks vol 14no 12 pp 1ndash17 2018

[39] L Xue Y Zhou T Chen X Luo and G Gu ldquoMalton to-wards on-device non-invasive mobile malware analysis forARTrdquo in Proceedings of the 26th USENIX Security Symposiumpp 289ndash306 Vancouver Canada August 2017

[40] M K Alzaylaee S Y Yerima and S Sezer ldquoDL-Droid deeplearning based android malware detection using real devicesrdquoComputers amp Security vol 89 Article ID 101663 2020

[41] A S Shamili C Bauckhage and T Alpcan ldquoMalware de-tection on mobile devices using distributed machine learn-ingrdquo in Proceedings of the 20th International Conference onPattern Recognition (ICPR) IEEE Istanbul Turkeypp 4348ndash4351 August 2010

[42] M Zhao T Zhang F Ge and Z Yuan ldquoRobotdroid alightweight malware detection framework on smartphonesrdquoJournal of Networks vol 7 no 4 p 715 2012

[43] K A Talha D I Alper and C Aydin ldquoAPK auditor per-mission-based android malware detection systemrdquo DigitalInvestigation vol 13 pp 1ndash14 2015

[44] S Arshad M A Shah A Wahid A Mehmood H Song andH Yu ldquoSamadroid a novel 3-level hybrid malware detectionmodel for android operating systemrdquo IEEE Access vol 6pp 4321ndash4339 2018

[45] G He L Zhang B Xu and H Zhu ldquoDetecting repackagedandroid malware based on mobile edge computingrdquo inProceedings of the 2018 Sixth International Conference onAdvanced Cloud and Big Data (CBD) IEEE Lanzhou Chinapp 360ndash365 August 2018

[46] R Roman J Lopez andMMambo ldquoMobile edge computingFog et al a survey and analysis of security threats andchallengesrdquo Future Generation Computer Systems vol 78pp 680ndash698 2018

[47] F Bonomi R Milito J Zhu and S Addepalli ldquoFog com-puting and its role in the internet of thingsrdquo in Proceedings ofthe First Edition of the MCC Workshop on Mobile CloudComputing ACM Helsinki Finland pp 13ndash16 August 2012

[48] M T Beck MWerner S Feld and S Schimper ldquoMobile edgecomputing a taxonomyrdquo in Proceedings of the Sixth Inter-national Conference on Advances in Future Internet pp 48ndash55 Lisbon Portugal November 2014

[49] P Mach and Z Becvar ldquoMobile edge computing a survey onarchitecture and computation offloadingrdquo IEEE Communi-cations Surveys amp Tutorials vol 19 no 3 pp 1628ndash1656 2017

[50] H T Dinh C Lee D Niyato and P Wang ldquoA survey ofmobile cloud computing architecture applications and ap-proachesrdquo Wireless Communications and Mobile Computingvol 13 no 18 pp 1587ndash1611 2013

[51] X Lyu H Tian L Jiang et al ldquoSelective offloading in mobileedge computing for the green internet of thingsrdquo IEEENetwork vol 32 no 1 pp 54ndash60 2018

[52] S Agrawal and J Agrawal ldquoSurvey on anomaly detectionusing data mining techniquesrdquo Procedia Computer Sciencevol 60 pp 708ndash713 2015

[53] A Tongaonkar ldquoA look at the mobile app identificationlandscaperdquo IEEE Internet Computing vol 20 no 4 pp 9ndash152016

[54] G He B Xu and H Zhu ldquoIdentifying mobile applications forencrypted network trafficrdquo in Proceedings of the 2017 FifthInternational Conference on Advanced Cloud and Big Data(CBD) IEEE Shanghai China pp 279ndash284 August 2017

[55] G Ranjan A Tongaonkar and R Torres ldquoApproximatematching of persistent lexicon using search-engines forclassifying mobile app trafficrdquo in Proceedings of the 35thAnnual IEEE International Conference on Computer Com-munications IIEEE San Francisco CA USA April 2016

[56] L Li and S Qu ldquoShort text classification based on improvedITCrdquo Journal of Computer and Communications vol 1 no 4pp 22ndash27 2013

[57] A H Lashkari A F A Kadir L Taheri and A A GhorbanildquoToward developing a systematic approach to generatebenchmark android malware datasets and classificationrdquo inProceedings of the 2018 International Carnahan Conference onSecurity Technology (ICCST) IEEE Montreal Canada Oc-tober 2018

[58] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[59] Y Li Z Yang Y Guo and X Chen ldquoDroidbot a lightweightUI-guided test input generator for androidrdquo in Proceedings ofthe Software Engineering Companion (ICSE-C) pp 23ndash26Buenos Aires Argentina May 2017

[60] M Fan X Luo J Liu et al ldquoGraph embedding based familialanalysis of android malware using unsupervised learningrdquo inProceedings of the 2019 IEEEACM 41st International Con-ference on Software Engineering (ICSE) IEEE Piscataway NJUSA pp 771ndash782 May 2019

Security and Communication Networks 19

Page 2: On-DeviceDetectionofRepackagedAndroidMalwarevia …downloads.hindawi.com/journals/scn/2020/8630748.pdf · 2020-05-31 · According to [4–6], most Android malware programs are repackagedapps,andthemajorityofnewAndroidmalware

According to [4ndash6] most Android malware programs arerepackaged apps and the majority of new Android malwaresamples are polymorphic variants of known malwareRepackaged Android malware is becoming a substantialthreat to Android security

Researchers have extensively investigated the detectionof (repackaged) Android malware and several methods havebeen proposed and evaluated [7ndash14] Previous works can begenerally divided into off-device and on-device detectionOff-device detection is typically adopted by security ana-lyzers who analyze APK files at a dedicated analysis servere analyzers can check the APKs with signatures or dy-namically run them in a real or virtual execution environ-ment to record their behaviors [15] If malicious code orbehaviors are identified the apps will be judged as malwareOff-device detection can also be conducted via networkmonitoring [14] We will refer to off-device detectionmethods of this type as network-based methods in thefollowing

On-device detection is implemented partially or com-pletely locally namely at usersrsquo mobile devices e rec-ognition of malware from the installed apps is attemptedTechnically on-device detection also identifies malware byanalyzing the behaviors of the apps Compared to off-devicedetection on-device detection can detect malware that isrunning on mobile usersrsquo devices and identify more so-phisticated malicious apps such as those which are triggeredby specified events For example by observing the deviationsof app network behaviors at the mobile device one candetect self-updating malware that may be triggered by aspecified geo-location device type or OS version [8]Nevertheless on-device detection is more challenging due tothe limited resources of mobile devices

Typically several steps should be conducted to detectmalicious Android apps at mobile devices First detectionfeatures such as the userrsquos operating behaviors API usagesand application network behaviors should be defined andextracted en based on the extracted features detectionmodels are constructed and finally new apps are comparedwith the constructed models for malware detectionDepending on where these steps are performed currentapproaches can be categorized into two main groups client-side detection and server-side detection [14] For client-sidedetection these steps are all conducted at the mobile deviceswhile for server-side detection the main steps (modelconstruction and malware detection) are conducted onremote servers

In practice the application of the client-side approachescould be restricted because they consume substantialamounts of resources and power According to reference [8]the storage of 10 Android malware detection models re-quires 7272 kB plusmn 15 memory consumption and the learningof these models requires 13 plusmn 15 CPU consumptionWorse the moremodels that are trained and saved the moreresources and power are consumed erefore the server-side methods are typically adopted [16] However the re-source and power consumptions of the server-side methodsremain high (eg 7 performance overhead and approxi-mately 3 battery overhead [17]) is is because the

detection features are still mainly collected and processed atthe mobile devices [10 17]

In this paper we propose a novel method for detectingrepackaged Android malware based on clustering appsrsquotraffic in the edge computing platform Although Androidmalware can use emulatorsandbox detection [18] andpacking [19 20] to impede dynamic and static analysis it isdifficult for them to hide the traffic to be inspectede basicstrategy of our method is that the feature extraction andpreprocessing can be offloaded to the edge us theworkloads of mobile devices are reduced e designedarchitecture for Android malware detection consists of threecomponents mobile devices edge servers and the cloude mobile devices are used normally and an additionaledge-client app will be installed and run in the backgrounde edge-client app operates identically to an Android VPNapp such as Packet Capture (httpsplaygooglecomstoreappsdetailsidappgreyshirtssslcaptureamphlen US) ex-cept it only records the flow metainformation such as theflow starting time and flow protocol and it does not need tosave and manipulate the packet contents Hence it islightweight and imposes a negligible burden on the mobiledevices

e edge servers collect mobile network traffic and extractdetection features from the traffic e traffic preprocessingthat is conducted by the edge servers not only reduces theamount of data to be sent but also protects the mobile usersrsquoprivacy because the cloud can only observe the coarse-grainedfeatures e extracted network traffic features are sent to thecloud for final processing In the cloud similarities betweenapps are calculated based on the received features After thatthese similarity values are automatically clustered to separateoriginal apps and repackaged malware We conducted anextensive set of experiments and the results demonstratedthat the average detection accuracy was 969

Our main contributions are as follows

(i) A lightweight and efficient framework that is basedon edge computing for the detection of Androidmalware is proposed Compared to the typicalserver-side approaches the proposed frameworkwill have minimal impact on the performances ofthe mobile devices In addition compared to thenetwork-based methods the detection accuracy canbe increased and the privacy of mobile users can bebetter protected

(ii) A traffic clustering method for the identification ofrepackaged Android malware is proposed Wecluster the network traffic that is generated by thesame apps to determine which one have beenrepackaged First we use the plaintext contents andflow statistical features to calculate the similaritiesbetween the apps en the similarity values areclustered to separate the original apps and therepackagedmalware automatically Via clustering itis unnecessary tomodel the appsrsquo network behaviorsor extract matching signatures for malware detec-tion in advance Hence this approach will be moreefficient in practice

2 Security and Communication Networks

(iii) Extensive experiments are conducted on publicdatasets

e remainder of this paper is organized as followse preliminaries of Android malware and edge com-puting are introduced in Section 2 In Section 3 wediscuss relevant related work Section 4 describes the newintegration architecture of the cloud edge and mobiledevices for Android malware detection e methodologyis presented in Section 5 An experimental evaluation ofour method in terms of accuracy and performance ispresented in Section 6 We discuss the limitations of theproposed method in Section 7 Finally in Section 8 wepresent the conclusions of the paper and discuss potentialfuture work

2 Preliminaries

In this section we introduce important concepts regardingAndroid malware and edge computing which form the basisof this work

21 Android Malware As discussed in Section 1 Androidapps (APKs) are easy to repackage which is convenient forattackers who would like to create Android malware In factmost Android malware is produced by repackaging benignapps especially favorite apps to ensure a wide diffusion ofthe malicious code [4] As evidence MalGenome [21] whichis a reference dataset in the Android security community80 of the malicious samples were built via repackagingother apps Repackaged apps can be distributed throughthird-party markets which do not typically requirescreening or integrity evaluation of the uploaded apps [22]For example many of the apps in F-Droid are found to berepackaged by the same developer [23] Consequently wedetect mainly malicious repackaged Android apps in thisstudy

Repackaged Android malware can be efficiently detectedby analyzing its network traffic [7 14] is is because therepackaged apps are typically functionalized to interact withremote servers to receive commands or return sensitive dataand their network behaviors will differ from the originalnetwork behaviors To better explain this we decompile twoapps namely 00575DE650F78413C91E8E613ADF9099812F325AA9CF36C72E81C79B230F58F4 and A55EA5807203CE31470D1905ED30F9267AE3459809DDA25C4F339197DE6486FE and compare their corresponding folders toidentify the newly added files e added files are furtheranalyzed (by reading the Java code) to investigate theirnetwork behaviors In this example the apprsquos name is theSHA256 value of the APK file and 00575lowast corresponds tothe original app and A55EAlowast is the repackaged app Weused Jadx to decompile APKs and the file comparison isconducted with WinMerge (httpwinmergeorg) Fig-ure 1 presents the comparison results

As shown in Figure 1 there are a total of 250 additionalJava code files and these code files are distributed in thesourcescom folder We read these added Java codes care-fully and determine that Android API HttpPost is invoked 8

times and HttpGet is invoked 5 times Hence there are atleast 13 flows that would never be produced by the originalapp namely 00575lowast erefore the original app can beefficiently distinguished from the repackaged version bycomparing their network traffic Inspired by these obser-vations we propose clustering the mobile appsrsquo traffic toseparate original apps and repackaged malwareautomatically

22 Edge Computing Edge computing can be defined interms of the technologies that enable computations to beperformed at the edge of the network on downstreamdata on behalf of cloud services and on upstream data onbehalf of Internet of ings (IoT) services [24] Perhapsthe most important concept of edge computing is that theedge can consist of any computing and network resourcesalong the path between mobileIoT devices and clouddata centers For example the edge could be a phonebetween a smart bracelet and the cloud or a wirelessrouter between the smart TV and the cloud if the phoneor the wireless router performs computations on behalfof the smart devices or the cloud In these examples thecomputations could be data anonymization for privacyprotection or content caching for performance im-provement Figure 2 illustrates the paradigm of edgecomputing

As illustrated in Figure 2 cloud computing-like capa-bilities ie edge servers are deployed at the edge of thenetwork for data (pre) processing and forwarding In edgecomputing the mobileIoT devices could send computationtasks to the edge servers namely the devices can distributecode to the edge to be executed hence the mobileIoTdevices can be designed to conduct complicated operationssuch as machine learning or encryption Similarly the cloudmay push services to the edge servers and the edge can bedeployed and updated online In most cases the devicesshould be connected to the edge and the cloud

Figure 1 Folders and files that were only present in repackagedapp A55EAlowast as indicated by left only in the third column

Security and Communication Networks 3

simultaneously as represented by the solid and dotted linesin Figure 2is is because not all data must be preprocessedby the edge servers and more importantly the robustness ofthe entire system can be improved If the edge breaks downthe cloud can continue to provide services In this study weobey the edge computing paradigm that is illustrated inFigure 2 in the design of the Android malware detectionsystem

3 Related Work

Repackaged Android apps are one of the major sources ofmobile malware and extensive studies have been conductedon Android malware detection In this section we will re-view the most closely related studies e typical studies ofedge computing are also reviewed to provide more detailedbackground information

31 Off-Device Detection Repackaged Android apps andmalware can be efficiently detected via code comparison ofAPKs DroidMoss [25] focuses on app code and conductspairwise similarity comparisons for the detection ofrepackaged apps in Android markets e detection resultsdemonstrate that 5 to 13 of apps that are hosted on third-party marketplaces were repackaged However the pairwisecomparison approach cannot be scaled up because millionsof Android apps are now available in markets To resolve thisissue Shao et al [26] proposed ResDroid for the detection ofrepackaged apps in Android markets via a divide-and-conquer strategy First they built a k-d tree and insertedclusters that contained one or more apps into the k-d treeen the apps were compared with their neighbors in the k-d tree and the number of comparisons was reduced As-suming that the apps from the official market were benignMassvet [27] downloaded all apps from Android markets inadvance and constructed v-core (a geometric center of a viewgraph) andm-core (mapping the features of a Java method toan index) databases for repackaged Android malware de-tection Both databases were used to vet new apps that weresubmitted to the market New apps would be reported asrepackaged malware if their v-cores or m-cores were similarto records in the databases and the code was not legitimatelyreused

In addition to code comparison behavior-based andUI-based detection methods are also emerging Droid-Chain [28] constructs a behavior chain model that iscomposed of typical behavior processes of Android apps

for the detection of malware In DroidChain the typicalbehaviors include privacy leakage SMS financial charg-ing malware installation and privilege escalationAccording to Tian et al [29] repackaged malware wasdifficult to detect partly because of their behavioralsimilarities to benign apps To overcome this problemthey proposed a new repackaged Android malware de-tection technique that was based on code heterogeneityanalysis e code structure of an app was partitioned intomultiple dependence-based regions and each region wasindependently classified according to its behavioral fea-tures De Lorenzo et al [30] proposed VizMal for visu-alizing the execution traces of Android applications andhighlighting potentially malicious behaviors Lin et al[31] proposed the detection of repackaged Android appsbased on static UI features ey detected a total of 3723repackaged app pairs in the Anzhi market and 15856repackaged pairs in the Mi market ese results dem-onstrate that repackaged malware still poses a severethreat to the security of mobile users even though varioustypes of detection methods have been proposed andadopted by Android markets [32]

Recently several network-based approaches have beenproposed for the detection of repackaged apps and malwareArora and Peddoju [33] demonstrated the effectiveness ofusing network traffic to detect Android malware Wu et al[34] divided app HTTP traffic into 2 categories primarymodule traffic and nonprimary module traffic For primarymodule traffic the HTTP flow distance algorithm and theHungarian Method were used to calculate the traffic simi-larity and identify similar app pairs CREDROID [35]identified malicious apps on the basis of Domain NameServer (DNS) queries and the data that are transmitted toremote servers Zulkifli et al [36] proposed a method fordetecting Android malware that is based on seven networktraffic features and the J48 decision tree algorithm Chenet al [37] introduced the imbalanced data gravitation-basedclassification algorithm for the classification of imbalanceddata of malicious apps He et al [14 38] proposed theidentification of encrypted appsrsquo flows via traffic correlationand the detection of repackaged Android apps via com-parison of the network behaviors of similar apps In thesemethods the detection is conducted mostly in the router orat the network monitoring node hence the performances ofmobile devices are not affected However the network be-haviors of malware must be modeled in advance which ischallenging in practice

Edge servers

Datacomputations

Data

Data

Data

DataservicesIoT devices

Cloud

Figure 2 Paradigm of edge computing

4 Security and Communication Networks

32On-DeviceDetection e detection of Android malwarecan also be conducted at usersrsquo mobile devices [39]Alzaylaee et al [40] extracted features from real devices andused a deep learning system to detect malicious Androidapps Shamili et al [41] proposed a distributed SupportVector Machine (SVM) algorithm for the detection ofAndroid malware on a network of mobile devices eirdetection features include short duration calls mediumduration calls and the number of outgoing SMS (ShortMessage Service) among others e experimental resultsdemonstrated that the average computation time per clientduring the training phase is nearly 10 seconds Zhao et al[42] implemented a malware detection framework namelyRobotDroid using an SVM active learning algorithmRobotDroid monitors all the running apps to identify theirrunning characteristics ese characteristics are later inputinto learning modules to generate the behavioral charac-teristics for the classification of normal apps and malwareAccording to their performance evaluation the time ofaccessing the GPS is increased from 80ms to almost 120mswhen the detection system is executed Hence the perfor-mance of a smartphone that utilizes this framework wouldbe significantly degraded Shabtai et al [8] monitored theappsrsquo network behaviors at the mobile devices to detectrepackaged Android malware and the detection accuracyexceeds 80 however the performance overhead of theirmethod is not negligible For example it requires7272 kB plusmn 15 memory consumption and 13 plusmn 15 CPUconsumption for learning 10 Android malware detectionmodels

To further improve the performance of on-device de-tection Crowdroid [10] sends the detection features to acentralized server for processing Crowdroid only recordsthe system calls as the detection features and it is light-weight Talha et al [43] proposed APK Auditor which is asystem that uses permissions as static analysis features forAndroid malware detection APK Auditor consists of threecomponents (1) a signature database (2) an Android clientand (3) a central server e Android client sends the app tothe central server for analysis e central server extracts thepermissions that are requested by the app and computes thepermission malware score based on the permissions inmalware If the score exceeds a threshold value the app isclassified as malware e results demonstrate that APKAuditor achieves 925 specificity however it lacks thebenefits of dynamic analysis as it cannot detect dynamicmalicious payloads

Monet [17] also uses backend servers and a client app todetect Androidmalwaree client app constructs a directedgraph of the app components and the system componentsand it records the potentially dangerous system calls as thedetection features ese features are forwarded to thebackend server which conducts further detection by ap-plying a signature matching algorithm and returns thedetection results e experimental results demonstrate thatMonet can detect Android malware with 99 accuracyHowever the performance overhead of Monet is high Asshown in [17] Monet had 8 overhead on memory and IObenchmarks and 55 overhead for battery Most recently

Arshad et al [44] proposed SAMADroid which is a 3-levelhybrid model for the detection of Android malware Sim-ilarly SAMADroid has a client app and an analysis servere client app hooks the Strace tool and traces the systemcalls that are invoked by the user app If the user app remainsrunning on the device the client app continues tracing thesystem calls of the user app and generates a log file of thesystem calls Later the log file is sent to the SAMADroidserver with the identifier of the user app At the server thesame user app is downloaded from stores and staticallyanalyzed en the static features and the dynamic featuresthat are extracted from the log file are embedded in a vectorspace Finally the features are provided as input to themachine learning tool for classification eir experimentalresults demonstrate that random forest yields 9907 ac-curacy on the static features and 8276 on the dynamicfeatures e performance overhead of SAMADroid is 18for memory and 06 for CPU In previous work [45] weused mobile edge computing to detect malicious apps Weextend it to the edge computing paradigm and refine theclustering method in this study

33 Edge Computing Research e vision and challenges ofedge computing are introduced in the literature [24] ebasic strategy of edge computing is to deploy cloud com-puting at the edge of the network to improve the perfor-mance and create new applications for IoT [46] Howeverthe deployment location of the edge and the functionalitiesof the edge are controversialerefore in applications edgecomputing can be implemented as fog computing [47]mobile edge computing [48 49] and mobile cloud com-puting [50] In the fog computing paradigm the analysis oflocal information is conducted on the ldquogroundrdquo and thecoordination and global analytics are conducted in theldquocloudrdquo One can argue that all objects that are outside thecloud constitute the ground thus fog computing is almostthe same as edge computing

Mobile edge computing deploys cloud services at theedge of mobile networks such as 5G Namely the edge isdeployed in close proximity to the eNodeB which loops thetraffic through the mobile edge computing servers for fur-ther data processing us in mobile edge computing theedge is fixed and the edge servers can be provided bytelecommunication companies Mobile cloud computingfocuses mainly on mobile delegation mobile devices shoulddelegate the storage of bulk data and the execution ofcomputationally intensive tasks to remote entities A remoteentity can be a centralized cloud or another mobile deviceerefore the edge can consist of mobile devices in mobilecloud computing Hence one of the most active areas ofresearch in the field of mobile cloud computing is thedelegation of tasks to external services especially to othermobile devices Task delegation is also essential for edgecomputing [51]

In this study we provide a lightweight and efficientframework for the detection of repackaged Androidmalwarethat is based on edge computing In the proposed frame-work the mobile devices need only to record the flow

Security and Communication Networks 5

starting time flow protocol server-side IP address server-side port app name and version for each network flowLater these data are sent to the edge servers for traffictagging Since the amounts of data that are saved and sent aresmall the performances of mobile devices are not signifi-cantly degraded e edge servers extract detection featuresfrom the labeled app network traffic and send the extractedfeatures to the cloud to be clustered After clustering theoriginal and repackaged apps can be efficiently distin-guished In our study only network traffic is considered inAndroid malware detection In contrast to previous net-work-based methods such as [14 34ndash36] we do not need tomodel appsrsquo network behaviors in advance and our methodis more useful in practice

4 Framework for Android Malware Detection

e on-device detection of Android malware is challengingbecause mobile devices typically have limited resources Toalleviate the storage and computing limitations and prolongthe lifetimes of the mobile devices in this study we proposea new framework for on-device detection of repackagedAndroid malware e proposed framework is illustrated inFigure 3

Our framework is composed of three layers mobiledevices the edge and the cloud First we explain thepossible locations of the edge later we introduce thefunctionalities of each component In our framework theonly requirement for the edge is that it can monitor andprocess the mobile network traffice mobile devices couldconnect to the Internet via WiFi or 3G4G hence the edgenodes can be wireless gateways household security gatewayssuch as BitDefender Box and LTE eNodeBs e edge nodescan also be routers that are located in the backbone networkif they can observe the traffic that is generated by the mobiledevices erefore the deployment locations of the edgenodes are flexible in practice Figure 4 illustrates the possibledeployment scenarios of the edge

In the proposed framework the mobile devices mustcollect and send flow metaformation to the edge for datapreprocessing For this an edge-client app is designed to berun on themobile devicese edge-client app operates as anAndroid VPN app Its main objective is to collect andtransfer the flow metainformation which includes the appname flow starting time and flow protocol to the edgeiteratively With the flow metainformation the edge serverpreprocesses the network traffic by adding a label (the appname) to each flow After that it extracts detection featuresfrom these labeled flows e traffic labeling and featureextraction are illustrated in detail in Section 52 Once thefeature extraction has been completed the edge server sendsthese features to the cloud With the received features thecloud conducts the malware detection tasks and notifies themobile devices of the detection results

To further illustrate the advantages of the proposedframework we compare it with the available frameworksSince the proposed framework is based on edge computingwe will refer to it as the edge-computing-based framework inthe following Similarly the available frameworks are termed

as the cloud-based framework (the server-side detectionmethods that were described in Section 1) and the network-based framework (the network-based methods that wereintroduced in Sections 1 and 3) for differentiation Com-pared to the popular cloud-based framework (Figure 5(a))the most prominent advantage of the edge-computing-basedframework is that the feature extraction process is offloadedto the edge and the mobile devices must only save andtransmit the flow metainformation us the proposededge-computing-based framework is resource-friendly formobile devices

e network-based framework (Figure 5(b)) monitorsnetwork traffic without any interference withmobile devicesHowever it is less accurate since the network nodes cannotknow the origins of the network flows precisely namely itcannot identify app traffic with 100 accuracy As dem-onstrated in the literature [14] since the network nodecannot identify the app for each flow (eg some flows lackssignatures or are encrypted) the appsrsquo network behaviors areincomplete and the malware detection accuracy is reducedIn the proposed edge-computing-based framework wedesign a lightweight edge-client app for gathering app in-formation Furthermore mobile usersrsquo privacy can be betterprotected in our proposed framework By deploying onersquosown edge server a mobile user can control the types ofinformation that can be used as detection features Incontrast in the network-based framework the networknodes can observe the full traffic content and extract anyinformation they want which severely threatens the privacyof mobile users

Android malwaredetectionCloud

Edge

Mobiledevices

(1) Flowmetainformation

(2) Extractedfeatures

(3) Detectionresults

Network connectionData transmission

Figure 3 Framework for android malware on-device detectionthat is based on edge computing Mobile devices send flow met-ainformation such as the app name and the server-side IP addressto the edge Edge servers capture app network traffic and extractHTTP contents and flow statistical features from the capturedtraffic Later the features are sent to the cloud for malware de-tection e cloud also interacts with the mobile devices to returnthe detection results

6 Security and Communication Networks

After introducing the general architecture we outline thedetection methods that can be used in the edge-computing-based framework e cloud can use various methods todetect Android malware For example it can use conven-tional signature matching [35] and network behavioranalysis [8 14] to identify malicious apps However thesetypes of methods require a priori information (such assignatures or behavior models) In this paper we propose anovel repackaged malware detection method that clustersthe network traffic that is generated by the same app fromvarious mobile devices Figure 6 presents an example of thisprocess

e motivations behind the traffic clustering are two-fold first most Android malware is produced by repack-aging popular apps [4] and popular apps are often widelydistributed Second the network traffic generated by therepackaged malware differs from that generated by theoriginal app (as illustrated in Section 21) As illustrated inFigure 3 many mobile devices would connect to the cloudfor malware detection Hence the cloud can observe thenetwork traffic features of both the original apps and theirrepackaged versions with large probabilities due to the widedistribution of popular apps and the large number of mobiledevices If we regard the features of the original apps as thenormal data and the features of the repackaged versions asthe abnormal data the repackaged malware detectionproblem is transformed into the abnormal detection

problem Clustering is an efficient method for solving thisproblem [52] By clustering appsrsquo network traffic features weneed not model appsrsquo network behaviors in advance andmore importantly we can detect various repackaged ver-sions of the original app as illustrated in Figure 6e detailsof the proposed malware detection method will be describedin detail in Sections 53ndash55

5 Methodology

In this section we introduce the technical details of theproposed clustering method for Android malware detectionTable 1 defines the main notation that we use in this paper

(a) (b) (c)

Figure 4 Possible deployments of the edge e edge could be the wireless gateways or the dedicated servers that are connected to theeNodeB or the routers (a) Wireless gateway (b) eNodeB+ server (c) Router + server

Clustering

Traffic features of an app collected from

different mobile devicesOriginal apps

Repackaged apps

Figure 6 Illustration of traffic clustering for repackaged malwaredetection e plain circles represent original apps and the circleswith triangles and inverted triangles are repackaged ones

Network connectionData transmission

Android malwaredetectionCloud

Mobiledevices

(1) Extractedfeatures

(2) Detectionresults

(a)

Network connectionData transmission

Network node

Mobiledevices

(1) Detectionresults

(b)

Figure 5 Available frameworks for android malware detection Various approaches such as [10] send the extracted features to a remoteserver other than to the cloud We still regard them as following the cloud-based framework because there is no essential difference formalware detection As illustrated in (b) the detection results are typically presented to the network administrator and are not fed back to themobile devices (a) Framework based on cloud (b) Framework based on network

Security and Communication Networks 7

51 System Overview Figure 7 illustrates the main proce-dure of the proposed approach App i is the app for vettingand it has been installed on umobile devicesese u devicesare connected to e edge servers We assume that r deviceshave installed the repackaged malware (the malware is alsopresented as App i to deceive the users) and rlt (12)uHence most of the devices have installed the original ver-sion is situation is reasonable because repackaged An-droidmalware is typically distributed in third-party marketswhich are less popular than official markets [23] us thenumbers of downloads and installations are relatively smallWe will use this characteristic to classify the clusteringresults

To detect the repackaged malware the edge server labelsthe network flows that are generated by App i according tothe flow metainformation en it extracts suitable trafficfeatures (such as traffic content and statistics) from thelabeled flows and sends the features to the cloud for pro-cessing erefore there are u feature sets for App i in thecloud e cloud filters these u sets according to the appversions and removes common traffic After that it calculatesthe pairwise similarities of the feature sets Finally it clustersthe similarity values and identifies r repackaged malwareinstances

e traffic labeling feature extraction and filtering(labels①② and③ in Figure 7) the similarity calculationon the traffic contents and behaviors (labels④ and⑤) andclustering of similarity values (label⑥) are the core elementsof the proposed method ey will be described in detail inthe following sections

52 Traffic Labeling Filtering and Feature Extraction

521 Traffic Labeling In this study traffic labeling isconducted to identify the corresponding app for each net-work flow which is known as the app identification problem

[53] Extensive works have been conducted on the identi-fication of apps from mobile network traffic [38 54 55]However the achieved identification accuracies are all lowerthan 100 In our method we design an edge-client app forcollecting flow metaformation and identifying apps accu-rately e flow metainformation is defined in the following

Definition 1 (flow metainformation) For flow f the flowmetainformation includes the client-side IP address C-IPfthe flow starting time Tf the flow protocol Pf the server-sideIP address S-IPf the server-side port S-Portf the app nameAppNf and the app version AppVfe flow f is bidirectionaland Pf is the transport layer protocol such as TCP or UDP

e edge-client app operates similarly to a VPN appeapp sends the flow metainformation to the edge server at afixed time interval Ti Once it has received the flow meta-information the edge server saves it into a metainformationdatabase and can label mobile app traffic accurately Denotethe flow observed by the edge server as fu e edge serverlabels fu as follows

Step 1 extract the recording time Tuf of fu and its flow

protocol Puf client-side IP address CminusIPu

f server-sideIP address SminusIPu

f and server-side port SminusPortufStep 2 search the metainformation database with Pu

fCminusIPu

fSminusIPuf and SminusPortuf If records match the da-

tabase returns the record that has the most recent flowstarting time Tr

f Otherwise the edge server waits for awhile and repeats Step 2Step 3 if Tu

f is close to Trf the edge server labels flow fu

as ltAppNf AppVfgtOtherwise it waits for a while andrepeats Steps 2 and 3

In the above steps the client-side IP address CminusIPuf is

used to identify a mobile device uniquely which is correctwhen the edge server is the wireless gateway or the eNodeBIn such scenarios the mobile devices directly connect to theedge hence the IP address is unique for each deviceHowever if the edge server is deployed at the backbonerouter as illustrated in subgraph (c) of Figure 4 the CminusIPu

f

may be confusing because mobile devices could be locatedbehind NATs (Network Address Translations) and they willshare the same public IP address To solve this issue theedge-client app can insert special indicators into the packetsto uniquely represent a device In this study we alwaysassume that CminusIPu

f is unique and we will investigate theimplementation of the device-indicator in future work

In steps 2 and 3 the edge server utilizes a wait-and-repeat strategy to ensure that the traffic labeling is fresh andaccurate e edge server must wait because the mobiledevices send flow metainformation to the edge server at afixed time interval Ti (60 s in our experiments) and the savedflow metainformation in the database may be outdated Inour implementations the waiting time is set to Ti and if|Tu

f minus Trf|gt 24 h the returned flow metainformation is

regarded as obsolete

522 Traffic Feature Extraction After traffic labeling theedge server extracts detection features from the labeled

Table 1 Main notation used in this paper

Notation Meaningi Android app iu Number of mobile devices with app i installedr Number of devices with repackaged app i

f Network flowC-IPf Client-side IP address of fS-IPf Server-side IP address of fS-Portf Server-side port of fAppNf Name of the app that generates fAppVf Version of the app that generates fTi Set time intervalTu

f Recording time of flow fu at the edge serverdi Feature set of app iwj Plaintext word in the feature setsV(di) Numerical vector of traffic contentsF

j

diFeature vector for the encrypted flow j in di

TBntimesm(di) Traffic behaviors of diCSdik

Content similarity between di and dkBSdik

Behavior similarity between di and dkSdik

Final similarity between di and dk

8 Security and Communication Networks

traffic e extracted traffic features are categorized into twogroups traffic content features and traffic behavior featuresTraffic content features are the plaintext contents that areextracted from HTTP flows and traffic behavior features arethe statistics of network flows such as packet sizes andintervals Since Android apps may use an encrypted HTTPSprotocol to transmit data [54] traffic behavior features alsomust be considered e traffic content and traffic behaviorfeatures are described in detail in Sections 53 and 54

523 Traffic Filtering After traffic feature extraction theedge servers send the extracted features along with theclient-side IP address server-side IP addresses server-sideports app name AppNf and app version AppVf to the cloudNote that App i is installed on u devices hence the cloudeventually has u feature sets for App i Figure 8 illustrates thestructure of the feature set e cloud further filters these ufeature sets to increase the detection accuracy First it di-vides the u sets into v categories according to the app versionAppVf erefore each category contains the feature setsthat are generated by the same version of App i en foreach category the cloud removes the common traffic fromthe feature sets e common traffic is defined as network

flows that are contained in all features sets ese flows havethe same server-side IP address and the same server-sideport number e reason for this step is that the commontraffic will obfuscate the similarity calculations betweenapps Otherwise the similarity values between features setsare all high and they cannot be effectively used to distin-guish the repackaged malware from the original apps

After the filtering has been completed the similaritiesbetween feature sets are calculated and the similarity valuesare clustered to identify the repackaged malware

Client-side IP addressApp name AppN_fApp version AppV_fFlow 1 server-side IP address 1 server-side port 1 flow features

Flow f server-side IP address f server-side port f flow features

Figure 8 Illustration of the feature set e set contains the clientIP address app name and app version Each network flow that isgenerated by App i in a device is represented by the featuresextracted from the flow and by the server IP address and port

Trafficlabeling

Edge server 1

Traffic featureextraction

Similarity calculation of

trafficbehaviors

Similarity calculation of traffic contents Clustering of

similarity values

Original app

Repackaged malware

Cloud

Mobile device 1 2 (with app i installed)

Results

Traffic features

Traffic and flowmetainformation

ę

ę

Edge server e

Traffic features

Traffic and flow metainformation

Filtering

Mobile device u(with app i installed)

1

3

4

5

6

2

Figure 7emain procedure of the proposedmethod for Androidmalware detectione core parts of the proposedmethod are labeled bycircled numbers

Security and Communication Networks 9

53 Similarity Calculation of Traffic Content Android appsuse mainly HTTP and HTTPS protocols to communicatewith their remote servers [54] For HTTP traffic since it isplaintext based we can efficiently use the flow content tocalculate the similarity e basic strategy is to convertHTTP traffic to document files and the document files canbe handled by information retrieval technologies for simi-larity calculation HTTP flow content has also been used inapp identification from network traffic [55] However in thisstudy we only consider partial HTTP header information tobetter protect mobile usersrsquo privacy

In detail first we reassemble HTTP flows from packetsand extract the flow contents by saving the correspondingpacket contents into a text file as ASCII code Figure 9 showssuch an example en we preprocess the text file by de-leting the HTTP request content and the HTTP requestheader except for the GET (or POST) and HOST fields eHTTP response content is also deleted namely only theHTTP response header the GET (or POST) and the HOSTfiled in the HTTP request header are saved Via this ap-proach the mobile usersrsquo privacy can be protected from thecloud (for example the mobile phone information (HUA-WEI) is leaked by the User-Agent field in Figure 9) Afterthat the text file is spilt using special characters (eg ) and the processed file becomes the flow contentfeatures

e similarity is calculated via the improved TF-IDFalgorithm [56] and the cosine similarity e term frequencymeasures the count of words in a document We use theaugmented frequency measurement Denote the document(the feature set) as di and the word as wj Term frequencytf(wj di) is calculated as

tf wj di1113872 1113873 05 +05 times f wj di1113872 1113873

max f w di( 1113857 w isin di1113864 1113865 (1)

where f(wj di) is the number of occurrences of wj indocument di

e inverse document frequency measures of theamount of information a word provides Typically it is thelogarithmically scaled inverse fraction of the number ofdocuments that contain the word e inverse documentfrequency of wj is calculated as

idf wj1113872 1113873 logu

d wj isin d1113966 111396711138681113868111386811138681113868

11138681113868111386811138681113868+ 001⎛⎝ ⎞⎠ (2)

where u is the number of documents and | d wj isin d1113966 1113967| isnumber of documents in which the term wj appears

Denote by p(wj di) the product of TF(wj di) andI DF(wj) it is formulated as

p wj di1113872 1113873 log tf wj di1113872 11138731113872 1113873 times idf wj1113872 1113873

1113936

cj0 log

2 tf wj di1113872 11138731113872 1113873 times idf2 wj1113872 11138731113969 (3)

After the TF-IDF operation all documents are convertedinto equal-length numerical vectors We use the cosinesimilarity to measure the similarity between two documentvectors For documents di and dk the corresponding

numerical vectors are denoted as V(di) and V(dk) and thelength of vector is t e similarity between di and dk iscalculated as

CSdik

1113936tl1 V di( 1113857l times V dk( 1113857l( 1113857

1113936tl1 V di( 1113857

2l

1113969

times

1113936tl1 V dk( 1113857

2l

1113969 (4)

Suppose that in category vi (we divide the u feature setsgenerated by App i in to v categories in the traffic filteringstep) there are c feature sets For each feature set we save theflow features of all HTTP flows into a text file and c text filesare obtained en we calculate the numerical vector foreach feature set via equation (3) Last we compare thenumerical vector with each other as equation (4) and gain csimilarity values (CSdii

1) for each feature set We rep-resent these similarity values of feature set di (i 1 2 c)as a vector Si

contents and Sicontents (CSdi1

CSdi2 CSdic

)e vectors Si

contents (i 1 2 c) in combination with thesimilarity values of traffic behaviors will be clustered toidentify the abnormal data points namely the repackagedmalware

54 SimilarityCalculation of TrafficBehaviors In addition tocalculating the similarity values of traffic contents thesimilarity of traffic behaviors must be calculated due toencrypted flows We use m (m 11 in our study) statisticalfeatures as listed in Table 2 to represent the behavior of anencrypted network flow e units of the flow duration andthe packet interval time are milliseconds in our experiments

With the flow statistical features for feature set di the jthencrypted flow can be represented as a vector in the featurespace

Fj

di f

1j f

2j f

mj1113872 1113873 (5)

where flj represents the value of the lth statistical feature

1le llem Figure 10 shows an example of feature represen-tation for encrypted traffic Consequently the traffic be-haviors of di can be represented as a matrix of f

pj

Figure 9 Example of converting an HTTP flow to a text file

10 Security and Communication Networks

TBntimesm di( 1113857 F1di

F2di

Fndi

1113872 1113873T (6)

where n is the number of encrypted flows that are containedin di and nge 1

Since different feature sets may differ in terms of thenumber of encrypted flows traffic behaviors TBnitimesm(di) andTBnktimesm(dk) may differ in terms of different dimensionnamely the value of ni should not be equal to that of nk Toefficiently calculate the similarity between TBnitimesm(di) andTBnktimesm(dk) we calculate the Frobenius norm of TBnitimesm(di)

and TBnktimesm(dk) e formula is presented in equation (7)where norm(middot) normalizes the value of fl

j to [0 1]

FN(d)

1113944

n

j11113944

m

l1norm fl

j1113872 111387311138681113868111386811138681113868

111386811138681113868111386811138682

11139741113972

(7)

en the distance between TBnitimesm(di) and TBnktimesm(dk) iscalculated as

dis di dk( 1113857 FN di( 1113857 minus FN dk( 11138571113868111386811138681113868

1113868111386811138681113868 (8)

Finally the similarity between TBnitimesm(di) andTBnktimesm(dk) can be calculated by

BSdik 1 minus

dis di dk( 1113857

max dis dr dq1113872 1113873 r q 1 2 u rne q1113960 1113961 (9)

In the above process ni and nk are assumed to exceed 1If ni 0 (nk 0) and nkge 1 (nige 1) BSdik

is set to 0 to reflectthe significant difference between di and dk However ifni 0 and nk 0 namely there is no encrypted flow in bothdi and dk BSdik

is set to 1 because they exhibit the samebehaviors ie they do not produce any encrypted traffic

Similarly we let BSdii 1 and represent these similarity

values as a vector Sibehavior BSdi1

BSdi2 BSdiu

1113966 1113967

55 Clustering of Similarity Values and Malware DetectionFrom the traffic content similarity CSdik

and the traffic be-havior similarity BSdik

we can determine the final similarityvalue BSdik

between the feature sets di and dk as expressed inequation (10) In this equation q is the weight value In ourexperiments we set q 05 namely the traffic contentsimilarity and the traffic behavior similarity are assigned thesame weight is is because apps typically produce almostthe same numbers of HTTP and HTTPS flows [57] We alsoevaluate various weighted values in Section 62 and theexperimental results support that 05 is the most suitablevalue For feature set di a vector Sdi

(Sdi1 Sdi2

Sdiu) is

finally obtained which contains the similarity values be-tween di and the other feature setse set of vector Sdi

1113966 1113967 i

1 2 u will be clustered for the detection of repackagedmalware

Sdik q times CSdik

+(1 minus q) times BSdik (10)

To be specific we use the density peak clustering method[58] to realize our objective Density peak clustering is basedon the following assumption cluster centers are charac-terized by a higher density than their neighbors and by arelatively large distance from points that have higher den-sities Hence density peak clustering can automaticallyidentify the correct number of clusters is is importantbecause we do not know whether there is repackagedmalware or how many types of repackaged malware thereare In other words we cannot determine the number ofclusters in advance Density peak clustering can help usovercome this challenge

We assume that most mobile devices have installed theoriginal apps as discussed in Section 51 erefore afterclustering the cluster that has the largest number of voxelswill be recognized as original and the remaining clusters willbe regarded as suspicious However suspicious apps are notnecessarily repackaged malware For example apps can berepackaged to show ads only Repackaged apps of this typeshould not be simply judged as malware Additionallyoriginal apps can be suspicious due to different user be-haviors To further reduce the frequency of the false alarmswe can check the server hostnames that are contained in therepackaged clusters using VirusTotal (the URLs are includedin the HTTP traffic contents) If all of the serversrsquo hostnamesfor a suspicious cluster Cr are judged as clean it will berecognized as a cluster of abnormal but harmless appsOtherwise it will be identified as repackaged malware Viathis approach we can accurately detect repackaged Androidmalware In this step we only use VirusTotal as a white list toexclude false alarms We do not use it as a backlist (detectionsignatures) to detect repackaged malware Malware detec-tion is still realized by clustering the similarity values

Table 2 Flow statistical featuresBytes sent by the app and bytes sent by the serverNumbers of outgoing and incoming packetsMean and std dev of the outgoing and incoming packet sizesrsquoflow durationMean and std dev of the interpacket time

lt2761 16696 16 12 173 1294 289 1177 3392 121 190gt

Figure 10 Example of the conversion of an encrypted flow to afeature vector

Security and Communication Networks 11

6 Evaluation

61 System Implementation and Dataset Figure 11illustrates our experimental setup e Android phoneswere connected to the edge server via WiFi and each phonehad installed the edge-client app which sends the flowmetainformation to the edge server using Socket (jav-anetSocket) e edge is a Dell PowerEdge R730 server(with 32 processors 16GB RAM and 12 TB hard drives)that is configured as an access point In the edge server weused tcpdump (httpwwwtcpdumporg) to collect networktraffic e similarity calculations of traffic contents andbehaviors and the clustering of similarity values wereimplemented on the Windows Azure cloud computingplatforme URL checking was conducted using Virustotal(httpswwwvirustotalcom)

In the implementation we used aria2c (httpsaria2githubio) to download apps from Androzoo (httpsandrozoounilu) aria2c was encapsulated as commandsin Python code for parallel downloading After downloadingthe apps we automatically installed them on phones usingthe adb shell command Each phone had 40 tested appsinstalled We have implemented the edge-client app basedon Android VPN service In our implementations the edge-client app correlated network flows and apps by resolvingthe ports and PIDs (Process IDs) send the newly collectedflow metainformation to the edge server every 60 secondse edge server used tcpdump to collect network trafficwhich was saved as pcap files To extract traffic features weused Splitcap (httpswwwnetreseccompageSplitCap)to restore tcp flows from pcap files en for HTTP trafficwe extracted the contents using tshark (httpswwwwiresharkorgdocsman-pagestsharkhtml) commands -zfollow tcp and ascii e packet sizes and packet intervaltimes were also obtained by tshark with commands -eframelen ndashe frametime_relative In the cloud CSdik

and BSdik

were calculated using scikit-learn (httpscikit-learnorg)and Numpy (httpsplotlynumpynorm) e code ofdensity peak clustering is available on GitHub (httpsgithubcomjasonwbwDensityPeakCluster) and we havefixed various bugs that were present in the original code

We have downloaded 200 pairs of original andrepackaged apps from Androzoo for testing However notall of these downloaded repackaged apps are maliciousSome apps are only repackaged to show extra ads To screenout Android malware from the downloaded repackagedapps we further checked the 200 repackaged apps usingVirustotal If an app was judged as safe by all the antivirustools it was labelled as repackaged_safe in our experimentsOtherwise it was labelled as repackaged_malware In total143 repackaged apps were detected as malware by VirustotalNevertheless in our experiments we evaluated the proposedclusteringmethod on all 200 repackaged apps to obtainmorerealistic performance results

e downloaded apps were run in 10 rooted Androidphones in parallel to accelerate the experimental process Forall apps we used Droidbot [59] to send input events to themand to generate network traffic automatically We ran eachrepackaged app (repackaged_safe and repackaged_malware

apps) 10 times and the corresponding original app 30 timesto generate sufficient network traffic e running times ofthe original apps exceeded those of the repackaged appsbecause we assume that most of the apps are original asdiscussed in Section 51 Consequently we obtained 40traffic sets for each repackaged-original app pair Note thatfrom the perspectives of the mobile devices and the edgeserver there are only 200 distinct apps because therepackaged app has the same name as the correspondingoriginal app erefore our experiments have simulated40lowast 200 8000 mobile devices For each repackaged-orig-inal app pair there were 30 devices with the original app and10 devices with the repackaged version

In total we have gathered almost 89GB network traffice datasets that are used in our experiments are summa-rized in Table 3 In the table the traffic dataset sizes are thesizes of the captured network traffic and the feature datasetsizes are the sizes of the extracted features In our experi-ments the HTTP traffic contents and the flow statisticalfeatures which are listed in Table 2 were both saved as textfiles As shown in Table 3 the feature datasets are signifi-cantly smaller than the traffic datasets (25G vs 63G and 9G vs26G)us the edge computing can reduce the data size andincrease the data acquisition speed for cloud computing[24] Since the feature dataset sizes are still large (25G+ 9G)we did not send all these feature data to the cloud simul-taneously Instead we sent the feature data one by oneWhen the feature data of an app were successfully receivedby the cloud its traffic similarities were calculated andclustered immediately After that the feature data of anotherapp would be sent and analyzed e experimental resultswill be presented in the following

62 ExperimentalResults In this section wemainly evaluatethe detection efficiency of Android malware by clusteringsimilarity values e performance evaluation of the pro-posed method in terms of power memory and CPU con-sumptions is elaborated in the following Section 63

Similar to previous studies [8 14] to measure theperformance of the proposed method Accuracy and F-measure metrics are utilized which are defined in equations(11) and (12) In equation (11) TP FN FP and TN aredefined in Table 4 In Table 4 repackaged_x representsrepackaged_safe or repackaged_malware e precision andrecall in equation (12) are calculated via equations (13) and(14)

Accuracy TP + TN

TP + FN + FP + TN (11)

F minus measure 2 times precision times recallprecision + recall

(12)

precision TP

TP + FP (13)

recall TP

TP + FN (14)

12 Security and Communication Networks

We have evaluated the Accuracy and F-measure for eachrepackaged app Recall that we ran each repackaged app 10times and the corresponding original app 30 timeserefore for each app the number of the original apptraffic sets is 30 and the number of the repackaged app trafficsets is 10 namely TP + FN 10 and TN+FP 30 for eachapp We have randomly chosen 20 apps from repack-age_malware and the experimental results for these apps arepresented in Figures 12 and 13 Detailed descriptions of thechosen 20 apps are listed in Table 5 In the table the mainfunction of each app is determined by analyzing the packagename from the AndroidManifestxml and from the Googlesearch results of the package name As depicted in Figure 12100 accuracy was realized on a total of 6 apps (app3 app5app9 app10 app13 and app18) e minimum accuracy is85 (app2 3440) and the average accuracy is 952 Fig-ure 13 presents the F-measure values In our experimentsthe maximum F-measure value is 1 and the minimum valueis 0737 e average F-measure value is 0888 ese resultsdemonstrate the satisfactory performance of our proposedclustering method

We have also calculated the average accuracy and F-measure values for all repackaged_safe and repack-aged_malware apps In these calculationsTP + FN 57lowast10 570 and TN+FP 57lowast 301710 for

repackaged_safe apps For repackaged_malware appsTP + FN 143lowast101430 and TN+FP 143lowast 30 4290is is because in our experiments 143 apps were identifiedas repackaged malware and 57 apps as repackaged safe byVirustotal us the parameter values differ e experi-mental results are listed in Table 6

According to Table 6 the average accuracy values forrepackaged_safe and repackaged_malware apps are bothhigh However the average F-measure for repackaged_safeapps is only 078 which is lower than that of repackagedmalware is indicates that repackaged safe apps are

Router Windows Azure

Edge server DellPowerEdge R730

10 android phones

Repackagedoriginal apps

Androzoo

Figure 11 Experimental setup

Table 3 Datasets used in the experiments

of apps of runs of HTTP flows of encrypted flows Traffic dataset sizes Feature dataset sizesOriginal apps 200 30 340015 183452 asymp63G bytes asymp25G bytesRepackaged apps 200 10 110670 93041 asymp26G bytes asymp9G bytes

Table 4 Confusion matrix

ObservedDetected

Repackaged_x app Original appRepackaged_x app TP ( of TPs) FN ( of FNs)Original app FP ( of FPs) TN ( of TNs)

Apps

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

Figure 12 Detection accuracies for the 20 selected apps

Security and Communication Networks 13

difficult to distinguish from their original apps is isbecause repackaged_safe apps only generate smallamounts of additional network flows and their networktraffic features are not clearly distinguishable from thoseof normal traffic Fortunately the average F-measure ofrepackaged malware apps is outstanding is may be dueto the fact that malware typically send stolen data to orreceive commands from remote servers Hence theirnetwork traffic features are special and distinguishableFigure 14 presents an example of app clustering and il-lustrates the distinguishability of the repackaged malwareand the original app ese results demonstrate thatrepackaged Android malware can be effectively detectedby comparing their network traffic with that of originalapps

For repackaged malware apps the accuracy results withdifferent weight values (q in equation (10)) are presented inFigure 15 As depicted in the figure when q 05 we realizethe highest accuracy rate 969 If q 0 namely only thefeatures of encrypted traffic are used the accuracy result is753 e result is 617 when q 1 namely only thefeatures of plaintext traffic are considered ese resultsdemonstrate that both the encrypted and plaintext trafficcontribute to the differentiation of the original apps and therepackaged malware However the encrypted traffic con-tributes more than the plaintext traffic is may becausemalicious apps usually connect to their servers by securityprotocols to evade IDS detection

In the previous experiments we checked the serversrsquohostnames using VirusTotal to reduce the frequency of thefalse positives For various clusters that contained less than20 elements if all the serversrsquo hostnames were judged asclean the cluster was identified as an original app rather thana repackaged one As argued in Section 55 this process onlyreduces the frequency of false alarms To support this ar-gument we list the numbers of true positives and falsepositives of the 143 repackaged malware in Table 7 forcomparison When conducting the hostname checking thenumber of true positives is 1400 and the result is the samewhen the hostname checking is turned off is findingsupports that the malware is detected by clustering thesimilarity values However the number of false positivesincreased from 47 to 213 when the hostname checking wasturned off erefore we suggest checking the serversrsquohostnames after clustering to increase the practicality of themethod Note that false positives still occur when we checkthe hostnames by VirusTotal Hence there may be malwarein the original apps In our dataset the typical examples arecommobovideoplayerpro and cncfshop ele1taoxiaosanWe further analyze these two apps by reading theirdecompiled codes and we determine that they indeed es-tablish connections with suspicious servers In detail invarious runs app commobovideoplayerpro connects withhttpwwwtopappsquarecomwhich is judged as maliciousby CyRadar in VirusTotal App cncfshop ele1taoxiaosanvisits suspicious URL http955ccjSn9 Since these apps arethe original apps they are identified as false positives in theexperiments

In our experiments we chose density peak clustering asthe clustering method because density peak clustering canautomatically determine the correct number of clusters [58]Hence our method can detect various versions of malwarethat are repackaged from the same original app Unfortu-nately in the apps that were downloaded from AndroZoono malware was repackaged from the same original app Toevaluate the efficiency of detecting versions of malware wehave created two malware instances from a popular game

Apps

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

0

01

02

03

04

05

06

07

08

09

1

F-m

easu

re

Figure 13 Detection F-measure for the 20 selected apps

Table 5 Detailed descriptions of the chosen 20 apps

Main function ofapp

of HTTPflows

of encryptedflows

App1 VideoPlayer 453 51App2 Game 657 1455App3 Browser 1531 1478App4 MusicPlayer 341 25App5 Game 612 289App6 Game 788 123App7 Game 1336 524App8 Fitness 357 892App9 Email 122 431App10 Game 963 790App11 MusicPlayer 1128 547App12 Game 1598 426App13 News 2460 1178App14 News 2891 1543App15 Chat 171 49App16 Email 143 368App17 WeatherForecast 397 120App18 Game 1832 921App19 Fitness 732 362App20 OnlineLearning 2378 834

Table 6 Average accuracy and F-measure

Average accuracy () AverageF-measure

Repackaged_safe 903 078Repackaged_malware 969 094

14 Security and Communication Networks

app namely aircomaceviralmotox3 One is produced bygrafting AnserverBotrsquos source code into air-comaceviralmotox3 and the other is created by adding codefor stealing usersrsquo contacts and short messages and sendingthem to a remote server Similarly each malware is run 10times and the original app aircomaceviralmotox3 is alsorun 10 times for validatione clustering result is presentedin Figure 16 Apparently there are 3 clusters e

experimental results show that the value of F-measure foreach malware is 1 Hence our method can detect variousmalware that were repackaged from the same original app

63 Comparison In this section we compare our methodwith the typical methods that are based on network or cloudas illustrated in Figure 5 e comparative indicators are thedetection accuracy F-measure average power CPU andmemory consumptions of the mobile device First wecompare the proposed method with AppFA [14] which isimplemented completely at the network level Since AppFAalso must analyze app traffic we directly use the traffic sets ofthe repackaged malware apps for comparison While testingAppFA we deleted the label information that was containedin the traffic sets and mixed the network flows as a wholeen we used the signature matching and the constrainedK-means clustering algorithm proposed in [14] to restoreapp sessions namely the label for each flow e similarapps used in AppFA were chosen in Google Play via themethod introduced in [14] and for each repackaged mal-ware 10 similar apps were selected as the peer group epeer group apps were run once by Droidbot to generatenetwork traffic e comparison results are listed in Table 8

Since AppFA need not to install some special apps onmobile devices the extra power memory and CPU con-sumptions are 0 However the detection accuracy is muchlower (863 vs 969) than that of the proposedmethod ascompared in Table 8 In the proposed method we use theedge-client app to record flow metainformation thusnetwork flows can be accurately labeled However inAppFA flow labeling is realized via traffic clustering and theclustering accuracy cannot be 100 Meanwhile AppFAcompares the appsrsquo network behaviors with those of theirsimilar apps which also results in errors In the proposedmethod we directly compare the repackaged malware withits original app and the detection accuracy is increased eaverage power CPU and memory consumptions of theproposed method are also low and acceptable

In Table 8 the power memory and CPU consumptionsfor our edge-client app were obtained with the help of appGT (httpsgithubcomTencentGT) GT is a portable toolthat runs on smartphones for debugging the APP internalparameters and the code time-consuming statistics eaverage values for the power memory and CPU con-sumptions are calculated as follows Before running an appwith Droidbot we started the edge-client app by GT Whenthe Droidbot was stopped we recoded the power CPU andmemory consumptions We repeated the process until allapps (including 143 repackaged malware apps and theiroriginal versions) had been tested Via this way we obtainedthe average power CPU and memory consumptions for theedge-client app According to Table 8 the edge-client app islightweight and has minimal impact on the performances ofthe mobile devices

en we compare the proposed method with SAMA-Droid [44] SAMADroid monitors Android system calls andsends the calls to a centralized server for the detection ofAndroid malware SAMADroid is a typical cloud-based on-

Decision graph rho-delta-rho

20 251510 35 403050Sorted rho

00

02

04

06

08

Rho lowast

delt

a

Figure 14 Example of app clustering Obviously two clusters areidentified e dot in the top left corner is an outlier e outlier isdetected as an abnormal but harmless app by our method isfigure was generated by the density peak clustering tool

01 02 03 04 05 06 07 08 09 10Value of q

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

Figure 15 Accuracies of various weight values for traffic contentsimilarity and traffic behavior similarity

Table 7 True positives and false positives with and withouthostname checking

of true positives of false positivesWith hostname checking 1400 47Without hostnamechecking 1400 213

Security and Communication Networks 15

device detection method We have implemented the dy-namic detection according to the description in [44] Indetail the APIs that are invoked by the running apps aremonitored by the tool Sensitive API Monitor (httpsgithubcomcyruliuSensitive-API-Monitor) As done in [44] thefrequency values of 10 system calls (open ioctl brk readwrite close sendto sendmsg recvfrom and recvmsg) arerecorded ese recorded frequency values are sent toWindows Azure for malware detection We evaluate thedynamic detection performance of SAMADroid on our 143repackaged_malware apps and their original versions In theevaluation the frequency values of the original apps are usedto create the normality model and the frequency values ofthe repackaged malware apps are used for testing We alsouse the app GT to record the power memory and CPUconsumptions of SAMADroid Note that our phones wereall rooted and Sensitive_API_Monitor and GT can be usede experimental results are presented in Table 9

Last we compare our method with Shabtairsquos method [8]Shabtairsquos method detects Android malware at the mobiledevices and it considers the apprsquos network traffic patternsonly the HTTP traffic contents are not considered isgives us an opportunity to compare our method withmethods that have been specially designed for encryptedtraffic We have implemented Shabtairsquos method and chosenthe feature subset 2 as the detection features Feature subset2 includes Avg Sent Bytes Avg Rcvd Bytes and Pct ofAvg Rcvd Bytes among other values and it is proved to be

the best feature subset for malware detection [8] e De-cisionRegression tree is chosen as the classification algo-rithm e testing dataset is the traffic that was generated bythe 143 repackaged malware instances and the training set isthe traffic that was generated by the corresponding originalapps e classification results are listed in Table 10According to the table the detection accuracy and F-mea-sure values of our method are significantly higher than thoseof Shabtairsquos method is indicates that the HTTP trafficprovides useful information for Android malware detectionSince Shabtairsquos method is conducted at the mobile devices itconsumes more resources than our method as compared inTable 10

7 Discussion

e results of our extensive experiments demonstrate thatrepackaged Android malware can be efficiently and effec-tively detected by clustering network traffic at the edgedevices However there are still limitations for our methodFirst we assume the most of the devices have installed theoriginal apps (rlt (12)u as described in Section 51) is isreasonable in most cases since mobile users typicallydownload apps from the official markets However in areaswhere official markets such as the Google play and Amazonare banned it is possible that the most mobile devices haveinstalled the repackaged apps is may cause false positivesSecond if the malware does not generate sufficient trafficour method cannot detect it For example some malwareonly sends fraudulent premium SMS messages is type ofmalware cannot be detected by monitoring its networktraffic ird we rely on third-party tools such as Virustotalfor the identification of malicious URLs and to reduce thenumber of false alarms If a malicious app rents legalplatforms such as the public cloud as the remote controlserver our method will judge it as a normal app Hence falsenegatives will be generated

To overcome these limitations we could combine theproposed method with available methods such as SAMA-Droid In such cases the edge-client app will not only recordflow metainformation but also monitor app system be-haviors However the system behavior monitoring does notneed to be conducted all the time If abnormal but harmlessor lower traffic apps are detected the edge-client app will beinformed to record the corresponding appsrsquo system be-haviors Similarly the edge-client app can send the appsystem behaviors to the edge server for further processinge performance of this combined method will be evaluatedin our future work One can also combine the proposedframework with AppFA [14] for the detection of other typesof Android malware For example the edge extracts apptraffic behavior features and the cloud can compare appsrsquobehaviors with their peer groups to identify abnormalitiesFurthermore one could first use clustering techniques [60]to group repackaged apps according to the same maliciousmodules and then analyze the behaviors of the representativerepackaged apps in each group to obtain more accuratebehavior features

y

2D nonclassical multidimensional scaling

00ndash02 02ndash04x

ndash04

ndash03

ndash02

ndash01

00

01

02

03

04

Figure 16 Clustering results of various malware that wererepackaged from aircomaceviralmotox3e outlier in the middleis detected as an abnormal but harmless app is figure wasgenerated by the density peak clustering tool

Table 8 Comparison of the proposed method with AppFA

Proposed method AppFAAccuracy 969 863F-measure 094 075Average power consumption 12 0Average CPU consumption 06 0Average memory consumption 654MB 0

16 Security and Communication Networks

e paradigms of edge computing are used to detectAndroid malware in this study However several challengeswould be encountered when deploying it in practice Firstthe mobile devices must install a specified app for thecollection of essential information which may cause con-cerns and mobile users may refuse to cooperate Second inedge computing the data should be preprocessed by the edgeto better protect the usersrsquo privacy However the edge servercan observe all the data (including the sensitive informa-tion) and it threatens the privacy of users For example ifthe edge is deployed by the cloud provider the usersrsquo privacyinformation is also available to the cloud ere is no dif-ference with the user-cloud model for privacy protectionunless the edge server is deployed by the userird the edgeserver should be well protected from network attacksOtherwise it will be convenient for hackers to steal usersrsquoprivate information

8 Conclusion

In this paper we propose a novel method for detectingAndroid malware based on edge computing To the best ofour knowledge the detection of Android malware by uti-lizing edge computing and traffic clustering has not beenconsidered previously By introducing the edge layer themain tasks of the mobile device are offloaded to the edgeserver and the mobile device only needs to record flowmetainformation which substantially conserves the re-sources of the mobile device e edge server extracts appsrsquonetwork traffic features and sends these features to the cloudplatform for further analysis us the usersrsquo privacy can bebetter protected With the received features the cloud cal-culates the similarities between apps and clusters thesesimilarity values to separate the original apps and themalware automatically We evaluated our methods on 400Android apps and the experimental results demonstratethat the detection accuracy can reach 100 for several appsand the average accuracy is 969 We also tested theperformance of the proposed method and compared it with

the typical approaches e comparison results show thatour method can detect repackaged Android malware withboth higher detection accuracy and lower power and re-source consumptions In the future we will continue to workto overcome the discussed limitations and to evaluate theproposed method in real world scenarios

Data Availability

e downloaded apps used to support the findings of thisstudy have been deposited in httpspanbaiducoms1b8vMceIdtdd1gBRC5Gbzvw Its extraction code is vd5re clustering code is also in the repository Since appnetwork traffic contains usersrsquo sensitive information itcannot be publicly shared However we have explained allthe steps to generate network traffic automatically esesteps are also described in the instructionspdf in therepository

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported by National Key TechnologiesResearch and Development Program of China under Grant2018YFD0401404 National Natural Science Foundation ofChina under Grants 61702282 61802192 and 71801123Natural Science Foundation of the Jiangsu Higher EducationInstitutions of China under Grants 18KJB520024 and17KJB520023 Nanjing Forestry University under GrantsGXL016 and CX2016026 and NUPTSF under GrantNY217143

References

[1] Statista Mobile Operating Systemsrsquo Market Share Worldwidefrom January 2012 to December 2019 Statista HamburgGermany 2009 httpswwwstatistacomstatistics272698global-market-share-held-by-mobile-operating-systems-since-2009

[2] W Martin F Sarro Y Jia Y Zhang and M Harman ldquoAsurvey of app store analysis for software engineeringrdquo IEEETransactions on Software Engineering vol 43 no 9pp 817ndash847 2017

[3] httpswwwappbraincomstatsnumber-of-android-apps2020

[4] L Li D Li T F Bissyande et al ldquoUnderstanding android apppiggybacking a systematic study of malicious code graftingrdquoIEEE Transactions on Information Forensics and Securityvol 12 no 6 pp 1269ndash1284 2017

[5] M Fan J Liu X Luo et al ldquoAndroid malware familialclassification and representative sample selection via frequentsubgraph analysisrdquo IEEE Transactions on Information Fo-rensics and Security vol 13 no 8 pp 1890ndash1905 2018

[6] J Zhang Z Qin K Zhang H Yin and J Zou ldquoDalvik opcodegraph based android malware variants detection using globaltopology featuresrdquo IEEE Access vol 6 pp 51964ndash51974 2018

[7] S Garg S K Peddoju and A K Sarje ldquoNetwork-baseddetection of android malicious appsrdquo International Journal ofInformation Security vol 16 no 4 pp 385ndash400 2016

Table 9 Comparison of the proposed method with SAMADroid

Proposed method SAMADroidAccuracy 969 831F-measure 0940 0793Average power consumption 12 23Average CPU consumption 06 08Average memory consumption 654MB 773MB

Table 10 Comparison of the proposed method with Shabtairsquosmethod

Proposed method Shabtairsquosmethod

Accuracy 969 771F-measure 0940 0695Average power consumption 12 77Average CPU consumption 06 23Average memory consumption 654MB 937MB

Security and Communication Networks 17

[8] A Shabtai L Tenenboim-Chekina D Mimran L RokachB Shapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquoComputers amp Security vol 43 pp 1ndash18 2014

[9] L Xie X Zhang J-P Seifert and S Zhu ldquopBMDS a be-havior-based malware detection system for cellphone de-vicesrdquo in Proceedings of the ird ACM Conference onWireless Network Security ACM New York NY USApp 37ndash48 2010

[10] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacyin Smartphones and Mobile Devices ACM Chicago IL USApp 15ndash26 October 2011

[11] G Dini F Martinelli A Saracino and D SgandurraldquoMADAM a multi-level anomaly detector for android mal-warerdquo in Proceedings of the International Conference onMathematical Methods Models and Architectures for Com-puter Network Security vol 12 Springer St PetersburgRussia pp 240ndash253 2012

[12] Y Zhang M Yang B Xu et al ldquoVetting undesirable be-haviors in android apps with permission use analysisrdquo inProceedings of the 2013 ACM SIGSAC Conference on Com-puter amp Communications Security ACM Berlin Germanypp 611ndash622 November 2013

[13] SWang Z Chen X Li LWang K Ji and C Zhao ldquoAndroidmalware clustering analysis on network-level behaviorrdquo inProceedings of the International Conference on IntelligentComputing Springer Liverpool UK pp 796ndash807 August2017

[14] G He B Xu and H Zhu ldquoAppFA a novel approach to detectmalicious android applications on the networkrdquo in Securityand Communication Networks Wiley Hoboken NJ USA2018

[15] X Wang Y Yang and Y Zeng ldquoAccurate mobile malwaredetection and classification in the cloudrdquo SpringerPlus vol 4no 1 p 583 2015

[16] K Tam A Feizollah N B Anuar R Salleh and L Cavallaroldquoe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys (CSUR) vol 49 no 4p 76 2017

[17] M Sun X Li J C S Lui R T B Ma and Z Liang ldquoMonet auser-oriented behavior-based malware variants detectionsystem for androidrdquo IEEE Transactions on Information Fo-rensics and Security vol 12 no 5 pp 1103ndash1112 2017

[18] T Vidas and N Christin ldquoEvading android runtime analysisvia sandbox detectionrdquo in Proceedings of the 9th ACMSymposium on Information Computer and CommunicationsSecurity pp 447ndash458 Kyoto Japan June 2014

[19] Y Duan M Zhang A V Bhaskar et al ldquoings you may notknow about android (Un) packers a systematic study basedon whole-system emulationrdquo in Proceedings of the 2018Network and Distributed System Security Symposium pp 1ndash15 San Diego CA USA February 2018

[20] L Xue X Luo L Yu S Wang and D Wu ldquoAdaptiveunpacking of android appsrdquo in Proceedings of the 2017 IEEEACM 39th International Conference on Software Engineering(ICSE) IEEE Buenos Aires Argentina pp 358ndash369 May2017

[21] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy IEEE San Francisco CAUSA pp 95ndash109 2012

[22] N W Lo S K Lu and Y H Chuang ldquoA framework for thirdparty android marketplaces to identify repackaged appsrdquo inProceedings of the 2016 IEEE 14th International Conference onDependable Autonomic and Secure Computing 14th Inter-national Conference on Pervasive Intelligence and Computing2nd International Conference on Big Data Intelligence andComputing and Cyber Science and Technology Congresspp 475ndash482 Auckland New Zealand August 2016

[23] L Li J Gao M Hurier et al ldquoAndrozoo++ collectingmillions of android apps and their metadata for the researchcommunityrdquo 2017 httpsarxivorgabs170905281

[24] W Shi J Cao Q Zhang Y Li and L Xu ldquoEdge computingvision and challengesrdquo IEEE Internet of ings Journal vol 3no 5 pp 637ndash646 2016

[25] W Zhou Y Zhou X Jiang and P Ning ldquoDetectingrepackaged smart-phone applications in third-party androidmarketplacesrdquo in Proceedings of the Second ACM Conferenceon Data and Application Security and Privacy ACM AntonioTX USA pp 317ndash326 February 2012

[26] Y Shao X Luo C Qian P Zhu and L Zhang ldquoTowards ascalable resource-driven approach for detecting repackagedandroid applicationsrdquo in Proceedings of the 30th AnnualComputer Security Applications Conference ACM NewOrleans LA USA pp 56ndash65 December 2014

[27] K Chen P Wang Y Lee et al ldquoFinding unknown malice in10 seconds mass vetting for new threats at the google-playscalerdquo in Proceedings of the USENIX Security Symposiumvol 15 Washington DC USA August 2015

[28] Z Wang C Li Y Guan and Y Xue ldquoDroidchain a novelmalware detection method for android based on behaviorchainrdquo in Proceedings of the 2015 IEEE Conference onCommunications and Network Security (CNS) IEEE Flor-ence Italy pp 727-728 September 2015

[29] K Tian D D Yao B G Ryder G Tan and G PengldquoDetection of repackaged android malware with code-het-erogeneity featuresrdquo IEEE Transactions on Dependable andSecure Computing vol 17 no 1 pp 64ndash77 2017

[30] A De Lorenzo F Martinelli E Medvet F Mercaldo andA Santone ldquoVisualizing the outcome of dynamic analysis ofandroid malware with vizmalrdquo Journal of Information Se-curity and Applications vol 50 Article ID 102423 2020

[31] M Lin D Zhang X Su and T Yu ldquoEffective and scalablerepackaged application detection based on user interfacerdquo inProceedings of the 2017 IEEE Conference on ComputerCommunications Workshops (INFOCOM WKSHPS) IEEEAtlanta GA USA May 2017

[32] G Meng M Patrick Y Xue Y Liu and J Zhang ldquoSecuringandroid app markets via modelling and predicting malwarespread between marketsrdquo IEEE Transactions on InformationForensics and Security vol 14 no 7 pp 1944ndash1959 2018

[33] A Arora and S K Peddoju ldquoMinimizing network trafficfeatures for android mobile malware detectionrdquo in Proceed-ings of the 18th International Conference on DistributedComputing and Networking ACM Hyderabad India January2017

[34] X Wu D Zhang X Su and W Li ldquoDetect repackagedAndroid application based on HTTP traffic similarity httptraffic similarityrdquo Security and Communication Networksvol 8 no 13 pp 2257ndash2266 2015

[35] J Malik and R Kaushal ldquoCredroid android malware de-tection by network traffic analysisrdquo in Proceedings of the 1stACM Workshop on Privacy-Aware Mobile Computing ACMPaderborn Germany pp 28ndash36 July 2016

18 Security and Communication Networks

[36] A Zulkifli I R A Hamid W M Shah and Z AbdullahldquoAndroid malware detection based on network traffic usingdecision tree algorithmrdquo in Proceedings of the InternationalConference on Soft Computing and Data Mining SpringerSenai Malaysia pp 485ndash494 January 2018

[37] Z Chen Q Yan H Han et al ldquoMachine learning basedmobile malware detection using highly imbalanced networktrafficrdquo Information Sciences vol 433-434 pp 346ndash364 2018

[38] G He B Xu L Zhang and H Zhu ldquoMobile app identifi-cation for encrypted network flows by traffic correlationrdquoInternational Journal of Distributed Sensor Networks vol 14no 12 pp 1ndash17 2018

[39] L Xue Y Zhou T Chen X Luo and G Gu ldquoMalton to-wards on-device non-invasive mobile malware analysis forARTrdquo in Proceedings of the 26th USENIX Security Symposiumpp 289ndash306 Vancouver Canada August 2017

[40] M K Alzaylaee S Y Yerima and S Sezer ldquoDL-Droid deeplearning based android malware detection using real devicesrdquoComputers amp Security vol 89 Article ID 101663 2020

[41] A S Shamili C Bauckhage and T Alpcan ldquoMalware de-tection on mobile devices using distributed machine learn-ingrdquo in Proceedings of the 20th International Conference onPattern Recognition (ICPR) IEEE Istanbul Turkeypp 4348ndash4351 August 2010

[42] M Zhao T Zhang F Ge and Z Yuan ldquoRobotdroid alightweight malware detection framework on smartphonesrdquoJournal of Networks vol 7 no 4 p 715 2012

[43] K A Talha D I Alper and C Aydin ldquoAPK auditor per-mission-based android malware detection systemrdquo DigitalInvestigation vol 13 pp 1ndash14 2015

[44] S Arshad M A Shah A Wahid A Mehmood H Song andH Yu ldquoSamadroid a novel 3-level hybrid malware detectionmodel for android operating systemrdquo IEEE Access vol 6pp 4321ndash4339 2018

[45] G He L Zhang B Xu and H Zhu ldquoDetecting repackagedandroid malware based on mobile edge computingrdquo inProceedings of the 2018 Sixth International Conference onAdvanced Cloud and Big Data (CBD) IEEE Lanzhou Chinapp 360ndash365 August 2018

[46] R Roman J Lopez andMMambo ldquoMobile edge computingFog et al a survey and analysis of security threats andchallengesrdquo Future Generation Computer Systems vol 78pp 680ndash698 2018

[47] F Bonomi R Milito J Zhu and S Addepalli ldquoFog com-puting and its role in the internet of thingsrdquo in Proceedings ofthe First Edition of the MCC Workshop on Mobile CloudComputing ACM Helsinki Finland pp 13ndash16 August 2012

[48] M T Beck MWerner S Feld and S Schimper ldquoMobile edgecomputing a taxonomyrdquo in Proceedings of the Sixth Inter-national Conference on Advances in Future Internet pp 48ndash55 Lisbon Portugal November 2014

[49] P Mach and Z Becvar ldquoMobile edge computing a survey onarchitecture and computation offloadingrdquo IEEE Communi-cations Surveys amp Tutorials vol 19 no 3 pp 1628ndash1656 2017

[50] H T Dinh C Lee D Niyato and P Wang ldquoA survey ofmobile cloud computing architecture applications and ap-proachesrdquo Wireless Communications and Mobile Computingvol 13 no 18 pp 1587ndash1611 2013

[51] X Lyu H Tian L Jiang et al ldquoSelective offloading in mobileedge computing for the green internet of thingsrdquo IEEENetwork vol 32 no 1 pp 54ndash60 2018

[52] S Agrawal and J Agrawal ldquoSurvey on anomaly detectionusing data mining techniquesrdquo Procedia Computer Sciencevol 60 pp 708ndash713 2015

[53] A Tongaonkar ldquoA look at the mobile app identificationlandscaperdquo IEEE Internet Computing vol 20 no 4 pp 9ndash152016

[54] G He B Xu and H Zhu ldquoIdentifying mobile applications forencrypted network trafficrdquo in Proceedings of the 2017 FifthInternational Conference on Advanced Cloud and Big Data(CBD) IEEE Shanghai China pp 279ndash284 August 2017

[55] G Ranjan A Tongaonkar and R Torres ldquoApproximatematching of persistent lexicon using search-engines forclassifying mobile app trafficrdquo in Proceedings of the 35thAnnual IEEE International Conference on Computer Com-munications IIEEE San Francisco CA USA April 2016

[56] L Li and S Qu ldquoShort text classification based on improvedITCrdquo Journal of Computer and Communications vol 1 no 4pp 22ndash27 2013

[57] A H Lashkari A F A Kadir L Taheri and A A GhorbanildquoToward developing a systematic approach to generatebenchmark android malware datasets and classificationrdquo inProceedings of the 2018 International Carnahan Conference onSecurity Technology (ICCST) IEEE Montreal Canada Oc-tober 2018

[58] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[59] Y Li Z Yang Y Guo and X Chen ldquoDroidbot a lightweightUI-guided test input generator for androidrdquo in Proceedings ofthe Software Engineering Companion (ICSE-C) pp 23ndash26Buenos Aires Argentina May 2017

[60] M Fan X Luo J Liu et al ldquoGraph embedding based familialanalysis of android malware using unsupervised learningrdquo inProceedings of the 2019 IEEEACM 41st International Con-ference on Software Engineering (ICSE) IEEE Piscataway NJUSA pp 771ndash782 May 2019

Security and Communication Networks 19

Page 3: On-DeviceDetectionofRepackagedAndroidMalwarevia …downloads.hindawi.com/journals/scn/2020/8630748.pdf · 2020-05-31 · According to [4–6], most Android malware programs are repackagedapps,andthemajorityofnewAndroidmalware

(iii) Extensive experiments are conducted on publicdatasets

e remainder of this paper is organized as followse preliminaries of Android malware and edge com-puting are introduced in Section 2 In Section 3 wediscuss relevant related work Section 4 describes the newintegration architecture of the cloud edge and mobiledevices for Android malware detection e methodologyis presented in Section 5 An experimental evaluation ofour method in terms of accuracy and performance ispresented in Section 6 We discuss the limitations of theproposed method in Section 7 Finally in Section 8 wepresent the conclusions of the paper and discuss potentialfuture work

2 Preliminaries

In this section we introduce important concepts regardingAndroid malware and edge computing which form the basisof this work

21 Android Malware As discussed in Section 1 Androidapps (APKs) are easy to repackage which is convenient forattackers who would like to create Android malware In factmost Android malware is produced by repackaging benignapps especially favorite apps to ensure a wide diffusion ofthe malicious code [4] As evidence MalGenome [21] whichis a reference dataset in the Android security community80 of the malicious samples were built via repackagingother apps Repackaged apps can be distributed throughthird-party markets which do not typically requirescreening or integrity evaluation of the uploaded apps [22]For example many of the apps in F-Droid are found to berepackaged by the same developer [23] Consequently wedetect mainly malicious repackaged Android apps in thisstudy

Repackaged Android malware can be efficiently detectedby analyzing its network traffic [7 14] is is because therepackaged apps are typically functionalized to interact withremote servers to receive commands or return sensitive dataand their network behaviors will differ from the originalnetwork behaviors To better explain this we decompile twoapps namely 00575DE650F78413C91E8E613ADF9099812F325AA9CF36C72E81C79B230F58F4 and A55EA5807203CE31470D1905ED30F9267AE3459809DDA25C4F339197DE6486FE and compare their corresponding folders toidentify the newly added files e added files are furtheranalyzed (by reading the Java code) to investigate theirnetwork behaviors In this example the apprsquos name is theSHA256 value of the APK file and 00575lowast corresponds tothe original app and A55EAlowast is the repackaged app Weused Jadx to decompile APKs and the file comparison isconducted with WinMerge (httpwinmergeorg) Fig-ure 1 presents the comparison results

As shown in Figure 1 there are a total of 250 additionalJava code files and these code files are distributed in thesourcescom folder We read these added Java codes care-fully and determine that Android API HttpPost is invoked 8

times and HttpGet is invoked 5 times Hence there are atleast 13 flows that would never be produced by the originalapp namely 00575lowast erefore the original app can beefficiently distinguished from the repackaged version bycomparing their network traffic Inspired by these obser-vations we propose clustering the mobile appsrsquo traffic toseparate original apps and repackaged malwareautomatically

22 Edge Computing Edge computing can be defined interms of the technologies that enable computations to beperformed at the edge of the network on downstreamdata on behalf of cloud services and on upstream data onbehalf of Internet of ings (IoT) services [24] Perhapsthe most important concept of edge computing is that theedge can consist of any computing and network resourcesalong the path between mobileIoT devices and clouddata centers For example the edge could be a phonebetween a smart bracelet and the cloud or a wirelessrouter between the smart TV and the cloud if the phoneor the wireless router performs computations on behalfof the smart devices or the cloud In these examples thecomputations could be data anonymization for privacyprotection or content caching for performance im-provement Figure 2 illustrates the paradigm of edgecomputing

As illustrated in Figure 2 cloud computing-like capa-bilities ie edge servers are deployed at the edge of thenetwork for data (pre) processing and forwarding In edgecomputing the mobileIoT devices could send computationtasks to the edge servers namely the devices can distributecode to the edge to be executed hence the mobileIoTdevices can be designed to conduct complicated operationssuch as machine learning or encryption Similarly the cloudmay push services to the edge servers and the edge can bedeployed and updated online In most cases the devicesshould be connected to the edge and the cloud

Figure 1 Folders and files that were only present in repackagedapp A55EAlowast as indicated by left only in the third column

Security and Communication Networks 3

simultaneously as represented by the solid and dotted linesin Figure 2is is because not all data must be preprocessedby the edge servers and more importantly the robustness ofthe entire system can be improved If the edge breaks downthe cloud can continue to provide services In this study weobey the edge computing paradigm that is illustrated inFigure 2 in the design of the Android malware detectionsystem

3 Related Work

Repackaged Android apps are one of the major sources ofmobile malware and extensive studies have been conductedon Android malware detection In this section we will re-view the most closely related studies e typical studies ofedge computing are also reviewed to provide more detailedbackground information

31 Off-Device Detection Repackaged Android apps andmalware can be efficiently detected via code comparison ofAPKs DroidMoss [25] focuses on app code and conductspairwise similarity comparisons for the detection ofrepackaged apps in Android markets e detection resultsdemonstrate that 5 to 13 of apps that are hosted on third-party marketplaces were repackaged However the pairwisecomparison approach cannot be scaled up because millionsof Android apps are now available in markets To resolve thisissue Shao et al [26] proposed ResDroid for the detection ofrepackaged apps in Android markets via a divide-and-conquer strategy First they built a k-d tree and insertedclusters that contained one or more apps into the k-d treeen the apps were compared with their neighbors in the k-d tree and the number of comparisons was reduced As-suming that the apps from the official market were benignMassvet [27] downloaded all apps from Android markets inadvance and constructed v-core (a geometric center of a viewgraph) andm-core (mapping the features of a Java method toan index) databases for repackaged Android malware de-tection Both databases were used to vet new apps that weresubmitted to the market New apps would be reported asrepackaged malware if their v-cores or m-cores were similarto records in the databases and the code was not legitimatelyreused

In addition to code comparison behavior-based andUI-based detection methods are also emerging Droid-Chain [28] constructs a behavior chain model that iscomposed of typical behavior processes of Android apps

for the detection of malware In DroidChain the typicalbehaviors include privacy leakage SMS financial charg-ing malware installation and privilege escalationAccording to Tian et al [29] repackaged malware wasdifficult to detect partly because of their behavioralsimilarities to benign apps To overcome this problemthey proposed a new repackaged Android malware de-tection technique that was based on code heterogeneityanalysis e code structure of an app was partitioned intomultiple dependence-based regions and each region wasindependently classified according to its behavioral fea-tures De Lorenzo et al [30] proposed VizMal for visu-alizing the execution traces of Android applications andhighlighting potentially malicious behaviors Lin et al[31] proposed the detection of repackaged Android appsbased on static UI features ey detected a total of 3723repackaged app pairs in the Anzhi market and 15856repackaged pairs in the Mi market ese results dem-onstrate that repackaged malware still poses a severethreat to the security of mobile users even though varioustypes of detection methods have been proposed andadopted by Android markets [32]

Recently several network-based approaches have beenproposed for the detection of repackaged apps and malwareArora and Peddoju [33] demonstrated the effectiveness ofusing network traffic to detect Android malware Wu et al[34] divided app HTTP traffic into 2 categories primarymodule traffic and nonprimary module traffic For primarymodule traffic the HTTP flow distance algorithm and theHungarian Method were used to calculate the traffic simi-larity and identify similar app pairs CREDROID [35]identified malicious apps on the basis of Domain NameServer (DNS) queries and the data that are transmitted toremote servers Zulkifli et al [36] proposed a method fordetecting Android malware that is based on seven networktraffic features and the J48 decision tree algorithm Chenet al [37] introduced the imbalanced data gravitation-basedclassification algorithm for the classification of imbalanceddata of malicious apps He et al [14 38] proposed theidentification of encrypted appsrsquo flows via traffic correlationand the detection of repackaged Android apps via com-parison of the network behaviors of similar apps In thesemethods the detection is conducted mostly in the router orat the network monitoring node hence the performances ofmobile devices are not affected However the network be-haviors of malware must be modeled in advance which ischallenging in practice

Edge servers

Datacomputations

Data

Data

Data

DataservicesIoT devices

Cloud

Figure 2 Paradigm of edge computing

4 Security and Communication Networks

32On-DeviceDetection e detection of Android malwarecan also be conducted at usersrsquo mobile devices [39]Alzaylaee et al [40] extracted features from real devices andused a deep learning system to detect malicious Androidapps Shamili et al [41] proposed a distributed SupportVector Machine (SVM) algorithm for the detection ofAndroid malware on a network of mobile devices eirdetection features include short duration calls mediumduration calls and the number of outgoing SMS (ShortMessage Service) among others e experimental resultsdemonstrated that the average computation time per clientduring the training phase is nearly 10 seconds Zhao et al[42] implemented a malware detection framework namelyRobotDroid using an SVM active learning algorithmRobotDroid monitors all the running apps to identify theirrunning characteristics ese characteristics are later inputinto learning modules to generate the behavioral charac-teristics for the classification of normal apps and malwareAccording to their performance evaluation the time ofaccessing the GPS is increased from 80ms to almost 120mswhen the detection system is executed Hence the perfor-mance of a smartphone that utilizes this framework wouldbe significantly degraded Shabtai et al [8] monitored theappsrsquo network behaviors at the mobile devices to detectrepackaged Android malware and the detection accuracyexceeds 80 however the performance overhead of theirmethod is not negligible For example it requires7272 kB plusmn 15 memory consumption and 13 plusmn 15 CPUconsumption for learning 10 Android malware detectionmodels

To further improve the performance of on-device de-tection Crowdroid [10] sends the detection features to acentralized server for processing Crowdroid only recordsthe system calls as the detection features and it is light-weight Talha et al [43] proposed APK Auditor which is asystem that uses permissions as static analysis features forAndroid malware detection APK Auditor consists of threecomponents (1) a signature database (2) an Android clientand (3) a central server e Android client sends the app tothe central server for analysis e central server extracts thepermissions that are requested by the app and computes thepermission malware score based on the permissions inmalware If the score exceeds a threshold value the app isclassified as malware e results demonstrate that APKAuditor achieves 925 specificity however it lacks thebenefits of dynamic analysis as it cannot detect dynamicmalicious payloads

Monet [17] also uses backend servers and a client app todetect Androidmalwaree client app constructs a directedgraph of the app components and the system componentsand it records the potentially dangerous system calls as thedetection features ese features are forwarded to thebackend server which conducts further detection by ap-plying a signature matching algorithm and returns thedetection results e experimental results demonstrate thatMonet can detect Android malware with 99 accuracyHowever the performance overhead of Monet is high Asshown in [17] Monet had 8 overhead on memory and IObenchmarks and 55 overhead for battery Most recently

Arshad et al [44] proposed SAMADroid which is a 3-levelhybrid model for the detection of Android malware Sim-ilarly SAMADroid has a client app and an analysis servere client app hooks the Strace tool and traces the systemcalls that are invoked by the user app If the user app remainsrunning on the device the client app continues tracing thesystem calls of the user app and generates a log file of thesystem calls Later the log file is sent to the SAMADroidserver with the identifier of the user app At the server thesame user app is downloaded from stores and staticallyanalyzed en the static features and the dynamic featuresthat are extracted from the log file are embedded in a vectorspace Finally the features are provided as input to themachine learning tool for classification eir experimentalresults demonstrate that random forest yields 9907 ac-curacy on the static features and 8276 on the dynamicfeatures e performance overhead of SAMADroid is 18for memory and 06 for CPU In previous work [45] weused mobile edge computing to detect malicious apps Weextend it to the edge computing paradigm and refine theclustering method in this study

33 Edge Computing Research e vision and challenges ofedge computing are introduced in the literature [24] ebasic strategy of edge computing is to deploy cloud com-puting at the edge of the network to improve the perfor-mance and create new applications for IoT [46] Howeverthe deployment location of the edge and the functionalitiesof the edge are controversialerefore in applications edgecomputing can be implemented as fog computing [47]mobile edge computing [48 49] and mobile cloud com-puting [50] In the fog computing paradigm the analysis oflocal information is conducted on the ldquogroundrdquo and thecoordination and global analytics are conducted in theldquocloudrdquo One can argue that all objects that are outside thecloud constitute the ground thus fog computing is almostthe same as edge computing

Mobile edge computing deploys cloud services at theedge of mobile networks such as 5G Namely the edge isdeployed in close proximity to the eNodeB which loops thetraffic through the mobile edge computing servers for fur-ther data processing us in mobile edge computing theedge is fixed and the edge servers can be provided bytelecommunication companies Mobile cloud computingfocuses mainly on mobile delegation mobile devices shoulddelegate the storage of bulk data and the execution ofcomputationally intensive tasks to remote entities A remoteentity can be a centralized cloud or another mobile deviceerefore the edge can consist of mobile devices in mobilecloud computing Hence one of the most active areas ofresearch in the field of mobile cloud computing is thedelegation of tasks to external services especially to othermobile devices Task delegation is also essential for edgecomputing [51]

In this study we provide a lightweight and efficientframework for the detection of repackaged Androidmalwarethat is based on edge computing In the proposed frame-work the mobile devices need only to record the flow

Security and Communication Networks 5

starting time flow protocol server-side IP address server-side port app name and version for each network flowLater these data are sent to the edge servers for traffictagging Since the amounts of data that are saved and sent aresmall the performances of mobile devices are not signifi-cantly degraded e edge servers extract detection featuresfrom the labeled app network traffic and send the extractedfeatures to the cloud to be clustered After clustering theoriginal and repackaged apps can be efficiently distin-guished In our study only network traffic is considered inAndroid malware detection In contrast to previous net-work-based methods such as [14 34ndash36] we do not need tomodel appsrsquo network behaviors in advance and our methodis more useful in practice

4 Framework for Android Malware Detection

e on-device detection of Android malware is challengingbecause mobile devices typically have limited resources Toalleviate the storage and computing limitations and prolongthe lifetimes of the mobile devices in this study we proposea new framework for on-device detection of repackagedAndroid malware e proposed framework is illustrated inFigure 3

Our framework is composed of three layers mobiledevices the edge and the cloud First we explain thepossible locations of the edge later we introduce thefunctionalities of each component In our framework theonly requirement for the edge is that it can monitor andprocess the mobile network traffice mobile devices couldconnect to the Internet via WiFi or 3G4G hence the edgenodes can be wireless gateways household security gatewayssuch as BitDefender Box and LTE eNodeBs e edge nodescan also be routers that are located in the backbone networkif they can observe the traffic that is generated by the mobiledevices erefore the deployment locations of the edgenodes are flexible in practice Figure 4 illustrates the possibledeployment scenarios of the edge

In the proposed framework the mobile devices mustcollect and send flow metaformation to the edge for datapreprocessing For this an edge-client app is designed to berun on themobile devicese edge-client app operates as anAndroid VPN app Its main objective is to collect andtransfer the flow metainformation which includes the appname flow starting time and flow protocol to the edgeiteratively With the flow metainformation the edge serverpreprocesses the network traffic by adding a label (the appname) to each flow After that it extracts detection featuresfrom these labeled flows e traffic labeling and featureextraction are illustrated in detail in Section 52 Once thefeature extraction has been completed the edge server sendsthese features to the cloud With the received features thecloud conducts the malware detection tasks and notifies themobile devices of the detection results

To further illustrate the advantages of the proposedframework we compare it with the available frameworksSince the proposed framework is based on edge computingwe will refer to it as the edge-computing-based framework inthe following Similarly the available frameworks are termed

as the cloud-based framework (the server-side detectionmethods that were described in Section 1) and the network-based framework (the network-based methods that wereintroduced in Sections 1 and 3) for differentiation Com-pared to the popular cloud-based framework (Figure 5(a))the most prominent advantage of the edge-computing-basedframework is that the feature extraction process is offloadedto the edge and the mobile devices must only save andtransmit the flow metainformation us the proposededge-computing-based framework is resource-friendly formobile devices

e network-based framework (Figure 5(b)) monitorsnetwork traffic without any interference withmobile devicesHowever it is less accurate since the network nodes cannotknow the origins of the network flows precisely namely itcannot identify app traffic with 100 accuracy As dem-onstrated in the literature [14] since the network nodecannot identify the app for each flow (eg some flows lackssignatures or are encrypted) the appsrsquo network behaviors areincomplete and the malware detection accuracy is reducedIn the proposed edge-computing-based framework wedesign a lightweight edge-client app for gathering app in-formation Furthermore mobile usersrsquo privacy can be betterprotected in our proposed framework By deploying onersquosown edge server a mobile user can control the types ofinformation that can be used as detection features Incontrast in the network-based framework the networknodes can observe the full traffic content and extract anyinformation they want which severely threatens the privacyof mobile users

Android malwaredetectionCloud

Edge

Mobiledevices

(1) Flowmetainformation

(2) Extractedfeatures

(3) Detectionresults

Network connectionData transmission

Figure 3 Framework for android malware on-device detectionthat is based on edge computing Mobile devices send flow met-ainformation such as the app name and the server-side IP addressto the edge Edge servers capture app network traffic and extractHTTP contents and flow statistical features from the capturedtraffic Later the features are sent to the cloud for malware de-tection e cloud also interacts with the mobile devices to returnthe detection results

6 Security and Communication Networks

After introducing the general architecture we outline thedetection methods that can be used in the edge-computing-based framework e cloud can use various methods todetect Android malware For example it can use conven-tional signature matching [35] and network behavioranalysis [8 14] to identify malicious apps However thesetypes of methods require a priori information (such assignatures or behavior models) In this paper we propose anovel repackaged malware detection method that clustersthe network traffic that is generated by the same app fromvarious mobile devices Figure 6 presents an example of thisprocess

e motivations behind the traffic clustering are two-fold first most Android malware is produced by repack-aging popular apps [4] and popular apps are often widelydistributed Second the network traffic generated by therepackaged malware differs from that generated by theoriginal app (as illustrated in Section 21) As illustrated inFigure 3 many mobile devices would connect to the cloudfor malware detection Hence the cloud can observe thenetwork traffic features of both the original apps and theirrepackaged versions with large probabilities due to the widedistribution of popular apps and the large number of mobiledevices If we regard the features of the original apps as thenormal data and the features of the repackaged versions asthe abnormal data the repackaged malware detectionproblem is transformed into the abnormal detection

problem Clustering is an efficient method for solving thisproblem [52] By clustering appsrsquo network traffic features weneed not model appsrsquo network behaviors in advance andmore importantly we can detect various repackaged ver-sions of the original app as illustrated in Figure 6e detailsof the proposed malware detection method will be describedin detail in Sections 53ndash55

5 Methodology

In this section we introduce the technical details of theproposed clustering method for Android malware detectionTable 1 defines the main notation that we use in this paper

(a) (b) (c)

Figure 4 Possible deployments of the edge e edge could be the wireless gateways or the dedicated servers that are connected to theeNodeB or the routers (a) Wireless gateway (b) eNodeB+ server (c) Router + server

Clustering

Traffic features of an app collected from

different mobile devicesOriginal apps

Repackaged apps

Figure 6 Illustration of traffic clustering for repackaged malwaredetection e plain circles represent original apps and the circleswith triangles and inverted triangles are repackaged ones

Network connectionData transmission

Android malwaredetectionCloud

Mobiledevices

(1) Extractedfeatures

(2) Detectionresults

(a)

Network connectionData transmission

Network node

Mobiledevices

(1) Detectionresults

(b)

Figure 5 Available frameworks for android malware detection Various approaches such as [10] send the extracted features to a remoteserver other than to the cloud We still regard them as following the cloud-based framework because there is no essential difference formalware detection As illustrated in (b) the detection results are typically presented to the network administrator and are not fed back to themobile devices (a) Framework based on cloud (b) Framework based on network

Security and Communication Networks 7

51 System Overview Figure 7 illustrates the main proce-dure of the proposed approach App i is the app for vettingand it has been installed on umobile devicesese u devicesare connected to e edge servers We assume that r deviceshave installed the repackaged malware (the malware is alsopresented as App i to deceive the users) and rlt (12)uHence most of the devices have installed the original ver-sion is situation is reasonable because repackaged An-droidmalware is typically distributed in third-party marketswhich are less popular than official markets [23] us thenumbers of downloads and installations are relatively smallWe will use this characteristic to classify the clusteringresults

To detect the repackaged malware the edge server labelsthe network flows that are generated by App i according tothe flow metainformation en it extracts suitable trafficfeatures (such as traffic content and statistics) from thelabeled flows and sends the features to the cloud for pro-cessing erefore there are u feature sets for App i in thecloud e cloud filters these u sets according to the appversions and removes common traffic After that it calculatesthe pairwise similarities of the feature sets Finally it clustersthe similarity values and identifies r repackaged malwareinstances

e traffic labeling feature extraction and filtering(labels①② and③ in Figure 7) the similarity calculationon the traffic contents and behaviors (labels④ and⑤) andclustering of similarity values (label⑥) are the core elementsof the proposed method ey will be described in detail inthe following sections

52 Traffic Labeling Filtering and Feature Extraction

521 Traffic Labeling In this study traffic labeling isconducted to identify the corresponding app for each net-work flow which is known as the app identification problem

[53] Extensive works have been conducted on the identi-fication of apps from mobile network traffic [38 54 55]However the achieved identification accuracies are all lowerthan 100 In our method we design an edge-client app forcollecting flow metaformation and identifying apps accu-rately e flow metainformation is defined in the following

Definition 1 (flow metainformation) For flow f the flowmetainformation includes the client-side IP address C-IPfthe flow starting time Tf the flow protocol Pf the server-sideIP address S-IPf the server-side port S-Portf the app nameAppNf and the app version AppVfe flow f is bidirectionaland Pf is the transport layer protocol such as TCP or UDP

e edge-client app operates similarly to a VPN appeapp sends the flow metainformation to the edge server at afixed time interval Ti Once it has received the flow meta-information the edge server saves it into a metainformationdatabase and can label mobile app traffic accurately Denotethe flow observed by the edge server as fu e edge serverlabels fu as follows

Step 1 extract the recording time Tuf of fu and its flow

protocol Puf client-side IP address CminusIPu

f server-sideIP address SminusIPu

f and server-side port SminusPortufStep 2 search the metainformation database with Pu

fCminusIPu

fSminusIPuf and SminusPortuf If records match the da-

tabase returns the record that has the most recent flowstarting time Tr

f Otherwise the edge server waits for awhile and repeats Step 2Step 3 if Tu

f is close to Trf the edge server labels flow fu

as ltAppNf AppVfgtOtherwise it waits for a while andrepeats Steps 2 and 3

In the above steps the client-side IP address CminusIPuf is

used to identify a mobile device uniquely which is correctwhen the edge server is the wireless gateway or the eNodeBIn such scenarios the mobile devices directly connect to theedge hence the IP address is unique for each deviceHowever if the edge server is deployed at the backbonerouter as illustrated in subgraph (c) of Figure 4 the CminusIPu

f

may be confusing because mobile devices could be locatedbehind NATs (Network Address Translations) and they willshare the same public IP address To solve this issue theedge-client app can insert special indicators into the packetsto uniquely represent a device In this study we alwaysassume that CminusIPu

f is unique and we will investigate theimplementation of the device-indicator in future work

In steps 2 and 3 the edge server utilizes a wait-and-repeat strategy to ensure that the traffic labeling is fresh andaccurate e edge server must wait because the mobiledevices send flow metainformation to the edge server at afixed time interval Ti (60 s in our experiments) and the savedflow metainformation in the database may be outdated Inour implementations the waiting time is set to Ti and if|Tu

f minus Trf|gt 24 h the returned flow metainformation is

regarded as obsolete

522 Traffic Feature Extraction After traffic labeling theedge server extracts detection features from the labeled

Table 1 Main notation used in this paper

Notation Meaningi Android app iu Number of mobile devices with app i installedr Number of devices with repackaged app i

f Network flowC-IPf Client-side IP address of fS-IPf Server-side IP address of fS-Portf Server-side port of fAppNf Name of the app that generates fAppVf Version of the app that generates fTi Set time intervalTu

f Recording time of flow fu at the edge serverdi Feature set of app iwj Plaintext word in the feature setsV(di) Numerical vector of traffic contentsF

j

diFeature vector for the encrypted flow j in di

TBntimesm(di) Traffic behaviors of diCSdik

Content similarity between di and dkBSdik

Behavior similarity between di and dkSdik

Final similarity between di and dk

8 Security and Communication Networks

traffic e extracted traffic features are categorized into twogroups traffic content features and traffic behavior featuresTraffic content features are the plaintext contents that areextracted from HTTP flows and traffic behavior features arethe statistics of network flows such as packet sizes andintervals Since Android apps may use an encrypted HTTPSprotocol to transmit data [54] traffic behavior features alsomust be considered e traffic content and traffic behaviorfeatures are described in detail in Sections 53 and 54

523 Traffic Filtering After traffic feature extraction theedge servers send the extracted features along with theclient-side IP address server-side IP addresses server-sideports app name AppNf and app version AppVf to the cloudNote that App i is installed on u devices hence the cloudeventually has u feature sets for App i Figure 8 illustrates thestructure of the feature set e cloud further filters these ufeature sets to increase the detection accuracy First it di-vides the u sets into v categories according to the app versionAppVf erefore each category contains the feature setsthat are generated by the same version of App i en foreach category the cloud removes the common traffic fromthe feature sets e common traffic is defined as network

flows that are contained in all features sets ese flows havethe same server-side IP address and the same server-sideport number e reason for this step is that the commontraffic will obfuscate the similarity calculations betweenapps Otherwise the similarity values between features setsare all high and they cannot be effectively used to distin-guish the repackaged malware from the original apps

After the filtering has been completed the similaritiesbetween feature sets are calculated and the similarity valuesare clustered to identify the repackaged malware

Client-side IP addressApp name AppN_fApp version AppV_fFlow 1 server-side IP address 1 server-side port 1 flow features

Flow f server-side IP address f server-side port f flow features

Figure 8 Illustration of the feature set e set contains the clientIP address app name and app version Each network flow that isgenerated by App i in a device is represented by the featuresextracted from the flow and by the server IP address and port

Trafficlabeling

Edge server 1

Traffic featureextraction

Similarity calculation of

trafficbehaviors

Similarity calculation of traffic contents Clustering of

similarity values

Original app

Repackaged malware

Cloud

Mobile device 1 2 (with app i installed)

Results

Traffic features

Traffic and flowmetainformation

ę

ę

Edge server e

Traffic features

Traffic and flow metainformation

Filtering

Mobile device u(with app i installed)

1

3

4

5

6

2

Figure 7emain procedure of the proposedmethod for Androidmalware detectione core parts of the proposedmethod are labeled bycircled numbers

Security and Communication Networks 9

53 Similarity Calculation of Traffic Content Android appsuse mainly HTTP and HTTPS protocols to communicatewith their remote servers [54] For HTTP traffic since it isplaintext based we can efficiently use the flow content tocalculate the similarity e basic strategy is to convertHTTP traffic to document files and the document files canbe handled by information retrieval technologies for simi-larity calculation HTTP flow content has also been used inapp identification from network traffic [55] However in thisstudy we only consider partial HTTP header information tobetter protect mobile usersrsquo privacy

In detail first we reassemble HTTP flows from packetsand extract the flow contents by saving the correspondingpacket contents into a text file as ASCII code Figure 9 showssuch an example en we preprocess the text file by de-leting the HTTP request content and the HTTP requestheader except for the GET (or POST) and HOST fields eHTTP response content is also deleted namely only theHTTP response header the GET (or POST) and the HOSTfiled in the HTTP request header are saved Via this ap-proach the mobile usersrsquo privacy can be protected from thecloud (for example the mobile phone information (HUA-WEI) is leaked by the User-Agent field in Figure 9) Afterthat the text file is spilt using special characters (eg ) and the processed file becomes the flow contentfeatures

e similarity is calculated via the improved TF-IDFalgorithm [56] and the cosine similarity e term frequencymeasures the count of words in a document We use theaugmented frequency measurement Denote the document(the feature set) as di and the word as wj Term frequencytf(wj di) is calculated as

tf wj di1113872 1113873 05 +05 times f wj di1113872 1113873

max f w di( 1113857 w isin di1113864 1113865 (1)

where f(wj di) is the number of occurrences of wj indocument di

e inverse document frequency measures of theamount of information a word provides Typically it is thelogarithmically scaled inverse fraction of the number ofdocuments that contain the word e inverse documentfrequency of wj is calculated as

idf wj1113872 1113873 logu

d wj isin d1113966 111396711138681113868111386811138681113868

11138681113868111386811138681113868+ 001⎛⎝ ⎞⎠ (2)

where u is the number of documents and | d wj isin d1113966 1113967| isnumber of documents in which the term wj appears

Denote by p(wj di) the product of TF(wj di) andI DF(wj) it is formulated as

p wj di1113872 1113873 log tf wj di1113872 11138731113872 1113873 times idf wj1113872 1113873

1113936

cj0 log

2 tf wj di1113872 11138731113872 1113873 times idf2 wj1113872 11138731113969 (3)

After the TF-IDF operation all documents are convertedinto equal-length numerical vectors We use the cosinesimilarity to measure the similarity between two documentvectors For documents di and dk the corresponding

numerical vectors are denoted as V(di) and V(dk) and thelength of vector is t e similarity between di and dk iscalculated as

CSdik

1113936tl1 V di( 1113857l times V dk( 1113857l( 1113857

1113936tl1 V di( 1113857

2l

1113969

times

1113936tl1 V dk( 1113857

2l

1113969 (4)

Suppose that in category vi (we divide the u feature setsgenerated by App i in to v categories in the traffic filteringstep) there are c feature sets For each feature set we save theflow features of all HTTP flows into a text file and c text filesare obtained en we calculate the numerical vector foreach feature set via equation (3) Last we compare thenumerical vector with each other as equation (4) and gain csimilarity values (CSdii

1) for each feature set We rep-resent these similarity values of feature set di (i 1 2 c)as a vector Si

contents and Sicontents (CSdi1

CSdi2 CSdic

)e vectors Si

contents (i 1 2 c) in combination with thesimilarity values of traffic behaviors will be clustered toidentify the abnormal data points namely the repackagedmalware

54 SimilarityCalculation of TrafficBehaviors In addition tocalculating the similarity values of traffic contents thesimilarity of traffic behaviors must be calculated due toencrypted flows We use m (m 11 in our study) statisticalfeatures as listed in Table 2 to represent the behavior of anencrypted network flow e units of the flow duration andthe packet interval time are milliseconds in our experiments

With the flow statistical features for feature set di the jthencrypted flow can be represented as a vector in the featurespace

Fj

di f

1j f

2j f

mj1113872 1113873 (5)

where flj represents the value of the lth statistical feature

1le llem Figure 10 shows an example of feature represen-tation for encrypted traffic Consequently the traffic be-haviors of di can be represented as a matrix of f

pj

Figure 9 Example of converting an HTTP flow to a text file

10 Security and Communication Networks

TBntimesm di( 1113857 F1di

F2di

Fndi

1113872 1113873T (6)

where n is the number of encrypted flows that are containedin di and nge 1

Since different feature sets may differ in terms of thenumber of encrypted flows traffic behaviors TBnitimesm(di) andTBnktimesm(dk) may differ in terms of different dimensionnamely the value of ni should not be equal to that of nk Toefficiently calculate the similarity between TBnitimesm(di) andTBnktimesm(dk) we calculate the Frobenius norm of TBnitimesm(di)

and TBnktimesm(dk) e formula is presented in equation (7)where norm(middot) normalizes the value of fl

j to [0 1]

FN(d)

1113944

n

j11113944

m

l1norm fl

j1113872 111387311138681113868111386811138681113868

111386811138681113868111386811138682

11139741113972

(7)

en the distance between TBnitimesm(di) and TBnktimesm(dk) iscalculated as

dis di dk( 1113857 FN di( 1113857 minus FN dk( 11138571113868111386811138681113868

1113868111386811138681113868 (8)

Finally the similarity between TBnitimesm(di) andTBnktimesm(dk) can be calculated by

BSdik 1 minus

dis di dk( 1113857

max dis dr dq1113872 1113873 r q 1 2 u rne q1113960 1113961 (9)

In the above process ni and nk are assumed to exceed 1If ni 0 (nk 0) and nkge 1 (nige 1) BSdik

is set to 0 to reflectthe significant difference between di and dk However ifni 0 and nk 0 namely there is no encrypted flow in bothdi and dk BSdik

is set to 1 because they exhibit the samebehaviors ie they do not produce any encrypted traffic

Similarly we let BSdii 1 and represent these similarity

values as a vector Sibehavior BSdi1

BSdi2 BSdiu

1113966 1113967

55 Clustering of Similarity Values and Malware DetectionFrom the traffic content similarity CSdik

and the traffic be-havior similarity BSdik

we can determine the final similarityvalue BSdik

between the feature sets di and dk as expressed inequation (10) In this equation q is the weight value In ourexperiments we set q 05 namely the traffic contentsimilarity and the traffic behavior similarity are assigned thesame weight is is because apps typically produce almostthe same numbers of HTTP and HTTPS flows [57] We alsoevaluate various weighted values in Section 62 and theexperimental results support that 05 is the most suitablevalue For feature set di a vector Sdi

(Sdi1 Sdi2

Sdiu) is

finally obtained which contains the similarity values be-tween di and the other feature setse set of vector Sdi

1113966 1113967 i

1 2 u will be clustered for the detection of repackagedmalware

Sdik q times CSdik

+(1 minus q) times BSdik (10)

To be specific we use the density peak clustering method[58] to realize our objective Density peak clustering is basedon the following assumption cluster centers are charac-terized by a higher density than their neighbors and by arelatively large distance from points that have higher den-sities Hence density peak clustering can automaticallyidentify the correct number of clusters is is importantbecause we do not know whether there is repackagedmalware or how many types of repackaged malware thereare In other words we cannot determine the number ofclusters in advance Density peak clustering can help usovercome this challenge

We assume that most mobile devices have installed theoriginal apps as discussed in Section 51 erefore afterclustering the cluster that has the largest number of voxelswill be recognized as original and the remaining clusters willbe regarded as suspicious However suspicious apps are notnecessarily repackaged malware For example apps can berepackaged to show ads only Repackaged apps of this typeshould not be simply judged as malware Additionallyoriginal apps can be suspicious due to different user be-haviors To further reduce the frequency of the false alarmswe can check the server hostnames that are contained in therepackaged clusters using VirusTotal (the URLs are includedin the HTTP traffic contents) If all of the serversrsquo hostnamesfor a suspicious cluster Cr are judged as clean it will berecognized as a cluster of abnormal but harmless appsOtherwise it will be identified as repackaged malware Viathis approach we can accurately detect repackaged Androidmalware In this step we only use VirusTotal as a white list toexclude false alarms We do not use it as a backlist (detectionsignatures) to detect repackaged malware Malware detec-tion is still realized by clustering the similarity values

Table 2 Flow statistical featuresBytes sent by the app and bytes sent by the serverNumbers of outgoing and incoming packetsMean and std dev of the outgoing and incoming packet sizesrsquoflow durationMean and std dev of the interpacket time

lt2761 16696 16 12 173 1294 289 1177 3392 121 190gt

Figure 10 Example of the conversion of an encrypted flow to afeature vector

Security and Communication Networks 11

6 Evaluation

61 System Implementation and Dataset Figure 11illustrates our experimental setup e Android phoneswere connected to the edge server via WiFi and each phonehad installed the edge-client app which sends the flowmetainformation to the edge server using Socket (jav-anetSocket) e edge is a Dell PowerEdge R730 server(with 32 processors 16GB RAM and 12 TB hard drives)that is configured as an access point In the edge server weused tcpdump (httpwwwtcpdumporg) to collect networktraffic e similarity calculations of traffic contents andbehaviors and the clustering of similarity values wereimplemented on the Windows Azure cloud computingplatforme URL checking was conducted using Virustotal(httpswwwvirustotalcom)

In the implementation we used aria2c (httpsaria2githubio) to download apps from Androzoo (httpsandrozoounilu) aria2c was encapsulated as commandsin Python code for parallel downloading After downloadingthe apps we automatically installed them on phones usingthe adb shell command Each phone had 40 tested appsinstalled We have implemented the edge-client app basedon Android VPN service In our implementations the edge-client app correlated network flows and apps by resolvingthe ports and PIDs (Process IDs) send the newly collectedflow metainformation to the edge server every 60 secondse edge server used tcpdump to collect network trafficwhich was saved as pcap files To extract traffic features weused Splitcap (httpswwwnetreseccompageSplitCap)to restore tcp flows from pcap files en for HTTP trafficwe extracted the contents using tshark (httpswwwwiresharkorgdocsman-pagestsharkhtml) commands -zfollow tcp and ascii e packet sizes and packet intervaltimes were also obtained by tshark with commands -eframelen ndashe frametime_relative In the cloud CSdik

and BSdik

were calculated using scikit-learn (httpscikit-learnorg)and Numpy (httpsplotlynumpynorm) e code ofdensity peak clustering is available on GitHub (httpsgithubcomjasonwbwDensityPeakCluster) and we havefixed various bugs that were present in the original code

We have downloaded 200 pairs of original andrepackaged apps from Androzoo for testing However notall of these downloaded repackaged apps are maliciousSome apps are only repackaged to show extra ads To screenout Android malware from the downloaded repackagedapps we further checked the 200 repackaged apps usingVirustotal If an app was judged as safe by all the antivirustools it was labelled as repackaged_safe in our experimentsOtherwise it was labelled as repackaged_malware In total143 repackaged apps were detected as malware by VirustotalNevertheless in our experiments we evaluated the proposedclusteringmethod on all 200 repackaged apps to obtainmorerealistic performance results

e downloaded apps were run in 10 rooted Androidphones in parallel to accelerate the experimental process Forall apps we used Droidbot [59] to send input events to themand to generate network traffic automatically We ran eachrepackaged app (repackaged_safe and repackaged_malware

apps) 10 times and the corresponding original app 30 timesto generate sufficient network traffic e running times ofthe original apps exceeded those of the repackaged appsbecause we assume that most of the apps are original asdiscussed in Section 51 Consequently we obtained 40traffic sets for each repackaged-original app pair Note thatfrom the perspectives of the mobile devices and the edgeserver there are only 200 distinct apps because therepackaged app has the same name as the correspondingoriginal app erefore our experiments have simulated40lowast 200 8000 mobile devices For each repackaged-orig-inal app pair there were 30 devices with the original app and10 devices with the repackaged version

In total we have gathered almost 89GB network traffice datasets that are used in our experiments are summa-rized in Table 3 In the table the traffic dataset sizes are thesizes of the captured network traffic and the feature datasetsizes are the sizes of the extracted features In our experi-ments the HTTP traffic contents and the flow statisticalfeatures which are listed in Table 2 were both saved as textfiles As shown in Table 3 the feature datasets are signifi-cantly smaller than the traffic datasets (25G vs 63G and 9G vs26G)us the edge computing can reduce the data size andincrease the data acquisition speed for cloud computing[24] Since the feature dataset sizes are still large (25G+ 9G)we did not send all these feature data to the cloud simul-taneously Instead we sent the feature data one by oneWhen the feature data of an app were successfully receivedby the cloud its traffic similarities were calculated andclustered immediately After that the feature data of anotherapp would be sent and analyzed e experimental resultswill be presented in the following

62 ExperimentalResults In this section wemainly evaluatethe detection efficiency of Android malware by clusteringsimilarity values e performance evaluation of the pro-posed method in terms of power memory and CPU con-sumptions is elaborated in the following Section 63

Similar to previous studies [8 14] to measure theperformance of the proposed method Accuracy and F-measure metrics are utilized which are defined in equations(11) and (12) In equation (11) TP FN FP and TN aredefined in Table 4 In Table 4 repackaged_x representsrepackaged_safe or repackaged_malware e precision andrecall in equation (12) are calculated via equations (13) and(14)

Accuracy TP + TN

TP + FN + FP + TN (11)

F minus measure 2 times precision times recallprecision + recall

(12)

precision TP

TP + FP (13)

recall TP

TP + FN (14)

12 Security and Communication Networks

We have evaluated the Accuracy and F-measure for eachrepackaged app Recall that we ran each repackaged app 10times and the corresponding original app 30 timeserefore for each app the number of the original apptraffic sets is 30 and the number of the repackaged app trafficsets is 10 namely TP + FN 10 and TN+FP 30 for eachapp We have randomly chosen 20 apps from repack-age_malware and the experimental results for these apps arepresented in Figures 12 and 13 Detailed descriptions of thechosen 20 apps are listed in Table 5 In the table the mainfunction of each app is determined by analyzing the packagename from the AndroidManifestxml and from the Googlesearch results of the package name As depicted in Figure 12100 accuracy was realized on a total of 6 apps (app3 app5app9 app10 app13 and app18) e minimum accuracy is85 (app2 3440) and the average accuracy is 952 Fig-ure 13 presents the F-measure values In our experimentsthe maximum F-measure value is 1 and the minimum valueis 0737 e average F-measure value is 0888 ese resultsdemonstrate the satisfactory performance of our proposedclustering method

We have also calculated the average accuracy and F-measure values for all repackaged_safe and repack-aged_malware apps In these calculationsTP + FN 57lowast10 570 and TN+FP 57lowast 301710 for

repackaged_safe apps For repackaged_malware appsTP + FN 143lowast101430 and TN+FP 143lowast 30 4290is is because in our experiments 143 apps were identifiedas repackaged malware and 57 apps as repackaged safe byVirustotal us the parameter values differ e experi-mental results are listed in Table 6

According to Table 6 the average accuracy values forrepackaged_safe and repackaged_malware apps are bothhigh However the average F-measure for repackaged_safeapps is only 078 which is lower than that of repackagedmalware is indicates that repackaged safe apps are

Router Windows Azure

Edge server DellPowerEdge R730

10 android phones

Repackagedoriginal apps

Androzoo

Figure 11 Experimental setup

Table 3 Datasets used in the experiments

of apps of runs of HTTP flows of encrypted flows Traffic dataset sizes Feature dataset sizesOriginal apps 200 30 340015 183452 asymp63G bytes asymp25G bytesRepackaged apps 200 10 110670 93041 asymp26G bytes asymp9G bytes

Table 4 Confusion matrix

ObservedDetected

Repackaged_x app Original appRepackaged_x app TP ( of TPs) FN ( of FNs)Original app FP ( of FPs) TN ( of TNs)

Apps

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

Figure 12 Detection accuracies for the 20 selected apps

Security and Communication Networks 13

difficult to distinguish from their original apps is isbecause repackaged_safe apps only generate smallamounts of additional network flows and their networktraffic features are not clearly distinguishable from thoseof normal traffic Fortunately the average F-measure ofrepackaged malware apps is outstanding is may be dueto the fact that malware typically send stolen data to orreceive commands from remote servers Hence theirnetwork traffic features are special and distinguishableFigure 14 presents an example of app clustering and il-lustrates the distinguishability of the repackaged malwareand the original app ese results demonstrate thatrepackaged Android malware can be effectively detectedby comparing their network traffic with that of originalapps

For repackaged malware apps the accuracy results withdifferent weight values (q in equation (10)) are presented inFigure 15 As depicted in the figure when q 05 we realizethe highest accuracy rate 969 If q 0 namely only thefeatures of encrypted traffic are used the accuracy result is753 e result is 617 when q 1 namely only thefeatures of plaintext traffic are considered ese resultsdemonstrate that both the encrypted and plaintext trafficcontribute to the differentiation of the original apps and therepackaged malware However the encrypted traffic con-tributes more than the plaintext traffic is may becausemalicious apps usually connect to their servers by securityprotocols to evade IDS detection

In the previous experiments we checked the serversrsquohostnames using VirusTotal to reduce the frequency of thefalse positives For various clusters that contained less than20 elements if all the serversrsquo hostnames were judged asclean the cluster was identified as an original app rather thana repackaged one As argued in Section 55 this process onlyreduces the frequency of false alarms To support this ar-gument we list the numbers of true positives and falsepositives of the 143 repackaged malware in Table 7 forcomparison When conducting the hostname checking thenumber of true positives is 1400 and the result is the samewhen the hostname checking is turned off is findingsupports that the malware is detected by clustering thesimilarity values However the number of false positivesincreased from 47 to 213 when the hostname checking wasturned off erefore we suggest checking the serversrsquohostnames after clustering to increase the practicality of themethod Note that false positives still occur when we checkthe hostnames by VirusTotal Hence there may be malwarein the original apps In our dataset the typical examples arecommobovideoplayerpro and cncfshop ele1taoxiaosanWe further analyze these two apps by reading theirdecompiled codes and we determine that they indeed es-tablish connections with suspicious servers In detail invarious runs app commobovideoplayerpro connects withhttpwwwtopappsquarecomwhich is judged as maliciousby CyRadar in VirusTotal App cncfshop ele1taoxiaosanvisits suspicious URL http955ccjSn9 Since these apps arethe original apps they are identified as false positives in theexperiments

In our experiments we chose density peak clustering asthe clustering method because density peak clustering canautomatically determine the correct number of clusters [58]Hence our method can detect various versions of malwarethat are repackaged from the same original app Unfortu-nately in the apps that were downloaded from AndroZoono malware was repackaged from the same original app Toevaluate the efficiency of detecting versions of malware wehave created two malware instances from a popular game

Apps

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

0

01

02

03

04

05

06

07

08

09

1

F-m

easu

re

Figure 13 Detection F-measure for the 20 selected apps

Table 5 Detailed descriptions of the chosen 20 apps

Main function ofapp

of HTTPflows

of encryptedflows

App1 VideoPlayer 453 51App2 Game 657 1455App3 Browser 1531 1478App4 MusicPlayer 341 25App5 Game 612 289App6 Game 788 123App7 Game 1336 524App8 Fitness 357 892App9 Email 122 431App10 Game 963 790App11 MusicPlayer 1128 547App12 Game 1598 426App13 News 2460 1178App14 News 2891 1543App15 Chat 171 49App16 Email 143 368App17 WeatherForecast 397 120App18 Game 1832 921App19 Fitness 732 362App20 OnlineLearning 2378 834

Table 6 Average accuracy and F-measure

Average accuracy () AverageF-measure

Repackaged_safe 903 078Repackaged_malware 969 094

14 Security and Communication Networks

app namely aircomaceviralmotox3 One is produced bygrafting AnserverBotrsquos source code into air-comaceviralmotox3 and the other is created by adding codefor stealing usersrsquo contacts and short messages and sendingthem to a remote server Similarly each malware is run 10times and the original app aircomaceviralmotox3 is alsorun 10 times for validatione clustering result is presentedin Figure 16 Apparently there are 3 clusters e

experimental results show that the value of F-measure foreach malware is 1 Hence our method can detect variousmalware that were repackaged from the same original app

63 Comparison In this section we compare our methodwith the typical methods that are based on network or cloudas illustrated in Figure 5 e comparative indicators are thedetection accuracy F-measure average power CPU andmemory consumptions of the mobile device First wecompare the proposed method with AppFA [14] which isimplemented completely at the network level Since AppFAalso must analyze app traffic we directly use the traffic sets ofthe repackaged malware apps for comparison While testingAppFA we deleted the label information that was containedin the traffic sets and mixed the network flows as a wholeen we used the signature matching and the constrainedK-means clustering algorithm proposed in [14] to restoreapp sessions namely the label for each flow e similarapps used in AppFA were chosen in Google Play via themethod introduced in [14] and for each repackaged mal-ware 10 similar apps were selected as the peer group epeer group apps were run once by Droidbot to generatenetwork traffic e comparison results are listed in Table 8

Since AppFA need not to install some special apps onmobile devices the extra power memory and CPU con-sumptions are 0 However the detection accuracy is muchlower (863 vs 969) than that of the proposedmethod ascompared in Table 8 In the proposed method we use theedge-client app to record flow metainformation thusnetwork flows can be accurately labeled However inAppFA flow labeling is realized via traffic clustering and theclustering accuracy cannot be 100 Meanwhile AppFAcompares the appsrsquo network behaviors with those of theirsimilar apps which also results in errors In the proposedmethod we directly compare the repackaged malware withits original app and the detection accuracy is increased eaverage power CPU and memory consumptions of theproposed method are also low and acceptable

In Table 8 the power memory and CPU consumptionsfor our edge-client app were obtained with the help of appGT (httpsgithubcomTencentGT) GT is a portable toolthat runs on smartphones for debugging the APP internalparameters and the code time-consuming statistics eaverage values for the power memory and CPU con-sumptions are calculated as follows Before running an appwith Droidbot we started the edge-client app by GT Whenthe Droidbot was stopped we recoded the power CPU andmemory consumptions We repeated the process until allapps (including 143 repackaged malware apps and theiroriginal versions) had been tested Via this way we obtainedthe average power CPU and memory consumptions for theedge-client app According to Table 8 the edge-client app islightweight and has minimal impact on the performances ofthe mobile devices

en we compare the proposed method with SAMA-Droid [44] SAMADroid monitors Android system calls andsends the calls to a centralized server for the detection ofAndroid malware SAMADroid is a typical cloud-based on-

Decision graph rho-delta-rho

20 251510 35 403050Sorted rho

00

02

04

06

08

Rho lowast

delt

a

Figure 14 Example of app clustering Obviously two clusters areidentified e dot in the top left corner is an outlier e outlier isdetected as an abnormal but harmless app by our method isfigure was generated by the density peak clustering tool

01 02 03 04 05 06 07 08 09 10Value of q

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

Figure 15 Accuracies of various weight values for traffic contentsimilarity and traffic behavior similarity

Table 7 True positives and false positives with and withouthostname checking

of true positives of false positivesWith hostname checking 1400 47Without hostnamechecking 1400 213

Security and Communication Networks 15

device detection method We have implemented the dy-namic detection according to the description in [44] Indetail the APIs that are invoked by the running apps aremonitored by the tool Sensitive API Monitor (httpsgithubcomcyruliuSensitive-API-Monitor) As done in [44] thefrequency values of 10 system calls (open ioctl brk readwrite close sendto sendmsg recvfrom and recvmsg) arerecorded ese recorded frequency values are sent toWindows Azure for malware detection We evaluate thedynamic detection performance of SAMADroid on our 143repackaged_malware apps and their original versions In theevaluation the frequency values of the original apps are usedto create the normality model and the frequency values ofthe repackaged malware apps are used for testing We alsouse the app GT to record the power memory and CPUconsumptions of SAMADroid Note that our phones wereall rooted and Sensitive_API_Monitor and GT can be usede experimental results are presented in Table 9

Last we compare our method with Shabtairsquos method [8]Shabtairsquos method detects Android malware at the mobiledevices and it considers the apprsquos network traffic patternsonly the HTTP traffic contents are not considered isgives us an opportunity to compare our method withmethods that have been specially designed for encryptedtraffic We have implemented Shabtairsquos method and chosenthe feature subset 2 as the detection features Feature subset2 includes Avg Sent Bytes Avg Rcvd Bytes and Pct ofAvg Rcvd Bytes among other values and it is proved to be

the best feature subset for malware detection [8] e De-cisionRegression tree is chosen as the classification algo-rithm e testing dataset is the traffic that was generated bythe 143 repackaged malware instances and the training set isthe traffic that was generated by the corresponding originalapps e classification results are listed in Table 10According to the table the detection accuracy and F-mea-sure values of our method are significantly higher than thoseof Shabtairsquos method is indicates that the HTTP trafficprovides useful information for Android malware detectionSince Shabtairsquos method is conducted at the mobile devices itconsumes more resources than our method as compared inTable 10

7 Discussion

e results of our extensive experiments demonstrate thatrepackaged Android malware can be efficiently and effec-tively detected by clustering network traffic at the edgedevices However there are still limitations for our methodFirst we assume the most of the devices have installed theoriginal apps (rlt (12)u as described in Section 51) is isreasonable in most cases since mobile users typicallydownload apps from the official markets However in areaswhere official markets such as the Google play and Amazonare banned it is possible that the most mobile devices haveinstalled the repackaged apps is may cause false positivesSecond if the malware does not generate sufficient trafficour method cannot detect it For example some malwareonly sends fraudulent premium SMS messages is type ofmalware cannot be detected by monitoring its networktraffic ird we rely on third-party tools such as Virustotalfor the identification of malicious URLs and to reduce thenumber of false alarms If a malicious app rents legalplatforms such as the public cloud as the remote controlserver our method will judge it as a normal app Hence falsenegatives will be generated

To overcome these limitations we could combine theproposed method with available methods such as SAMA-Droid In such cases the edge-client app will not only recordflow metainformation but also monitor app system be-haviors However the system behavior monitoring does notneed to be conducted all the time If abnormal but harmlessor lower traffic apps are detected the edge-client app will beinformed to record the corresponding appsrsquo system be-haviors Similarly the edge-client app can send the appsystem behaviors to the edge server for further processinge performance of this combined method will be evaluatedin our future work One can also combine the proposedframework with AppFA [14] for the detection of other typesof Android malware For example the edge extracts apptraffic behavior features and the cloud can compare appsrsquobehaviors with their peer groups to identify abnormalitiesFurthermore one could first use clustering techniques [60]to group repackaged apps according to the same maliciousmodules and then analyze the behaviors of the representativerepackaged apps in each group to obtain more accuratebehavior features

y

2D nonclassical multidimensional scaling

00ndash02 02ndash04x

ndash04

ndash03

ndash02

ndash01

00

01

02

03

04

Figure 16 Clustering results of various malware that wererepackaged from aircomaceviralmotox3e outlier in the middleis detected as an abnormal but harmless app is figure wasgenerated by the density peak clustering tool

Table 8 Comparison of the proposed method with AppFA

Proposed method AppFAAccuracy 969 863F-measure 094 075Average power consumption 12 0Average CPU consumption 06 0Average memory consumption 654MB 0

16 Security and Communication Networks

e paradigms of edge computing are used to detectAndroid malware in this study However several challengeswould be encountered when deploying it in practice Firstthe mobile devices must install a specified app for thecollection of essential information which may cause con-cerns and mobile users may refuse to cooperate Second inedge computing the data should be preprocessed by the edgeto better protect the usersrsquo privacy However the edge servercan observe all the data (including the sensitive informa-tion) and it threatens the privacy of users For example ifthe edge is deployed by the cloud provider the usersrsquo privacyinformation is also available to the cloud ere is no dif-ference with the user-cloud model for privacy protectionunless the edge server is deployed by the userird the edgeserver should be well protected from network attacksOtherwise it will be convenient for hackers to steal usersrsquoprivate information

8 Conclusion

In this paper we propose a novel method for detectingAndroid malware based on edge computing To the best ofour knowledge the detection of Android malware by uti-lizing edge computing and traffic clustering has not beenconsidered previously By introducing the edge layer themain tasks of the mobile device are offloaded to the edgeserver and the mobile device only needs to record flowmetainformation which substantially conserves the re-sources of the mobile device e edge server extracts appsrsquonetwork traffic features and sends these features to the cloudplatform for further analysis us the usersrsquo privacy can bebetter protected With the received features the cloud cal-culates the similarities between apps and clusters thesesimilarity values to separate the original apps and themalware automatically We evaluated our methods on 400Android apps and the experimental results demonstratethat the detection accuracy can reach 100 for several appsand the average accuracy is 969 We also tested theperformance of the proposed method and compared it with

the typical approaches e comparison results show thatour method can detect repackaged Android malware withboth higher detection accuracy and lower power and re-source consumptions In the future we will continue to workto overcome the discussed limitations and to evaluate theproposed method in real world scenarios

Data Availability

e downloaded apps used to support the findings of thisstudy have been deposited in httpspanbaiducoms1b8vMceIdtdd1gBRC5Gbzvw Its extraction code is vd5re clustering code is also in the repository Since appnetwork traffic contains usersrsquo sensitive information itcannot be publicly shared However we have explained allthe steps to generate network traffic automatically esesteps are also described in the instructionspdf in therepository

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported by National Key TechnologiesResearch and Development Program of China under Grant2018YFD0401404 National Natural Science Foundation ofChina under Grants 61702282 61802192 and 71801123Natural Science Foundation of the Jiangsu Higher EducationInstitutions of China under Grants 18KJB520024 and17KJB520023 Nanjing Forestry University under GrantsGXL016 and CX2016026 and NUPTSF under GrantNY217143

References

[1] Statista Mobile Operating Systemsrsquo Market Share Worldwidefrom January 2012 to December 2019 Statista HamburgGermany 2009 httpswwwstatistacomstatistics272698global-market-share-held-by-mobile-operating-systems-since-2009

[2] W Martin F Sarro Y Jia Y Zhang and M Harman ldquoAsurvey of app store analysis for software engineeringrdquo IEEETransactions on Software Engineering vol 43 no 9pp 817ndash847 2017

[3] httpswwwappbraincomstatsnumber-of-android-apps2020

[4] L Li D Li T F Bissyande et al ldquoUnderstanding android apppiggybacking a systematic study of malicious code graftingrdquoIEEE Transactions on Information Forensics and Securityvol 12 no 6 pp 1269ndash1284 2017

[5] M Fan J Liu X Luo et al ldquoAndroid malware familialclassification and representative sample selection via frequentsubgraph analysisrdquo IEEE Transactions on Information Fo-rensics and Security vol 13 no 8 pp 1890ndash1905 2018

[6] J Zhang Z Qin K Zhang H Yin and J Zou ldquoDalvik opcodegraph based android malware variants detection using globaltopology featuresrdquo IEEE Access vol 6 pp 51964ndash51974 2018

[7] S Garg S K Peddoju and A K Sarje ldquoNetwork-baseddetection of android malicious appsrdquo International Journal ofInformation Security vol 16 no 4 pp 385ndash400 2016

Table 9 Comparison of the proposed method with SAMADroid

Proposed method SAMADroidAccuracy 969 831F-measure 0940 0793Average power consumption 12 23Average CPU consumption 06 08Average memory consumption 654MB 773MB

Table 10 Comparison of the proposed method with Shabtairsquosmethod

Proposed method Shabtairsquosmethod

Accuracy 969 771F-measure 0940 0695Average power consumption 12 77Average CPU consumption 06 23Average memory consumption 654MB 937MB

Security and Communication Networks 17

[8] A Shabtai L Tenenboim-Chekina D Mimran L RokachB Shapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquoComputers amp Security vol 43 pp 1ndash18 2014

[9] L Xie X Zhang J-P Seifert and S Zhu ldquopBMDS a be-havior-based malware detection system for cellphone de-vicesrdquo in Proceedings of the ird ACM Conference onWireless Network Security ACM New York NY USApp 37ndash48 2010

[10] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacyin Smartphones and Mobile Devices ACM Chicago IL USApp 15ndash26 October 2011

[11] G Dini F Martinelli A Saracino and D SgandurraldquoMADAM a multi-level anomaly detector for android mal-warerdquo in Proceedings of the International Conference onMathematical Methods Models and Architectures for Com-puter Network Security vol 12 Springer St PetersburgRussia pp 240ndash253 2012

[12] Y Zhang M Yang B Xu et al ldquoVetting undesirable be-haviors in android apps with permission use analysisrdquo inProceedings of the 2013 ACM SIGSAC Conference on Com-puter amp Communications Security ACM Berlin Germanypp 611ndash622 November 2013

[13] SWang Z Chen X Li LWang K Ji and C Zhao ldquoAndroidmalware clustering analysis on network-level behaviorrdquo inProceedings of the International Conference on IntelligentComputing Springer Liverpool UK pp 796ndash807 August2017

[14] G He B Xu and H Zhu ldquoAppFA a novel approach to detectmalicious android applications on the networkrdquo in Securityand Communication Networks Wiley Hoboken NJ USA2018

[15] X Wang Y Yang and Y Zeng ldquoAccurate mobile malwaredetection and classification in the cloudrdquo SpringerPlus vol 4no 1 p 583 2015

[16] K Tam A Feizollah N B Anuar R Salleh and L Cavallaroldquoe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys (CSUR) vol 49 no 4p 76 2017

[17] M Sun X Li J C S Lui R T B Ma and Z Liang ldquoMonet auser-oriented behavior-based malware variants detectionsystem for androidrdquo IEEE Transactions on Information Fo-rensics and Security vol 12 no 5 pp 1103ndash1112 2017

[18] T Vidas and N Christin ldquoEvading android runtime analysisvia sandbox detectionrdquo in Proceedings of the 9th ACMSymposium on Information Computer and CommunicationsSecurity pp 447ndash458 Kyoto Japan June 2014

[19] Y Duan M Zhang A V Bhaskar et al ldquoings you may notknow about android (Un) packers a systematic study basedon whole-system emulationrdquo in Proceedings of the 2018Network and Distributed System Security Symposium pp 1ndash15 San Diego CA USA February 2018

[20] L Xue X Luo L Yu S Wang and D Wu ldquoAdaptiveunpacking of android appsrdquo in Proceedings of the 2017 IEEEACM 39th International Conference on Software Engineering(ICSE) IEEE Buenos Aires Argentina pp 358ndash369 May2017

[21] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy IEEE San Francisco CAUSA pp 95ndash109 2012

[22] N W Lo S K Lu and Y H Chuang ldquoA framework for thirdparty android marketplaces to identify repackaged appsrdquo inProceedings of the 2016 IEEE 14th International Conference onDependable Autonomic and Secure Computing 14th Inter-national Conference on Pervasive Intelligence and Computing2nd International Conference on Big Data Intelligence andComputing and Cyber Science and Technology Congresspp 475ndash482 Auckland New Zealand August 2016

[23] L Li J Gao M Hurier et al ldquoAndrozoo++ collectingmillions of android apps and their metadata for the researchcommunityrdquo 2017 httpsarxivorgabs170905281

[24] W Shi J Cao Q Zhang Y Li and L Xu ldquoEdge computingvision and challengesrdquo IEEE Internet of ings Journal vol 3no 5 pp 637ndash646 2016

[25] W Zhou Y Zhou X Jiang and P Ning ldquoDetectingrepackaged smart-phone applications in third-party androidmarketplacesrdquo in Proceedings of the Second ACM Conferenceon Data and Application Security and Privacy ACM AntonioTX USA pp 317ndash326 February 2012

[26] Y Shao X Luo C Qian P Zhu and L Zhang ldquoTowards ascalable resource-driven approach for detecting repackagedandroid applicationsrdquo in Proceedings of the 30th AnnualComputer Security Applications Conference ACM NewOrleans LA USA pp 56ndash65 December 2014

[27] K Chen P Wang Y Lee et al ldquoFinding unknown malice in10 seconds mass vetting for new threats at the google-playscalerdquo in Proceedings of the USENIX Security Symposiumvol 15 Washington DC USA August 2015

[28] Z Wang C Li Y Guan and Y Xue ldquoDroidchain a novelmalware detection method for android based on behaviorchainrdquo in Proceedings of the 2015 IEEE Conference onCommunications and Network Security (CNS) IEEE Flor-ence Italy pp 727-728 September 2015

[29] K Tian D D Yao B G Ryder G Tan and G PengldquoDetection of repackaged android malware with code-het-erogeneity featuresrdquo IEEE Transactions on Dependable andSecure Computing vol 17 no 1 pp 64ndash77 2017

[30] A De Lorenzo F Martinelli E Medvet F Mercaldo andA Santone ldquoVisualizing the outcome of dynamic analysis ofandroid malware with vizmalrdquo Journal of Information Se-curity and Applications vol 50 Article ID 102423 2020

[31] M Lin D Zhang X Su and T Yu ldquoEffective and scalablerepackaged application detection based on user interfacerdquo inProceedings of the 2017 IEEE Conference on ComputerCommunications Workshops (INFOCOM WKSHPS) IEEEAtlanta GA USA May 2017

[32] G Meng M Patrick Y Xue Y Liu and J Zhang ldquoSecuringandroid app markets via modelling and predicting malwarespread between marketsrdquo IEEE Transactions on InformationForensics and Security vol 14 no 7 pp 1944ndash1959 2018

[33] A Arora and S K Peddoju ldquoMinimizing network trafficfeatures for android mobile malware detectionrdquo in Proceed-ings of the 18th International Conference on DistributedComputing and Networking ACM Hyderabad India January2017

[34] X Wu D Zhang X Su and W Li ldquoDetect repackagedAndroid application based on HTTP traffic similarity httptraffic similarityrdquo Security and Communication Networksvol 8 no 13 pp 2257ndash2266 2015

[35] J Malik and R Kaushal ldquoCredroid android malware de-tection by network traffic analysisrdquo in Proceedings of the 1stACM Workshop on Privacy-Aware Mobile Computing ACMPaderborn Germany pp 28ndash36 July 2016

18 Security and Communication Networks

[36] A Zulkifli I R A Hamid W M Shah and Z AbdullahldquoAndroid malware detection based on network traffic usingdecision tree algorithmrdquo in Proceedings of the InternationalConference on Soft Computing and Data Mining SpringerSenai Malaysia pp 485ndash494 January 2018

[37] Z Chen Q Yan H Han et al ldquoMachine learning basedmobile malware detection using highly imbalanced networktrafficrdquo Information Sciences vol 433-434 pp 346ndash364 2018

[38] G He B Xu L Zhang and H Zhu ldquoMobile app identifi-cation for encrypted network flows by traffic correlationrdquoInternational Journal of Distributed Sensor Networks vol 14no 12 pp 1ndash17 2018

[39] L Xue Y Zhou T Chen X Luo and G Gu ldquoMalton to-wards on-device non-invasive mobile malware analysis forARTrdquo in Proceedings of the 26th USENIX Security Symposiumpp 289ndash306 Vancouver Canada August 2017

[40] M K Alzaylaee S Y Yerima and S Sezer ldquoDL-Droid deeplearning based android malware detection using real devicesrdquoComputers amp Security vol 89 Article ID 101663 2020

[41] A S Shamili C Bauckhage and T Alpcan ldquoMalware de-tection on mobile devices using distributed machine learn-ingrdquo in Proceedings of the 20th International Conference onPattern Recognition (ICPR) IEEE Istanbul Turkeypp 4348ndash4351 August 2010

[42] M Zhao T Zhang F Ge and Z Yuan ldquoRobotdroid alightweight malware detection framework on smartphonesrdquoJournal of Networks vol 7 no 4 p 715 2012

[43] K A Talha D I Alper and C Aydin ldquoAPK auditor per-mission-based android malware detection systemrdquo DigitalInvestigation vol 13 pp 1ndash14 2015

[44] S Arshad M A Shah A Wahid A Mehmood H Song andH Yu ldquoSamadroid a novel 3-level hybrid malware detectionmodel for android operating systemrdquo IEEE Access vol 6pp 4321ndash4339 2018

[45] G He L Zhang B Xu and H Zhu ldquoDetecting repackagedandroid malware based on mobile edge computingrdquo inProceedings of the 2018 Sixth International Conference onAdvanced Cloud and Big Data (CBD) IEEE Lanzhou Chinapp 360ndash365 August 2018

[46] R Roman J Lopez andMMambo ldquoMobile edge computingFog et al a survey and analysis of security threats andchallengesrdquo Future Generation Computer Systems vol 78pp 680ndash698 2018

[47] F Bonomi R Milito J Zhu and S Addepalli ldquoFog com-puting and its role in the internet of thingsrdquo in Proceedings ofthe First Edition of the MCC Workshop on Mobile CloudComputing ACM Helsinki Finland pp 13ndash16 August 2012

[48] M T Beck MWerner S Feld and S Schimper ldquoMobile edgecomputing a taxonomyrdquo in Proceedings of the Sixth Inter-national Conference on Advances in Future Internet pp 48ndash55 Lisbon Portugal November 2014

[49] P Mach and Z Becvar ldquoMobile edge computing a survey onarchitecture and computation offloadingrdquo IEEE Communi-cations Surveys amp Tutorials vol 19 no 3 pp 1628ndash1656 2017

[50] H T Dinh C Lee D Niyato and P Wang ldquoA survey ofmobile cloud computing architecture applications and ap-proachesrdquo Wireless Communications and Mobile Computingvol 13 no 18 pp 1587ndash1611 2013

[51] X Lyu H Tian L Jiang et al ldquoSelective offloading in mobileedge computing for the green internet of thingsrdquo IEEENetwork vol 32 no 1 pp 54ndash60 2018

[52] S Agrawal and J Agrawal ldquoSurvey on anomaly detectionusing data mining techniquesrdquo Procedia Computer Sciencevol 60 pp 708ndash713 2015

[53] A Tongaonkar ldquoA look at the mobile app identificationlandscaperdquo IEEE Internet Computing vol 20 no 4 pp 9ndash152016

[54] G He B Xu and H Zhu ldquoIdentifying mobile applications forencrypted network trafficrdquo in Proceedings of the 2017 FifthInternational Conference on Advanced Cloud and Big Data(CBD) IEEE Shanghai China pp 279ndash284 August 2017

[55] G Ranjan A Tongaonkar and R Torres ldquoApproximatematching of persistent lexicon using search-engines forclassifying mobile app trafficrdquo in Proceedings of the 35thAnnual IEEE International Conference on Computer Com-munications IIEEE San Francisco CA USA April 2016

[56] L Li and S Qu ldquoShort text classification based on improvedITCrdquo Journal of Computer and Communications vol 1 no 4pp 22ndash27 2013

[57] A H Lashkari A F A Kadir L Taheri and A A GhorbanildquoToward developing a systematic approach to generatebenchmark android malware datasets and classificationrdquo inProceedings of the 2018 International Carnahan Conference onSecurity Technology (ICCST) IEEE Montreal Canada Oc-tober 2018

[58] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[59] Y Li Z Yang Y Guo and X Chen ldquoDroidbot a lightweightUI-guided test input generator for androidrdquo in Proceedings ofthe Software Engineering Companion (ICSE-C) pp 23ndash26Buenos Aires Argentina May 2017

[60] M Fan X Luo J Liu et al ldquoGraph embedding based familialanalysis of android malware using unsupervised learningrdquo inProceedings of the 2019 IEEEACM 41st International Con-ference on Software Engineering (ICSE) IEEE Piscataway NJUSA pp 771ndash782 May 2019

Security and Communication Networks 19

Page 4: On-DeviceDetectionofRepackagedAndroidMalwarevia …downloads.hindawi.com/journals/scn/2020/8630748.pdf · 2020-05-31 · According to [4–6], most Android malware programs are repackagedapps,andthemajorityofnewAndroidmalware

simultaneously as represented by the solid and dotted linesin Figure 2is is because not all data must be preprocessedby the edge servers and more importantly the robustness ofthe entire system can be improved If the edge breaks downthe cloud can continue to provide services In this study weobey the edge computing paradigm that is illustrated inFigure 2 in the design of the Android malware detectionsystem

3 Related Work

Repackaged Android apps are one of the major sources ofmobile malware and extensive studies have been conductedon Android malware detection In this section we will re-view the most closely related studies e typical studies ofedge computing are also reviewed to provide more detailedbackground information

31 Off-Device Detection Repackaged Android apps andmalware can be efficiently detected via code comparison ofAPKs DroidMoss [25] focuses on app code and conductspairwise similarity comparisons for the detection ofrepackaged apps in Android markets e detection resultsdemonstrate that 5 to 13 of apps that are hosted on third-party marketplaces were repackaged However the pairwisecomparison approach cannot be scaled up because millionsof Android apps are now available in markets To resolve thisissue Shao et al [26] proposed ResDroid for the detection ofrepackaged apps in Android markets via a divide-and-conquer strategy First they built a k-d tree and insertedclusters that contained one or more apps into the k-d treeen the apps were compared with their neighbors in the k-d tree and the number of comparisons was reduced As-suming that the apps from the official market were benignMassvet [27] downloaded all apps from Android markets inadvance and constructed v-core (a geometric center of a viewgraph) andm-core (mapping the features of a Java method toan index) databases for repackaged Android malware de-tection Both databases were used to vet new apps that weresubmitted to the market New apps would be reported asrepackaged malware if their v-cores or m-cores were similarto records in the databases and the code was not legitimatelyreused

In addition to code comparison behavior-based andUI-based detection methods are also emerging Droid-Chain [28] constructs a behavior chain model that iscomposed of typical behavior processes of Android apps

for the detection of malware In DroidChain the typicalbehaviors include privacy leakage SMS financial charg-ing malware installation and privilege escalationAccording to Tian et al [29] repackaged malware wasdifficult to detect partly because of their behavioralsimilarities to benign apps To overcome this problemthey proposed a new repackaged Android malware de-tection technique that was based on code heterogeneityanalysis e code structure of an app was partitioned intomultiple dependence-based regions and each region wasindependently classified according to its behavioral fea-tures De Lorenzo et al [30] proposed VizMal for visu-alizing the execution traces of Android applications andhighlighting potentially malicious behaviors Lin et al[31] proposed the detection of repackaged Android appsbased on static UI features ey detected a total of 3723repackaged app pairs in the Anzhi market and 15856repackaged pairs in the Mi market ese results dem-onstrate that repackaged malware still poses a severethreat to the security of mobile users even though varioustypes of detection methods have been proposed andadopted by Android markets [32]

Recently several network-based approaches have beenproposed for the detection of repackaged apps and malwareArora and Peddoju [33] demonstrated the effectiveness ofusing network traffic to detect Android malware Wu et al[34] divided app HTTP traffic into 2 categories primarymodule traffic and nonprimary module traffic For primarymodule traffic the HTTP flow distance algorithm and theHungarian Method were used to calculate the traffic simi-larity and identify similar app pairs CREDROID [35]identified malicious apps on the basis of Domain NameServer (DNS) queries and the data that are transmitted toremote servers Zulkifli et al [36] proposed a method fordetecting Android malware that is based on seven networktraffic features and the J48 decision tree algorithm Chenet al [37] introduced the imbalanced data gravitation-basedclassification algorithm for the classification of imbalanceddata of malicious apps He et al [14 38] proposed theidentification of encrypted appsrsquo flows via traffic correlationand the detection of repackaged Android apps via com-parison of the network behaviors of similar apps In thesemethods the detection is conducted mostly in the router orat the network monitoring node hence the performances ofmobile devices are not affected However the network be-haviors of malware must be modeled in advance which ischallenging in practice

Edge servers

Datacomputations

Data

Data

Data

DataservicesIoT devices

Cloud

Figure 2 Paradigm of edge computing

4 Security and Communication Networks

32On-DeviceDetection e detection of Android malwarecan also be conducted at usersrsquo mobile devices [39]Alzaylaee et al [40] extracted features from real devices andused a deep learning system to detect malicious Androidapps Shamili et al [41] proposed a distributed SupportVector Machine (SVM) algorithm for the detection ofAndroid malware on a network of mobile devices eirdetection features include short duration calls mediumduration calls and the number of outgoing SMS (ShortMessage Service) among others e experimental resultsdemonstrated that the average computation time per clientduring the training phase is nearly 10 seconds Zhao et al[42] implemented a malware detection framework namelyRobotDroid using an SVM active learning algorithmRobotDroid monitors all the running apps to identify theirrunning characteristics ese characteristics are later inputinto learning modules to generate the behavioral charac-teristics for the classification of normal apps and malwareAccording to their performance evaluation the time ofaccessing the GPS is increased from 80ms to almost 120mswhen the detection system is executed Hence the perfor-mance of a smartphone that utilizes this framework wouldbe significantly degraded Shabtai et al [8] monitored theappsrsquo network behaviors at the mobile devices to detectrepackaged Android malware and the detection accuracyexceeds 80 however the performance overhead of theirmethod is not negligible For example it requires7272 kB plusmn 15 memory consumption and 13 plusmn 15 CPUconsumption for learning 10 Android malware detectionmodels

To further improve the performance of on-device de-tection Crowdroid [10] sends the detection features to acentralized server for processing Crowdroid only recordsthe system calls as the detection features and it is light-weight Talha et al [43] proposed APK Auditor which is asystem that uses permissions as static analysis features forAndroid malware detection APK Auditor consists of threecomponents (1) a signature database (2) an Android clientand (3) a central server e Android client sends the app tothe central server for analysis e central server extracts thepermissions that are requested by the app and computes thepermission malware score based on the permissions inmalware If the score exceeds a threshold value the app isclassified as malware e results demonstrate that APKAuditor achieves 925 specificity however it lacks thebenefits of dynamic analysis as it cannot detect dynamicmalicious payloads

Monet [17] also uses backend servers and a client app todetect Androidmalwaree client app constructs a directedgraph of the app components and the system componentsand it records the potentially dangerous system calls as thedetection features ese features are forwarded to thebackend server which conducts further detection by ap-plying a signature matching algorithm and returns thedetection results e experimental results demonstrate thatMonet can detect Android malware with 99 accuracyHowever the performance overhead of Monet is high Asshown in [17] Monet had 8 overhead on memory and IObenchmarks and 55 overhead for battery Most recently

Arshad et al [44] proposed SAMADroid which is a 3-levelhybrid model for the detection of Android malware Sim-ilarly SAMADroid has a client app and an analysis servere client app hooks the Strace tool and traces the systemcalls that are invoked by the user app If the user app remainsrunning on the device the client app continues tracing thesystem calls of the user app and generates a log file of thesystem calls Later the log file is sent to the SAMADroidserver with the identifier of the user app At the server thesame user app is downloaded from stores and staticallyanalyzed en the static features and the dynamic featuresthat are extracted from the log file are embedded in a vectorspace Finally the features are provided as input to themachine learning tool for classification eir experimentalresults demonstrate that random forest yields 9907 ac-curacy on the static features and 8276 on the dynamicfeatures e performance overhead of SAMADroid is 18for memory and 06 for CPU In previous work [45] weused mobile edge computing to detect malicious apps Weextend it to the edge computing paradigm and refine theclustering method in this study

33 Edge Computing Research e vision and challenges ofedge computing are introduced in the literature [24] ebasic strategy of edge computing is to deploy cloud com-puting at the edge of the network to improve the perfor-mance and create new applications for IoT [46] Howeverthe deployment location of the edge and the functionalitiesof the edge are controversialerefore in applications edgecomputing can be implemented as fog computing [47]mobile edge computing [48 49] and mobile cloud com-puting [50] In the fog computing paradigm the analysis oflocal information is conducted on the ldquogroundrdquo and thecoordination and global analytics are conducted in theldquocloudrdquo One can argue that all objects that are outside thecloud constitute the ground thus fog computing is almostthe same as edge computing

Mobile edge computing deploys cloud services at theedge of mobile networks such as 5G Namely the edge isdeployed in close proximity to the eNodeB which loops thetraffic through the mobile edge computing servers for fur-ther data processing us in mobile edge computing theedge is fixed and the edge servers can be provided bytelecommunication companies Mobile cloud computingfocuses mainly on mobile delegation mobile devices shoulddelegate the storage of bulk data and the execution ofcomputationally intensive tasks to remote entities A remoteentity can be a centralized cloud or another mobile deviceerefore the edge can consist of mobile devices in mobilecloud computing Hence one of the most active areas ofresearch in the field of mobile cloud computing is thedelegation of tasks to external services especially to othermobile devices Task delegation is also essential for edgecomputing [51]

In this study we provide a lightweight and efficientframework for the detection of repackaged Androidmalwarethat is based on edge computing In the proposed frame-work the mobile devices need only to record the flow

Security and Communication Networks 5

starting time flow protocol server-side IP address server-side port app name and version for each network flowLater these data are sent to the edge servers for traffictagging Since the amounts of data that are saved and sent aresmall the performances of mobile devices are not signifi-cantly degraded e edge servers extract detection featuresfrom the labeled app network traffic and send the extractedfeatures to the cloud to be clustered After clustering theoriginal and repackaged apps can be efficiently distin-guished In our study only network traffic is considered inAndroid malware detection In contrast to previous net-work-based methods such as [14 34ndash36] we do not need tomodel appsrsquo network behaviors in advance and our methodis more useful in practice

4 Framework for Android Malware Detection

e on-device detection of Android malware is challengingbecause mobile devices typically have limited resources Toalleviate the storage and computing limitations and prolongthe lifetimes of the mobile devices in this study we proposea new framework for on-device detection of repackagedAndroid malware e proposed framework is illustrated inFigure 3

Our framework is composed of three layers mobiledevices the edge and the cloud First we explain thepossible locations of the edge later we introduce thefunctionalities of each component In our framework theonly requirement for the edge is that it can monitor andprocess the mobile network traffice mobile devices couldconnect to the Internet via WiFi or 3G4G hence the edgenodes can be wireless gateways household security gatewayssuch as BitDefender Box and LTE eNodeBs e edge nodescan also be routers that are located in the backbone networkif they can observe the traffic that is generated by the mobiledevices erefore the deployment locations of the edgenodes are flexible in practice Figure 4 illustrates the possibledeployment scenarios of the edge

In the proposed framework the mobile devices mustcollect and send flow metaformation to the edge for datapreprocessing For this an edge-client app is designed to berun on themobile devicese edge-client app operates as anAndroid VPN app Its main objective is to collect andtransfer the flow metainformation which includes the appname flow starting time and flow protocol to the edgeiteratively With the flow metainformation the edge serverpreprocesses the network traffic by adding a label (the appname) to each flow After that it extracts detection featuresfrom these labeled flows e traffic labeling and featureextraction are illustrated in detail in Section 52 Once thefeature extraction has been completed the edge server sendsthese features to the cloud With the received features thecloud conducts the malware detection tasks and notifies themobile devices of the detection results

To further illustrate the advantages of the proposedframework we compare it with the available frameworksSince the proposed framework is based on edge computingwe will refer to it as the edge-computing-based framework inthe following Similarly the available frameworks are termed

as the cloud-based framework (the server-side detectionmethods that were described in Section 1) and the network-based framework (the network-based methods that wereintroduced in Sections 1 and 3) for differentiation Com-pared to the popular cloud-based framework (Figure 5(a))the most prominent advantage of the edge-computing-basedframework is that the feature extraction process is offloadedto the edge and the mobile devices must only save andtransmit the flow metainformation us the proposededge-computing-based framework is resource-friendly formobile devices

e network-based framework (Figure 5(b)) monitorsnetwork traffic without any interference withmobile devicesHowever it is less accurate since the network nodes cannotknow the origins of the network flows precisely namely itcannot identify app traffic with 100 accuracy As dem-onstrated in the literature [14] since the network nodecannot identify the app for each flow (eg some flows lackssignatures or are encrypted) the appsrsquo network behaviors areincomplete and the malware detection accuracy is reducedIn the proposed edge-computing-based framework wedesign a lightweight edge-client app for gathering app in-formation Furthermore mobile usersrsquo privacy can be betterprotected in our proposed framework By deploying onersquosown edge server a mobile user can control the types ofinformation that can be used as detection features Incontrast in the network-based framework the networknodes can observe the full traffic content and extract anyinformation they want which severely threatens the privacyof mobile users

Android malwaredetectionCloud

Edge

Mobiledevices

(1) Flowmetainformation

(2) Extractedfeatures

(3) Detectionresults

Network connectionData transmission

Figure 3 Framework for android malware on-device detectionthat is based on edge computing Mobile devices send flow met-ainformation such as the app name and the server-side IP addressto the edge Edge servers capture app network traffic and extractHTTP contents and flow statistical features from the capturedtraffic Later the features are sent to the cloud for malware de-tection e cloud also interacts with the mobile devices to returnthe detection results

6 Security and Communication Networks

After introducing the general architecture we outline thedetection methods that can be used in the edge-computing-based framework e cloud can use various methods todetect Android malware For example it can use conven-tional signature matching [35] and network behavioranalysis [8 14] to identify malicious apps However thesetypes of methods require a priori information (such assignatures or behavior models) In this paper we propose anovel repackaged malware detection method that clustersthe network traffic that is generated by the same app fromvarious mobile devices Figure 6 presents an example of thisprocess

e motivations behind the traffic clustering are two-fold first most Android malware is produced by repack-aging popular apps [4] and popular apps are often widelydistributed Second the network traffic generated by therepackaged malware differs from that generated by theoriginal app (as illustrated in Section 21) As illustrated inFigure 3 many mobile devices would connect to the cloudfor malware detection Hence the cloud can observe thenetwork traffic features of both the original apps and theirrepackaged versions with large probabilities due to the widedistribution of popular apps and the large number of mobiledevices If we regard the features of the original apps as thenormal data and the features of the repackaged versions asthe abnormal data the repackaged malware detectionproblem is transformed into the abnormal detection

problem Clustering is an efficient method for solving thisproblem [52] By clustering appsrsquo network traffic features weneed not model appsrsquo network behaviors in advance andmore importantly we can detect various repackaged ver-sions of the original app as illustrated in Figure 6e detailsof the proposed malware detection method will be describedin detail in Sections 53ndash55

5 Methodology

In this section we introduce the technical details of theproposed clustering method for Android malware detectionTable 1 defines the main notation that we use in this paper

(a) (b) (c)

Figure 4 Possible deployments of the edge e edge could be the wireless gateways or the dedicated servers that are connected to theeNodeB or the routers (a) Wireless gateway (b) eNodeB+ server (c) Router + server

Clustering

Traffic features of an app collected from

different mobile devicesOriginal apps

Repackaged apps

Figure 6 Illustration of traffic clustering for repackaged malwaredetection e plain circles represent original apps and the circleswith triangles and inverted triangles are repackaged ones

Network connectionData transmission

Android malwaredetectionCloud

Mobiledevices

(1) Extractedfeatures

(2) Detectionresults

(a)

Network connectionData transmission

Network node

Mobiledevices

(1) Detectionresults

(b)

Figure 5 Available frameworks for android malware detection Various approaches such as [10] send the extracted features to a remoteserver other than to the cloud We still regard them as following the cloud-based framework because there is no essential difference formalware detection As illustrated in (b) the detection results are typically presented to the network administrator and are not fed back to themobile devices (a) Framework based on cloud (b) Framework based on network

Security and Communication Networks 7

51 System Overview Figure 7 illustrates the main proce-dure of the proposed approach App i is the app for vettingand it has been installed on umobile devicesese u devicesare connected to e edge servers We assume that r deviceshave installed the repackaged malware (the malware is alsopresented as App i to deceive the users) and rlt (12)uHence most of the devices have installed the original ver-sion is situation is reasonable because repackaged An-droidmalware is typically distributed in third-party marketswhich are less popular than official markets [23] us thenumbers of downloads and installations are relatively smallWe will use this characteristic to classify the clusteringresults

To detect the repackaged malware the edge server labelsthe network flows that are generated by App i according tothe flow metainformation en it extracts suitable trafficfeatures (such as traffic content and statistics) from thelabeled flows and sends the features to the cloud for pro-cessing erefore there are u feature sets for App i in thecloud e cloud filters these u sets according to the appversions and removes common traffic After that it calculatesthe pairwise similarities of the feature sets Finally it clustersthe similarity values and identifies r repackaged malwareinstances

e traffic labeling feature extraction and filtering(labels①② and③ in Figure 7) the similarity calculationon the traffic contents and behaviors (labels④ and⑤) andclustering of similarity values (label⑥) are the core elementsof the proposed method ey will be described in detail inthe following sections

52 Traffic Labeling Filtering and Feature Extraction

521 Traffic Labeling In this study traffic labeling isconducted to identify the corresponding app for each net-work flow which is known as the app identification problem

[53] Extensive works have been conducted on the identi-fication of apps from mobile network traffic [38 54 55]However the achieved identification accuracies are all lowerthan 100 In our method we design an edge-client app forcollecting flow metaformation and identifying apps accu-rately e flow metainformation is defined in the following

Definition 1 (flow metainformation) For flow f the flowmetainformation includes the client-side IP address C-IPfthe flow starting time Tf the flow protocol Pf the server-sideIP address S-IPf the server-side port S-Portf the app nameAppNf and the app version AppVfe flow f is bidirectionaland Pf is the transport layer protocol such as TCP or UDP

e edge-client app operates similarly to a VPN appeapp sends the flow metainformation to the edge server at afixed time interval Ti Once it has received the flow meta-information the edge server saves it into a metainformationdatabase and can label mobile app traffic accurately Denotethe flow observed by the edge server as fu e edge serverlabels fu as follows

Step 1 extract the recording time Tuf of fu and its flow

protocol Puf client-side IP address CminusIPu

f server-sideIP address SminusIPu

f and server-side port SminusPortufStep 2 search the metainformation database with Pu

fCminusIPu

fSminusIPuf and SminusPortuf If records match the da-

tabase returns the record that has the most recent flowstarting time Tr

f Otherwise the edge server waits for awhile and repeats Step 2Step 3 if Tu

f is close to Trf the edge server labels flow fu

as ltAppNf AppVfgtOtherwise it waits for a while andrepeats Steps 2 and 3

In the above steps the client-side IP address CminusIPuf is

used to identify a mobile device uniquely which is correctwhen the edge server is the wireless gateway or the eNodeBIn such scenarios the mobile devices directly connect to theedge hence the IP address is unique for each deviceHowever if the edge server is deployed at the backbonerouter as illustrated in subgraph (c) of Figure 4 the CminusIPu

f

may be confusing because mobile devices could be locatedbehind NATs (Network Address Translations) and they willshare the same public IP address To solve this issue theedge-client app can insert special indicators into the packetsto uniquely represent a device In this study we alwaysassume that CminusIPu

f is unique and we will investigate theimplementation of the device-indicator in future work

In steps 2 and 3 the edge server utilizes a wait-and-repeat strategy to ensure that the traffic labeling is fresh andaccurate e edge server must wait because the mobiledevices send flow metainformation to the edge server at afixed time interval Ti (60 s in our experiments) and the savedflow metainformation in the database may be outdated Inour implementations the waiting time is set to Ti and if|Tu

f minus Trf|gt 24 h the returned flow metainformation is

regarded as obsolete

522 Traffic Feature Extraction After traffic labeling theedge server extracts detection features from the labeled

Table 1 Main notation used in this paper

Notation Meaningi Android app iu Number of mobile devices with app i installedr Number of devices with repackaged app i

f Network flowC-IPf Client-side IP address of fS-IPf Server-side IP address of fS-Portf Server-side port of fAppNf Name of the app that generates fAppVf Version of the app that generates fTi Set time intervalTu

f Recording time of flow fu at the edge serverdi Feature set of app iwj Plaintext word in the feature setsV(di) Numerical vector of traffic contentsF

j

diFeature vector for the encrypted flow j in di

TBntimesm(di) Traffic behaviors of diCSdik

Content similarity between di and dkBSdik

Behavior similarity between di and dkSdik

Final similarity between di and dk

8 Security and Communication Networks

traffic e extracted traffic features are categorized into twogroups traffic content features and traffic behavior featuresTraffic content features are the plaintext contents that areextracted from HTTP flows and traffic behavior features arethe statistics of network flows such as packet sizes andintervals Since Android apps may use an encrypted HTTPSprotocol to transmit data [54] traffic behavior features alsomust be considered e traffic content and traffic behaviorfeatures are described in detail in Sections 53 and 54

523 Traffic Filtering After traffic feature extraction theedge servers send the extracted features along with theclient-side IP address server-side IP addresses server-sideports app name AppNf and app version AppVf to the cloudNote that App i is installed on u devices hence the cloudeventually has u feature sets for App i Figure 8 illustrates thestructure of the feature set e cloud further filters these ufeature sets to increase the detection accuracy First it di-vides the u sets into v categories according to the app versionAppVf erefore each category contains the feature setsthat are generated by the same version of App i en foreach category the cloud removes the common traffic fromthe feature sets e common traffic is defined as network

flows that are contained in all features sets ese flows havethe same server-side IP address and the same server-sideport number e reason for this step is that the commontraffic will obfuscate the similarity calculations betweenapps Otherwise the similarity values between features setsare all high and they cannot be effectively used to distin-guish the repackaged malware from the original apps

After the filtering has been completed the similaritiesbetween feature sets are calculated and the similarity valuesare clustered to identify the repackaged malware

Client-side IP addressApp name AppN_fApp version AppV_fFlow 1 server-side IP address 1 server-side port 1 flow features

Flow f server-side IP address f server-side port f flow features

Figure 8 Illustration of the feature set e set contains the clientIP address app name and app version Each network flow that isgenerated by App i in a device is represented by the featuresextracted from the flow and by the server IP address and port

Trafficlabeling

Edge server 1

Traffic featureextraction

Similarity calculation of

trafficbehaviors

Similarity calculation of traffic contents Clustering of

similarity values

Original app

Repackaged malware

Cloud

Mobile device 1 2 (with app i installed)

Results

Traffic features

Traffic and flowmetainformation

ę

ę

Edge server e

Traffic features

Traffic and flow metainformation

Filtering

Mobile device u(with app i installed)

1

3

4

5

6

2

Figure 7emain procedure of the proposedmethod for Androidmalware detectione core parts of the proposedmethod are labeled bycircled numbers

Security and Communication Networks 9

53 Similarity Calculation of Traffic Content Android appsuse mainly HTTP and HTTPS protocols to communicatewith their remote servers [54] For HTTP traffic since it isplaintext based we can efficiently use the flow content tocalculate the similarity e basic strategy is to convertHTTP traffic to document files and the document files canbe handled by information retrieval technologies for simi-larity calculation HTTP flow content has also been used inapp identification from network traffic [55] However in thisstudy we only consider partial HTTP header information tobetter protect mobile usersrsquo privacy

In detail first we reassemble HTTP flows from packetsand extract the flow contents by saving the correspondingpacket contents into a text file as ASCII code Figure 9 showssuch an example en we preprocess the text file by de-leting the HTTP request content and the HTTP requestheader except for the GET (or POST) and HOST fields eHTTP response content is also deleted namely only theHTTP response header the GET (or POST) and the HOSTfiled in the HTTP request header are saved Via this ap-proach the mobile usersrsquo privacy can be protected from thecloud (for example the mobile phone information (HUA-WEI) is leaked by the User-Agent field in Figure 9) Afterthat the text file is spilt using special characters (eg ) and the processed file becomes the flow contentfeatures

e similarity is calculated via the improved TF-IDFalgorithm [56] and the cosine similarity e term frequencymeasures the count of words in a document We use theaugmented frequency measurement Denote the document(the feature set) as di and the word as wj Term frequencytf(wj di) is calculated as

tf wj di1113872 1113873 05 +05 times f wj di1113872 1113873

max f w di( 1113857 w isin di1113864 1113865 (1)

where f(wj di) is the number of occurrences of wj indocument di

e inverse document frequency measures of theamount of information a word provides Typically it is thelogarithmically scaled inverse fraction of the number ofdocuments that contain the word e inverse documentfrequency of wj is calculated as

idf wj1113872 1113873 logu

d wj isin d1113966 111396711138681113868111386811138681113868

11138681113868111386811138681113868+ 001⎛⎝ ⎞⎠ (2)

where u is the number of documents and | d wj isin d1113966 1113967| isnumber of documents in which the term wj appears

Denote by p(wj di) the product of TF(wj di) andI DF(wj) it is formulated as

p wj di1113872 1113873 log tf wj di1113872 11138731113872 1113873 times idf wj1113872 1113873

1113936

cj0 log

2 tf wj di1113872 11138731113872 1113873 times idf2 wj1113872 11138731113969 (3)

After the TF-IDF operation all documents are convertedinto equal-length numerical vectors We use the cosinesimilarity to measure the similarity between two documentvectors For documents di and dk the corresponding

numerical vectors are denoted as V(di) and V(dk) and thelength of vector is t e similarity between di and dk iscalculated as

CSdik

1113936tl1 V di( 1113857l times V dk( 1113857l( 1113857

1113936tl1 V di( 1113857

2l

1113969

times

1113936tl1 V dk( 1113857

2l

1113969 (4)

Suppose that in category vi (we divide the u feature setsgenerated by App i in to v categories in the traffic filteringstep) there are c feature sets For each feature set we save theflow features of all HTTP flows into a text file and c text filesare obtained en we calculate the numerical vector foreach feature set via equation (3) Last we compare thenumerical vector with each other as equation (4) and gain csimilarity values (CSdii

1) for each feature set We rep-resent these similarity values of feature set di (i 1 2 c)as a vector Si

contents and Sicontents (CSdi1

CSdi2 CSdic

)e vectors Si

contents (i 1 2 c) in combination with thesimilarity values of traffic behaviors will be clustered toidentify the abnormal data points namely the repackagedmalware

54 SimilarityCalculation of TrafficBehaviors In addition tocalculating the similarity values of traffic contents thesimilarity of traffic behaviors must be calculated due toencrypted flows We use m (m 11 in our study) statisticalfeatures as listed in Table 2 to represent the behavior of anencrypted network flow e units of the flow duration andthe packet interval time are milliseconds in our experiments

With the flow statistical features for feature set di the jthencrypted flow can be represented as a vector in the featurespace

Fj

di f

1j f

2j f

mj1113872 1113873 (5)

where flj represents the value of the lth statistical feature

1le llem Figure 10 shows an example of feature represen-tation for encrypted traffic Consequently the traffic be-haviors of di can be represented as a matrix of f

pj

Figure 9 Example of converting an HTTP flow to a text file

10 Security and Communication Networks

TBntimesm di( 1113857 F1di

F2di

Fndi

1113872 1113873T (6)

where n is the number of encrypted flows that are containedin di and nge 1

Since different feature sets may differ in terms of thenumber of encrypted flows traffic behaviors TBnitimesm(di) andTBnktimesm(dk) may differ in terms of different dimensionnamely the value of ni should not be equal to that of nk Toefficiently calculate the similarity between TBnitimesm(di) andTBnktimesm(dk) we calculate the Frobenius norm of TBnitimesm(di)

and TBnktimesm(dk) e formula is presented in equation (7)where norm(middot) normalizes the value of fl

j to [0 1]

FN(d)

1113944

n

j11113944

m

l1norm fl

j1113872 111387311138681113868111386811138681113868

111386811138681113868111386811138682

11139741113972

(7)

en the distance between TBnitimesm(di) and TBnktimesm(dk) iscalculated as

dis di dk( 1113857 FN di( 1113857 minus FN dk( 11138571113868111386811138681113868

1113868111386811138681113868 (8)

Finally the similarity between TBnitimesm(di) andTBnktimesm(dk) can be calculated by

BSdik 1 minus

dis di dk( 1113857

max dis dr dq1113872 1113873 r q 1 2 u rne q1113960 1113961 (9)

In the above process ni and nk are assumed to exceed 1If ni 0 (nk 0) and nkge 1 (nige 1) BSdik

is set to 0 to reflectthe significant difference between di and dk However ifni 0 and nk 0 namely there is no encrypted flow in bothdi and dk BSdik

is set to 1 because they exhibit the samebehaviors ie they do not produce any encrypted traffic

Similarly we let BSdii 1 and represent these similarity

values as a vector Sibehavior BSdi1

BSdi2 BSdiu

1113966 1113967

55 Clustering of Similarity Values and Malware DetectionFrom the traffic content similarity CSdik

and the traffic be-havior similarity BSdik

we can determine the final similarityvalue BSdik

between the feature sets di and dk as expressed inequation (10) In this equation q is the weight value In ourexperiments we set q 05 namely the traffic contentsimilarity and the traffic behavior similarity are assigned thesame weight is is because apps typically produce almostthe same numbers of HTTP and HTTPS flows [57] We alsoevaluate various weighted values in Section 62 and theexperimental results support that 05 is the most suitablevalue For feature set di a vector Sdi

(Sdi1 Sdi2

Sdiu) is

finally obtained which contains the similarity values be-tween di and the other feature setse set of vector Sdi

1113966 1113967 i

1 2 u will be clustered for the detection of repackagedmalware

Sdik q times CSdik

+(1 minus q) times BSdik (10)

To be specific we use the density peak clustering method[58] to realize our objective Density peak clustering is basedon the following assumption cluster centers are charac-terized by a higher density than their neighbors and by arelatively large distance from points that have higher den-sities Hence density peak clustering can automaticallyidentify the correct number of clusters is is importantbecause we do not know whether there is repackagedmalware or how many types of repackaged malware thereare In other words we cannot determine the number ofclusters in advance Density peak clustering can help usovercome this challenge

We assume that most mobile devices have installed theoriginal apps as discussed in Section 51 erefore afterclustering the cluster that has the largest number of voxelswill be recognized as original and the remaining clusters willbe regarded as suspicious However suspicious apps are notnecessarily repackaged malware For example apps can berepackaged to show ads only Repackaged apps of this typeshould not be simply judged as malware Additionallyoriginal apps can be suspicious due to different user be-haviors To further reduce the frequency of the false alarmswe can check the server hostnames that are contained in therepackaged clusters using VirusTotal (the URLs are includedin the HTTP traffic contents) If all of the serversrsquo hostnamesfor a suspicious cluster Cr are judged as clean it will berecognized as a cluster of abnormal but harmless appsOtherwise it will be identified as repackaged malware Viathis approach we can accurately detect repackaged Androidmalware In this step we only use VirusTotal as a white list toexclude false alarms We do not use it as a backlist (detectionsignatures) to detect repackaged malware Malware detec-tion is still realized by clustering the similarity values

Table 2 Flow statistical featuresBytes sent by the app and bytes sent by the serverNumbers of outgoing and incoming packetsMean and std dev of the outgoing and incoming packet sizesrsquoflow durationMean and std dev of the interpacket time

lt2761 16696 16 12 173 1294 289 1177 3392 121 190gt

Figure 10 Example of the conversion of an encrypted flow to afeature vector

Security and Communication Networks 11

6 Evaluation

61 System Implementation and Dataset Figure 11illustrates our experimental setup e Android phoneswere connected to the edge server via WiFi and each phonehad installed the edge-client app which sends the flowmetainformation to the edge server using Socket (jav-anetSocket) e edge is a Dell PowerEdge R730 server(with 32 processors 16GB RAM and 12 TB hard drives)that is configured as an access point In the edge server weused tcpdump (httpwwwtcpdumporg) to collect networktraffic e similarity calculations of traffic contents andbehaviors and the clustering of similarity values wereimplemented on the Windows Azure cloud computingplatforme URL checking was conducted using Virustotal(httpswwwvirustotalcom)

In the implementation we used aria2c (httpsaria2githubio) to download apps from Androzoo (httpsandrozoounilu) aria2c was encapsulated as commandsin Python code for parallel downloading After downloadingthe apps we automatically installed them on phones usingthe adb shell command Each phone had 40 tested appsinstalled We have implemented the edge-client app basedon Android VPN service In our implementations the edge-client app correlated network flows and apps by resolvingthe ports and PIDs (Process IDs) send the newly collectedflow metainformation to the edge server every 60 secondse edge server used tcpdump to collect network trafficwhich was saved as pcap files To extract traffic features weused Splitcap (httpswwwnetreseccompageSplitCap)to restore tcp flows from pcap files en for HTTP trafficwe extracted the contents using tshark (httpswwwwiresharkorgdocsman-pagestsharkhtml) commands -zfollow tcp and ascii e packet sizes and packet intervaltimes were also obtained by tshark with commands -eframelen ndashe frametime_relative In the cloud CSdik

and BSdik

were calculated using scikit-learn (httpscikit-learnorg)and Numpy (httpsplotlynumpynorm) e code ofdensity peak clustering is available on GitHub (httpsgithubcomjasonwbwDensityPeakCluster) and we havefixed various bugs that were present in the original code

We have downloaded 200 pairs of original andrepackaged apps from Androzoo for testing However notall of these downloaded repackaged apps are maliciousSome apps are only repackaged to show extra ads To screenout Android malware from the downloaded repackagedapps we further checked the 200 repackaged apps usingVirustotal If an app was judged as safe by all the antivirustools it was labelled as repackaged_safe in our experimentsOtherwise it was labelled as repackaged_malware In total143 repackaged apps were detected as malware by VirustotalNevertheless in our experiments we evaluated the proposedclusteringmethod on all 200 repackaged apps to obtainmorerealistic performance results

e downloaded apps were run in 10 rooted Androidphones in parallel to accelerate the experimental process Forall apps we used Droidbot [59] to send input events to themand to generate network traffic automatically We ran eachrepackaged app (repackaged_safe and repackaged_malware

apps) 10 times and the corresponding original app 30 timesto generate sufficient network traffic e running times ofthe original apps exceeded those of the repackaged appsbecause we assume that most of the apps are original asdiscussed in Section 51 Consequently we obtained 40traffic sets for each repackaged-original app pair Note thatfrom the perspectives of the mobile devices and the edgeserver there are only 200 distinct apps because therepackaged app has the same name as the correspondingoriginal app erefore our experiments have simulated40lowast 200 8000 mobile devices For each repackaged-orig-inal app pair there were 30 devices with the original app and10 devices with the repackaged version

In total we have gathered almost 89GB network traffice datasets that are used in our experiments are summa-rized in Table 3 In the table the traffic dataset sizes are thesizes of the captured network traffic and the feature datasetsizes are the sizes of the extracted features In our experi-ments the HTTP traffic contents and the flow statisticalfeatures which are listed in Table 2 were both saved as textfiles As shown in Table 3 the feature datasets are signifi-cantly smaller than the traffic datasets (25G vs 63G and 9G vs26G)us the edge computing can reduce the data size andincrease the data acquisition speed for cloud computing[24] Since the feature dataset sizes are still large (25G+ 9G)we did not send all these feature data to the cloud simul-taneously Instead we sent the feature data one by oneWhen the feature data of an app were successfully receivedby the cloud its traffic similarities were calculated andclustered immediately After that the feature data of anotherapp would be sent and analyzed e experimental resultswill be presented in the following

62 ExperimentalResults In this section wemainly evaluatethe detection efficiency of Android malware by clusteringsimilarity values e performance evaluation of the pro-posed method in terms of power memory and CPU con-sumptions is elaborated in the following Section 63

Similar to previous studies [8 14] to measure theperformance of the proposed method Accuracy and F-measure metrics are utilized which are defined in equations(11) and (12) In equation (11) TP FN FP and TN aredefined in Table 4 In Table 4 repackaged_x representsrepackaged_safe or repackaged_malware e precision andrecall in equation (12) are calculated via equations (13) and(14)

Accuracy TP + TN

TP + FN + FP + TN (11)

F minus measure 2 times precision times recallprecision + recall

(12)

precision TP

TP + FP (13)

recall TP

TP + FN (14)

12 Security and Communication Networks

We have evaluated the Accuracy and F-measure for eachrepackaged app Recall that we ran each repackaged app 10times and the corresponding original app 30 timeserefore for each app the number of the original apptraffic sets is 30 and the number of the repackaged app trafficsets is 10 namely TP + FN 10 and TN+FP 30 for eachapp We have randomly chosen 20 apps from repack-age_malware and the experimental results for these apps arepresented in Figures 12 and 13 Detailed descriptions of thechosen 20 apps are listed in Table 5 In the table the mainfunction of each app is determined by analyzing the packagename from the AndroidManifestxml and from the Googlesearch results of the package name As depicted in Figure 12100 accuracy was realized on a total of 6 apps (app3 app5app9 app10 app13 and app18) e minimum accuracy is85 (app2 3440) and the average accuracy is 952 Fig-ure 13 presents the F-measure values In our experimentsthe maximum F-measure value is 1 and the minimum valueis 0737 e average F-measure value is 0888 ese resultsdemonstrate the satisfactory performance of our proposedclustering method

We have also calculated the average accuracy and F-measure values for all repackaged_safe and repack-aged_malware apps In these calculationsTP + FN 57lowast10 570 and TN+FP 57lowast 301710 for

repackaged_safe apps For repackaged_malware appsTP + FN 143lowast101430 and TN+FP 143lowast 30 4290is is because in our experiments 143 apps were identifiedas repackaged malware and 57 apps as repackaged safe byVirustotal us the parameter values differ e experi-mental results are listed in Table 6

According to Table 6 the average accuracy values forrepackaged_safe and repackaged_malware apps are bothhigh However the average F-measure for repackaged_safeapps is only 078 which is lower than that of repackagedmalware is indicates that repackaged safe apps are

Router Windows Azure

Edge server DellPowerEdge R730

10 android phones

Repackagedoriginal apps

Androzoo

Figure 11 Experimental setup

Table 3 Datasets used in the experiments

of apps of runs of HTTP flows of encrypted flows Traffic dataset sizes Feature dataset sizesOriginal apps 200 30 340015 183452 asymp63G bytes asymp25G bytesRepackaged apps 200 10 110670 93041 asymp26G bytes asymp9G bytes

Table 4 Confusion matrix

ObservedDetected

Repackaged_x app Original appRepackaged_x app TP ( of TPs) FN ( of FNs)Original app FP ( of FPs) TN ( of TNs)

Apps

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

Figure 12 Detection accuracies for the 20 selected apps

Security and Communication Networks 13

difficult to distinguish from their original apps is isbecause repackaged_safe apps only generate smallamounts of additional network flows and their networktraffic features are not clearly distinguishable from thoseof normal traffic Fortunately the average F-measure ofrepackaged malware apps is outstanding is may be dueto the fact that malware typically send stolen data to orreceive commands from remote servers Hence theirnetwork traffic features are special and distinguishableFigure 14 presents an example of app clustering and il-lustrates the distinguishability of the repackaged malwareand the original app ese results demonstrate thatrepackaged Android malware can be effectively detectedby comparing their network traffic with that of originalapps

For repackaged malware apps the accuracy results withdifferent weight values (q in equation (10)) are presented inFigure 15 As depicted in the figure when q 05 we realizethe highest accuracy rate 969 If q 0 namely only thefeatures of encrypted traffic are used the accuracy result is753 e result is 617 when q 1 namely only thefeatures of plaintext traffic are considered ese resultsdemonstrate that both the encrypted and plaintext trafficcontribute to the differentiation of the original apps and therepackaged malware However the encrypted traffic con-tributes more than the plaintext traffic is may becausemalicious apps usually connect to their servers by securityprotocols to evade IDS detection

In the previous experiments we checked the serversrsquohostnames using VirusTotal to reduce the frequency of thefalse positives For various clusters that contained less than20 elements if all the serversrsquo hostnames were judged asclean the cluster was identified as an original app rather thana repackaged one As argued in Section 55 this process onlyreduces the frequency of false alarms To support this ar-gument we list the numbers of true positives and falsepositives of the 143 repackaged malware in Table 7 forcomparison When conducting the hostname checking thenumber of true positives is 1400 and the result is the samewhen the hostname checking is turned off is findingsupports that the malware is detected by clustering thesimilarity values However the number of false positivesincreased from 47 to 213 when the hostname checking wasturned off erefore we suggest checking the serversrsquohostnames after clustering to increase the practicality of themethod Note that false positives still occur when we checkthe hostnames by VirusTotal Hence there may be malwarein the original apps In our dataset the typical examples arecommobovideoplayerpro and cncfshop ele1taoxiaosanWe further analyze these two apps by reading theirdecompiled codes and we determine that they indeed es-tablish connections with suspicious servers In detail invarious runs app commobovideoplayerpro connects withhttpwwwtopappsquarecomwhich is judged as maliciousby CyRadar in VirusTotal App cncfshop ele1taoxiaosanvisits suspicious URL http955ccjSn9 Since these apps arethe original apps they are identified as false positives in theexperiments

In our experiments we chose density peak clustering asthe clustering method because density peak clustering canautomatically determine the correct number of clusters [58]Hence our method can detect various versions of malwarethat are repackaged from the same original app Unfortu-nately in the apps that were downloaded from AndroZoono malware was repackaged from the same original app Toevaluate the efficiency of detecting versions of malware wehave created two malware instances from a popular game

Apps

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

0

01

02

03

04

05

06

07

08

09

1

F-m

easu

re

Figure 13 Detection F-measure for the 20 selected apps

Table 5 Detailed descriptions of the chosen 20 apps

Main function ofapp

of HTTPflows

of encryptedflows

App1 VideoPlayer 453 51App2 Game 657 1455App3 Browser 1531 1478App4 MusicPlayer 341 25App5 Game 612 289App6 Game 788 123App7 Game 1336 524App8 Fitness 357 892App9 Email 122 431App10 Game 963 790App11 MusicPlayer 1128 547App12 Game 1598 426App13 News 2460 1178App14 News 2891 1543App15 Chat 171 49App16 Email 143 368App17 WeatherForecast 397 120App18 Game 1832 921App19 Fitness 732 362App20 OnlineLearning 2378 834

Table 6 Average accuracy and F-measure

Average accuracy () AverageF-measure

Repackaged_safe 903 078Repackaged_malware 969 094

14 Security and Communication Networks

app namely aircomaceviralmotox3 One is produced bygrafting AnserverBotrsquos source code into air-comaceviralmotox3 and the other is created by adding codefor stealing usersrsquo contacts and short messages and sendingthem to a remote server Similarly each malware is run 10times and the original app aircomaceviralmotox3 is alsorun 10 times for validatione clustering result is presentedin Figure 16 Apparently there are 3 clusters e

experimental results show that the value of F-measure foreach malware is 1 Hence our method can detect variousmalware that were repackaged from the same original app

63 Comparison In this section we compare our methodwith the typical methods that are based on network or cloudas illustrated in Figure 5 e comparative indicators are thedetection accuracy F-measure average power CPU andmemory consumptions of the mobile device First wecompare the proposed method with AppFA [14] which isimplemented completely at the network level Since AppFAalso must analyze app traffic we directly use the traffic sets ofthe repackaged malware apps for comparison While testingAppFA we deleted the label information that was containedin the traffic sets and mixed the network flows as a wholeen we used the signature matching and the constrainedK-means clustering algorithm proposed in [14] to restoreapp sessions namely the label for each flow e similarapps used in AppFA were chosen in Google Play via themethod introduced in [14] and for each repackaged mal-ware 10 similar apps were selected as the peer group epeer group apps were run once by Droidbot to generatenetwork traffic e comparison results are listed in Table 8

Since AppFA need not to install some special apps onmobile devices the extra power memory and CPU con-sumptions are 0 However the detection accuracy is muchlower (863 vs 969) than that of the proposedmethod ascompared in Table 8 In the proposed method we use theedge-client app to record flow metainformation thusnetwork flows can be accurately labeled However inAppFA flow labeling is realized via traffic clustering and theclustering accuracy cannot be 100 Meanwhile AppFAcompares the appsrsquo network behaviors with those of theirsimilar apps which also results in errors In the proposedmethod we directly compare the repackaged malware withits original app and the detection accuracy is increased eaverage power CPU and memory consumptions of theproposed method are also low and acceptable

In Table 8 the power memory and CPU consumptionsfor our edge-client app were obtained with the help of appGT (httpsgithubcomTencentGT) GT is a portable toolthat runs on smartphones for debugging the APP internalparameters and the code time-consuming statistics eaverage values for the power memory and CPU con-sumptions are calculated as follows Before running an appwith Droidbot we started the edge-client app by GT Whenthe Droidbot was stopped we recoded the power CPU andmemory consumptions We repeated the process until allapps (including 143 repackaged malware apps and theiroriginal versions) had been tested Via this way we obtainedthe average power CPU and memory consumptions for theedge-client app According to Table 8 the edge-client app islightweight and has minimal impact on the performances ofthe mobile devices

en we compare the proposed method with SAMA-Droid [44] SAMADroid monitors Android system calls andsends the calls to a centralized server for the detection ofAndroid malware SAMADroid is a typical cloud-based on-

Decision graph rho-delta-rho

20 251510 35 403050Sorted rho

00

02

04

06

08

Rho lowast

delt

a

Figure 14 Example of app clustering Obviously two clusters areidentified e dot in the top left corner is an outlier e outlier isdetected as an abnormal but harmless app by our method isfigure was generated by the density peak clustering tool

01 02 03 04 05 06 07 08 09 10Value of q

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

Figure 15 Accuracies of various weight values for traffic contentsimilarity and traffic behavior similarity

Table 7 True positives and false positives with and withouthostname checking

of true positives of false positivesWith hostname checking 1400 47Without hostnamechecking 1400 213

Security and Communication Networks 15

device detection method We have implemented the dy-namic detection according to the description in [44] Indetail the APIs that are invoked by the running apps aremonitored by the tool Sensitive API Monitor (httpsgithubcomcyruliuSensitive-API-Monitor) As done in [44] thefrequency values of 10 system calls (open ioctl brk readwrite close sendto sendmsg recvfrom and recvmsg) arerecorded ese recorded frequency values are sent toWindows Azure for malware detection We evaluate thedynamic detection performance of SAMADroid on our 143repackaged_malware apps and their original versions In theevaluation the frequency values of the original apps are usedto create the normality model and the frequency values ofthe repackaged malware apps are used for testing We alsouse the app GT to record the power memory and CPUconsumptions of SAMADroid Note that our phones wereall rooted and Sensitive_API_Monitor and GT can be usede experimental results are presented in Table 9

Last we compare our method with Shabtairsquos method [8]Shabtairsquos method detects Android malware at the mobiledevices and it considers the apprsquos network traffic patternsonly the HTTP traffic contents are not considered isgives us an opportunity to compare our method withmethods that have been specially designed for encryptedtraffic We have implemented Shabtairsquos method and chosenthe feature subset 2 as the detection features Feature subset2 includes Avg Sent Bytes Avg Rcvd Bytes and Pct ofAvg Rcvd Bytes among other values and it is proved to be

the best feature subset for malware detection [8] e De-cisionRegression tree is chosen as the classification algo-rithm e testing dataset is the traffic that was generated bythe 143 repackaged malware instances and the training set isthe traffic that was generated by the corresponding originalapps e classification results are listed in Table 10According to the table the detection accuracy and F-mea-sure values of our method are significantly higher than thoseof Shabtairsquos method is indicates that the HTTP trafficprovides useful information for Android malware detectionSince Shabtairsquos method is conducted at the mobile devices itconsumes more resources than our method as compared inTable 10

7 Discussion

e results of our extensive experiments demonstrate thatrepackaged Android malware can be efficiently and effec-tively detected by clustering network traffic at the edgedevices However there are still limitations for our methodFirst we assume the most of the devices have installed theoriginal apps (rlt (12)u as described in Section 51) is isreasonable in most cases since mobile users typicallydownload apps from the official markets However in areaswhere official markets such as the Google play and Amazonare banned it is possible that the most mobile devices haveinstalled the repackaged apps is may cause false positivesSecond if the malware does not generate sufficient trafficour method cannot detect it For example some malwareonly sends fraudulent premium SMS messages is type ofmalware cannot be detected by monitoring its networktraffic ird we rely on third-party tools such as Virustotalfor the identification of malicious URLs and to reduce thenumber of false alarms If a malicious app rents legalplatforms such as the public cloud as the remote controlserver our method will judge it as a normal app Hence falsenegatives will be generated

To overcome these limitations we could combine theproposed method with available methods such as SAMA-Droid In such cases the edge-client app will not only recordflow metainformation but also monitor app system be-haviors However the system behavior monitoring does notneed to be conducted all the time If abnormal but harmlessor lower traffic apps are detected the edge-client app will beinformed to record the corresponding appsrsquo system be-haviors Similarly the edge-client app can send the appsystem behaviors to the edge server for further processinge performance of this combined method will be evaluatedin our future work One can also combine the proposedframework with AppFA [14] for the detection of other typesof Android malware For example the edge extracts apptraffic behavior features and the cloud can compare appsrsquobehaviors with their peer groups to identify abnormalitiesFurthermore one could first use clustering techniques [60]to group repackaged apps according to the same maliciousmodules and then analyze the behaviors of the representativerepackaged apps in each group to obtain more accuratebehavior features

y

2D nonclassical multidimensional scaling

00ndash02 02ndash04x

ndash04

ndash03

ndash02

ndash01

00

01

02

03

04

Figure 16 Clustering results of various malware that wererepackaged from aircomaceviralmotox3e outlier in the middleis detected as an abnormal but harmless app is figure wasgenerated by the density peak clustering tool

Table 8 Comparison of the proposed method with AppFA

Proposed method AppFAAccuracy 969 863F-measure 094 075Average power consumption 12 0Average CPU consumption 06 0Average memory consumption 654MB 0

16 Security and Communication Networks

e paradigms of edge computing are used to detectAndroid malware in this study However several challengeswould be encountered when deploying it in practice Firstthe mobile devices must install a specified app for thecollection of essential information which may cause con-cerns and mobile users may refuse to cooperate Second inedge computing the data should be preprocessed by the edgeto better protect the usersrsquo privacy However the edge servercan observe all the data (including the sensitive informa-tion) and it threatens the privacy of users For example ifthe edge is deployed by the cloud provider the usersrsquo privacyinformation is also available to the cloud ere is no dif-ference with the user-cloud model for privacy protectionunless the edge server is deployed by the userird the edgeserver should be well protected from network attacksOtherwise it will be convenient for hackers to steal usersrsquoprivate information

8 Conclusion

In this paper we propose a novel method for detectingAndroid malware based on edge computing To the best ofour knowledge the detection of Android malware by uti-lizing edge computing and traffic clustering has not beenconsidered previously By introducing the edge layer themain tasks of the mobile device are offloaded to the edgeserver and the mobile device only needs to record flowmetainformation which substantially conserves the re-sources of the mobile device e edge server extracts appsrsquonetwork traffic features and sends these features to the cloudplatform for further analysis us the usersrsquo privacy can bebetter protected With the received features the cloud cal-culates the similarities between apps and clusters thesesimilarity values to separate the original apps and themalware automatically We evaluated our methods on 400Android apps and the experimental results demonstratethat the detection accuracy can reach 100 for several appsand the average accuracy is 969 We also tested theperformance of the proposed method and compared it with

the typical approaches e comparison results show thatour method can detect repackaged Android malware withboth higher detection accuracy and lower power and re-source consumptions In the future we will continue to workto overcome the discussed limitations and to evaluate theproposed method in real world scenarios

Data Availability

e downloaded apps used to support the findings of thisstudy have been deposited in httpspanbaiducoms1b8vMceIdtdd1gBRC5Gbzvw Its extraction code is vd5re clustering code is also in the repository Since appnetwork traffic contains usersrsquo sensitive information itcannot be publicly shared However we have explained allthe steps to generate network traffic automatically esesteps are also described in the instructionspdf in therepository

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported by National Key TechnologiesResearch and Development Program of China under Grant2018YFD0401404 National Natural Science Foundation ofChina under Grants 61702282 61802192 and 71801123Natural Science Foundation of the Jiangsu Higher EducationInstitutions of China under Grants 18KJB520024 and17KJB520023 Nanjing Forestry University under GrantsGXL016 and CX2016026 and NUPTSF under GrantNY217143

References

[1] Statista Mobile Operating Systemsrsquo Market Share Worldwidefrom January 2012 to December 2019 Statista HamburgGermany 2009 httpswwwstatistacomstatistics272698global-market-share-held-by-mobile-operating-systems-since-2009

[2] W Martin F Sarro Y Jia Y Zhang and M Harman ldquoAsurvey of app store analysis for software engineeringrdquo IEEETransactions on Software Engineering vol 43 no 9pp 817ndash847 2017

[3] httpswwwappbraincomstatsnumber-of-android-apps2020

[4] L Li D Li T F Bissyande et al ldquoUnderstanding android apppiggybacking a systematic study of malicious code graftingrdquoIEEE Transactions on Information Forensics and Securityvol 12 no 6 pp 1269ndash1284 2017

[5] M Fan J Liu X Luo et al ldquoAndroid malware familialclassification and representative sample selection via frequentsubgraph analysisrdquo IEEE Transactions on Information Fo-rensics and Security vol 13 no 8 pp 1890ndash1905 2018

[6] J Zhang Z Qin K Zhang H Yin and J Zou ldquoDalvik opcodegraph based android malware variants detection using globaltopology featuresrdquo IEEE Access vol 6 pp 51964ndash51974 2018

[7] S Garg S K Peddoju and A K Sarje ldquoNetwork-baseddetection of android malicious appsrdquo International Journal ofInformation Security vol 16 no 4 pp 385ndash400 2016

Table 9 Comparison of the proposed method with SAMADroid

Proposed method SAMADroidAccuracy 969 831F-measure 0940 0793Average power consumption 12 23Average CPU consumption 06 08Average memory consumption 654MB 773MB

Table 10 Comparison of the proposed method with Shabtairsquosmethod

Proposed method Shabtairsquosmethod

Accuracy 969 771F-measure 0940 0695Average power consumption 12 77Average CPU consumption 06 23Average memory consumption 654MB 937MB

Security and Communication Networks 17

[8] A Shabtai L Tenenboim-Chekina D Mimran L RokachB Shapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquoComputers amp Security vol 43 pp 1ndash18 2014

[9] L Xie X Zhang J-P Seifert and S Zhu ldquopBMDS a be-havior-based malware detection system for cellphone de-vicesrdquo in Proceedings of the ird ACM Conference onWireless Network Security ACM New York NY USApp 37ndash48 2010

[10] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacyin Smartphones and Mobile Devices ACM Chicago IL USApp 15ndash26 October 2011

[11] G Dini F Martinelli A Saracino and D SgandurraldquoMADAM a multi-level anomaly detector for android mal-warerdquo in Proceedings of the International Conference onMathematical Methods Models and Architectures for Com-puter Network Security vol 12 Springer St PetersburgRussia pp 240ndash253 2012

[12] Y Zhang M Yang B Xu et al ldquoVetting undesirable be-haviors in android apps with permission use analysisrdquo inProceedings of the 2013 ACM SIGSAC Conference on Com-puter amp Communications Security ACM Berlin Germanypp 611ndash622 November 2013

[13] SWang Z Chen X Li LWang K Ji and C Zhao ldquoAndroidmalware clustering analysis on network-level behaviorrdquo inProceedings of the International Conference on IntelligentComputing Springer Liverpool UK pp 796ndash807 August2017

[14] G He B Xu and H Zhu ldquoAppFA a novel approach to detectmalicious android applications on the networkrdquo in Securityand Communication Networks Wiley Hoboken NJ USA2018

[15] X Wang Y Yang and Y Zeng ldquoAccurate mobile malwaredetection and classification in the cloudrdquo SpringerPlus vol 4no 1 p 583 2015

[16] K Tam A Feizollah N B Anuar R Salleh and L Cavallaroldquoe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys (CSUR) vol 49 no 4p 76 2017

[17] M Sun X Li J C S Lui R T B Ma and Z Liang ldquoMonet auser-oriented behavior-based malware variants detectionsystem for androidrdquo IEEE Transactions on Information Fo-rensics and Security vol 12 no 5 pp 1103ndash1112 2017

[18] T Vidas and N Christin ldquoEvading android runtime analysisvia sandbox detectionrdquo in Proceedings of the 9th ACMSymposium on Information Computer and CommunicationsSecurity pp 447ndash458 Kyoto Japan June 2014

[19] Y Duan M Zhang A V Bhaskar et al ldquoings you may notknow about android (Un) packers a systematic study basedon whole-system emulationrdquo in Proceedings of the 2018Network and Distributed System Security Symposium pp 1ndash15 San Diego CA USA February 2018

[20] L Xue X Luo L Yu S Wang and D Wu ldquoAdaptiveunpacking of android appsrdquo in Proceedings of the 2017 IEEEACM 39th International Conference on Software Engineering(ICSE) IEEE Buenos Aires Argentina pp 358ndash369 May2017

[21] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy IEEE San Francisco CAUSA pp 95ndash109 2012

[22] N W Lo S K Lu and Y H Chuang ldquoA framework for thirdparty android marketplaces to identify repackaged appsrdquo inProceedings of the 2016 IEEE 14th International Conference onDependable Autonomic and Secure Computing 14th Inter-national Conference on Pervasive Intelligence and Computing2nd International Conference on Big Data Intelligence andComputing and Cyber Science and Technology Congresspp 475ndash482 Auckland New Zealand August 2016

[23] L Li J Gao M Hurier et al ldquoAndrozoo++ collectingmillions of android apps and their metadata for the researchcommunityrdquo 2017 httpsarxivorgabs170905281

[24] W Shi J Cao Q Zhang Y Li and L Xu ldquoEdge computingvision and challengesrdquo IEEE Internet of ings Journal vol 3no 5 pp 637ndash646 2016

[25] W Zhou Y Zhou X Jiang and P Ning ldquoDetectingrepackaged smart-phone applications in third-party androidmarketplacesrdquo in Proceedings of the Second ACM Conferenceon Data and Application Security and Privacy ACM AntonioTX USA pp 317ndash326 February 2012

[26] Y Shao X Luo C Qian P Zhu and L Zhang ldquoTowards ascalable resource-driven approach for detecting repackagedandroid applicationsrdquo in Proceedings of the 30th AnnualComputer Security Applications Conference ACM NewOrleans LA USA pp 56ndash65 December 2014

[27] K Chen P Wang Y Lee et al ldquoFinding unknown malice in10 seconds mass vetting for new threats at the google-playscalerdquo in Proceedings of the USENIX Security Symposiumvol 15 Washington DC USA August 2015

[28] Z Wang C Li Y Guan and Y Xue ldquoDroidchain a novelmalware detection method for android based on behaviorchainrdquo in Proceedings of the 2015 IEEE Conference onCommunications and Network Security (CNS) IEEE Flor-ence Italy pp 727-728 September 2015

[29] K Tian D D Yao B G Ryder G Tan and G PengldquoDetection of repackaged android malware with code-het-erogeneity featuresrdquo IEEE Transactions on Dependable andSecure Computing vol 17 no 1 pp 64ndash77 2017

[30] A De Lorenzo F Martinelli E Medvet F Mercaldo andA Santone ldquoVisualizing the outcome of dynamic analysis ofandroid malware with vizmalrdquo Journal of Information Se-curity and Applications vol 50 Article ID 102423 2020

[31] M Lin D Zhang X Su and T Yu ldquoEffective and scalablerepackaged application detection based on user interfacerdquo inProceedings of the 2017 IEEE Conference on ComputerCommunications Workshops (INFOCOM WKSHPS) IEEEAtlanta GA USA May 2017

[32] G Meng M Patrick Y Xue Y Liu and J Zhang ldquoSecuringandroid app markets via modelling and predicting malwarespread between marketsrdquo IEEE Transactions on InformationForensics and Security vol 14 no 7 pp 1944ndash1959 2018

[33] A Arora and S K Peddoju ldquoMinimizing network trafficfeatures for android mobile malware detectionrdquo in Proceed-ings of the 18th International Conference on DistributedComputing and Networking ACM Hyderabad India January2017

[34] X Wu D Zhang X Su and W Li ldquoDetect repackagedAndroid application based on HTTP traffic similarity httptraffic similarityrdquo Security and Communication Networksvol 8 no 13 pp 2257ndash2266 2015

[35] J Malik and R Kaushal ldquoCredroid android malware de-tection by network traffic analysisrdquo in Proceedings of the 1stACM Workshop on Privacy-Aware Mobile Computing ACMPaderborn Germany pp 28ndash36 July 2016

18 Security and Communication Networks

[36] A Zulkifli I R A Hamid W M Shah and Z AbdullahldquoAndroid malware detection based on network traffic usingdecision tree algorithmrdquo in Proceedings of the InternationalConference on Soft Computing and Data Mining SpringerSenai Malaysia pp 485ndash494 January 2018

[37] Z Chen Q Yan H Han et al ldquoMachine learning basedmobile malware detection using highly imbalanced networktrafficrdquo Information Sciences vol 433-434 pp 346ndash364 2018

[38] G He B Xu L Zhang and H Zhu ldquoMobile app identifi-cation for encrypted network flows by traffic correlationrdquoInternational Journal of Distributed Sensor Networks vol 14no 12 pp 1ndash17 2018

[39] L Xue Y Zhou T Chen X Luo and G Gu ldquoMalton to-wards on-device non-invasive mobile malware analysis forARTrdquo in Proceedings of the 26th USENIX Security Symposiumpp 289ndash306 Vancouver Canada August 2017

[40] M K Alzaylaee S Y Yerima and S Sezer ldquoDL-Droid deeplearning based android malware detection using real devicesrdquoComputers amp Security vol 89 Article ID 101663 2020

[41] A S Shamili C Bauckhage and T Alpcan ldquoMalware de-tection on mobile devices using distributed machine learn-ingrdquo in Proceedings of the 20th International Conference onPattern Recognition (ICPR) IEEE Istanbul Turkeypp 4348ndash4351 August 2010

[42] M Zhao T Zhang F Ge and Z Yuan ldquoRobotdroid alightweight malware detection framework on smartphonesrdquoJournal of Networks vol 7 no 4 p 715 2012

[43] K A Talha D I Alper and C Aydin ldquoAPK auditor per-mission-based android malware detection systemrdquo DigitalInvestigation vol 13 pp 1ndash14 2015

[44] S Arshad M A Shah A Wahid A Mehmood H Song andH Yu ldquoSamadroid a novel 3-level hybrid malware detectionmodel for android operating systemrdquo IEEE Access vol 6pp 4321ndash4339 2018

[45] G He L Zhang B Xu and H Zhu ldquoDetecting repackagedandroid malware based on mobile edge computingrdquo inProceedings of the 2018 Sixth International Conference onAdvanced Cloud and Big Data (CBD) IEEE Lanzhou Chinapp 360ndash365 August 2018

[46] R Roman J Lopez andMMambo ldquoMobile edge computingFog et al a survey and analysis of security threats andchallengesrdquo Future Generation Computer Systems vol 78pp 680ndash698 2018

[47] F Bonomi R Milito J Zhu and S Addepalli ldquoFog com-puting and its role in the internet of thingsrdquo in Proceedings ofthe First Edition of the MCC Workshop on Mobile CloudComputing ACM Helsinki Finland pp 13ndash16 August 2012

[48] M T Beck MWerner S Feld and S Schimper ldquoMobile edgecomputing a taxonomyrdquo in Proceedings of the Sixth Inter-national Conference on Advances in Future Internet pp 48ndash55 Lisbon Portugal November 2014

[49] P Mach and Z Becvar ldquoMobile edge computing a survey onarchitecture and computation offloadingrdquo IEEE Communi-cations Surveys amp Tutorials vol 19 no 3 pp 1628ndash1656 2017

[50] H T Dinh C Lee D Niyato and P Wang ldquoA survey ofmobile cloud computing architecture applications and ap-proachesrdquo Wireless Communications and Mobile Computingvol 13 no 18 pp 1587ndash1611 2013

[51] X Lyu H Tian L Jiang et al ldquoSelective offloading in mobileedge computing for the green internet of thingsrdquo IEEENetwork vol 32 no 1 pp 54ndash60 2018

[52] S Agrawal and J Agrawal ldquoSurvey on anomaly detectionusing data mining techniquesrdquo Procedia Computer Sciencevol 60 pp 708ndash713 2015

[53] A Tongaonkar ldquoA look at the mobile app identificationlandscaperdquo IEEE Internet Computing vol 20 no 4 pp 9ndash152016

[54] G He B Xu and H Zhu ldquoIdentifying mobile applications forencrypted network trafficrdquo in Proceedings of the 2017 FifthInternational Conference on Advanced Cloud and Big Data(CBD) IEEE Shanghai China pp 279ndash284 August 2017

[55] G Ranjan A Tongaonkar and R Torres ldquoApproximatematching of persistent lexicon using search-engines forclassifying mobile app trafficrdquo in Proceedings of the 35thAnnual IEEE International Conference on Computer Com-munications IIEEE San Francisco CA USA April 2016

[56] L Li and S Qu ldquoShort text classification based on improvedITCrdquo Journal of Computer and Communications vol 1 no 4pp 22ndash27 2013

[57] A H Lashkari A F A Kadir L Taheri and A A GhorbanildquoToward developing a systematic approach to generatebenchmark android malware datasets and classificationrdquo inProceedings of the 2018 International Carnahan Conference onSecurity Technology (ICCST) IEEE Montreal Canada Oc-tober 2018

[58] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[59] Y Li Z Yang Y Guo and X Chen ldquoDroidbot a lightweightUI-guided test input generator for androidrdquo in Proceedings ofthe Software Engineering Companion (ICSE-C) pp 23ndash26Buenos Aires Argentina May 2017

[60] M Fan X Luo J Liu et al ldquoGraph embedding based familialanalysis of android malware using unsupervised learningrdquo inProceedings of the 2019 IEEEACM 41st International Con-ference on Software Engineering (ICSE) IEEE Piscataway NJUSA pp 771ndash782 May 2019

Security and Communication Networks 19

Page 5: On-DeviceDetectionofRepackagedAndroidMalwarevia …downloads.hindawi.com/journals/scn/2020/8630748.pdf · 2020-05-31 · According to [4–6], most Android malware programs are repackagedapps,andthemajorityofnewAndroidmalware

32On-DeviceDetection e detection of Android malwarecan also be conducted at usersrsquo mobile devices [39]Alzaylaee et al [40] extracted features from real devices andused a deep learning system to detect malicious Androidapps Shamili et al [41] proposed a distributed SupportVector Machine (SVM) algorithm for the detection ofAndroid malware on a network of mobile devices eirdetection features include short duration calls mediumduration calls and the number of outgoing SMS (ShortMessage Service) among others e experimental resultsdemonstrated that the average computation time per clientduring the training phase is nearly 10 seconds Zhao et al[42] implemented a malware detection framework namelyRobotDroid using an SVM active learning algorithmRobotDroid monitors all the running apps to identify theirrunning characteristics ese characteristics are later inputinto learning modules to generate the behavioral charac-teristics for the classification of normal apps and malwareAccording to their performance evaluation the time ofaccessing the GPS is increased from 80ms to almost 120mswhen the detection system is executed Hence the perfor-mance of a smartphone that utilizes this framework wouldbe significantly degraded Shabtai et al [8] monitored theappsrsquo network behaviors at the mobile devices to detectrepackaged Android malware and the detection accuracyexceeds 80 however the performance overhead of theirmethod is not negligible For example it requires7272 kB plusmn 15 memory consumption and 13 plusmn 15 CPUconsumption for learning 10 Android malware detectionmodels

To further improve the performance of on-device de-tection Crowdroid [10] sends the detection features to acentralized server for processing Crowdroid only recordsthe system calls as the detection features and it is light-weight Talha et al [43] proposed APK Auditor which is asystem that uses permissions as static analysis features forAndroid malware detection APK Auditor consists of threecomponents (1) a signature database (2) an Android clientand (3) a central server e Android client sends the app tothe central server for analysis e central server extracts thepermissions that are requested by the app and computes thepermission malware score based on the permissions inmalware If the score exceeds a threshold value the app isclassified as malware e results demonstrate that APKAuditor achieves 925 specificity however it lacks thebenefits of dynamic analysis as it cannot detect dynamicmalicious payloads

Monet [17] also uses backend servers and a client app todetect Androidmalwaree client app constructs a directedgraph of the app components and the system componentsand it records the potentially dangerous system calls as thedetection features ese features are forwarded to thebackend server which conducts further detection by ap-plying a signature matching algorithm and returns thedetection results e experimental results demonstrate thatMonet can detect Android malware with 99 accuracyHowever the performance overhead of Monet is high Asshown in [17] Monet had 8 overhead on memory and IObenchmarks and 55 overhead for battery Most recently

Arshad et al [44] proposed SAMADroid which is a 3-levelhybrid model for the detection of Android malware Sim-ilarly SAMADroid has a client app and an analysis servere client app hooks the Strace tool and traces the systemcalls that are invoked by the user app If the user app remainsrunning on the device the client app continues tracing thesystem calls of the user app and generates a log file of thesystem calls Later the log file is sent to the SAMADroidserver with the identifier of the user app At the server thesame user app is downloaded from stores and staticallyanalyzed en the static features and the dynamic featuresthat are extracted from the log file are embedded in a vectorspace Finally the features are provided as input to themachine learning tool for classification eir experimentalresults demonstrate that random forest yields 9907 ac-curacy on the static features and 8276 on the dynamicfeatures e performance overhead of SAMADroid is 18for memory and 06 for CPU In previous work [45] weused mobile edge computing to detect malicious apps Weextend it to the edge computing paradigm and refine theclustering method in this study

33 Edge Computing Research e vision and challenges ofedge computing are introduced in the literature [24] ebasic strategy of edge computing is to deploy cloud com-puting at the edge of the network to improve the perfor-mance and create new applications for IoT [46] Howeverthe deployment location of the edge and the functionalitiesof the edge are controversialerefore in applications edgecomputing can be implemented as fog computing [47]mobile edge computing [48 49] and mobile cloud com-puting [50] In the fog computing paradigm the analysis oflocal information is conducted on the ldquogroundrdquo and thecoordination and global analytics are conducted in theldquocloudrdquo One can argue that all objects that are outside thecloud constitute the ground thus fog computing is almostthe same as edge computing

Mobile edge computing deploys cloud services at theedge of mobile networks such as 5G Namely the edge isdeployed in close proximity to the eNodeB which loops thetraffic through the mobile edge computing servers for fur-ther data processing us in mobile edge computing theedge is fixed and the edge servers can be provided bytelecommunication companies Mobile cloud computingfocuses mainly on mobile delegation mobile devices shoulddelegate the storage of bulk data and the execution ofcomputationally intensive tasks to remote entities A remoteentity can be a centralized cloud or another mobile deviceerefore the edge can consist of mobile devices in mobilecloud computing Hence one of the most active areas ofresearch in the field of mobile cloud computing is thedelegation of tasks to external services especially to othermobile devices Task delegation is also essential for edgecomputing [51]

In this study we provide a lightweight and efficientframework for the detection of repackaged Androidmalwarethat is based on edge computing In the proposed frame-work the mobile devices need only to record the flow

Security and Communication Networks 5

starting time flow protocol server-side IP address server-side port app name and version for each network flowLater these data are sent to the edge servers for traffictagging Since the amounts of data that are saved and sent aresmall the performances of mobile devices are not signifi-cantly degraded e edge servers extract detection featuresfrom the labeled app network traffic and send the extractedfeatures to the cloud to be clustered After clustering theoriginal and repackaged apps can be efficiently distin-guished In our study only network traffic is considered inAndroid malware detection In contrast to previous net-work-based methods such as [14 34ndash36] we do not need tomodel appsrsquo network behaviors in advance and our methodis more useful in practice

4 Framework for Android Malware Detection

e on-device detection of Android malware is challengingbecause mobile devices typically have limited resources Toalleviate the storage and computing limitations and prolongthe lifetimes of the mobile devices in this study we proposea new framework for on-device detection of repackagedAndroid malware e proposed framework is illustrated inFigure 3

Our framework is composed of three layers mobiledevices the edge and the cloud First we explain thepossible locations of the edge later we introduce thefunctionalities of each component In our framework theonly requirement for the edge is that it can monitor andprocess the mobile network traffice mobile devices couldconnect to the Internet via WiFi or 3G4G hence the edgenodes can be wireless gateways household security gatewayssuch as BitDefender Box and LTE eNodeBs e edge nodescan also be routers that are located in the backbone networkif they can observe the traffic that is generated by the mobiledevices erefore the deployment locations of the edgenodes are flexible in practice Figure 4 illustrates the possibledeployment scenarios of the edge

In the proposed framework the mobile devices mustcollect and send flow metaformation to the edge for datapreprocessing For this an edge-client app is designed to berun on themobile devicese edge-client app operates as anAndroid VPN app Its main objective is to collect andtransfer the flow metainformation which includes the appname flow starting time and flow protocol to the edgeiteratively With the flow metainformation the edge serverpreprocesses the network traffic by adding a label (the appname) to each flow After that it extracts detection featuresfrom these labeled flows e traffic labeling and featureextraction are illustrated in detail in Section 52 Once thefeature extraction has been completed the edge server sendsthese features to the cloud With the received features thecloud conducts the malware detection tasks and notifies themobile devices of the detection results

To further illustrate the advantages of the proposedframework we compare it with the available frameworksSince the proposed framework is based on edge computingwe will refer to it as the edge-computing-based framework inthe following Similarly the available frameworks are termed

as the cloud-based framework (the server-side detectionmethods that were described in Section 1) and the network-based framework (the network-based methods that wereintroduced in Sections 1 and 3) for differentiation Com-pared to the popular cloud-based framework (Figure 5(a))the most prominent advantage of the edge-computing-basedframework is that the feature extraction process is offloadedto the edge and the mobile devices must only save andtransmit the flow metainformation us the proposededge-computing-based framework is resource-friendly formobile devices

e network-based framework (Figure 5(b)) monitorsnetwork traffic without any interference withmobile devicesHowever it is less accurate since the network nodes cannotknow the origins of the network flows precisely namely itcannot identify app traffic with 100 accuracy As dem-onstrated in the literature [14] since the network nodecannot identify the app for each flow (eg some flows lackssignatures or are encrypted) the appsrsquo network behaviors areincomplete and the malware detection accuracy is reducedIn the proposed edge-computing-based framework wedesign a lightweight edge-client app for gathering app in-formation Furthermore mobile usersrsquo privacy can be betterprotected in our proposed framework By deploying onersquosown edge server a mobile user can control the types ofinformation that can be used as detection features Incontrast in the network-based framework the networknodes can observe the full traffic content and extract anyinformation they want which severely threatens the privacyof mobile users

Android malwaredetectionCloud

Edge

Mobiledevices

(1) Flowmetainformation

(2) Extractedfeatures

(3) Detectionresults

Network connectionData transmission

Figure 3 Framework for android malware on-device detectionthat is based on edge computing Mobile devices send flow met-ainformation such as the app name and the server-side IP addressto the edge Edge servers capture app network traffic and extractHTTP contents and flow statistical features from the capturedtraffic Later the features are sent to the cloud for malware de-tection e cloud also interacts with the mobile devices to returnthe detection results

6 Security and Communication Networks

After introducing the general architecture we outline thedetection methods that can be used in the edge-computing-based framework e cloud can use various methods todetect Android malware For example it can use conven-tional signature matching [35] and network behavioranalysis [8 14] to identify malicious apps However thesetypes of methods require a priori information (such assignatures or behavior models) In this paper we propose anovel repackaged malware detection method that clustersthe network traffic that is generated by the same app fromvarious mobile devices Figure 6 presents an example of thisprocess

e motivations behind the traffic clustering are two-fold first most Android malware is produced by repack-aging popular apps [4] and popular apps are often widelydistributed Second the network traffic generated by therepackaged malware differs from that generated by theoriginal app (as illustrated in Section 21) As illustrated inFigure 3 many mobile devices would connect to the cloudfor malware detection Hence the cloud can observe thenetwork traffic features of both the original apps and theirrepackaged versions with large probabilities due to the widedistribution of popular apps and the large number of mobiledevices If we regard the features of the original apps as thenormal data and the features of the repackaged versions asthe abnormal data the repackaged malware detectionproblem is transformed into the abnormal detection

problem Clustering is an efficient method for solving thisproblem [52] By clustering appsrsquo network traffic features weneed not model appsrsquo network behaviors in advance andmore importantly we can detect various repackaged ver-sions of the original app as illustrated in Figure 6e detailsof the proposed malware detection method will be describedin detail in Sections 53ndash55

5 Methodology

In this section we introduce the technical details of theproposed clustering method for Android malware detectionTable 1 defines the main notation that we use in this paper

(a) (b) (c)

Figure 4 Possible deployments of the edge e edge could be the wireless gateways or the dedicated servers that are connected to theeNodeB or the routers (a) Wireless gateway (b) eNodeB+ server (c) Router + server

Clustering

Traffic features of an app collected from

different mobile devicesOriginal apps

Repackaged apps

Figure 6 Illustration of traffic clustering for repackaged malwaredetection e plain circles represent original apps and the circleswith triangles and inverted triangles are repackaged ones

Network connectionData transmission

Android malwaredetectionCloud

Mobiledevices

(1) Extractedfeatures

(2) Detectionresults

(a)

Network connectionData transmission

Network node

Mobiledevices

(1) Detectionresults

(b)

Figure 5 Available frameworks for android malware detection Various approaches such as [10] send the extracted features to a remoteserver other than to the cloud We still regard them as following the cloud-based framework because there is no essential difference formalware detection As illustrated in (b) the detection results are typically presented to the network administrator and are not fed back to themobile devices (a) Framework based on cloud (b) Framework based on network

Security and Communication Networks 7

51 System Overview Figure 7 illustrates the main proce-dure of the proposed approach App i is the app for vettingand it has been installed on umobile devicesese u devicesare connected to e edge servers We assume that r deviceshave installed the repackaged malware (the malware is alsopresented as App i to deceive the users) and rlt (12)uHence most of the devices have installed the original ver-sion is situation is reasonable because repackaged An-droidmalware is typically distributed in third-party marketswhich are less popular than official markets [23] us thenumbers of downloads and installations are relatively smallWe will use this characteristic to classify the clusteringresults

To detect the repackaged malware the edge server labelsthe network flows that are generated by App i according tothe flow metainformation en it extracts suitable trafficfeatures (such as traffic content and statistics) from thelabeled flows and sends the features to the cloud for pro-cessing erefore there are u feature sets for App i in thecloud e cloud filters these u sets according to the appversions and removes common traffic After that it calculatesthe pairwise similarities of the feature sets Finally it clustersthe similarity values and identifies r repackaged malwareinstances

e traffic labeling feature extraction and filtering(labels①② and③ in Figure 7) the similarity calculationon the traffic contents and behaviors (labels④ and⑤) andclustering of similarity values (label⑥) are the core elementsof the proposed method ey will be described in detail inthe following sections

52 Traffic Labeling Filtering and Feature Extraction

521 Traffic Labeling In this study traffic labeling isconducted to identify the corresponding app for each net-work flow which is known as the app identification problem

[53] Extensive works have been conducted on the identi-fication of apps from mobile network traffic [38 54 55]However the achieved identification accuracies are all lowerthan 100 In our method we design an edge-client app forcollecting flow metaformation and identifying apps accu-rately e flow metainformation is defined in the following

Definition 1 (flow metainformation) For flow f the flowmetainformation includes the client-side IP address C-IPfthe flow starting time Tf the flow protocol Pf the server-sideIP address S-IPf the server-side port S-Portf the app nameAppNf and the app version AppVfe flow f is bidirectionaland Pf is the transport layer protocol such as TCP or UDP

e edge-client app operates similarly to a VPN appeapp sends the flow metainformation to the edge server at afixed time interval Ti Once it has received the flow meta-information the edge server saves it into a metainformationdatabase and can label mobile app traffic accurately Denotethe flow observed by the edge server as fu e edge serverlabels fu as follows

Step 1 extract the recording time Tuf of fu and its flow

protocol Puf client-side IP address CminusIPu

f server-sideIP address SminusIPu

f and server-side port SminusPortufStep 2 search the metainformation database with Pu

fCminusIPu

fSminusIPuf and SminusPortuf If records match the da-

tabase returns the record that has the most recent flowstarting time Tr

f Otherwise the edge server waits for awhile and repeats Step 2Step 3 if Tu

f is close to Trf the edge server labels flow fu

as ltAppNf AppVfgtOtherwise it waits for a while andrepeats Steps 2 and 3

In the above steps the client-side IP address CminusIPuf is

used to identify a mobile device uniquely which is correctwhen the edge server is the wireless gateway or the eNodeBIn such scenarios the mobile devices directly connect to theedge hence the IP address is unique for each deviceHowever if the edge server is deployed at the backbonerouter as illustrated in subgraph (c) of Figure 4 the CminusIPu

f

may be confusing because mobile devices could be locatedbehind NATs (Network Address Translations) and they willshare the same public IP address To solve this issue theedge-client app can insert special indicators into the packetsto uniquely represent a device In this study we alwaysassume that CminusIPu

f is unique and we will investigate theimplementation of the device-indicator in future work

In steps 2 and 3 the edge server utilizes a wait-and-repeat strategy to ensure that the traffic labeling is fresh andaccurate e edge server must wait because the mobiledevices send flow metainformation to the edge server at afixed time interval Ti (60 s in our experiments) and the savedflow metainformation in the database may be outdated Inour implementations the waiting time is set to Ti and if|Tu

f minus Trf|gt 24 h the returned flow metainformation is

regarded as obsolete

522 Traffic Feature Extraction After traffic labeling theedge server extracts detection features from the labeled

Table 1 Main notation used in this paper

Notation Meaningi Android app iu Number of mobile devices with app i installedr Number of devices with repackaged app i

f Network flowC-IPf Client-side IP address of fS-IPf Server-side IP address of fS-Portf Server-side port of fAppNf Name of the app that generates fAppVf Version of the app that generates fTi Set time intervalTu

f Recording time of flow fu at the edge serverdi Feature set of app iwj Plaintext word in the feature setsV(di) Numerical vector of traffic contentsF

j

diFeature vector for the encrypted flow j in di

TBntimesm(di) Traffic behaviors of diCSdik

Content similarity between di and dkBSdik

Behavior similarity between di and dkSdik

Final similarity between di and dk

8 Security and Communication Networks

traffic e extracted traffic features are categorized into twogroups traffic content features and traffic behavior featuresTraffic content features are the plaintext contents that areextracted from HTTP flows and traffic behavior features arethe statistics of network flows such as packet sizes andintervals Since Android apps may use an encrypted HTTPSprotocol to transmit data [54] traffic behavior features alsomust be considered e traffic content and traffic behaviorfeatures are described in detail in Sections 53 and 54

523 Traffic Filtering After traffic feature extraction theedge servers send the extracted features along with theclient-side IP address server-side IP addresses server-sideports app name AppNf and app version AppVf to the cloudNote that App i is installed on u devices hence the cloudeventually has u feature sets for App i Figure 8 illustrates thestructure of the feature set e cloud further filters these ufeature sets to increase the detection accuracy First it di-vides the u sets into v categories according to the app versionAppVf erefore each category contains the feature setsthat are generated by the same version of App i en foreach category the cloud removes the common traffic fromthe feature sets e common traffic is defined as network

flows that are contained in all features sets ese flows havethe same server-side IP address and the same server-sideport number e reason for this step is that the commontraffic will obfuscate the similarity calculations betweenapps Otherwise the similarity values between features setsare all high and they cannot be effectively used to distin-guish the repackaged malware from the original apps

After the filtering has been completed the similaritiesbetween feature sets are calculated and the similarity valuesare clustered to identify the repackaged malware

Client-side IP addressApp name AppN_fApp version AppV_fFlow 1 server-side IP address 1 server-side port 1 flow features

Flow f server-side IP address f server-side port f flow features

Figure 8 Illustration of the feature set e set contains the clientIP address app name and app version Each network flow that isgenerated by App i in a device is represented by the featuresextracted from the flow and by the server IP address and port

Trafficlabeling

Edge server 1

Traffic featureextraction

Similarity calculation of

trafficbehaviors

Similarity calculation of traffic contents Clustering of

similarity values

Original app

Repackaged malware

Cloud

Mobile device 1 2 (with app i installed)

Results

Traffic features

Traffic and flowmetainformation

ę

ę

Edge server e

Traffic features

Traffic and flow metainformation

Filtering

Mobile device u(with app i installed)

1

3

4

5

6

2

Figure 7emain procedure of the proposedmethod for Androidmalware detectione core parts of the proposedmethod are labeled bycircled numbers

Security and Communication Networks 9

53 Similarity Calculation of Traffic Content Android appsuse mainly HTTP and HTTPS protocols to communicatewith their remote servers [54] For HTTP traffic since it isplaintext based we can efficiently use the flow content tocalculate the similarity e basic strategy is to convertHTTP traffic to document files and the document files canbe handled by information retrieval technologies for simi-larity calculation HTTP flow content has also been used inapp identification from network traffic [55] However in thisstudy we only consider partial HTTP header information tobetter protect mobile usersrsquo privacy

In detail first we reassemble HTTP flows from packetsand extract the flow contents by saving the correspondingpacket contents into a text file as ASCII code Figure 9 showssuch an example en we preprocess the text file by de-leting the HTTP request content and the HTTP requestheader except for the GET (or POST) and HOST fields eHTTP response content is also deleted namely only theHTTP response header the GET (or POST) and the HOSTfiled in the HTTP request header are saved Via this ap-proach the mobile usersrsquo privacy can be protected from thecloud (for example the mobile phone information (HUA-WEI) is leaked by the User-Agent field in Figure 9) Afterthat the text file is spilt using special characters (eg ) and the processed file becomes the flow contentfeatures

e similarity is calculated via the improved TF-IDFalgorithm [56] and the cosine similarity e term frequencymeasures the count of words in a document We use theaugmented frequency measurement Denote the document(the feature set) as di and the word as wj Term frequencytf(wj di) is calculated as

tf wj di1113872 1113873 05 +05 times f wj di1113872 1113873

max f w di( 1113857 w isin di1113864 1113865 (1)

where f(wj di) is the number of occurrences of wj indocument di

e inverse document frequency measures of theamount of information a word provides Typically it is thelogarithmically scaled inverse fraction of the number ofdocuments that contain the word e inverse documentfrequency of wj is calculated as

idf wj1113872 1113873 logu

d wj isin d1113966 111396711138681113868111386811138681113868

11138681113868111386811138681113868+ 001⎛⎝ ⎞⎠ (2)

where u is the number of documents and | d wj isin d1113966 1113967| isnumber of documents in which the term wj appears

Denote by p(wj di) the product of TF(wj di) andI DF(wj) it is formulated as

p wj di1113872 1113873 log tf wj di1113872 11138731113872 1113873 times idf wj1113872 1113873

1113936

cj0 log

2 tf wj di1113872 11138731113872 1113873 times idf2 wj1113872 11138731113969 (3)

After the TF-IDF operation all documents are convertedinto equal-length numerical vectors We use the cosinesimilarity to measure the similarity between two documentvectors For documents di and dk the corresponding

numerical vectors are denoted as V(di) and V(dk) and thelength of vector is t e similarity between di and dk iscalculated as

CSdik

1113936tl1 V di( 1113857l times V dk( 1113857l( 1113857

1113936tl1 V di( 1113857

2l

1113969

times

1113936tl1 V dk( 1113857

2l

1113969 (4)

Suppose that in category vi (we divide the u feature setsgenerated by App i in to v categories in the traffic filteringstep) there are c feature sets For each feature set we save theflow features of all HTTP flows into a text file and c text filesare obtained en we calculate the numerical vector foreach feature set via equation (3) Last we compare thenumerical vector with each other as equation (4) and gain csimilarity values (CSdii

1) for each feature set We rep-resent these similarity values of feature set di (i 1 2 c)as a vector Si

contents and Sicontents (CSdi1

CSdi2 CSdic

)e vectors Si

contents (i 1 2 c) in combination with thesimilarity values of traffic behaviors will be clustered toidentify the abnormal data points namely the repackagedmalware

54 SimilarityCalculation of TrafficBehaviors In addition tocalculating the similarity values of traffic contents thesimilarity of traffic behaviors must be calculated due toencrypted flows We use m (m 11 in our study) statisticalfeatures as listed in Table 2 to represent the behavior of anencrypted network flow e units of the flow duration andthe packet interval time are milliseconds in our experiments

With the flow statistical features for feature set di the jthencrypted flow can be represented as a vector in the featurespace

Fj

di f

1j f

2j f

mj1113872 1113873 (5)

where flj represents the value of the lth statistical feature

1le llem Figure 10 shows an example of feature represen-tation for encrypted traffic Consequently the traffic be-haviors of di can be represented as a matrix of f

pj

Figure 9 Example of converting an HTTP flow to a text file

10 Security and Communication Networks

TBntimesm di( 1113857 F1di

F2di

Fndi

1113872 1113873T (6)

where n is the number of encrypted flows that are containedin di and nge 1

Since different feature sets may differ in terms of thenumber of encrypted flows traffic behaviors TBnitimesm(di) andTBnktimesm(dk) may differ in terms of different dimensionnamely the value of ni should not be equal to that of nk Toefficiently calculate the similarity between TBnitimesm(di) andTBnktimesm(dk) we calculate the Frobenius norm of TBnitimesm(di)

and TBnktimesm(dk) e formula is presented in equation (7)where norm(middot) normalizes the value of fl

j to [0 1]

FN(d)

1113944

n

j11113944

m

l1norm fl

j1113872 111387311138681113868111386811138681113868

111386811138681113868111386811138682

11139741113972

(7)

en the distance between TBnitimesm(di) and TBnktimesm(dk) iscalculated as

dis di dk( 1113857 FN di( 1113857 minus FN dk( 11138571113868111386811138681113868

1113868111386811138681113868 (8)

Finally the similarity between TBnitimesm(di) andTBnktimesm(dk) can be calculated by

BSdik 1 minus

dis di dk( 1113857

max dis dr dq1113872 1113873 r q 1 2 u rne q1113960 1113961 (9)

In the above process ni and nk are assumed to exceed 1If ni 0 (nk 0) and nkge 1 (nige 1) BSdik

is set to 0 to reflectthe significant difference between di and dk However ifni 0 and nk 0 namely there is no encrypted flow in bothdi and dk BSdik

is set to 1 because they exhibit the samebehaviors ie they do not produce any encrypted traffic

Similarly we let BSdii 1 and represent these similarity

values as a vector Sibehavior BSdi1

BSdi2 BSdiu

1113966 1113967

55 Clustering of Similarity Values and Malware DetectionFrom the traffic content similarity CSdik

and the traffic be-havior similarity BSdik

we can determine the final similarityvalue BSdik

between the feature sets di and dk as expressed inequation (10) In this equation q is the weight value In ourexperiments we set q 05 namely the traffic contentsimilarity and the traffic behavior similarity are assigned thesame weight is is because apps typically produce almostthe same numbers of HTTP and HTTPS flows [57] We alsoevaluate various weighted values in Section 62 and theexperimental results support that 05 is the most suitablevalue For feature set di a vector Sdi

(Sdi1 Sdi2

Sdiu) is

finally obtained which contains the similarity values be-tween di and the other feature setse set of vector Sdi

1113966 1113967 i

1 2 u will be clustered for the detection of repackagedmalware

Sdik q times CSdik

+(1 minus q) times BSdik (10)

To be specific we use the density peak clustering method[58] to realize our objective Density peak clustering is basedon the following assumption cluster centers are charac-terized by a higher density than their neighbors and by arelatively large distance from points that have higher den-sities Hence density peak clustering can automaticallyidentify the correct number of clusters is is importantbecause we do not know whether there is repackagedmalware or how many types of repackaged malware thereare In other words we cannot determine the number ofclusters in advance Density peak clustering can help usovercome this challenge

We assume that most mobile devices have installed theoriginal apps as discussed in Section 51 erefore afterclustering the cluster that has the largest number of voxelswill be recognized as original and the remaining clusters willbe regarded as suspicious However suspicious apps are notnecessarily repackaged malware For example apps can berepackaged to show ads only Repackaged apps of this typeshould not be simply judged as malware Additionallyoriginal apps can be suspicious due to different user be-haviors To further reduce the frequency of the false alarmswe can check the server hostnames that are contained in therepackaged clusters using VirusTotal (the URLs are includedin the HTTP traffic contents) If all of the serversrsquo hostnamesfor a suspicious cluster Cr are judged as clean it will berecognized as a cluster of abnormal but harmless appsOtherwise it will be identified as repackaged malware Viathis approach we can accurately detect repackaged Androidmalware In this step we only use VirusTotal as a white list toexclude false alarms We do not use it as a backlist (detectionsignatures) to detect repackaged malware Malware detec-tion is still realized by clustering the similarity values

Table 2 Flow statistical featuresBytes sent by the app and bytes sent by the serverNumbers of outgoing and incoming packetsMean and std dev of the outgoing and incoming packet sizesrsquoflow durationMean and std dev of the interpacket time

lt2761 16696 16 12 173 1294 289 1177 3392 121 190gt

Figure 10 Example of the conversion of an encrypted flow to afeature vector

Security and Communication Networks 11

6 Evaluation

61 System Implementation and Dataset Figure 11illustrates our experimental setup e Android phoneswere connected to the edge server via WiFi and each phonehad installed the edge-client app which sends the flowmetainformation to the edge server using Socket (jav-anetSocket) e edge is a Dell PowerEdge R730 server(with 32 processors 16GB RAM and 12 TB hard drives)that is configured as an access point In the edge server weused tcpdump (httpwwwtcpdumporg) to collect networktraffic e similarity calculations of traffic contents andbehaviors and the clustering of similarity values wereimplemented on the Windows Azure cloud computingplatforme URL checking was conducted using Virustotal(httpswwwvirustotalcom)

In the implementation we used aria2c (httpsaria2githubio) to download apps from Androzoo (httpsandrozoounilu) aria2c was encapsulated as commandsin Python code for parallel downloading After downloadingthe apps we automatically installed them on phones usingthe adb shell command Each phone had 40 tested appsinstalled We have implemented the edge-client app basedon Android VPN service In our implementations the edge-client app correlated network flows and apps by resolvingthe ports and PIDs (Process IDs) send the newly collectedflow metainformation to the edge server every 60 secondse edge server used tcpdump to collect network trafficwhich was saved as pcap files To extract traffic features weused Splitcap (httpswwwnetreseccompageSplitCap)to restore tcp flows from pcap files en for HTTP trafficwe extracted the contents using tshark (httpswwwwiresharkorgdocsman-pagestsharkhtml) commands -zfollow tcp and ascii e packet sizes and packet intervaltimes were also obtained by tshark with commands -eframelen ndashe frametime_relative In the cloud CSdik

and BSdik

were calculated using scikit-learn (httpscikit-learnorg)and Numpy (httpsplotlynumpynorm) e code ofdensity peak clustering is available on GitHub (httpsgithubcomjasonwbwDensityPeakCluster) and we havefixed various bugs that were present in the original code

We have downloaded 200 pairs of original andrepackaged apps from Androzoo for testing However notall of these downloaded repackaged apps are maliciousSome apps are only repackaged to show extra ads To screenout Android malware from the downloaded repackagedapps we further checked the 200 repackaged apps usingVirustotal If an app was judged as safe by all the antivirustools it was labelled as repackaged_safe in our experimentsOtherwise it was labelled as repackaged_malware In total143 repackaged apps were detected as malware by VirustotalNevertheless in our experiments we evaluated the proposedclusteringmethod on all 200 repackaged apps to obtainmorerealistic performance results

e downloaded apps were run in 10 rooted Androidphones in parallel to accelerate the experimental process Forall apps we used Droidbot [59] to send input events to themand to generate network traffic automatically We ran eachrepackaged app (repackaged_safe and repackaged_malware

apps) 10 times and the corresponding original app 30 timesto generate sufficient network traffic e running times ofthe original apps exceeded those of the repackaged appsbecause we assume that most of the apps are original asdiscussed in Section 51 Consequently we obtained 40traffic sets for each repackaged-original app pair Note thatfrom the perspectives of the mobile devices and the edgeserver there are only 200 distinct apps because therepackaged app has the same name as the correspondingoriginal app erefore our experiments have simulated40lowast 200 8000 mobile devices For each repackaged-orig-inal app pair there were 30 devices with the original app and10 devices with the repackaged version

In total we have gathered almost 89GB network traffice datasets that are used in our experiments are summa-rized in Table 3 In the table the traffic dataset sizes are thesizes of the captured network traffic and the feature datasetsizes are the sizes of the extracted features In our experi-ments the HTTP traffic contents and the flow statisticalfeatures which are listed in Table 2 were both saved as textfiles As shown in Table 3 the feature datasets are signifi-cantly smaller than the traffic datasets (25G vs 63G and 9G vs26G)us the edge computing can reduce the data size andincrease the data acquisition speed for cloud computing[24] Since the feature dataset sizes are still large (25G+ 9G)we did not send all these feature data to the cloud simul-taneously Instead we sent the feature data one by oneWhen the feature data of an app were successfully receivedby the cloud its traffic similarities were calculated andclustered immediately After that the feature data of anotherapp would be sent and analyzed e experimental resultswill be presented in the following

62 ExperimentalResults In this section wemainly evaluatethe detection efficiency of Android malware by clusteringsimilarity values e performance evaluation of the pro-posed method in terms of power memory and CPU con-sumptions is elaborated in the following Section 63

Similar to previous studies [8 14] to measure theperformance of the proposed method Accuracy and F-measure metrics are utilized which are defined in equations(11) and (12) In equation (11) TP FN FP and TN aredefined in Table 4 In Table 4 repackaged_x representsrepackaged_safe or repackaged_malware e precision andrecall in equation (12) are calculated via equations (13) and(14)

Accuracy TP + TN

TP + FN + FP + TN (11)

F minus measure 2 times precision times recallprecision + recall

(12)

precision TP

TP + FP (13)

recall TP

TP + FN (14)

12 Security and Communication Networks

We have evaluated the Accuracy and F-measure for eachrepackaged app Recall that we ran each repackaged app 10times and the corresponding original app 30 timeserefore for each app the number of the original apptraffic sets is 30 and the number of the repackaged app trafficsets is 10 namely TP + FN 10 and TN+FP 30 for eachapp We have randomly chosen 20 apps from repack-age_malware and the experimental results for these apps arepresented in Figures 12 and 13 Detailed descriptions of thechosen 20 apps are listed in Table 5 In the table the mainfunction of each app is determined by analyzing the packagename from the AndroidManifestxml and from the Googlesearch results of the package name As depicted in Figure 12100 accuracy was realized on a total of 6 apps (app3 app5app9 app10 app13 and app18) e minimum accuracy is85 (app2 3440) and the average accuracy is 952 Fig-ure 13 presents the F-measure values In our experimentsthe maximum F-measure value is 1 and the minimum valueis 0737 e average F-measure value is 0888 ese resultsdemonstrate the satisfactory performance of our proposedclustering method

We have also calculated the average accuracy and F-measure values for all repackaged_safe and repack-aged_malware apps In these calculationsTP + FN 57lowast10 570 and TN+FP 57lowast 301710 for

repackaged_safe apps For repackaged_malware appsTP + FN 143lowast101430 and TN+FP 143lowast 30 4290is is because in our experiments 143 apps were identifiedas repackaged malware and 57 apps as repackaged safe byVirustotal us the parameter values differ e experi-mental results are listed in Table 6

According to Table 6 the average accuracy values forrepackaged_safe and repackaged_malware apps are bothhigh However the average F-measure for repackaged_safeapps is only 078 which is lower than that of repackagedmalware is indicates that repackaged safe apps are

Router Windows Azure

Edge server DellPowerEdge R730

10 android phones

Repackagedoriginal apps

Androzoo

Figure 11 Experimental setup

Table 3 Datasets used in the experiments

of apps of runs of HTTP flows of encrypted flows Traffic dataset sizes Feature dataset sizesOriginal apps 200 30 340015 183452 asymp63G bytes asymp25G bytesRepackaged apps 200 10 110670 93041 asymp26G bytes asymp9G bytes

Table 4 Confusion matrix

ObservedDetected

Repackaged_x app Original appRepackaged_x app TP ( of TPs) FN ( of FNs)Original app FP ( of FPs) TN ( of TNs)

Apps

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

Figure 12 Detection accuracies for the 20 selected apps

Security and Communication Networks 13

difficult to distinguish from their original apps is isbecause repackaged_safe apps only generate smallamounts of additional network flows and their networktraffic features are not clearly distinguishable from thoseof normal traffic Fortunately the average F-measure ofrepackaged malware apps is outstanding is may be dueto the fact that malware typically send stolen data to orreceive commands from remote servers Hence theirnetwork traffic features are special and distinguishableFigure 14 presents an example of app clustering and il-lustrates the distinguishability of the repackaged malwareand the original app ese results demonstrate thatrepackaged Android malware can be effectively detectedby comparing their network traffic with that of originalapps

For repackaged malware apps the accuracy results withdifferent weight values (q in equation (10)) are presented inFigure 15 As depicted in the figure when q 05 we realizethe highest accuracy rate 969 If q 0 namely only thefeatures of encrypted traffic are used the accuracy result is753 e result is 617 when q 1 namely only thefeatures of plaintext traffic are considered ese resultsdemonstrate that both the encrypted and plaintext trafficcontribute to the differentiation of the original apps and therepackaged malware However the encrypted traffic con-tributes more than the plaintext traffic is may becausemalicious apps usually connect to their servers by securityprotocols to evade IDS detection

In the previous experiments we checked the serversrsquohostnames using VirusTotal to reduce the frequency of thefalse positives For various clusters that contained less than20 elements if all the serversrsquo hostnames were judged asclean the cluster was identified as an original app rather thana repackaged one As argued in Section 55 this process onlyreduces the frequency of false alarms To support this ar-gument we list the numbers of true positives and falsepositives of the 143 repackaged malware in Table 7 forcomparison When conducting the hostname checking thenumber of true positives is 1400 and the result is the samewhen the hostname checking is turned off is findingsupports that the malware is detected by clustering thesimilarity values However the number of false positivesincreased from 47 to 213 when the hostname checking wasturned off erefore we suggest checking the serversrsquohostnames after clustering to increase the practicality of themethod Note that false positives still occur when we checkthe hostnames by VirusTotal Hence there may be malwarein the original apps In our dataset the typical examples arecommobovideoplayerpro and cncfshop ele1taoxiaosanWe further analyze these two apps by reading theirdecompiled codes and we determine that they indeed es-tablish connections with suspicious servers In detail invarious runs app commobovideoplayerpro connects withhttpwwwtopappsquarecomwhich is judged as maliciousby CyRadar in VirusTotal App cncfshop ele1taoxiaosanvisits suspicious URL http955ccjSn9 Since these apps arethe original apps they are identified as false positives in theexperiments

In our experiments we chose density peak clustering asthe clustering method because density peak clustering canautomatically determine the correct number of clusters [58]Hence our method can detect various versions of malwarethat are repackaged from the same original app Unfortu-nately in the apps that were downloaded from AndroZoono malware was repackaged from the same original app Toevaluate the efficiency of detecting versions of malware wehave created two malware instances from a popular game

Apps

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

0

01

02

03

04

05

06

07

08

09

1

F-m

easu

re

Figure 13 Detection F-measure for the 20 selected apps

Table 5 Detailed descriptions of the chosen 20 apps

Main function ofapp

of HTTPflows

of encryptedflows

App1 VideoPlayer 453 51App2 Game 657 1455App3 Browser 1531 1478App4 MusicPlayer 341 25App5 Game 612 289App6 Game 788 123App7 Game 1336 524App8 Fitness 357 892App9 Email 122 431App10 Game 963 790App11 MusicPlayer 1128 547App12 Game 1598 426App13 News 2460 1178App14 News 2891 1543App15 Chat 171 49App16 Email 143 368App17 WeatherForecast 397 120App18 Game 1832 921App19 Fitness 732 362App20 OnlineLearning 2378 834

Table 6 Average accuracy and F-measure

Average accuracy () AverageF-measure

Repackaged_safe 903 078Repackaged_malware 969 094

14 Security and Communication Networks

app namely aircomaceviralmotox3 One is produced bygrafting AnserverBotrsquos source code into air-comaceviralmotox3 and the other is created by adding codefor stealing usersrsquo contacts and short messages and sendingthem to a remote server Similarly each malware is run 10times and the original app aircomaceviralmotox3 is alsorun 10 times for validatione clustering result is presentedin Figure 16 Apparently there are 3 clusters e

experimental results show that the value of F-measure foreach malware is 1 Hence our method can detect variousmalware that were repackaged from the same original app

63 Comparison In this section we compare our methodwith the typical methods that are based on network or cloudas illustrated in Figure 5 e comparative indicators are thedetection accuracy F-measure average power CPU andmemory consumptions of the mobile device First wecompare the proposed method with AppFA [14] which isimplemented completely at the network level Since AppFAalso must analyze app traffic we directly use the traffic sets ofthe repackaged malware apps for comparison While testingAppFA we deleted the label information that was containedin the traffic sets and mixed the network flows as a wholeen we used the signature matching and the constrainedK-means clustering algorithm proposed in [14] to restoreapp sessions namely the label for each flow e similarapps used in AppFA were chosen in Google Play via themethod introduced in [14] and for each repackaged mal-ware 10 similar apps were selected as the peer group epeer group apps were run once by Droidbot to generatenetwork traffic e comparison results are listed in Table 8

Since AppFA need not to install some special apps onmobile devices the extra power memory and CPU con-sumptions are 0 However the detection accuracy is muchlower (863 vs 969) than that of the proposedmethod ascompared in Table 8 In the proposed method we use theedge-client app to record flow metainformation thusnetwork flows can be accurately labeled However inAppFA flow labeling is realized via traffic clustering and theclustering accuracy cannot be 100 Meanwhile AppFAcompares the appsrsquo network behaviors with those of theirsimilar apps which also results in errors In the proposedmethod we directly compare the repackaged malware withits original app and the detection accuracy is increased eaverage power CPU and memory consumptions of theproposed method are also low and acceptable

In Table 8 the power memory and CPU consumptionsfor our edge-client app were obtained with the help of appGT (httpsgithubcomTencentGT) GT is a portable toolthat runs on smartphones for debugging the APP internalparameters and the code time-consuming statistics eaverage values for the power memory and CPU con-sumptions are calculated as follows Before running an appwith Droidbot we started the edge-client app by GT Whenthe Droidbot was stopped we recoded the power CPU andmemory consumptions We repeated the process until allapps (including 143 repackaged malware apps and theiroriginal versions) had been tested Via this way we obtainedthe average power CPU and memory consumptions for theedge-client app According to Table 8 the edge-client app islightweight and has minimal impact on the performances ofthe mobile devices

en we compare the proposed method with SAMA-Droid [44] SAMADroid monitors Android system calls andsends the calls to a centralized server for the detection ofAndroid malware SAMADroid is a typical cloud-based on-

Decision graph rho-delta-rho

20 251510 35 403050Sorted rho

00

02

04

06

08

Rho lowast

delt

a

Figure 14 Example of app clustering Obviously two clusters areidentified e dot in the top left corner is an outlier e outlier isdetected as an abnormal but harmless app by our method isfigure was generated by the density peak clustering tool

01 02 03 04 05 06 07 08 09 10Value of q

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

Figure 15 Accuracies of various weight values for traffic contentsimilarity and traffic behavior similarity

Table 7 True positives and false positives with and withouthostname checking

of true positives of false positivesWith hostname checking 1400 47Without hostnamechecking 1400 213

Security and Communication Networks 15

device detection method We have implemented the dy-namic detection according to the description in [44] Indetail the APIs that are invoked by the running apps aremonitored by the tool Sensitive API Monitor (httpsgithubcomcyruliuSensitive-API-Monitor) As done in [44] thefrequency values of 10 system calls (open ioctl brk readwrite close sendto sendmsg recvfrom and recvmsg) arerecorded ese recorded frequency values are sent toWindows Azure for malware detection We evaluate thedynamic detection performance of SAMADroid on our 143repackaged_malware apps and their original versions In theevaluation the frequency values of the original apps are usedto create the normality model and the frequency values ofthe repackaged malware apps are used for testing We alsouse the app GT to record the power memory and CPUconsumptions of SAMADroid Note that our phones wereall rooted and Sensitive_API_Monitor and GT can be usede experimental results are presented in Table 9

Last we compare our method with Shabtairsquos method [8]Shabtairsquos method detects Android malware at the mobiledevices and it considers the apprsquos network traffic patternsonly the HTTP traffic contents are not considered isgives us an opportunity to compare our method withmethods that have been specially designed for encryptedtraffic We have implemented Shabtairsquos method and chosenthe feature subset 2 as the detection features Feature subset2 includes Avg Sent Bytes Avg Rcvd Bytes and Pct ofAvg Rcvd Bytes among other values and it is proved to be

the best feature subset for malware detection [8] e De-cisionRegression tree is chosen as the classification algo-rithm e testing dataset is the traffic that was generated bythe 143 repackaged malware instances and the training set isthe traffic that was generated by the corresponding originalapps e classification results are listed in Table 10According to the table the detection accuracy and F-mea-sure values of our method are significantly higher than thoseof Shabtairsquos method is indicates that the HTTP trafficprovides useful information for Android malware detectionSince Shabtairsquos method is conducted at the mobile devices itconsumes more resources than our method as compared inTable 10

7 Discussion

e results of our extensive experiments demonstrate thatrepackaged Android malware can be efficiently and effec-tively detected by clustering network traffic at the edgedevices However there are still limitations for our methodFirst we assume the most of the devices have installed theoriginal apps (rlt (12)u as described in Section 51) is isreasonable in most cases since mobile users typicallydownload apps from the official markets However in areaswhere official markets such as the Google play and Amazonare banned it is possible that the most mobile devices haveinstalled the repackaged apps is may cause false positivesSecond if the malware does not generate sufficient trafficour method cannot detect it For example some malwareonly sends fraudulent premium SMS messages is type ofmalware cannot be detected by monitoring its networktraffic ird we rely on third-party tools such as Virustotalfor the identification of malicious URLs and to reduce thenumber of false alarms If a malicious app rents legalplatforms such as the public cloud as the remote controlserver our method will judge it as a normal app Hence falsenegatives will be generated

To overcome these limitations we could combine theproposed method with available methods such as SAMA-Droid In such cases the edge-client app will not only recordflow metainformation but also monitor app system be-haviors However the system behavior monitoring does notneed to be conducted all the time If abnormal but harmlessor lower traffic apps are detected the edge-client app will beinformed to record the corresponding appsrsquo system be-haviors Similarly the edge-client app can send the appsystem behaviors to the edge server for further processinge performance of this combined method will be evaluatedin our future work One can also combine the proposedframework with AppFA [14] for the detection of other typesof Android malware For example the edge extracts apptraffic behavior features and the cloud can compare appsrsquobehaviors with their peer groups to identify abnormalitiesFurthermore one could first use clustering techniques [60]to group repackaged apps according to the same maliciousmodules and then analyze the behaviors of the representativerepackaged apps in each group to obtain more accuratebehavior features

y

2D nonclassical multidimensional scaling

00ndash02 02ndash04x

ndash04

ndash03

ndash02

ndash01

00

01

02

03

04

Figure 16 Clustering results of various malware that wererepackaged from aircomaceviralmotox3e outlier in the middleis detected as an abnormal but harmless app is figure wasgenerated by the density peak clustering tool

Table 8 Comparison of the proposed method with AppFA

Proposed method AppFAAccuracy 969 863F-measure 094 075Average power consumption 12 0Average CPU consumption 06 0Average memory consumption 654MB 0

16 Security and Communication Networks

e paradigms of edge computing are used to detectAndroid malware in this study However several challengeswould be encountered when deploying it in practice Firstthe mobile devices must install a specified app for thecollection of essential information which may cause con-cerns and mobile users may refuse to cooperate Second inedge computing the data should be preprocessed by the edgeto better protect the usersrsquo privacy However the edge servercan observe all the data (including the sensitive informa-tion) and it threatens the privacy of users For example ifthe edge is deployed by the cloud provider the usersrsquo privacyinformation is also available to the cloud ere is no dif-ference with the user-cloud model for privacy protectionunless the edge server is deployed by the userird the edgeserver should be well protected from network attacksOtherwise it will be convenient for hackers to steal usersrsquoprivate information

8 Conclusion

In this paper we propose a novel method for detectingAndroid malware based on edge computing To the best ofour knowledge the detection of Android malware by uti-lizing edge computing and traffic clustering has not beenconsidered previously By introducing the edge layer themain tasks of the mobile device are offloaded to the edgeserver and the mobile device only needs to record flowmetainformation which substantially conserves the re-sources of the mobile device e edge server extracts appsrsquonetwork traffic features and sends these features to the cloudplatform for further analysis us the usersrsquo privacy can bebetter protected With the received features the cloud cal-culates the similarities between apps and clusters thesesimilarity values to separate the original apps and themalware automatically We evaluated our methods on 400Android apps and the experimental results demonstratethat the detection accuracy can reach 100 for several appsand the average accuracy is 969 We also tested theperformance of the proposed method and compared it with

the typical approaches e comparison results show thatour method can detect repackaged Android malware withboth higher detection accuracy and lower power and re-source consumptions In the future we will continue to workto overcome the discussed limitations and to evaluate theproposed method in real world scenarios

Data Availability

e downloaded apps used to support the findings of thisstudy have been deposited in httpspanbaiducoms1b8vMceIdtdd1gBRC5Gbzvw Its extraction code is vd5re clustering code is also in the repository Since appnetwork traffic contains usersrsquo sensitive information itcannot be publicly shared However we have explained allthe steps to generate network traffic automatically esesteps are also described in the instructionspdf in therepository

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported by National Key TechnologiesResearch and Development Program of China under Grant2018YFD0401404 National Natural Science Foundation ofChina under Grants 61702282 61802192 and 71801123Natural Science Foundation of the Jiangsu Higher EducationInstitutions of China under Grants 18KJB520024 and17KJB520023 Nanjing Forestry University under GrantsGXL016 and CX2016026 and NUPTSF under GrantNY217143

References

[1] Statista Mobile Operating Systemsrsquo Market Share Worldwidefrom January 2012 to December 2019 Statista HamburgGermany 2009 httpswwwstatistacomstatistics272698global-market-share-held-by-mobile-operating-systems-since-2009

[2] W Martin F Sarro Y Jia Y Zhang and M Harman ldquoAsurvey of app store analysis for software engineeringrdquo IEEETransactions on Software Engineering vol 43 no 9pp 817ndash847 2017

[3] httpswwwappbraincomstatsnumber-of-android-apps2020

[4] L Li D Li T F Bissyande et al ldquoUnderstanding android apppiggybacking a systematic study of malicious code graftingrdquoIEEE Transactions on Information Forensics and Securityvol 12 no 6 pp 1269ndash1284 2017

[5] M Fan J Liu X Luo et al ldquoAndroid malware familialclassification and representative sample selection via frequentsubgraph analysisrdquo IEEE Transactions on Information Fo-rensics and Security vol 13 no 8 pp 1890ndash1905 2018

[6] J Zhang Z Qin K Zhang H Yin and J Zou ldquoDalvik opcodegraph based android malware variants detection using globaltopology featuresrdquo IEEE Access vol 6 pp 51964ndash51974 2018

[7] S Garg S K Peddoju and A K Sarje ldquoNetwork-baseddetection of android malicious appsrdquo International Journal ofInformation Security vol 16 no 4 pp 385ndash400 2016

Table 9 Comparison of the proposed method with SAMADroid

Proposed method SAMADroidAccuracy 969 831F-measure 0940 0793Average power consumption 12 23Average CPU consumption 06 08Average memory consumption 654MB 773MB

Table 10 Comparison of the proposed method with Shabtairsquosmethod

Proposed method Shabtairsquosmethod

Accuracy 969 771F-measure 0940 0695Average power consumption 12 77Average CPU consumption 06 23Average memory consumption 654MB 937MB

Security and Communication Networks 17

[8] A Shabtai L Tenenboim-Chekina D Mimran L RokachB Shapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquoComputers amp Security vol 43 pp 1ndash18 2014

[9] L Xie X Zhang J-P Seifert and S Zhu ldquopBMDS a be-havior-based malware detection system for cellphone de-vicesrdquo in Proceedings of the ird ACM Conference onWireless Network Security ACM New York NY USApp 37ndash48 2010

[10] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacyin Smartphones and Mobile Devices ACM Chicago IL USApp 15ndash26 October 2011

[11] G Dini F Martinelli A Saracino and D SgandurraldquoMADAM a multi-level anomaly detector for android mal-warerdquo in Proceedings of the International Conference onMathematical Methods Models and Architectures for Com-puter Network Security vol 12 Springer St PetersburgRussia pp 240ndash253 2012

[12] Y Zhang M Yang B Xu et al ldquoVetting undesirable be-haviors in android apps with permission use analysisrdquo inProceedings of the 2013 ACM SIGSAC Conference on Com-puter amp Communications Security ACM Berlin Germanypp 611ndash622 November 2013

[13] SWang Z Chen X Li LWang K Ji and C Zhao ldquoAndroidmalware clustering analysis on network-level behaviorrdquo inProceedings of the International Conference on IntelligentComputing Springer Liverpool UK pp 796ndash807 August2017

[14] G He B Xu and H Zhu ldquoAppFA a novel approach to detectmalicious android applications on the networkrdquo in Securityand Communication Networks Wiley Hoboken NJ USA2018

[15] X Wang Y Yang and Y Zeng ldquoAccurate mobile malwaredetection and classification in the cloudrdquo SpringerPlus vol 4no 1 p 583 2015

[16] K Tam A Feizollah N B Anuar R Salleh and L Cavallaroldquoe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys (CSUR) vol 49 no 4p 76 2017

[17] M Sun X Li J C S Lui R T B Ma and Z Liang ldquoMonet auser-oriented behavior-based malware variants detectionsystem for androidrdquo IEEE Transactions on Information Fo-rensics and Security vol 12 no 5 pp 1103ndash1112 2017

[18] T Vidas and N Christin ldquoEvading android runtime analysisvia sandbox detectionrdquo in Proceedings of the 9th ACMSymposium on Information Computer and CommunicationsSecurity pp 447ndash458 Kyoto Japan June 2014

[19] Y Duan M Zhang A V Bhaskar et al ldquoings you may notknow about android (Un) packers a systematic study basedon whole-system emulationrdquo in Proceedings of the 2018Network and Distributed System Security Symposium pp 1ndash15 San Diego CA USA February 2018

[20] L Xue X Luo L Yu S Wang and D Wu ldquoAdaptiveunpacking of android appsrdquo in Proceedings of the 2017 IEEEACM 39th International Conference on Software Engineering(ICSE) IEEE Buenos Aires Argentina pp 358ndash369 May2017

[21] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy IEEE San Francisco CAUSA pp 95ndash109 2012

[22] N W Lo S K Lu and Y H Chuang ldquoA framework for thirdparty android marketplaces to identify repackaged appsrdquo inProceedings of the 2016 IEEE 14th International Conference onDependable Autonomic and Secure Computing 14th Inter-national Conference on Pervasive Intelligence and Computing2nd International Conference on Big Data Intelligence andComputing and Cyber Science and Technology Congresspp 475ndash482 Auckland New Zealand August 2016

[23] L Li J Gao M Hurier et al ldquoAndrozoo++ collectingmillions of android apps and their metadata for the researchcommunityrdquo 2017 httpsarxivorgabs170905281

[24] W Shi J Cao Q Zhang Y Li and L Xu ldquoEdge computingvision and challengesrdquo IEEE Internet of ings Journal vol 3no 5 pp 637ndash646 2016

[25] W Zhou Y Zhou X Jiang and P Ning ldquoDetectingrepackaged smart-phone applications in third-party androidmarketplacesrdquo in Proceedings of the Second ACM Conferenceon Data and Application Security and Privacy ACM AntonioTX USA pp 317ndash326 February 2012

[26] Y Shao X Luo C Qian P Zhu and L Zhang ldquoTowards ascalable resource-driven approach for detecting repackagedandroid applicationsrdquo in Proceedings of the 30th AnnualComputer Security Applications Conference ACM NewOrleans LA USA pp 56ndash65 December 2014

[27] K Chen P Wang Y Lee et al ldquoFinding unknown malice in10 seconds mass vetting for new threats at the google-playscalerdquo in Proceedings of the USENIX Security Symposiumvol 15 Washington DC USA August 2015

[28] Z Wang C Li Y Guan and Y Xue ldquoDroidchain a novelmalware detection method for android based on behaviorchainrdquo in Proceedings of the 2015 IEEE Conference onCommunications and Network Security (CNS) IEEE Flor-ence Italy pp 727-728 September 2015

[29] K Tian D D Yao B G Ryder G Tan and G PengldquoDetection of repackaged android malware with code-het-erogeneity featuresrdquo IEEE Transactions on Dependable andSecure Computing vol 17 no 1 pp 64ndash77 2017

[30] A De Lorenzo F Martinelli E Medvet F Mercaldo andA Santone ldquoVisualizing the outcome of dynamic analysis ofandroid malware with vizmalrdquo Journal of Information Se-curity and Applications vol 50 Article ID 102423 2020

[31] M Lin D Zhang X Su and T Yu ldquoEffective and scalablerepackaged application detection based on user interfacerdquo inProceedings of the 2017 IEEE Conference on ComputerCommunications Workshops (INFOCOM WKSHPS) IEEEAtlanta GA USA May 2017

[32] G Meng M Patrick Y Xue Y Liu and J Zhang ldquoSecuringandroid app markets via modelling and predicting malwarespread between marketsrdquo IEEE Transactions on InformationForensics and Security vol 14 no 7 pp 1944ndash1959 2018

[33] A Arora and S K Peddoju ldquoMinimizing network trafficfeatures for android mobile malware detectionrdquo in Proceed-ings of the 18th International Conference on DistributedComputing and Networking ACM Hyderabad India January2017

[34] X Wu D Zhang X Su and W Li ldquoDetect repackagedAndroid application based on HTTP traffic similarity httptraffic similarityrdquo Security and Communication Networksvol 8 no 13 pp 2257ndash2266 2015

[35] J Malik and R Kaushal ldquoCredroid android malware de-tection by network traffic analysisrdquo in Proceedings of the 1stACM Workshop on Privacy-Aware Mobile Computing ACMPaderborn Germany pp 28ndash36 July 2016

18 Security and Communication Networks

[36] A Zulkifli I R A Hamid W M Shah and Z AbdullahldquoAndroid malware detection based on network traffic usingdecision tree algorithmrdquo in Proceedings of the InternationalConference on Soft Computing and Data Mining SpringerSenai Malaysia pp 485ndash494 January 2018

[37] Z Chen Q Yan H Han et al ldquoMachine learning basedmobile malware detection using highly imbalanced networktrafficrdquo Information Sciences vol 433-434 pp 346ndash364 2018

[38] G He B Xu L Zhang and H Zhu ldquoMobile app identifi-cation for encrypted network flows by traffic correlationrdquoInternational Journal of Distributed Sensor Networks vol 14no 12 pp 1ndash17 2018

[39] L Xue Y Zhou T Chen X Luo and G Gu ldquoMalton to-wards on-device non-invasive mobile malware analysis forARTrdquo in Proceedings of the 26th USENIX Security Symposiumpp 289ndash306 Vancouver Canada August 2017

[40] M K Alzaylaee S Y Yerima and S Sezer ldquoDL-Droid deeplearning based android malware detection using real devicesrdquoComputers amp Security vol 89 Article ID 101663 2020

[41] A S Shamili C Bauckhage and T Alpcan ldquoMalware de-tection on mobile devices using distributed machine learn-ingrdquo in Proceedings of the 20th International Conference onPattern Recognition (ICPR) IEEE Istanbul Turkeypp 4348ndash4351 August 2010

[42] M Zhao T Zhang F Ge and Z Yuan ldquoRobotdroid alightweight malware detection framework on smartphonesrdquoJournal of Networks vol 7 no 4 p 715 2012

[43] K A Talha D I Alper and C Aydin ldquoAPK auditor per-mission-based android malware detection systemrdquo DigitalInvestigation vol 13 pp 1ndash14 2015

[44] S Arshad M A Shah A Wahid A Mehmood H Song andH Yu ldquoSamadroid a novel 3-level hybrid malware detectionmodel for android operating systemrdquo IEEE Access vol 6pp 4321ndash4339 2018

[45] G He L Zhang B Xu and H Zhu ldquoDetecting repackagedandroid malware based on mobile edge computingrdquo inProceedings of the 2018 Sixth International Conference onAdvanced Cloud and Big Data (CBD) IEEE Lanzhou Chinapp 360ndash365 August 2018

[46] R Roman J Lopez andMMambo ldquoMobile edge computingFog et al a survey and analysis of security threats andchallengesrdquo Future Generation Computer Systems vol 78pp 680ndash698 2018

[47] F Bonomi R Milito J Zhu and S Addepalli ldquoFog com-puting and its role in the internet of thingsrdquo in Proceedings ofthe First Edition of the MCC Workshop on Mobile CloudComputing ACM Helsinki Finland pp 13ndash16 August 2012

[48] M T Beck MWerner S Feld and S Schimper ldquoMobile edgecomputing a taxonomyrdquo in Proceedings of the Sixth Inter-national Conference on Advances in Future Internet pp 48ndash55 Lisbon Portugal November 2014

[49] P Mach and Z Becvar ldquoMobile edge computing a survey onarchitecture and computation offloadingrdquo IEEE Communi-cations Surveys amp Tutorials vol 19 no 3 pp 1628ndash1656 2017

[50] H T Dinh C Lee D Niyato and P Wang ldquoA survey ofmobile cloud computing architecture applications and ap-proachesrdquo Wireless Communications and Mobile Computingvol 13 no 18 pp 1587ndash1611 2013

[51] X Lyu H Tian L Jiang et al ldquoSelective offloading in mobileedge computing for the green internet of thingsrdquo IEEENetwork vol 32 no 1 pp 54ndash60 2018

[52] S Agrawal and J Agrawal ldquoSurvey on anomaly detectionusing data mining techniquesrdquo Procedia Computer Sciencevol 60 pp 708ndash713 2015

[53] A Tongaonkar ldquoA look at the mobile app identificationlandscaperdquo IEEE Internet Computing vol 20 no 4 pp 9ndash152016

[54] G He B Xu and H Zhu ldquoIdentifying mobile applications forencrypted network trafficrdquo in Proceedings of the 2017 FifthInternational Conference on Advanced Cloud and Big Data(CBD) IEEE Shanghai China pp 279ndash284 August 2017

[55] G Ranjan A Tongaonkar and R Torres ldquoApproximatematching of persistent lexicon using search-engines forclassifying mobile app trafficrdquo in Proceedings of the 35thAnnual IEEE International Conference on Computer Com-munications IIEEE San Francisco CA USA April 2016

[56] L Li and S Qu ldquoShort text classification based on improvedITCrdquo Journal of Computer and Communications vol 1 no 4pp 22ndash27 2013

[57] A H Lashkari A F A Kadir L Taheri and A A GhorbanildquoToward developing a systematic approach to generatebenchmark android malware datasets and classificationrdquo inProceedings of the 2018 International Carnahan Conference onSecurity Technology (ICCST) IEEE Montreal Canada Oc-tober 2018

[58] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[59] Y Li Z Yang Y Guo and X Chen ldquoDroidbot a lightweightUI-guided test input generator for androidrdquo in Proceedings ofthe Software Engineering Companion (ICSE-C) pp 23ndash26Buenos Aires Argentina May 2017

[60] M Fan X Luo J Liu et al ldquoGraph embedding based familialanalysis of android malware using unsupervised learningrdquo inProceedings of the 2019 IEEEACM 41st International Con-ference on Software Engineering (ICSE) IEEE Piscataway NJUSA pp 771ndash782 May 2019

Security and Communication Networks 19

Page 6: On-DeviceDetectionofRepackagedAndroidMalwarevia …downloads.hindawi.com/journals/scn/2020/8630748.pdf · 2020-05-31 · According to [4–6], most Android malware programs are repackagedapps,andthemajorityofnewAndroidmalware

starting time flow protocol server-side IP address server-side port app name and version for each network flowLater these data are sent to the edge servers for traffictagging Since the amounts of data that are saved and sent aresmall the performances of mobile devices are not signifi-cantly degraded e edge servers extract detection featuresfrom the labeled app network traffic and send the extractedfeatures to the cloud to be clustered After clustering theoriginal and repackaged apps can be efficiently distin-guished In our study only network traffic is considered inAndroid malware detection In contrast to previous net-work-based methods such as [14 34ndash36] we do not need tomodel appsrsquo network behaviors in advance and our methodis more useful in practice

4 Framework for Android Malware Detection

e on-device detection of Android malware is challengingbecause mobile devices typically have limited resources Toalleviate the storage and computing limitations and prolongthe lifetimes of the mobile devices in this study we proposea new framework for on-device detection of repackagedAndroid malware e proposed framework is illustrated inFigure 3

Our framework is composed of three layers mobiledevices the edge and the cloud First we explain thepossible locations of the edge later we introduce thefunctionalities of each component In our framework theonly requirement for the edge is that it can monitor andprocess the mobile network traffice mobile devices couldconnect to the Internet via WiFi or 3G4G hence the edgenodes can be wireless gateways household security gatewayssuch as BitDefender Box and LTE eNodeBs e edge nodescan also be routers that are located in the backbone networkif they can observe the traffic that is generated by the mobiledevices erefore the deployment locations of the edgenodes are flexible in practice Figure 4 illustrates the possibledeployment scenarios of the edge

In the proposed framework the mobile devices mustcollect and send flow metaformation to the edge for datapreprocessing For this an edge-client app is designed to berun on themobile devicese edge-client app operates as anAndroid VPN app Its main objective is to collect andtransfer the flow metainformation which includes the appname flow starting time and flow protocol to the edgeiteratively With the flow metainformation the edge serverpreprocesses the network traffic by adding a label (the appname) to each flow After that it extracts detection featuresfrom these labeled flows e traffic labeling and featureextraction are illustrated in detail in Section 52 Once thefeature extraction has been completed the edge server sendsthese features to the cloud With the received features thecloud conducts the malware detection tasks and notifies themobile devices of the detection results

To further illustrate the advantages of the proposedframework we compare it with the available frameworksSince the proposed framework is based on edge computingwe will refer to it as the edge-computing-based framework inthe following Similarly the available frameworks are termed

as the cloud-based framework (the server-side detectionmethods that were described in Section 1) and the network-based framework (the network-based methods that wereintroduced in Sections 1 and 3) for differentiation Com-pared to the popular cloud-based framework (Figure 5(a))the most prominent advantage of the edge-computing-basedframework is that the feature extraction process is offloadedto the edge and the mobile devices must only save andtransmit the flow metainformation us the proposededge-computing-based framework is resource-friendly formobile devices

e network-based framework (Figure 5(b)) monitorsnetwork traffic without any interference withmobile devicesHowever it is less accurate since the network nodes cannotknow the origins of the network flows precisely namely itcannot identify app traffic with 100 accuracy As dem-onstrated in the literature [14] since the network nodecannot identify the app for each flow (eg some flows lackssignatures or are encrypted) the appsrsquo network behaviors areincomplete and the malware detection accuracy is reducedIn the proposed edge-computing-based framework wedesign a lightweight edge-client app for gathering app in-formation Furthermore mobile usersrsquo privacy can be betterprotected in our proposed framework By deploying onersquosown edge server a mobile user can control the types ofinformation that can be used as detection features Incontrast in the network-based framework the networknodes can observe the full traffic content and extract anyinformation they want which severely threatens the privacyof mobile users

Android malwaredetectionCloud

Edge

Mobiledevices

(1) Flowmetainformation

(2) Extractedfeatures

(3) Detectionresults

Network connectionData transmission

Figure 3 Framework for android malware on-device detectionthat is based on edge computing Mobile devices send flow met-ainformation such as the app name and the server-side IP addressto the edge Edge servers capture app network traffic and extractHTTP contents and flow statistical features from the capturedtraffic Later the features are sent to the cloud for malware de-tection e cloud also interacts with the mobile devices to returnthe detection results

6 Security and Communication Networks

After introducing the general architecture we outline thedetection methods that can be used in the edge-computing-based framework e cloud can use various methods todetect Android malware For example it can use conven-tional signature matching [35] and network behavioranalysis [8 14] to identify malicious apps However thesetypes of methods require a priori information (such assignatures or behavior models) In this paper we propose anovel repackaged malware detection method that clustersthe network traffic that is generated by the same app fromvarious mobile devices Figure 6 presents an example of thisprocess

e motivations behind the traffic clustering are two-fold first most Android malware is produced by repack-aging popular apps [4] and popular apps are often widelydistributed Second the network traffic generated by therepackaged malware differs from that generated by theoriginal app (as illustrated in Section 21) As illustrated inFigure 3 many mobile devices would connect to the cloudfor malware detection Hence the cloud can observe thenetwork traffic features of both the original apps and theirrepackaged versions with large probabilities due to the widedistribution of popular apps and the large number of mobiledevices If we regard the features of the original apps as thenormal data and the features of the repackaged versions asthe abnormal data the repackaged malware detectionproblem is transformed into the abnormal detection

problem Clustering is an efficient method for solving thisproblem [52] By clustering appsrsquo network traffic features weneed not model appsrsquo network behaviors in advance andmore importantly we can detect various repackaged ver-sions of the original app as illustrated in Figure 6e detailsof the proposed malware detection method will be describedin detail in Sections 53ndash55

5 Methodology

In this section we introduce the technical details of theproposed clustering method for Android malware detectionTable 1 defines the main notation that we use in this paper

(a) (b) (c)

Figure 4 Possible deployments of the edge e edge could be the wireless gateways or the dedicated servers that are connected to theeNodeB or the routers (a) Wireless gateway (b) eNodeB+ server (c) Router + server

Clustering

Traffic features of an app collected from

different mobile devicesOriginal apps

Repackaged apps

Figure 6 Illustration of traffic clustering for repackaged malwaredetection e plain circles represent original apps and the circleswith triangles and inverted triangles are repackaged ones

Network connectionData transmission

Android malwaredetectionCloud

Mobiledevices

(1) Extractedfeatures

(2) Detectionresults

(a)

Network connectionData transmission

Network node

Mobiledevices

(1) Detectionresults

(b)

Figure 5 Available frameworks for android malware detection Various approaches such as [10] send the extracted features to a remoteserver other than to the cloud We still regard them as following the cloud-based framework because there is no essential difference formalware detection As illustrated in (b) the detection results are typically presented to the network administrator and are not fed back to themobile devices (a) Framework based on cloud (b) Framework based on network

Security and Communication Networks 7

51 System Overview Figure 7 illustrates the main proce-dure of the proposed approach App i is the app for vettingand it has been installed on umobile devicesese u devicesare connected to e edge servers We assume that r deviceshave installed the repackaged malware (the malware is alsopresented as App i to deceive the users) and rlt (12)uHence most of the devices have installed the original ver-sion is situation is reasonable because repackaged An-droidmalware is typically distributed in third-party marketswhich are less popular than official markets [23] us thenumbers of downloads and installations are relatively smallWe will use this characteristic to classify the clusteringresults

To detect the repackaged malware the edge server labelsthe network flows that are generated by App i according tothe flow metainformation en it extracts suitable trafficfeatures (such as traffic content and statistics) from thelabeled flows and sends the features to the cloud for pro-cessing erefore there are u feature sets for App i in thecloud e cloud filters these u sets according to the appversions and removes common traffic After that it calculatesthe pairwise similarities of the feature sets Finally it clustersthe similarity values and identifies r repackaged malwareinstances

e traffic labeling feature extraction and filtering(labels①② and③ in Figure 7) the similarity calculationon the traffic contents and behaviors (labels④ and⑤) andclustering of similarity values (label⑥) are the core elementsof the proposed method ey will be described in detail inthe following sections

52 Traffic Labeling Filtering and Feature Extraction

521 Traffic Labeling In this study traffic labeling isconducted to identify the corresponding app for each net-work flow which is known as the app identification problem

[53] Extensive works have been conducted on the identi-fication of apps from mobile network traffic [38 54 55]However the achieved identification accuracies are all lowerthan 100 In our method we design an edge-client app forcollecting flow metaformation and identifying apps accu-rately e flow metainformation is defined in the following

Definition 1 (flow metainformation) For flow f the flowmetainformation includes the client-side IP address C-IPfthe flow starting time Tf the flow protocol Pf the server-sideIP address S-IPf the server-side port S-Portf the app nameAppNf and the app version AppVfe flow f is bidirectionaland Pf is the transport layer protocol such as TCP or UDP

e edge-client app operates similarly to a VPN appeapp sends the flow metainformation to the edge server at afixed time interval Ti Once it has received the flow meta-information the edge server saves it into a metainformationdatabase and can label mobile app traffic accurately Denotethe flow observed by the edge server as fu e edge serverlabels fu as follows

Step 1 extract the recording time Tuf of fu and its flow

protocol Puf client-side IP address CminusIPu

f server-sideIP address SminusIPu

f and server-side port SminusPortufStep 2 search the metainformation database with Pu

fCminusIPu

fSminusIPuf and SminusPortuf If records match the da-

tabase returns the record that has the most recent flowstarting time Tr

f Otherwise the edge server waits for awhile and repeats Step 2Step 3 if Tu

f is close to Trf the edge server labels flow fu

as ltAppNf AppVfgtOtherwise it waits for a while andrepeats Steps 2 and 3

In the above steps the client-side IP address CminusIPuf is

used to identify a mobile device uniquely which is correctwhen the edge server is the wireless gateway or the eNodeBIn such scenarios the mobile devices directly connect to theedge hence the IP address is unique for each deviceHowever if the edge server is deployed at the backbonerouter as illustrated in subgraph (c) of Figure 4 the CminusIPu

f

may be confusing because mobile devices could be locatedbehind NATs (Network Address Translations) and they willshare the same public IP address To solve this issue theedge-client app can insert special indicators into the packetsto uniquely represent a device In this study we alwaysassume that CminusIPu

f is unique and we will investigate theimplementation of the device-indicator in future work

In steps 2 and 3 the edge server utilizes a wait-and-repeat strategy to ensure that the traffic labeling is fresh andaccurate e edge server must wait because the mobiledevices send flow metainformation to the edge server at afixed time interval Ti (60 s in our experiments) and the savedflow metainformation in the database may be outdated Inour implementations the waiting time is set to Ti and if|Tu

f minus Trf|gt 24 h the returned flow metainformation is

regarded as obsolete

522 Traffic Feature Extraction After traffic labeling theedge server extracts detection features from the labeled

Table 1 Main notation used in this paper

Notation Meaningi Android app iu Number of mobile devices with app i installedr Number of devices with repackaged app i

f Network flowC-IPf Client-side IP address of fS-IPf Server-side IP address of fS-Portf Server-side port of fAppNf Name of the app that generates fAppVf Version of the app that generates fTi Set time intervalTu

f Recording time of flow fu at the edge serverdi Feature set of app iwj Plaintext word in the feature setsV(di) Numerical vector of traffic contentsF

j

diFeature vector for the encrypted flow j in di

TBntimesm(di) Traffic behaviors of diCSdik

Content similarity between di and dkBSdik

Behavior similarity between di and dkSdik

Final similarity between di and dk

8 Security and Communication Networks

traffic e extracted traffic features are categorized into twogroups traffic content features and traffic behavior featuresTraffic content features are the plaintext contents that areextracted from HTTP flows and traffic behavior features arethe statistics of network flows such as packet sizes andintervals Since Android apps may use an encrypted HTTPSprotocol to transmit data [54] traffic behavior features alsomust be considered e traffic content and traffic behaviorfeatures are described in detail in Sections 53 and 54

523 Traffic Filtering After traffic feature extraction theedge servers send the extracted features along with theclient-side IP address server-side IP addresses server-sideports app name AppNf and app version AppVf to the cloudNote that App i is installed on u devices hence the cloudeventually has u feature sets for App i Figure 8 illustrates thestructure of the feature set e cloud further filters these ufeature sets to increase the detection accuracy First it di-vides the u sets into v categories according to the app versionAppVf erefore each category contains the feature setsthat are generated by the same version of App i en foreach category the cloud removes the common traffic fromthe feature sets e common traffic is defined as network

flows that are contained in all features sets ese flows havethe same server-side IP address and the same server-sideport number e reason for this step is that the commontraffic will obfuscate the similarity calculations betweenapps Otherwise the similarity values between features setsare all high and they cannot be effectively used to distin-guish the repackaged malware from the original apps

After the filtering has been completed the similaritiesbetween feature sets are calculated and the similarity valuesare clustered to identify the repackaged malware

Client-side IP addressApp name AppN_fApp version AppV_fFlow 1 server-side IP address 1 server-side port 1 flow features

Flow f server-side IP address f server-side port f flow features

Figure 8 Illustration of the feature set e set contains the clientIP address app name and app version Each network flow that isgenerated by App i in a device is represented by the featuresextracted from the flow and by the server IP address and port

Trafficlabeling

Edge server 1

Traffic featureextraction

Similarity calculation of

trafficbehaviors

Similarity calculation of traffic contents Clustering of

similarity values

Original app

Repackaged malware

Cloud

Mobile device 1 2 (with app i installed)

Results

Traffic features

Traffic and flowmetainformation

ę

ę

Edge server e

Traffic features

Traffic and flow metainformation

Filtering

Mobile device u(with app i installed)

1

3

4

5

6

2

Figure 7emain procedure of the proposedmethod for Androidmalware detectione core parts of the proposedmethod are labeled bycircled numbers

Security and Communication Networks 9

53 Similarity Calculation of Traffic Content Android appsuse mainly HTTP and HTTPS protocols to communicatewith their remote servers [54] For HTTP traffic since it isplaintext based we can efficiently use the flow content tocalculate the similarity e basic strategy is to convertHTTP traffic to document files and the document files canbe handled by information retrieval technologies for simi-larity calculation HTTP flow content has also been used inapp identification from network traffic [55] However in thisstudy we only consider partial HTTP header information tobetter protect mobile usersrsquo privacy

In detail first we reassemble HTTP flows from packetsand extract the flow contents by saving the correspondingpacket contents into a text file as ASCII code Figure 9 showssuch an example en we preprocess the text file by de-leting the HTTP request content and the HTTP requestheader except for the GET (or POST) and HOST fields eHTTP response content is also deleted namely only theHTTP response header the GET (or POST) and the HOSTfiled in the HTTP request header are saved Via this ap-proach the mobile usersrsquo privacy can be protected from thecloud (for example the mobile phone information (HUA-WEI) is leaked by the User-Agent field in Figure 9) Afterthat the text file is spilt using special characters (eg ) and the processed file becomes the flow contentfeatures

e similarity is calculated via the improved TF-IDFalgorithm [56] and the cosine similarity e term frequencymeasures the count of words in a document We use theaugmented frequency measurement Denote the document(the feature set) as di and the word as wj Term frequencytf(wj di) is calculated as

tf wj di1113872 1113873 05 +05 times f wj di1113872 1113873

max f w di( 1113857 w isin di1113864 1113865 (1)

where f(wj di) is the number of occurrences of wj indocument di

e inverse document frequency measures of theamount of information a word provides Typically it is thelogarithmically scaled inverse fraction of the number ofdocuments that contain the word e inverse documentfrequency of wj is calculated as

idf wj1113872 1113873 logu

d wj isin d1113966 111396711138681113868111386811138681113868

11138681113868111386811138681113868+ 001⎛⎝ ⎞⎠ (2)

where u is the number of documents and | d wj isin d1113966 1113967| isnumber of documents in which the term wj appears

Denote by p(wj di) the product of TF(wj di) andI DF(wj) it is formulated as

p wj di1113872 1113873 log tf wj di1113872 11138731113872 1113873 times idf wj1113872 1113873

1113936

cj0 log

2 tf wj di1113872 11138731113872 1113873 times idf2 wj1113872 11138731113969 (3)

After the TF-IDF operation all documents are convertedinto equal-length numerical vectors We use the cosinesimilarity to measure the similarity between two documentvectors For documents di and dk the corresponding

numerical vectors are denoted as V(di) and V(dk) and thelength of vector is t e similarity between di and dk iscalculated as

CSdik

1113936tl1 V di( 1113857l times V dk( 1113857l( 1113857

1113936tl1 V di( 1113857

2l

1113969

times

1113936tl1 V dk( 1113857

2l

1113969 (4)

Suppose that in category vi (we divide the u feature setsgenerated by App i in to v categories in the traffic filteringstep) there are c feature sets For each feature set we save theflow features of all HTTP flows into a text file and c text filesare obtained en we calculate the numerical vector foreach feature set via equation (3) Last we compare thenumerical vector with each other as equation (4) and gain csimilarity values (CSdii

1) for each feature set We rep-resent these similarity values of feature set di (i 1 2 c)as a vector Si

contents and Sicontents (CSdi1

CSdi2 CSdic

)e vectors Si

contents (i 1 2 c) in combination with thesimilarity values of traffic behaviors will be clustered toidentify the abnormal data points namely the repackagedmalware

54 SimilarityCalculation of TrafficBehaviors In addition tocalculating the similarity values of traffic contents thesimilarity of traffic behaviors must be calculated due toencrypted flows We use m (m 11 in our study) statisticalfeatures as listed in Table 2 to represent the behavior of anencrypted network flow e units of the flow duration andthe packet interval time are milliseconds in our experiments

With the flow statistical features for feature set di the jthencrypted flow can be represented as a vector in the featurespace

Fj

di f

1j f

2j f

mj1113872 1113873 (5)

where flj represents the value of the lth statistical feature

1le llem Figure 10 shows an example of feature represen-tation for encrypted traffic Consequently the traffic be-haviors of di can be represented as a matrix of f

pj

Figure 9 Example of converting an HTTP flow to a text file

10 Security and Communication Networks

TBntimesm di( 1113857 F1di

F2di

Fndi

1113872 1113873T (6)

where n is the number of encrypted flows that are containedin di and nge 1

Since different feature sets may differ in terms of thenumber of encrypted flows traffic behaviors TBnitimesm(di) andTBnktimesm(dk) may differ in terms of different dimensionnamely the value of ni should not be equal to that of nk Toefficiently calculate the similarity between TBnitimesm(di) andTBnktimesm(dk) we calculate the Frobenius norm of TBnitimesm(di)

and TBnktimesm(dk) e formula is presented in equation (7)where norm(middot) normalizes the value of fl

j to [0 1]

FN(d)

1113944

n

j11113944

m

l1norm fl

j1113872 111387311138681113868111386811138681113868

111386811138681113868111386811138682

11139741113972

(7)

en the distance between TBnitimesm(di) and TBnktimesm(dk) iscalculated as

dis di dk( 1113857 FN di( 1113857 minus FN dk( 11138571113868111386811138681113868

1113868111386811138681113868 (8)

Finally the similarity between TBnitimesm(di) andTBnktimesm(dk) can be calculated by

BSdik 1 minus

dis di dk( 1113857

max dis dr dq1113872 1113873 r q 1 2 u rne q1113960 1113961 (9)

In the above process ni and nk are assumed to exceed 1If ni 0 (nk 0) and nkge 1 (nige 1) BSdik

is set to 0 to reflectthe significant difference between di and dk However ifni 0 and nk 0 namely there is no encrypted flow in bothdi and dk BSdik

is set to 1 because they exhibit the samebehaviors ie they do not produce any encrypted traffic

Similarly we let BSdii 1 and represent these similarity

values as a vector Sibehavior BSdi1

BSdi2 BSdiu

1113966 1113967

55 Clustering of Similarity Values and Malware DetectionFrom the traffic content similarity CSdik

and the traffic be-havior similarity BSdik

we can determine the final similarityvalue BSdik

between the feature sets di and dk as expressed inequation (10) In this equation q is the weight value In ourexperiments we set q 05 namely the traffic contentsimilarity and the traffic behavior similarity are assigned thesame weight is is because apps typically produce almostthe same numbers of HTTP and HTTPS flows [57] We alsoevaluate various weighted values in Section 62 and theexperimental results support that 05 is the most suitablevalue For feature set di a vector Sdi

(Sdi1 Sdi2

Sdiu) is

finally obtained which contains the similarity values be-tween di and the other feature setse set of vector Sdi

1113966 1113967 i

1 2 u will be clustered for the detection of repackagedmalware

Sdik q times CSdik

+(1 minus q) times BSdik (10)

To be specific we use the density peak clustering method[58] to realize our objective Density peak clustering is basedon the following assumption cluster centers are charac-terized by a higher density than their neighbors and by arelatively large distance from points that have higher den-sities Hence density peak clustering can automaticallyidentify the correct number of clusters is is importantbecause we do not know whether there is repackagedmalware or how many types of repackaged malware thereare In other words we cannot determine the number ofclusters in advance Density peak clustering can help usovercome this challenge

We assume that most mobile devices have installed theoriginal apps as discussed in Section 51 erefore afterclustering the cluster that has the largest number of voxelswill be recognized as original and the remaining clusters willbe regarded as suspicious However suspicious apps are notnecessarily repackaged malware For example apps can berepackaged to show ads only Repackaged apps of this typeshould not be simply judged as malware Additionallyoriginal apps can be suspicious due to different user be-haviors To further reduce the frequency of the false alarmswe can check the server hostnames that are contained in therepackaged clusters using VirusTotal (the URLs are includedin the HTTP traffic contents) If all of the serversrsquo hostnamesfor a suspicious cluster Cr are judged as clean it will berecognized as a cluster of abnormal but harmless appsOtherwise it will be identified as repackaged malware Viathis approach we can accurately detect repackaged Androidmalware In this step we only use VirusTotal as a white list toexclude false alarms We do not use it as a backlist (detectionsignatures) to detect repackaged malware Malware detec-tion is still realized by clustering the similarity values

Table 2 Flow statistical featuresBytes sent by the app and bytes sent by the serverNumbers of outgoing and incoming packetsMean and std dev of the outgoing and incoming packet sizesrsquoflow durationMean and std dev of the interpacket time

lt2761 16696 16 12 173 1294 289 1177 3392 121 190gt

Figure 10 Example of the conversion of an encrypted flow to afeature vector

Security and Communication Networks 11

6 Evaluation

61 System Implementation and Dataset Figure 11illustrates our experimental setup e Android phoneswere connected to the edge server via WiFi and each phonehad installed the edge-client app which sends the flowmetainformation to the edge server using Socket (jav-anetSocket) e edge is a Dell PowerEdge R730 server(with 32 processors 16GB RAM and 12 TB hard drives)that is configured as an access point In the edge server weused tcpdump (httpwwwtcpdumporg) to collect networktraffic e similarity calculations of traffic contents andbehaviors and the clustering of similarity values wereimplemented on the Windows Azure cloud computingplatforme URL checking was conducted using Virustotal(httpswwwvirustotalcom)

In the implementation we used aria2c (httpsaria2githubio) to download apps from Androzoo (httpsandrozoounilu) aria2c was encapsulated as commandsin Python code for parallel downloading After downloadingthe apps we automatically installed them on phones usingthe adb shell command Each phone had 40 tested appsinstalled We have implemented the edge-client app basedon Android VPN service In our implementations the edge-client app correlated network flows and apps by resolvingthe ports and PIDs (Process IDs) send the newly collectedflow metainformation to the edge server every 60 secondse edge server used tcpdump to collect network trafficwhich was saved as pcap files To extract traffic features weused Splitcap (httpswwwnetreseccompageSplitCap)to restore tcp flows from pcap files en for HTTP trafficwe extracted the contents using tshark (httpswwwwiresharkorgdocsman-pagestsharkhtml) commands -zfollow tcp and ascii e packet sizes and packet intervaltimes were also obtained by tshark with commands -eframelen ndashe frametime_relative In the cloud CSdik

and BSdik

were calculated using scikit-learn (httpscikit-learnorg)and Numpy (httpsplotlynumpynorm) e code ofdensity peak clustering is available on GitHub (httpsgithubcomjasonwbwDensityPeakCluster) and we havefixed various bugs that were present in the original code

We have downloaded 200 pairs of original andrepackaged apps from Androzoo for testing However notall of these downloaded repackaged apps are maliciousSome apps are only repackaged to show extra ads To screenout Android malware from the downloaded repackagedapps we further checked the 200 repackaged apps usingVirustotal If an app was judged as safe by all the antivirustools it was labelled as repackaged_safe in our experimentsOtherwise it was labelled as repackaged_malware In total143 repackaged apps were detected as malware by VirustotalNevertheless in our experiments we evaluated the proposedclusteringmethod on all 200 repackaged apps to obtainmorerealistic performance results

e downloaded apps were run in 10 rooted Androidphones in parallel to accelerate the experimental process Forall apps we used Droidbot [59] to send input events to themand to generate network traffic automatically We ran eachrepackaged app (repackaged_safe and repackaged_malware

apps) 10 times and the corresponding original app 30 timesto generate sufficient network traffic e running times ofthe original apps exceeded those of the repackaged appsbecause we assume that most of the apps are original asdiscussed in Section 51 Consequently we obtained 40traffic sets for each repackaged-original app pair Note thatfrom the perspectives of the mobile devices and the edgeserver there are only 200 distinct apps because therepackaged app has the same name as the correspondingoriginal app erefore our experiments have simulated40lowast 200 8000 mobile devices For each repackaged-orig-inal app pair there were 30 devices with the original app and10 devices with the repackaged version

In total we have gathered almost 89GB network traffice datasets that are used in our experiments are summa-rized in Table 3 In the table the traffic dataset sizes are thesizes of the captured network traffic and the feature datasetsizes are the sizes of the extracted features In our experi-ments the HTTP traffic contents and the flow statisticalfeatures which are listed in Table 2 were both saved as textfiles As shown in Table 3 the feature datasets are signifi-cantly smaller than the traffic datasets (25G vs 63G and 9G vs26G)us the edge computing can reduce the data size andincrease the data acquisition speed for cloud computing[24] Since the feature dataset sizes are still large (25G+ 9G)we did not send all these feature data to the cloud simul-taneously Instead we sent the feature data one by oneWhen the feature data of an app were successfully receivedby the cloud its traffic similarities were calculated andclustered immediately After that the feature data of anotherapp would be sent and analyzed e experimental resultswill be presented in the following

62 ExperimentalResults In this section wemainly evaluatethe detection efficiency of Android malware by clusteringsimilarity values e performance evaluation of the pro-posed method in terms of power memory and CPU con-sumptions is elaborated in the following Section 63

Similar to previous studies [8 14] to measure theperformance of the proposed method Accuracy and F-measure metrics are utilized which are defined in equations(11) and (12) In equation (11) TP FN FP and TN aredefined in Table 4 In Table 4 repackaged_x representsrepackaged_safe or repackaged_malware e precision andrecall in equation (12) are calculated via equations (13) and(14)

Accuracy TP + TN

TP + FN + FP + TN (11)

F minus measure 2 times precision times recallprecision + recall

(12)

precision TP

TP + FP (13)

recall TP

TP + FN (14)

12 Security and Communication Networks

We have evaluated the Accuracy and F-measure for eachrepackaged app Recall that we ran each repackaged app 10times and the corresponding original app 30 timeserefore for each app the number of the original apptraffic sets is 30 and the number of the repackaged app trafficsets is 10 namely TP + FN 10 and TN+FP 30 for eachapp We have randomly chosen 20 apps from repack-age_malware and the experimental results for these apps arepresented in Figures 12 and 13 Detailed descriptions of thechosen 20 apps are listed in Table 5 In the table the mainfunction of each app is determined by analyzing the packagename from the AndroidManifestxml and from the Googlesearch results of the package name As depicted in Figure 12100 accuracy was realized on a total of 6 apps (app3 app5app9 app10 app13 and app18) e minimum accuracy is85 (app2 3440) and the average accuracy is 952 Fig-ure 13 presents the F-measure values In our experimentsthe maximum F-measure value is 1 and the minimum valueis 0737 e average F-measure value is 0888 ese resultsdemonstrate the satisfactory performance of our proposedclustering method

We have also calculated the average accuracy and F-measure values for all repackaged_safe and repack-aged_malware apps In these calculationsTP + FN 57lowast10 570 and TN+FP 57lowast 301710 for

repackaged_safe apps For repackaged_malware appsTP + FN 143lowast101430 and TN+FP 143lowast 30 4290is is because in our experiments 143 apps were identifiedas repackaged malware and 57 apps as repackaged safe byVirustotal us the parameter values differ e experi-mental results are listed in Table 6

According to Table 6 the average accuracy values forrepackaged_safe and repackaged_malware apps are bothhigh However the average F-measure for repackaged_safeapps is only 078 which is lower than that of repackagedmalware is indicates that repackaged safe apps are

Router Windows Azure

Edge server DellPowerEdge R730

10 android phones

Repackagedoriginal apps

Androzoo

Figure 11 Experimental setup

Table 3 Datasets used in the experiments

of apps of runs of HTTP flows of encrypted flows Traffic dataset sizes Feature dataset sizesOriginal apps 200 30 340015 183452 asymp63G bytes asymp25G bytesRepackaged apps 200 10 110670 93041 asymp26G bytes asymp9G bytes

Table 4 Confusion matrix

ObservedDetected

Repackaged_x app Original appRepackaged_x app TP ( of TPs) FN ( of FNs)Original app FP ( of FPs) TN ( of TNs)

Apps

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

Figure 12 Detection accuracies for the 20 selected apps

Security and Communication Networks 13

difficult to distinguish from their original apps is isbecause repackaged_safe apps only generate smallamounts of additional network flows and their networktraffic features are not clearly distinguishable from thoseof normal traffic Fortunately the average F-measure ofrepackaged malware apps is outstanding is may be dueto the fact that malware typically send stolen data to orreceive commands from remote servers Hence theirnetwork traffic features are special and distinguishableFigure 14 presents an example of app clustering and il-lustrates the distinguishability of the repackaged malwareand the original app ese results demonstrate thatrepackaged Android malware can be effectively detectedby comparing their network traffic with that of originalapps

For repackaged malware apps the accuracy results withdifferent weight values (q in equation (10)) are presented inFigure 15 As depicted in the figure when q 05 we realizethe highest accuracy rate 969 If q 0 namely only thefeatures of encrypted traffic are used the accuracy result is753 e result is 617 when q 1 namely only thefeatures of plaintext traffic are considered ese resultsdemonstrate that both the encrypted and plaintext trafficcontribute to the differentiation of the original apps and therepackaged malware However the encrypted traffic con-tributes more than the plaintext traffic is may becausemalicious apps usually connect to their servers by securityprotocols to evade IDS detection

In the previous experiments we checked the serversrsquohostnames using VirusTotal to reduce the frequency of thefalse positives For various clusters that contained less than20 elements if all the serversrsquo hostnames were judged asclean the cluster was identified as an original app rather thana repackaged one As argued in Section 55 this process onlyreduces the frequency of false alarms To support this ar-gument we list the numbers of true positives and falsepositives of the 143 repackaged malware in Table 7 forcomparison When conducting the hostname checking thenumber of true positives is 1400 and the result is the samewhen the hostname checking is turned off is findingsupports that the malware is detected by clustering thesimilarity values However the number of false positivesincreased from 47 to 213 when the hostname checking wasturned off erefore we suggest checking the serversrsquohostnames after clustering to increase the practicality of themethod Note that false positives still occur when we checkthe hostnames by VirusTotal Hence there may be malwarein the original apps In our dataset the typical examples arecommobovideoplayerpro and cncfshop ele1taoxiaosanWe further analyze these two apps by reading theirdecompiled codes and we determine that they indeed es-tablish connections with suspicious servers In detail invarious runs app commobovideoplayerpro connects withhttpwwwtopappsquarecomwhich is judged as maliciousby CyRadar in VirusTotal App cncfshop ele1taoxiaosanvisits suspicious URL http955ccjSn9 Since these apps arethe original apps they are identified as false positives in theexperiments

In our experiments we chose density peak clustering asthe clustering method because density peak clustering canautomatically determine the correct number of clusters [58]Hence our method can detect various versions of malwarethat are repackaged from the same original app Unfortu-nately in the apps that were downloaded from AndroZoono malware was repackaged from the same original app Toevaluate the efficiency of detecting versions of malware wehave created two malware instances from a popular game

Apps

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

0

01

02

03

04

05

06

07

08

09

1

F-m

easu

re

Figure 13 Detection F-measure for the 20 selected apps

Table 5 Detailed descriptions of the chosen 20 apps

Main function ofapp

of HTTPflows

of encryptedflows

App1 VideoPlayer 453 51App2 Game 657 1455App3 Browser 1531 1478App4 MusicPlayer 341 25App5 Game 612 289App6 Game 788 123App7 Game 1336 524App8 Fitness 357 892App9 Email 122 431App10 Game 963 790App11 MusicPlayer 1128 547App12 Game 1598 426App13 News 2460 1178App14 News 2891 1543App15 Chat 171 49App16 Email 143 368App17 WeatherForecast 397 120App18 Game 1832 921App19 Fitness 732 362App20 OnlineLearning 2378 834

Table 6 Average accuracy and F-measure

Average accuracy () AverageF-measure

Repackaged_safe 903 078Repackaged_malware 969 094

14 Security and Communication Networks

app namely aircomaceviralmotox3 One is produced bygrafting AnserverBotrsquos source code into air-comaceviralmotox3 and the other is created by adding codefor stealing usersrsquo contacts and short messages and sendingthem to a remote server Similarly each malware is run 10times and the original app aircomaceviralmotox3 is alsorun 10 times for validatione clustering result is presentedin Figure 16 Apparently there are 3 clusters e

experimental results show that the value of F-measure foreach malware is 1 Hence our method can detect variousmalware that were repackaged from the same original app

63 Comparison In this section we compare our methodwith the typical methods that are based on network or cloudas illustrated in Figure 5 e comparative indicators are thedetection accuracy F-measure average power CPU andmemory consumptions of the mobile device First wecompare the proposed method with AppFA [14] which isimplemented completely at the network level Since AppFAalso must analyze app traffic we directly use the traffic sets ofthe repackaged malware apps for comparison While testingAppFA we deleted the label information that was containedin the traffic sets and mixed the network flows as a wholeen we used the signature matching and the constrainedK-means clustering algorithm proposed in [14] to restoreapp sessions namely the label for each flow e similarapps used in AppFA were chosen in Google Play via themethod introduced in [14] and for each repackaged mal-ware 10 similar apps were selected as the peer group epeer group apps were run once by Droidbot to generatenetwork traffic e comparison results are listed in Table 8

Since AppFA need not to install some special apps onmobile devices the extra power memory and CPU con-sumptions are 0 However the detection accuracy is muchlower (863 vs 969) than that of the proposedmethod ascompared in Table 8 In the proposed method we use theedge-client app to record flow metainformation thusnetwork flows can be accurately labeled However inAppFA flow labeling is realized via traffic clustering and theclustering accuracy cannot be 100 Meanwhile AppFAcompares the appsrsquo network behaviors with those of theirsimilar apps which also results in errors In the proposedmethod we directly compare the repackaged malware withits original app and the detection accuracy is increased eaverage power CPU and memory consumptions of theproposed method are also low and acceptable

In Table 8 the power memory and CPU consumptionsfor our edge-client app were obtained with the help of appGT (httpsgithubcomTencentGT) GT is a portable toolthat runs on smartphones for debugging the APP internalparameters and the code time-consuming statistics eaverage values for the power memory and CPU con-sumptions are calculated as follows Before running an appwith Droidbot we started the edge-client app by GT Whenthe Droidbot was stopped we recoded the power CPU andmemory consumptions We repeated the process until allapps (including 143 repackaged malware apps and theiroriginal versions) had been tested Via this way we obtainedthe average power CPU and memory consumptions for theedge-client app According to Table 8 the edge-client app islightweight and has minimal impact on the performances ofthe mobile devices

en we compare the proposed method with SAMA-Droid [44] SAMADroid monitors Android system calls andsends the calls to a centralized server for the detection ofAndroid malware SAMADroid is a typical cloud-based on-

Decision graph rho-delta-rho

20 251510 35 403050Sorted rho

00

02

04

06

08

Rho lowast

delt

a

Figure 14 Example of app clustering Obviously two clusters areidentified e dot in the top left corner is an outlier e outlier isdetected as an abnormal but harmless app by our method isfigure was generated by the density peak clustering tool

01 02 03 04 05 06 07 08 09 10Value of q

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

Figure 15 Accuracies of various weight values for traffic contentsimilarity and traffic behavior similarity

Table 7 True positives and false positives with and withouthostname checking

of true positives of false positivesWith hostname checking 1400 47Without hostnamechecking 1400 213

Security and Communication Networks 15

device detection method We have implemented the dy-namic detection according to the description in [44] Indetail the APIs that are invoked by the running apps aremonitored by the tool Sensitive API Monitor (httpsgithubcomcyruliuSensitive-API-Monitor) As done in [44] thefrequency values of 10 system calls (open ioctl brk readwrite close sendto sendmsg recvfrom and recvmsg) arerecorded ese recorded frequency values are sent toWindows Azure for malware detection We evaluate thedynamic detection performance of SAMADroid on our 143repackaged_malware apps and their original versions In theevaluation the frequency values of the original apps are usedto create the normality model and the frequency values ofthe repackaged malware apps are used for testing We alsouse the app GT to record the power memory and CPUconsumptions of SAMADroid Note that our phones wereall rooted and Sensitive_API_Monitor and GT can be usede experimental results are presented in Table 9

Last we compare our method with Shabtairsquos method [8]Shabtairsquos method detects Android malware at the mobiledevices and it considers the apprsquos network traffic patternsonly the HTTP traffic contents are not considered isgives us an opportunity to compare our method withmethods that have been specially designed for encryptedtraffic We have implemented Shabtairsquos method and chosenthe feature subset 2 as the detection features Feature subset2 includes Avg Sent Bytes Avg Rcvd Bytes and Pct ofAvg Rcvd Bytes among other values and it is proved to be

the best feature subset for malware detection [8] e De-cisionRegression tree is chosen as the classification algo-rithm e testing dataset is the traffic that was generated bythe 143 repackaged malware instances and the training set isthe traffic that was generated by the corresponding originalapps e classification results are listed in Table 10According to the table the detection accuracy and F-mea-sure values of our method are significantly higher than thoseof Shabtairsquos method is indicates that the HTTP trafficprovides useful information for Android malware detectionSince Shabtairsquos method is conducted at the mobile devices itconsumes more resources than our method as compared inTable 10

7 Discussion

e results of our extensive experiments demonstrate thatrepackaged Android malware can be efficiently and effec-tively detected by clustering network traffic at the edgedevices However there are still limitations for our methodFirst we assume the most of the devices have installed theoriginal apps (rlt (12)u as described in Section 51) is isreasonable in most cases since mobile users typicallydownload apps from the official markets However in areaswhere official markets such as the Google play and Amazonare banned it is possible that the most mobile devices haveinstalled the repackaged apps is may cause false positivesSecond if the malware does not generate sufficient trafficour method cannot detect it For example some malwareonly sends fraudulent premium SMS messages is type ofmalware cannot be detected by monitoring its networktraffic ird we rely on third-party tools such as Virustotalfor the identification of malicious URLs and to reduce thenumber of false alarms If a malicious app rents legalplatforms such as the public cloud as the remote controlserver our method will judge it as a normal app Hence falsenegatives will be generated

To overcome these limitations we could combine theproposed method with available methods such as SAMA-Droid In such cases the edge-client app will not only recordflow metainformation but also monitor app system be-haviors However the system behavior monitoring does notneed to be conducted all the time If abnormal but harmlessor lower traffic apps are detected the edge-client app will beinformed to record the corresponding appsrsquo system be-haviors Similarly the edge-client app can send the appsystem behaviors to the edge server for further processinge performance of this combined method will be evaluatedin our future work One can also combine the proposedframework with AppFA [14] for the detection of other typesof Android malware For example the edge extracts apptraffic behavior features and the cloud can compare appsrsquobehaviors with their peer groups to identify abnormalitiesFurthermore one could first use clustering techniques [60]to group repackaged apps according to the same maliciousmodules and then analyze the behaviors of the representativerepackaged apps in each group to obtain more accuratebehavior features

y

2D nonclassical multidimensional scaling

00ndash02 02ndash04x

ndash04

ndash03

ndash02

ndash01

00

01

02

03

04

Figure 16 Clustering results of various malware that wererepackaged from aircomaceviralmotox3e outlier in the middleis detected as an abnormal but harmless app is figure wasgenerated by the density peak clustering tool

Table 8 Comparison of the proposed method with AppFA

Proposed method AppFAAccuracy 969 863F-measure 094 075Average power consumption 12 0Average CPU consumption 06 0Average memory consumption 654MB 0

16 Security and Communication Networks

e paradigms of edge computing are used to detectAndroid malware in this study However several challengeswould be encountered when deploying it in practice Firstthe mobile devices must install a specified app for thecollection of essential information which may cause con-cerns and mobile users may refuse to cooperate Second inedge computing the data should be preprocessed by the edgeto better protect the usersrsquo privacy However the edge servercan observe all the data (including the sensitive informa-tion) and it threatens the privacy of users For example ifthe edge is deployed by the cloud provider the usersrsquo privacyinformation is also available to the cloud ere is no dif-ference with the user-cloud model for privacy protectionunless the edge server is deployed by the userird the edgeserver should be well protected from network attacksOtherwise it will be convenient for hackers to steal usersrsquoprivate information

8 Conclusion

In this paper we propose a novel method for detectingAndroid malware based on edge computing To the best ofour knowledge the detection of Android malware by uti-lizing edge computing and traffic clustering has not beenconsidered previously By introducing the edge layer themain tasks of the mobile device are offloaded to the edgeserver and the mobile device only needs to record flowmetainformation which substantially conserves the re-sources of the mobile device e edge server extracts appsrsquonetwork traffic features and sends these features to the cloudplatform for further analysis us the usersrsquo privacy can bebetter protected With the received features the cloud cal-culates the similarities between apps and clusters thesesimilarity values to separate the original apps and themalware automatically We evaluated our methods on 400Android apps and the experimental results demonstratethat the detection accuracy can reach 100 for several appsand the average accuracy is 969 We also tested theperformance of the proposed method and compared it with

the typical approaches e comparison results show thatour method can detect repackaged Android malware withboth higher detection accuracy and lower power and re-source consumptions In the future we will continue to workto overcome the discussed limitations and to evaluate theproposed method in real world scenarios

Data Availability

e downloaded apps used to support the findings of thisstudy have been deposited in httpspanbaiducoms1b8vMceIdtdd1gBRC5Gbzvw Its extraction code is vd5re clustering code is also in the repository Since appnetwork traffic contains usersrsquo sensitive information itcannot be publicly shared However we have explained allthe steps to generate network traffic automatically esesteps are also described in the instructionspdf in therepository

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported by National Key TechnologiesResearch and Development Program of China under Grant2018YFD0401404 National Natural Science Foundation ofChina under Grants 61702282 61802192 and 71801123Natural Science Foundation of the Jiangsu Higher EducationInstitutions of China under Grants 18KJB520024 and17KJB520023 Nanjing Forestry University under GrantsGXL016 and CX2016026 and NUPTSF under GrantNY217143

References

[1] Statista Mobile Operating Systemsrsquo Market Share Worldwidefrom January 2012 to December 2019 Statista HamburgGermany 2009 httpswwwstatistacomstatistics272698global-market-share-held-by-mobile-operating-systems-since-2009

[2] W Martin F Sarro Y Jia Y Zhang and M Harman ldquoAsurvey of app store analysis for software engineeringrdquo IEEETransactions on Software Engineering vol 43 no 9pp 817ndash847 2017

[3] httpswwwappbraincomstatsnumber-of-android-apps2020

[4] L Li D Li T F Bissyande et al ldquoUnderstanding android apppiggybacking a systematic study of malicious code graftingrdquoIEEE Transactions on Information Forensics and Securityvol 12 no 6 pp 1269ndash1284 2017

[5] M Fan J Liu X Luo et al ldquoAndroid malware familialclassification and representative sample selection via frequentsubgraph analysisrdquo IEEE Transactions on Information Fo-rensics and Security vol 13 no 8 pp 1890ndash1905 2018

[6] J Zhang Z Qin K Zhang H Yin and J Zou ldquoDalvik opcodegraph based android malware variants detection using globaltopology featuresrdquo IEEE Access vol 6 pp 51964ndash51974 2018

[7] S Garg S K Peddoju and A K Sarje ldquoNetwork-baseddetection of android malicious appsrdquo International Journal ofInformation Security vol 16 no 4 pp 385ndash400 2016

Table 9 Comparison of the proposed method with SAMADroid

Proposed method SAMADroidAccuracy 969 831F-measure 0940 0793Average power consumption 12 23Average CPU consumption 06 08Average memory consumption 654MB 773MB

Table 10 Comparison of the proposed method with Shabtairsquosmethod

Proposed method Shabtairsquosmethod

Accuracy 969 771F-measure 0940 0695Average power consumption 12 77Average CPU consumption 06 23Average memory consumption 654MB 937MB

Security and Communication Networks 17

[8] A Shabtai L Tenenboim-Chekina D Mimran L RokachB Shapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquoComputers amp Security vol 43 pp 1ndash18 2014

[9] L Xie X Zhang J-P Seifert and S Zhu ldquopBMDS a be-havior-based malware detection system for cellphone de-vicesrdquo in Proceedings of the ird ACM Conference onWireless Network Security ACM New York NY USApp 37ndash48 2010

[10] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacyin Smartphones and Mobile Devices ACM Chicago IL USApp 15ndash26 October 2011

[11] G Dini F Martinelli A Saracino and D SgandurraldquoMADAM a multi-level anomaly detector for android mal-warerdquo in Proceedings of the International Conference onMathematical Methods Models and Architectures for Com-puter Network Security vol 12 Springer St PetersburgRussia pp 240ndash253 2012

[12] Y Zhang M Yang B Xu et al ldquoVetting undesirable be-haviors in android apps with permission use analysisrdquo inProceedings of the 2013 ACM SIGSAC Conference on Com-puter amp Communications Security ACM Berlin Germanypp 611ndash622 November 2013

[13] SWang Z Chen X Li LWang K Ji and C Zhao ldquoAndroidmalware clustering analysis on network-level behaviorrdquo inProceedings of the International Conference on IntelligentComputing Springer Liverpool UK pp 796ndash807 August2017

[14] G He B Xu and H Zhu ldquoAppFA a novel approach to detectmalicious android applications on the networkrdquo in Securityand Communication Networks Wiley Hoboken NJ USA2018

[15] X Wang Y Yang and Y Zeng ldquoAccurate mobile malwaredetection and classification in the cloudrdquo SpringerPlus vol 4no 1 p 583 2015

[16] K Tam A Feizollah N B Anuar R Salleh and L Cavallaroldquoe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys (CSUR) vol 49 no 4p 76 2017

[17] M Sun X Li J C S Lui R T B Ma and Z Liang ldquoMonet auser-oriented behavior-based malware variants detectionsystem for androidrdquo IEEE Transactions on Information Fo-rensics and Security vol 12 no 5 pp 1103ndash1112 2017

[18] T Vidas and N Christin ldquoEvading android runtime analysisvia sandbox detectionrdquo in Proceedings of the 9th ACMSymposium on Information Computer and CommunicationsSecurity pp 447ndash458 Kyoto Japan June 2014

[19] Y Duan M Zhang A V Bhaskar et al ldquoings you may notknow about android (Un) packers a systematic study basedon whole-system emulationrdquo in Proceedings of the 2018Network and Distributed System Security Symposium pp 1ndash15 San Diego CA USA February 2018

[20] L Xue X Luo L Yu S Wang and D Wu ldquoAdaptiveunpacking of android appsrdquo in Proceedings of the 2017 IEEEACM 39th International Conference on Software Engineering(ICSE) IEEE Buenos Aires Argentina pp 358ndash369 May2017

[21] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy IEEE San Francisco CAUSA pp 95ndash109 2012

[22] N W Lo S K Lu and Y H Chuang ldquoA framework for thirdparty android marketplaces to identify repackaged appsrdquo inProceedings of the 2016 IEEE 14th International Conference onDependable Autonomic and Secure Computing 14th Inter-national Conference on Pervasive Intelligence and Computing2nd International Conference on Big Data Intelligence andComputing and Cyber Science and Technology Congresspp 475ndash482 Auckland New Zealand August 2016

[23] L Li J Gao M Hurier et al ldquoAndrozoo++ collectingmillions of android apps and their metadata for the researchcommunityrdquo 2017 httpsarxivorgabs170905281

[24] W Shi J Cao Q Zhang Y Li and L Xu ldquoEdge computingvision and challengesrdquo IEEE Internet of ings Journal vol 3no 5 pp 637ndash646 2016

[25] W Zhou Y Zhou X Jiang and P Ning ldquoDetectingrepackaged smart-phone applications in third-party androidmarketplacesrdquo in Proceedings of the Second ACM Conferenceon Data and Application Security and Privacy ACM AntonioTX USA pp 317ndash326 February 2012

[26] Y Shao X Luo C Qian P Zhu and L Zhang ldquoTowards ascalable resource-driven approach for detecting repackagedandroid applicationsrdquo in Proceedings of the 30th AnnualComputer Security Applications Conference ACM NewOrleans LA USA pp 56ndash65 December 2014

[27] K Chen P Wang Y Lee et al ldquoFinding unknown malice in10 seconds mass vetting for new threats at the google-playscalerdquo in Proceedings of the USENIX Security Symposiumvol 15 Washington DC USA August 2015

[28] Z Wang C Li Y Guan and Y Xue ldquoDroidchain a novelmalware detection method for android based on behaviorchainrdquo in Proceedings of the 2015 IEEE Conference onCommunications and Network Security (CNS) IEEE Flor-ence Italy pp 727-728 September 2015

[29] K Tian D D Yao B G Ryder G Tan and G PengldquoDetection of repackaged android malware with code-het-erogeneity featuresrdquo IEEE Transactions on Dependable andSecure Computing vol 17 no 1 pp 64ndash77 2017

[30] A De Lorenzo F Martinelli E Medvet F Mercaldo andA Santone ldquoVisualizing the outcome of dynamic analysis ofandroid malware with vizmalrdquo Journal of Information Se-curity and Applications vol 50 Article ID 102423 2020

[31] M Lin D Zhang X Su and T Yu ldquoEffective and scalablerepackaged application detection based on user interfacerdquo inProceedings of the 2017 IEEE Conference on ComputerCommunications Workshops (INFOCOM WKSHPS) IEEEAtlanta GA USA May 2017

[32] G Meng M Patrick Y Xue Y Liu and J Zhang ldquoSecuringandroid app markets via modelling and predicting malwarespread between marketsrdquo IEEE Transactions on InformationForensics and Security vol 14 no 7 pp 1944ndash1959 2018

[33] A Arora and S K Peddoju ldquoMinimizing network trafficfeatures for android mobile malware detectionrdquo in Proceed-ings of the 18th International Conference on DistributedComputing and Networking ACM Hyderabad India January2017

[34] X Wu D Zhang X Su and W Li ldquoDetect repackagedAndroid application based on HTTP traffic similarity httptraffic similarityrdquo Security and Communication Networksvol 8 no 13 pp 2257ndash2266 2015

[35] J Malik and R Kaushal ldquoCredroid android malware de-tection by network traffic analysisrdquo in Proceedings of the 1stACM Workshop on Privacy-Aware Mobile Computing ACMPaderborn Germany pp 28ndash36 July 2016

18 Security and Communication Networks

[36] A Zulkifli I R A Hamid W M Shah and Z AbdullahldquoAndroid malware detection based on network traffic usingdecision tree algorithmrdquo in Proceedings of the InternationalConference on Soft Computing and Data Mining SpringerSenai Malaysia pp 485ndash494 January 2018

[37] Z Chen Q Yan H Han et al ldquoMachine learning basedmobile malware detection using highly imbalanced networktrafficrdquo Information Sciences vol 433-434 pp 346ndash364 2018

[38] G He B Xu L Zhang and H Zhu ldquoMobile app identifi-cation for encrypted network flows by traffic correlationrdquoInternational Journal of Distributed Sensor Networks vol 14no 12 pp 1ndash17 2018

[39] L Xue Y Zhou T Chen X Luo and G Gu ldquoMalton to-wards on-device non-invasive mobile malware analysis forARTrdquo in Proceedings of the 26th USENIX Security Symposiumpp 289ndash306 Vancouver Canada August 2017

[40] M K Alzaylaee S Y Yerima and S Sezer ldquoDL-Droid deeplearning based android malware detection using real devicesrdquoComputers amp Security vol 89 Article ID 101663 2020

[41] A S Shamili C Bauckhage and T Alpcan ldquoMalware de-tection on mobile devices using distributed machine learn-ingrdquo in Proceedings of the 20th International Conference onPattern Recognition (ICPR) IEEE Istanbul Turkeypp 4348ndash4351 August 2010

[42] M Zhao T Zhang F Ge and Z Yuan ldquoRobotdroid alightweight malware detection framework on smartphonesrdquoJournal of Networks vol 7 no 4 p 715 2012

[43] K A Talha D I Alper and C Aydin ldquoAPK auditor per-mission-based android malware detection systemrdquo DigitalInvestigation vol 13 pp 1ndash14 2015

[44] S Arshad M A Shah A Wahid A Mehmood H Song andH Yu ldquoSamadroid a novel 3-level hybrid malware detectionmodel for android operating systemrdquo IEEE Access vol 6pp 4321ndash4339 2018

[45] G He L Zhang B Xu and H Zhu ldquoDetecting repackagedandroid malware based on mobile edge computingrdquo inProceedings of the 2018 Sixth International Conference onAdvanced Cloud and Big Data (CBD) IEEE Lanzhou Chinapp 360ndash365 August 2018

[46] R Roman J Lopez andMMambo ldquoMobile edge computingFog et al a survey and analysis of security threats andchallengesrdquo Future Generation Computer Systems vol 78pp 680ndash698 2018

[47] F Bonomi R Milito J Zhu and S Addepalli ldquoFog com-puting and its role in the internet of thingsrdquo in Proceedings ofthe First Edition of the MCC Workshop on Mobile CloudComputing ACM Helsinki Finland pp 13ndash16 August 2012

[48] M T Beck MWerner S Feld and S Schimper ldquoMobile edgecomputing a taxonomyrdquo in Proceedings of the Sixth Inter-national Conference on Advances in Future Internet pp 48ndash55 Lisbon Portugal November 2014

[49] P Mach and Z Becvar ldquoMobile edge computing a survey onarchitecture and computation offloadingrdquo IEEE Communi-cations Surveys amp Tutorials vol 19 no 3 pp 1628ndash1656 2017

[50] H T Dinh C Lee D Niyato and P Wang ldquoA survey ofmobile cloud computing architecture applications and ap-proachesrdquo Wireless Communications and Mobile Computingvol 13 no 18 pp 1587ndash1611 2013

[51] X Lyu H Tian L Jiang et al ldquoSelective offloading in mobileedge computing for the green internet of thingsrdquo IEEENetwork vol 32 no 1 pp 54ndash60 2018

[52] S Agrawal and J Agrawal ldquoSurvey on anomaly detectionusing data mining techniquesrdquo Procedia Computer Sciencevol 60 pp 708ndash713 2015

[53] A Tongaonkar ldquoA look at the mobile app identificationlandscaperdquo IEEE Internet Computing vol 20 no 4 pp 9ndash152016

[54] G He B Xu and H Zhu ldquoIdentifying mobile applications forencrypted network trafficrdquo in Proceedings of the 2017 FifthInternational Conference on Advanced Cloud and Big Data(CBD) IEEE Shanghai China pp 279ndash284 August 2017

[55] G Ranjan A Tongaonkar and R Torres ldquoApproximatematching of persistent lexicon using search-engines forclassifying mobile app trafficrdquo in Proceedings of the 35thAnnual IEEE International Conference on Computer Com-munications IIEEE San Francisco CA USA April 2016

[56] L Li and S Qu ldquoShort text classification based on improvedITCrdquo Journal of Computer and Communications vol 1 no 4pp 22ndash27 2013

[57] A H Lashkari A F A Kadir L Taheri and A A GhorbanildquoToward developing a systematic approach to generatebenchmark android malware datasets and classificationrdquo inProceedings of the 2018 International Carnahan Conference onSecurity Technology (ICCST) IEEE Montreal Canada Oc-tober 2018

[58] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[59] Y Li Z Yang Y Guo and X Chen ldquoDroidbot a lightweightUI-guided test input generator for androidrdquo in Proceedings ofthe Software Engineering Companion (ICSE-C) pp 23ndash26Buenos Aires Argentina May 2017

[60] M Fan X Luo J Liu et al ldquoGraph embedding based familialanalysis of android malware using unsupervised learningrdquo inProceedings of the 2019 IEEEACM 41st International Con-ference on Software Engineering (ICSE) IEEE Piscataway NJUSA pp 771ndash782 May 2019

Security and Communication Networks 19

Page 7: On-DeviceDetectionofRepackagedAndroidMalwarevia …downloads.hindawi.com/journals/scn/2020/8630748.pdf · 2020-05-31 · According to [4–6], most Android malware programs are repackagedapps,andthemajorityofnewAndroidmalware

After introducing the general architecture we outline thedetection methods that can be used in the edge-computing-based framework e cloud can use various methods todetect Android malware For example it can use conven-tional signature matching [35] and network behavioranalysis [8 14] to identify malicious apps However thesetypes of methods require a priori information (such assignatures or behavior models) In this paper we propose anovel repackaged malware detection method that clustersthe network traffic that is generated by the same app fromvarious mobile devices Figure 6 presents an example of thisprocess

e motivations behind the traffic clustering are two-fold first most Android malware is produced by repack-aging popular apps [4] and popular apps are often widelydistributed Second the network traffic generated by therepackaged malware differs from that generated by theoriginal app (as illustrated in Section 21) As illustrated inFigure 3 many mobile devices would connect to the cloudfor malware detection Hence the cloud can observe thenetwork traffic features of both the original apps and theirrepackaged versions with large probabilities due to the widedistribution of popular apps and the large number of mobiledevices If we regard the features of the original apps as thenormal data and the features of the repackaged versions asthe abnormal data the repackaged malware detectionproblem is transformed into the abnormal detection

problem Clustering is an efficient method for solving thisproblem [52] By clustering appsrsquo network traffic features weneed not model appsrsquo network behaviors in advance andmore importantly we can detect various repackaged ver-sions of the original app as illustrated in Figure 6e detailsof the proposed malware detection method will be describedin detail in Sections 53ndash55

5 Methodology

In this section we introduce the technical details of theproposed clustering method for Android malware detectionTable 1 defines the main notation that we use in this paper

(a) (b) (c)

Figure 4 Possible deployments of the edge e edge could be the wireless gateways or the dedicated servers that are connected to theeNodeB or the routers (a) Wireless gateway (b) eNodeB+ server (c) Router + server

Clustering

Traffic features of an app collected from

different mobile devicesOriginal apps

Repackaged apps

Figure 6 Illustration of traffic clustering for repackaged malwaredetection e plain circles represent original apps and the circleswith triangles and inverted triangles are repackaged ones

Network connectionData transmission

Android malwaredetectionCloud

Mobiledevices

(1) Extractedfeatures

(2) Detectionresults

(a)

Network connectionData transmission

Network node

Mobiledevices

(1) Detectionresults

(b)

Figure 5 Available frameworks for android malware detection Various approaches such as [10] send the extracted features to a remoteserver other than to the cloud We still regard them as following the cloud-based framework because there is no essential difference formalware detection As illustrated in (b) the detection results are typically presented to the network administrator and are not fed back to themobile devices (a) Framework based on cloud (b) Framework based on network

Security and Communication Networks 7

51 System Overview Figure 7 illustrates the main proce-dure of the proposed approach App i is the app for vettingand it has been installed on umobile devicesese u devicesare connected to e edge servers We assume that r deviceshave installed the repackaged malware (the malware is alsopresented as App i to deceive the users) and rlt (12)uHence most of the devices have installed the original ver-sion is situation is reasonable because repackaged An-droidmalware is typically distributed in third-party marketswhich are less popular than official markets [23] us thenumbers of downloads and installations are relatively smallWe will use this characteristic to classify the clusteringresults

To detect the repackaged malware the edge server labelsthe network flows that are generated by App i according tothe flow metainformation en it extracts suitable trafficfeatures (such as traffic content and statistics) from thelabeled flows and sends the features to the cloud for pro-cessing erefore there are u feature sets for App i in thecloud e cloud filters these u sets according to the appversions and removes common traffic After that it calculatesthe pairwise similarities of the feature sets Finally it clustersthe similarity values and identifies r repackaged malwareinstances

e traffic labeling feature extraction and filtering(labels①② and③ in Figure 7) the similarity calculationon the traffic contents and behaviors (labels④ and⑤) andclustering of similarity values (label⑥) are the core elementsof the proposed method ey will be described in detail inthe following sections

52 Traffic Labeling Filtering and Feature Extraction

521 Traffic Labeling In this study traffic labeling isconducted to identify the corresponding app for each net-work flow which is known as the app identification problem

[53] Extensive works have been conducted on the identi-fication of apps from mobile network traffic [38 54 55]However the achieved identification accuracies are all lowerthan 100 In our method we design an edge-client app forcollecting flow metaformation and identifying apps accu-rately e flow metainformation is defined in the following

Definition 1 (flow metainformation) For flow f the flowmetainformation includes the client-side IP address C-IPfthe flow starting time Tf the flow protocol Pf the server-sideIP address S-IPf the server-side port S-Portf the app nameAppNf and the app version AppVfe flow f is bidirectionaland Pf is the transport layer protocol such as TCP or UDP

e edge-client app operates similarly to a VPN appeapp sends the flow metainformation to the edge server at afixed time interval Ti Once it has received the flow meta-information the edge server saves it into a metainformationdatabase and can label mobile app traffic accurately Denotethe flow observed by the edge server as fu e edge serverlabels fu as follows

Step 1 extract the recording time Tuf of fu and its flow

protocol Puf client-side IP address CminusIPu

f server-sideIP address SminusIPu

f and server-side port SminusPortufStep 2 search the metainformation database with Pu

fCminusIPu

fSminusIPuf and SminusPortuf If records match the da-

tabase returns the record that has the most recent flowstarting time Tr

f Otherwise the edge server waits for awhile and repeats Step 2Step 3 if Tu

f is close to Trf the edge server labels flow fu

as ltAppNf AppVfgtOtherwise it waits for a while andrepeats Steps 2 and 3

In the above steps the client-side IP address CminusIPuf is

used to identify a mobile device uniquely which is correctwhen the edge server is the wireless gateway or the eNodeBIn such scenarios the mobile devices directly connect to theedge hence the IP address is unique for each deviceHowever if the edge server is deployed at the backbonerouter as illustrated in subgraph (c) of Figure 4 the CminusIPu

f

may be confusing because mobile devices could be locatedbehind NATs (Network Address Translations) and they willshare the same public IP address To solve this issue theedge-client app can insert special indicators into the packetsto uniquely represent a device In this study we alwaysassume that CminusIPu

f is unique and we will investigate theimplementation of the device-indicator in future work

In steps 2 and 3 the edge server utilizes a wait-and-repeat strategy to ensure that the traffic labeling is fresh andaccurate e edge server must wait because the mobiledevices send flow metainformation to the edge server at afixed time interval Ti (60 s in our experiments) and the savedflow metainformation in the database may be outdated Inour implementations the waiting time is set to Ti and if|Tu

f minus Trf|gt 24 h the returned flow metainformation is

regarded as obsolete

522 Traffic Feature Extraction After traffic labeling theedge server extracts detection features from the labeled

Table 1 Main notation used in this paper

Notation Meaningi Android app iu Number of mobile devices with app i installedr Number of devices with repackaged app i

f Network flowC-IPf Client-side IP address of fS-IPf Server-side IP address of fS-Portf Server-side port of fAppNf Name of the app that generates fAppVf Version of the app that generates fTi Set time intervalTu

f Recording time of flow fu at the edge serverdi Feature set of app iwj Plaintext word in the feature setsV(di) Numerical vector of traffic contentsF

j

diFeature vector for the encrypted flow j in di

TBntimesm(di) Traffic behaviors of diCSdik

Content similarity between di and dkBSdik

Behavior similarity between di and dkSdik

Final similarity between di and dk

8 Security and Communication Networks

traffic e extracted traffic features are categorized into twogroups traffic content features and traffic behavior featuresTraffic content features are the plaintext contents that areextracted from HTTP flows and traffic behavior features arethe statistics of network flows such as packet sizes andintervals Since Android apps may use an encrypted HTTPSprotocol to transmit data [54] traffic behavior features alsomust be considered e traffic content and traffic behaviorfeatures are described in detail in Sections 53 and 54

523 Traffic Filtering After traffic feature extraction theedge servers send the extracted features along with theclient-side IP address server-side IP addresses server-sideports app name AppNf and app version AppVf to the cloudNote that App i is installed on u devices hence the cloudeventually has u feature sets for App i Figure 8 illustrates thestructure of the feature set e cloud further filters these ufeature sets to increase the detection accuracy First it di-vides the u sets into v categories according to the app versionAppVf erefore each category contains the feature setsthat are generated by the same version of App i en foreach category the cloud removes the common traffic fromthe feature sets e common traffic is defined as network

flows that are contained in all features sets ese flows havethe same server-side IP address and the same server-sideport number e reason for this step is that the commontraffic will obfuscate the similarity calculations betweenapps Otherwise the similarity values between features setsare all high and they cannot be effectively used to distin-guish the repackaged malware from the original apps

After the filtering has been completed the similaritiesbetween feature sets are calculated and the similarity valuesare clustered to identify the repackaged malware

Client-side IP addressApp name AppN_fApp version AppV_fFlow 1 server-side IP address 1 server-side port 1 flow features

Flow f server-side IP address f server-side port f flow features

Figure 8 Illustration of the feature set e set contains the clientIP address app name and app version Each network flow that isgenerated by App i in a device is represented by the featuresextracted from the flow and by the server IP address and port

Trafficlabeling

Edge server 1

Traffic featureextraction

Similarity calculation of

trafficbehaviors

Similarity calculation of traffic contents Clustering of

similarity values

Original app

Repackaged malware

Cloud

Mobile device 1 2 (with app i installed)

Results

Traffic features

Traffic and flowmetainformation

ę

ę

Edge server e

Traffic features

Traffic and flow metainformation

Filtering

Mobile device u(with app i installed)

1

3

4

5

6

2

Figure 7emain procedure of the proposedmethod for Androidmalware detectione core parts of the proposedmethod are labeled bycircled numbers

Security and Communication Networks 9

53 Similarity Calculation of Traffic Content Android appsuse mainly HTTP and HTTPS protocols to communicatewith their remote servers [54] For HTTP traffic since it isplaintext based we can efficiently use the flow content tocalculate the similarity e basic strategy is to convertHTTP traffic to document files and the document files canbe handled by information retrieval technologies for simi-larity calculation HTTP flow content has also been used inapp identification from network traffic [55] However in thisstudy we only consider partial HTTP header information tobetter protect mobile usersrsquo privacy

In detail first we reassemble HTTP flows from packetsand extract the flow contents by saving the correspondingpacket contents into a text file as ASCII code Figure 9 showssuch an example en we preprocess the text file by de-leting the HTTP request content and the HTTP requestheader except for the GET (or POST) and HOST fields eHTTP response content is also deleted namely only theHTTP response header the GET (or POST) and the HOSTfiled in the HTTP request header are saved Via this ap-proach the mobile usersrsquo privacy can be protected from thecloud (for example the mobile phone information (HUA-WEI) is leaked by the User-Agent field in Figure 9) Afterthat the text file is spilt using special characters (eg ) and the processed file becomes the flow contentfeatures

e similarity is calculated via the improved TF-IDFalgorithm [56] and the cosine similarity e term frequencymeasures the count of words in a document We use theaugmented frequency measurement Denote the document(the feature set) as di and the word as wj Term frequencytf(wj di) is calculated as

tf wj di1113872 1113873 05 +05 times f wj di1113872 1113873

max f w di( 1113857 w isin di1113864 1113865 (1)

where f(wj di) is the number of occurrences of wj indocument di

e inverse document frequency measures of theamount of information a word provides Typically it is thelogarithmically scaled inverse fraction of the number ofdocuments that contain the word e inverse documentfrequency of wj is calculated as

idf wj1113872 1113873 logu

d wj isin d1113966 111396711138681113868111386811138681113868

11138681113868111386811138681113868+ 001⎛⎝ ⎞⎠ (2)

where u is the number of documents and | d wj isin d1113966 1113967| isnumber of documents in which the term wj appears

Denote by p(wj di) the product of TF(wj di) andI DF(wj) it is formulated as

p wj di1113872 1113873 log tf wj di1113872 11138731113872 1113873 times idf wj1113872 1113873

1113936

cj0 log

2 tf wj di1113872 11138731113872 1113873 times idf2 wj1113872 11138731113969 (3)

After the TF-IDF operation all documents are convertedinto equal-length numerical vectors We use the cosinesimilarity to measure the similarity between two documentvectors For documents di and dk the corresponding

numerical vectors are denoted as V(di) and V(dk) and thelength of vector is t e similarity between di and dk iscalculated as

CSdik

1113936tl1 V di( 1113857l times V dk( 1113857l( 1113857

1113936tl1 V di( 1113857

2l

1113969

times

1113936tl1 V dk( 1113857

2l

1113969 (4)

Suppose that in category vi (we divide the u feature setsgenerated by App i in to v categories in the traffic filteringstep) there are c feature sets For each feature set we save theflow features of all HTTP flows into a text file and c text filesare obtained en we calculate the numerical vector foreach feature set via equation (3) Last we compare thenumerical vector with each other as equation (4) and gain csimilarity values (CSdii

1) for each feature set We rep-resent these similarity values of feature set di (i 1 2 c)as a vector Si

contents and Sicontents (CSdi1

CSdi2 CSdic

)e vectors Si

contents (i 1 2 c) in combination with thesimilarity values of traffic behaviors will be clustered toidentify the abnormal data points namely the repackagedmalware

54 SimilarityCalculation of TrafficBehaviors In addition tocalculating the similarity values of traffic contents thesimilarity of traffic behaviors must be calculated due toencrypted flows We use m (m 11 in our study) statisticalfeatures as listed in Table 2 to represent the behavior of anencrypted network flow e units of the flow duration andthe packet interval time are milliseconds in our experiments

With the flow statistical features for feature set di the jthencrypted flow can be represented as a vector in the featurespace

Fj

di f

1j f

2j f

mj1113872 1113873 (5)

where flj represents the value of the lth statistical feature

1le llem Figure 10 shows an example of feature represen-tation for encrypted traffic Consequently the traffic be-haviors of di can be represented as a matrix of f

pj

Figure 9 Example of converting an HTTP flow to a text file

10 Security and Communication Networks

TBntimesm di( 1113857 F1di

F2di

Fndi

1113872 1113873T (6)

where n is the number of encrypted flows that are containedin di and nge 1

Since different feature sets may differ in terms of thenumber of encrypted flows traffic behaviors TBnitimesm(di) andTBnktimesm(dk) may differ in terms of different dimensionnamely the value of ni should not be equal to that of nk Toefficiently calculate the similarity between TBnitimesm(di) andTBnktimesm(dk) we calculate the Frobenius norm of TBnitimesm(di)

and TBnktimesm(dk) e formula is presented in equation (7)where norm(middot) normalizes the value of fl

j to [0 1]

FN(d)

1113944

n

j11113944

m

l1norm fl

j1113872 111387311138681113868111386811138681113868

111386811138681113868111386811138682

11139741113972

(7)

en the distance between TBnitimesm(di) and TBnktimesm(dk) iscalculated as

dis di dk( 1113857 FN di( 1113857 minus FN dk( 11138571113868111386811138681113868

1113868111386811138681113868 (8)

Finally the similarity between TBnitimesm(di) andTBnktimesm(dk) can be calculated by

BSdik 1 minus

dis di dk( 1113857

max dis dr dq1113872 1113873 r q 1 2 u rne q1113960 1113961 (9)

In the above process ni and nk are assumed to exceed 1If ni 0 (nk 0) and nkge 1 (nige 1) BSdik

is set to 0 to reflectthe significant difference between di and dk However ifni 0 and nk 0 namely there is no encrypted flow in bothdi and dk BSdik

is set to 1 because they exhibit the samebehaviors ie they do not produce any encrypted traffic

Similarly we let BSdii 1 and represent these similarity

values as a vector Sibehavior BSdi1

BSdi2 BSdiu

1113966 1113967

55 Clustering of Similarity Values and Malware DetectionFrom the traffic content similarity CSdik

and the traffic be-havior similarity BSdik

we can determine the final similarityvalue BSdik

between the feature sets di and dk as expressed inequation (10) In this equation q is the weight value In ourexperiments we set q 05 namely the traffic contentsimilarity and the traffic behavior similarity are assigned thesame weight is is because apps typically produce almostthe same numbers of HTTP and HTTPS flows [57] We alsoevaluate various weighted values in Section 62 and theexperimental results support that 05 is the most suitablevalue For feature set di a vector Sdi

(Sdi1 Sdi2

Sdiu) is

finally obtained which contains the similarity values be-tween di and the other feature setse set of vector Sdi

1113966 1113967 i

1 2 u will be clustered for the detection of repackagedmalware

Sdik q times CSdik

+(1 minus q) times BSdik (10)

To be specific we use the density peak clustering method[58] to realize our objective Density peak clustering is basedon the following assumption cluster centers are charac-terized by a higher density than their neighbors and by arelatively large distance from points that have higher den-sities Hence density peak clustering can automaticallyidentify the correct number of clusters is is importantbecause we do not know whether there is repackagedmalware or how many types of repackaged malware thereare In other words we cannot determine the number ofclusters in advance Density peak clustering can help usovercome this challenge

We assume that most mobile devices have installed theoriginal apps as discussed in Section 51 erefore afterclustering the cluster that has the largest number of voxelswill be recognized as original and the remaining clusters willbe regarded as suspicious However suspicious apps are notnecessarily repackaged malware For example apps can berepackaged to show ads only Repackaged apps of this typeshould not be simply judged as malware Additionallyoriginal apps can be suspicious due to different user be-haviors To further reduce the frequency of the false alarmswe can check the server hostnames that are contained in therepackaged clusters using VirusTotal (the URLs are includedin the HTTP traffic contents) If all of the serversrsquo hostnamesfor a suspicious cluster Cr are judged as clean it will berecognized as a cluster of abnormal but harmless appsOtherwise it will be identified as repackaged malware Viathis approach we can accurately detect repackaged Androidmalware In this step we only use VirusTotal as a white list toexclude false alarms We do not use it as a backlist (detectionsignatures) to detect repackaged malware Malware detec-tion is still realized by clustering the similarity values

Table 2 Flow statistical featuresBytes sent by the app and bytes sent by the serverNumbers of outgoing and incoming packetsMean and std dev of the outgoing and incoming packet sizesrsquoflow durationMean and std dev of the interpacket time

lt2761 16696 16 12 173 1294 289 1177 3392 121 190gt

Figure 10 Example of the conversion of an encrypted flow to afeature vector

Security and Communication Networks 11

6 Evaluation

61 System Implementation and Dataset Figure 11illustrates our experimental setup e Android phoneswere connected to the edge server via WiFi and each phonehad installed the edge-client app which sends the flowmetainformation to the edge server using Socket (jav-anetSocket) e edge is a Dell PowerEdge R730 server(with 32 processors 16GB RAM and 12 TB hard drives)that is configured as an access point In the edge server weused tcpdump (httpwwwtcpdumporg) to collect networktraffic e similarity calculations of traffic contents andbehaviors and the clustering of similarity values wereimplemented on the Windows Azure cloud computingplatforme URL checking was conducted using Virustotal(httpswwwvirustotalcom)

In the implementation we used aria2c (httpsaria2githubio) to download apps from Androzoo (httpsandrozoounilu) aria2c was encapsulated as commandsin Python code for parallel downloading After downloadingthe apps we automatically installed them on phones usingthe adb shell command Each phone had 40 tested appsinstalled We have implemented the edge-client app basedon Android VPN service In our implementations the edge-client app correlated network flows and apps by resolvingthe ports and PIDs (Process IDs) send the newly collectedflow metainformation to the edge server every 60 secondse edge server used tcpdump to collect network trafficwhich was saved as pcap files To extract traffic features weused Splitcap (httpswwwnetreseccompageSplitCap)to restore tcp flows from pcap files en for HTTP trafficwe extracted the contents using tshark (httpswwwwiresharkorgdocsman-pagestsharkhtml) commands -zfollow tcp and ascii e packet sizes and packet intervaltimes were also obtained by tshark with commands -eframelen ndashe frametime_relative In the cloud CSdik

and BSdik

were calculated using scikit-learn (httpscikit-learnorg)and Numpy (httpsplotlynumpynorm) e code ofdensity peak clustering is available on GitHub (httpsgithubcomjasonwbwDensityPeakCluster) and we havefixed various bugs that were present in the original code

We have downloaded 200 pairs of original andrepackaged apps from Androzoo for testing However notall of these downloaded repackaged apps are maliciousSome apps are only repackaged to show extra ads To screenout Android malware from the downloaded repackagedapps we further checked the 200 repackaged apps usingVirustotal If an app was judged as safe by all the antivirustools it was labelled as repackaged_safe in our experimentsOtherwise it was labelled as repackaged_malware In total143 repackaged apps were detected as malware by VirustotalNevertheless in our experiments we evaluated the proposedclusteringmethod on all 200 repackaged apps to obtainmorerealistic performance results

e downloaded apps were run in 10 rooted Androidphones in parallel to accelerate the experimental process Forall apps we used Droidbot [59] to send input events to themand to generate network traffic automatically We ran eachrepackaged app (repackaged_safe and repackaged_malware

apps) 10 times and the corresponding original app 30 timesto generate sufficient network traffic e running times ofthe original apps exceeded those of the repackaged appsbecause we assume that most of the apps are original asdiscussed in Section 51 Consequently we obtained 40traffic sets for each repackaged-original app pair Note thatfrom the perspectives of the mobile devices and the edgeserver there are only 200 distinct apps because therepackaged app has the same name as the correspondingoriginal app erefore our experiments have simulated40lowast 200 8000 mobile devices For each repackaged-orig-inal app pair there were 30 devices with the original app and10 devices with the repackaged version

In total we have gathered almost 89GB network traffice datasets that are used in our experiments are summa-rized in Table 3 In the table the traffic dataset sizes are thesizes of the captured network traffic and the feature datasetsizes are the sizes of the extracted features In our experi-ments the HTTP traffic contents and the flow statisticalfeatures which are listed in Table 2 were both saved as textfiles As shown in Table 3 the feature datasets are signifi-cantly smaller than the traffic datasets (25G vs 63G and 9G vs26G)us the edge computing can reduce the data size andincrease the data acquisition speed for cloud computing[24] Since the feature dataset sizes are still large (25G+ 9G)we did not send all these feature data to the cloud simul-taneously Instead we sent the feature data one by oneWhen the feature data of an app were successfully receivedby the cloud its traffic similarities were calculated andclustered immediately After that the feature data of anotherapp would be sent and analyzed e experimental resultswill be presented in the following

62 ExperimentalResults In this section wemainly evaluatethe detection efficiency of Android malware by clusteringsimilarity values e performance evaluation of the pro-posed method in terms of power memory and CPU con-sumptions is elaborated in the following Section 63

Similar to previous studies [8 14] to measure theperformance of the proposed method Accuracy and F-measure metrics are utilized which are defined in equations(11) and (12) In equation (11) TP FN FP and TN aredefined in Table 4 In Table 4 repackaged_x representsrepackaged_safe or repackaged_malware e precision andrecall in equation (12) are calculated via equations (13) and(14)

Accuracy TP + TN

TP + FN + FP + TN (11)

F minus measure 2 times precision times recallprecision + recall

(12)

precision TP

TP + FP (13)

recall TP

TP + FN (14)

12 Security and Communication Networks

We have evaluated the Accuracy and F-measure for eachrepackaged app Recall that we ran each repackaged app 10times and the corresponding original app 30 timeserefore for each app the number of the original apptraffic sets is 30 and the number of the repackaged app trafficsets is 10 namely TP + FN 10 and TN+FP 30 for eachapp We have randomly chosen 20 apps from repack-age_malware and the experimental results for these apps arepresented in Figures 12 and 13 Detailed descriptions of thechosen 20 apps are listed in Table 5 In the table the mainfunction of each app is determined by analyzing the packagename from the AndroidManifestxml and from the Googlesearch results of the package name As depicted in Figure 12100 accuracy was realized on a total of 6 apps (app3 app5app9 app10 app13 and app18) e minimum accuracy is85 (app2 3440) and the average accuracy is 952 Fig-ure 13 presents the F-measure values In our experimentsthe maximum F-measure value is 1 and the minimum valueis 0737 e average F-measure value is 0888 ese resultsdemonstrate the satisfactory performance of our proposedclustering method

We have also calculated the average accuracy and F-measure values for all repackaged_safe and repack-aged_malware apps In these calculationsTP + FN 57lowast10 570 and TN+FP 57lowast 301710 for

repackaged_safe apps For repackaged_malware appsTP + FN 143lowast101430 and TN+FP 143lowast 30 4290is is because in our experiments 143 apps were identifiedas repackaged malware and 57 apps as repackaged safe byVirustotal us the parameter values differ e experi-mental results are listed in Table 6

According to Table 6 the average accuracy values forrepackaged_safe and repackaged_malware apps are bothhigh However the average F-measure for repackaged_safeapps is only 078 which is lower than that of repackagedmalware is indicates that repackaged safe apps are

Router Windows Azure

Edge server DellPowerEdge R730

10 android phones

Repackagedoriginal apps

Androzoo

Figure 11 Experimental setup

Table 3 Datasets used in the experiments

of apps of runs of HTTP flows of encrypted flows Traffic dataset sizes Feature dataset sizesOriginal apps 200 30 340015 183452 asymp63G bytes asymp25G bytesRepackaged apps 200 10 110670 93041 asymp26G bytes asymp9G bytes

Table 4 Confusion matrix

ObservedDetected

Repackaged_x app Original appRepackaged_x app TP ( of TPs) FN ( of FNs)Original app FP ( of FPs) TN ( of TNs)

Apps

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

Figure 12 Detection accuracies for the 20 selected apps

Security and Communication Networks 13

difficult to distinguish from their original apps is isbecause repackaged_safe apps only generate smallamounts of additional network flows and their networktraffic features are not clearly distinguishable from thoseof normal traffic Fortunately the average F-measure ofrepackaged malware apps is outstanding is may be dueto the fact that malware typically send stolen data to orreceive commands from remote servers Hence theirnetwork traffic features are special and distinguishableFigure 14 presents an example of app clustering and il-lustrates the distinguishability of the repackaged malwareand the original app ese results demonstrate thatrepackaged Android malware can be effectively detectedby comparing their network traffic with that of originalapps

For repackaged malware apps the accuracy results withdifferent weight values (q in equation (10)) are presented inFigure 15 As depicted in the figure when q 05 we realizethe highest accuracy rate 969 If q 0 namely only thefeatures of encrypted traffic are used the accuracy result is753 e result is 617 when q 1 namely only thefeatures of plaintext traffic are considered ese resultsdemonstrate that both the encrypted and plaintext trafficcontribute to the differentiation of the original apps and therepackaged malware However the encrypted traffic con-tributes more than the plaintext traffic is may becausemalicious apps usually connect to their servers by securityprotocols to evade IDS detection

In the previous experiments we checked the serversrsquohostnames using VirusTotal to reduce the frequency of thefalse positives For various clusters that contained less than20 elements if all the serversrsquo hostnames were judged asclean the cluster was identified as an original app rather thana repackaged one As argued in Section 55 this process onlyreduces the frequency of false alarms To support this ar-gument we list the numbers of true positives and falsepositives of the 143 repackaged malware in Table 7 forcomparison When conducting the hostname checking thenumber of true positives is 1400 and the result is the samewhen the hostname checking is turned off is findingsupports that the malware is detected by clustering thesimilarity values However the number of false positivesincreased from 47 to 213 when the hostname checking wasturned off erefore we suggest checking the serversrsquohostnames after clustering to increase the practicality of themethod Note that false positives still occur when we checkthe hostnames by VirusTotal Hence there may be malwarein the original apps In our dataset the typical examples arecommobovideoplayerpro and cncfshop ele1taoxiaosanWe further analyze these two apps by reading theirdecompiled codes and we determine that they indeed es-tablish connections with suspicious servers In detail invarious runs app commobovideoplayerpro connects withhttpwwwtopappsquarecomwhich is judged as maliciousby CyRadar in VirusTotal App cncfshop ele1taoxiaosanvisits suspicious URL http955ccjSn9 Since these apps arethe original apps they are identified as false positives in theexperiments

In our experiments we chose density peak clustering asthe clustering method because density peak clustering canautomatically determine the correct number of clusters [58]Hence our method can detect various versions of malwarethat are repackaged from the same original app Unfortu-nately in the apps that were downloaded from AndroZoono malware was repackaged from the same original app Toevaluate the efficiency of detecting versions of malware wehave created two malware instances from a popular game

Apps

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

0

01

02

03

04

05

06

07

08

09

1

F-m

easu

re

Figure 13 Detection F-measure for the 20 selected apps

Table 5 Detailed descriptions of the chosen 20 apps

Main function ofapp

of HTTPflows

of encryptedflows

App1 VideoPlayer 453 51App2 Game 657 1455App3 Browser 1531 1478App4 MusicPlayer 341 25App5 Game 612 289App6 Game 788 123App7 Game 1336 524App8 Fitness 357 892App9 Email 122 431App10 Game 963 790App11 MusicPlayer 1128 547App12 Game 1598 426App13 News 2460 1178App14 News 2891 1543App15 Chat 171 49App16 Email 143 368App17 WeatherForecast 397 120App18 Game 1832 921App19 Fitness 732 362App20 OnlineLearning 2378 834

Table 6 Average accuracy and F-measure

Average accuracy () AverageF-measure

Repackaged_safe 903 078Repackaged_malware 969 094

14 Security and Communication Networks

app namely aircomaceviralmotox3 One is produced bygrafting AnserverBotrsquos source code into air-comaceviralmotox3 and the other is created by adding codefor stealing usersrsquo contacts and short messages and sendingthem to a remote server Similarly each malware is run 10times and the original app aircomaceviralmotox3 is alsorun 10 times for validatione clustering result is presentedin Figure 16 Apparently there are 3 clusters e

experimental results show that the value of F-measure foreach malware is 1 Hence our method can detect variousmalware that were repackaged from the same original app

63 Comparison In this section we compare our methodwith the typical methods that are based on network or cloudas illustrated in Figure 5 e comparative indicators are thedetection accuracy F-measure average power CPU andmemory consumptions of the mobile device First wecompare the proposed method with AppFA [14] which isimplemented completely at the network level Since AppFAalso must analyze app traffic we directly use the traffic sets ofthe repackaged malware apps for comparison While testingAppFA we deleted the label information that was containedin the traffic sets and mixed the network flows as a wholeen we used the signature matching and the constrainedK-means clustering algorithm proposed in [14] to restoreapp sessions namely the label for each flow e similarapps used in AppFA were chosen in Google Play via themethod introduced in [14] and for each repackaged mal-ware 10 similar apps were selected as the peer group epeer group apps were run once by Droidbot to generatenetwork traffic e comparison results are listed in Table 8

Since AppFA need not to install some special apps onmobile devices the extra power memory and CPU con-sumptions are 0 However the detection accuracy is muchlower (863 vs 969) than that of the proposedmethod ascompared in Table 8 In the proposed method we use theedge-client app to record flow metainformation thusnetwork flows can be accurately labeled However inAppFA flow labeling is realized via traffic clustering and theclustering accuracy cannot be 100 Meanwhile AppFAcompares the appsrsquo network behaviors with those of theirsimilar apps which also results in errors In the proposedmethod we directly compare the repackaged malware withits original app and the detection accuracy is increased eaverage power CPU and memory consumptions of theproposed method are also low and acceptable

In Table 8 the power memory and CPU consumptionsfor our edge-client app were obtained with the help of appGT (httpsgithubcomTencentGT) GT is a portable toolthat runs on smartphones for debugging the APP internalparameters and the code time-consuming statistics eaverage values for the power memory and CPU con-sumptions are calculated as follows Before running an appwith Droidbot we started the edge-client app by GT Whenthe Droidbot was stopped we recoded the power CPU andmemory consumptions We repeated the process until allapps (including 143 repackaged malware apps and theiroriginal versions) had been tested Via this way we obtainedthe average power CPU and memory consumptions for theedge-client app According to Table 8 the edge-client app islightweight and has minimal impact on the performances ofthe mobile devices

en we compare the proposed method with SAMA-Droid [44] SAMADroid monitors Android system calls andsends the calls to a centralized server for the detection ofAndroid malware SAMADroid is a typical cloud-based on-

Decision graph rho-delta-rho

20 251510 35 403050Sorted rho

00

02

04

06

08

Rho lowast

delt

a

Figure 14 Example of app clustering Obviously two clusters areidentified e dot in the top left corner is an outlier e outlier isdetected as an abnormal but harmless app by our method isfigure was generated by the density peak clustering tool

01 02 03 04 05 06 07 08 09 10Value of q

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

Figure 15 Accuracies of various weight values for traffic contentsimilarity and traffic behavior similarity

Table 7 True positives and false positives with and withouthostname checking

of true positives of false positivesWith hostname checking 1400 47Without hostnamechecking 1400 213

Security and Communication Networks 15

device detection method We have implemented the dy-namic detection according to the description in [44] Indetail the APIs that are invoked by the running apps aremonitored by the tool Sensitive API Monitor (httpsgithubcomcyruliuSensitive-API-Monitor) As done in [44] thefrequency values of 10 system calls (open ioctl brk readwrite close sendto sendmsg recvfrom and recvmsg) arerecorded ese recorded frequency values are sent toWindows Azure for malware detection We evaluate thedynamic detection performance of SAMADroid on our 143repackaged_malware apps and their original versions In theevaluation the frequency values of the original apps are usedto create the normality model and the frequency values ofthe repackaged malware apps are used for testing We alsouse the app GT to record the power memory and CPUconsumptions of SAMADroid Note that our phones wereall rooted and Sensitive_API_Monitor and GT can be usede experimental results are presented in Table 9

Last we compare our method with Shabtairsquos method [8]Shabtairsquos method detects Android malware at the mobiledevices and it considers the apprsquos network traffic patternsonly the HTTP traffic contents are not considered isgives us an opportunity to compare our method withmethods that have been specially designed for encryptedtraffic We have implemented Shabtairsquos method and chosenthe feature subset 2 as the detection features Feature subset2 includes Avg Sent Bytes Avg Rcvd Bytes and Pct ofAvg Rcvd Bytes among other values and it is proved to be

the best feature subset for malware detection [8] e De-cisionRegression tree is chosen as the classification algo-rithm e testing dataset is the traffic that was generated bythe 143 repackaged malware instances and the training set isthe traffic that was generated by the corresponding originalapps e classification results are listed in Table 10According to the table the detection accuracy and F-mea-sure values of our method are significantly higher than thoseof Shabtairsquos method is indicates that the HTTP trafficprovides useful information for Android malware detectionSince Shabtairsquos method is conducted at the mobile devices itconsumes more resources than our method as compared inTable 10

7 Discussion

e results of our extensive experiments demonstrate thatrepackaged Android malware can be efficiently and effec-tively detected by clustering network traffic at the edgedevices However there are still limitations for our methodFirst we assume the most of the devices have installed theoriginal apps (rlt (12)u as described in Section 51) is isreasonable in most cases since mobile users typicallydownload apps from the official markets However in areaswhere official markets such as the Google play and Amazonare banned it is possible that the most mobile devices haveinstalled the repackaged apps is may cause false positivesSecond if the malware does not generate sufficient trafficour method cannot detect it For example some malwareonly sends fraudulent premium SMS messages is type ofmalware cannot be detected by monitoring its networktraffic ird we rely on third-party tools such as Virustotalfor the identification of malicious URLs and to reduce thenumber of false alarms If a malicious app rents legalplatforms such as the public cloud as the remote controlserver our method will judge it as a normal app Hence falsenegatives will be generated

To overcome these limitations we could combine theproposed method with available methods such as SAMA-Droid In such cases the edge-client app will not only recordflow metainformation but also monitor app system be-haviors However the system behavior monitoring does notneed to be conducted all the time If abnormal but harmlessor lower traffic apps are detected the edge-client app will beinformed to record the corresponding appsrsquo system be-haviors Similarly the edge-client app can send the appsystem behaviors to the edge server for further processinge performance of this combined method will be evaluatedin our future work One can also combine the proposedframework with AppFA [14] for the detection of other typesof Android malware For example the edge extracts apptraffic behavior features and the cloud can compare appsrsquobehaviors with their peer groups to identify abnormalitiesFurthermore one could first use clustering techniques [60]to group repackaged apps according to the same maliciousmodules and then analyze the behaviors of the representativerepackaged apps in each group to obtain more accuratebehavior features

y

2D nonclassical multidimensional scaling

00ndash02 02ndash04x

ndash04

ndash03

ndash02

ndash01

00

01

02

03

04

Figure 16 Clustering results of various malware that wererepackaged from aircomaceviralmotox3e outlier in the middleis detected as an abnormal but harmless app is figure wasgenerated by the density peak clustering tool

Table 8 Comparison of the proposed method with AppFA

Proposed method AppFAAccuracy 969 863F-measure 094 075Average power consumption 12 0Average CPU consumption 06 0Average memory consumption 654MB 0

16 Security and Communication Networks

e paradigms of edge computing are used to detectAndroid malware in this study However several challengeswould be encountered when deploying it in practice Firstthe mobile devices must install a specified app for thecollection of essential information which may cause con-cerns and mobile users may refuse to cooperate Second inedge computing the data should be preprocessed by the edgeto better protect the usersrsquo privacy However the edge servercan observe all the data (including the sensitive informa-tion) and it threatens the privacy of users For example ifthe edge is deployed by the cloud provider the usersrsquo privacyinformation is also available to the cloud ere is no dif-ference with the user-cloud model for privacy protectionunless the edge server is deployed by the userird the edgeserver should be well protected from network attacksOtherwise it will be convenient for hackers to steal usersrsquoprivate information

8 Conclusion

In this paper we propose a novel method for detectingAndroid malware based on edge computing To the best ofour knowledge the detection of Android malware by uti-lizing edge computing and traffic clustering has not beenconsidered previously By introducing the edge layer themain tasks of the mobile device are offloaded to the edgeserver and the mobile device only needs to record flowmetainformation which substantially conserves the re-sources of the mobile device e edge server extracts appsrsquonetwork traffic features and sends these features to the cloudplatform for further analysis us the usersrsquo privacy can bebetter protected With the received features the cloud cal-culates the similarities between apps and clusters thesesimilarity values to separate the original apps and themalware automatically We evaluated our methods on 400Android apps and the experimental results demonstratethat the detection accuracy can reach 100 for several appsand the average accuracy is 969 We also tested theperformance of the proposed method and compared it with

the typical approaches e comparison results show thatour method can detect repackaged Android malware withboth higher detection accuracy and lower power and re-source consumptions In the future we will continue to workto overcome the discussed limitations and to evaluate theproposed method in real world scenarios

Data Availability

e downloaded apps used to support the findings of thisstudy have been deposited in httpspanbaiducoms1b8vMceIdtdd1gBRC5Gbzvw Its extraction code is vd5re clustering code is also in the repository Since appnetwork traffic contains usersrsquo sensitive information itcannot be publicly shared However we have explained allthe steps to generate network traffic automatically esesteps are also described in the instructionspdf in therepository

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported by National Key TechnologiesResearch and Development Program of China under Grant2018YFD0401404 National Natural Science Foundation ofChina under Grants 61702282 61802192 and 71801123Natural Science Foundation of the Jiangsu Higher EducationInstitutions of China under Grants 18KJB520024 and17KJB520023 Nanjing Forestry University under GrantsGXL016 and CX2016026 and NUPTSF under GrantNY217143

References

[1] Statista Mobile Operating Systemsrsquo Market Share Worldwidefrom January 2012 to December 2019 Statista HamburgGermany 2009 httpswwwstatistacomstatistics272698global-market-share-held-by-mobile-operating-systems-since-2009

[2] W Martin F Sarro Y Jia Y Zhang and M Harman ldquoAsurvey of app store analysis for software engineeringrdquo IEEETransactions on Software Engineering vol 43 no 9pp 817ndash847 2017

[3] httpswwwappbraincomstatsnumber-of-android-apps2020

[4] L Li D Li T F Bissyande et al ldquoUnderstanding android apppiggybacking a systematic study of malicious code graftingrdquoIEEE Transactions on Information Forensics and Securityvol 12 no 6 pp 1269ndash1284 2017

[5] M Fan J Liu X Luo et al ldquoAndroid malware familialclassification and representative sample selection via frequentsubgraph analysisrdquo IEEE Transactions on Information Fo-rensics and Security vol 13 no 8 pp 1890ndash1905 2018

[6] J Zhang Z Qin K Zhang H Yin and J Zou ldquoDalvik opcodegraph based android malware variants detection using globaltopology featuresrdquo IEEE Access vol 6 pp 51964ndash51974 2018

[7] S Garg S K Peddoju and A K Sarje ldquoNetwork-baseddetection of android malicious appsrdquo International Journal ofInformation Security vol 16 no 4 pp 385ndash400 2016

Table 9 Comparison of the proposed method with SAMADroid

Proposed method SAMADroidAccuracy 969 831F-measure 0940 0793Average power consumption 12 23Average CPU consumption 06 08Average memory consumption 654MB 773MB

Table 10 Comparison of the proposed method with Shabtairsquosmethod

Proposed method Shabtairsquosmethod

Accuracy 969 771F-measure 0940 0695Average power consumption 12 77Average CPU consumption 06 23Average memory consumption 654MB 937MB

Security and Communication Networks 17

[8] A Shabtai L Tenenboim-Chekina D Mimran L RokachB Shapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquoComputers amp Security vol 43 pp 1ndash18 2014

[9] L Xie X Zhang J-P Seifert and S Zhu ldquopBMDS a be-havior-based malware detection system for cellphone de-vicesrdquo in Proceedings of the ird ACM Conference onWireless Network Security ACM New York NY USApp 37ndash48 2010

[10] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacyin Smartphones and Mobile Devices ACM Chicago IL USApp 15ndash26 October 2011

[11] G Dini F Martinelli A Saracino and D SgandurraldquoMADAM a multi-level anomaly detector for android mal-warerdquo in Proceedings of the International Conference onMathematical Methods Models and Architectures for Com-puter Network Security vol 12 Springer St PetersburgRussia pp 240ndash253 2012

[12] Y Zhang M Yang B Xu et al ldquoVetting undesirable be-haviors in android apps with permission use analysisrdquo inProceedings of the 2013 ACM SIGSAC Conference on Com-puter amp Communications Security ACM Berlin Germanypp 611ndash622 November 2013

[13] SWang Z Chen X Li LWang K Ji and C Zhao ldquoAndroidmalware clustering analysis on network-level behaviorrdquo inProceedings of the International Conference on IntelligentComputing Springer Liverpool UK pp 796ndash807 August2017

[14] G He B Xu and H Zhu ldquoAppFA a novel approach to detectmalicious android applications on the networkrdquo in Securityand Communication Networks Wiley Hoboken NJ USA2018

[15] X Wang Y Yang and Y Zeng ldquoAccurate mobile malwaredetection and classification in the cloudrdquo SpringerPlus vol 4no 1 p 583 2015

[16] K Tam A Feizollah N B Anuar R Salleh and L Cavallaroldquoe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys (CSUR) vol 49 no 4p 76 2017

[17] M Sun X Li J C S Lui R T B Ma and Z Liang ldquoMonet auser-oriented behavior-based malware variants detectionsystem for androidrdquo IEEE Transactions on Information Fo-rensics and Security vol 12 no 5 pp 1103ndash1112 2017

[18] T Vidas and N Christin ldquoEvading android runtime analysisvia sandbox detectionrdquo in Proceedings of the 9th ACMSymposium on Information Computer and CommunicationsSecurity pp 447ndash458 Kyoto Japan June 2014

[19] Y Duan M Zhang A V Bhaskar et al ldquoings you may notknow about android (Un) packers a systematic study basedon whole-system emulationrdquo in Proceedings of the 2018Network and Distributed System Security Symposium pp 1ndash15 San Diego CA USA February 2018

[20] L Xue X Luo L Yu S Wang and D Wu ldquoAdaptiveunpacking of android appsrdquo in Proceedings of the 2017 IEEEACM 39th International Conference on Software Engineering(ICSE) IEEE Buenos Aires Argentina pp 358ndash369 May2017

[21] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy IEEE San Francisco CAUSA pp 95ndash109 2012

[22] N W Lo S K Lu and Y H Chuang ldquoA framework for thirdparty android marketplaces to identify repackaged appsrdquo inProceedings of the 2016 IEEE 14th International Conference onDependable Autonomic and Secure Computing 14th Inter-national Conference on Pervasive Intelligence and Computing2nd International Conference on Big Data Intelligence andComputing and Cyber Science and Technology Congresspp 475ndash482 Auckland New Zealand August 2016

[23] L Li J Gao M Hurier et al ldquoAndrozoo++ collectingmillions of android apps and their metadata for the researchcommunityrdquo 2017 httpsarxivorgabs170905281

[24] W Shi J Cao Q Zhang Y Li and L Xu ldquoEdge computingvision and challengesrdquo IEEE Internet of ings Journal vol 3no 5 pp 637ndash646 2016

[25] W Zhou Y Zhou X Jiang and P Ning ldquoDetectingrepackaged smart-phone applications in third-party androidmarketplacesrdquo in Proceedings of the Second ACM Conferenceon Data and Application Security and Privacy ACM AntonioTX USA pp 317ndash326 February 2012

[26] Y Shao X Luo C Qian P Zhu and L Zhang ldquoTowards ascalable resource-driven approach for detecting repackagedandroid applicationsrdquo in Proceedings of the 30th AnnualComputer Security Applications Conference ACM NewOrleans LA USA pp 56ndash65 December 2014

[27] K Chen P Wang Y Lee et al ldquoFinding unknown malice in10 seconds mass vetting for new threats at the google-playscalerdquo in Proceedings of the USENIX Security Symposiumvol 15 Washington DC USA August 2015

[28] Z Wang C Li Y Guan and Y Xue ldquoDroidchain a novelmalware detection method for android based on behaviorchainrdquo in Proceedings of the 2015 IEEE Conference onCommunications and Network Security (CNS) IEEE Flor-ence Italy pp 727-728 September 2015

[29] K Tian D D Yao B G Ryder G Tan and G PengldquoDetection of repackaged android malware with code-het-erogeneity featuresrdquo IEEE Transactions on Dependable andSecure Computing vol 17 no 1 pp 64ndash77 2017

[30] A De Lorenzo F Martinelli E Medvet F Mercaldo andA Santone ldquoVisualizing the outcome of dynamic analysis ofandroid malware with vizmalrdquo Journal of Information Se-curity and Applications vol 50 Article ID 102423 2020

[31] M Lin D Zhang X Su and T Yu ldquoEffective and scalablerepackaged application detection based on user interfacerdquo inProceedings of the 2017 IEEE Conference on ComputerCommunications Workshops (INFOCOM WKSHPS) IEEEAtlanta GA USA May 2017

[32] G Meng M Patrick Y Xue Y Liu and J Zhang ldquoSecuringandroid app markets via modelling and predicting malwarespread between marketsrdquo IEEE Transactions on InformationForensics and Security vol 14 no 7 pp 1944ndash1959 2018

[33] A Arora and S K Peddoju ldquoMinimizing network trafficfeatures for android mobile malware detectionrdquo in Proceed-ings of the 18th International Conference on DistributedComputing and Networking ACM Hyderabad India January2017

[34] X Wu D Zhang X Su and W Li ldquoDetect repackagedAndroid application based on HTTP traffic similarity httptraffic similarityrdquo Security and Communication Networksvol 8 no 13 pp 2257ndash2266 2015

[35] J Malik and R Kaushal ldquoCredroid android malware de-tection by network traffic analysisrdquo in Proceedings of the 1stACM Workshop on Privacy-Aware Mobile Computing ACMPaderborn Germany pp 28ndash36 July 2016

18 Security and Communication Networks

[36] A Zulkifli I R A Hamid W M Shah and Z AbdullahldquoAndroid malware detection based on network traffic usingdecision tree algorithmrdquo in Proceedings of the InternationalConference on Soft Computing and Data Mining SpringerSenai Malaysia pp 485ndash494 January 2018

[37] Z Chen Q Yan H Han et al ldquoMachine learning basedmobile malware detection using highly imbalanced networktrafficrdquo Information Sciences vol 433-434 pp 346ndash364 2018

[38] G He B Xu L Zhang and H Zhu ldquoMobile app identifi-cation for encrypted network flows by traffic correlationrdquoInternational Journal of Distributed Sensor Networks vol 14no 12 pp 1ndash17 2018

[39] L Xue Y Zhou T Chen X Luo and G Gu ldquoMalton to-wards on-device non-invasive mobile malware analysis forARTrdquo in Proceedings of the 26th USENIX Security Symposiumpp 289ndash306 Vancouver Canada August 2017

[40] M K Alzaylaee S Y Yerima and S Sezer ldquoDL-Droid deeplearning based android malware detection using real devicesrdquoComputers amp Security vol 89 Article ID 101663 2020

[41] A S Shamili C Bauckhage and T Alpcan ldquoMalware de-tection on mobile devices using distributed machine learn-ingrdquo in Proceedings of the 20th International Conference onPattern Recognition (ICPR) IEEE Istanbul Turkeypp 4348ndash4351 August 2010

[42] M Zhao T Zhang F Ge and Z Yuan ldquoRobotdroid alightweight malware detection framework on smartphonesrdquoJournal of Networks vol 7 no 4 p 715 2012

[43] K A Talha D I Alper and C Aydin ldquoAPK auditor per-mission-based android malware detection systemrdquo DigitalInvestigation vol 13 pp 1ndash14 2015

[44] S Arshad M A Shah A Wahid A Mehmood H Song andH Yu ldquoSamadroid a novel 3-level hybrid malware detectionmodel for android operating systemrdquo IEEE Access vol 6pp 4321ndash4339 2018

[45] G He L Zhang B Xu and H Zhu ldquoDetecting repackagedandroid malware based on mobile edge computingrdquo inProceedings of the 2018 Sixth International Conference onAdvanced Cloud and Big Data (CBD) IEEE Lanzhou Chinapp 360ndash365 August 2018

[46] R Roman J Lopez andMMambo ldquoMobile edge computingFog et al a survey and analysis of security threats andchallengesrdquo Future Generation Computer Systems vol 78pp 680ndash698 2018

[47] F Bonomi R Milito J Zhu and S Addepalli ldquoFog com-puting and its role in the internet of thingsrdquo in Proceedings ofthe First Edition of the MCC Workshop on Mobile CloudComputing ACM Helsinki Finland pp 13ndash16 August 2012

[48] M T Beck MWerner S Feld and S Schimper ldquoMobile edgecomputing a taxonomyrdquo in Proceedings of the Sixth Inter-national Conference on Advances in Future Internet pp 48ndash55 Lisbon Portugal November 2014

[49] P Mach and Z Becvar ldquoMobile edge computing a survey onarchitecture and computation offloadingrdquo IEEE Communi-cations Surveys amp Tutorials vol 19 no 3 pp 1628ndash1656 2017

[50] H T Dinh C Lee D Niyato and P Wang ldquoA survey ofmobile cloud computing architecture applications and ap-proachesrdquo Wireless Communications and Mobile Computingvol 13 no 18 pp 1587ndash1611 2013

[51] X Lyu H Tian L Jiang et al ldquoSelective offloading in mobileedge computing for the green internet of thingsrdquo IEEENetwork vol 32 no 1 pp 54ndash60 2018

[52] S Agrawal and J Agrawal ldquoSurvey on anomaly detectionusing data mining techniquesrdquo Procedia Computer Sciencevol 60 pp 708ndash713 2015

[53] A Tongaonkar ldquoA look at the mobile app identificationlandscaperdquo IEEE Internet Computing vol 20 no 4 pp 9ndash152016

[54] G He B Xu and H Zhu ldquoIdentifying mobile applications forencrypted network trafficrdquo in Proceedings of the 2017 FifthInternational Conference on Advanced Cloud and Big Data(CBD) IEEE Shanghai China pp 279ndash284 August 2017

[55] G Ranjan A Tongaonkar and R Torres ldquoApproximatematching of persistent lexicon using search-engines forclassifying mobile app trafficrdquo in Proceedings of the 35thAnnual IEEE International Conference on Computer Com-munications IIEEE San Francisco CA USA April 2016

[56] L Li and S Qu ldquoShort text classification based on improvedITCrdquo Journal of Computer and Communications vol 1 no 4pp 22ndash27 2013

[57] A H Lashkari A F A Kadir L Taheri and A A GhorbanildquoToward developing a systematic approach to generatebenchmark android malware datasets and classificationrdquo inProceedings of the 2018 International Carnahan Conference onSecurity Technology (ICCST) IEEE Montreal Canada Oc-tober 2018

[58] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[59] Y Li Z Yang Y Guo and X Chen ldquoDroidbot a lightweightUI-guided test input generator for androidrdquo in Proceedings ofthe Software Engineering Companion (ICSE-C) pp 23ndash26Buenos Aires Argentina May 2017

[60] M Fan X Luo J Liu et al ldquoGraph embedding based familialanalysis of android malware using unsupervised learningrdquo inProceedings of the 2019 IEEEACM 41st International Con-ference on Software Engineering (ICSE) IEEE Piscataway NJUSA pp 771ndash782 May 2019

Security and Communication Networks 19

Page 8: On-DeviceDetectionofRepackagedAndroidMalwarevia …downloads.hindawi.com/journals/scn/2020/8630748.pdf · 2020-05-31 · According to [4–6], most Android malware programs are repackagedapps,andthemajorityofnewAndroidmalware

51 System Overview Figure 7 illustrates the main proce-dure of the proposed approach App i is the app for vettingand it has been installed on umobile devicesese u devicesare connected to e edge servers We assume that r deviceshave installed the repackaged malware (the malware is alsopresented as App i to deceive the users) and rlt (12)uHence most of the devices have installed the original ver-sion is situation is reasonable because repackaged An-droidmalware is typically distributed in third-party marketswhich are less popular than official markets [23] us thenumbers of downloads and installations are relatively smallWe will use this characteristic to classify the clusteringresults

To detect the repackaged malware the edge server labelsthe network flows that are generated by App i according tothe flow metainformation en it extracts suitable trafficfeatures (such as traffic content and statistics) from thelabeled flows and sends the features to the cloud for pro-cessing erefore there are u feature sets for App i in thecloud e cloud filters these u sets according to the appversions and removes common traffic After that it calculatesthe pairwise similarities of the feature sets Finally it clustersthe similarity values and identifies r repackaged malwareinstances

e traffic labeling feature extraction and filtering(labels①② and③ in Figure 7) the similarity calculationon the traffic contents and behaviors (labels④ and⑤) andclustering of similarity values (label⑥) are the core elementsof the proposed method ey will be described in detail inthe following sections

52 Traffic Labeling Filtering and Feature Extraction

521 Traffic Labeling In this study traffic labeling isconducted to identify the corresponding app for each net-work flow which is known as the app identification problem

[53] Extensive works have been conducted on the identi-fication of apps from mobile network traffic [38 54 55]However the achieved identification accuracies are all lowerthan 100 In our method we design an edge-client app forcollecting flow metaformation and identifying apps accu-rately e flow metainformation is defined in the following

Definition 1 (flow metainformation) For flow f the flowmetainformation includes the client-side IP address C-IPfthe flow starting time Tf the flow protocol Pf the server-sideIP address S-IPf the server-side port S-Portf the app nameAppNf and the app version AppVfe flow f is bidirectionaland Pf is the transport layer protocol such as TCP or UDP

e edge-client app operates similarly to a VPN appeapp sends the flow metainformation to the edge server at afixed time interval Ti Once it has received the flow meta-information the edge server saves it into a metainformationdatabase and can label mobile app traffic accurately Denotethe flow observed by the edge server as fu e edge serverlabels fu as follows

Step 1 extract the recording time Tuf of fu and its flow

protocol Puf client-side IP address CminusIPu

f server-sideIP address SminusIPu

f and server-side port SminusPortufStep 2 search the metainformation database with Pu

fCminusIPu

fSminusIPuf and SminusPortuf If records match the da-

tabase returns the record that has the most recent flowstarting time Tr

f Otherwise the edge server waits for awhile and repeats Step 2Step 3 if Tu

f is close to Trf the edge server labels flow fu

as ltAppNf AppVfgtOtherwise it waits for a while andrepeats Steps 2 and 3

In the above steps the client-side IP address CminusIPuf is

used to identify a mobile device uniquely which is correctwhen the edge server is the wireless gateway or the eNodeBIn such scenarios the mobile devices directly connect to theedge hence the IP address is unique for each deviceHowever if the edge server is deployed at the backbonerouter as illustrated in subgraph (c) of Figure 4 the CminusIPu

f

may be confusing because mobile devices could be locatedbehind NATs (Network Address Translations) and they willshare the same public IP address To solve this issue theedge-client app can insert special indicators into the packetsto uniquely represent a device In this study we alwaysassume that CminusIPu

f is unique and we will investigate theimplementation of the device-indicator in future work

In steps 2 and 3 the edge server utilizes a wait-and-repeat strategy to ensure that the traffic labeling is fresh andaccurate e edge server must wait because the mobiledevices send flow metainformation to the edge server at afixed time interval Ti (60 s in our experiments) and the savedflow metainformation in the database may be outdated Inour implementations the waiting time is set to Ti and if|Tu

f minus Trf|gt 24 h the returned flow metainformation is

regarded as obsolete

522 Traffic Feature Extraction After traffic labeling theedge server extracts detection features from the labeled

Table 1 Main notation used in this paper

Notation Meaningi Android app iu Number of mobile devices with app i installedr Number of devices with repackaged app i

f Network flowC-IPf Client-side IP address of fS-IPf Server-side IP address of fS-Portf Server-side port of fAppNf Name of the app that generates fAppVf Version of the app that generates fTi Set time intervalTu

f Recording time of flow fu at the edge serverdi Feature set of app iwj Plaintext word in the feature setsV(di) Numerical vector of traffic contentsF

j

diFeature vector for the encrypted flow j in di

TBntimesm(di) Traffic behaviors of diCSdik

Content similarity between di and dkBSdik

Behavior similarity between di and dkSdik

Final similarity between di and dk

8 Security and Communication Networks

traffic e extracted traffic features are categorized into twogroups traffic content features and traffic behavior featuresTraffic content features are the plaintext contents that areextracted from HTTP flows and traffic behavior features arethe statistics of network flows such as packet sizes andintervals Since Android apps may use an encrypted HTTPSprotocol to transmit data [54] traffic behavior features alsomust be considered e traffic content and traffic behaviorfeatures are described in detail in Sections 53 and 54

523 Traffic Filtering After traffic feature extraction theedge servers send the extracted features along with theclient-side IP address server-side IP addresses server-sideports app name AppNf and app version AppVf to the cloudNote that App i is installed on u devices hence the cloudeventually has u feature sets for App i Figure 8 illustrates thestructure of the feature set e cloud further filters these ufeature sets to increase the detection accuracy First it di-vides the u sets into v categories according to the app versionAppVf erefore each category contains the feature setsthat are generated by the same version of App i en foreach category the cloud removes the common traffic fromthe feature sets e common traffic is defined as network

flows that are contained in all features sets ese flows havethe same server-side IP address and the same server-sideport number e reason for this step is that the commontraffic will obfuscate the similarity calculations betweenapps Otherwise the similarity values between features setsare all high and they cannot be effectively used to distin-guish the repackaged malware from the original apps

After the filtering has been completed the similaritiesbetween feature sets are calculated and the similarity valuesare clustered to identify the repackaged malware

Client-side IP addressApp name AppN_fApp version AppV_fFlow 1 server-side IP address 1 server-side port 1 flow features

Flow f server-side IP address f server-side port f flow features

Figure 8 Illustration of the feature set e set contains the clientIP address app name and app version Each network flow that isgenerated by App i in a device is represented by the featuresextracted from the flow and by the server IP address and port

Trafficlabeling

Edge server 1

Traffic featureextraction

Similarity calculation of

trafficbehaviors

Similarity calculation of traffic contents Clustering of

similarity values

Original app

Repackaged malware

Cloud

Mobile device 1 2 (with app i installed)

Results

Traffic features

Traffic and flowmetainformation

ę

ę

Edge server e

Traffic features

Traffic and flow metainformation

Filtering

Mobile device u(with app i installed)

1

3

4

5

6

2

Figure 7emain procedure of the proposedmethod for Androidmalware detectione core parts of the proposedmethod are labeled bycircled numbers

Security and Communication Networks 9

53 Similarity Calculation of Traffic Content Android appsuse mainly HTTP and HTTPS protocols to communicatewith their remote servers [54] For HTTP traffic since it isplaintext based we can efficiently use the flow content tocalculate the similarity e basic strategy is to convertHTTP traffic to document files and the document files canbe handled by information retrieval technologies for simi-larity calculation HTTP flow content has also been used inapp identification from network traffic [55] However in thisstudy we only consider partial HTTP header information tobetter protect mobile usersrsquo privacy

In detail first we reassemble HTTP flows from packetsand extract the flow contents by saving the correspondingpacket contents into a text file as ASCII code Figure 9 showssuch an example en we preprocess the text file by de-leting the HTTP request content and the HTTP requestheader except for the GET (or POST) and HOST fields eHTTP response content is also deleted namely only theHTTP response header the GET (or POST) and the HOSTfiled in the HTTP request header are saved Via this ap-proach the mobile usersrsquo privacy can be protected from thecloud (for example the mobile phone information (HUA-WEI) is leaked by the User-Agent field in Figure 9) Afterthat the text file is spilt using special characters (eg ) and the processed file becomes the flow contentfeatures

e similarity is calculated via the improved TF-IDFalgorithm [56] and the cosine similarity e term frequencymeasures the count of words in a document We use theaugmented frequency measurement Denote the document(the feature set) as di and the word as wj Term frequencytf(wj di) is calculated as

tf wj di1113872 1113873 05 +05 times f wj di1113872 1113873

max f w di( 1113857 w isin di1113864 1113865 (1)

where f(wj di) is the number of occurrences of wj indocument di

e inverse document frequency measures of theamount of information a word provides Typically it is thelogarithmically scaled inverse fraction of the number ofdocuments that contain the word e inverse documentfrequency of wj is calculated as

idf wj1113872 1113873 logu

d wj isin d1113966 111396711138681113868111386811138681113868

11138681113868111386811138681113868+ 001⎛⎝ ⎞⎠ (2)

where u is the number of documents and | d wj isin d1113966 1113967| isnumber of documents in which the term wj appears

Denote by p(wj di) the product of TF(wj di) andI DF(wj) it is formulated as

p wj di1113872 1113873 log tf wj di1113872 11138731113872 1113873 times idf wj1113872 1113873

1113936

cj0 log

2 tf wj di1113872 11138731113872 1113873 times idf2 wj1113872 11138731113969 (3)

After the TF-IDF operation all documents are convertedinto equal-length numerical vectors We use the cosinesimilarity to measure the similarity between two documentvectors For documents di and dk the corresponding

numerical vectors are denoted as V(di) and V(dk) and thelength of vector is t e similarity between di and dk iscalculated as

CSdik

1113936tl1 V di( 1113857l times V dk( 1113857l( 1113857

1113936tl1 V di( 1113857

2l

1113969

times

1113936tl1 V dk( 1113857

2l

1113969 (4)

Suppose that in category vi (we divide the u feature setsgenerated by App i in to v categories in the traffic filteringstep) there are c feature sets For each feature set we save theflow features of all HTTP flows into a text file and c text filesare obtained en we calculate the numerical vector foreach feature set via equation (3) Last we compare thenumerical vector with each other as equation (4) and gain csimilarity values (CSdii

1) for each feature set We rep-resent these similarity values of feature set di (i 1 2 c)as a vector Si

contents and Sicontents (CSdi1

CSdi2 CSdic

)e vectors Si

contents (i 1 2 c) in combination with thesimilarity values of traffic behaviors will be clustered toidentify the abnormal data points namely the repackagedmalware

54 SimilarityCalculation of TrafficBehaviors In addition tocalculating the similarity values of traffic contents thesimilarity of traffic behaviors must be calculated due toencrypted flows We use m (m 11 in our study) statisticalfeatures as listed in Table 2 to represent the behavior of anencrypted network flow e units of the flow duration andthe packet interval time are milliseconds in our experiments

With the flow statistical features for feature set di the jthencrypted flow can be represented as a vector in the featurespace

Fj

di f

1j f

2j f

mj1113872 1113873 (5)

where flj represents the value of the lth statistical feature

1le llem Figure 10 shows an example of feature represen-tation for encrypted traffic Consequently the traffic be-haviors of di can be represented as a matrix of f

pj

Figure 9 Example of converting an HTTP flow to a text file

10 Security and Communication Networks

TBntimesm di( 1113857 F1di

F2di

Fndi

1113872 1113873T (6)

where n is the number of encrypted flows that are containedin di and nge 1

Since different feature sets may differ in terms of thenumber of encrypted flows traffic behaviors TBnitimesm(di) andTBnktimesm(dk) may differ in terms of different dimensionnamely the value of ni should not be equal to that of nk Toefficiently calculate the similarity between TBnitimesm(di) andTBnktimesm(dk) we calculate the Frobenius norm of TBnitimesm(di)

and TBnktimesm(dk) e formula is presented in equation (7)where norm(middot) normalizes the value of fl

j to [0 1]

FN(d)

1113944

n

j11113944

m

l1norm fl

j1113872 111387311138681113868111386811138681113868

111386811138681113868111386811138682

11139741113972

(7)

en the distance between TBnitimesm(di) and TBnktimesm(dk) iscalculated as

dis di dk( 1113857 FN di( 1113857 minus FN dk( 11138571113868111386811138681113868

1113868111386811138681113868 (8)

Finally the similarity between TBnitimesm(di) andTBnktimesm(dk) can be calculated by

BSdik 1 minus

dis di dk( 1113857

max dis dr dq1113872 1113873 r q 1 2 u rne q1113960 1113961 (9)

In the above process ni and nk are assumed to exceed 1If ni 0 (nk 0) and nkge 1 (nige 1) BSdik

is set to 0 to reflectthe significant difference between di and dk However ifni 0 and nk 0 namely there is no encrypted flow in bothdi and dk BSdik

is set to 1 because they exhibit the samebehaviors ie they do not produce any encrypted traffic

Similarly we let BSdii 1 and represent these similarity

values as a vector Sibehavior BSdi1

BSdi2 BSdiu

1113966 1113967

55 Clustering of Similarity Values and Malware DetectionFrom the traffic content similarity CSdik

and the traffic be-havior similarity BSdik

we can determine the final similarityvalue BSdik

between the feature sets di and dk as expressed inequation (10) In this equation q is the weight value In ourexperiments we set q 05 namely the traffic contentsimilarity and the traffic behavior similarity are assigned thesame weight is is because apps typically produce almostthe same numbers of HTTP and HTTPS flows [57] We alsoevaluate various weighted values in Section 62 and theexperimental results support that 05 is the most suitablevalue For feature set di a vector Sdi

(Sdi1 Sdi2

Sdiu) is

finally obtained which contains the similarity values be-tween di and the other feature setse set of vector Sdi

1113966 1113967 i

1 2 u will be clustered for the detection of repackagedmalware

Sdik q times CSdik

+(1 minus q) times BSdik (10)

To be specific we use the density peak clustering method[58] to realize our objective Density peak clustering is basedon the following assumption cluster centers are charac-terized by a higher density than their neighbors and by arelatively large distance from points that have higher den-sities Hence density peak clustering can automaticallyidentify the correct number of clusters is is importantbecause we do not know whether there is repackagedmalware or how many types of repackaged malware thereare In other words we cannot determine the number ofclusters in advance Density peak clustering can help usovercome this challenge

We assume that most mobile devices have installed theoriginal apps as discussed in Section 51 erefore afterclustering the cluster that has the largest number of voxelswill be recognized as original and the remaining clusters willbe regarded as suspicious However suspicious apps are notnecessarily repackaged malware For example apps can berepackaged to show ads only Repackaged apps of this typeshould not be simply judged as malware Additionallyoriginal apps can be suspicious due to different user be-haviors To further reduce the frequency of the false alarmswe can check the server hostnames that are contained in therepackaged clusters using VirusTotal (the URLs are includedin the HTTP traffic contents) If all of the serversrsquo hostnamesfor a suspicious cluster Cr are judged as clean it will berecognized as a cluster of abnormal but harmless appsOtherwise it will be identified as repackaged malware Viathis approach we can accurately detect repackaged Androidmalware In this step we only use VirusTotal as a white list toexclude false alarms We do not use it as a backlist (detectionsignatures) to detect repackaged malware Malware detec-tion is still realized by clustering the similarity values

Table 2 Flow statistical featuresBytes sent by the app and bytes sent by the serverNumbers of outgoing and incoming packetsMean and std dev of the outgoing and incoming packet sizesrsquoflow durationMean and std dev of the interpacket time

lt2761 16696 16 12 173 1294 289 1177 3392 121 190gt

Figure 10 Example of the conversion of an encrypted flow to afeature vector

Security and Communication Networks 11

6 Evaluation

61 System Implementation and Dataset Figure 11illustrates our experimental setup e Android phoneswere connected to the edge server via WiFi and each phonehad installed the edge-client app which sends the flowmetainformation to the edge server using Socket (jav-anetSocket) e edge is a Dell PowerEdge R730 server(with 32 processors 16GB RAM and 12 TB hard drives)that is configured as an access point In the edge server weused tcpdump (httpwwwtcpdumporg) to collect networktraffic e similarity calculations of traffic contents andbehaviors and the clustering of similarity values wereimplemented on the Windows Azure cloud computingplatforme URL checking was conducted using Virustotal(httpswwwvirustotalcom)

In the implementation we used aria2c (httpsaria2githubio) to download apps from Androzoo (httpsandrozoounilu) aria2c was encapsulated as commandsin Python code for parallel downloading After downloadingthe apps we automatically installed them on phones usingthe adb shell command Each phone had 40 tested appsinstalled We have implemented the edge-client app basedon Android VPN service In our implementations the edge-client app correlated network flows and apps by resolvingthe ports and PIDs (Process IDs) send the newly collectedflow metainformation to the edge server every 60 secondse edge server used tcpdump to collect network trafficwhich was saved as pcap files To extract traffic features weused Splitcap (httpswwwnetreseccompageSplitCap)to restore tcp flows from pcap files en for HTTP trafficwe extracted the contents using tshark (httpswwwwiresharkorgdocsman-pagestsharkhtml) commands -zfollow tcp and ascii e packet sizes and packet intervaltimes were also obtained by tshark with commands -eframelen ndashe frametime_relative In the cloud CSdik

and BSdik

were calculated using scikit-learn (httpscikit-learnorg)and Numpy (httpsplotlynumpynorm) e code ofdensity peak clustering is available on GitHub (httpsgithubcomjasonwbwDensityPeakCluster) and we havefixed various bugs that were present in the original code

We have downloaded 200 pairs of original andrepackaged apps from Androzoo for testing However notall of these downloaded repackaged apps are maliciousSome apps are only repackaged to show extra ads To screenout Android malware from the downloaded repackagedapps we further checked the 200 repackaged apps usingVirustotal If an app was judged as safe by all the antivirustools it was labelled as repackaged_safe in our experimentsOtherwise it was labelled as repackaged_malware In total143 repackaged apps were detected as malware by VirustotalNevertheless in our experiments we evaluated the proposedclusteringmethod on all 200 repackaged apps to obtainmorerealistic performance results

e downloaded apps were run in 10 rooted Androidphones in parallel to accelerate the experimental process Forall apps we used Droidbot [59] to send input events to themand to generate network traffic automatically We ran eachrepackaged app (repackaged_safe and repackaged_malware

apps) 10 times and the corresponding original app 30 timesto generate sufficient network traffic e running times ofthe original apps exceeded those of the repackaged appsbecause we assume that most of the apps are original asdiscussed in Section 51 Consequently we obtained 40traffic sets for each repackaged-original app pair Note thatfrom the perspectives of the mobile devices and the edgeserver there are only 200 distinct apps because therepackaged app has the same name as the correspondingoriginal app erefore our experiments have simulated40lowast 200 8000 mobile devices For each repackaged-orig-inal app pair there were 30 devices with the original app and10 devices with the repackaged version

In total we have gathered almost 89GB network traffice datasets that are used in our experiments are summa-rized in Table 3 In the table the traffic dataset sizes are thesizes of the captured network traffic and the feature datasetsizes are the sizes of the extracted features In our experi-ments the HTTP traffic contents and the flow statisticalfeatures which are listed in Table 2 were both saved as textfiles As shown in Table 3 the feature datasets are signifi-cantly smaller than the traffic datasets (25G vs 63G and 9G vs26G)us the edge computing can reduce the data size andincrease the data acquisition speed for cloud computing[24] Since the feature dataset sizes are still large (25G+ 9G)we did not send all these feature data to the cloud simul-taneously Instead we sent the feature data one by oneWhen the feature data of an app were successfully receivedby the cloud its traffic similarities were calculated andclustered immediately After that the feature data of anotherapp would be sent and analyzed e experimental resultswill be presented in the following

62 ExperimentalResults In this section wemainly evaluatethe detection efficiency of Android malware by clusteringsimilarity values e performance evaluation of the pro-posed method in terms of power memory and CPU con-sumptions is elaborated in the following Section 63

Similar to previous studies [8 14] to measure theperformance of the proposed method Accuracy and F-measure metrics are utilized which are defined in equations(11) and (12) In equation (11) TP FN FP and TN aredefined in Table 4 In Table 4 repackaged_x representsrepackaged_safe or repackaged_malware e precision andrecall in equation (12) are calculated via equations (13) and(14)

Accuracy TP + TN

TP + FN + FP + TN (11)

F minus measure 2 times precision times recallprecision + recall

(12)

precision TP

TP + FP (13)

recall TP

TP + FN (14)

12 Security and Communication Networks

We have evaluated the Accuracy and F-measure for eachrepackaged app Recall that we ran each repackaged app 10times and the corresponding original app 30 timeserefore for each app the number of the original apptraffic sets is 30 and the number of the repackaged app trafficsets is 10 namely TP + FN 10 and TN+FP 30 for eachapp We have randomly chosen 20 apps from repack-age_malware and the experimental results for these apps arepresented in Figures 12 and 13 Detailed descriptions of thechosen 20 apps are listed in Table 5 In the table the mainfunction of each app is determined by analyzing the packagename from the AndroidManifestxml and from the Googlesearch results of the package name As depicted in Figure 12100 accuracy was realized on a total of 6 apps (app3 app5app9 app10 app13 and app18) e minimum accuracy is85 (app2 3440) and the average accuracy is 952 Fig-ure 13 presents the F-measure values In our experimentsthe maximum F-measure value is 1 and the minimum valueis 0737 e average F-measure value is 0888 ese resultsdemonstrate the satisfactory performance of our proposedclustering method

We have also calculated the average accuracy and F-measure values for all repackaged_safe and repack-aged_malware apps In these calculationsTP + FN 57lowast10 570 and TN+FP 57lowast 301710 for

repackaged_safe apps For repackaged_malware appsTP + FN 143lowast101430 and TN+FP 143lowast 30 4290is is because in our experiments 143 apps were identifiedas repackaged malware and 57 apps as repackaged safe byVirustotal us the parameter values differ e experi-mental results are listed in Table 6

According to Table 6 the average accuracy values forrepackaged_safe and repackaged_malware apps are bothhigh However the average F-measure for repackaged_safeapps is only 078 which is lower than that of repackagedmalware is indicates that repackaged safe apps are

Router Windows Azure

Edge server DellPowerEdge R730

10 android phones

Repackagedoriginal apps

Androzoo

Figure 11 Experimental setup

Table 3 Datasets used in the experiments

of apps of runs of HTTP flows of encrypted flows Traffic dataset sizes Feature dataset sizesOriginal apps 200 30 340015 183452 asymp63G bytes asymp25G bytesRepackaged apps 200 10 110670 93041 asymp26G bytes asymp9G bytes

Table 4 Confusion matrix

ObservedDetected

Repackaged_x app Original appRepackaged_x app TP ( of TPs) FN ( of FNs)Original app FP ( of FPs) TN ( of TNs)

Apps

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

Figure 12 Detection accuracies for the 20 selected apps

Security and Communication Networks 13

difficult to distinguish from their original apps is isbecause repackaged_safe apps only generate smallamounts of additional network flows and their networktraffic features are not clearly distinguishable from thoseof normal traffic Fortunately the average F-measure ofrepackaged malware apps is outstanding is may be dueto the fact that malware typically send stolen data to orreceive commands from remote servers Hence theirnetwork traffic features are special and distinguishableFigure 14 presents an example of app clustering and il-lustrates the distinguishability of the repackaged malwareand the original app ese results demonstrate thatrepackaged Android malware can be effectively detectedby comparing their network traffic with that of originalapps

For repackaged malware apps the accuracy results withdifferent weight values (q in equation (10)) are presented inFigure 15 As depicted in the figure when q 05 we realizethe highest accuracy rate 969 If q 0 namely only thefeatures of encrypted traffic are used the accuracy result is753 e result is 617 when q 1 namely only thefeatures of plaintext traffic are considered ese resultsdemonstrate that both the encrypted and plaintext trafficcontribute to the differentiation of the original apps and therepackaged malware However the encrypted traffic con-tributes more than the plaintext traffic is may becausemalicious apps usually connect to their servers by securityprotocols to evade IDS detection

In the previous experiments we checked the serversrsquohostnames using VirusTotal to reduce the frequency of thefalse positives For various clusters that contained less than20 elements if all the serversrsquo hostnames were judged asclean the cluster was identified as an original app rather thana repackaged one As argued in Section 55 this process onlyreduces the frequency of false alarms To support this ar-gument we list the numbers of true positives and falsepositives of the 143 repackaged malware in Table 7 forcomparison When conducting the hostname checking thenumber of true positives is 1400 and the result is the samewhen the hostname checking is turned off is findingsupports that the malware is detected by clustering thesimilarity values However the number of false positivesincreased from 47 to 213 when the hostname checking wasturned off erefore we suggest checking the serversrsquohostnames after clustering to increase the practicality of themethod Note that false positives still occur when we checkthe hostnames by VirusTotal Hence there may be malwarein the original apps In our dataset the typical examples arecommobovideoplayerpro and cncfshop ele1taoxiaosanWe further analyze these two apps by reading theirdecompiled codes and we determine that they indeed es-tablish connections with suspicious servers In detail invarious runs app commobovideoplayerpro connects withhttpwwwtopappsquarecomwhich is judged as maliciousby CyRadar in VirusTotal App cncfshop ele1taoxiaosanvisits suspicious URL http955ccjSn9 Since these apps arethe original apps they are identified as false positives in theexperiments

In our experiments we chose density peak clustering asthe clustering method because density peak clustering canautomatically determine the correct number of clusters [58]Hence our method can detect various versions of malwarethat are repackaged from the same original app Unfortu-nately in the apps that were downloaded from AndroZoono malware was repackaged from the same original app Toevaluate the efficiency of detecting versions of malware wehave created two malware instances from a popular game

Apps

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

0

01

02

03

04

05

06

07

08

09

1

F-m

easu

re

Figure 13 Detection F-measure for the 20 selected apps

Table 5 Detailed descriptions of the chosen 20 apps

Main function ofapp

of HTTPflows

of encryptedflows

App1 VideoPlayer 453 51App2 Game 657 1455App3 Browser 1531 1478App4 MusicPlayer 341 25App5 Game 612 289App6 Game 788 123App7 Game 1336 524App8 Fitness 357 892App9 Email 122 431App10 Game 963 790App11 MusicPlayer 1128 547App12 Game 1598 426App13 News 2460 1178App14 News 2891 1543App15 Chat 171 49App16 Email 143 368App17 WeatherForecast 397 120App18 Game 1832 921App19 Fitness 732 362App20 OnlineLearning 2378 834

Table 6 Average accuracy and F-measure

Average accuracy () AverageF-measure

Repackaged_safe 903 078Repackaged_malware 969 094

14 Security and Communication Networks

app namely aircomaceviralmotox3 One is produced bygrafting AnserverBotrsquos source code into air-comaceviralmotox3 and the other is created by adding codefor stealing usersrsquo contacts and short messages and sendingthem to a remote server Similarly each malware is run 10times and the original app aircomaceviralmotox3 is alsorun 10 times for validatione clustering result is presentedin Figure 16 Apparently there are 3 clusters e

experimental results show that the value of F-measure foreach malware is 1 Hence our method can detect variousmalware that were repackaged from the same original app

63 Comparison In this section we compare our methodwith the typical methods that are based on network or cloudas illustrated in Figure 5 e comparative indicators are thedetection accuracy F-measure average power CPU andmemory consumptions of the mobile device First wecompare the proposed method with AppFA [14] which isimplemented completely at the network level Since AppFAalso must analyze app traffic we directly use the traffic sets ofthe repackaged malware apps for comparison While testingAppFA we deleted the label information that was containedin the traffic sets and mixed the network flows as a wholeen we used the signature matching and the constrainedK-means clustering algorithm proposed in [14] to restoreapp sessions namely the label for each flow e similarapps used in AppFA were chosen in Google Play via themethod introduced in [14] and for each repackaged mal-ware 10 similar apps were selected as the peer group epeer group apps were run once by Droidbot to generatenetwork traffic e comparison results are listed in Table 8

Since AppFA need not to install some special apps onmobile devices the extra power memory and CPU con-sumptions are 0 However the detection accuracy is muchlower (863 vs 969) than that of the proposedmethod ascompared in Table 8 In the proposed method we use theedge-client app to record flow metainformation thusnetwork flows can be accurately labeled However inAppFA flow labeling is realized via traffic clustering and theclustering accuracy cannot be 100 Meanwhile AppFAcompares the appsrsquo network behaviors with those of theirsimilar apps which also results in errors In the proposedmethod we directly compare the repackaged malware withits original app and the detection accuracy is increased eaverage power CPU and memory consumptions of theproposed method are also low and acceptable

In Table 8 the power memory and CPU consumptionsfor our edge-client app were obtained with the help of appGT (httpsgithubcomTencentGT) GT is a portable toolthat runs on smartphones for debugging the APP internalparameters and the code time-consuming statistics eaverage values for the power memory and CPU con-sumptions are calculated as follows Before running an appwith Droidbot we started the edge-client app by GT Whenthe Droidbot was stopped we recoded the power CPU andmemory consumptions We repeated the process until allapps (including 143 repackaged malware apps and theiroriginal versions) had been tested Via this way we obtainedthe average power CPU and memory consumptions for theedge-client app According to Table 8 the edge-client app islightweight and has minimal impact on the performances ofthe mobile devices

en we compare the proposed method with SAMA-Droid [44] SAMADroid monitors Android system calls andsends the calls to a centralized server for the detection ofAndroid malware SAMADroid is a typical cloud-based on-

Decision graph rho-delta-rho

20 251510 35 403050Sorted rho

00

02

04

06

08

Rho lowast

delt

a

Figure 14 Example of app clustering Obviously two clusters areidentified e dot in the top left corner is an outlier e outlier isdetected as an abnormal but harmless app by our method isfigure was generated by the density peak clustering tool

01 02 03 04 05 06 07 08 09 10Value of q

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

Figure 15 Accuracies of various weight values for traffic contentsimilarity and traffic behavior similarity

Table 7 True positives and false positives with and withouthostname checking

of true positives of false positivesWith hostname checking 1400 47Without hostnamechecking 1400 213

Security and Communication Networks 15

device detection method We have implemented the dy-namic detection according to the description in [44] Indetail the APIs that are invoked by the running apps aremonitored by the tool Sensitive API Monitor (httpsgithubcomcyruliuSensitive-API-Monitor) As done in [44] thefrequency values of 10 system calls (open ioctl brk readwrite close sendto sendmsg recvfrom and recvmsg) arerecorded ese recorded frequency values are sent toWindows Azure for malware detection We evaluate thedynamic detection performance of SAMADroid on our 143repackaged_malware apps and their original versions In theevaluation the frequency values of the original apps are usedto create the normality model and the frequency values ofthe repackaged malware apps are used for testing We alsouse the app GT to record the power memory and CPUconsumptions of SAMADroid Note that our phones wereall rooted and Sensitive_API_Monitor and GT can be usede experimental results are presented in Table 9

Last we compare our method with Shabtairsquos method [8]Shabtairsquos method detects Android malware at the mobiledevices and it considers the apprsquos network traffic patternsonly the HTTP traffic contents are not considered isgives us an opportunity to compare our method withmethods that have been specially designed for encryptedtraffic We have implemented Shabtairsquos method and chosenthe feature subset 2 as the detection features Feature subset2 includes Avg Sent Bytes Avg Rcvd Bytes and Pct ofAvg Rcvd Bytes among other values and it is proved to be

the best feature subset for malware detection [8] e De-cisionRegression tree is chosen as the classification algo-rithm e testing dataset is the traffic that was generated bythe 143 repackaged malware instances and the training set isthe traffic that was generated by the corresponding originalapps e classification results are listed in Table 10According to the table the detection accuracy and F-mea-sure values of our method are significantly higher than thoseof Shabtairsquos method is indicates that the HTTP trafficprovides useful information for Android malware detectionSince Shabtairsquos method is conducted at the mobile devices itconsumes more resources than our method as compared inTable 10

7 Discussion

e results of our extensive experiments demonstrate thatrepackaged Android malware can be efficiently and effec-tively detected by clustering network traffic at the edgedevices However there are still limitations for our methodFirst we assume the most of the devices have installed theoriginal apps (rlt (12)u as described in Section 51) is isreasonable in most cases since mobile users typicallydownload apps from the official markets However in areaswhere official markets such as the Google play and Amazonare banned it is possible that the most mobile devices haveinstalled the repackaged apps is may cause false positivesSecond if the malware does not generate sufficient trafficour method cannot detect it For example some malwareonly sends fraudulent premium SMS messages is type ofmalware cannot be detected by monitoring its networktraffic ird we rely on third-party tools such as Virustotalfor the identification of malicious URLs and to reduce thenumber of false alarms If a malicious app rents legalplatforms such as the public cloud as the remote controlserver our method will judge it as a normal app Hence falsenegatives will be generated

To overcome these limitations we could combine theproposed method with available methods such as SAMA-Droid In such cases the edge-client app will not only recordflow metainformation but also monitor app system be-haviors However the system behavior monitoring does notneed to be conducted all the time If abnormal but harmlessor lower traffic apps are detected the edge-client app will beinformed to record the corresponding appsrsquo system be-haviors Similarly the edge-client app can send the appsystem behaviors to the edge server for further processinge performance of this combined method will be evaluatedin our future work One can also combine the proposedframework with AppFA [14] for the detection of other typesof Android malware For example the edge extracts apptraffic behavior features and the cloud can compare appsrsquobehaviors with their peer groups to identify abnormalitiesFurthermore one could first use clustering techniques [60]to group repackaged apps according to the same maliciousmodules and then analyze the behaviors of the representativerepackaged apps in each group to obtain more accuratebehavior features

y

2D nonclassical multidimensional scaling

00ndash02 02ndash04x

ndash04

ndash03

ndash02

ndash01

00

01

02

03

04

Figure 16 Clustering results of various malware that wererepackaged from aircomaceviralmotox3e outlier in the middleis detected as an abnormal but harmless app is figure wasgenerated by the density peak clustering tool

Table 8 Comparison of the proposed method with AppFA

Proposed method AppFAAccuracy 969 863F-measure 094 075Average power consumption 12 0Average CPU consumption 06 0Average memory consumption 654MB 0

16 Security and Communication Networks

e paradigms of edge computing are used to detectAndroid malware in this study However several challengeswould be encountered when deploying it in practice Firstthe mobile devices must install a specified app for thecollection of essential information which may cause con-cerns and mobile users may refuse to cooperate Second inedge computing the data should be preprocessed by the edgeto better protect the usersrsquo privacy However the edge servercan observe all the data (including the sensitive informa-tion) and it threatens the privacy of users For example ifthe edge is deployed by the cloud provider the usersrsquo privacyinformation is also available to the cloud ere is no dif-ference with the user-cloud model for privacy protectionunless the edge server is deployed by the userird the edgeserver should be well protected from network attacksOtherwise it will be convenient for hackers to steal usersrsquoprivate information

8 Conclusion

In this paper we propose a novel method for detectingAndroid malware based on edge computing To the best ofour knowledge the detection of Android malware by uti-lizing edge computing and traffic clustering has not beenconsidered previously By introducing the edge layer themain tasks of the mobile device are offloaded to the edgeserver and the mobile device only needs to record flowmetainformation which substantially conserves the re-sources of the mobile device e edge server extracts appsrsquonetwork traffic features and sends these features to the cloudplatform for further analysis us the usersrsquo privacy can bebetter protected With the received features the cloud cal-culates the similarities between apps and clusters thesesimilarity values to separate the original apps and themalware automatically We evaluated our methods on 400Android apps and the experimental results demonstratethat the detection accuracy can reach 100 for several appsand the average accuracy is 969 We also tested theperformance of the proposed method and compared it with

the typical approaches e comparison results show thatour method can detect repackaged Android malware withboth higher detection accuracy and lower power and re-source consumptions In the future we will continue to workto overcome the discussed limitations and to evaluate theproposed method in real world scenarios

Data Availability

e downloaded apps used to support the findings of thisstudy have been deposited in httpspanbaiducoms1b8vMceIdtdd1gBRC5Gbzvw Its extraction code is vd5re clustering code is also in the repository Since appnetwork traffic contains usersrsquo sensitive information itcannot be publicly shared However we have explained allthe steps to generate network traffic automatically esesteps are also described in the instructionspdf in therepository

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported by National Key TechnologiesResearch and Development Program of China under Grant2018YFD0401404 National Natural Science Foundation ofChina under Grants 61702282 61802192 and 71801123Natural Science Foundation of the Jiangsu Higher EducationInstitutions of China under Grants 18KJB520024 and17KJB520023 Nanjing Forestry University under GrantsGXL016 and CX2016026 and NUPTSF under GrantNY217143

References

[1] Statista Mobile Operating Systemsrsquo Market Share Worldwidefrom January 2012 to December 2019 Statista HamburgGermany 2009 httpswwwstatistacomstatistics272698global-market-share-held-by-mobile-operating-systems-since-2009

[2] W Martin F Sarro Y Jia Y Zhang and M Harman ldquoAsurvey of app store analysis for software engineeringrdquo IEEETransactions on Software Engineering vol 43 no 9pp 817ndash847 2017

[3] httpswwwappbraincomstatsnumber-of-android-apps2020

[4] L Li D Li T F Bissyande et al ldquoUnderstanding android apppiggybacking a systematic study of malicious code graftingrdquoIEEE Transactions on Information Forensics and Securityvol 12 no 6 pp 1269ndash1284 2017

[5] M Fan J Liu X Luo et al ldquoAndroid malware familialclassification and representative sample selection via frequentsubgraph analysisrdquo IEEE Transactions on Information Fo-rensics and Security vol 13 no 8 pp 1890ndash1905 2018

[6] J Zhang Z Qin K Zhang H Yin and J Zou ldquoDalvik opcodegraph based android malware variants detection using globaltopology featuresrdquo IEEE Access vol 6 pp 51964ndash51974 2018

[7] S Garg S K Peddoju and A K Sarje ldquoNetwork-baseddetection of android malicious appsrdquo International Journal ofInformation Security vol 16 no 4 pp 385ndash400 2016

Table 9 Comparison of the proposed method with SAMADroid

Proposed method SAMADroidAccuracy 969 831F-measure 0940 0793Average power consumption 12 23Average CPU consumption 06 08Average memory consumption 654MB 773MB

Table 10 Comparison of the proposed method with Shabtairsquosmethod

Proposed method Shabtairsquosmethod

Accuracy 969 771F-measure 0940 0695Average power consumption 12 77Average CPU consumption 06 23Average memory consumption 654MB 937MB

Security and Communication Networks 17

[8] A Shabtai L Tenenboim-Chekina D Mimran L RokachB Shapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquoComputers amp Security vol 43 pp 1ndash18 2014

[9] L Xie X Zhang J-P Seifert and S Zhu ldquopBMDS a be-havior-based malware detection system for cellphone de-vicesrdquo in Proceedings of the ird ACM Conference onWireless Network Security ACM New York NY USApp 37ndash48 2010

[10] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacyin Smartphones and Mobile Devices ACM Chicago IL USApp 15ndash26 October 2011

[11] G Dini F Martinelli A Saracino and D SgandurraldquoMADAM a multi-level anomaly detector for android mal-warerdquo in Proceedings of the International Conference onMathematical Methods Models and Architectures for Com-puter Network Security vol 12 Springer St PetersburgRussia pp 240ndash253 2012

[12] Y Zhang M Yang B Xu et al ldquoVetting undesirable be-haviors in android apps with permission use analysisrdquo inProceedings of the 2013 ACM SIGSAC Conference on Com-puter amp Communications Security ACM Berlin Germanypp 611ndash622 November 2013

[13] SWang Z Chen X Li LWang K Ji and C Zhao ldquoAndroidmalware clustering analysis on network-level behaviorrdquo inProceedings of the International Conference on IntelligentComputing Springer Liverpool UK pp 796ndash807 August2017

[14] G He B Xu and H Zhu ldquoAppFA a novel approach to detectmalicious android applications on the networkrdquo in Securityand Communication Networks Wiley Hoboken NJ USA2018

[15] X Wang Y Yang and Y Zeng ldquoAccurate mobile malwaredetection and classification in the cloudrdquo SpringerPlus vol 4no 1 p 583 2015

[16] K Tam A Feizollah N B Anuar R Salleh and L Cavallaroldquoe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys (CSUR) vol 49 no 4p 76 2017

[17] M Sun X Li J C S Lui R T B Ma and Z Liang ldquoMonet auser-oriented behavior-based malware variants detectionsystem for androidrdquo IEEE Transactions on Information Fo-rensics and Security vol 12 no 5 pp 1103ndash1112 2017

[18] T Vidas and N Christin ldquoEvading android runtime analysisvia sandbox detectionrdquo in Proceedings of the 9th ACMSymposium on Information Computer and CommunicationsSecurity pp 447ndash458 Kyoto Japan June 2014

[19] Y Duan M Zhang A V Bhaskar et al ldquoings you may notknow about android (Un) packers a systematic study basedon whole-system emulationrdquo in Proceedings of the 2018Network and Distributed System Security Symposium pp 1ndash15 San Diego CA USA February 2018

[20] L Xue X Luo L Yu S Wang and D Wu ldquoAdaptiveunpacking of android appsrdquo in Proceedings of the 2017 IEEEACM 39th International Conference on Software Engineering(ICSE) IEEE Buenos Aires Argentina pp 358ndash369 May2017

[21] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy IEEE San Francisco CAUSA pp 95ndash109 2012

[22] N W Lo S K Lu and Y H Chuang ldquoA framework for thirdparty android marketplaces to identify repackaged appsrdquo inProceedings of the 2016 IEEE 14th International Conference onDependable Autonomic and Secure Computing 14th Inter-national Conference on Pervasive Intelligence and Computing2nd International Conference on Big Data Intelligence andComputing and Cyber Science and Technology Congresspp 475ndash482 Auckland New Zealand August 2016

[23] L Li J Gao M Hurier et al ldquoAndrozoo++ collectingmillions of android apps and their metadata for the researchcommunityrdquo 2017 httpsarxivorgabs170905281

[24] W Shi J Cao Q Zhang Y Li and L Xu ldquoEdge computingvision and challengesrdquo IEEE Internet of ings Journal vol 3no 5 pp 637ndash646 2016

[25] W Zhou Y Zhou X Jiang and P Ning ldquoDetectingrepackaged smart-phone applications in third-party androidmarketplacesrdquo in Proceedings of the Second ACM Conferenceon Data and Application Security and Privacy ACM AntonioTX USA pp 317ndash326 February 2012

[26] Y Shao X Luo C Qian P Zhu and L Zhang ldquoTowards ascalable resource-driven approach for detecting repackagedandroid applicationsrdquo in Proceedings of the 30th AnnualComputer Security Applications Conference ACM NewOrleans LA USA pp 56ndash65 December 2014

[27] K Chen P Wang Y Lee et al ldquoFinding unknown malice in10 seconds mass vetting for new threats at the google-playscalerdquo in Proceedings of the USENIX Security Symposiumvol 15 Washington DC USA August 2015

[28] Z Wang C Li Y Guan and Y Xue ldquoDroidchain a novelmalware detection method for android based on behaviorchainrdquo in Proceedings of the 2015 IEEE Conference onCommunications and Network Security (CNS) IEEE Flor-ence Italy pp 727-728 September 2015

[29] K Tian D D Yao B G Ryder G Tan and G PengldquoDetection of repackaged android malware with code-het-erogeneity featuresrdquo IEEE Transactions on Dependable andSecure Computing vol 17 no 1 pp 64ndash77 2017

[30] A De Lorenzo F Martinelli E Medvet F Mercaldo andA Santone ldquoVisualizing the outcome of dynamic analysis ofandroid malware with vizmalrdquo Journal of Information Se-curity and Applications vol 50 Article ID 102423 2020

[31] M Lin D Zhang X Su and T Yu ldquoEffective and scalablerepackaged application detection based on user interfacerdquo inProceedings of the 2017 IEEE Conference on ComputerCommunications Workshops (INFOCOM WKSHPS) IEEEAtlanta GA USA May 2017

[32] G Meng M Patrick Y Xue Y Liu and J Zhang ldquoSecuringandroid app markets via modelling and predicting malwarespread between marketsrdquo IEEE Transactions on InformationForensics and Security vol 14 no 7 pp 1944ndash1959 2018

[33] A Arora and S K Peddoju ldquoMinimizing network trafficfeatures for android mobile malware detectionrdquo in Proceed-ings of the 18th International Conference on DistributedComputing and Networking ACM Hyderabad India January2017

[34] X Wu D Zhang X Su and W Li ldquoDetect repackagedAndroid application based on HTTP traffic similarity httptraffic similarityrdquo Security and Communication Networksvol 8 no 13 pp 2257ndash2266 2015

[35] J Malik and R Kaushal ldquoCredroid android malware de-tection by network traffic analysisrdquo in Proceedings of the 1stACM Workshop on Privacy-Aware Mobile Computing ACMPaderborn Germany pp 28ndash36 July 2016

18 Security and Communication Networks

[36] A Zulkifli I R A Hamid W M Shah and Z AbdullahldquoAndroid malware detection based on network traffic usingdecision tree algorithmrdquo in Proceedings of the InternationalConference on Soft Computing and Data Mining SpringerSenai Malaysia pp 485ndash494 January 2018

[37] Z Chen Q Yan H Han et al ldquoMachine learning basedmobile malware detection using highly imbalanced networktrafficrdquo Information Sciences vol 433-434 pp 346ndash364 2018

[38] G He B Xu L Zhang and H Zhu ldquoMobile app identifi-cation for encrypted network flows by traffic correlationrdquoInternational Journal of Distributed Sensor Networks vol 14no 12 pp 1ndash17 2018

[39] L Xue Y Zhou T Chen X Luo and G Gu ldquoMalton to-wards on-device non-invasive mobile malware analysis forARTrdquo in Proceedings of the 26th USENIX Security Symposiumpp 289ndash306 Vancouver Canada August 2017

[40] M K Alzaylaee S Y Yerima and S Sezer ldquoDL-Droid deeplearning based android malware detection using real devicesrdquoComputers amp Security vol 89 Article ID 101663 2020

[41] A S Shamili C Bauckhage and T Alpcan ldquoMalware de-tection on mobile devices using distributed machine learn-ingrdquo in Proceedings of the 20th International Conference onPattern Recognition (ICPR) IEEE Istanbul Turkeypp 4348ndash4351 August 2010

[42] M Zhao T Zhang F Ge and Z Yuan ldquoRobotdroid alightweight malware detection framework on smartphonesrdquoJournal of Networks vol 7 no 4 p 715 2012

[43] K A Talha D I Alper and C Aydin ldquoAPK auditor per-mission-based android malware detection systemrdquo DigitalInvestigation vol 13 pp 1ndash14 2015

[44] S Arshad M A Shah A Wahid A Mehmood H Song andH Yu ldquoSamadroid a novel 3-level hybrid malware detectionmodel for android operating systemrdquo IEEE Access vol 6pp 4321ndash4339 2018

[45] G He L Zhang B Xu and H Zhu ldquoDetecting repackagedandroid malware based on mobile edge computingrdquo inProceedings of the 2018 Sixth International Conference onAdvanced Cloud and Big Data (CBD) IEEE Lanzhou Chinapp 360ndash365 August 2018

[46] R Roman J Lopez andMMambo ldquoMobile edge computingFog et al a survey and analysis of security threats andchallengesrdquo Future Generation Computer Systems vol 78pp 680ndash698 2018

[47] F Bonomi R Milito J Zhu and S Addepalli ldquoFog com-puting and its role in the internet of thingsrdquo in Proceedings ofthe First Edition of the MCC Workshop on Mobile CloudComputing ACM Helsinki Finland pp 13ndash16 August 2012

[48] M T Beck MWerner S Feld and S Schimper ldquoMobile edgecomputing a taxonomyrdquo in Proceedings of the Sixth Inter-national Conference on Advances in Future Internet pp 48ndash55 Lisbon Portugal November 2014

[49] P Mach and Z Becvar ldquoMobile edge computing a survey onarchitecture and computation offloadingrdquo IEEE Communi-cations Surveys amp Tutorials vol 19 no 3 pp 1628ndash1656 2017

[50] H T Dinh C Lee D Niyato and P Wang ldquoA survey ofmobile cloud computing architecture applications and ap-proachesrdquo Wireless Communications and Mobile Computingvol 13 no 18 pp 1587ndash1611 2013

[51] X Lyu H Tian L Jiang et al ldquoSelective offloading in mobileedge computing for the green internet of thingsrdquo IEEENetwork vol 32 no 1 pp 54ndash60 2018

[52] S Agrawal and J Agrawal ldquoSurvey on anomaly detectionusing data mining techniquesrdquo Procedia Computer Sciencevol 60 pp 708ndash713 2015

[53] A Tongaonkar ldquoA look at the mobile app identificationlandscaperdquo IEEE Internet Computing vol 20 no 4 pp 9ndash152016

[54] G He B Xu and H Zhu ldquoIdentifying mobile applications forencrypted network trafficrdquo in Proceedings of the 2017 FifthInternational Conference on Advanced Cloud and Big Data(CBD) IEEE Shanghai China pp 279ndash284 August 2017

[55] G Ranjan A Tongaonkar and R Torres ldquoApproximatematching of persistent lexicon using search-engines forclassifying mobile app trafficrdquo in Proceedings of the 35thAnnual IEEE International Conference on Computer Com-munications IIEEE San Francisco CA USA April 2016

[56] L Li and S Qu ldquoShort text classification based on improvedITCrdquo Journal of Computer and Communications vol 1 no 4pp 22ndash27 2013

[57] A H Lashkari A F A Kadir L Taheri and A A GhorbanildquoToward developing a systematic approach to generatebenchmark android malware datasets and classificationrdquo inProceedings of the 2018 International Carnahan Conference onSecurity Technology (ICCST) IEEE Montreal Canada Oc-tober 2018

[58] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[59] Y Li Z Yang Y Guo and X Chen ldquoDroidbot a lightweightUI-guided test input generator for androidrdquo in Proceedings ofthe Software Engineering Companion (ICSE-C) pp 23ndash26Buenos Aires Argentina May 2017

[60] M Fan X Luo J Liu et al ldquoGraph embedding based familialanalysis of android malware using unsupervised learningrdquo inProceedings of the 2019 IEEEACM 41st International Con-ference on Software Engineering (ICSE) IEEE Piscataway NJUSA pp 771ndash782 May 2019

Security and Communication Networks 19

Page 9: On-DeviceDetectionofRepackagedAndroidMalwarevia …downloads.hindawi.com/journals/scn/2020/8630748.pdf · 2020-05-31 · According to [4–6], most Android malware programs are repackagedapps,andthemajorityofnewAndroidmalware

traffic e extracted traffic features are categorized into twogroups traffic content features and traffic behavior featuresTraffic content features are the plaintext contents that areextracted from HTTP flows and traffic behavior features arethe statistics of network flows such as packet sizes andintervals Since Android apps may use an encrypted HTTPSprotocol to transmit data [54] traffic behavior features alsomust be considered e traffic content and traffic behaviorfeatures are described in detail in Sections 53 and 54

523 Traffic Filtering After traffic feature extraction theedge servers send the extracted features along with theclient-side IP address server-side IP addresses server-sideports app name AppNf and app version AppVf to the cloudNote that App i is installed on u devices hence the cloudeventually has u feature sets for App i Figure 8 illustrates thestructure of the feature set e cloud further filters these ufeature sets to increase the detection accuracy First it di-vides the u sets into v categories according to the app versionAppVf erefore each category contains the feature setsthat are generated by the same version of App i en foreach category the cloud removes the common traffic fromthe feature sets e common traffic is defined as network

flows that are contained in all features sets ese flows havethe same server-side IP address and the same server-sideport number e reason for this step is that the commontraffic will obfuscate the similarity calculations betweenapps Otherwise the similarity values between features setsare all high and they cannot be effectively used to distin-guish the repackaged malware from the original apps

After the filtering has been completed the similaritiesbetween feature sets are calculated and the similarity valuesare clustered to identify the repackaged malware

Client-side IP addressApp name AppN_fApp version AppV_fFlow 1 server-side IP address 1 server-side port 1 flow features

Flow f server-side IP address f server-side port f flow features

Figure 8 Illustration of the feature set e set contains the clientIP address app name and app version Each network flow that isgenerated by App i in a device is represented by the featuresextracted from the flow and by the server IP address and port

Trafficlabeling

Edge server 1

Traffic featureextraction

Similarity calculation of

trafficbehaviors

Similarity calculation of traffic contents Clustering of

similarity values

Original app

Repackaged malware

Cloud

Mobile device 1 2 (with app i installed)

Results

Traffic features

Traffic and flowmetainformation

ę

ę

Edge server e

Traffic features

Traffic and flow metainformation

Filtering

Mobile device u(with app i installed)

1

3

4

5

6

2

Figure 7emain procedure of the proposedmethod for Androidmalware detectione core parts of the proposedmethod are labeled bycircled numbers

Security and Communication Networks 9

53 Similarity Calculation of Traffic Content Android appsuse mainly HTTP and HTTPS protocols to communicatewith their remote servers [54] For HTTP traffic since it isplaintext based we can efficiently use the flow content tocalculate the similarity e basic strategy is to convertHTTP traffic to document files and the document files canbe handled by information retrieval technologies for simi-larity calculation HTTP flow content has also been used inapp identification from network traffic [55] However in thisstudy we only consider partial HTTP header information tobetter protect mobile usersrsquo privacy

In detail first we reassemble HTTP flows from packetsand extract the flow contents by saving the correspondingpacket contents into a text file as ASCII code Figure 9 showssuch an example en we preprocess the text file by de-leting the HTTP request content and the HTTP requestheader except for the GET (or POST) and HOST fields eHTTP response content is also deleted namely only theHTTP response header the GET (or POST) and the HOSTfiled in the HTTP request header are saved Via this ap-proach the mobile usersrsquo privacy can be protected from thecloud (for example the mobile phone information (HUA-WEI) is leaked by the User-Agent field in Figure 9) Afterthat the text file is spilt using special characters (eg ) and the processed file becomes the flow contentfeatures

e similarity is calculated via the improved TF-IDFalgorithm [56] and the cosine similarity e term frequencymeasures the count of words in a document We use theaugmented frequency measurement Denote the document(the feature set) as di and the word as wj Term frequencytf(wj di) is calculated as

tf wj di1113872 1113873 05 +05 times f wj di1113872 1113873

max f w di( 1113857 w isin di1113864 1113865 (1)

where f(wj di) is the number of occurrences of wj indocument di

e inverse document frequency measures of theamount of information a word provides Typically it is thelogarithmically scaled inverse fraction of the number ofdocuments that contain the word e inverse documentfrequency of wj is calculated as

idf wj1113872 1113873 logu

d wj isin d1113966 111396711138681113868111386811138681113868

11138681113868111386811138681113868+ 001⎛⎝ ⎞⎠ (2)

where u is the number of documents and | d wj isin d1113966 1113967| isnumber of documents in which the term wj appears

Denote by p(wj di) the product of TF(wj di) andI DF(wj) it is formulated as

p wj di1113872 1113873 log tf wj di1113872 11138731113872 1113873 times idf wj1113872 1113873

1113936

cj0 log

2 tf wj di1113872 11138731113872 1113873 times idf2 wj1113872 11138731113969 (3)

After the TF-IDF operation all documents are convertedinto equal-length numerical vectors We use the cosinesimilarity to measure the similarity between two documentvectors For documents di and dk the corresponding

numerical vectors are denoted as V(di) and V(dk) and thelength of vector is t e similarity between di and dk iscalculated as

CSdik

1113936tl1 V di( 1113857l times V dk( 1113857l( 1113857

1113936tl1 V di( 1113857

2l

1113969

times

1113936tl1 V dk( 1113857

2l

1113969 (4)

Suppose that in category vi (we divide the u feature setsgenerated by App i in to v categories in the traffic filteringstep) there are c feature sets For each feature set we save theflow features of all HTTP flows into a text file and c text filesare obtained en we calculate the numerical vector foreach feature set via equation (3) Last we compare thenumerical vector with each other as equation (4) and gain csimilarity values (CSdii

1) for each feature set We rep-resent these similarity values of feature set di (i 1 2 c)as a vector Si

contents and Sicontents (CSdi1

CSdi2 CSdic

)e vectors Si

contents (i 1 2 c) in combination with thesimilarity values of traffic behaviors will be clustered toidentify the abnormal data points namely the repackagedmalware

54 SimilarityCalculation of TrafficBehaviors In addition tocalculating the similarity values of traffic contents thesimilarity of traffic behaviors must be calculated due toencrypted flows We use m (m 11 in our study) statisticalfeatures as listed in Table 2 to represent the behavior of anencrypted network flow e units of the flow duration andthe packet interval time are milliseconds in our experiments

With the flow statistical features for feature set di the jthencrypted flow can be represented as a vector in the featurespace

Fj

di f

1j f

2j f

mj1113872 1113873 (5)

where flj represents the value of the lth statistical feature

1le llem Figure 10 shows an example of feature represen-tation for encrypted traffic Consequently the traffic be-haviors of di can be represented as a matrix of f

pj

Figure 9 Example of converting an HTTP flow to a text file

10 Security and Communication Networks

TBntimesm di( 1113857 F1di

F2di

Fndi

1113872 1113873T (6)

where n is the number of encrypted flows that are containedin di and nge 1

Since different feature sets may differ in terms of thenumber of encrypted flows traffic behaviors TBnitimesm(di) andTBnktimesm(dk) may differ in terms of different dimensionnamely the value of ni should not be equal to that of nk Toefficiently calculate the similarity between TBnitimesm(di) andTBnktimesm(dk) we calculate the Frobenius norm of TBnitimesm(di)

and TBnktimesm(dk) e formula is presented in equation (7)where norm(middot) normalizes the value of fl

j to [0 1]

FN(d)

1113944

n

j11113944

m

l1norm fl

j1113872 111387311138681113868111386811138681113868

111386811138681113868111386811138682

11139741113972

(7)

en the distance between TBnitimesm(di) and TBnktimesm(dk) iscalculated as

dis di dk( 1113857 FN di( 1113857 minus FN dk( 11138571113868111386811138681113868

1113868111386811138681113868 (8)

Finally the similarity between TBnitimesm(di) andTBnktimesm(dk) can be calculated by

BSdik 1 minus

dis di dk( 1113857

max dis dr dq1113872 1113873 r q 1 2 u rne q1113960 1113961 (9)

In the above process ni and nk are assumed to exceed 1If ni 0 (nk 0) and nkge 1 (nige 1) BSdik

is set to 0 to reflectthe significant difference between di and dk However ifni 0 and nk 0 namely there is no encrypted flow in bothdi and dk BSdik

is set to 1 because they exhibit the samebehaviors ie they do not produce any encrypted traffic

Similarly we let BSdii 1 and represent these similarity

values as a vector Sibehavior BSdi1

BSdi2 BSdiu

1113966 1113967

55 Clustering of Similarity Values and Malware DetectionFrom the traffic content similarity CSdik

and the traffic be-havior similarity BSdik

we can determine the final similarityvalue BSdik

between the feature sets di and dk as expressed inequation (10) In this equation q is the weight value In ourexperiments we set q 05 namely the traffic contentsimilarity and the traffic behavior similarity are assigned thesame weight is is because apps typically produce almostthe same numbers of HTTP and HTTPS flows [57] We alsoevaluate various weighted values in Section 62 and theexperimental results support that 05 is the most suitablevalue For feature set di a vector Sdi

(Sdi1 Sdi2

Sdiu) is

finally obtained which contains the similarity values be-tween di and the other feature setse set of vector Sdi

1113966 1113967 i

1 2 u will be clustered for the detection of repackagedmalware

Sdik q times CSdik

+(1 minus q) times BSdik (10)

To be specific we use the density peak clustering method[58] to realize our objective Density peak clustering is basedon the following assumption cluster centers are charac-terized by a higher density than their neighbors and by arelatively large distance from points that have higher den-sities Hence density peak clustering can automaticallyidentify the correct number of clusters is is importantbecause we do not know whether there is repackagedmalware or how many types of repackaged malware thereare In other words we cannot determine the number ofclusters in advance Density peak clustering can help usovercome this challenge

We assume that most mobile devices have installed theoriginal apps as discussed in Section 51 erefore afterclustering the cluster that has the largest number of voxelswill be recognized as original and the remaining clusters willbe regarded as suspicious However suspicious apps are notnecessarily repackaged malware For example apps can berepackaged to show ads only Repackaged apps of this typeshould not be simply judged as malware Additionallyoriginal apps can be suspicious due to different user be-haviors To further reduce the frequency of the false alarmswe can check the server hostnames that are contained in therepackaged clusters using VirusTotal (the URLs are includedin the HTTP traffic contents) If all of the serversrsquo hostnamesfor a suspicious cluster Cr are judged as clean it will berecognized as a cluster of abnormal but harmless appsOtherwise it will be identified as repackaged malware Viathis approach we can accurately detect repackaged Androidmalware In this step we only use VirusTotal as a white list toexclude false alarms We do not use it as a backlist (detectionsignatures) to detect repackaged malware Malware detec-tion is still realized by clustering the similarity values

Table 2 Flow statistical featuresBytes sent by the app and bytes sent by the serverNumbers of outgoing and incoming packetsMean and std dev of the outgoing and incoming packet sizesrsquoflow durationMean and std dev of the interpacket time

lt2761 16696 16 12 173 1294 289 1177 3392 121 190gt

Figure 10 Example of the conversion of an encrypted flow to afeature vector

Security and Communication Networks 11

6 Evaluation

61 System Implementation and Dataset Figure 11illustrates our experimental setup e Android phoneswere connected to the edge server via WiFi and each phonehad installed the edge-client app which sends the flowmetainformation to the edge server using Socket (jav-anetSocket) e edge is a Dell PowerEdge R730 server(with 32 processors 16GB RAM and 12 TB hard drives)that is configured as an access point In the edge server weused tcpdump (httpwwwtcpdumporg) to collect networktraffic e similarity calculations of traffic contents andbehaviors and the clustering of similarity values wereimplemented on the Windows Azure cloud computingplatforme URL checking was conducted using Virustotal(httpswwwvirustotalcom)

In the implementation we used aria2c (httpsaria2githubio) to download apps from Androzoo (httpsandrozoounilu) aria2c was encapsulated as commandsin Python code for parallel downloading After downloadingthe apps we automatically installed them on phones usingthe adb shell command Each phone had 40 tested appsinstalled We have implemented the edge-client app basedon Android VPN service In our implementations the edge-client app correlated network flows and apps by resolvingthe ports and PIDs (Process IDs) send the newly collectedflow metainformation to the edge server every 60 secondse edge server used tcpdump to collect network trafficwhich was saved as pcap files To extract traffic features weused Splitcap (httpswwwnetreseccompageSplitCap)to restore tcp flows from pcap files en for HTTP trafficwe extracted the contents using tshark (httpswwwwiresharkorgdocsman-pagestsharkhtml) commands -zfollow tcp and ascii e packet sizes and packet intervaltimes were also obtained by tshark with commands -eframelen ndashe frametime_relative In the cloud CSdik

and BSdik

were calculated using scikit-learn (httpscikit-learnorg)and Numpy (httpsplotlynumpynorm) e code ofdensity peak clustering is available on GitHub (httpsgithubcomjasonwbwDensityPeakCluster) and we havefixed various bugs that were present in the original code

We have downloaded 200 pairs of original andrepackaged apps from Androzoo for testing However notall of these downloaded repackaged apps are maliciousSome apps are only repackaged to show extra ads To screenout Android malware from the downloaded repackagedapps we further checked the 200 repackaged apps usingVirustotal If an app was judged as safe by all the antivirustools it was labelled as repackaged_safe in our experimentsOtherwise it was labelled as repackaged_malware In total143 repackaged apps were detected as malware by VirustotalNevertheless in our experiments we evaluated the proposedclusteringmethod on all 200 repackaged apps to obtainmorerealistic performance results

e downloaded apps were run in 10 rooted Androidphones in parallel to accelerate the experimental process Forall apps we used Droidbot [59] to send input events to themand to generate network traffic automatically We ran eachrepackaged app (repackaged_safe and repackaged_malware

apps) 10 times and the corresponding original app 30 timesto generate sufficient network traffic e running times ofthe original apps exceeded those of the repackaged appsbecause we assume that most of the apps are original asdiscussed in Section 51 Consequently we obtained 40traffic sets for each repackaged-original app pair Note thatfrom the perspectives of the mobile devices and the edgeserver there are only 200 distinct apps because therepackaged app has the same name as the correspondingoriginal app erefore our experiments have simulated40lowast 200 8000 mobile devices For each repackaged-orig-inal app pair there were 30 devices with the original app and10 devices with the repackaged version

In total we have gathered almost 89GB network traffice datasets that are used in our experiments are summa-rized in Table 3 In the table the traffic dataset sizes are thesizes of the captured network traffic and the feature datasetsizes are the sizes of the extracted features In our experi-ments the HTTP traffic contents and the flow statisticalfeatures which are listed in Table 2 were both saved as textfiles As shown in Table 3 the feature datasets are signifi-cantly smaller than the traffic datasets (25G vs 63G and 9G vs26G)us the edge computing can reduce the data size andincrease the data acquisition speed for cloud computing[24] Since the feature dataset sizes are still large (25G+ 9G)we did not send all these feature data to the cloud simul-taneously Instead we sent the feature data one by oneWhen the feature data of an app were successfully receivedby the cloud its traffic similarities were calculated andclustered immediately After that the feature data of anotherapp would be sent and analyzed e experimental resultswill be presented in the following

62 ExperimentalResults In this section wemainly evaluatethe detection efficiency of Android malware by clusteringsimilarity values e performance evaluation of the pro-posed method in terms of power memory and CPU con-sumptions is elaborated in the following Section 63

Similar to previous studies [8 14] to measure theperformance of the proposed method Accuracy and F-measure metrics are utilized which are defined in equations(11) and (12) In equation (11) TP FN FP and TN aredefined in Table 4 In Table 4 repackaged_x representsrepackaged_safe or repackaged_malware e precision andrecall in equation (12) are calculated via equations (13) and(14)

Accuracy TP + TN

TP + FN + FP + TN (11)

F minus measure 2 times precision times recallprecision + recall

(12)

precision TP

TP + FP (13)

recall TP

TP + FN (14)

12 Security and Communication Networks

We have evaluated the Accuracy and F-measure for eachrepackaged app Recall that we ran each repackaged app 10times and the corresponding original app 30 timeserefore for each app the number of the original apptraffic sets is 30 and the number of the repackaged app trafficsets is 10 namely TP + FN 10 and TN+FP 30 for eachapp We have randomly chosen 20 apps from repack-age_malware and the experimental results for these apps arepresented in Figures 12 and 13 Detailed descriptions of thechosen 20 apps are listed in Table 5 In the table the mainfunction of each app is determined by analyzing the packagename from the AndroidManifestxml and from the Googlesearch results of the package name As depicted in Figure 12100 accuracy was realized on a total of 6 apps (app3 app5app9 app10 app13 and app18) e minimum accuracy is85 (app2 3440) and the average accuracy is 952 Fig-ure 13 presents the F-measure values In our experimentsthe maximum F-measure value is 1 and the minimum valueis 0737 e average F-measure value is 0888 ese resultsdemonstrate the satisfactory performance of our proposedclustering method

We have also calculated the average accuracy and F-measure values for all repackaged_safe and repack-aged_malware apps In these calculationsTP + FN 57lowast10 570 and TN+FP 57lowast 301710 for

repackaged_safe apps For repackaged_malware appsTP + FN 143lowast101430 and TN+FP 143lowast 30 4290is is because in our experiments 143 apps were identifiedas repackaged malware and 57 apps as repackaged safe byVirustotal us the parameter values differ e experi-mental results are listed in Table 6

According to Table 6 the average accuracy values forrepackaged_safe and repackaged_malware apps are bothhigh However the average F-measure for repackaged_safeapps is only 078 which is lower than that of repackagedmalware is indicates that repackaged safe apps are

Router Windows Azure

Edge server DellPowerEdge R730

10 android phones

Repackagedoriginal apps

Androzoo

Figure 11 Experimental setup

Table 3 Datasets used in the experiments

of apps of runs of HTTP flows of encrypted flows Traffic dataset sizes Feature dataset sizesOriginal apps 200 30 340015 183452 asymp63G bytes asymp25G bytesRepackaged apps 200 10 110670 93041 asymp26G bytes asymp9G bytes

Table 4 Confusion matrix

ObservedDetected

Repackaged_x app Original appRepackaged_x app TP ( of TPs) FN ( of FNs)Original app FP ( of FPs) TN ( of TNs)

Apps

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

Figure 12 Detection accuracies for the 20 selected apps

Security and Communication Networks 13

difficult to distinguish from their original apps is isbecause repackaged_safe apps only generate smallamounts of additional network flows and their networktraffic features are not clearly distinguishable from thoseof normal traffic Fortunately the average F-measure ofrepackaged malware apps is outstanding is may be dueto the fact that malware typically send stolen data to orreceive commands from remote servers Hence theirnetwork traffic features are special and distinguishableFigure 14 presents an example of app clustering and il-lustrates the distinguishability of the repackaged malwareand the original app ese results demonstrate thatrepackaged Android malware can be effectively detectedby comparing their network traffic with that of originalapps

For repackaged malware apps the accuracy results withdifferent weight values (q in equation (10)) are presented inFigure 15 As depicted in the figure when q 05 we realizethe highest accuracy rate 969 If q 0 namely only thefeatures of encrypted traffic are used the accuracy result is753 e result is 617 when q 1 namely only thefeatures of plaintext traffic are considered ese resultsdemonstrate that both the encrypted and plaintext trafficcontribute to the differentiation of the original apps and therepackaged malware However the encrypted traffic con-tributes more than the plaintext traffic is may becausemalicious apps usually connect to their servers by securityprotocols to evade IDS detection

In the previous experiments we checked the serversrsquohostnames using VirusTotal to reduce the frequency of thefalse positives For various clusters that contained less than20 elements if all the serversrsquo hostnames were judged asclean the cluster was identified as an original app rather thana repackaged one As argued in Section 55 this process onlyreduces the frequency of false alarms To support this ar-gument we list the numbers of true positives and falsepositives of the 143 repackaged malware in Table 7 forcomparison When conducting the hostname checking thenumber of true positives is 1400 and the result is the samewhen the hostname checking is turned off is findingsupports that the malware is detected by clustering thesimilarity values However the number of false positivesincreased from 47 to 213 when the hostname checking wasturned off erefore we suggest checking the serversrsquohostnames after clustering to increase the practicality of themethod Note that false positives still occur when we checkthe hostnames by VirusTotal Hence there may be malwarein the original apps In our dataset the typical examples arecommobovideoplayerpro and cncfshop ele1taoxiaosanWe further analyze these two apps by reading theirdecompiled codes and we determine that they indeed es-tablish connections with suspicious servers In detail invarious runs app commobovideoplayerpro connects withhttpwwwtopappsquarecomwhich is judged as maliciousby CyRadar in VirusTotal App cncfshop ele1taoxiaosanvisits suspicious URL http955ccjSn9 Since these apps arethe original apps they are identified as false positives in theexperiments

In our experiments we chose density peak clustering asthe clustering method because density peak clustering canautomatically determine the correct number of clusters [58]Hence our method can detect various versions of malwarethat are repackaged from the same original app Unfortu-nately in the apps that were downloaded from AndroZoono malware was repackaged from the same original app Toevaluate the efficiency of detecting versions of malware wehave created two malware instances from a popular game

Apps

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

0

01

02

03

04

05

06

07

08

09

1

F-m

easu

re

Figure 13 Detection F-measure for the 20 selected apps

Table 5 Detailed descriptions of the chosen 20 apps

Main function ofapp

of HTTPflows

of encryptedflows

App1 VideoPlayer 453 51App2 Game 657 1455App3 Browser 1531 1478App4 MusicPlayer 341 25App5 Game 612 289App6 Game 788 123App7 Game 1336 524App8 Fitness 357 892App9 Email 122 431App10 Game 963 790App11 MusicPlayer 1128 547App12 Game 1598 426App13 News 2460 1178App14 News 2891 1543App15 Chat 171 49App16 Email 143 368App17 WeatherForecast 397 120App18 Game 1832 921App19 Fitness 732 362App20 OnlineLearning 2378 834

Table 6 Average accuracy and F-measure

Average accuracy () AverageF-measure

Repackaged_safe 903 078Repackaged_malware 969 094

14 Security and Communication Networks

app namely aircomaceviralmotox3 One is produced bygrafting AnserverBotrsquos source code into air-comaceviralmotox3 and the other is created by adding codefor stealing usersrsquo contacts and short messages and sendingthem to a remote server Similarly each malware is run 10times and the original app aircomaceviralmotox3 is alsorun 10 times for validatione clustering result is presentedin Figure 16 Apparently there are 3 clusters e

experimental results show that the value of F-measure foreach malware is 1 Hence our method can detect variousmalware that were repackaged from the same original app

63 Comparison In this section we compare our methodwith the typical methods that are based on network or cloudas illustrated in Figure 5 e comparative indicators are thedetection accuracy F-measure average power CPU andmemory consumptions of the mobile device First wecompare the proposed method with AppFA [14] which isimplemented completely at the network level Since AppFAalso must analyze app traffic we directly use the traffic sets ofthe repackaged malware apps for comparison While testingAppFA we deleted the label information that was containedin the traffic sets and mixed the network flows as a wholeen we used the signature matching and the constrainedK-means clustering algorithm proposed in [14] to restoreapp sessions namely the label for each flow e similarapps used in AppFA were chosen in Google Play via themethod introduced in [14] and for each repackaged mal-ware 10 similar apps were selected as the peer group epeer group apps were run once by Droidbot to generatenetwork traffic e comparison results are listed in Table 8

Since AppFA need not to install some special apps onmobile devices the extra power memory and CPU con-sumptions are 0 However the detection accuracy is muchlower (863 vs 969) than that of the proposedmethod ascompared in Table 8 In the proposed method we use theedge-client app to record flow metainformation thusnetwork flows can be accurately labeled However inAppFA flow labeling is realized via traffic clustering and theclustering accuracy cannot be 100 Meanwhile AppFAcompares the appsrsquo network behaviors with those of theirsimilar apps which also results in errors In the proposedmethod we directly compare the repackaged malware withits original app and the detection accuracy is increased eaverage power CPU and memory consumptions of theproposed method are also low and acceptable

In Table 8 the power memory and CPU consumptionsfor our edge-client app were obtained with the help of appGT (httpsgithubcomTencentGT) GT is a portable toolthat runs on smartphones for debugging the APP internalparameters and the code time-consuming statistics eaverage values for the power memory and CPU con-sumptions are calculated as follows Before running an appwith Droidbot we started the edge-client app by GT Whenthe Droidbot was stopped we recoded the power CPU andmemory consumptions We repeated the process until allapps (including 143 repackaged malware apps and theiroriginal versions) had been tested Via this way we obtainedthe average power CPU and memory consumptions for theedge-client app According to Table 8 the edge-client app islightweight and has minimal impact on the performances ofthe mobile devices

en we compare the proposed method with SAMA-Droid [44] SAMADroid monitors Android system calls andsends the calls to a centralized server for the detection ofAndroid malware SAMADroid is a typical cloud-based on-

Decision graph rho-delta-rho

20 251510 35 403050Sorted rho

00

02

04

06

08

Rho lowast

delt

a

Figure 14 Example of app clustering Obviously two clusters areidentified e dot in the top left corner is an outlier e outlier isdetected as an abnormal but harmless app by our method isfigure was generated by the density peak clustering tool

01 02 03 04 05 06 07 08 09 10Value of q

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

Figure 15 Accuracies of various weight values for traffic contentsimilarity and traffic behavior similarity

Table 7 True positives and false positives with and withouthostname checking

of true positives of false positivesWith hostname checking 1400 47Without hostnamechecking 1400 213

Security and Communication Networks 15

device detection method We have implemented the dy-namic detection according to the description in [44] Indetail the APIs that are invoked by the running apps aremonitored by the tool Sensitive API Monitor (httpsgithubcomcyruliuSensitive-API-Monitor) As done in [44] thefrequency values of 10 system calls (open ioctl brk readwrite close sendto sendmsg recvfrom and recvmsg) arerecorded ese recorded frequency values are sent toWindows Azure for malware detection We evaluate thedynamic detection performance of SAMADroid on our 143repackaged_malware apps and their original versions In theevaluation the frequency values of the original apps are usedto create the normality model and the frequency values ofthe repackaged malware apps are used for testing We alsouse the app GT to record the power memory and CPUconsumptions of SAMADroid Note that our phones wereall rooted and Sensitive_API_Monitor and GT can be usede experimental results are presented in Table 9

Last we compare our method with Shabtairsquos method [8]Shabtairsquos method detects Android malware at the mobiledevices and it considers the apprsquos network traffic patternsonly the HTTP traffic contents are not considered isgives us an opportunity to compare our method withmethods that have been specially designed for encryptedtraffic We have implemented Shabtairsquos method and chosenthe feature subset 2 as the detection features Feature subset2 includes Avg Sent Bytes Avg Rcvd Bytes and Pct ofAvg Rcvd Bytes among other values and it is proved to be

the best feature subset for malware detection [8] e De-cisionRegression tree is chosen as the classification algo-rithm e testing dataset is the traffic that was generated bythe 143 repackaged malware instances and the training set isthe traffic that was generated by the corresponding originalapps e classification results are listed in Table 10According to the table the detection accuracy and F-mea-sure values of our method are significantly higher than thoseof Shabtairsquos method is indicates that the HTTP trafficprovides useful information for Android malware detectionSince Shabtairsquos method is conducted at the mobile devices itconsumes more resources than our method as compared inTable 10

7 Discussion

e results of our extensive experiments demonstrate thatrepackaged Android malware can be efficiently and effec-tively detected by clustering network traffic at the edgedevices However there are still limitations for our methodFirst we assume the most of the devices have installed theoriginal apps (rlt (12)u as described in Section 51) is isreasonable in most cases since mobile users typicallydownload apps from the official markets However in areaswhere official markets such as the Google play and Amazonare banned it is possible that the most mobile devices haveinstalled the repackaged apps is may cause false positivesSecond if the malware does not generate sufficient trafficour method cannot detect it For example some malwareonly sends fraudulent premium SMS messages is type ofmalware cannot be detected by monitoring its networktraffic ird we rely on third-party tools such as Virustotalfor the identification of malicious URLs and to reduce thenumber of false alarms If a malicious app rents legalplatforms such as the public cloud as the remote controlserver our method will judge it as a normal app Hence falsenegatives will be generated

To overcome these limitations we could combine theproposed method with available methods such as SAMA-Droid In such cases the edge-client app will not only recordflow metainformation but also monitor app system be-haviors However the system behavior monitoring does notneed to be conducted all the time If abnormal but harmlessor lower traffic apps are detected the edge-client app will beinformed to record the corresponding appsrsquo system be-haviors Similarly the edge-client app can send the appsystem behaviors to the edge server for further processinge performance of this combined method will be evaluatedin our future work One can also combine the proposedframework with AppFA [14] for the detection of other typesof Android malware For example the edge extracts apptraffic behavior features and the cloud can compare appsrsquobehaviors with their peer groups to identify abnormalitiesFurthermore one could first use clustering techniques [60]to group repackaged apps according to the same maliciousmodules and then analyze the behaviors of the representativerepackaged apps in each group to obtain more accuratebehavior features

y

2D nonclassical multidimensional scaling

00ndash02 02ndash04x

ndash04

ndash03

ndash02

ndash01

00

01

02

03

04

Figure 16 Clustering results of various malware that wererepackaged from aircomaceviralmotox3e outlier in the middleis detected as an abnormal but harmless app is figure wasgenerated by the density peak clustering tool

Table 8 Comparison of the proposed method with AppFA

Proposed method AppFAAccuracy 969 863F-measure 094 075Average power consumption 12 0Average CPU consumption 06 0Average memory consumption 654MB 0

16 Security and Communication Networks

e paradigms of edge computing are used to detectAndroid malware in this study However several challengeswould be encountered when deploying it in practice Firstthe mobile devices must install a specified app for thecollection of essential information which may cause con-cerns and mobile users may refuse to cooperate Second inedge computing the data should be preprocessed by the edgeto better protect the usersrsquo privacy However the edge servercan observe all the data (including the sensitive informa-tion) and it threatens the privacy of users For example ifthe edge is deployed by the cloud provider the usersrsquo privacyinformation is also available to the cloud ere is no dif-ference with the user-cloud model for privacy protectionunless the edge server is deployed by the userird the edgeserver should be well protected from network attacksOtherwise it will be convenient for hackers to steal usersrsquoprivate information

8 Conclusion

In this paper we propose a novel method for detectingAndroid malware based on edge computing To the best ofour knowledge the detection of Android malware by uti-lizing edge computing and traffic clustering has not beenconsidered previously By introducing the edge layer themain tasks of the mobile device are offloaded to the edgeserver and the mobile device only needs to record flowmetainformation which substantially conserves the re-sources of the mobile device e edge server extracts appsrsquonetwork traffic features and sends these features to the cloudplatform for further analysis us the usersrsquo privacy can bebetter protected With the received features the cloud cal-culates the similarities between apps and clusters thesesimilarity values to separate the original apps and themalware automatically We evaluated our methods on 400Android apps and the experimental results demonstratethat the detection accuracy can reach 100 for several appsand the average accuracy is 969 We also tested theperformance of the proposed method and compared it with

the typical approaches e comparison results show thatour method can detect repackaged Android malware withboth higher detection accuracy and lower power and re-source consumptions In the future we will continue to workto overcome the discussed limitations and to evaluate theproposed method in real world scenarios

Data Availability

e downloaded apps used to support the findings of thisstudy have been deposited in httpspanbaiducoms1b8vMceIdtdd1gBRC5Gbzvw Its extraction code is vd5re clustering code is also in the repository Since appnetwork traffic contains usersrsquo sensitive information itcannot be publicly shared However we have explained allthe steps to generate network traffic automatically esesteps are also described in the instructionspdf in therepository

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported by National Key TechnologiesResearch and Development Program of China under Grant2018YFD0401404 National Natural Science Foundation ofChina under Grants 61702282 61802192 and 71801123Natural Science Foundation of the Jiangsu Higher EducationInstitutions of China under Grants 18KJB520024 and17KJB520023 Nanjing Forestry University under GrantsGXL016 and CX2016026 and NUPTSF under GrantNY217143

References

[1] Statista Mobile Operating Systemsrsquo Market Share Worldwidefrom January 2012 to December 2019 Statista HamburgGermany 2009 httpswwwstatistacomstatistics272698global-market-share-held-by-mobile-operating-systems-since-2009

[2] W Martin F Sarro Y Jia Y Zhang and M Harman ldquoAsurvey of app store analysis for software engineeringrdquo IEEETransactions on Software Engineering vol 43 no 9pp 817ndash847 2017

[3] httpswwwappbraincomstatsnumber-of-android-apps2020

[4] L Li D Li T F Bissyande et al ldquoUnderstanding android apppiggybacking a systematic study of malicious code graftingrdquoIEEE Transactions on Information Forensics and Securityvol 12 no 6 pp 1269ndash1284 2017

[5] M Fan J Liu X Luo et al ldquoAndroid malware familialclassification and representative sample selection via frequentsubgraph analysisrdquo IEEE Transactions on Information Fo-rensics and Security vol 13 no 8 pp 1890ndash1905 2018

[6] J Zhang Z Qin K Zhang H Yin and J Zou ldquoDalvik opcodegraph based android malware variants detection using globaltopology featuresrdquo IEEE Access vol 6 pp 51964ndash51974 2018

[7] S Garg S K Peddoju and A K Sarje ldquoNetwork-baseddetection of android malicious appsrdquo International Journal ofInformation Security vol 16 no 4 pp 385ndash400 2016

Table 9 Comparison of the proposed method with SAMADroid

Proposed method SAMADroidAccuracy 969 831F-measure 0940 0793Average power consumption 12 23Average CPU consumption 06 08Average memory consumption 654MB 773MB

Table 10 Comparison of the proposed method with Shabtairsquosmethod

Proposed method Shabtairsquosmethod

Accuracy 969 771F-measure 0940 0695Average power consumption 12 77Average CPU consumption 06 23Average memory consumption 654MB 937MB

Security and Communication Networks 17

[8] A Shabtai L Tenenboim-Chekina D Mimran L RokachB Shapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquoComputers amp Security vol 43 pp 1ndash18 2014

[9] L Xie X Zhang J-P Seifert and S Zhu ldquopBMDS a be-havior-based malware detection system for cellphone de-vicesrdquo in Proceedings of the ird ACM Conference onWireless Network Security ACM New York NY USApp 37ndash48 2010

[10] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacyin Smartphones and Mobile Devices ACM Chicago IL USApp 15ndash26 October 2011

[11] G Dini F Martinelli A Saracino and D SgandurraldquoMADAM a multi-level anomaly detector for android mal-warerdquo in Proceedings of the International Conference onMathematical Methods Models and Architectures for Com-puter Network Security vol 12 Springer St PetersburgRussia pp 240ndash253 2012

[12] Y Zhang M Yang B Xu et al ldquoVetting undesirable be-haviors in android apps with permission use analysisrdquo inProceedings of the 2013 ACM SIGSAC Conference on Com-puter amp Communications Security ACM Berlin Germanypp 611ndash622 November 2013

[13] SWang Z Chen X Li LWang K Ji and C Zhao ldquoAndroidmalware clustering analysis on network-level behaviorrdquo inProceedings of the International Conference on IntelligentComputing Springer Liverpool UK pp 796ndash807 August2017

[14] G He B Xu and H Zhu ldquoAppFA a novel approach to detectmalicious android applications on the networkrdquo in Securityand Communication Networks Wiley Hoboken NJ USA2018

[15] X Wang Y Yang and Y Zeng ldquoAccurate mobile malwaredetection and classification in the cloudrdquo SpringerPlus vol 4no 1 p 583 2015

[16] K Tam A Feizollah N B Anuar R Salleh and L Cavallaroldquoe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys (CSUR) vol 49 no 4p 76 2017

[17] M Sun X Li J C S Lui R T B Ma and Z Liang ldquoMonet auser-oriented behavior-based malware variants detectionsystem for androidrdquo IEEE Transactions on Information Fo-rensics and Security vol 12 no 5 pp 1103ndash1112 2017

[18] T Vidas and N Christin ldquoEvading android runtime analysisvia sandbox detectionrdquo in Proceedings of the 9th ACMSymposium on Information Computer and CommunicationsSecurity pp 447ndash458 Kyoto Japan June 2014

[19] Y Duan M Zhang A V Bhaskar et al ldquoings you may notknow about android (Un) packers a systematic study basedon whole-system emulationrdquo in Proceedings of the 2018Network and Distributed System Security Symposium pp 1ndash15 San Diego CA USA February 2018

[20] L Xue X Luo L Yu S Wang and D Wu ldquoAdaptiveunpacking of android appsrdquo in Proceedings of the 2017 IEEEACM 39th International Conference on Software Engineering(ICSE) IEEE Buenos Aires Argentina pp 358ndash369 May2017

[21] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy IEEE San Francisco CAUSA pp 95ndash109 2012

[22] N W Lo S K Lu and Y H Chuang ldquoA framework for thirdparty android marketplaces to identify repackaged appsrdquo inProceedings of the 2016 IEEE 14th International Conference onDependable Autonomic and Secure Computing 14th Inter-national Conference on Pervasive Intelligence and Computing2nd International Conference on Big Data Intelligence andComputing and Cyber Science and Technology Congresspp 475ndash482 Auckland New Zealand August 2016

[23] L Li J Gao M Hurier et al ldquoAndrozoo++ collectingmillions of android apps and their metadata for the researchcommunityrdquo 2017 httpsarxivorgabs170905281

[24] W Shi J Cao Q Zhang Y Li and L Xu ldquoEdge computingvision and challengesrdquo IEEE Internet of ings Journal vol 3no 5 pp 637ndash646 2016

[25] W Zhou Y Zhou X Jiang and P Ning ldquoDetectingrepackaged smart-phone applications in third-party androidmarketplacesrdquo in Proceedings of the Second ACM Conferenceon Data and Application Security and Privacy ACM AntonioTX USA pp 317ndash326 February 2012

[26] Y Shao X Luo C Qian P Zhu and L Zhang ldquoTowards ascalable resource-driven approach for detecting repackagedandroid applicationsrdquo in Proceedings of the 30th AnnualComputer Security Applications Conference ACM NewOrleans LA USA pp 56ndash65 December 2014

[27] K Chen P Wang Y Lee et al ldquoFinding unknown malice in10 seconds mass vetting for new threats at the google-playscalerdquo in Proceedings of the USENIX Security Symposiumvol 15 Washington DC USA August 2015

[28] Z Wang C Li Y Guan and Y Xue ldquoDroidchain a novelmalware detection method for android based on behaviorchainrdquo in Proceedings of the 2015 IEEE Conference onCommunications and Network Security (CNS) IEEE Flor-ence Italy pp 727-728 September 2015

[29] K Tian D D Yao B G Ryder G Tan and G PengldquoDetection of repackaged android malware with code-het-erogeneity featuresrdquo IEEE Transactions on Dependable andSecure Computing vol 17 no 1 pp 64ndash77 2017

[30] A De Lorenzo F Martinelli E Medvet F Mercaldo andA Santone ldquoVisualizing the outcome of dynamic analysis ofandroid malware with vizmalrdquo Journal of Information Se-curity and Applications vol 50 Article ID 102423 2020

[31] M Lin D Zhang X Su and T Yu ldquoEffective and scalablerepackaged application detection based on user interfacerdquo inProceedings of the 2017 IEEE Conference on ComputerCommunications Workshops (INFOCOM WKSHPS) IEEEAtlanta GA USA May 2017

[32] G Meng M Patrick Y Xue Y Liu and J Zhang ldquoSecuringandroid app markets via modelling and predicting malwarespread between marketsrdquo IEEE Transactions on InformationForensics and Security vol 14 no 7 pp 1944ndash1959 2018

[33] A Arora and S K Peddoju ldquoMinimizing network trafficfeatures for android mobile malware detectionrdquo in Proceed-ings of the 18th International Conference on DistributedComputing and Networking ACM Hyderabad India January2017

[34] X Wu D Zhang X Su and W Li ldquoDetect repackagedAndroid application based on HTTP traffic similarity httptraffic similarityrdquo Security and Communication Networksvol 8 no 13 pp 2257ndash2266 2015

[35] J Malik and R Kaushal ldquoCredroid android malware de-tection by network traffic analysisrdquo in Proceedings of the 1stACM Workshop on Privacy-Aware Mobile Computing ACMPaderborn Germany pp 28ndash36 July 2016

18 Security and Communication Networks

[36] A Zulkifli I R A Hamid W M Shah and Z AbdullahldquoAndroid malware detection based on network traffic usingdecision tree algorithmrdquo in Proceedings of the InternationalConference on Soft Computing and Data Mining SpringerSenai Malaysia pp 485ndash494 January 2018

[37] Z Chen Q Yan H Han et al ldquoMachine learning basedmobile malware detection using highly imbalanced networktrafficrdquo Information Sciences vol 433-434 pp 346ndash364 2018

[38] G He B Xu L Zhang and H Zhu ldquoMobile app identifi-cation for encrypted network flows by traffic correlationrdquoInternational Journal of Distributed Sensor Networks vol 14no 12 pp 1ndash17 2018

[39] L Xue Y Zhou T Chen X Luo and G Gu ldquoMalton to-wards on-device non-invasive mobile malware analysis forARTrdquo in Proceedings of the 26th USENIX Security Symposiumpp 289ndash306 Vancouver Canada August 2017

[40] M K Alzaylaee S Y Yerima and S Sezer ldquoDL-Droid deeplearning based android malware detection using real devicesrdquoComputers amp Security vol 89 Article ID 101663 2020

[41] A S Shamili C Bauckhage and T Alpcan ldquoMalware de-tection on mobile devices using distributed machine learn-ingrdquo in Proceedings of the 20th International Conference onPattern Recognition (ICPR) IEEE Istanbul Turkeypp 4348ndash4351 August 2010

[42] M Zhao T Zhang F Ge and Z Yuan ldquoRobotdroid alightweight malware detection framework on smartphonesrdquoJournal of Networks vol 7 no 4 p 715 2012

[43] K A Talha D I Alper and C Aydin ldquoAPK auditor per-mission-based android malware detection systemrdquo DigitalInvestigation vol 13 pp 1ndash14 2015

[44] S Arshad M A Shah A Wahid A Mehmood H Song andH Yu ldquoSamadroid a novel 3-level hybrid malware detectionmodel for android operating systemrdquo IEEE Access vol 6pp 4321ndash4339 2018

[45] G He L Zhang B Xu and H Zhu ldquoDetecting repackagedandroid malware based on mobile edge computingrdquo inProceedings of the 2018 Sixth International Conference onAdvanced Cloud and Big Data (CBD) IEEE Lanzhou Chinapp 360ndash365 August 2018

[46] R Roman J Lopez andMMambo ldquoMobile edge computingFog et al a survey and analysis of security threats andchallengesrdquo Future Generation Computer Systems vol 78pp 680ndash698 2018

[47] F Bonomi R Milito J Zhu and S Addepalli ldquoFog com-puting and its role in the internet of thingsrdquo in Proceedings ofthe First Edition of the MCC Workshop on Mobile CloudComputing ACM Helsinki Finland pp 13ndash16 August 2012

[48] M T Beck MWerner S Feld and S Schimper ldquoMobile edgecomputing a taxonomyrdquo in Proceedings of the Sixth Inter-national Conference on Advances in Future Internet pp 48ndash55 Lisbon Portugal November 2014

[49] P Mach and Z Becvar ldquoMobile edge computing a survey onarchitecture and computation offloadingrdquo IEEE Communi-cations Surveys amp Tutorials vol 19 no 3 pp 1628ndash1656 2017

[50] H T Dinh C Lee D Niyato and P Wang ldquoA survey ofmobile cloud computing architecture applications and ap-proachesrdquo Wireless Communications and Mobile Computingvol 13 no 18 pp 1587ndash1611 2013

[51] X Lyu H Tian L Jiang et al ldquoSelective offloading in mobileedge computing for the green internet of thingsrdquo IEEENetwork vol 32 no 1 pp 54ndash60 2018

[52] S Agrawal and J Agrawal ldquoSurvey on anomaly detectionusing data mining techniquesrdquo Procedia Computer Sciencevol 60 pp 708ndash713 2015

[53] A Tongaonkar ldquoA look at the mobile app identificationlandscaperdquo IEEE Internet Computing vol 20 no 4 pp 9ndash152016

[54] G He B Xu and H Zhu ldquoIdentifying mobile applications forencrypted network trafficrdquo in Proceedings of the 2017 FifthInternational Conference on Advanced Cloud and Big Data(CBD) IEEE Shanghai China pp 279ndash284 August 2017

[55] G Ranjan A Tongaonkar and R Torres ldquoApproximatematching of persistent lexicon using search-engines forclassifying mobile app trafficrdquo in Proceedings of the 35thAnnual IEEE International Conference on Computer Com-munications IIEEE San Francisco CA USA April 2016

[56] L Li and S Qu ldquoShort text classification based on improvedITCrdquo Journal of Computer and Communications vol 1 no 4pp 22ndash27 2013

[57] A H Lashkari A F A Kadir L Taheri and A A GhorbanildquoToward developing a systematic approach to generatebenchmark android malware datasets and classificationrdquo inProceedings of the 2018 International Carnahan Conference onSecurity Technology (ICCST) IEEE Montreal Canada Oc-tober 2018

[58] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[59] Y Li Z Yang Y Guo and X Chen ldquoDroidbot a lightweightUI-guided test input generator for androidrdquo in Proceedings ofthe Software Engineering Companion (ICSE-C) pp 23ndash26Buenos Aires Argentina May 2017

[60] M Fan X Luo J Liu et al ldquoGraph embedding based familialanalysis of android malware using unsupervised learningrdquo inProceedings of the 2019 IEEEACM 41st International Con-ference on Software Engineering (ICSE) IEEE Piscataway NJUSA pp 771ndash782 May 2019

Security and Communication Networks 19

Page 10: On-DeviceDetectionofRepackagedAndroidMalwarevia …downloads.hindawi.com/journals/scn/2020/8630748.pdf · 2020-05-31 · According to [4–6], most Android malware programs are repackagedapps,andthemajorityofnewAndroidmalware

53 Similarity Calculation of Traffic Content Android appsuse mainly HTTP and HTTPS protocols to communicatewith their remote servers [54] For HTTP traffic since it isplaintext based we can efficiently use the flow content tocalculate the similarity e basic strategy is to convertHTTP traffic to document files and the document files canbe handled by information retrieval technologies for simi-larity calculation HTTP flow content has also been used inapp identification from network traffic [55] However in thisstudy we only consider partial HTTP header information tobetter protect mobile usersrsquo privacy

In detail first we reassemble HTTP flows from packetsand extract the flow contents by saving the correspondingpacket contents into a text file as ASCII code Figure 9 showssuch an example en we preprocess the text file by de-leting the HTTP request content and the HTTP requestheader except for the GET (or POST) and HOST fields eHTTP response content is also deleted namely only theHTTP response header the GET (or POST) and the HOSTfiled in the HTTP request header are saved Via this ap-proach the mobile usersrsquo privacy can be protected from thecloud (for example the mobile phone information (HUA-WEI) is leaked by the User-Agent field in Figure 9) Afterthat the text file is spilt using special characters (eg ) and the processed file becomes the flow contentfeatures

e similarity is calculated via the improved TF-IDFalgorithm [56] and the cosine similarity e term frequencymeasures the count of words in a document We use theaugmented frequency measurement Denote the document(the feature set) as di and the word as wj Term frequencytf(wj di) is calculated as

tf wj di1113872 1113873 05 +05 times f wj di1113872 1113873

max f w di( 1113857 w isin di1113864 1113865 (1)

where f(wj di) is the number of occurrences of wj indocument di

e inverse document frequency measures of theamount of information a word provides Typically it is thelogarithmically scaled inverse fraction of the number ofdocuments that contain the word e inverse documentfrequency of wj is calculated as

idf wj1113872 1113873 logu

d wj isin d1113966 111396711138681113868111386811138681113868

11138681113868111386811138681113868+ 001⎛⎝ ⎞⎠ (2)

where u is the number of documents and | d wj isin d1113966 1113967| isnumber of documents in which the term wj appears

Denote by p(wj di) the product of TF(wj di) andI DF(wj) it is formulated as

p wj di1113872 1113873 log tf wj di1113872 11138731113872 1113873 times idf wj1113872 1113873

1113936

cj0 log

2 tf wj di1113872 11138731113872 1113873 times idf2 wj1113872 11138731113969 (3)

After the TF-IDF operation all documents are convertedinto equal-length numerical vectors We use the cosinesimilarity to measure the similarity between two documentvectors For documents di and dk the corresponding

numerical vectors are denoted as V(di) and V(dk) and thelength of vector is t e similarity between di and dk iscalculated as

CSdik

1113936tl1 V di( 1113857l times V dk( 1113857l( 1113857

1113936tl1 V di( 1113857

2l

1113969

times

1113936tl1 V dk( 1113857

2l

1113969 (4)

Suppose that in category vi (we divide the u feature setsgenerated by App i in to v categories in the traffic filteringstep) there are c feature sets For each feature set we save theflow features of all HTTP flows into a text file and c text filesare obtained en we calculate the numerical vector foreach feature set via equation (3) Last we compare thenumerical vector with each other as equation (4) and gain csimilarity values (CSdii

1) for each feature set We rep-resent these similarity values of feature set di (i 1 2 c)as a vector Si

contents and Sicontents (CSdi1

CSdi2 CSdic

)e vectors Si

contents (i 1 2 c) in combination with thesimilarity values of traffic behaviors will be clustered toidentify the abnormal data points namely the repackagedmalware

54 SimilarityCalculation of TrafficBehaviors In addition tocalculating the similarity values of traffic contents thesimilarity of traffic behaviors must be calculated due toencrypted flows We use m (m 11 in our study) statisticalfeatures as listed in Table 2 to represent the behavior of anencrypted network flow e units of the flow duration andthe packet interval time are milliseconds in our experiments

With the flow statistical features for feature set di the jthencrypted flow can be represented as a vector in the featurespace

Fj

di f

1j f

2j f

mj1113872 1113873 (5)

where flj represents the value of the lth statistical feature

1le llem Figure 10 shows an example of feature represen-tation for encrypted traffic Consequently the traffic be-haviors of di can be represented as a matrix of f

pj

Figure 9 Example of converting an HTTP flow to a text file

10 Security and Communication Networks

TBntimesm di( 1113857 F1di

F2di

Fndi

1113872 1113873T (6)

where n is the number of encrypted flows that are containedin di and nge 1

Since different feature sets may differ in terms of thenumber of encrypted flows traffic behaviors TBnitimesm(di) andTBnktimesm(dk) may differ in terms of different dimensionnamely the value of ni should not be equal to that of nk Toefficiently calculate the similarity between TBnitimesm(di) andTBnktimesm(dk) we calculate the Frobenius norm of TBnitimesm(di)

and TBnktimesm(dk) e formula is presented in equation (7)where norm(middot) normalizes the value of fl

j to [0 1]

FN(d)

1113944

n

j11113944

m

l1norm fl

j1113872 111387311138681113868111386811138681113868

111386811138681113868111386811138682

11139741113972

(7)

en the distance between TBnitimesm(di) and TBnktimesm(dk) iscalculated as

dis di dk( 1113857 FN di( 1113857 minus FN dk( 11138571113868111386811138681113868

1113868111386811138681113868 (8)

Finally the similarity between TBnitimesm(di) andTBnktimesm(dk) can be calculated by

BSdik 1 minus

dis di dk( 1113857

max dis dr dq1113872 1113873 r q 1 2 u rne q1113960 1113961 (9)

In the above process ni and nk are assumed to exceed 1If ni 0 (nk 0) and nkge 1 (nige 1) BSdik

is set to 0 to reflectthe significant difference between di and dk However ifni 0 and nk 0 namely there is no encrypted flow in bothdi and dk BSdik

is set to 1 because they exhibit the samebehaviors ie they do not produce any encrypted traffic

Similarly we let BSdii 1 and represent these similarity

values as a vector Sibehavior BSdi1

BSdi2 BSdiu

1113966 1113967

55 Clustering of Similarity Values and Malware DetectionFrom the traffic content similarity CSdik

and the traffic be-havior similarity BSdik

we can determine the final similarityvalue BSdik

between the feature sets di and dk as expressed inequation (10) In this equation q is the weight value In ourexperiments we set q 05 namely the traffic contentsimilarity and the traffic behavior similarity are assigned thesame weight is is because apps typically produce almostthe same numbers of HTTP and HTTPS flows [57] We alsoevaluate various weighted values in Section 62 and theexperimental results support that 05 is the most suitablevalue For feature set di a vector Sdi

(Sdi1 Sdi2

Sdiu) is

finally obtained which contains the similarity values be-tween di and the other feature setse set of vector Sdi

1113966 1113967 i

1 2 u will be clustered for the detection of repackagedmalware

Sdik q times CSdik

+(1 minus q) times BSdik (10)

To be specific we use the density peak clustering method[58] to realize our objective Density peak clustering is basedon the following assumption cluster centers are charac-terized by a higher density than their neighbors and by arelatively large distance from points that have higher den-sities Hence density peak clustering can automaticallyidentify the correct number of clusters is is importantbecause we do not know whether there is repackagedmalware or how many types of repackaged malware thereare In other words we cannot determine the number ofclusters in advance Density peak clustering can help usovercome this challenge

We assume that most mobile devices have installed theoriginal apps as discussed in Section 51 erefore afterclustering the cluster that has the largest number of voxelswill be recognized as original and the remaining clusters willbe regarded as suspicious However suspicious apps are notnecessarily repackaged malware For example apps can berepackaged to show ads only Repackaged apps of this typeshould not be simply judged as malware Additionallyoriginal apps can be suspicious due to different user be-haviors To further reduce the frequency of the false alarmswe can check the server hostnames that are contained in therepackaged clusters using VirusTotal (the URLs are includedin the HTTP traffic contents) If all of the serversrsquo hostnamesfor a suspicious cluster Cr are judged as clean it will berecognized as a cluster of abnormal but harmless appsOtherwise it will be identified as repackaged malware Viathis approach we can accurately detect repackaged Androidmalware In this step we only use VirusTotal as a white list toexclude false alarms We do not use it as a backlist (detectionsignatures) to detect repackaged malware Malware detec-tion is still realized by clustering the similarity values

Table 2 Flow statistical featuresBytes sent by the app and bytes sent by the serverNumbers of outgoing and incoming packetsMean and std dev of the outgoing and incoming packet sizesrsquoflow durationMean and std dev of the interpacket time

lt2761 16696 16 12 173 1294 289 1177 3392 121 190gt

Figure 10 Example of the conversion of an encrypted flow to afeature vector

Security and Communication Networks 11

6 Evaluation

61 System Implementation and Dataset Figure 11illustrates our experimental setup e Android phoneswere connected to the edge server via WiFi and each phonehad installed the edge-client app which sends the flowmetainformation to the edge server using Socket (jav-anetSocket) e edge is a Dell PowerEdge R730 server(with 32 processors 16GB RAM and 12 TB hard drives)that is configured as an access point In the edge server weused tcpdump (httpwwwtcpdumporg) to collect networktraffic e similarity calculations of traffic contents andbehaviors and the clustering of similarity values wereimplemented on the Windows Azure cloud computingplatforme URL checking was conducted using Virustotal(httpswwwvirustotalcom)

In the implementation we used aria2c (httpsaria2githubio) to download apps from Androzoo (httpsandrozoounilu) aria2c was encapsulated as commandsin Python code for parallel downloading After downloadingthe apps we automatically installed them on phones usingthe adb shell command Each phone had 40 tested appsinstalled We have implemented the edge-client app basedon Android VPN service In our implementations the edge-client app correlated network flows and apps by resolvingthe ports and PIDs (Process IDs) send the newly collectedflow metainformation to the edge server every 60 secondse edge server used tcpdump to collect network trafficwhich was saved as pcap files To extract traffic features weused Splitcap (httpswwwnetreseccompageSplitCap)to restore tcp flows from pcap files en for HTTP trafficwe extracted the contents using tshark (httpswwwwiresharkorgdocsman-pagestsharkhtml) commands -zfollow tcp and ascii e packet sizes and packet intervaltimes were also obtained by tshark with commands -eframelen ndashe frametime_relative In the cloud CSdik

and BSdik

were calculated using scikit-learn (httpscikit-learnorg)and Numpy (httpsplotlynumpynorm) e code ofdensity peak clustering is available on GitHub (httpsgithubcomjasonwbwDensityPeakCluster) and we havefixed various bugs that were present in the original code

We have downloaded 200 pairs of original andrepackaged apps from Androzoo for testing However notall of these downloaded repackaged apps are maliciousSome apps are only repackaged to show extra ads To screenout Android malware from the downloaded repackagedapps we further checked the 200 repackaged apps usingVirustotal If an app was judged as safe by all the antivirustools it was labelled as repackaged_safe in our experimentsOtherwise it was labelled as repackaged_malware In total143 repackaged apps were detected as malware by VirustotalNevertheless in our experiments we evaluated the proposedclusteringmethod on all 200 repackaged apps to obtainmorerealistic performance results

e downloaded apps were run in 10 rooted Androidphones in parallel to accelerate the experimental process Forall apps we used Droidbot [59] to send input events to themand to generate network traffic automatically We ran eachrepackaged app (repackaged_safe and repackaged_malware

apps) 10 times and the corresponding original app 30 timesto generate sufficient network traffic e running times ofthe original apps exceeded those of the repackaged appsbecause we assume that most of the apps are original asdiscussed in Section 51 Consequently we obtained 40traffic sets for each repackaged-original app pair Note thatfrom the perspectives of the mobile devices and the edgeserver there are only 200 distinct apps because therepackaged app has the same name as the correspondingoriginal app erefore our experiments have simulated40lowast 200 8000 mobile devices For each repackaged-orig-inal app pair there were 30 devices with the original app and10 devices with the repackaged version

In total we have gathered almost 89GB network traffice datasets that are used in our experiments are summa-rized in Table 3 In the table the traffic dataset sizes are thesizes of the captured network traffic and the feature datasetsizes are the sizes of the extracted features In our experi-ments the HTTP traffic contents and the flow statisticalfeatures which are listed in Table 2 were both saved as textfiles As shown in Table 3 the feature datasets are signifi-cantly smaller than the traffic datasets (25G vs 63G and 9G vs26G)us the edge computing can reduce the data size andincrease the data acquisition speed for cloud computing[24] Since the feature dataset sizes are still large (25G+ 9G)we did not send all these feature data to the cloud simul-taneously Instead we sent the feature data one by oneWhen the feature data of an app were successfully receivedby the cloud its traffic similarities were calculated andclustered immediately After that the feature data of anotherapp would be sent and analyzed e experimental resultswill be presented in the following

62 ExperimentalResults In this section wemainly evaluatethe detection efficiency of Android malware by clusteringsimilarity values e performance evaluation of the pro-posed method in terms of power memory and CPU con-sumptions is elaborated in the following Section 63

Similar to previous studies [8 14] to measure theperformance of the proposed method Accuracy and F-measure metrics are utilized which are defined in equations(11) and (12) In equation (11) TP FN FP and TN aredefined in Table 4 In Table 4 repackaged_x representsrepackaged_safe or repackaged_malware e precision andrecall in equation (12) are calculated via equations (13) and(14)

Accuracy TP + TN

TP + FN + FP + TN (11)

F minus measure 2 times precision times recallprecision + recall

(12)

precision TP

TP + FP (13)

recall TP

TP + FN (14)

12 Security and Communication Networks

We have evaluated the Accuracy and F-measure for eachrepackaged app Recall that we ran each repackaged app 10times and the corresponding original app 30 timeserefore for each app the number of the original apptraffic sets is 30 and the number of the repackaged app trafficsets is 10 namely TP + FN 10 and TN+FP 30 for eachapp We have randomly chosen 20 apps from repack-age_malware and the experimental results for these apps arepresented in Figures 12 and 13 Detailed descriptions of thechosen 20 apps are listed in Table 5 In the table the mainfunction of each app is determined by analyzing the packagename from the AndroidManifestxml and from the Googlesearch results of the package name As depicted in Figure 12100 accuracy was realized on a total of 6 apps (app3 app5app9 app10 app13 and app18) e minimum accuracy is85 (app2 3440) and the average accuracy is 952 Fig-ure 13 presents the F-measure values In our experimentsthe maximum F-measure value is 1 and the minimum valueis 0737 e average F-measure value is 0888 ese resultsdemonstrate the satisfactory performance of our proposedclustering method

We have also calculated the average accuracy and F-measure values for all repackaged_safe and repack-aged_malware apps In these calculationsTP + FN 57lowast10 570 and TN+FP 57lowast 301710 for

repackaged_safe apps For repackaged_malware appsTP + FN 143lowast101430 and TN+FP 143lowast 30 4290is is because in our experiments 143 apps were identifiedas repackaged malware and 57 apps as repackaged safe byVirustotal us the parameter values differ e experi-mental results are listed in Table 6

According to Table 6 the average accuracy values forrepackaged_safe and repackaged_malware apps are bothhigh However the average F-measure for repackaged_safeapps is only 078 which is lower than that of repackagedmalware is indicates that repackaged safe apps are

Router Windows Azure

Edge server DellPowerEdge R730

10 android phones

Repackagedoriginal apps

Androzoo

Figure 11 Experimental setup

Table 3 Datasets used in the experiments

of apps of runs of HTTP flows of encrypted flows Traffic dataset sizes Feature dataset sizesOriginal apps 200 30 340015 183452 asymp63G bytes asymp25G bytesRepackaged apps 200 10 110670 93041 asymp26G bytes asymp9G bytes

Table 4 Confusion matrix

ObservedDetected

Repackaged_x app Original appRepackaged_x app TP ( of TPs) FN ( of FNs)Original app FP ( of FPs) TN ( of TNs)

Apps

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

Figure 12 Detection accuracies for the 20 selected apps

Security and Communication Networks 13

difficult to distinguish from their original apps is isbecause repackaged_safe apps only generate smallamounts of additional network flows and their networktraffic features are not clearly distinguishable from thoseof normal traffic Fortunately the average F-measure ofrepackaged malware apps is outstanding is may be dueto the fact that malware typically send stolen data to orreceive commands from remote servers Hence theirnetwork traffic features are special and distinguishableFigure 14 presents an example of app clustering and il-lustrates the distinguishability of the repackaged malwareand the original app ese results demonstrate thatrepackaged Android malware can be effectively detectedby comparing their network traffic with that of originalapps

For repackaged malware apps the accuracy results withdifferent weight values (q in equation (10)) are presented inFigure 15 As depicted in the figure when q 05 we realizethe highest accuracy rate 969 If q 0 namely only thefeatures of encrypted traffic are used the accuracy result is753 e result is 617 when q 1 namely only thefeatures of plaintext traffic are considered ese resultsdemonstrate that both the encrypted and plaintext trafficcontribute to the differentiation of the original apps and therepackaged malware However the encrypted traffic con-tributes more than the plaintext traffic is may becausemalicious apps usually connect to their servers by securityprotocols to evade IDS detection

In the previous experiments we checked the serversrsquohostnames using VirusTotal to reduce the frequency of thefalse positives For various clusters that contained less than20 elements if all the serversrsquo hostnames were judged asclean the cluster was identified as an original app rather thana repackaged one As argued in Section 55 this process onlyreduces the frequency of false alarms To support this ar-gument we list the numbers of true positives and falsepositives of the 143 repackaged malware in Table 7 forcomparison When conducting the hostname checking thenumber of true positives is 1400 and the result is the samewhen the hostname checking is turned off is findingsupports that the malware is detected by clustering thesimilarity values However the number of false positivesincreased from 47 to 213 when the hostname checking wasturned off erefore we suggest checking the serversrsquohostnames after clustering to increase the practicality of themethod Note that false positives still occur when we checkthe hostnames by VirusTotal Hence there may be malwarein the original apps In our dataset the typical examples arecommobovideoplayerpro and cncfshop ele1taoxiaosanWe further analyze these two apps by reading theirdecompiled codes and we determine that they indeed es-tablish connections with suspicious servers In detail invarious runs app commobovideoplayerpro connects withhttpwwwtopappsquarecomwhich is judged as maliciousby CyRadar in VirusTotal App cncfshop ele1taoxiaosanvisits suspicious URL http955ccjSn9 Since these apps arethe original apps they are identified as false positives in theexperiments

In our experiments we chose density peak clustering asthe clustering method because density peak clustering canautomatically determine the correct number of clusters [58]Hence our method can detect various versions of malwarethat are repackaged from the same original app Unfortu-nately in the apps that were downloaded from AndroZoono malware was repackaged from the same original app Toevaluate the efficiency of detecting versions of malware wehave created two malware instances from a popular game

Apps

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

0

01

02

03

04

05

06

07

08

09

1

F-m

easu

re

Figure 13 Detection F-measure for the 20 selected apps

Table 5 Detailed descriptions of the chosen 20 apps

Main function ofapp

of HTTPflows

of encryptedflows

App1 VideoPlayer 453 51App2 Game 657 1455App3 Browser 1531 1478App4 MusicPlayer 341 25App5 Game 612 289App6 Game 788 123App7 Game 1336 524App8 Fitness 357 892App9 Email 122 431App10 Game 963 790App11 MusicPlayer 1128 547App12 Game 1598 426App13 News 2460 1178App14 News 2891 1543App15 Chat 171 49App16 Email 143 368App17 WeatherForecast 397 120App18 Game 1832 921App19 Fitness 732 362App20 OnlineLearning 2378 834

Table 6 Average accuracy and F-measure

Average accuracy () AverageF-measure

Repackaged_safe 903 078Repackaged_malware 969 094

14 Security and Communication Networks

app namely aircomaceviralmotox3 One is produced bygrafting AnserverBotrsquos source code into air-comaceviralmotox3 and the other is created by adding codefor stealing usersrsquo contacts and short messages and sendingthem to a remote server Similarly each malware is run 10times and the original app aircomaceviralmotox3 is alsorun 10 times for validatione clustering result is presentedin Figure 16 Apparently there are 3 clusters e

experimental results show that the value of F-measure foreach malware is 1 Hence our method can detect variousmalware that were repackaged from the same original app

63 Comparison In this section we compare our methodwith the typical methods that are based on network or cloudas illustrated in Figure 5 e comparative indicators are thedetection accuracy F-measure average power CPU andmemory consumptions of the mobile device First wecompare the proposed method with AppFA [14] which isimplemented completely at the network level Since AppFAalso must analyze app traffic we directly use the traffic sets ofthe repackaged malware apps for comparison While testingAppFA we deleted the label information that was containedin the traffic sets and mixed the network flows as a wholeen we used the signature matching and the constrainedK-means clustering algorithm proposed in [14] to restoreapp sessions namely the label for each flow e similarapps used in AppFA were chosen in Google Play via themethod introduced in [14] and for each repackaged mal-ware 10 similar apps were selected as the peer group epeer group apps were run once by Droidbot to generatenetwork traffic e comparison results are listed in Table 8

Since AppFA need not to install some special apps onmobile devices the extra power memory and CPU con-sumptions are 0 However the detection accuracy is muchlower (863 vs 969) than that of the proposedmethod ascompared in Table 8 In the proposed method we use theedge-client app to record flow metainformation thusnetwork flows can be accurately labeled However inAppFA flow labeling is realized via traffic clustering and theclustering accuracy cannot be 100 Meanwhile AppFAcompares the appsrsquo network behaviors with those of theirsimilar apps which also results in errors In the proposedmethod we directly compare the repackaged malware withits original app and the detection accuracy is increased eaverage power CPU and memory consumptions of theproposed method are also low and acceptable

In Table 8 the power memory and CPU consumptionsfor our edge-client app were obtained with the help of appGT (httpsgithubcomTencentGT) GT is a portable toolthat runs on smartphones for debugging the APP internalparameters and the code time-consuming statistics eaverage values for the power memory and CPU con-sumptions are calculated as follows Before running an appwith Droidbot we started the edge-client app by GT Whenthe Droidbot was stopped we recoded the power CPU andmemory consumptions We repeated the process until allapps (including 143 repackaged malware apps and theiroriginal versions) had been tested Via this way we obtainedthe average power CPU and memory consumptions for theedge-client app According to Table 8 the edge-client app islightweight and has minimal impact on the performances ofthe mobile devices

en we compare the proposed method with SAMA-Droid [44] SAMADroid monitors Android system calls andsends the calls to a centralized server for the detection ofAndroid malware SAMADroid is a typical cloud-based on-

Decision graph rho-delta-rho

20 251510 35 403050Sorted rho

00

02

04

06

08

Rho lowast

delt

a

Figure 14 Example of app clustering Obviously two clusters areidentified e dot in the top left corner is an outlier e outlier isdetected as an abnormal but harmless app by our method isfigure was generated by the density peak clustering tool

01 02 03 04 05 06 07 08 09 10Value of q

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

Figure 15 Accuracies of various weight values for traffic contentsimilarity and traffic behavior similarity

Table 7 True positives and false positives with and withouthostname checking

of true positives of false positivesWith hostname checking 1400 47Without hostnamechecking 1400 213

Security and Communication Networks 15

device detection method We have implemented the dy-namic detection according to the description in [44] Indetail the APIs that are invoked by the running apps aremonitored by the tool Sensitive API Monitor (httpsgithubcomcyruliuSensitive-API-Monitor) As done in [44] thefrequency values of 10 system calls (open ioctl brk readwrite close sendto sendmsg recvfrom and recvmsg) arerecorded ese recorded frequency values are sent toWindows Azure for malware detection We evaluate thedynamic detection performance of SAMADroid on our 143repackaged_malware apps and their original versions In theevaluation the frequency values of the original apps are usedto create the normality model and the frequency values ofthe repackaged malware apps are used for testing We alsouse the app GT to record the power memory and CPUconsumptions of SAMADroid Note that our phones wereall rooted and Sensitive_API_Monitor and GT can be usede experimental results are presented in Table 9

Last we compare our method with Shabtairsquos method [8]Shabtairsquos method detects Android malware at the mobiledevices and it considers the apprsquos network traffic patternsonly the HTTP traffic contents are not considered isgives us an opportunity to compare our method withmethods that have been specially designed for encryptedtraffic We have implemented Shabtairsquos method and chosenthe feature subset 2 as the detection features Feature subset2 includes Avg Sent Bytes Avg Rcvd Bytes and Pct ofAvg Rcvd Bytes among other values and it is proved to be

the best feature subset for malware detection [8] e De-cisionRegression tree is chosen as the classification algo-rithm e testing dataset is the traffic that was generated bythe 143 repackaged malware instances and the training set isthe traffic that was generated by the corresponding originalapps e classification results are listed in Table 10According to the table the detection accuracy and F-mea-sure values of our method are significantly higher than thoseof Shabtairsquos method is indicates that the HTTP trafficprovides useful information for Android malware detectionSince Shabtairsquos method is conducted at the mobile devices itconsumes more resources than our method as compared inTable 10

7 Discussion

e results of our extensive experiments demonstrate thatrepackaged Android malware can be efficiently and effec-tively detected by clustering network traffic at the edgedevices However there are still limitations for our methodFirst we assume the most of the devices have installed theoriginal apps (rlt (12)u as described in Section 51) is isreasonable in most cases since mobile users typicallydownload apps from the official markets However in areaswhere official markets such as the Google play and Amazonare banned it is possible that the most mobile devices haveinstalled the repackaged apps is may cause false positivesSecond if the malware does not generate sufficient trafficour method cannot detect it For example some malwareonly sends fraudulent premium SMS messages is type ofmalware cannot be detected by monitoring its networktraffic ird we rely on third-party tools such as Virustotalfor the identification of malicious URLs and to reduce thenumber of false alarms If a malicious app rents legalplatforms such as the public cloud as the remote controlserver our method will judge it as a normal app Hence falsenegatives will be generated

To overcome these limitations we could combine theproposed method with available methods such as SAMA-Droid In such cases the edge-client app will not only recordflow metainformation but also monitor app system be-haviors However the system behavior monitoring does notneed to be conducted all the time If abnormal but harmlessor lower traffic apps are detected the edge-client app will beinformed to record the corresponding appsrsquo system be-haviors Similarly the edge-client app can send the appsystem behaviors to the edge server for further processinge performance of this combined method will be evaluatedin our future work One can also combine the proposedframework with AppFA [14] for the detection of other typesof Android malware For example the edge extracts apptraffic behavior features and the cloud can compare appsrsquobehaviors with their peer groups to identify abnormalitiesFurthermore one could first use clustering techniques [60]to group repackaged apps according to the same maliciousmodules and then analyze the behaviors of the representativerepackaged apps in each group to obtain more accuratebehavior features

y

2D nonclassical multidimensional scaling

00ndash02 02ndash04x

ndash04

ndash03

ndash02

ndash01

00

01

02

03

04

Figure 16 Clustering results of various malware that wererepackaged from aircomaceviralmotox3e outlier in the middleis detected as an abnormal but harmless app is figure wasgenerated by the density peak clustering tool

Table 8 Comparison of the proposed method with AppFA

Proposed method AppFAAccuracy 969 863F-measure 094 075Average power consumption 12 0Average CPU consumption 06 0Average memory consumption 654MB 0

16 Security and Communication Networks

e paradigms of edge computing are used to detectAndroid malware in this study However several challengeswould be encountered when deploying it in practice Firstthe mobile devices must install a specified app for thecollection of essential information which may cause con-cerns and mobile users may refuse to cooperate Second inedge computing the data should be preprocessed by the edgeto better protect the usersrsquo privacy However the edge servercan observe all the data (including the sensitive informa-tion) and it threatens the privacy of users For example ifthe edge is deployed by the cloud provider the usersrsquo privacyinformation is also available to the cloud ere is no dif-ference with the user-cloud model for privacy protectionunless the edge server is deployed by the userird the edgeserver should be well protected from network attacksOtherwise it will be convenient for hackers to steal usersrsquoprivate information

8 Conclusion

In this paper we propose a novel method for detectingAndroid malware based on edge computing To the best ofour knowledge the detection of Android malware by uti-lizing edge computing and traffic clustering has not beenconsidered previously By introducing the edge layer themain tasks of the mobile device are offloaded to the edgeserver and the mobile device only needs to record flowmetainformation which substantially conserves the re-sources of the mobile device e edge server extracts appsrsquonetwork traffic features and sends these features to the cloudplatform for further analysis us the usersrsquo privacy can bebetter protected With the received features the cloud cal-culates the similarities between apps and clusters thesesimilarity values to separate the original apps and themalware automatically We evaluated our methods on 400Android apps and the experimental results demonstratethat the detection accuracy can reach 100 for several appsand the average accuracy is 969 We also tested theperformance of the proposed method and compared it with

the typical approaches e comparison results show thatour method can detect repackaged Android malware withboth higher detection accuracy and lower power and re-source consumptions In the future we will continue to workto overcome the discussed limitations and to evaluate theproposed method in real world scenarios

Data Availability

e downloaded apps used to support the findings of thisstudy have been deposited in httpspanbaiducoms1b8vMceIdtdd1gBRC5Gbzvw Its extraction code is vd5re clustering code is also in the repository Since appnetwork traffic contains usersrsquo sensitive information itcannot be publicly shared However we have explained allthe steps to generate network traffic automatically esesteps are also described in the instructionspdf in therepository

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported by National Key TechnologiesResearch and Development Program of China under Grant2018YFD0401404 National Natural Science Foundation ofChina under Grants 61702282 61802192 and 71801123Natural Science Foundation of the Jiangsu Higher EducationInstitutions of China under Grants 18KJB520024 and17KJB520023 Nanjing Forestry University under GrantsGXL016 and CX2016026 and NUPTSF under GrantNY217143

References

[1] Statista Mobile Operating Systemsrsquo Market Share Worldwidefrom January 2012 to December 2019 Statista HamburgGermany 2009 httpswwwstatistacomstatistics272698global-market-share-held-by-mobile-operating-systems-since-2009

[2] W Martin F Sarro Y Jia Y Zhang and M Harman ldquoAsurvey of app store analysis for software engineeringrdquo IEEETransactions on Software Engineering vol 43 no 9pp 817ndash847 2017

[3] httpswwwappbraincomstatsnumber-of-android-apps2020

[4] L Li D Li T F Bissyande et al ldquoUnderstanding android apppiggybacking a systematic study of malicious code graftingrdquoIEEE Transactions on Information Forensics and Securityvol 12 no 6 pp 1269ndash1284 2017

[5] M Fan J Liu X Luo et al ldquoAndroid malware familialclassification and representative sample selection via frequentsubgraph analysisrdquo IEEE Transactions on Information Fo-rensics and Security vol 13 no 8 pp 1890ndash1905 2018

[6] J Zhang Z Qin K Zhang H Yin and J Zou ldquoDalvik opcodegraph based android malware variants detection using globaltopology featuresrdquo IEEE Access vol 6 pp 51964ndash51974 2018

[7] S Garg S K Peddoju and A K Sarje ldquoNetwork-baseddetection of android malicious appsrdquo International Journal ofInformation Security vol 16 no 4 pp 385ndash400 2016

Table 9 Comparison of the proposed method with SAMADroid

Proposed method SAMADroidAccuracy 969 831F-measure 0940 0793Average power consumption 12 23Average CPU consumption 06 08Average memory consumption 654MB 773MB

Table 10 Comparison of the proposed method with Shabtairsquosmethod

Proposed method Shabtairsquosmethod

Accuracy 969 771F-measure 0940 0695Average power consumption 12 77Average CPU consumption 06 23Average memory consumption 654MB 937MB

Security and Communication Networks 17

[8] A Shabtai L Tenenboim-Chekina D Mimran L RokachB Shapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquoComputers amp Security vol 43 pp 1ndash18 2014

[9] L Xie X Zhang J-P Seifert and S Zhu ldquopBMDS a be-havior-based malware detection system for cellphone de-vicesrdquo in Proceedings of the ird ACM Conference onWireless Network Security ACM New York NY USApp 37ndash48 2010

[10] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacyin Smartphones and Mobile Devices ACM Chicago IL USApp 15ndash26 October 2011

[11] G Dini F Martinelli A Saracino and D SgandurraldquoMADAM a multi-level anomaly detector for android mal-warerdquo in Proceedings of the International Conference onMathematical Methods Models and Architectures for Com-puter Network Security vol 12 Springer St PetersburgRussia pp 240ndash253 2012

[12] Y Zhang M Yang B Xu et al ldquoVetting undesirable be-haviors in android apps with permission use analysisrdquo inProceedings of the 2013 ACM SIGSAC Conference on Com-puter amp Communications Security ACM Berlin Germanypp 611ndash622 November 2013

[13] SWang Z Chen X Li LWang K Ji and C Zhao ldquoAndroidmalware clustering analysis on network-level behaviorrdquo inProceedings of the International Conference on IntelligentComputing Springer Liverpool UK pp 796ndash807 August2017

[14] G He B Xu and H Zhu ldquoAppFA a novel approach to detectmalicious android applications on the networkrdquo in Securityand Communication Networks Wiley Hoboken NJ USA2018

[15] X Wang Y Yang and Y Zeng ldquoAccurate mobile malwaredetection and classification in the cloudrdquo SpringerPlus vol 4no 1 p 583 2015

[16] K Tam A Feizollah N B Anuar R Salleh and L Cavallaroldquoe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys (CSUR) vol 49 no 4p 76 2017

[17] M Sun X Li J C S Lui R T B Ma and Z Liang ldquoMonet auser-oriented behavior-based malware variants detectionsystem for androidrdquo IEEE Transactions on Information Fo-rensics and Security vol 12 no 5 pp 1103ndash1112 2017

[18] T Vidas and N Christin ldquoEvading android runtime analysisvia sandbox detectionrdquo in Proceedings of the 9th ACMSymposium on Information Computer and CommunicationsSecurity pp 447ndash458 Kyoto Japan June 2014

[19] Y Duan M Zhang A V Bhaskar et al ldquoings you may notknow about android (Un) packers a systematic study basedon whole-system emulationrdquo in Proceedings of the 2018Network and Distributed System Security Symposium pp 1ndash15 San Diego CA USA February 2018

[20] L Xue X Luo L Yu S Wang and D Wu ldquoAdaptiveunpacking of android appsrdquo in Proceedings of the 2017 IEEEACM 39th International Conference on Software Engineering(ICSE) IEEE Buenos Aires Argentina pp 358ndash369 May2017

[21] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy IEEE San Francisco CAUSA pp 95ndash109 2012

[22] N W Lo S K Lu and Y H Chuang ldquoA framework for thirdparty android marketplaces to identify repackaged appsrdquo inProceedings of the 2016 IEEE 14th International Conference onDependable Autonomic and Secure Computing 14th Inter-national Conference on Pervasive Intelligence and Computing2nd International Conference on Big Data Intelligence andComputing and Cyber Science and Technology Congresspp 475ndash482 Auckland New Zealand August 2016

[23] L Li J Gao M Hurier et al ldquoAndrozoo++ collectingmillions of android apps and their metadata for the researchcommunityrdquo 2017 httpsarxivorgabs170905281

[24] W Shi J Cao Q Zhang Y Li and L Xu ldquoEdge computingvision and challengesrdquo IEEE Internet of ings Journal vol 3no 5 pp 637ndash646 2016

[25] W Zhou Y Zhou X Jiang and P Ning ldquoDetectingrepackaged smart-phone applications in third-party androidmarketplacesrdquo in Proceedings of the Second ACM Conferenceon Data and Application Security and Privacy ACM AntonioTX USA pp 317ndash326 February 2012

[26] Y Shao X Luo C Qian P Zhu and L Zhang ldquoTowards ascalable resource-driven approach for detecting repackagedandroid applicationsrdquo in Proceedings of the 30th AnnualComputer Security Applications Conference ACM NewOrleans LA USA pp 56ndash65 December 2014

[27] K Chen P Wang Y Lee et al ldquoFinding unknown malice in10 seconds mass vetting for new threats at the google-playscalerdquo in Proceedings of the USENIX Security Symposiumvol 15 Washington DC USA August 2015

[28] Z Wang C Li Y Guan and Y Xue ldquoDroidchain a novelmalware detection method for android based on behaviorchainrdquo in Proceedings of the 2015 IEEE Conference onCommunications and Network Security (CNS) IEEE Flor-ence Italy pp 727-728 September 2015

[29] K Tian D D Yao B G Ryder G Tan and G PengldquoDetection of repackaged android malware with code-het-erogeneity featuresrdquo IEEE Transactions on Dependable andSecure Computing vol 17 no 1 pp 64ndash77 2017

[30] A De Lorenzo F Martinelli E Medvet F Mercaldo andA Santone ldquoVisualizing the outcome of dynamic analysis ofandroid malware with vizmalrdquo Journal of Information Se-curity and Applications vol 50 Article ID 102423 2020

[31] M Lin D Zhang X Su and T Yu ldquoEffective and scalablerepackaged application detection based on user interfacerdquo inProceedings of the 2017 IEEE Conference on ComputerCommunications Workshops (INFOCOM WKSHPS) IEEEAtlanta GA USA May 2017

[32] G Meng M Patrick Y Xue Y Liu and J Zhang ldquoSecuringandroid app markets via modelling and predicting malwarespread between marketsrdquo IEEE Transactions on InformationForensics and Security vol 14 no 7 pp 1944ndash1959 2018

[33] A Arora and S K Peddoju ldquoMinimizing network trafficfeatures for android mobile malware detectionrdquo in Proceed-ings of the 18th International Conference on DistributedComputing and Networking ACM Hyderabad India January2017

[34] X Wu D Zhang X Su and W Li ldquoDetect repackagedAndroid application based on HTTP traffic similarity httptraffic similarityrdquo Security and Communication Networksvol 8 no 13 pp 2257ndash2266 2015

[35] J Malik and R Kaushal ldquoCredroid android malware de-tection by network traffic analysisrdquo in Proceedings of the 1stACM Workshop on Privacy-Aware Mobile Computing ACMPaderborn Germany pp 28ndash36 July 2016

18 Security and Communication Networks

[36] A Zulkifli I R A Hamid W M Shah and Z AbdullahldquoAndroid malware detection based on network traffic usingdecision tree algorithmrdquo in Proceedings of the InternationalConference on Soft Computing and Data Mining SpringerSenai Malaysia pp 485ndash494 January 2018

[37] Z Chen Q Yan H Han et al ldquoMachine learning basedmobile malware detection using highly imbalanced networktrafficrdquo Information Sciences vol 433-434 pp 346ndash364 2018

[38] G He B Xu L Zhang and H Zhu ldquoMobile app identifi-cation for encrypted network flows by traffic correlationrdquoInternational Journal of Distributed Sensor Networks vol 14no 12 pp 1ndash17 2018

[39] L Xue Y Zhou T Chen X Luo and G Gu ldquoMalton to-wards on-device non-invasive mobile malware analysis forARTrdquo in Proceedings of the 26th USENIX Security Symposiumpp 289ndash306 Vancouver Canada August 2017

[40] M K Alzaylaee S Y Yerima and S Sezer ldquoDL-Droid deeplearning based android malware detection using real devicesrdquoComputers amp Security vol 89 Article ID 101663 2020

[41] A S Shamili C Bauckhage and T Alpcan ldquoMalware de-tection on mobile devices using distributed machine learn-ingrdquo in Proceedings of the 20th International Conference onPattern Recognition (ICPR) IEEE Istanbul Turkeypp 4348ndash4351 August 2010

[42] M Zhao T Zhang F Ge and Z Yuan ldquoRobotdroid alightweight malware detection framework on smartphonesrdquoJournal of Networks vol 7 no 4 p 715 2012

[43] K A Talha D I Alper and C Aydin ldquoAPK auditor per-mission-based android malware detection systemrdquo DigitalInvestigation vol 13 pp 1ndash14 2015

[44] S Arshad M A Shah A Wahid A Mehmood H Song andH Yu ldquoSamadroid a novel 3-level hybrid malware detectionmodel for android operating systemrdquo IEEE Access vol 6pp 4321ndash4339 2018

[45] G He L Zhang B Xu and H Zhu ldquoDetecting repackagedandroid malware based on mobile edge computingrdquo inProceedings of the 2018 Sixth International Conference onAdvanced Cloud and Big Data (CBD) IEEE Lanzhou Chinapp 360ndash365 August 2018

[46] R Roman J Lopez andMMambo ldquoMobile edge computingFog et al a survey and analysis of security threats andchallengesrdquo Future Generation Computer Systems vol 78pp 680ndash698 2018

[47] F Bonomi R Milito J Zhu and S Addepalli ldquoFog com-puting and its role in the internet of thingsrdquo in Proceedings ofthe First Edition of the MCC Workshop on Mobile CloudComputing ACM Helsinki Finland pp 13ndash16 August 2012

[48] M T Beck MWerner S Feld and S Schimper ldquoMobile edgecomputing a taxonomyrdquo in Proceedings of the Sixth Inter-national Conference on Advances in Future Internet pp 48ndash55 Lisbon Portugal November 2014

[49] P Mach and Z Becvar ldquoMobile edge computing a survey onarchitecture and computation offloadingrdquo IEEE Communi-cations Surveys amp Tutorials vol 19 no 3 pp 1628ndash1656 2017

[50] H T Dinh C Lee D Niyato and P Wang ldquoA survey ofmobile cloud computing architecture applications and ap-proachesrdquo Wireless Communications and Mobile Computingvol 13 no 18 pp 1587ndash1611 2013

[51] X Lyu H Tian L Jiang et al ldquoSelective offloading in mobileedge computing for the green internet of thingsrdquo IEEENetwork vol 32 no 1 pp 54ndash60 2018

[52] S Agrawal and J Agrawal ldquoSurvey on anomaly detectionusing data mining techniquesrdquo Procedia Computer Sciencevol 60 pp 708ndash713 2015

[53] A Tongaonkar ldquoA look at the mobile app identificationlandscaperdquo IEEE Internet Computing vol 20 no 4 pp 9ndash152016

[54] G He B Xu and H Zhu ldquoIdentifying mobile applications forencrypted network trafficrdquo in Proceedings of the 2017 FifthInternational Conference on Advanced Cloud and Big Data(CBD) IEEE Shanghai China pp 279ndash284 August 2017

[55] G Ranjan A Tongaonkar and R Torres ldquoApproximatematching of persistent lexicon using search-engines forclassifying mobile app trafficrdquo in Proceedings of the 35thAnnual IEEE International Conference on Computer Com-munications IIEEE San Francisco CA USA April 2016

[56] L Li and S Qu ldquoShort text classification based on improvedITCrdquo Journal of Computer and Communications vol 1 no 4pp 22ndash27 2013

[57] A H Lashkari A F A Kadir L Taheri and A A GhorbanildquoToward developing a systematic approach to generatebenchmark android malware datasets and classificationrdquo inProceedings of the 2018 International Carnahan Conference onSecurity Technology (ICCST) IEEE Montreal Canada Oc-tober 2018

[58] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[59] Y Li Z Yang Y Guo and X Chen ldquoDroidbot a lightweightUI-guided test input generator for androidrdquo in Proceedings ofthe Software Engineering Companion (ICSE-C) pp 23ndash26Buenos Aires Argentina May 2017

[60] M Fan X Luo J Liu et al ldquoGraph embedding based familialanalysis of android malware using unsupervised learningrdquo inProceedings of the 2019 IEEEACM 41st International Con-ference on Software Engineering (ICSE) IEEE Piscataway NJUSA pp 771ndash782 May 2019

Security and Communication Networks 19

Page 11: On-DeviceDetectionofRepackagedAndroidMalwarevia …downloads.hindawi.com/journals/scn/2020/8630748.pdf · 2020-05-31 · According to [4–6], most Android malware programs are repackagedapps,andthemajorityofnewAndroidmalware

TBntimesm di( 1113857 F1di

F2di

Fndi

1113872 1113873T (6)

where n is the number of encrypted flows that are containedin di and nge 1

Since different feature sets may differ in terms of thenumber of encrypted flows traffic behaviors TBnitimesm(di) andTBnktimesm(dk) may differ in terms of different dimensionnamely the value of ni should not be equal to that of nk Toefficiently calculate the similarity between TBnitimesm(di) andTBnktimesm(dk) we calculate the Frobenius norm of TBnitimesm(di)

and TBnktimesm(dk) e formula is presented in equation (7)where norm(middot) normalizes the value of fl

j to [0 1]

FN(d)

1113944

n

j11113944

m

l1norm fl

j1113872 111387311138681113868111386811138681113868

111386811138681113868111386811138682

11139741113972

(7)

en the distance between TBnitimesm(di) and TBnktimesm(dk) iscalculated as

dis di dk( 1113857 FN di( 1113857 minus FN dk( 11138571113868111386811138681113868

1113868111386811138681113868 (8)

Finally the similarity between TBnitimesm(di) andTBnktimesm(dk) can be calculated by

BSdik 1 minus

dis di dk( 1113857

max dis dr dq1113872 1113873 r q 1 2 u rne q1113960 1113961 (9)

In the above process ni and nk are assumed to exceed 1If ni 0 (nk 0) and nkge 1 (nige 1) BSdik

is set to 0 to reflectthe significant difference between di and dk However ifni 0 and nk 0 namely there is no encrypted flow in bothdi and dk BSdik

is set to 1 because they exhibit the samebehaviors ie they do not produce any encrypted traffic

Similarly we let BSdii 1 and represent these similarity

values as a vector Sibehavior BSdi1

BSdi2 BSdiu

1113966 1113967

55 Clustering of Similarity Values and Malware DetectionFrom the traffic content similarity CSdik

and the traffic be-havior similarity BSdik

we can determine the final similarityvalue BSdik

between the feature sets di and dk as expressed inequation (10) In this equation q is the weight value In ourexperiments we set q 05 namely the traffic contentsimilarity and the traffic behavior similarity are assigned thesame weight is is because apps typically produce almostthe same numbers of HTTP and HTTPS flows [57] We alsoevaluate various weighted values in Section 62 and theexperimental results support that 05 is the most suitablevalue For feature set di a vector Sdi

(Sdi1 Sdi2

Sdiu) is

finally obtained which contains the similarity values be-tween di and the other feature setse set of vector Sdi

1113966 1113967 i

1 2 u will be clustered for the detection of repackagedmalware

Sdik q times CSdik

+(1 minus q) times BSdik (10)

To be specific we use the density peak clustering method[58] to realize our objective Density peak clustering is basedon the following assumption cluster centers are charac-terized by a higher density than their neighbors and by arelatively large distance from points that have higher den-sities Hence density peak clustering can automaticallyidentify the correct number of clusters is is importantbecause we do not know whether there is repackagedmalware or how many types of repackaged malware thereare In other words we cannot determine the number ofclusters in advance Density peak clustering can help usovercome this challenge

We assume that most mobile devices have installed theoriginal apps as discussed in Section 51 erefore afterclustering the cluster that has the largest number of voxelswill be recognized as original and the remaining clusters willbe regarded as suspicious However suspicious apps are notnecessarily repackaged malware For example apps can berepackaged to show ads only Repackaged apps of this typeshould not be simply judged as malware Additionallyoriginal apps can be suspicious due to different user be-haviors To further reduce the frequency of the false alarmswe can check the server hostnames that are contained in therepackaged clusters using VirusTotal (the URLs are includedin the HTTP traffic contents) If all of the serversrsquo hostnamesfor a suspicious cluster Cr are judged as clean it will berecognized as a cluster of abnormal but harmless appsOtherwise it will be identified as repackaged malware Viathis approach we can accurately detect repackaged Androidmalware In this step we only use VirusTotal as a white list toexclude false alarms We do not use it as a backlist (detectionsignatures) to detect repackaged malware Malware detec-tion is still realized by clustering the similarity values

Table 2 Flow statistical featuresBytes sent by the app and bytes sent by the serverNumbers of outgoing and incoming packetsMean and std dev of the outgoing and incoming packet sizesrsquoflow durationMean and std dev of the interpacket time

lt2761 16696 16 12 173 1294 289 1177 3392 121 190gt

Figure 10 Example of the conversion of an encrypted flow to afeature vector

Security and Communication Networks 11

6 Evaluation

61 System Implementation and Dataset Figure 11illustrates our experimental setup e Android phoneswere connected to the edge server via WiFi and each phonehad installed the edge-client app which sends the flowmetainformation to the edge server using Socket (jav-anetSocket) e edge is a Dell PowerEdge R730 server(with 32 processors 16GB RAM and 12 TB hard drives)that is configured as an access point In the edge server weused tcpdump (httpwwwtcpdumporg) to collect networktraffic e similarity calculations of traffic contents andbehaviors and the clustering of similarity values wereimplemented on the Windows Azure cloud computingplatforme URL checking was conducted using Virustotal(httpswwwvirustotalcom)

In the implementation we used aria2c (httpsaria2githubio) to download apps from Androzoo (httpsandrozoounilu) aria2c was encapsulated as commandsin Python code for parallel downloading After downloadingthe apps we automatically installed them on phones usingthe adb shell command Each phone had 40 tested appsinstalled We have implemented the edge-client app basedon Android VPN service In our implementations the edge-client app correlated network flows and apps by resolvingthe ports and PIDs (Process IDs) send the newly collectedflow metainformation to the edge server every 60 secondse edge server used tcpdump to collect network trafficwhich was saved as pcap files To extract traffic features weused Splitcap (httpswwwnetreseccompageSplitCap)to restore tcp flows from pcap files en for HTTP trafficwe extracted the contents using tshark (httpswwwwiresharkorgdocsman-pagestsharkhtml) commands -zfollow tcp and ascii e packet sizes and packet intervaltimes were also obtained by tshark with commands -eframelen ndashe frametime_relative In the cloud CSdik

and BSdik

were calculated using scikit-learn (httpscikit-learnorg)and Numpy (httpsplotlynumpynorm) e code ofdensity peak clustering is available on GitHub (httpsgithubcomjasonwbwDensityPeakCluster) and we havefixed various bugs that were present in the original code

We have downloaded 200 pairs of original andrepackaged apps from Androzoo for testing However notall of these downloaded repackaged apps are maliciousSome apps are only repackaged to show extra ads To screenout Android malware from the downloaded repackagedapps we further checked the 200 repackaged apps usingVirustotal If an app was judged as safe by all the antivirustools it was labelled as repackaged_safe in our experimentsOtherwise it was labelled as repackaged_malware In total143 repackaged apps were detected as malware by VirustotalNevertheless in our experiments we evaluated the proposedclusteringmethod on all 200 repackaged apps to obtainmorerealistic performance results

e downloaded apps were run in 10 rooted Androidphones in parallel to accelerate the experimental process Forall apps we used Droidbot [59] to send input events to themand to generate network traffic automatically We ran eachrepackaged app (repackaged_safe and repackaged_malware

apps) 10 times and the corresponding original app 30 timesto generate sufficient network traffic e running times ofthe original apps exceeded those of the repackaged appsbecause we assume that most of the apps are original asdiscussed in Section 51 Consequently we obtained 40traffic sets for each repackaged-original app pair Note thatfrom the perspectives of the mobile devices and the edgeserver there are only 200 distinct apps because therepackaged app has the same name as the correspondingoriginal app erefore our experiments have simulated40lowast 200 8000 mobile devices For each repackaged-orig-inal app pair there were 30 devices with the original app and10 devices with the repackaged version

In total we have gathered almost 89GB network traffice datasets that are used in our experiments are summa-rized in Table 3 In the table the traffic dataset sizes are thesizes of the captured network traffic and the feature datasetsizes are the sizes of the extracted features In our experi-ments the HTTP traffic contents and the flow statisticalfeatures which are listed in Table 2 were both saved as textfiles As shown in Table 3 the feature datasets are signifi-cantly smaller than the traffic datasets (25G vs 63G and 9G vs26G)us the edge computing can reduce the data size andincrease the data acquisition speed for cloud computing[24] Since the feature dataset sizes are still large (25G+ 9G)we did not send all these feature data to the cloud simul-taneously Instead we sent the feature data one by oneWhen the feature data of an app were successfully receivedby the cloud its traffic similarities were calculated andclustered immediately After that the feature data of anotherapp would be sent and analyzed e experimental resultswill be presented in the following

62 ExperimentalResults In this section wemainly evaluatethe detection efficiency of Android malware by clusteringsimilarity values e performance evaluation of the pro-posed method in terms of power memory and CPU con-sumptions is elaborated in the following Section 63

Similar to previous studies [8 14] to measure theperformance of the proposed method Accuracy and F-measure metrics are utilized which are defined in equations(11) and (12) In equation (11) TP FN FP and TN aredefined in Table 4 In Table 4 repackaged_x representsrepackaged_safe or repackaged_malware e precision andrecall in equation (12) are calculated via equations (13) and(14)

Accuracy TP + TN

TP + FN + FP + TN (11)

F minus measure 2 times precision times recallprecision + recall

(12)

precision TP

TP + FP (13)

recall TP

TP + FN (14)

12 Security and Communication Networks

We have evaluated the Accuracy and F-measure for eachrepackaged app Recall that we ran each repackaged app 10times and the corresponding original app 30 timeserefore for each app the number of the original apptraffic sets is 30 and the number of the repackaged app trafficsets is 10 namely TP + FN 10 and TN+FP 30 for eachapp We have randomly chosen 20 apps from repack-age_malware and the experimental results for these apps arepresented in Figures 12 and 13 Detailed descriptions of thechosen 20 apps are listed in Table 5 In the table the mainfunction of each app is determined by analyzing the packagename from the AndroidManifestxml and from the Googlesearch results of the package name As depicted in Figure 12100 accuracy was realized on a total of 6 apps (app3 app5app9 app10 app13 and app18) e minimum accuracy is85 (app2 3440) and the average accuracy is 952 Fig-ure 13 presents the F-measure values In our experimentsthe maximum F-measure value is 1 and the minimum valueis 0737 e average F-measure value is 0888 ese resultsdemonstrate the satisfactory performance of our proposedclustering method

We have also calculated the average accuracy and F-measure values for all repackaged_safe and repack-aged_malware apps In these calculationsTP + FN 57lowast10 570 and TN+FP 57lowast 301710 for

repackaged_safe apps For repackaged_malware appsTP + FN 143lowast101430 and TN+FP 143lowast 30 4290is is because in our experiments 143 apps were identifiedas repackaged malware and 57 apps as repackaged safe byVirustotal us the parameter values differ e experi-mental results are listed in Table 6

According to Table 6 the average accuracy values forrepackaged_safe and repackaged_malware apps are bothhigh However the average F-measure for repackaged_safeapps is only 078 which is lower than that of repackagedmalware is indicates that repackaged safe apps are

Router Windows Azure

Edge server DellPowerEdge R730

10 android phones

Repackagedoriginal apps

Androzoo

Figure 11 Experimental setup

Table 3 Datasets used in the experiments

of apps of runs of HTTP flows of encrypted flows Traffic dataset sizes Feature dataset sizesOriginal apps 200 30 340015 183452 asymp63G bytes asymp25G bytesRepackaged apps 200 10 110670 93041 asymp26G bytes asymp9G bytes

Table 4 Confusion matrix

ObservedDetected

Repackaged_x app Original appRepackaged_x app TP ( of TPs) FN ( of FNs)Original app FP ( of FPs) TN ( of TNs)

Apps

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

Figure 12 Detection accuracies for the 20 selected apps

Security and Communication Networks 13

difficult to distinguish from their original apps is isbecause repackaged_safe apps only generate smallamounts of additional network flows and their networktraffic features are not clearly distinguishable from thoseof normal traffic Fortunately the average F-measure ofrepackaged malware apps is outstanding is may be dueto the fact that malware typically send stolen data to orreceive commands from remote servers Hence theirnetwork traffic features are special and distinguishableFigure 14 presents an example of app clustering and il-lustrates the distinguishability of the repackaged malwareand the original app ese results demonstrate thatrepackaged Android malware can be effectively detectedby comparing their network traffic with that of originalapps

For repackaged malware apps the accuracy results withdifferent weight values (q in equation (10)) are presented inFigure 15 As depicted in the figure when q 05 we realizethe highest accuracy rate 969 If q 0 namely only thefeatures of encrypted traffic are used the accuracy result is753 e result is 617 when q 1 namely only thefeatures of plaintext traffic are considered ese resultsdemonstrate that both the encrypted and plaintext trafficcontribute to the differentiation of the original apps and therepackaged malware However the encrypted traffic con-tributes more than the plaintext traffic is may becausemalicious apps usually connect to their servers by securityprotocols to evade IDS detection

In the previous experiments we checked the serversrsquohostnames using VirusTotal to reduce the frequency of thefalse positives For various clusters that contained less than20 elements if all the serversrsquo hostnames were judged asclean the cluster was identified as an original app rather thana repackaged one As argued in Section 55 this process onlyreduces the frequency of false alarms To support this ar-gument we list the numbers of true positives and falsepositives of the 143 repackaged malware in Table 7 forcomparison When conducting the hostname checking thenumber of true positives is 1400 and the result is the samewhen the hostname checking is turned off is findingsupports that the malware is detected by clustering thesimilarity values However the number of false positivesincreased from 47 to 213 when the hostname checking wasturned off erefore we suggest checking the serversrsquohostnames after clustering to increase the practicality of themethod Note that false positives still occur when we checkthe hostnames by VirusTotal Hence there may be malwarein the original apps In our dataset the typical examples arecommobovideoplayerpro and cncfshop ele1taoxiaosanWe further analyze these two apps by reading theirdecompiled codes and we determine that they indeed es-tablish connections with suspicious servers In detail invarious runs app commobovideoplayerpro connects withhttpwwwtopappsquarecomwhich is judged as maliciousby CyRadar in VirusTotal App cncfshop ele1taoxiaosanvisits suspicious URL http955ccjSn9 Since these apps arethe original apps they are identified as false positives in theexperiments

In our experiments we chose density peak clustering asthe clustering method because density peak clustering canautomatically determine the correct number of clusters [58]Hence our method can detect various versions of malwarethat are repackaged from the same original app Unfortu-nately in the apps that were downloaded from AndroZoono malware was repackaged from the same original app Toevaluate the efficiency of detecting versions of malware wehave created two malware instances from a popular game

Apps

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

0

01

02

03

04

05

06

07

08

09

1

F-m

easu

re

Figure 13 Detection F-measure for the 20 selected apps

Table 5 Detailed descriptions of the chosen 20 apps

Main function ofapp

of HTTPflows

of encryptedflows

App1 VideoPlayer 453 51App2 Game 657 1455App3 Browser 1531 1478App4 MusicPlayer 341 25App5 Game 612 289App6 Game 788 123App7 Game 1336 524App8 Fitness 357 892App9 Email 122 431App10 Game 963 790App11 MusicPlayer 1128 547App12 Game 1598 426App13 News 2460 1178App14 News 2891 1543App15 Chat 171 49App16 Email 143 368App17 WeatherForecast 397 120App18 Game 1832 921App19 Fitness 732 362App20 OnlineLearning 2378 834

Table 6 Average accuracy and F-measure

Average accuracy () AverageF-measure

Repackaged_safe 903 078Repackaged_malware 969 094

14 Security and Communication Networks

app namely aircomaceviralmotox3 One is produced bygrafting AnserverBotrsquos source code into air-comaceviralmotox3 and the other is created by adding codefor stealing usersrsquo contacts and short messages and sendingthem to a remote server Similarly each malware is run 10times and the original app aircomaceviralmotox3 is alsorun 10 times for validatione clustering result is presentedin Figure 16 Apparently there are 3 clusters e

experimental results show that the value of F-measure foreach malware is 1 Hence our method can detect variousmalware that were repackaged from the same original app

63 Comparison In this section we compare our methodwith the typical methods that are based on network or cloudas illustrated in Figure 5 e comparative indicators are thedetection accuracy F-measure average power CPU andmemory consumptions of the mobile device First wecompare the proposed method with AppFA [14] which isimplemented completely at the network level Since AppFAalso must analyze app traffic we directly use the traffic sets ofthe repackaged malware apps for comparison While testingAppFA we deleted the label information that was containedin the traffic sets and mixed the network flows as a wholeen we used the signature matching and the constrainedK-means clustering algorithm proposed in [14] to restoreapp sessions namely the label for each flow e similarapps used in AppFA were chosen in Google Play via themethod introduced in [14] and for each repackaged mal-ware 10 similar apps were selected as the peer group epeer group apps were run once by Droidbot to generatenetwork traffic e comparison results are listed in Table 8

Since AppFA need not to install some special apps onmobile devices the extra power memory and CPU con-sumptions are 0 However the detection accuracy is muchlower (863 vs 969) than that of the proposedmethod ascompared in Table 8 In the proposed method we use theedge-client app to record flow metainformation thusnetwork flows can be accurately labeled However inAppFA flow labeling is realized via traffic clustering and theclustering accuracy cannot be 100 Meanwhile AppFAcompares the appsrsquo network behaviors with those of theirsimilar apps which also results in errors In the proposedmethod we directly compare the repackaged malware withits original app and the detection accuracy is increased eaverage power CPU and memory consumptions of theproposed method are also low and acceptable

In Table 8 the power memory and CPU consumptionsfor our edge-client app were obtained with the help of appGT (httpsgithubcomTencentGT) GT is a portable toolthat runs on smartphones for debugging the APP internalparameters and the code time-consuming statistics eaverage values for the power memory and CPU con-sumptions are calculated as follows Before running an appwith Droidbot we started the edge-client app by GT Whenthe Droidbot was stopped we recoded the power CPU andmemory consumptions We repeated the process until allapps (including 143 repackaged malware apps and theiroriginal versions) had been tested Via this way we obtainedthe average power CPU and memory consumptions for theedge-client app According to Table 8 the edge-client app islightweight and has minimal impact on the performances ofthe mobile devices

en we compare the proposed method with SAMA-Droid [44] SAMADroid monitors Android system calls andsends the calls to a centralized server for the detection ofAndroid malware SAMADroid is a typical cloud-based on-

Decision graph rho-delta-rho

20 251510 35 403050Sorted rho

00

02

04

06

08

Rho lowast

delt

a

Figure 14 Example of app clustering Obviously two clusters areidentified e dot in the top left corner is an outlier e outlier isdetected as an abnormal but harmless app by our method isfigure was generated by the density peak clustering tool

01 02 03 04 05 06 07 08 09 10Value of q

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

Figure 15 Accuracies of various weight values for traffic contentsimilarity and traffic behavior similarity

Table 7 True positives and false positives with and withouthostname checking

of true positives of false positivesWith hostname checking 1400 47Without hostnamechecking 1400 213

Security and Communication Networks 15

device detection method We have implemented the dy-namic detection according to the description in [44] Indetail the APIs that are invoked by the running apps aremonitored by the tool Sensitive API Monitor (httpsgithubcomcyruliuSensitive-API-Monitor) As done in [44] thefrequency values of 10 system calls (open ioctl brk readwrite close sendto sendmsg recvfrom and recvmsg) arerecorded ese recorded frequency values are sent toWindows Azure for malware detection We evaluate thedynamic detection performance of SAMADroid on our 143repackaged_malware apps and their original versions In theevaluation the frequency values of the original apps are usedto create the normality model and the frequency values ofthe repackaged malware apps are used for testing We alsouse the app GT to record the power memory and CPUconsumptions of SAMADroid Note that our phones wereall rooted and Sensitive_API_Monitor and GT can be usede experimental results are presented in Table 9

Last we compare our method with Shabtairsquos method [8]Shabtairsquos method detects Android malware at the mobiledevices and it considers the apprsquos network traffic patternsonly the HTTP traffic contents are not considered isgives us an opportunity to compare our method withmethods that have been specially designed for encryptedtraffic We have implemented Shabtairsquos method and chosenthe feature subset 2 as the detection features Feature subset2 includes Avg Sent Bytes Avg Rcvd Bytes and Pct ofAvg Rcvd Bytes among other values and it is proved to be

the best feature subset for malware detection [8] e De-cisionRegression tree is chosen as the classification algo-rithm e testing dataset is the traffic that was generated bythe 143 repackaged malware instances and the training set isthe traffic that was generated by the corresponding originalapps e classification results are listed in Table 10According to the table the detection accuracy and F-mea-sure values of our method are significantly higher than thoseof Shabtairsquos method is indicates that the HTTP trafficprovides useful information for Android malware detectionSince Shabtairsquos method is conducted at the mobile devices itconsumes more resources than our method as compared inTable 10

7 Discussion

e results of our extensive experiments demonstrate thatrepackaged Android malware can be efficiently and effec-tively detected by clustering network traffic at the edgedevices However there are still limitations for our methodFirst we assume the most of the devices have installed theoriginal apps (rlt (12)u as described in Section 51) is isreasonable in most cases since mobile users typicallydownload apps from the official markets However in areaswhere official markets such as the Google play and Amazonare banned it is possible that the most mobile devices haveinstalled the repackaged apps is may cause false positivesSecond if the malware does not generate sufficient trafficour method cannot detect it For example some malwareonly sends fraudulent premium SMS messages is type ofmalware cannot be detected by monitoring its networktraffic ird we rely on third-party tools such as Virustotalfor the identification of malicious URLs and to reduce thenumber of false alarms If a malicious app rents legalplatforms such as the public cloud as the remote controlserver our method will judge it as a normal app Hence falsenegatives will be generated

To overcome these limitations we could combine theproposed method with available methods such as SAMA-Droid In such cases the edge-client app will not only recordflow metainformation but also monitor app system be-haviors However the system behavior monitoring does notneed to be conducted all the time If abnormal but harmlessor lower traffic apps are detected the edge-client app will beinformed to record the corresponding appsrsquo system be-haviors Similarly the edge-client app can send the appsystem behaviors to the edge server for further processinge performance of this combined method will be evaluatedin our future work One can also combine the proposedframework with AppFA [14] for the detection of other typesof Android malware For example the edge extracts apptraffic behavior features and the cloud can compare appsrsquobehaviors with their peer groups to identify abnormalitiesFurthermore one could first use clustering techniques [60]to group repackaged apps according to the same maliciousmodules and then analyze the behaviors of the representativerepackaged apps in each group to obtain more accuratebehavior features

y

2D nonclassical multidimensional scaling

00ndash02 02ndash04x

ndash04

ndash03

ndash02

ndash01

00

01

02

03

04

Figure 16 Clustering results of various malware that wererepackaged from aircomaceviralmotox3e outlier in the middleis detected as an abnormal but harmless app is figure wasgenerated by the density peak clustering tool

Table 8 Comparison of the proposed method with AppFA

Proposed method AppFAAccuracy 969 863F-measure 094 075Average power consumption 12 0Average CPU consumption 06 0Average memory consumption 654MB 0

16 Security and Communication Networks

e paradigms of edge computing are used to detectAndroid malware in this study However several challengeswould be encountered when deploying it in practice Firstthe mobile devices must install a specified app for thecollection of essential information which may cause con-cerns and mobile users may refuse to cooperate Second inedge computing the data should be preprocessed by the edgeto better protect the usersrsquo privacy However the edge servercan observe all the data (including the sensitive informa-tion) and it threatens the privacy of users For example ifthe edge is deployed by the cloud provider the usersrsquo privacyinformation is also available to the cloud ere is no dif-ference with the user-cloud model for privacy protectionunless the edge server is deployed by the userird the edgeserver should be well protected from network attacksOtherwise it will be convenient for hackers to steal usersrsquoprivate information

8 Conclusion

In this paper we propose a novel method for detectingAndroid malware based on edge computing To the best ofour knowledge the detection of Android malware by uti-lizing edge computing and traffic clustering has not beenconsidered previously By introducing the edge layer themain tasks of the mobile device are offloaded to the edgeserver and the mobile device only needs to record flowmetainformation which substantially conserves the re-sources of the mobile device e edge server extracts appsrsquonetwork traffic features and sends these features to the cloudplatform for further analysis us the usersrsquo privacy can bebetter protected With the received features the cloud cal-culates the similarities between apps and clusters thesesimilarity values to separate the original apps and themalware automatically We evaluated our methods on 400Android apps and the experimental results demonstratethat the detection accuracy can reach 100 for several appsand the average accuracy is 969 We also tested theperformance of the proposed method and compared it with

the typical approaches e comparison results show thatour method can detect repackaged Android malware withboth higher detection accuracy and lower power and re-source consumptions In the future we will continue to workto overcome the discussed limitations and to evaluate theproposed method in real world scenarios

Data Availability

e downloaded apps used to support the findings of thisstudy have been deposited in httpspanbaiducoms1b8vMceIdtdd1gBRC5Gbzvw Its extraction code is vd5re clustering code is also in the repository Since appnetwork traffic contains usersrsquo sensitive information itcannot be publicly shared However we have explained allthe steps to generate network traffic automatically esesteps are also described in the instructionspdf in therepository

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported by National Key TechnologiesResearch and Development Program of China under Grant2018YFD0401404 National Natural Science Foundation ofChina under Grants 61702282 61802192 and 71801123Natural Science Foundation of the Jiangsu Higher EducationInstitutions of China under Grants 18KJB520024 and17KJB520023 Nanjing Forestry University under GrantsGXL016 and CX2016026 and NUPTSF under GrantNY217143

References

[1] Statista Mobile Operating Systemsrsquo Market Share Worldwidefrom January 2012 to December 2019 Statista HamburgGermany 2009 httpswwwstatistacomstatistics272698global-market-share-held-by-mobile-operating-systems-since-2009

[2] W Martin F Sarro Y Jia Y Zhang and M Harman ldquoAsurvey of app store analysis for software engineeringrdquo IEEETransactions on Software Engineering vol 43 no 9pp 817ndash847 2017

[3] httpswwwappbraincomstatsnumber-of-android-apps2020

[4] L Li D Li T F Bissyande et al ldquoUnderstanding android apppiggybacking a systematic study of malicious code graftingrdquoIEEE Transactions on Information Forensics and Securityvol 12 no 6 pp 1269ndash1284 2017

[5] M Fan J Liu X Luo et al ldquoAndroid malware familialclassification and representative sample selection via frequentsubgraph analysisrdquo IEEE Transactions on Information Fo-rensics and Security vol 13 no 8 pp 1890ndash1905 2018

[6] J Zhang Z Qin K Zhang H Yin and J Zou ldquoDalvik opcodegraph based android malware variants detection using globaltopology featuresrdquo IEEE Access vol 6 pp 51964ndash51974 2018

[7] S Garg S K Peddoju and A K Sarje ldquoNetwork-baseddetection of android malicious appsrdquo International Journal ofInformation Security vol 16 no 4 pp 385ndash400 2016

Table 9 Comparison of the proposed method with SAMADroid

Proposed method SAMADroidAccuracy 969 831F-measure 0940 0793Average power consumption 12 23Average CPU consumption 06 08Average memory consumption 654MB 773MB

Table 10 Comparison of the proposed method with Shabtairsquosmethod

Proposed method Shabtairsquosmethod

Accuracy 969 771F-measure 0940 0695Average power consumption 12 77Average CPU consumption 06 23Average memory consumption 654MB 937MB

Security and Communication Networks 17

[8] A Shabtai L Tenenboim-Chekina D Mimran L RokachB Shapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquoComputers amp Security vol 43 pp 1ndash18 2014

[9] L Xie X Zhang J-P Seifert and S Zhu ldquopBMDS a be-havior-based malware detection system for cellphone de-vicesrdquo in Proceedings of the ird ACM Conference onWireless Network Security ACM New York NY USApp 37ndash48 2010

[10] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacyin Smartphones and Mobile Devices ACM Chicago IL USApp 15ndash26 October 2011

[11] G Dini F Martinelli A Saracino and D SgandurraldquoMADAM a multi-level anomaly detector for android mal-warerdquo in Proceedings of the International Conference onMathematical Methods Models and Architectures for Com-puter Network Security vol 12 Springer St PetersburgRussia pp 240ndash253 2012

[12] Y Zhang M Yang B Xu et al ldquoVetting undesirable be-haviors in android apps with permission use analysisrdquo inProceedings of the 2013 ACM SIGSAC Conference on Com-puter amp Communications Security ACM Berlin Germanypp 611ndash622 November 2013

[13] SWang Z Chen X Li LWang K Ji and C Zhao ldquoAndroidmalware clustering analysis on network-level behaviorrdquo inProceedings of the International Conference on IntelligentComputing Springer Liverpool UK pp 796ndash807 August2017

[14] G He B Xu and H Zhu ldquoAppFA a novel approach to detectmalicious android applications on the networkrdquo in Securityand Communication Networks Wiley Hoboken NJ USA2018

[15] X Wang Y Yang and Y Zeng ldquoAccurate mobile malwaredetection and classification in the cloudrdquo SpringerPlus vol 4no 1 p 583 2015

[16] K Tam A Feizollah N B Anuar R Salleh and L Cavallaroldquoe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys (CSUR) vol 49 no 4p 76 2017

[17] M Sun X Li J C S Lui R T B Ma and Z Liang ldquoMonet auser-oriented behavior-based malware variants detectionsystem for androidrdquo IEEE Transactions on Information Fo-rensics and Security vol 12 no 5 pp 1103ndash1112 2017

[18] T Vidas and N Christin ldquoEvading android runtime analysisvia sandbox detectionrdquo in Proceedings of the 9th ACMSymposium on Information Computer and CommunicationsSecurity pp 447ndash458 Kyoto Japan June 2014

[19] Y Duan M Zhang A V Bhaskar et al ldquoings you may notknow about android (Un) packers a systematic study basedon whole-system emulationrdquo in Proceedings of the 2018Network and Distributed System Security Symposium pp 1ndash15 San Diego CA USA February 2018

[20] L Xue X Luo L Yu S Wang and D Wu ldquoAdaptiveunpacking of android appsrdquo in Proceedings of the 2017 IEEEACM 39th International Conference on Software Engineering(ICSE) IEEE Buenos Aires Argentina pp 358ndash369 May2017

[21] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy IEEE San Francisco CAUSA pp 95ndash109 2012

[22] N W Lo S K Lu and Y H Chuang ldquoA framework for thirdparty android marketplaces to identify repackaged appsrdquo inProceedings of the 2016 IEEE 14th International Conference onDependable Autonomic and Secure Computing 14th Inter-national Conference on Pervasive Intelligence and Computing2nd International Conference on Big Data Intelligence andComputing and Cyber Science and Technology Congresspp 475ndash482 Auckland New Zealand August 2016

[23] L Li J Gao M Hurier et al ldquoAndrozoo++ collectingmillions of android apps and their metadata for the researchcommunityrdquo 2017 httpsarxivorgabs170905281

[24] W Shi J Cao Q Zhang Y Li and L Xu ldquoEdge computingvision and challengesrdquo IEEE Internet of ings Journal vol 3no 5 pp 637ndash646 2016

[25] W Zhou Y Zhou X Jiang and P Ning ldquoDetectingrepackaged smart-phone applications in third-party androidmarketplacesrdquo in Proceedings of the Second ACM Conferenceon Data and Application Security and Privacy ACM AntonioTX USA pp 317ndash326 February 2012

[26] Y Shao X Luo C Qian P Zhu and L Zhang ldquoTowards ascalable resource-driven approach for detecting repackagedandroid applicationsrdquo in Proceedings of the 30th AnnualComputer Security Applications Conference ACM NewOrleans LA USA pp 56ndash65 December 2014

[27] K Chen P Wang Y Lee et al ldquoFinding unknown malice in10 seconds mass vetting for new threats at the google-playscalerdquo in Proceedings of the USENIX Security Symposiumvol 15 Washington DC USA August 2015

[28] Z Wang C Li Y Guan and Y Xue ldquoDroidchain a novelmalware detection method for android based on behaviorchainrdquo in Proceedings of the 2015 IEEE Conference onCommunications and Network Security (CNS) IEEE Flor-ence Italy pp 727-728 September 2015

[29] K Tian D D Yao B G Ryder G Tan and G PengldquoDetection of repackaged android malware with code-het-erogeneity featuresrdquo IEEE Transactions on Dependable andSecure Computing vol 17 no 1 pp 64ndash77 2017

[30] A De Lorenzo F Martinelli E Medvet F Mercaldo andA Santone ldquoVisualizing the outcome of dynamic analysis ofandroid malware with vizmalrdquo Journal of Information Se-curity and Applications vol 50 Article ID 102423 2020

[31] M Lin D Zhang X Su and T Yu ldquoEffective and scalablerepackaged application detection based on user interfacerdquo inProceedings of the 2017 IEEE Conference on ComputerCommunications Workshops (INFOCOM WKSHPS) IEEEAtlanta GA USA May 2017

[32] G Meng M Patrick Y Xue Y Liu and J Zhang ldquoSecuringandroid app markets via modelling and predicting malwarespread between marketsrdquo IEEE Transactions on InformationForensics and Security vol 14 no 7 pp 1944ndash1959 2018

[33] A Arora and S K Peddoju ldquoMinimizing network trafficfeatures for android mobile malware detectionrdquo in Proceed-ings of the 18th International Conference on DistributedComputing and Networking ACM Hyderabad India January2017

[34] X Wu D Zhang X Su and W Li ldquoDetect repackagedAndroid application based on HTTP traffic similarity httptraffic similarityrdquo Security and Communication Networksvol 8 no 13 pp 2257ndash2266 2015

[35] J Malik and R Kaushal ldquoCredroid android malware de-tection by network traffic analysisrdquo in Proceedings of the 1stACM Workshop on Privacy-Aware Mobile Computing ACMPaderborn Germany pp 28ndash36 July 2016

18 Security and Communication Networks

[36] A Zulkifli I R A Hamid W M Shah and Z AbdullahldquoAndroid malware detection based on network traffic usingdecision tree algorithmrdquo in Proceedings of the InternationalConference on Soft Computing and Data Mining SpringerSenai Malaysia pp 485ndash494 January 2018

[37] Z Chen Q Yan H Han et al ldquoMachine learning basedmobile malware detection using highly imbalanced networktrafficrdquo Information Sciences vol 433-434 pp 346ndash364 2018

[38] G He B Xu L Zhang and H Zhu ldquoMobile app identifi-cation for encrypted network flows by traffic correlationrdquoInternational Journal of Distributed Sensor Networks vol 14no 12 pp 1ndash17 2018

[39] L Xue Y Zhou T Chen X Luo and G Gu ldquoMalton to-wards on-device non-invasive mobile malware analysis forARTrdquo in Proceedings of the 26th USENIX Security Symposiumpp 289ndash306 Vancouver Canada August 2017

[40] M K Alzaylaee S Y Yerima and S Sezer ldquoDL-Droid deeplearning based android malware detection using real devicesrdquoComputers amp Security vol 89 Article ID 101663 2020

[41] A S Shamili C Bauckhage and T Alpcan ldquoMalware de-tection on mobile devices using distributed machine learn-ingrdquo in Proceedings of the 20th International Conference onPattern Recognition (ICPR) IEEE Istanbul Turkeypp 4348ndash4351 August 2010

[42] M Zhao T Zhang F Ge and Z Yuan ldquoRobotdroid alightweight malware detection framework on smartphonesrdquoJournal of Networks vol 7 no 4 p 715 2012

[43] K A Talha D I Alper and C Aydin ldquoAPK auditor per-mission-based android malware detection systemrdquo DigitalInvestigation vol 13 pp 1ndash14 2015

[44] S Arshad M A Shah A Wahid A Mehmood H Song andH Yu ldquoSamadroid a novel 3-level hybrid malware detectionmodel for android operating systemrdquo IEEE Access vol 6pp 4321ndash4339 2018

[45] G He L Zhang B Xu and H Zhu ldquoDetecting repackagedandroid malware based on mobile edge computingrdquo inProceedings of the 2018 Sixth International Conference onAdvanced Cloud and Big Data (CBD) IEEE Lanzhou Chinapp 360ndash365 August 2018

[46] R Roman J Lopez andMMambo ldquoMobile edge computingFog et al a survey and analysis of security threats andchallengesrdquo Future Generation Computer Systems vol 78pp 680ndash698 2018

[47] F Bonomi R Milito J Zhu and S Addepalli ldquoFog com-puting and its role in the internet of thingsrdquo in Proceedings ofthe First Edition of the MCC Workshop on Mobile CloudComputing ACM Helsinki Finland pp 13ndash16 August 2012

[48] M T Beck MWerner S Feld and S Schimper ldquoMobile edgecomputing a taxonomyrdquo in Proceedings of the Sixth Inter-national Conference on Advances in Future Internet pp 48ndash55 Lisbon Portugal November 2014

[49] P Mach and Z Becvar ldquoMobile edge computing a survey onarchitecture and computation offloadingrdquo IEEE Communi-cations Surveys amp Tutorials vol 19 no 3 pp 1628ndash1656 2017

[50] H T Dinh C Lee D Niyato and P Wang ldquoA survey ofmobile cloud computing architecture applications and ap-proachesrdquo Wireless Communications and Mobile Computingvol 13 no 18 pp 1587ndash1611 2013

[51] X Lyu H Tian L Jiang et al ldquoSelective offloading in mobileedge computing for the green internet of thingsrdquo IEEENetwork vol 32 no 1 pp 54ndash60 2018

[52] S Agrawal and J Agrawal ldquoSurvey on anomaly detectionusing data mining techniquesrdquo Procedia Computer Sciencevol 60 pp 708ndash713 2015

[53] A Tongaonkar ldquoA look at the mobile app identificationlandscaperdquo IEEE Internet Computing vol 20 no 4 pp 9ndash152016

[54] G He B Xu and H Zhu ldquoIdentifying mobile applications forencrypted network trafficrdquo in Proceedings of the 2017 FifthInternational Conference on Advanced Cloud and Big Data(CBD) IEEE Shanghai China pp 279ndash284 August 2017

[55] G Ranjan A Tongaonkar and R Torres ldquoApproximatematching of persistent lexicon using search-engines forclassifying mobile app trafficrdquo in Proceedings of the 35thAnnual IEEE International Conference on Computer Com-munications IIEEE San Francisco CA USA April 2016

[56] L Li and S Qu ldquoShort text classification based on improvedITCrdquo Journal of Computer and Communications vol 1 no 4pp 22ndash27 2013

[57] A H Lashkari A F A Kadir L Taheri and A A GhorbanildquoToward developing a systematic approach to generatebenchmark android malware datasets and classificationrdquo inProceedings of the 2018 International Carnahan Conference onSecurity Technology (ICCST) IEEE Montreal Canada Oc-tober 2018

[58] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[59] Y Li Z Yang Y Guo and X Chen ldquoDroidbot a lightweightUI-guided test input generator for androidrdquo in Proceedings ofthe Software Engineering Companion (ICSE-C) pp 23ndash26Buenos Aires Argentina May 2017

[60] M Fan X Luo J Liu et al ldquoGraph embedding based familialanalysis of android malware using unsupervised learningrdquo inProceedings of the 2019 IEEEACM 41st International Con-ference on Software Engineering (ICSE) IEEE Piscataway NJUSA pp 771ndash782 May 2019

Security and Communication Networks 19

Page 12: On-DeviceDetectionofRepackagedAndroidMalwarevia …downloads.hindawi.com/journals/scn/2020/8630748.pdf · 2020-05-31 · According to [4–6], most Android malware programs are repackagedapps,andthemajorityofnewAndroidmalware

6 Evaluation

61 System Implementation and Dataset Figure 11illustrates our experimental setup e Android phoneswere connected to the edge server via WiFi and each phonehad installed the edge-client app which sends the flowmetainformation to the edge server using Socket (jav-anetSocket) e edge is a Dell PowerEdge R730 server(with 32 processors 16GB RAM and 12 TB hard drives)that is configured as an access point In the edge server weused tcpdump (httpwwwtcpdumporg) to collect networktraffic e similarity calculations of traffic contents andbehaviors and the clustering of similarity values wereimplemented on the Windows Azure cloud computingplatforme URL checking was conducted using Virustotal(httpswwwvirustotalcom)

In the implementation we used aria2c (httpsaria2githubio) to download apps from Androzoo (httpsandrozoounilu) aria2c was encapsulated as commandsin Python code for parallel downloading After downloadingthe apps we automatically installed them on phones usingthe adb shell command Each phone had 40 tested appsinstalled We have implemented the edge-client app basedon Android VPN service In our implementations the edge-client app correlated network flows and apps by resolvingthe ports and PIDs (Process IDs) send the newly collectedflow metainformation to the edge server every 60 secondse edge server used tcpdump to collect network trafficwhich was saved as pcap files To extract traffic features weused Splitcap (httpswwwnetreseccompageSplitCap)to restore tcp flows from pcap files en for HTTP trafficwe extracted the contents using tshark (httpswwwwiresharkorgdocsman-pagestsharkhtml) commands -zfollow tcp and ascii e packet sizes and packet intervaltimes were also obtained by tshark with commands -eframelen ndashe frametime_relative In the cloud CSdik

and BSdik

were calculated using scikit-learn (httpscikit-learnorg)and Numpy (httpsplotlynumpynorm) e code ofdensity peak clustering is available on GitHub (httpsgithubcomjasonwbwDensityPeakCluster) and we havefixed various bugs that were present in the original code

We have downloaded 200 pairs of original andrepackaged apps from Androzoo for testing However notall of these downloaded repackaged apps are maliciousSome apps are only repackaged to show extra ads To screenout Android malware from the downloaded repackagedapps we further checked the 200 repackaged apps usingVirustotal If an app was judged as safe by all the antivirustools it was labelled as repackaged_safe in our experimentsOtherwise it was labelled as repackaged_malware In total143 repackaged apps were detected as malware by VirustotalNevertheless in our experiments we evaluated the proposedclusteringmethod on all 200 repackaged apps to obtainmorerealistic performance results

e downloaded apps were run in 10 rooted Androidphones in parallel to accelerate the experimental process Forall apps we used Droidbot [59] to send input events to themand to generate network traffic automatically We ran eachrepackaged app (repackaged_safe and repackaged_malware

apps) 10 times and the corresponding original app 30 timesto generate sufficient network traffic e running times ofthe original apps exceeded those of the repackaged appsbecause we assume that most of the apps are original asdiscussed in Section 51 Consequently we obtained 40traffic sets for each repackaged-original app pair Note thatfrom the perspectives of the mobile devices and the edgeserver there are only 200 distinct apps because therepackaged app has the same name as the correspondingoriginal app erefore our experiments have simulated40lowast 200 8000 mobile devices For each repackaged-orig-inal app pair there were 30 devices with the original app and10 devices with the repackaged version

In total we have gathered almost 89GB network traffice datasets that are used in our experiments are summa-rized in Table 3 In the table the traffic dataset sizes are thesizes of the captured network traffic and the feature datasetsizes are the sizes of the extracted features In our experi-ments the HTTP traffic contents and the flow statisticalfeatures which are listed in Table 2 were both saved as textfiles As shown in Table 3 the feature datasets are signifi-cantly smaller than the traffic datasets (25G vs 63G and 9G vs26G)us the edge computing can reduce the data size andincrease the data acquisition speed for cloud computing[24] Since the feature dataset sizes are still large (25G+ 9G)we did not send all these feature data to the cloud simul-taneously Instead we sent the feature data one by oneWhen the feature data of an app were successfully receivedby the cloud its traffic similarities were calculated andclustered immediately After that the feature data of anotherapp would be sent and analyzed e experimental resultswill be presented in the following

62 ExperimentalResults In this section wemainly evaluatethe detection efficiency of Android malware by clusteringsimilarity values e performance evaluation of the pro-posed method in terms of power memory and CPU con-sumptions is elaborated in the following Section 63

Similar to previous studies [8 14] to measure theperformance of the proposed method Accuracy and F-measure metrics are utilized which are defined in equations(11) and (12) In equation (11) TP FN FP and TN aredefined in Table 4 In Table 4 repackaged_x representsrepackaged_safe or repackaged_malware e precision andrecall in equation (12) are calculated via equations (13) and(14)

Accuracy TP + TN

TP + FN + FP + TN (11)

F minus measure 2 times precision times recallprecision + recall

(12)

precision TP

TP + FP (13)

recall TP

TP + FN (14)

12 Security and Communication Networks

We have evaluated the Accuracy and F-measure for eachrepackaged app Recall that we ran each repackaged app 10times and the corresponding original app 30 timeserefore for each app the number of the original apptraffic sets is 30 and the number of the repackaged app trafficsets is 10 namely TP + FN 10 and TN+FP 30 for eachapp We have randomly chosen 20 apps from repack-age_malware and the experimental results for these apps arepresented in Figures 12 and 13 Detailed descriptions of thechosen 20 apps are listed in Table 5 In the table the mainfunction of each app is determined by analyzing the packagename from the AndroidManifestxml and from the Googlesearch results of the package name As depicted in Figure 12100 accuracy was realized on a total of 6 apps (app3 app5app9 app10 app13 and app18) e minimum accuracy is85 (app2 3440) and the average accuracy is 952 Fig-ure 13 presents the F-measure values In our experimentsthe maximum F-measure value is 1 and the minimum valueis 0737 e average F-measure value is 0888 ese resultsdemonstrate the satisfactory performance of our proposedclustering method

We have also calculated the average accuracy and F-measure values for all repackaged_safe and repack-aged_malware apps In these calculationsTP + FN 57lowast10 570 and TN+FP 57lowast 301710 for

repackaged_safe apps For repackaged_malware appsTP + FN 143lowast101430 and TN+FP 143lowast 30 4290is is because in our experiments 143 apps were identifiedas repackaged malware and 57 apps as repackaged safe byVirustotal us the parameter values differ e experi-mental results are listed in Table 6

According to Table 6 the average accuracy values forrepackaged_safe and repackaged_malware apps are bothhigh However the average F-measure for repackaged_safeapps is only 078 which is lower than that of repackagedmalware is indicates that repackaged safe apps are

Router Windows Azure

Edge server DellPowerEdge R730

10 android phones

Repackagedoriginal apps

Androzoo

Figure 11 Experimental setup

Table 3 Datasets used in the experiments

of apps of runs of HTTP flows of encrypted flows Traffic dataset sizes Feature dataset sizesOriginal apps 200 30 340015 183452 asymp63G bytes asymp25G bytesRepackaged apps 200 10 110670 93041 asymp26G bytes asymp9G bytes

Table 4 Confusion matrix

ObservedDetected

Repackaged_x app Original appRepackaged_x app TP ( of TPs) FN ( of FNs)Original app FP ( of FPs) TN ( of TNs)

Apps

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

Figure 12 Detection accuracies for the 20 selected apps

Security and Communication Networks 13

difficult to distinguish from their original apps is isbecause repackaged_safe apps only generate smallamounts of additional network flows and their networktraffic features are not clearly distinguishable from thoseof normal traffic Fortunately the average F-measure ofrepackaged malware apps is outstanding is may be dueto the fact that malware typically send stolen data to orreceive commands from remote servers Hence theirnetwork traffic features are special and distinguishableFigure 14 presents an example of app clustering and il-lustrates the distinguishability of the repackaged malwareand the original app ese results demonstrate thatrepackaged Android malware can be effectively detectedby comparing their network traffic with that of originalapps

For repackaged malware apps the accuracy results withdifferent weight values (q in equation (10)) are presented inFigure 15 As depicted in the figure when q 05 we realizethe highest accuracy rate 969 If q 0 namely only thefeatures of encrypted traffic are used the accuracy result is753 e result is 617 when q 1 namely only thefeatures of plaintext traffic are considered ese resultsdemonstrate that both the encrypted and plaintext trafficcontribute to the differentiation of the original apps and therepackaged malware However the encrypted traffic con-tributes more than the plaintext traffic is may becausemalicious apps usually connect to their servers by securityprotocols to evade IDS detection

In the previous experiments we checked the serversrsquohostnames using VirusTotal to reduce the frequency of thefalse positives For various clusters that contained less than20 elements if all the serversrsquo hostnames were judged asclean the cluster was identified as an original app rather thana repackaged one As argued in Section 55 this process onlyreduces the frequency of false alarms To support this ar-gument we list the numbers of true positives and falsepositives of the 143 repackaged malware in Table 7 forcomparison When conducting the hostname checking thenumber of true positives is 1400 and the result is the samewhen the hostname checking is turned off is findingsupports that the malware is detected by clustering thesimilarity values However the number of false positivesincreased from 47 to 213 when the hostname checking wasturned off erefore we suggest checking the serversrsquohostnames after clustering to increase the practicality of themethod Note that false positives still occur when we checkthe hostnames by VirusTotal Hence there may be malwarein the original apps In our dataset the typical examples arecommobovideoplayerpro and cncfshop ele1taoxiaosanWe further analyze these two apps by reading theirdecompiled codes and we determine that they indeed es-tablish connections with suspicious servers In detail invarious runs app commobovideoplayerpro connects withhttpwwwtopappsquarecomwhich is judged as maliciousby CyRadar in VirusTotal App cncfshop ele1taoxiaosanvisits suspicious URL http955ccjSn9 Since these apps arethe original apps they are identified as false positives in theexperiments

In our experiments we chose density peak clustering asthe clustering method because density peak clustering canautomatically determine the correct number of clusters [58]Hence our method can detect various versions of malwarethat are repackaged from the same original app Unfortu-nately in the apps that were downloaded from AndroZoono malware was repackaged from the same original app Toevaluate the efficiency of detecting versions of malware wehave created two malware instances from a popular game

Apps

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

0

01

02

03

04

05

06

07

08

09

1

F-m

easu

re

Figure 13 Detection F-measure for the 20 selected apps

Table 5 Detailed descriptions of the chosen 20 apps

Main function ofapp

of HTTPflows

of encryptedflows

App1 VideoPlayer 453 51App2 Game 657 1455App3 Browser 1531 1478App4 MusicPlayer 341 25App5 Game 612 289App6 Game 788 123App7 Game 1336 524App8 Fitness 357 892App9 Email 122 431App10 Game 963 790App11 MusicPlayer 1128 547App12 Game 1598 426App13 News 2460 1178App14 News 2891 1543App15 Chat 171 49App16 Email 143 368App17 WeatherForecast 397 120App18 Game 1832 921App19 Fitness 732 362App20 OnlineLearning 2378 834

Table 6 Average accuracy and F-measure

Average accuracy () AverageF-measure

Repackaged_safe 903 078Repackaged_malware 969 094

14 Security and Communication Networks

app namely aircomaceviralmotox3 One is produced bygrafting AnserverBotrsquos source code into air-comaceviralmotox3 and the other is created by adding codefor stealing usersrsquo contacts and short messages and sendingthem to a remote server Similarly each malware is run 10times and the original app aircomaceviralmotox3 is alsorun 10 times for validatione clustering result is presentedin Figure 16 Apparently there are 3 clusters e

experimental results show that the value of F-measure foreach malware is 1 Hence our method can detect variousmalware that were repackaged from the same original app

63 Comparison In this section we compare our methodwith the typical methods that are based on network or cloudas illustrated in Figure 5 e comparative indicators are thedetection accuracy F-measure average power CPU andmemory consumptions of the mobile device First wecompare the proposed method with AppFA [14] which isimplemented completely at the network level Since AppFAalso must analyze app traffic we directly use the traffic sets ofthe repackaged malware apps for comparison While testingAppFA we deleted the label information that was containedin the traffic sets and mixed the network flows as a wholeen we used the signature matching and the constrainedK-means clustering algorithm proposed in [14] to restoreapp sessions namely the label for each flow e similarapps used in AppFA were chosen in Google Play via themethod introduced in [14] and for each repackaged mal-ware 10 similar apps were selected as the peer group epeer group apps were run once by Droidbot to generatenetwork traffic e comparison results are listed in Table 8

Since AppFA need not to install some special apps onmobile devices the extra power memory and CPU con-sumptions are 0 However the detection accuracy is muchlower (863 vs 969) than that of the proposedmethod ascompared in Table 8 In the proposed method we use theedge-client app to record flow metainformation thusnetwork flows can be accurately labeled However inAppFA flow labeling is realized via traffic clustering and theclustering accuracy cannot be 100 Meanwhile AppFAcompares the appsrsquo network behaviors with those of theirsimilar apps which also results in errors In the proposedmethod we directly compare the repackaged malware withits original app and the detection accuracy is increased eaverage power CPU and memory consumptions of theproposed method are also low and acceptable

In Table 8 the power memory and CPU consumptionsfor our edge-client app were obtained with the help of appGT (httpsgithubcomTencentGT) GT is a portable toolthat runs on smartphones for debugging the APP internalparameters and the code time-consuming statistics eaverage values for the power memory and CPU con-sumptions are calculated as follows Before running an appwith Droidbot we started the edge-client app by GT Whenthe Droidbot was stopped we recoded the power CPU andmemory consumptions We repeated the process until allapps (including 143 repackaged malware apps and theiroriginal versions) had been tested Via this way we obtainedthe average power CPU and memory consumptions for theedge-client app According to Table 8 the edge-client app islightweight and has minimal impact on the performances ofthe mobile devices

en we compare the proposed method with SAMA-Droid [44] SAMADroid monitors Android system calls andsends the calls to a centralized server for the detection ofAndroid malware SAMADroid is a typical cloud-based on-

Decision graph rho-delta-rho

20 251510 35 403050Sorted rho

00

02

04

06

08

Rho lowast

delt

a

Figure 14 Example of app clustering Obviously two clusters areidentified e dot in the top left corner is an outlier e outlier isdetected as an abnormal but harmless app by our method isfigure was generated by the density peak clustering tool

01 02 03 04 05 06 07 08 09 10Value of q

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

Figure 15 Accuracies of various weight values for traffic contentsimilarity and traffic behavior similarity

Table 7 True positives and false positives with and withouthostname checking

of true positives of false positivesWith hostname checking 1400 47Without hostnamechecking 1400 213

Security and Communication Networks 15

device detection method We have implemented the dy-namic detection according to the description in [44] Indetail the APIs that are invoked by the running apps aremonitored by the tool Sensitive API Monitor (httpsgithubcomcyruliuSensitive-API-Monitor) As done in [44] thefrequency values of 10 system calls (open ioctl brk readwrite close sendto sendmsg recvfrom and recvmsg) arerecorded ese recorded frequency values are sent toWindows Azure for malware detection We evaluate thedynamic detection performance of SAMADroid on our 143repackaged_malware apps and their original versions In theevaluation the frequency values of the original apps are usedto create the normality model and the frequency values ofthe repackaged malware apps are used for testing We alsouse the app GT to record the power memory and CPUconsumptions of SAMADroid Note that our phones wereall rooted and Sensitive_API_Monitor and GT can be usede experimental results are presented in Table 9

Last we compare our method with Shabtairsquos method [8]Shabtairsquos method detects Android malware at the mobiledevices and it considers the apprsquos network traffic patternsonly the HTTP traffic contents are not considered isgives us an opportunity to compare our method withmethods that have been specially designed for encryptedtraffic We have implemented Shabtairsquos method and chosenthe feature subset 2 as the detection features Feature subset2 includes Avg Sent Bytes Avg Rcvd Bytes and Pct ofAvg Rcvd Bytes among other values and it is proved to be

the best feature subset for malware detection [8] e De-cisionRegression tree is chosen as the classification algo-rithm e testing dataset is the traffic that was generated bythe 143 repackaged malware instances and the training set isthe traffic that was generated by the corresponding originalapps e classification results are listed in Table 10According to the table the detection accuracy and F-mea-sure values of our method are significantly higher than thoseof Shabtairsquos method is indicates that the HTTP trafficprovides useful information for Android malware detectionSince Shabtairsquos method is conducted at the mobile devices itconsumes more resources than our method as compared inTable 10

7 Discussion

e results of our extensive experiments demonstrate thatrepackaged Android malware can be efficiently and effec-tively detected by clustering network traffic at the edgedevices However there are still limitations for our methodFirst we assume the most of the devices have installed theoriginal apps (rlt (12)u as described in Section 51) is isreasonable in most cases since mobile users typicallydownload apps from the official markets However in areaswhere official markets such as the Google play and Amazonare banned it is possible that the most mobile devices haveinstalled the repackaged apps is may cause false positivesSecond if the malware does not generate sufficient trafficour method cannot detect it For example some malwareonly sends fraudulent premium SMS messages is type ofmalware cannot be detected by monitoring its networktraffic ird we rely on third-party tools such as Virustotalfor the identification of malicious URLs and to reduce thenumber of false alarms If a malicious app rents legalplatforms such as the public cloud as the remote controlserver our method will judge it as a normal app Hence falsenegatives will be generated

To overcome these limitations we could combine theproposed method with available methods such as SAMA-Droid In such cases the edge-client app will not only recordflow metainformation but also monitor app system be-haviors However the system behavior monitoring does notneed to be conducted all the time If abnormal but harmlessor lower traffic apps are detected the edge-client app will beinformed to record the corresponding appsrsquo system be-haviors Similarly the edge-client app can send the appsystem behaviors to the edge server for further processinge performance of this combined method will be evaluatedin our future work One can also combine the proposedframework with AppFA [14] for the detection of other typesof Android malware For example the edge extracts apptraffic behavior features and the cloud can compare appsrsquobehaviors with their peer groups to identify abnormalitiesFurthermore one could first use clustering techniques [60]to group repackaged apps according to the same maliciousmodules and then analyze the behaviors of the representativerepackaged apps in each group to obtain more accuratebehavior features

y

2D nonclassical multidimensional scaling

00ndash02 02ndash04x

ndash04

ndash03

ndash02

ndash01

00

01

02

03

04

Figure 16 Clustering results of various malware that wererepackaged from aircomaceviralmotox3e outlier in the middleis detected as an abnormal but harmless app is figure wasgenerated by the density peak clustering tool

Table 8 Comparison of the proposed method with AppFA

Proposed method AppFAAccuracy 969 863F-measure 094 075Average power consumption 12 0Average CPU consumption 06 0Average memory consumption 654MB 0

16 Security and Communication Networks

e paradigms of edge computing are used to detectAndroid malware in this study However several challengeswould be encountered when deploying it in practice Firstthe mobile devices must install a specified app for thecollection of essential information which may cause con-cerns and mobile users may refuse to cooperate Second inedge computing the data should be preprocessed by the edgeto better protect the usersrsquo privacy However the edge servercan observe all the data (including the sensitive informa-tion) and it threatens the privacy of users For example ifthe edge is deployed by the cloud provider the usersrsquo privacyinformation is also available to the cloud ere is no dif-ference with the user-cloud model for privacy protectionunless the edge server is deployed by the userird the edgeserver should be well protected from network attacksOtherwise it will be convenient for hackers to steal usersrsquoprivate information

8 Conclusion

In this paper we propose a novel method for detectingAndroid malware based on edge computing To the best ofour knowledge the detection of Android malware by uti-lizing edge computing and traffic clustering has not beenconsidered previously By introducing the edge layer themain tasks of the mobile device are offloaded to the edgeserver and the mobile device only needs to record flowmetainformation which substantially conserves the re-sources of the mobile device e edge server extracts appsrsquonetwork traffic features and sends these features to the cloudplatform for further analysis us the usersrsquo privacy can bebetter protected With the received features the cloud cal-culates the similarities between apps and clusters thesesimilarity values to separate the original apps and themalware automatically We evaluated our methods on 400Android apps and the experimental results demonstratethat the detection accuracy can reach 100 for several appsand the average accuracy is 969 We also tested theperformance of the proposed method and compared it with

the typical approaches e comparison results show thatour method can detect repackaged Android malware withboth higher detection accuracy and lower power and re-source consumptions In the future we will continue to workto overcome the discussed limitations and to evaluate theproposed method in real world scenarios

Data Availability

e downloaded apps used to support the findings of thisstudy have been deposited in httpspanbaiducoms1b8vMceIdtdd1gBRC5Gbzvw Its extraction code is vd5re clustering code is also in the repository Since appnetwork traffic contains usersrsquo sensitive information itcannot be publicly shared However we have explained allthe steps to generate network traffic automatically esesteps are also described in the instructionspdf in therepository

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported by National Key TechnologiesResearch and Development Program of China under Grant2018YFD0401404 National Natural Science Foundation ofChina under Grants 61702282 61802192 and 71801123Natural Science Foundation of the Jiangsu Higher EducationInstitutions of China under Grants 18KJB520024 and17KJB520023 Nanjing Forestry University under GrantsGXL016 and CX2016026 and NUPTSF under GrantNY217143

References

[1] Statista Mobile Operating Systemsrsquo Market Share Worldwidefrom January 2012 to December 2019 Statista HamburgGermany 2009 httpswwwstatistacomstatistics272698global-market-share-held-by-mobile-operating-systems-since-2009

[2] W Martin F Sarro Y Jia Y Zhang and M Harman ldquoAsurvey of app store analysis for software engineeringrdquo IEEETransactions on Software Engineering vol 43 no 9pp 817ndash847 2017

[3] httpswwwappbraincomstatsnumber-of-android-apps2020

[4] L Li D Li T F Bissyande et al ldquoUnderstanding android apppiggybacking a systematic study of malicious code graftingrdquoIEEE Transactions on Information Forensics and Securityvol 12 no 6 pp 1269ndash1284 2017

[5] M Fan J Liu X Luo et al ldquoAndroid malware familialclassification and representative sample selection via frequentsubgraph analysisrdquo IEEE Transactions on Information Fo-rensics and Security vol 13 no 8 pp 1890ndash1905 2018

[6] J Zhang Z Qin K Zhang H Yin and J Zou ldquoDalvik opcodegraph based android malware variants detection using globaltopology featuresrdquo IEEE Access vol 6 pp 51964ndash51974 2018

[7] S Garg S K Peddoju and A K Sarje ldquoNetwork-baseddetection of android malicious appsrdquo International Journal ofInformation Security vol 16 no 4 pp 385ndash400 2016

Table 9 Comparison of the proposed method with SAMADroid

Proposed method SAMADroidAccuracy 969 831F-measure 0940 0793Average power consumption 12 23Average CPU consumption 06 08Average memory consumption 654MB 773MB

Table 10 Comparison of the proposed method with Shabtairsquosmethod

Proposed method Shabtairsquosmethod

Accuracy 969 771F-measure 0940 0695Average power consumption 12 77Average CPU consumption 06 23Average memory consumption 654MB 937MB

Security and Communication Networks 17

[8] A Shabtai L Tenenboim-Chekina D Mimran L RokachB Shapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquoComputers amp Security vol 43 pp 1ndash18 2014

[9] L Xie X Zhang J-P Seifert and S Zhu ldquopBMDS a be-havior-based malware detection system for cellphone de-vicesrdquo in Proceedings of the ird ACM Conference onWireless Network Security ACM New York NY USApp 37ndash48 2010

[10] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacyin Smartphones and Mobile Devices ACM Chicago IL USApp 15ndash26 October 2011

[11] G Dini F Martinelli A Saracino and D SgandurraldquoMADAM a multi-level anomaly detector for android mal-warerdquo in Proceedings of the International Conference onMathematical Methods Models and Architectures for Com-puter Network Security vol 12 Springer St PetersburgRussia pp 240ndash253 2012

[12] Y Zhang M Yang B Xu et al ldquoVetting undesirable be-haviors in android apps with permission use analysisrdquo inProceedings of the 2013 ACM SIGSAC Conference on Com-puter amp Communications Security ACM Berlin Germanypp 611ndash622 November 2013

[13] SWang Z Chen X Li LWang K Ji and C Zhao ldquoAndroidmalware clustering analysis on network-level behaviorrdquo inProceedings of the International Conference on IntelligentComputing Springer Liverpool UK pp 796ndash807 August2017

[14] G He B Xu and H Zhu ldquoAppFA a novel approach to detectmalicious android applications on the networkrdquo in Securityand Communication Networks Wiley Hoboken NJ USA2018

[15] X Wang Y Yang and Y Zeng ldquoAccurate mobile malwaredetection and classification in the cloudrdquo SpringerPlus vol 4no 1 p 583 2015

[16] K Tam A Feizollah N B Anuar R Salleh and L Cavallaroldquoe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys (CSUR) vol 49 no 4p 76 2017

[17] M Sun X Li J C S Lui R T B Ma and Z Liang ldquoMonet auser-oriented behavior-based malware variants detectionsystem for androidrdquo IEEE Transactions on Information Fo-rensics and Security vol 12 no 5 pp 1103ndash1112 2017

[18] T Vidas and N Christin ldquoEvading android runtime analysisvia sandbox detectionrdquo in Proceedings of the 9th ACMSymposium on Information Computer and CommunicationsSecurity pp 447ndash458 Kyoto Japan June 2014

[19] Y Duan M Zhang A V Bhaskar et al ldquoings you may notknow about android (Un) packers a systematic study basedon whole-system emulationrdquo in Proceedings of the 2018Network and Distributed System Security Symposium pp 1ndash15 San Diego CA USA February 2018

[20] L Xue X Luo L Yu S Wang and D Wu ldquoAdaptiveunpacking of android appsrdquo in Proceedings of the 2017 IEEEACM 39th International Conference on Software Engineering(ICSE) IEEE Buenos Aires Argentina pp 358ndash369 May2017

[21] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy IEEE San Francisco CAUSA pp 95ndash109 2012

[22] N W Lo S K Lu and Y H Chuang ldquoA framework for thirdparty android marketplaces to identify repackaged appsrdquo inProceedings of the 2016 IEEE 14th International Conference onDependable Autonomic and Secure Computing 14th Inter-national Conference on Pervasive Intelligence and Computing2nd International Conference on Big Data Intelligence andComputing and Cyber Science and Technology Congresspp 475ndash482 Auckland New Zealand August 2016

[23] L Li J Gao M Hurier et al ldquoAndrozoo++ collectingmillions of android apps and their metadata for the researchcommunityrdquo 2017 httpsarxivorgabs170905281

[24] W Shi J Cao Q Zhang Y Li and L Xu ldquoEdge computingvision and challengesrdquo IEEE Internet of ings Journal vol 3no 5 pp 637ndash646 2016

[25] W Zhou Y Zhou X Jiang and P Ning ldquoDetectingrepackaged smart-phone applications in third-party androidmarketplacesrdquo in Proceedings of the Second ACM Conferenceon Data and Application Security and Privacy ACM AntonioTX USA pp 317ndash326 February 2012

[26] Y Shao X Luo C Qian P Zhu and L Zhang ldquoTowards ascalable resource-driven approach for detecting repackagedandroid applicationsrdquo in Proceedings of the 30th AnnualComputer Security Applications Conference ACM NewOrleans LA USA pp 56ndash65 December 2014

[27] K Chen P Wang Y Lee et al ldquoFinding unknown malice in10 seconds mass vetting for new threats at the google-playscalerdquo in Proceedings of the USENIX Security Symposiumvol 15 Washington DC USA August 2015

[28] Z Wang C Li Y Guan and Y Xue ldquoDroidchain a novelmalware detection method for android based on behaviorchainrdquo in Proceedings of the 2015 IEEE Conference onCommunications and Network Security (CNS) IEEE Flor-ence Italy pp 727-728 September 2015

[29] K Tian D D Yao B G Ryder G Tan and G PengldquoDetection of repackaged android malware with code-het-erogeneity featuresrdquo IEEE Transactions on Dependable andSecure Computing vol 17 no 1 pp 64ndash77 2017

[30] A De Lorenzo F Martinelli E Medvet F Mercaldo andA Santone ldquoVisualizing the outcome of dynamic analysis ofandroid malware with vizmalrdquo Journal of Information Se-curity and Applications vol 50 Article ID 102423 2020

[31] M Lin D Zhang X Su and T Yu ldquoEffective and scalablerepackaged application detection based on user interfacerdquo inProceedings of the 2017 IEEE Conference on ComputerCommunications Workshops (INFOCOM WKSHPS) IEEEAtlanta GA USA May 2017

[32] G Meng M Patrick Y Xue Y Liu and J Zhang ldquoSecuringandroid app markets via modelling and predicting malwarespread between marketsrdquo IEEE Transactions on InformationForensics and Security vol 14 no 7 pp 1944ndash1959 2018

[33] A Arora and S K Peddoju ldquoMinimizing network trafficfeatures for android mobile malware detectionrdquo in Proceed-ings of the 18th International Conference on DistributedComputing and Networking ACM Hyderabad India January2017

[34] X Wu D Zhang X Su and W Li ldquoDetect repackagedAndroid application based on HTTP traffic similarity httptraffic similarityrdquo Security and Communication Networksvol 8 no 13 pp 2257ndash2266 2015

[35] J Malik and R Kaushal ldquoCredroid android malware de-tection by network traffic analysisrdquo in Proceedings of the 1stACM Workshop on Privacy-Aware Mobile Computing ACMPaderborn Germany pp 28ndash36 July 2016

18 Security and Communication Networks

[36] A Zulkifli I R A Hamid W M Shah and Z AbdullahldquoAndroid malware detection based on network traffic usingdecision tree algorithmrdquo in Proceedings of the InternationalConference on Soft Computing and Data Mining SpringerSenai Malaysia pp 485ndash494 January 2018

[37] Z Chen Q Yan H Han et al ldquoMachine learning basedmobile malware detection using highly imbalanced networktrafficrdquo Information Sciences vol 433-434 pp 346ndash364 2018

[38] G He B Xu L Zhang and H Zhu ldquoMobile app identifi-cation for encrypted network flows by traffic correlationrdquoInternational Journal of Distributed Sensor Networks vol 14no 12 pp 1ndash17 2018

[39] L Xue Y Zhou T Chen X Luo and G Gu ldquoMalton to-wards on-device non-invasive mobile malware analysis forARTrdquo in Proceedings of the 26th USENIX Security Symposiumpp 289ndash306 Vancouver Canada August 2017

[40] M K Alzaylaee S Y Yerima and S Sezer ldquoDL-Droid deeplearning based android malware detection using real devicesrdquoComputers amp Security vol 89 Article ID 101663 2020

[41] A S Shamili C Bauckhage and T Alpcan ldquoMalware de-tection on mobile devices using distributed machine learn-ingrdquo in Proceedings of the 20th International Conference onPattern Recognition (ICPR) IEEE Istanbul Turkeypp 4348ndash4351 August 2010

[42] M Zhao T Zhang F Ge and Z Yuan ldquoRobotdroid alightweight malware detection framework on smartphonesrdquoJournal of Networks vol 7 no 4 p 715 2012

[43] K A Talha D I Alper and C Aydin ldquoAPK auditor per-mission-based android malware detection systemrdquo DigitalInvestigation vol 13 pp 1ndash14 2015

[44] S Arshad M A Shah A Wahid A Mehmood H Song andH Yu ldquoSamadroid a novel 3-level hybrid malware detectionmodel for android operating systemrdquo IEEE Access vol 6pp 4321ndash4339 2018

[45] G He L Zhang B Xu and H Zhu ldquoDetecting repackagedandroid malware based on mobile edge computingrdquo inProceedings of the 2018 Sixth International Conference onAdvanced Cloud and Big Data (CBD) IEEE Lanzhou Chinapp 360ndash365 August 2018

[46] R Roman J Lopez andMMambo ldquoMobile edge computingFog et al a survey and analysis of security threats andchallengesrdquo Future Generation Computer Systems vol 78pp 680ndash698 2018

[47] F Bonomi R Milito J Zhu and S Addepalli ldquoFog com-puting and its role in the internet of thingsrdquo in Proceedings ofthe First Edition of the MCC Workshop on Mobile CloudComputing ACM Helsinki Finland pp 13ndash16 August 2012

[48] M T Beck MWerner S Feld and S Schimper ldquoMobile edgecomputing a taxonomyrdquo in Proceedings of the Sixth Inter-national Conference on Advances in Future Internet pp 48ndash55 Lisbon Portugal November 2014

[49] P Mach and Z Becvar ldquoMobile edge computing a survey onarchitecture and computation offloadingrdquo IEEE Communi-cations Surveys amp Tutorials vol 19 no 3 pp 1628ndash1656 2017

[50] H T Dinh C Lee D Niyato and P Wang ldquoA survey ofmobile cloud computing architecture applications and ap-proachesrdquo Wireless Communications and Mobile Computingvol 13 no 18 pp 1587ndash1611 2013

[51] X Lyu H Tian L Jiang et al ldquoSelective offloading in mobileedge computing for the green internet of thingsrdquo IEEENetwork vol 32 no 1 pp 54ndash60 2018

[52] S Agrawal and J Agrawal ldquoSurvey on anomaly detectionusing data mining techniquesrdquo Procedia Computer Sciencevol 60 pp 708ndash713 2015

[53] A Tongaonkar ldquoA look at the mobile app identificationlandscaperdquo IEEE Internet Computing vol 20 no 4 pp 9ndash152016

[54] G He B Xu and H Zhu ldquoIdentifying mobile applications forencrypted network trafficrdquo in Proceedings of the 2017 FifthInternational Conference on Advanced Cloud and Big Data(CBD) IEEE Shanghai China pp 279ndash284 August 2017

[55] G Ranjan A Tongaonkar and R Torres ldquoApproximatematching of persistent lexicon using search-engines forclassifying mobile app trafficrdquo in Proceedings of the 35thAnnual IEEE International Conference on Computer Com-munications IIEEE San Francisco CA USA April 2016

[56] L Li and S Qu ldquoShort text classification based on improvedITCrdquo Journal of Computer and Communications vol 1 no 4pp 22ndash27 2013

[57] A H Lashkari A F A Kadir L Taheri and A A GhorbanildquoToward developing a systematic approach to generatebenchmark android malware datasets and classificationrdquo inProceedings of the 2018 International Carnahan Conference onSecurity Technology (ICCST) IEEE Montreal Canada Oc-tober 2018

[58] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[59] Y Li Z Yang Y Guo and X Chen ldquoDroidbot a lightweightUI-guided test input generator for androidrdquo in Proceedings ofthe Software Engineering Companion (ICSE-C) pp 23ndash26Buenos Aires Argentina May 2017

[60] M Fan X Luo J Liu et al ldquoGraph embedding based familialanalysis of android malware using unsupervised learningrdquo inProceedings of the 2019 IEEEACM 41st International Con-ference on Software Engineering (ICSE) IEEE Piscataway NJUSA pp 771ndash782 May 2019

Security and Communication Networks 19

Page 13: On-DeviceDetectionofRepackagedAndroidMalwarevia …downloads.hindawi.com/journals/scn/2020/8630748.pdf · 2020-05-31 · According to [4–6], most Android malware programs are repackagedapps,andthemajorityofnewAndroidmalware

We have evaluated the Accuracy and F-measure for eachrepackaged app Recall that we ran each repackaged app 10times and the corresponding original app 30 timeserefore for each app the number of the original apptraffic sets is 30 and the number of the repackaged app trafficsets is 10 namely TP + FN 10 and TN+FP 30 for eachapp We have randomly chosen 20 apps from repack-age_malware and the experimental results for these apps arepresented in Figures 12 and 13 Detailed descriptions of thechosen 20 apps are listed in Table 5 In the table the mainfunction of each app is determined by analyzing the packagename from the AndroidManifestxml and from the Googlesearch results of the package name As depicted in Figure 12100 accuracy was realized on a total of 6 apps (app3 app5app9 app10 app13 and app18) e minimum accuracy is85 (app2 3440) and the average accuracy is 952 Fig-ure 13 presents the F-measure values In our experimentsthe maximum F-measure value is 1 and the minimum valueis 0737 e average F-measure value is 0888 ese resultsdemonstrate the satisfactory performance of our proposedclustering method

We have also calculated the average accuracy and F-measure values for all repackaged_safe and repack-aged_malware apps In these calculationsTP + FN 57lowast10 570 and TN+FP 57lowast 301710 for

repackaged_safe apps For repackaged_malware appsTP + FN 143lowast101430 and TN+FP 143lowast 30 4290is is because in our experiments 143 apps were identifiedas repackaged malware and 57 apps as repackaged safe byVirustotal us the parameter values differ e experi-mental results are listed in Table 6

According to Table 6 the average accuracy values forrepackaged_safe and repackaged_malware apps are bothhigh However the average F-measure for repackaged_safeapps is only 078 which is lower than that of repackagedmalware is indicates that repackaged safe apps are

Router Windows Azure

Edge server DellPowerEdge R730

10 android phones

Repackagedoriginal apps

Androzoo

Figure 11 Experimental setup

Table 3 Datasets used in the experiments

of apps of runs of HTTP flows of encrypted flows Traffic dataset sizes Feature dataset sizesOriginal apps 200 30 340015 183452 asymp63G bytes asymp25G bytesRepackaged apps 200 10 110670 93041 asymp26G bytes asymp9G bytes

Table 4 Confusion matrix

ObservedDetected

Repackaged_x app Original appRepackaged_x app TP ( of TPs) FN ( of FNs)Original app FP ( of FPs) TN ( of TNs)

Apps

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

Figure 12 Detection accuracies for the 20 selected apps

Security and Communication Networks 13

difficult to distinguish from their original apps is isbecause repackaged_safe apps only generate smallamounts of additional network flows and their networktraffic features are not clearly distinguishable from thoseof normal traffic Fortunately the average F-measure ofrepackaged malware apps is outstanding is may be dueto the fact that malware typically send stolen data to orreceive commands from remote servers Hence theirnetwork traffic features are special and distinguishableFigure 14 presents an example of app clustering and il-lustrates the distinguishability of the repackaged malwareand the original app ese results demonstrate thatrepackaged Android malware can be effectively detectedby comparing their network traffic with that of originalapps

For repackaged malware apps the accuracy results withdifferent weight values (q in equation (10)) are presented inFigure 15 As depicted in the figure when q 05 we realizethe highest accuracy rate 969 If q 0 namely only thefeatures of encrypted traffic are used the accuracy result is753 e result is 617 when q 1 namely only thefeatures of plaintext traffic are considered ese resultsdemonstrate that both the encrypted and plaintext trafficcontribute to the differentiation of the original apps and therepackaged malware However the encrypted traffic con-tributes more than the plaintext traffic is may becausemalicious apps usually connect to their servers by securityprotocols to evade IDS detection

In the previous experiments we checked the serversrsquohostnames using VirusTotal to reduce the frequency of thefalse positives For various clusters that contained less than20 elements if all the serversrsquo hostnames were judged asclean the cluster was identified as an original app rather thana repackaged one As argued in Section 55 this process onlyreduces the frequency of false alarms To support this ar-gument we list the numbers of true positives and falsepositives of the 143 repackaged malware in Table 7 forcomparison When conducting the hostname checking thenumber of true positives is 1400 and the result is the samewhen the hostname checking is turned off is findingsupports that the malware is detected by clustering thesimilarity values However the number of false positivesincreased from 47 to 213 when the hostname checking wasturned off erefore we suggest checking the serversrsquohostnames after clustering to increase the practicality of themethod Note that false positives still occur when we checkthe hostnames by VirusTotal Hence there may be malwarein the original apps In our dataset the typical examples arecommobovideoplayerpro and cncfshop ele1taoxiaosanWe further analyze these two apps by reading theirdecompiled codes and we determine that they indeed es-tablish connections with suspicious servers In detail invarious runs app commobovideoplayerpro connects withhttpwwwtopappsquarecomwhich is judged as maliciousby CyRadar in VirusTotal App cncfshop ele1taoxiaosanvisits suspicious URL http955ccjSn9 Since these apps arethe original apps they are identified as false positives in theexperiments

In our experiments we chose density peak clustering asthe clustering method because density peak clustering canautomatically determine the correct number of clusters [58]Hence our method can detect various versions of malwarethat are repackaged from the same original app Unfortu-nately in the apps that were downloaded from AndroZoono malware was repackaged from the same original app Toevaluate the efficiency of detecting versions of malware wehave created two malware instances from a popular game

Apps

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

0

01

02

03

04

05

06

07

08

09

1

F-m

easu

re

Figure 13 Detection F-measure for the 20 selected apps

Table 5 Detailed descriptions of the chosen 20 apps

Main function ofapp

of HTTPflows

of encryptedflows

App1 VideoPlayer 453 51App2 Game 657 1455App3 Browser 1531 1478App4 MusicPlayer 341 25App5 Game 612 289App6 Game 788 123App7 Game 1336 524App8 Fitness 357 892App9 Email 122 431App10 Game 963 790App11 MusicPlayer 1128 547App12 Game 1598 426App13 News 2460 1178App14 News 2891 1543App15 Chat 171 49App16 Email 143 368App17 WeatherForecast 397 120App18 Game 1832 921App19 Fitness 732 362App20 OnlineLearning 2378 834

Table 6 Average accuracy and F-measure

Average accuracy () AverageF-measure

Repackaged_safe 903 078Repackaged_malware 969 094

14 Security and Communication Networks

app namely aircomaceviralmotox3 One is produced bygrafting AnserverBotrsquos source code into air-comaceviralmotox3 and the other is created by adding codefor stealing usersrsquo contacts and short messages and sendingthem to a remote server Similarly each malware is run 10times and the original app aircomaceviralmotox3 is alsorun 10 times for validatione clustering result is presentedin Figure 16 Apparently there are 3 clusters e

experimental results show that the value of F-measure foreach malware is 1 Hence our method can detect variousmalware that were repackaged from the same original app

63 Comparison In this section we compare our methodwith the typical methods that are based on network or cloudas illustrated in Figure 5 e comparative indicators are thedetection accuracy F-measure average power CPU andmemory consumptions of the mobile device First wecompare the proposed method with AppFA [14] which isimplemented completely at the network level Since AppFAalso must analyze app traffic we directly use the traffic sets ofthe repackaged malware apps for comparison While testingAppFA we deleted the label information that was containedin the traffic sets and mixed the network flows as a wholeen we used the signature matching and the constrainedK-means clustering algorithm proposed in [14] to restoreapp sessions namely the label for each flow e similarapps used in AppFA were chosen in Google Play via themethod introduced in [14] and for each repackaged mal-ware 10 similar apps were selected as the peer group epeer group apps were run once by Droidbot to generatenetwork traffic e comparison results are listed in Table 8

Since AppFA need not to install some special apps onmobile devices the extra power memory and CPU con-sumptions are 0 However the detection accuracy is muchlower (863 vs 969) than that of the proposedmethod ascompared in Table 8 In the proposed method we use theedge-client app to record flow metainformation thusnetwork flows can be accurately labeled However inAppFA flow labeling is realized via traffic clustering and theclustering accuracy cannot be 100 Meanwhile AppFAcompares the appsrsquo network behaviors with those of theirsimilar apps which also results in errors In the proposedmethod we directly compare the repackaged malware withits original app and the detection accuracy is increased eaverage power CPU and memory consumptions of theproposed method are also low and acceptable

In Table 8 the power memory and CPU consumptionsfor our edge-client app were obtained with the help of appGT (httpsgithubcomTencentGT) GT is a portable toolthat runs on smartphones for debugging the APP internalparameters and the code time-consuming statistics eaverage values for the power memory and CPU con-sumptions are calculated as follows Before running an appwith Droidbot we started the edge-client app by GT Whenthe Droidbot was stopped we recoded the power CPU andmemory consumptions We repeated the process until allapps (including 143 repackaged malware apps and theiroriginal versions) had been tested Via this way we obtainedthe average power CPU and memory consumptions for theedge-client app According to Table 8 the edge-client app islightweight and has minimal impact on the performances ofthe mobile devices

en we compare the proposed method with SAMA-Droid [44] SAMADroid monitors Android system calls andsends the calls to a centralized server for the detection ofAndroid malware SAMADroid is a typical cloud-based on-

Decision graph rho-delta-rho

20 251510 35 403050Sorted rho

00

02

04

06

08

Rho lowast

delt

a

Figure 14 Example of app clustering Obviously two clusters areidentified e dot in the top left corner is an outlier e outlier isdetected as an abnormal but harmless app by our method isfigure was generated by the density peak clustering tool

01 02 03 04 05 06 07 08 09 10Value of q

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

Figure 15 Accuracies of various weight values for traffic contentsimilarity and traffic behavior similarity

Table 7 True positives and false positives with and withouthostname checking

of true positives of false positivesWith hostname checking 1400 47Without hostnamechecking 1400 213

Security and Communication Networks 15

device detection method We have implemented the dy-namic detection according to the description in [44] Indetail the APIs that are invoked by the running apps aremonitored by the tool Sensitive API Monitor (httpsgithubcomcyruliuSensitive-API-Monitor) As done in [44] thefrequency values of 10 system calls (open ioctl brk readwrite close sendto sendmsg recvfrom and recvmsg) arerecorded ese recorded frequency values are sent toWindows Azure for malware detection We evaluate thedynamic detection performance of SAMADroid on our 143repackaged_malware apps and their original versions In theevaluation the frequency values of the original apps are usedto create the normality model and the frequency values ofthe repackaged malware apps are used for testing We alsouse the app GT to record the power memory and CPUconsumptions of SAMADroid Note that our phones wereall rooted and Sensitive_API_Monitor and GT can be usede experimental results are presented in Table 9

Last we compare our method with Shabtairsquos method [8]Shabtairsquos method detects Android malware at the mobiledevices and it considers the apprsquos network traffic patternsonly the HTTP traffic contents are not considered isgives us an opportunity to compare our method withmethods that have been specially designed for encryptedtraffic We have implemented Shabtairsquos method and chosenthe feature subset 2 as the detection features Feature subset2 includes Avg Sent Bytes Avg Rcvd Bytes and Pct ofAvg Rcvd Bytes among other values and it is proved to be

the best feature subset for malware detection [8] e De-cisionRegression tree is chosen as the classification algo-rithm e testing dataset is the traffic that was generated bythe 143 repackaged malware instances and the training set isthe traffic that was generated by the corresponding originalapps e classification results are listed in Table 10According to the table the detection accuracy and F-mea-sure values of our method are significantly higher than thoseof Shabtairsquos method is indicates that the HTTP trafficprovides useful information for Android malware detectionSince Shabtairsquos method is conducted at the mobile devices itconsumes more resources than our method as compared inTable 10

7 Discussion

e results of our extensive experiments demonstrate thatrepackaged Android malware can be efficiently and effec-tively detected by clustering network traffic at the edgedevices However there are still limitations for our methodFirst we assume the most of the devices have installed theoriginal apps (rlt (12)u as described in Section 51) is isreasonable in most cases since mobile users typicallydownload apps from the official markets However in areaswhere official markets such as the Google play and Amazonare banned it is possible that the most mobile devices haveinstalled the repackaged apps is may cause false positivesSecond if the malware does not generate sufficient trafficour method cannot detect it For example some malwareonly sends fraudulent premium SMS messages is type ofmalware cannot be detected by monitoring its networktraffic ird we rely on third-party tools such as Virustotalfor the identification of malicious URLs and to reduce thenumber of false alarms If a malicious app rents legalplatforms such as the public cloud as the remote controlserver our method will judge it as a normal app Hence falsenegatives will be generated

To overcome these limitations we could combine theproposed method with available methods such as SAMA-Droid In such cases the edge-client app will not only recordflow metainformation but also monitor app system be-haviors However the system behavior monitoring does notneed to be conducted all the time If abnormal but harmlessor lower traffic apps are detected the edge-client app will beinformed to record the corresponding appsrsquo system be-haviors Similarly the edge-client app can send the appsystem behaviors to the edge server for further processinge performance of this combined method will be evaluatedin our future work One can also combine the proposedframework with AppFA [14] for the detection of other typesof Android malware For example the edge extracts apptraffic behavior features and the cloud can compare appsrsquobehaviors with their peer groups to identify abnormalitiesFurthermore one could first use clustering techniques [60]to group repackaged apps according to the same maliciousmodules and then analyze the behaviors of the representativerepackaged apps in each group to obtain more accuratebehavior features

y

2D nonclassical multidimensional scaling

00ndash02 02ndash04x

ndash04

ndash03

ndash02

ndash01

00

01

02

03

04

Figure 16 Clustering results of various malware that wererepackaged from aircomaceviralmotox3e outlier in the middleis detected as an abnormal but harmless app is figure wasgenerated by the density peak clustering tool

Table 8 Comparison of the proposed method with AppFA

Proposed method AppFAAccuracy 969 863F-measure 094 075Average power consumption 12 0Average CPU consumption 06 0Average memory consumption 654MB 0

16 Security and Communication Networks

e paradigms of edge computing are used to detectAndroid malware in this study However several challengeswould be encountered when deploying it in practice Firstthe mobile devices must install a specified app for thecollection of essential information which may cause con-cerns and mobile users may refuse to cooperate Second inedge computing the data should be preprocessed by the edgeto better protect the usersrsquo privacy However the edge servercan observe all the data (including the sensitive informa-tion) and it threatens the privacy of users For example ifthe edge is deployed by the cloud provider the usersrsquo privacyinformation is also available to the cloud ere is no dif-ference with the user-cloud model for privacy protectionunless the edge server is deployed by the userird the edgeserver should be well protected from network attacksOtherwise it will be convenient for hackers to steal usersrsquoprivate information

8 Conclusion

In this paper we propose a novel method for detectingAndroid malware based on edge computing To the best ofour knowledge the detection of Android malware by uti-lizing edge computing and traffic clustering has not beenconsidered previously By introducing the edge layer themain tasks of the mobile device are offloaded to the edgeserver and the mobile device only needs to record flowmetainformation which substantially conserves the re-sources of the mobile device e edge server extracts appsrsquonetwork traffic features and sends these features to the cloudplatform for further analysis us the usersrsquo privacy can bebetter protected With the received features the cloud cal-culates the similarities between apps and clusters thesesimilarity values to separate the original apps and themalware automatically We evaluated our methods on 400Android apps and the experimental results demonstratethat the detection accuracy can reach 100 for several appsand the average accuracy is 969 We also tested theperformance of the proposed method and compared it with

the typical approaches e comparison results show thatour method can detect repackaged Android malware withboth higher detection accuracy and lower power and re-source consumptions In the future we will continue to workto overcome the discussed limitations and to evaluate theproposed method in real world scenarios

Data Availability

e downloaded apps used to support the findings of thisstudy have been deposited in httpspanbaiducoms1b8vMceIdtdd1gBRC5Gbzvw Its extraction code is vd5re clustering code is also in the repository Since appnetwork traffic contains usersrsquo sensitive information itcannot be publicly shared However we have explained allthe steps to generate network traffic automatically esesteps are also described in the instructionspdf in therepository

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported by National Key TechnologiesResearch and Development Program of China under Grant2018YFD0401404 National Natural Science Foundation ofChina under Grants 61702282 61802192 and 71801123Natural Science Foundation of the Jiangsu Higher EducationInstitutions of China under Grants 18KJB520024 and17KJB520023 Nanjing Forestry University under GrantsGXL016 and CX2016026 and NUPTSF under GrantNY217143

References

[1] Statista Mobile Operating Systemsrsquo Market Share Worldwidefrom January 2012 to December 2019 Statista HamburgGermany 2009 httpswwwstatistacomstatistics272698global-market-share-held-by-mobile-operating-systems-since-2009

[2] W Martin F Sarro Y Jia Y Zhang and M Harman ldquoAsurvey of app store analysis for software engineeringrdquo IEEETransactions on Software Engineering vol 43 no 9pp 817ndash847 2017

[3] httpswwwappbraincomstatsnumber-of-android-apps2020

[4] L Li D Li T F Bissyande et al ldquoUnderstanding android apppiggybacking a systematic study of malicious code graftingrdquoIEEE Transactions on Information Forensics and Securityvol 12 no 6 pp 1269ndash1284 2017

[5] M Fan J Liu X Luo et al ldquoAndroid malware familialclassification and representative sample selection via frequentsubgraph analysisrdquo IEEE Transactions on Information Fo-rensics and Security vol 13 no 8 pp 1890ndash1905 2018

[6] J Zhang Z Qin K Zhang H Yin and J Zou ldquoDalvik opcodegraph based android malware variants detection using globaltopology featuresrdquo IEEE Access vol 6 pp 51964ndash51974 2018

[7] S Garg S K Peddoju and A K Sarje ldquoNetwork-baseddetection of android malicious appsrdquo International Journal ofInformation Security vol 16 no 4 pp 385ndash400 2016

Table 9 Comparison of the proposed method with SAMADroid

Proposed method SAMADroidAccuracy 969 831F-measure 0940 0793Average power consumption 12 23Average CPU consumption 06 08Average memory consumption 654MB 773MB

Table 10 Comparison of the proposed method with Shabtairsquosmethod

Proposed method Shabtairsquosmethod

Accuracy 969 771F-measure 0940 0695Average power consumption 12 77Average CPU consumption 06 23Average memory consumption 654MB 937MB

Security and Communication Networks 17

[8] A Shabtai L Tenenboim-Chekina D Mimran L RokachB Shapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquoComputers amp Security vol 43 pp 1ndash18 2014

[9] L Xie X Zhang J-P Seifert and S Zhu ldquopBMDS a be-havior-based malware detection system for cellphone de-vicesrdquo in Proceedings of the ird ACM Conference onWireless Network Security ACM New York NY USApp 37ndash48 2010

[10] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacyin Smartphones and Mobile Devices ACM Chicago IL USApp 15ndash26 October 2011

[11] G Dini F Martinelli A Saracino and D SgandurraldquoMADAM a multi-level anomaly detector for android mal-warerdquo in Proceedings of the International Conference onMathematical Methods Models and Architectures for Com-puter Network Security vol 12 Springer St PetersburgRussia pp 240ndash253 2012

[12] Y Zhang M Yang B Xu et al ldquoVetting undesirable be-haviors in android apps with permission use analysisrdquo inProceedings of the 2013 ACM SIGSAC Conference on Com-puter amp Communications Security ACM Berlin Germanypp 611ndash622 November 2013

[13] SWang Z Chen X Li LWang K Ji and C Zhao ldquoAndroidmalware clustering analysis on network-level behaviorrdquo inProceedings of the International Conference on IntelligentComputing Springer Liverpool UK pp 796ndash807 August2017

[14] G He B Xu and H Zhu ldquoAppFA a novel approach to detectmalicious android applications on the networkrdquo in Securityand Communication Networks Wiley Hoboken NJ USA2018

[15] X Wang Y Yang and Y Zeng ldquoAccurate mobile malwaredetection and classification in the cloudrdquo SpringerPlus vol 4no 1 p 583 2015

[16] K Tam A Feizollah N B Anuar R Salleh and L Cavallaroldquoe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys (CSUR) vol 49 no 4p 76 2017

[17] M Sun X Li J C S Lui R T B Ma and Z Liang ldquoMonet auser-oriented behavior-based malware variants detectionsystem for androidrdquo IEEE Transactions on Information Fo-rensics and Security vol 12 no 5 pp 1103ndash1112 2017

[18] T Vidas and N Christin ldquoEvading android runtime analysisvia sandbox detectionrdquo in Proceedings of the 9th ACMSymposium on Information Computer and CommunicationsSecurity pp 447ndash458 Kyoto Japan June 2014

[19] Y Duan M Zhang A V Bhaskar et al ldquoings you may notknow about android (Un) packers a systematic study basedon whole-system emulationrdquo in Proceedings of the 2018Network and Distributed System Security Symposium pp 1ndash15 San Diego CA USA February 2018

[20] L Xue X Luo L Yu S Wang and D Wu ldquoAdaptiveunpacking of android appsrdquo in Proceedings of the 2017 IEEEACM 39th International Conference on Software Engineering(ICSE) IEEE Buenos Aires Argentina pp 358ndash369 May2017

[21] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy IEEE San Francisco CAUSA pp 95ndash109 2012

[22] N W Lo S K Lu and Y H Chuang ldquoA framework for thirdparty android marketplaces to identify repackaged appsrdquo inProceedings of the 2016 IEEE 14th International Conference onDependable Autonomic and Secure Computing 14th Inter-national Conference on Pervasive Intelligence and Computing2nd International Conference on Big Data Intelligence andComputing and Cyber Science and Technology Congresspp 475ndash482 Auckland New Zealand August 2016

[23] L Li J Gao M Hurier et al ldquoAndrozoo++ collectingmillions of android apps and their metadata for the researchcommunityrdquo 2017 httpsarxivorgabs170905281

[24] W Shi J Cao Q Zhang Y Li and L Xu ldquoEdge computingvision and challengesrdquo IEEE Internet of ings Journal vol 3no 5 pp 637ndash646 2016

[25] W Zhou Y Zhou X Jiang and P Ning ldquoDetectingrepackaged smart-phone applications in third-party androidmarketplacesrdquo in Proceedings of the Second ACM Conferenceon Data and Application Security and Privacy ACM AntonioTX USA pp 317ndash326 February 2012

[26] Y Shao X Luo C Qian P Zhu and L Zhang ldquoTowards ascalable resource-driven approach for detecting repackagedandroid applicationsrdquo in Proceedings of the 30th AnnualComputer Security Applications Conference ACM NewOrleans LA USA pp 56ndash65 December 2014

[27] K Chen P Wang Y Lee et al ldquoFinding unknown malice in10 seconds mass vetting for new threats at the google-playscalerdquo in Proceedings of the USENIX Security Symposiumvol 15 Washington DC USA August 2015

[28] Z Wang C Li Y Guan and Y Xue ldquoDroidchain a novelmalware detection method for android based on behaviorchainrdquo in Proceedings of the 2015 IEEE Conference onCommunications and Network Security (CNS) IEEE Flor-ence Italy pp 727-728 September 2015

[29] K Tian D D Yao B G Ryder G Tan and G PengldquoDetection of repackaged android malware with code-het-erogeneity featuresrdquo IEEE Transactions on Dependable andSecure Computing vol 17 no 1 pp 64ndash77 2017

[30] A De Lorenzo F Martinelli E Medvet F Mercaldo andA Santone ldquoVisualizing the outcome of dynamic analysis ofandroid malware with vizmalrdquo Journal of Information Se-curity and Applications vol 50 Article ID 102423 2020

[31] M Lin D Zhang X Su and T Yu ldquoEffective and scalablerepackaged application detection based on user interfacerdquo inProceedings of the 2017 IEEE Conference on ComputerCommunications Workshops (INFOCOM WKSHPS) IEEEAtlanta GA USA May 2017

[32] G Meng M Patrick Y Xue Y Liu and J Zhang ldquoSecuringandroid app markets via modelling and predicting malwarespread between marketsrdquo IEEE Transactions on InformationForensics and Security vol 14 no 7 pp 1944ndash1959 2018

[33] A Arora and S K Peddoju ldquoMinimizing network trafficfeatures for android mobile malware detectionrdquo in Proceed-ings of the 18th International Conference on DistributedComputing and Networking ACM Hyderabad India January2017

[34] X Wu D Zhang X Su and W Li ldquoDetect repackagedAndroid application based on HTTP traffic similarity httptraffic similarityrdquo Security and Communication Networksvol 8 no 13 pp 2257ndash2266 2015

[35] J Malik and R Kaushal ldquoCredroid android malware de-tection by network traffic analysisrdquo in Proceedings of the 1stACM Workshop on Privacy-Aware Mobile Computing ACMPaderborn Germany pp 28ndash36 July 2016

18 Security and Communication Networks

[36] A Zulkifli I R A Hamid W M Shah and Z AbdullahldquoAndroid malware detection based on network traffic usingdecision tree algorithmrdquo in Proceedings of the InternationalConference on Soft Computing and Data Mining SpringerSenai Malaysia pp 485ndash494 January 2018

[37] Z Chen Q Yan H Han et al ldquoMachine learning basedmobile malware detection using highly imbalanced networktrafficrdquo Information Sciences vol 433-434 pp 346ndash364 2018

[38] G He B Xu L Zhang and H Zhu ldquoMobile app identifi-cation for encrypted network flows by traffic correlationrdquoInternational Journal of Distributed Sensor Networks vol 14no 12 pp 1ndash17 2018

[39] L Xue Y Zhou T Chen X Luo and G Gu ldquoMalton to-wards on-device non-invasive mobile malware analysis forARTrdquo in Proceedings of the 26th USENIX Security Symposiumpp 289ndash306 Vancouver Canada August 2017

[40] M K Alzaylaee S Y Yerima and S Sezer ldquoDL-Droid deeplearning based android malware detection using real devicesrdquoComputers amp Security vol 89 Article ID 101663 2020

[41] A S Shamili C Bauckhage and T Alpcan ldquoMalware de-tection on mobile devices using distributed machine learn-ingrdquo in Proceedings of the 20th International Conference onPattern Recognition (ICPR) IEEE Istanbul Turkeypp 4348ndash4351 August 2010

[42] M Zhao T Zhang F Ge and Z Yuan ldquoRobotdroid alightweight malware detection framework on smartphonesrdquoJournal of Networks vol 7 no 4 p 715 2012

[43] K A Talha D I Alper and C Aydin ldquoAPK auditor per-mission-based android malware detection systemrdquo DigitalInvestigation vol 13 pp 1ndash14 2015

[44] S Arshad M A Shah A Wahid A Mehmood H Song andH Yu ldquoSamadroid a novel 3-level hybrid malware detectionmodel for android operating systemrdquo IEEE Access vol 6pp 4321ndash4339 2018

[45] G He L Zhang B Xu and H Zhu ldquoDetecting repackagedandroid malware based on mobile edge computingrdquo inProceedings of the 2018 Sixth International Conference onAdvanced Cloud and Big Data (CBD) IEEE Lanzhou Chinapp 360ndash365 August 2018

[46] R Roman J Lopez andMMambo ldquoMobile edge computingFog et al a survey and analysis of security threats andchallengesrdquo Future Generation Computer Systems vol 78pp 680ndash698 2018

[47] F Bonomi R Milito J Zhu and S Addepalli ldquoFog com-puting and its role in the internet of thingsrdquo in Proceedings ofthe First Edition of the MCC Workshop on Mobile CloudComputing ACM Helsinki Finland pp 13ndash16 August 2012

[48] M T Beck MWerner S Feld and S Schimper ldquoMobile edgecomputing a taxonomyrdquo in Proceedings of the Sixth Inter-national Conference on Advances in Future Internet pp 48ndash55 Lisbon Portugal November 2014

[49] P Mach and Z Becvar ldquoMobile edge computing a survey onarchitecture and computation offloadingrdquo IEEE Communi-cations Surveys amp Tutorials vol 19 no 3 pp 1628ndash1656 2017

[50] H T Dinh C Lee D Niyato and P Wang ldquoA survey ofmobile cloud computing architecture applications and ap-proachesrdquo Wireless Communications and Mobile Computingvol 13 no 18 pp 1587ndash1611 2013

[51] X Lyu H Tian L Jiang et al ldquoSelective offloading in mobileedge computing for the green internet of thingsrdquo IEEENetwork vol 32 no 1 pp 54ndash60 2018

[52] S Agrawal and J Agrawal ldquoSurvey on anomaly detectionusing data mining techniquesrdquo Procedia Computer Sciencevol 60 pp 708ndash713 2015

[53] A Tongaonkar ldquoA look at the mobile app identificationlandscaperdquo IEEE Internet Computing vol 20 no 4 pp 9ndash152016

[54] G He B Xu and H Zhu ldquoIdentifying mobile applications forencrypted network trafficrdquo in Proceedings of the 2017 FifthInternational Conference on Advanced Cloud and Big Data(CBD) IEEE Shanghai China pp 279ndash284 August 2017

[55] G Ranjan A Tongaonkar and R Torres ldquoApproximatematching of persistent lexicon using search-engines forclassifying mobile app trafficrdquo in Proceedings of the 35thAnnual IEEE International Conference on Computer Com-munications IIEEE San Francisco CA USA April 2016

[56] L Li and S Qu ldquoShort text classification based on improvedITCrdquo Journal of Computer and Communications vol 1 no 4pp 22ndash27 2013

[57] A H Lashkari A F A Kadir L Taheri and A A GhorbanildquoToward developing a systematic approach to generatebenchmark android malware datasets and classificationrdquo inProceedings of the 2018 International Carnahan Conference onSecurity Technology (ICCST) IEEE Montreal Canada Oc-tober 2018

[58] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[59] Y Li Z Yang Y Guo and X Chen ldquoDroidbot a lightweightUI-guided test input generator for androidrdquo in Proceedings ofthe Software Engineering Companion (ICSE-C) pp 23ndash26Buenos Aires Argentina May 2017

[60] M Fan X Luo J Liu et al ldquoGraph embedding based familialanalysis of android malware using unsupervised learningrdquo inProceedings of the 2019 IEEEACM 41st International Con-ference on Software Engineering (ICSE) IEEE Piscataway NJUSA pp 771ndash782 May 2019

Security and Communication Networks 19

Page 14: On-DeviceDetectionofRepackagedAndroidMalwarevia …downloads.hindawi.com/journals/scn/2020/8630748.pdf · 2020-05-31 · According to [4–6], most Android malware programs are repackagedapps,andthemajorityofnewAndroidmalware

difficult to distinguish from their original apps is isbecause repackaged_safe apps only generate smallamounts of additional network flows and their networktraffic features are not clearly distinguishable from thoseof normal traffic Fortunately the average F-measure ofrepackaged malware apps is outstanding is may be dueto the fact that malware typically send stolen data to orreceive commands from remote servers Hence theirnetwork traffic features are special and distinguishableFigure 14 presents an example of app clustering and il-lustrates the distinguishability of the repackaged malwareand the original app ese results demonstrate thatrepackaged Android malware can be effectively detectedby comparing their network traffic with that of originalapps

For repackaged malware apps the accuracy results withdifferent weight values (q in equation (10)) are presented inFigure 15 As depicted in the figure when q 05 we realizethe highest accuracy rate 969 If q 0 namely only thefeatures of encrypted traffic are used the accuracy result is753 e result is 617 when q 1 namely only thefeatures of plaintext traffic are considered ese resultsdemonstrate that both the encrypted and plaintext trafficcontribute to the differentiation of the original apps and therepackaged malware However the encrypted traffic con-tributes more than the plaintext traffic is may becausemalicious apps usually connect to their servers by securityprotocols to evade IDS detection

In the previous experiments we checked the serversrsquohostnames using VirusTotal to reduce the frequency of thefalse positives For various clusters that contained less than20 elements if all the serversrsquo hostnames were judged asclean the cluster was identified as an original app rather thana repackaged one As argued in Section 55 this process onlyreduces the frequency of false alarms To support this ar-gument we list the numbers of true positives and falsepositives of the 143 repackaged malware in Table 7 forcomparison When conducting the hostname checking thenumber of true positives is 1400 and the result is the samewhen the hostname checking is turned off is findingsupports that the malware is detected by clustering thesimilarity values However the number of false positivesincreased from 47 to 213 when the hostname checking wasturned off erefore we suggest checking the serversrsquohostnames after clustering to increase the practicality of themethod Note that false positives still occur when we checkthe hostnames by VirusTotal Hence there may be malwarein the original apps In our dataset the typical examples arecommobovideoplayerpro and cncfshop ele1taoxiaosanWe further analyze these two apps by reading theirdecompiled codes and we determine that they indeed es-tablish connections with suspicious servers In detail invarious runs app commobovideoplayerpro connects withhttpwwwtopappsquarecomwhich is judged as maliciousby CyRadar in VirusTotal App cncfshop ele1taoxiaosanvisits suspicious URL http955ccjSn9 Since these apps arethe original apps they are identified as false positives in theexperiments

In our experiments we chose density peak clustering asthe clustering method because density peak clustering canautomatically determine the correct number of clusters [58]Hence our method can detect various versions of malwarethat are repackaged from the same original app Unfortu-nately in the apps that were downloaded from AndroZoono malware was repackaged from the same original app Toevaluate the efficiency of detecting versions of malware wehave created two malware instances from a popular game

Apps

App

1A

pp2

App

3A

pp4

App

5A

pp6

App

7A

pp8

App

9A

pp10

App

11A

pp12

App

13A

pp14

App

15A

pp16

App

18A

pp17

App

19A

pp20

0

01

02

03

04

05

06

07

08

09

1

F-m

easu

re

Figure 13 Detection F-measure for the 20 selected apps

Table 5 Detailed descriptions of the chosen 20 apps

Main function ofapp

of HTTPflows

of encryptedflows

App1 VideoPlayer 453 51App2 Game 657 1455App3 Browser 1531 1478App4 MusicPlayer 341 25App5 Game 612 289App6 Game 788 123App7 Game 1336 524App8 Fitness 357 892App9 Email 122 431App10 Game 963 790App11 MusicPlayer 1128 547App12 Game 1598 426App13 News 2460 1178App14 News 2891 1543App15 Chat 171 49App16 Email 143 368App17 WeatherForecast 397 120App18 Game 1832 921App19 Fitness 732 362App20 OnlineLearning 2378 834

Table 6 Average accuracy and F-measure

Average accuracy () AverageF-measure

Repackaged_safe 903 078Repackaged_malware 969 094

14 Security and Communication Networks

app namely aircomaceviralmotox3 One is produced bygrafting AnserverBotrsquos source code into air-comaceviralmotox3 and the other is created by adding codefor stealing usersrsquo contacts and short messages and sendingthem to a remote server Similarly each malware is run 10times and the original app aircomaceviralmotox3 is alsorun 10 times for validatione clustering result is presentedin Figure 16 Apparently there are 3 clusters e

experimental results show that the value of F-measure foreach malware is 1 Hence our method can detect variousmalware that were repackaged from the same original app

63 Comparison In this section we compare our methodwith the typical methods that are based on network or cloudas illustrated in Figure 5 e comparative indicators are thedetection accuracy F-measure average power CPU andmemory consumptions of the mobile device First wecompare the proposed method with AppFA [14] which isimplemented completely at the network level Since AppFAalso must analyze app traffic we directly use the traffic sets ofthe repackaged malware apps for comparison While testingAppFA we deleted the label information that was containedin the traffic sets and mixed the network flows as a wholeen we used the signature matching and the constrainedK-means clustering algorithm proposed in [14] to restoreapp sessions namely the label for each flow e similarapps used in AppFA were chosen in Google Play via themethod introduced in [14] and for each repackaged mal-ware 10 similar apps were selected as the peer group epeer group apps were run once by Droidbot to generatenetwork traffic e comparison results are listed in Table 8

Since AppFA need not to install some special apps onmobile devices the extra power memory and CPU con-sumptions are 0 However the detection accuracy is muchlower (863 vs 969) than that of the proposedmethod ascompared in Table 8 In the proposed method we use theedge-client app to record flow metainformation thusnetwork flows can be accurately labeled However inAppFA flow labeling is realized via traffic clustering and theclustering accuracy cannot be 100 Meanwhile AppFAcompares the appsrsquo network behaviors with those of theirsimilar apps which also results in errors In the proposedmethod we directly compare the repackaged malware withits original app and the detection accuracy is increased eaverage power CPU and memory consumptions of theproposed method are also low and acceptable

In Table 8 the power memory and CPU consumptionsfor our edge-client app were obtained with the help of appGT (httpsgithubcomTencentGT) GT is a portable toolthat runs on smartphones for debugging the APP internalparameters and the code time-consuming statistics eaverage values for the power memory and CPU con-sumptions are calculated as follows Before running an appwith Droidbot we started the edge-client app by GT Whenthe Droidbot was stopped we recoded the power CPU andmemory consumptions We repeated the process until allapps (including 143 repackaged malware apps and theiroriginal versions) had been tested Via this way we obtainedthe average power CPU and memory consumptions for theedge-client app According to Table 8 the edge-client app islightweight and has minimal impact on the performances ofthe mobile devices

en we compare the proposed method with SAMA-Droid [44] SAMADroid monitors Android system calls andsends the calls to a centralized server for the detection ofAndroid malware SAMADroid is a typical cloud-based on-

Decision graph rho-delta-rho

20 251510 35 403050Sorted rho

00

02

04

06

08

Rho lowast

delt

a

Figure 14 Example of app clustering Obviously two clusters areidentified e dot in the top left corner is an outlier e outlier isdetected as an abnormal but harmless app by our method isfigure was generated by the density peak clustering tool

01 02 03 04 05 06 07 08 09 10Value of q

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

Figure 15 Accuracies of various weight values for traffic contentsimilarity and traffic behavior similarity

Table 7 True positives and false positives with and withouthostname checking

of true positives of false positivesWith hostname checking 1400 47Without hostnamechecking 1400 213

Security and Communication Networks 15

device detection method We have implemented the dy-namic detection according to the description in [44] Indetail the APIs that are invoked by the running apps aremonitored by the tool Sensitive API Monitor (httpsgithubcomcyruliuSensitive-API-Monitor) As done in [44] thefrequency values of 10 system calls (open ioctl brk readwrite close sendto sendmsg recvfrom and recvmsg) arerecorded ese recorded frequency values are sent toWindows Azure for malware detection We evaluate thedynamic detection performance of SAMADroid on our 143repackaged_malware apps and their original versions In theevaluation the frequency values of the original apps are usedto create the normality model and the frequency values ofthe repackaged malware apps are used for testing We alsouse the app GT to record the power memory and CPUconsumptions of SAMADroid Note that our phones wereall rooted and Sensitive_API_Monitor and GT can be usede experimental results are presented in Table 9

Last we compare our method with Shabtairsquos method [8]Shabtairsquos method detects Android malware at the mobiledevices and it considers the apprsquos network traffic patternsonly the HTTP traffic contents are not considered isgives us an opportunity to compare our method withmethods that have been specially designed for encryptedtraffic We have implemented Shabtairsquos method and chosenthe feature subset 2 as the detection features Feature subset2 includes Avg Sent Bytes Avg Rcvd Bytes and Pct ofAvg Rcvd Bytes among other values and it is proved to be

the best feature subset for malware detection [8] e De-cisionRegression tree is chosen as the classification algo-rithm e testing dataset is the traffic that was generated bythe 143 repackaged malware instances and the training set isthe traffic that was generated by the corresponding originalapps e classification results are listed in Table 10According to the table the detection accuracy and F-mea-sure values of our method are significantly higher than thoseof Shabtairsquos method is indicates that the HTTP trafficprovides useful information for Android malware detectionSince Shabtairsquos method is conducted at the mobile devices itconsumes more resources than our method as compared inTable 10

7 Discussion

e results of our extensive experiments demonstrate thatrepackaged Android malware can be efficiently and effec-tively detected by clustering network traffic at the edgedevices However there are still limitations for our methodFirst we assume the most of the devices have installed theoriginal apps (rlt (12)u as described in Section 51) is isreasonable in most cases since mobile users typicallydownload apps from the official markets However in areaswhere official markets such as the Google play and Amazonare banned it is possible that the most mobile devices haveinstalled the repackaged apps is may cause false positivesSecond if the malware does not generate sufficient trafficour method cannot detect it For example some malwareonly sends fraudulent premium SMS messages is type ofmalware cannot be detected by monitoring its networktraffic ird we rely on third-party tools such as Virustotalfor the identification of malicious URLs and to reduce thenumber of false alarms If a malicious app rents legalplatforms such as the public cloud as the remote controlserver our method will judge it as a normal app Hence falsenegatives will be generated

To overcome these limitations we could combine theproposed method with available methods such as SAMA-Droid In such cases the edge-client app will not only recordflow metainformation but also monitor app system be-haviors However the system behavior monitoring does notneed to be conducted all the time If abnormal but harmlessor lower traffic apps are detected the edge-client app will beinformed to record the corresponding appsrsquo system be-haviors Similarly the edge-client app can send the appsystem behaviors to the edge server for further processinge performance of this combined method will be evaluatedin our future work One can also combine the proposedframework with AppFA [14] for the detection of other typesof Android malware For example the edge extracts apptraffic behavior features and the cloud can compare appsrsquobehaviors with their peer groups to identify abnormalitiesFurthermore one could first use clustering techniques [60]to group repackaged apps according to the same maliciousmodules and then analyze the behaviors of the representativerepackaged apps in each group to obtain more accuratebehavior features

y

2D nonclassical multidimensional scaling

00ndash02 02ndash04x

ndash04

ndash03

ndash02

ndash01

00

01

02

03

04

Figure 16 Clustering results of various malware that wererepackaged from aircomaceviralmotox3e outlier in the middleis detected as an abnormal but harmless app is figure wasgenerated by the density peak clustering tool

Table 8 Comparison of the proposed method with AppFA

Proposed method AppFAAccuracy 969 863F-measure 094 075Average power consumption 12 0Average CPU consumption 06 0Average memory consumption 654MB 0

16 Security and Communication Networks

e paradigms of edge computing are used to detectAndroid malware in this study However several challengeswould be encountered when deploying it in practice Firstthe mobile devices must install a specified app for thecollection of essential information which may cause con-cerns and mobile users may refuse to cooperate Second inedge computing the data should be preprocessed by the edgeto better protect the usersrsquo privacy However the edge servercan observe all the data (including the sensitive informa-tion) and it threatens the privacy of users For example ifthe edge is deployed by the cloud provider the usersrsquo privacyinformation is also available to the cloud ere is no dif-ference with the user-cloud model for privacy protectionunless the edge server is deployed by the userird the edgeserver should be well protected from network attacksOtherwise it will be convenient for hackers to steal usersrsquoprivate information

8 Conclusion

In this paper we propose a novel method for detectingAndroid malware based on edge computing To the best ofour knowledge the detection of Android malware by uti-lizing edge computing and traffic clustering has not beenconsidered previously By introducing the edge layer themain tasks of the mobile device are offloaded to the edgeserver and the mobile device only needs to record flowmetainformation which substantially conserves the re-sources of the mobile device e edge server extracts appsrsquonetwork traffic features and sends these features to the cloudplatform for further analysis us the usersrsquo privacy can bebetter protected With the received features the cloud cal-culates the similarities between apps and clusters thesesimilarity values to separate the original apps and themalware automatically We evaluated our methods on 400Android apps and the experimental results demonstratethat the detection accuracy can reach 100 for several appsand the average accuracy is 969 We also tested theperformance of the proposed method and compared it with

the typical approaches e comparison results show thatour method can detect repackaged Android malware withboth higher detection accuracy and lower power and re-source consumptions In the future we will continue to workto overcome the discussed limitations and to evaluate theproposed method in real world scenarios

Data Availability

e downloaded apps used to support the findings of thisstudy have been deposited in httpspanbaiducoms1b8vMceIdtdd1gBRC5Gbzvw Its extraction code is vd5re clustering code is also in the repository Since appnetwork traffic contains usersrsquo sensitive information itcannot be publicly shared However we have explained allthe steps to generate network traffic automatically esesteps are also described in the instructionspdf in therepository

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported by National Key TechnologiesResearch and Development Program of China under Grant2018YFD0401404 National Natural Science Foundation ofChina under Grants 61702282 61802192 and 71801123Natural Science Foundation of the Jiangsu Higher EducationInstitutions of China under Grants 18KJB520024 and17KJB520023 Nanjing Forestry University under GrantsGXL016 and CX2016026 and NUPTSF under GrantNY217143

References

[1] Statista Mobile Operating Systemsrsquo Market Share Worldwidefrom January 2012 to December 2019 Statista HamburgGermany 2009 httpswwwstatistacomstatistics272698global-market-share-held-by-mobile-operating-systems-since-2009

[2] W Martin F Sarro Y Jia Y Zhang and M Harman ldquoAsurvey of app store analysis for software engineeringrdquo IEEETransactions on Software Engineering vol 43 no 9pp 817ndash847 2017

[3] httpswwwappbraincomstatsnumber-of-android-apps2020

[4] L Li D Li T F Bissyande et al ldquoUnderstanding android apppiggybacking a systematic study of malicious code graftingrdquoIEEE Transactions on Information Forensics and Securityvol 12 no 6 pp 1269ndash1284 2017

[5] M Fan J Liu X Luo et al ldquoAndroid malware familialclassification and representative sample selection via frequentsubgraph analysisrdquo IEEE Transactions on Information Fo-rensics and Security vol 13 no 8 pp 1890ndash1905 2018

[6] J Zhang Z Qin K Zhang H Yin and J Zou ldquoDalvik opcodegraph based android malware variants detection using globaltopology featuresrdquo IEEE Access vol 6 pp 51964ndash51974 2018

[7] S Garg S K Peddoju and A K Sarje ldquoNetwork-baseddetection of android malicious appsrdquo International Journal ofInformation Security vol 16 no 4 pp 385ndash400 2016

Table 9 Comparison of the proposed method with SAMADroid

Proposed method SAMADroidAccuracy 969 831F-measure 0940 0793Average power consumption 12 23Average CPU consumption 06 08Average memory consumption 654MB 773MB

Table 10 Comparison of the proposed method with Shabtairsquosmethod

Proposed method Shabtairsquosmethod

Accuracy 969 771F-measure 0940 0695Average power consumption 12 77Average CPU consumption 06 23Average memory consumption 654MB 937MB

Security and Communication Networks 17

[8] A Shabtai L Tenenboim-Chekina D Mimran L RokachB Shapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquoComputers amp Security vol 43 pp 1ndash18 2014

[9] L Xie X Zhang J-P Seifert and S Zhu ldquopBMDS a be-havior-based malware detection system for cellphone de-vicesrdquo in Proceedings of the ird ACM Conference onWireless Network Security ACM New York NY USApp 37ndash48 2010

[10] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacyin Smartphones and Mobile Devices ACM Chicago IL USApp 15ndash26 October 2011

[11] G Dini F Martinelli A Saracino and D SgandurraldquoMADAM a multi-level anomaly detector for android mal-warerdquo in Proceedings of the International Conference onMathematical Methods Models and Architectures for Com-puter Network Security vol 12 Springer St PetersburgRussia pp 240ndash253 2012

[12] Y Zhang M Yang B Xu et al ldquoVetting undesirable be-haviors in android apps with permission use analysisrdquo inProceedings of the 2013 ACM SIGSAC Conference on Com-puter amp Communications Security ACM Berlin Germanypp 611ndash622 November 2013

[13] SWang Z Chen X Li LWang K Ji and C Zhao ldquoAndroidmalware clustering analysis on network-level behaviorrdquo inProceedings of the International Conference on IntelligentComputing Springer Liverpool UK pp 796ndash807 August2017

[14] G He B Xu and H Zhu ldquoAppFA a novel approach to detectmalicious android applications on the networkrdquo in Securityand Communication Networks Wiley Hoboken NJ USA2018

[15] X Wang Y Yang and Y Zeng ldquoAccurate mobile malwaredetection and classification in the cloudrdquo SpringerPlus vol 4no 1 p 583 2015

[16] K Tam A Feizollah N B Anuar R Salleh and L Cavallaroldquoe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys (CSUR) vol 49 no 4p 76 2017

[17] M Sun X Li J C S Lui R T B Ma and Z Liang ldquoMonet auser-oriented behavior-based malware variants detectionsystem for androidrdquo IEEE Transactions on Information Fo-rensics and Security vol 12 no 5 pp 1103ndash1112 2017

[18] T Vidas and N Christin ldquoEvading android runtime analysisvia sandbox detectionrdquo in Proceedings of the 9th ACMSymposium on Information Computer and CommunicationsSecurity pp 447ndash458 Kyoto Japan June 2014

[19] Y Duan M Zhang A V Bhaskar et al ldquoings you may notknow about android (Un) packers a systematic study basedon whole-system emulationrdquo in Proceedings of the 2018Network and Distributed System Security Symposium pp 1ndash15 San Diego CA USA February 2018

[20] L Xue X Luo L Yu S Wang and D Wu ldquoAdaptiveunpacking of android appsrdquo in Proceedings of the 2017 IEEEACM 39th International Conference on Software Engineering(ICSE) IEEE Buenos Aires Argentina pp 358ndash369 May2017

[21] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy IEEE San Francisco CAUSA pp 95ndash109 2012

[22] N W Lo S K Lu and Y H Chuang ldquoA framework for thirdparty android marketplaces to identify repackaged appsrdquo inProceedings of the 2016 IEEE 14th International Conference onDependable Autonomic and Secure Computing 14th Inter-national Conference on Pervasive Intelligence and Computing2nd International Conference on Big Data Intelligence andComputing and Cyber Science and Technology Congresspp 475ndash482 Auckland New Zealand August 2016

[23] L Li J Gao M Hurier et al ldquoAndrozoo++ collectingmillions of android apps and their metadata for the researchcommunityrdquo 2017 httpsarxivorgabs170905281

[24] W Shi J Cao Q Zhang Y Li and L Xu ldquoEdge computingvision and challengesrdquo IEEE Internet of ings Journal vol 3no 5 pp 637ndash646 2016

[25] W Zhou Y Zhou X Jiang and P Ning ldquoDetectingrepackaged smart-phone applications in third-party androidmarketplacesrdquo in Proceedings of the Second ACM Conferenceon Data and Application Security and Privacy ACM AntonioTX USA pp 317ndash326 February 2012

[26] Y Shao X Luo C Qian P Zhu and L Zhang ldquoTowards ascalable resource-driven approach for detecting repackagedandroid applicationsrdquo in Proceedings of the 30th AnnualComputer Security Applications Conference ACM NewOrleans LA USA pp 56ndash65 December 2014

[27] K Chen P Wang Y Lee et al ldquoFinding unknown malice in10 seconds mass vetting for new threats at the google-playscalerdquo in Proceedings of the USENIX Security Symposiumvol 15 Washington DC USA August 2015

[28] Z Wang C Li Y Guan and Y Xue ldquoDroidchain a novelmalware detection method for android based on behaviorchainrdquo in Proceedings of the 2015 IEEE Conference onCommunications and Network Security (CNS) IEEE Flor-ence Italy pp 727-728 September 2015

[29] K Tian D D Yao B G Ryder G Tan and G PengldquoDetection of repackaged android malware with code-het-erogeneity featuresrdquo IEEE Transactions on Dependable andSecure Computing vol 17 no 1 pp 64ndash77 2017

[30] A De Lorenzo F Martinelli E Medvet F Mercaldo andA Santone ldquoVisualizing the outcome of dynamic analysis ofandroid malware with vizmalrdquo Journal of Information Se-curity and Applications vol 50 Article ID 102423 2020

[31] M Lin D Zhang X Su and T Yu ldquoEffective and scalablerepackaged application detection based on user interfacerdquo inProceedings of the 2017 IEEE Conference on ComputerCommunications Workshops (INFOCOM WKSHPS) IEEEAtlanta GA USA May 2017

[32] G Meng M Patrick Y Xue Y Liu and J Zhang ldquoSecuringandroid app markets via modelling and predicting malwarespread between marketsrdquo IEEE Transactions on InformationForensics and Security vol 14 no 7 pp 1944ndash1959 2018

[33] A Arora and S K Peddoju ldquoMinimizing network trafficfeatures for android mobile malware detectionrdquo in Proceed-ings of the 18th International Conference on DistributedComputing and Networking ACM Hyderabad India January2017

[34] X Wu D Zhang X Su and W Li ldquoDetect repackagedAndroid application based on HTTP traffic similarity httptraffic similarityrdquo Security and Communication Networksvol 8 no 13 pp 2257ndash2266 2015

[35] J Malik and R Kaushal ldquoCredroid android malware de-tection by network traffic analysisrdquo in Proceedings of the 1stACM Workshop on Privacy-Aware Mobile Computing ACMPaderborn Germany pp 28ndash36 July 2016

18 Security and Communication Networks

[36] A Zulkifli I R A Hamid W M Shah and Z AbdullahldquoAndroid malware detection based on network traffic usingdecision tree algorithmrdquo in Proceedings of the InternationalConference on Soft Computing and Data Mining SpringerSenai Malaysia pp 485ndash494 January 2018

[37] Z Chen Q Yan H Han et al ldquoMachine learning basedmobile malware detection using highly imbalanced networktrafficrdquo Information Sciences vol 433-434 pp 346ndash364 2018

[38] G He B Xu L Zhang and H Zhu ldquoMobile app identifi-cation for encrypted network flows by traffic correlationrdquoInternational Journal of Distributed Sensor Networks vol 14no 12 pp 1ndash17 2018

[39] L Xue Y Zhou T Chen X Luo and G Gu ldquoMalton to-wards on-device non-invasive mobile malware analysis forARTrdquo in Proceedings of the 26th USENIX Security Symposiumpp 289ndash306 Vancouver Canada August 2017

[40] M K Alzaylaee S Y Yerima and S Sezer ldquoDL-Droid deeplearning based android malware detection using real devicesrdquoComputers amp Security vol 89 Article ID 101663 2020

[41] A S Shamili C Bauckhage and T Alpcan ldquoMalware de-tection on mobile devices using distributed machine learn-ingrdquo in Proceedings of the 20th International Conference onPattern Recognition (ICPR) IEEE Istanbul Turkeypp 4348ndash4351 August 2010

[42] M Zhao T Zhang F Ge and Z Yuan ldquoRobotdroid alightweight malware detection framework on smartphonesrdquoJournal of Networks vol 7 no 4 p 715 2012

[43] K A Talha D I Alper and C Aydin ldquoAPK auditor per-mission-based android malware detection systemrdquo DigitalInvestigation vol 13 pp 1ndash14 2015

[44] S Arshad M A Shah A Wahid A Mehmood H Song andH Yu ldquoSamadroid a novel 3-level hybrid malware detectionmodel for android operating systemrdquo IEEE Access vol 6pp 4321ndash4339 2018

[45] G He L Zhang B Xu and H Zhu ldquoDetecting repackagedandroid malware based on mobile edge computingrdquo inProceedings of the 2018 Sixth International Conference onAdvanced Cloud and Big Data (CBD) IEEE Lanzhou Chinapp 360ndash365 August 2018

[46] R Roman J Lopez andMMambo ldquoMobile edge computingFog et al a survey and analysis of security threats andchallengesrdquo Future Generation Computer Systems vol 78pp 680ndash698 2018

[47] F Bonomi R Milito J Zhu and S Addepalli ldquoFog com-puting and its role in the internet of thingsrdquo in Proceedings ofthe First Edition of the MCC Workshop on Mobile CloudComputing ACM Helsinki Finland pp 13ndash16 August 2012

[48] M T Beck MWerner S Feld and S Schimper ldquoMobile edgecomputing a taxonomyrdquo in Proceedings of the Sixth Inter-national Conference on Advances in Future Internet pp 48ndash55 Lisbon Portugal November 2014

[49] P Mach and Z Becvar ldquoMobile edge computing a survey onarchitecture and computation offloadingrdquo IEEE Communi-cations Surveys amp Tutorials vol 19 no 3 pp 1628ndash1656 2017

[50] H T Dinh C Lee D Niyato and P Wang ldquoA survey ofmobile cloud computing architecture applications and ap-proachesrdquo Wireless Communications and Mobile Computingvol 13 no 18 pp 1587ndash1611 2013

[51] X Lyu H Tian L Jiang et al ldquoSelective offloading in mobileedge computing for the green internet of thingsrdquo IEEENetwork vol 32 no 1 pp 54ndash60 2018

[52] S Agrawal and J Agrawal ldquoSurvey on anomaly detectionusing data mining techniquesrdquo Procedia Computer Sciencevol 60 pp 708ndash713 2015

[53] A Tongaonkar ldquoA look at the mobile app identificationlandscaperdquo IEEE Internet Computing vol 20 no 4 pp 9ndash152016

[54] G He B Xu and H Zhu ldquoIdentifying mobile applications forencrypted network trafficrdquo in Proceedings of the 2017 FifthInternational Conference on Advanced Cloud and Big Data(CBD) IEEE Shanghai China pp 279ndash284 August 2017

[55] G Ranjan A Tongaonkar and R Torres ldquoApproximatematching of persistent lexicon using search-engines forclassifying mobile app trafficrdquo in Proceedings of the 35thAnnual IEEE International Conference on Computer Com-munications IIEEE San Francisco CA USA April 2016

[56] L Li and S Qu ldquoShort text classification based on improvedITCrdquo Journal of Computer and Communications vol 1 no 4pp 22ndash27 2013

[57] A H Lashkari A F A Kadir L Taheri and A A GhorbanildquoToward developing a systematic approach to generatebenchmark android malware datasets and classificationrdquo inProceedings of the 2018 International Carnahan Conference onSecurity Technology (ICCST) IEEE Montreal Canada Oc-tober 2018

[58] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[59] Y Li Z Yang Y Guo and X Chen ldquoDroidbot a lightweightUI-guided test input generator for androidrdquo in Proceedings ofthe Software Engineering Companion (ICSE-C) pp 23ndash26Buenos Aires Argentina May 2017

[60] M Fan X Luo J Liu et al ldquoGraph embedding based familialanalysis of android malware using unsupervised learningrdquo inProceedings of the 2019 IEEEACM 41st International Con-ference on Software Engineering (ICSE) IEEE Piscataway NJUSA pp 771ndash782 May 2019

Security and Communication Networks 19

Page 15: On-DeviceDetectionofRepackagedAndroidMalwarevia …downloads.hindawi.com/journals/scn/2020/8630748.pdf · 2020-05-31 · According to [4–6], most Android malware programs are repackagedapps,andthemajorityofnewAndroidmalware

app namely aircomaceviralmotox3 One is produced bygrafting AnserverBotrsquos source code into air-comaceviralmotox3 and the other is created by adding codefor stealing usersrsquo contacts and short messages and sendingthem to a remote server Similarly each malware is run 10times and the original app aircomaceviralmotox3 is alsorun 10 times for validatione clustering result is presentedin Figure 16 Apparently there are 3 clusters e

experimental results show that the value of F-measure foreach malware is 1 Hence our method can detect variousmalware that were repackaged from the same original app

63 Comparison In this section we compare our methodwith the typical methods that are based on network or cloudas illustrated in Figure 5 e comparative indicators are thedetection accuracy F-measure average power CPU andmemory consumptions of the mobile device First wecompare the proposed method with AppFA [14] which isimplemented completely at the network level Since AppFAalso must analyze app traffic we directly use the traffic sets ofthe repackaged malware apps for comparison While testingAppFA we deleted the label information that was containedin the traffic sets and mixed the network flows as a wholeen we used the signature matching and the constrainedK-means clustering algorithm proposed in [14] to restoreapp sessions namely the label for each flow e similarapps used in AppFA were chosen in Google Play via themethod introduced in [14] and for each repackaged mal-ware 10 similar apps were selected as the peer group epeer group apps were run once by Droidbot to generatenetwork traffic e comparison results are listed in Table 8

Since AppFA need not to install some special apps onmobile devices the extra power memory and CPU con-sumptions are 0 However the detection accuracy is muchlower (863 vs 969) than that of the proposedmethod ascompared in Table 8 In the proposed method we use theedge-client app to record flow metainformation thusnetwork flows can be accurately labeled However inAppFA flow labeling is realized via traffic clustering and theclustering accuracy cannot be 100 Meanwhile AppFAcompares the appsrsquo network behaviors with those of theirsimilar apps which also results in errors In the proposedmethod we directly compare the repackaged malware withits original app and the detection accuracy is increased eaverage power CPU and memory consumptions of theproposed method are also low and acceptable

In Table 8 the power memory and CPU consumptionsfor our edge-client app were obtained with the help of appGT (httpsgithubcomTencentGT) GT is a portable toolthat runs on smartphones for debugging the APP internalparameters and the code time-consuming statistics eaverage values for the power memory and CPU con-sumptions are calculated as follows Before running an appwith Droidbot we started the edge-client app by GT Whenthe Droidbot was stopped we recoded the power CPU andmemory consumptions We repeated the process until allapps (including 143 repackaged malware apps and theiroriginal versions) had been tested Via this way we obtainedthe average power CPU and memory consumptions for theedge-client app According to Table 8 the edge-client app islightweight and has minimal impact on the performances ofthe mobile devices

en we compare the proposed method with SAMA-Droid [44] SAMADroid monitors Android system calls andsends the calls to a centralized server for the detection ofAndroid malware SAMADroid is a typical cloud-based on-

Decision graph rho-delta-rho

20 251510 35 403050Sorted rho

00

02

04

06

08

Rho lowast

delt

a

Figure 14 Example of app clustering Obviously two clusters areidentified e dot in the top left corner is an outlier e outlier isdetected as an abnormal but harmless app by our method isfigure was generated by the density peak clustering tool

01 02 03 04 05 06 07 08 09 10Value of q

0

01

02

03

04

05

06

07

08

09

1

Acc

urac

y

Figure 15 Accuracies of various weight values for traffic contentsimilarity and traffic behavior similarity

Table 7 True positives and false positives with and withouthostname checking

of true positives of false positivesWith hostname checking 1400 47Without hostnamechecking 1400 213

Security and Communication Networks 15

device detection method We have implemented the dy-namic detection according to the description in [44] Indetail the APIs that are invoked by the running apps aremonitored by the tool Sensitive API Monitor (httpsgithubcomcyruliuSensitive-API-Monitor) As done in [44] thefrequency values of 10 system calls (open ioctl brk readwrite close sendto sendmsg recvfrom and recvmsg) arerecorded ese recorded frequency values are sent toWindows Azure for malware detection We evaluate thedynamic detection performance of SAMADroid on our 143repackaged_malware apps and their original versions In theevaluation the frequency values of the original apps are usedto create the normality model and the frequency values ofthe repackaged malware apps are used for testing We alsouse the app GT to record the power memory and CPUconsumptions of SAMADroid Note that our phones wereall rooted and Sensitive_API_Monitor and GT can be usede experimental results are presented in Table 9

Last we compare our method with Shabtairsquos method [8]Shabtairsquos method detects Android malware at the mobiledevices and it considers the apprsquos network traffic patternsonly the HTTP traffic contents are not considered isgives us an opportunity to compare our method withmethods that have been specially designed for encryptedtraffic We have implemented Shabtairsquos method and chosenthe feature subset 2 as the detection features Feature subset2 includes Avg Sent Bytes Avg Rcvd Bytes and Pct ofAvg Rcvd Bytes among other values and it is proved to be

the best feature subset for malware detection [8] e De-cisionRegression tree is chosen as the classification algo-rithm e testing dataset is the traffic that was generated bythe 143 repackaged malware instances and the training set isthe traffic that was generated by the corresponding originalapps e classification results are listed in Table 10According to the table the detection accuracy and F-mea-sure values of our method are significantly higher than thoseof Shabtairsquos method is indicates that the HTTP trafficprovides useful information for Android malware detectionSince Shabtairsquos method is conducted at the mobile devices itconsumes more resources than our method as compared inTable 10

7 Discussion

e results of our extensive experiments demonstrate thatrepackaged Android malware can be efficiently and effec-tively detected by clustering network traffic at the edgedevices However there are still limitations for our methodFirst we assume the most of the devices have installed theoriginal apps (rlt (12)u as described in Section 51) is isreasonable in most cases since mobile users typicallydownload apps from the official markets However in areaswhere official markets such as the Google play and Amazonare banned it is possible that the most mobile devices haveinstalled the repackaged apps is may cause false positivesSecond if the malware does not generate sufficient trafficour method cannot detect it For example some malwareonly sends fraudulent premium SMS messages is type ofmalware cannot be detected by monitoring its networktraffic ird we rely on third-party tools such as Virustotalfor the identification of malicious URLs and to reduce thenumber of false alarms If a malicious app rents legalplatforms such as the public cloud as the remote controlserver our method will judge it as a normal app Hence falsenegatives will be generated

To overcome these limitations we could combine theproposed method with available methods such as SAMA-Droid In such cases the edge-client app will not only recordflow metainformation but also monitor app system be-haviors However the system behavior monitoring does notneed to be conducted all the time If abnormal but harmlessor lower traffic apps are detected the edge-client app will beinformed to record the corresponding appsrsquo system be-haviors Similarly the edge-client app can send the appsystem behaviors to the edge server for further processinge performance of this combined method will be evaluatedin our future work One can also combine the proposedframework with AppFA [14] for the detection of other typesof Android malware For example the edge extracts apptraffic behavior features and the cloud can compare appsrsquobehaviors with their peer groups to identify abnormalitiesFurthermore one could first use clustering techniques [60]to group repackaged apps according to the same maliciousmodules and then analyze the behaviors of the representativerepackaged apps in each group to obtain more accuratebehavior features

y

2D nonclassical multidimensional scaling

00ndash02 02ndash04x

ndash04

ndash03

ndash02

ndash01

00

01

02

03

04

Figure 16 Clustering results of various malware that wererepackaged from aircomaceviralmotox3e outlier in the middleis detected as an abnormal but harmless app is figure wasgenerated by the density peak clustering tool

Table 8 Comparison of the proposed method with AppFA

Proposed method AppFAAccuracy 969 863F-measure 094 075Average power consumption 12 0Average CPU consumption 06 0Average memory consumption 654MB 0

16 Security and Communication Networks

e paradigms of edge computing are used to detectAndroid malware in this study However several challengeswould be encountered when deploying it in practice Firstthe mobile devices must install a specified app for thecollection of essential information which may cause con-cerns and mobile users may refuse to cooperate Second inedge computing the data should be preprocessed by the edgeto better protect the usersrsquo privacy However the edge servercan observe all the data (including the sensitive informa-tion) and it threatens the privacy of users For example ifthe edge is deployed by the cloud provider the usersrsquo privacyinformation is also available to the cloud ere is no dif-ference with the user-cloud model for privacy protectionunless the edge server is deployed by the userird the edgeserver should be well protected from network attacksOtherwise it will be convenient for hackers to steal usersrsquoprivate information

8 Conclusion

In this paper we propose a novel method for detectingAndroid malware based on edge computing To the best ofour knowledge the detection of Android malware by uti-lizing edge computing and traffic clustering has not beenconsidered previously By introducing the edge layer themain tasks of the mobile device are offloaded to the edgeserver and the mobile device only needs to record flowmetainformation which substantially conserves the re-sources of the mobile device e edge server extracts appsrsquonetwork traffic features and sends these features to the cloudplatform for further analysis us the usersrsquo privacy can bebetter protected With the received features the cloud cal-culates the similarities between apps and clusters thesesimilarity values to separate the original apps and themalware automatically We evaluated our methods on 400Android apps and the experimental results demonstratethat the detection accuracy can reach 100 for several appsand the average accuracy is 969 We also tested theperformance of the proposed method and compared it with

the typical approaches e comparison results show thatour method can detect repackaged Android malware withboth higher detection accuracy and lower power and re-source consumptions In the future we will continue to workto overcome the discussed limitations and to evaluate theproposed method in real world scenarios

Data Availability

e downloaded apps used to support the findings of thisstudy have been deposited in httpspanbaiducoms1b8vMceIdtdd1gBRC5Gbzvw Its extraction code is vd5re clustering code is also in the repository Since appnetwork traffic contains usersrsquo sensitive information itcannot be publicly shared However we have explained allthe steps to generate network traffic automatically esesteps are also described in the instructionspdf in therepository

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported by National Key TechnologiesResearch and Development Program of China under Grant2018YFD0401404 National Natural Science Foundation ofChina under Grants 61702282 61802192 and 71801123Natural Science Foundation of the Jiangsu Higher EducationInstitutions of China under Grants 18KJB520024 and17KJB520023 Nanjing Forestry University under GrantsGXL016 and CX2016026 and NUPTSF under GrantNY217143

References

[1] Statista Mobile Operating Systemsrsquo Market Share Worldwidefrom January 2012 to December 2019 Statista HamburgGermany 2009 httpswwwstatistacomstatistics272698global-market-share-held-by-mobile-operating-systems-since-2009

[2] W Martin F Sarro Y Jia Y Zhang and M Harman ldquoAsurvey of app store analysis for software engineeringrdquo IEEETransactions on Software Engineering vol 43 no 9pp 817ndash847 2017

[3] httpswwwappbraincomstatsnumber-of-android-apps2020

[4] L Li D Li T F Bissyande et al ldquoUnderstanding android apppiggybacking a systematic study of malicious code graftingrdquoIEEE Transactions on Information Forensics and Securityvol 12 no 6 pp 1269ndash1284 2017

[5] M Fan J Liu X Luo et al ldquoAndroid malware familialclassification and representative sample selection via frequentsubgraph analysisrdquo IEEE Transactions on Information Fo-rensics and Security vol 13 no 8 pp 1890ndash1905 2018

[6] J Zhang Z Qin K Zhang H Yin and J Zou ldquoDalvik opcodegraph based android malware variants detection using globaltopology featuresrdquo IEEE Access vol 6 pp 51964ndash51974 2018

[7] S Garg S K Peddoju and A K Sarje ldquoNetwork-baseddetection of android malicious appsrdquo International Journal ofInformation Security vol 16 no 4 pp 385ndash400 2016

Table 9 Comparison of the proposed method with SAMADroid

Proposed method SAMADroidAccuracy 969 831F-measure 0940 0793Average power consumption 12 23Average CPU consumption 06 08Average memory consumption 654MB 773MB

Table 10 Comparison of the proposed method with Shabtairsquosmethod

Proposed method Shabtairsquosmethod

Accuracy 969 771F-measure 0940 0695Average power consumption 12 77Average CPU consumption 06 23Average memory consumption 654MB 937MB

Security and Communication Networks 17

[8] A Shabtai L Tenenboim-Chekina D Mimran L RokachB Shapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquoComputers amp Security vol 43 pp 1ndash18 2014

[9] L Xie X Zhang J-P Seifert and S Zhu ldquopBMDS a be-havior-based malware detection system for cellphone de-vicesrdquo in Proceedings of the ird ACM Conference onWireless Network Security ACM New York NY USApp 37ndash48 2010

[10] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacyin Smartphones and Mobile Devices ACM Chicago IL USApp 15ndash26 October 2011

[11] G Dini F Martinelli A Saracino and D SgandurraldquoMADAM a multi-level anomaly detector for android mal-warerdquo in Proceedings of the International Conference onMathematical Methods Models and Architectures for Com-puter Network Security vol 12 Springer St PetersburgRussia pp 240ndash253 2012

[12] Y Zhang M Yang B Xu et al ldquoVetting undesirable be-haviors in android apps with permission use analysisrdquo inProceedings of the 2013 ACM SIGSAC Conference on Com-puter amp Communications Security ACM Berlin Germanypp 611ndash622 November 2013

[13] SWang Z Chen X Li LWang K Ji and C Zhao ldquoAndroidmalware clustering analysis on network-level behaviorrdquo inProceedings of the International Conference on IntelligentComputing Springer Liverpool UK pp 796ndash807 August2017

[14] G He B Xu and H Zhu ldquoAppFA a novel approach to detectmalicious android applications on the networkrdquo in Securityand Communication Networks Wiley Hoboken NJ USA2018

[15] X Wang Y Yang and Y Zeng ldquoAccurate mobile malwaredetection and classification in the cloudrdquo SpringerPlus vol 4no 1 p 583 2015

[16] K Tam A Feizollah N B Anuar R Salleh and L Cavallaroldquoe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys (CSUR) vol 49 no 4p 76 2017

[17] M Sun X Li J C S Lui R T B Ma and Z Liang ldquoMonet auser-oriented behavior-based malware variants detectionsystem for androidrdquo IEEE Transactions on Information Fo-rensics and Security vol 12 no 5 pp 1103ndash1112 2017

[18] T Vidas and N Christin ldquoEvading android runtime analysisvia sandbox detectionrdquo in Proceedings of the 9th ACMSymposium on Information Computer and CommunicationsSecurity pp 447ndash458 Kyoto Japan June 2014

[19] Y Duan M Zhang A V Bhaskar et al ldquoings you may notknow about android (Un) packers a systematic study basedon whole-system emulationrdquo in Proceedings of the 2018Network and Distributed System Security Symposium pp 1ndash15 San Diego CA USA February 2018

[20] L Xue X Luo L Yu S Wang and D Wu ldquoAdaptiveunpacking of android appsrdquo in Proceedings of the 2017 IEEEACM 39th International Conference on Software Engineering(ICSE) IEEE Buenos Aires Argentina pp 358ndash369 May2017

[21] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy IEEE San Francisco CAUSA pp 95ndash109 2012

[22] N W Lo S K Lu and Y H Chuang ldquoA framework for thirdparty android marketplaces to identify repackaged appsrdquo inProceedings of the 2016 IEEE 14th International Conference onDependable Autonomic and Secure Computing 14th Inter-national Conference on Pervasive Intelligence and Computing2nd International Conference on Big Data Intelligence andComputing and Cyber Science and Technology Congresspp 475ndash482 Auckland New Zealand August 2016

[23] L Li J Gao M Hurier et al ldquoAndrozoo++ collectingmillions of android apps and their metadata for the researchcommunityrdquo 2017 httpsarxivorgabs170905281

[24] W Shi J Cao Q Zhang Y Li and L Xu ldquoEdge computingvision and challengesrdquo IEEE Internet of ings Journal vol 3no 5 pp 637ndash646 2016

[25] W Zhou Y Zhou X Jiang and P Ning ldquoDetectingrepackaged smart-phone applications in third-party androidmarketplacesrdquo in Proceedings of the Second ACM Conferenceon Data and Application Security and Privacy ACM AntonioTX USA pp 317ndash326 February 2012

[26] Y Shao X Luo C Qian P Zhu and L Zhang ldquoTowards ascalable resource-driven approach for detecting repackagedandroid applicationsrdquo in Proceedings of the 30th AnnualComputer Security Applications Conference ACM NewOrleans LA USA pp 56ndash65 December 2014

[27] K Chen P Wang Y Lee et al ldquoFinding unknown malice in10 seconds mass vetting for new threats at the google-playscalerdquo in Proceedings of the USENIX Security Symposiumvol 15 Washington DC USA August 2015

[28] Z Wang C Li Y Guan and Y Xue ldquoDroidchain a novelmalware detection method for android based on behaviorchainrdquo in Proceedings of the 2015 IEEE Conference onCommunications and Network Security (CNS) IEEE Flor-ence Italy pp 727-728 September 2015

[29] K Tian D D Yao B G Ryder G Tan and G PengldquoDetection of repackaged android malware with code-het-erogeneity featuresrdquo IEEE Transactions on Dependable andSecure Computing vol 17 no 1 pp 64ndash77 2017

[30] A De Lorenzo F Martinelli E Medvet F Mercaldo andA Santone ldquoVisualizing the outcome of dynamic analysis ofandroid malware with vizmalrdquo Journal of Information Se-curity and Applications vol 50 Article ID 102423 2020

[31] M Lin D Zhang X Su and T Yu ldquoEffective and scalablerepackaged application detection based on user interfacerdquo inProceedings of the 2017 IEEE Conference on ComputerCommunications Workshops (INFOCOM WKSHPS) IEEEAtlanta GA USA May 2017

[32] G Meng M Patrick Y Xue Y Liu and J Zhang ldquoSecuringandroid app markets via modelling and predicting malwarespread between marketsrdquo IEEE Transactions on InformationForensics and Security vol 14 no 7 pp 1944ndash1959 2018

[33] A Arora and S K Peddoju ldquoMinimizing network trafficfeatures for android mobile malware detectionrdquo in Proceed-ings of the 18th International Conference on DistributedComputing and Networking ACM Hyderabad India January2017

[34] X Wu D Zhang X Su and W Li ldquoDetect repackagedAndroid application based on HTTP traffic similarity httptraffic similarityrdquo Security and Communication Networksvol 8 no 13 pp 2257ndash2266 2015

[35] J Malik and R Kaushal ldquoCredroid android malware de-tection by network traffic analysisrdquo in Proceedings of the 1stACM Workshop on Privacy-Aware Mobile Computing ACMPaderborn Germany pp 28ndash36 July 2016

18 Security and Communication Networks

[36] A Zulkifli I R A Hamid W M Shah and Z AbdullahldquoAndroid malware detection based on network traffic usingdecision tree algorithmrdquo in Proceedings of the InternationalConference on Soft Computing and Data Mining SpringerSenai Malaysia pp 485ndash494 January 2018

[37] Z Chen Q Yan H Han et al ldquoMachine learning basedmobile malware detection using highly imbalanced networktrafficrdquo Information Sciences vol 433-434 pp 346ndash364 2018

[38] G He B Xu L Zhang and H Zhu ldquoMobile app identifi-cation for encrypted network flows by traffic correlationrdquoInternational Journal of Distributed Sensor Networks vol 14no 12 pp 1ndash17 2018

[39] L Xue Y Zhou T Chen X Luo and G Gu ldquoMalton to-wards on-device non-invasive mobile malware analysis forARTrdquo in Proceedings of the 26th USENIX Security Symposiumpp 289ndash306 Vancouver Canada August 2017

[40] M K Alzaylaee S Y Yerima and S Sezer ldquoDL-Droid deeplearning based android malware detection using real devicesrdquoComputers amp Security vol 89 Article ID 101663 2020

[41] A S Shamili C Bauckhage and T Alpcan ldquoMalware de-tection on mobile devices using distributed machine learn-ingrdquo in Proceedings of the 20th International Conference onPattern Recognition (ICPR) IEEE Istanbul Turkeypp 4348ndash4351 August 2010

[42] M Zhao T Zhang F Ge and Z Yuan ldquoRobotdroid alightweight malware detection framework on smartphonesrdquoJournal of Networks vol 7 no 4 p 715 2012

[43] K A Talha D I Alper and C Aydin ldquoAPK auditor per-mission-based android malware detection systemrdquo DigitalInvestigation vol 13 pp 1ndash14 2015

[44] S Arshad M A Shah A Wahid A Mehmood H Song andH Yu ldquoSamadroid a novel 3-level hybrid malware detectionmodel for android operating systemrdquo IEEE Access vol 6pp 4321ndash4339 2018

[45] G He L Zhang B Xu and H Zhu ldquoDetecting repackagedandroid malware based on mobile edge computingrdquo inProceedings of the 2018 Sixth International Conference onAdvanced Cloud and Big Data (CBD) IEEE Lanzhou Chinapp 360ndash365 August 2018

[46] R Roman J Lopez andMMambo ldquoMobile edge computingFog et al a survey and analysis of security threats andchallengesrdquo Future Generation Computer Systems vol 78pp 680ndash698 2018

[47] F Bonomi R Milito J Zhu and S Addepalli ldquoFog com-puting and its role in the internet of thingsrdquo in Proceedings ofthe First Edition of the MCC Workshop on Mobile CloudComputing ACM Helsinki Finland pp 13ndash16 August 2012

[48] M T Beck MWerner S Feld and S Schimper ldquoMobile edgecomputing a taxonomyrdquo in Proceedings of the Sixth Inter-national Conference on Advances in Future Internet pp 48ndash55 Lisbon Portugal November 2014

[49] P Mach and Z Becvar ldquoMobile edge computing a survey onarchitecture and computation offloadingrdquo IEEE Communi-cations Surveys amp Tutorials vol 19 no 3 pp 1628ndash1656 2017

[50] H T Dinh C Lee D Niyato and P Wang ldquoA survey ofmobile cloud computing architecture applications and ap-proachesrdquo Wireless Communications and Mobile Computingvol 13 no 18 pp 1587ndash1611 2013

[51] X Lyu H Tian L Jiang et al ldquoSelective offloading in mobileedge computing for the green internet of thingsrdquo IEEENetwork vol 32 no 1 pp 54ndash60 2018

[52] S Agrawal and J Agrawal ldquoSurvey on anomaly detectionusing data mining techniquesrdquo Procedia Computer Sciencevol 60 pp 708ndash713 2015

[53] A Tongaonkar ldquoA look at the mobile app identificationlandscaperdquo IEEE Internet Computing vol 20 no 4 pp 9ndash152016

[54] G He B Xu and H Zhu ldquoIdentifying mobile applications forencrypted network trafficrdquo in Proceedings of the 2017 FifthInternational Conference on Advanced Cloud and Big Data(CBD) IEEE Shanghai China pp 279ndash284 August 2017

[55] G Ranjan A Tongaonkar and R Torres ldquoApproximatematching of persistent lexicon using search-engines forclassifying mobile app trafficrdquo in Proceedings of the 35thAnnual IEEE International Conference on Computer Com-munications IIEEE San Francisco CA USA April 2016

[56] L Li and S Qu ldquoShort text classification based on improvedITCrdquo Journal of Computer and Communications vol 1 no 4pp 22ndash27 2013

[57] A H Lashkari A F A Kadir L Taheri and A A GhorbanildquoToward developing a systematic approach to generatebenchmark android malware datasets and classificationrdquo inProceedings of the 2018 International Carnahan Conference onSecurity Technology (ICCST) IEEE Montreal Canada Oc-tober 2018

[58] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[59] Y Li Z Yang Y Guo and X Chen ldquoDroidbot a lightweightUI-guided test input generator for androidrdquo in Proceedings ofthe Software Engineering Companion (ICSE-C) pp 23ndash26Buenos Aires Argentina May 2017

[60] M Fan X Luo J Liu et al ldquoGraph embedding based familialanalysis of android malware using unsupervised learningrdquo inProceedings of the 2019 IEEEACM 41st International Con-ference on Software Engineering (ICSE) IEEE Piscataway NJUSA pp 771ndash782 May 2019

Security and Communication Networks 19

Page 16: On-DeviceDetectionofRepackagedAndroidMalwarevia …downloads.hindawi.com/journals/scn/2020/8630748.pdf · 2020-05-31 · According to [4–6], most Android malware programs are repackagedapps,andthemajorityofnewAndroidmalware

device detection method We have implemented the dy-namic detection according to the description in [44] Indetail the APIs that are invoked by the running apps aremonitored by the tool Sensitive API Monitor (httpsgithubcomcyruliuSensitive-API-Monitor) As done in [44] thefrequency values of 10 system calls (open ioctl brk readwrite close sendto sendmsg recvfrom and recvmsg) arerecorded ese recorded frequency values are sent toWindows Azure for malware detection We evaluate thedynamic detection performance of SAMADroid on our 143repackaged_malware apps and their original versions In theevaluation the frequency values of the original apps are usedto create the normality model and the frequency values ofthe repackaged malware apps are used for testing We alsouse the app GT to record the power memory and CPUconsumptions of SAMADroid Note that our phones wereall rooted and Sensitive_API_Monitor and GT can be usede experimental results are presented in Table 9

Last we compare our method with Shabtairsquos method [8]Shabtairsquos method detects Android malware at the mobiledevices and it considers the apprsquos network traffic patternsonly the HTTP traffic contents are not considered isgives us an opportunity to compare our method withmethods that have been specially designed for encryptedtraffic We have implemented Shabtairsquos method and chosenthe feature subset 2 as the detection features Feature subset2 includes Avg Sent Bytes Avg Rcvd Bytes and Pct ofAvg Rcvd Bytes among other values and it is proved to be

the best feature subset for malware detection [8] e De-cisionRegression tree is chosen as the classification algo-rithm e testing dataset is the traffic that was generated bythe 143 repackaged malware instances and the training set isthe traffic that was generated by the corresponding originalapps e classification results are listed in Table 10According to the table the detection accuracy and F-mea-sure values of our method are significantly higher than thoseof Shabtairsquos method is indicates that the HTTP trafficprovides useful information for Android malware detectionSince Shabtairsquos method is conducted at the mobile devices itconsumes more resources than our method as compared inTable 10

7 Discussion

e results of our extensive experiments demonstrate thatrepackaged Android malware can be efficiently and effec-tively detected by clustering network traffic at the edgedevices However there are still limitations for our methodFirst we assume the most of the devices have installed theoriginal apps (rlt (12)u as described in Section 51) is isreasonable in most cases since mobile users typicallydownload apps from the official markets However in areaswhere official markets such as the Google play and Amazonare banned it is possible that the most mobile devices haveinstalled the repackaged apps is may cause false positivesSecond if the malware does not generate sufficient trafficour method cannot detect it For example some malwareonly sends fraudulent premium SMS messages is type ofmalware cannot be detected by monitoring its networktraffic ird we rely on third-party tools such as Virustotalfor the identification of malicious URLs and to reduce thenumber of false alarms If a malicious app rents legalplatforms such as the public cloud as the remote controlserver our method will judge it as a normal app Hence falsenegatives will be generated

To overcome these limitations we could combine theproposed method with available methods such as SAMA-Droid In such cases the edge-client app will not only recordflow metainformation but also monitor app system be-haviors However the system behavior monitoring does notneed to be conducted all the time If abnormal but harmlessor lower traffic apps are detected the edge-client app will beinformed to record the corresponding appsrsquo system be-haviors Similarly the edge-client app can send the appsystem behaviors to the edge server for further processinge performance of this combined method will be evaluatedin our future work One can also combine the proposedframework with AppFA [14] for the detection of other typesof Android malware For example the edge extracts apptraffic behavior features and the cloud can compare appsrsquobehaviors with their peer groups to identify abnormalitiesFurthermore one could first use clustering techniques [60]to group repackaged apps according to the same maliciousmodules and then analyze the behaviors of the representativerepackaged apps in each group to obtain more accuratebehavior features

y

2D nonclassical multidimensional scaling

00ndash02 02ndash04x

ndash04

ndash03

ndash02

ndash01

00

01

02

03

04

Figure 16 Clustering results of various malware that wererepackaged from aircomaceviralmotox3e outlier in the middleis detected as an abnormal but harmless app is figure wasgenerated by the density peak clustering tool

Table 8 Comparison of the proposed method with AppFA

Proposed method AppFAAccuracy 969 863F-measure 094 075Average power consumption 12 0Average CPU consumption 06 0Average memory consumption 654MB 0

16 Security and Communication Networks

e paradigms of edge computing are used to detectAndroid malware in this study However several challengeswould be encountered when deploying it in practice Firstthe mobile devices must install a specified app for thecollection of essential information which may cause con-cerns and mobile users may refuse to cooperate Second inedge computing the data should be preprocessed by the edgeto better protect the usersrsquo privacy However the edge servercan observe all the data (including the sensitive informa-tion) and it threatens the privacy of users For example ifthe edge is deployed by the cloud provider the usersrsquo privacyinformation is also available to the cloud ere is no dif-ference with the user-cloud model for privacy protectionunless the edge server is deployed by the userird the edgeserver should be well protected from network attacksOtherwise it will be convenient for hackers to steal usersrsquoprivate information

8 Conclusion

In this paper we propose a novel method for detectingAndroid malware based on edge computing To the best ofour knowledge the detection of Android malware by uti-lizing edge computing and traffic clustering has not beenconsidered previously By introducing the edge layer themain tasks of the mobile device are offloaded to the edgeserver and the mobile device only needs to record flowmetainformation which substantially conserves the re-sources of the mobile device e edge server extracts appsrsquonetwork traffic features and sends these features to the cloudplatform for further analysis us the usersrsquo privacy can bebetter protected With the received features the cloud cal-culates the similarities between apps and clusters thesesimilarity values to separate the original apps and themalware automatically We evaluated our methods on 400Android apps and the experimental results demonstratethat the detection accuracy can reach 100 for several appsand the average accuracy is 969 We also tested theperformance of the proposed method and compared it with

the typical approaches e comparison results show thatour method can detect repackaged Android malware withboth higher detection accuracy and lower power and re-source consumptions In the future we will continue to workto overcome the discussed limitations and to evaluate theproposed method in real world scenarios

Data Availability

e downloaded apps used to support the findings of thisstudy have been deposited in httpspanbaiducoms1b8vMceIdtdd1gBRC5Gbzvw Its extraction code is vd5re clustering code is also in the repository Since appnetwork traffic contains usersrsquo sensitive information itcannot be publicly shared However we have explained allthe steps to generate network traffic automatically esesteps are also described in the instructionspdf in therepository

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported by National Key TechnologiesResearch and Development Program of China under Grant2018YFD0401404 National Natural Science Foundation ofChina under Grants 61702282 61802192 and 71801123Natural Science Foundation of the Jiangsu Higher EducationInstitutions of China under Grants 18KJB520024 and17KJB520023 Nanjing Forestry University under GrantsGXL016 and CX2016026 and NUPTSF under GrantNY217143

References

[1] Statista Mobile Operating Systemsrsquo Market Share Worldwidefrom January 2012 to December 2019 Statista HamburgGermany 2009 httpswwwstatistacomstatistics272698global-market-share-held-by-mobile-operating-systems-since-2009

[2] W Martin F Sarro Y Jia Y Zhang and M Harman ldquoAsurvey of app store analysis for software engineeringrdquo IEEETransactions on Software Engineering vol 43 no 9pp 817ndash847 2017

[3] httpswwwappbraincomstatsnumber-of-android-apps2020

[4] L Li D Li T F Bissyande et al ldquoUnderstanding android apppiggybacking a systematic study of malicious code graftingrdquoIEEE Transactions on Information Forensics and Securityvol 12 no 6 pp 1269ndash1284 2017

[5] M Fan J Liu X Luo et al ldquoAndroid malware familialclassification and representative sample selection via frequentsubgraph analysisrdquo IEEE Transactions on Information Fo-rensics and Security vol 13 no 8 pp 1890ndash1905 2018

[6] J Zhang Z Qin K Zhang H Yin and J Zou ldquoDalvik opcodegraph based android malware variants detection using globaltopology featuresrdquo IEEE Access vol 6 pp 51964ndash51974 2018

[7] S Garg S K Peddoju and A K Sarje ldquoNetwork-baseddetection of android malicious appsrdquo International Journal ofInformation Security vol 16 no 4 pp 385ndash400 2016

Table 9 Comparison of the proposed method with SAMADroid

Proposed method SAMADroidAccuracy 969 831F-measure 0940 0793Average power consumption 12 23Average CPU consumption 06 08Average memory consumption 654MB 773MB

Table 10 Comparison of the proposed method with Shabtairsquosmethod

Proposed method Shabtairsquosmethod

Accuracy 969 771F-measure 0940 0695Average power consumption 12 77Average CPU consumption 06 23Average memory consumption 654MB 937MB

Security and Communication Networks 17

[8] A Shabtai L Tenenboim-Chekina D Mimran L RokachB Shapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquoComputers amp Security vol 43 pp 1ndash18 2014

[9] L Xie X Zhang J-P Seifert and S Zhu ldquopBMDS a be-havior-based malware detection system for cellphone de-vicesrdquo in Proceedings of the ird ACM Conference onWireless Network Security ACM New York NY USApp 37ndash48 2010

[10] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacyin Smartphones and Mobile Devices ACM Chicago IL USApp 15ndash26 October 2011

[11] G Dini F Martinelli A Saracino and D SgandurraldquoMADAM a multi-level anomaly detector for android mal-warerdquo in Proceedings of the International Conference onMathematical Methods Models and Architectures for Com-puter Network Security vol 12 Springer St PetersburgRussia pp 240ndash253 2012

[12] Y Zhang M Yang B Xu et al ldquoVetting undesirable be-haviors in android apps with permission use analysisrdquo inProceedings of the 2013 ACM SIGSAC Conference on Com-puter amp Communications Security ACM Berlin Germanypp 611ndash622 November 2013

[13] SWang Z Chen X Li LWang K Ji and C Zhao ldquoAndroidmalware clustering analysis on network-level behaviorrdquo inProceedings of the International Conference on IntelligentComputing Springer Liverpool UK pp 796ndash807 August2017

[14] G He B Xu and H Zhu ldquoAppFA a novel approach to detectmalicious android applications on the networkrdquo in Securityand Communication Networks Wiley Hoboken NJ USA2018

[15] X Wang Y Yang and Y Zeng ldquoAccurate mobile malwaredetection and classification in the cloudrdquo SpringerPlus vol 4no 1 p 583 2015

[16] K Tam A Feizollah N B Anuar R Salleh and L Cavallaroldquoe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys (CSUR) vol 49 no 4p 76 2017

[17] M Sun X Li J C S Lui R T B Ma and Z Liang ldquoMonet auser-oriented behavior-based malware variants detectionsystem for androidrdquo IEEE Transactions on Information Fo-rensics and Security vol 12 no 5 pp 1103ndash1112 2017

[18] T Vidas and N Christin ldquoEvading android runtime analysisvia sandbox detectionrdquo in Proceedings of the 9th ACMSymposium on Information Computer and CommunicationsSecurity pp 447ndash458 Kyoto Japan June 2014

[19] Y Duan M Zhang A V Bhaskar et al ldquoings you may notknow about android (Un) packers a systematic study basedon whole-system emulationrdquo in Proceedings of the 2018Network and Distributed System Security Symposium pp 1ndash15 San Diego CA USA February 2018

[20] L Xue X Luo L Yu S Wang and D Wu ldquoAdaptiveunpacking of android appsrdquo in Proceedings of the 2017 IEEEACM 39th International Conference on Software Engineering(ICSE) IEEE Buenos Aires Argentina pp 358ndash369 May2017

[21] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy IEEE San Francisco CAUSA pp 95ndash109 2012

[22] N W Lo S K Lu and Y H Chuang ldquoA framework for thirdparty android marketplaces to identify repackaged appsrdquo inProceedings of the 2016 IEEE 14th International Conference onDependable Autonomic and Secure Computing 14th Inter-national Conference on Pervasive Intelligence and Computing2nd International Conference on Big Data Intelligence andComputing and Cyber Science and Technology Congresspp 475ndash482 Auckland New Zealand August 2016

[23] L Li J Gao M Hurier et al ldquoAndrozoo++ collectingmillions of android apps and their metadata for the researchcommunityrdquo 2017 httpsarxivorgabs170905281

[24] W Shi J Cao Q Zhang Y Li and L Xu ldquoEdge computingvision and challengesrdquo IEEE Internet of ings Journal vol 3no 5 pp 637ndash646 2016

[25] W Zhou Y Zhou X Jiang and P Ning ldquoDetectingrepackaged smart-phone applications in third-party androidmarketplacesrdquo in Proceedings of the Second ACM Conferenceon Data and Application Security and Privacy ACM AntonioTX USA pp 317ndash326 February 2012

[26] Y Shao X Luo C Qian P Zhu and L Zhang ldquoTowards ascalable resource-driven approach for detecting repackagedandroid applicationsrdquo in Proceedings of the 30th AnnualComputer Security Applications Conference ACM NewOrleans LA USA pp 56ndash65 December 2014

[27] K Chen P Wang Y Lee et al ldquoFinding unknown malice in10 seconds mass vetting for new threats at the google-playscalerdquo in Proceedings of the USENIX Security Symposiumvol 15 Washington DC USA August 2015

[28] Z Wang C Li Y Guan and Y Xue ldquoDroidchain a novelmalware detection method for android based on behaviorchainrdquo in Proceedings of the 2015 IEEE Conference onCommunications and Network Security (CNS) IEEE Flor-ence Italy pp 727-728 September 2015

[29] K Tian D D Yao B G Ryder G Tan and G PengldquoDetection of repackaged android malware with code-het-erogeneity featuresrdquo IEEE Transactions on Dependable andSecure Computing vol 17 no 1 pp 64ndash77 2017

[30] A De Lorenzo F Martinelli E Medvet F Mercaldo andA Santone ldquoVisualizing the outcome of dynamic analysis ofandroid malware with vizmalrdquo Journal of Information Se-curity and Applications vol 50 Article ID 102423 2020

[31] M Lin D Zhang X Su and T Yu ldquoEffective and scalablerepackaged application detection based on user interfacerdquo inProceedings of the 2017 IEEE Conference on ComputerCommunications Workshops (INFOCOM WKSHPS) IEEEAtlanta GA USA May 2017

[32] G Meng M Patrick Y Xue Y Liu and J Zhang ldquoSecuringandroid app markets via modelling and predicting malwarespread between marketsrdquo IEEE Transactions on InformationForensics and Security vol 14 no 7 pp 1944ndash1959 2018

[33] A Arora and S K Peddoju ldquoMinimizing network trafficfeatures for android mobile malware detectionrdquo in Proceed-ings of the 18th International Conference on DistributedComputing and Networking ACM Hyderabad India January2017

[34] X Wu D Zhang X Su and W Li ldquoDetect repackagedAndroid application based on HTTP traffic similarity httptraffic similarityrdquo Security and Communication Networksvol 8 no 13 pp 2257ndash2266 2015

[35] J Malik and R Kaushal ldquoCredroid android malware de-tection by network traffic analysisrdquo in Proceedings of the 1stACM Workshop on Privacy-Aware Mobile Computing ACMPaderborn Germany pp 28ndash36 July 2016

18 Security and Communication Networks

[36] A Zulkifli I R A Hamid W M Shah and Z AbdullahldquoAndroid malware detection based on network traffic usingdecision tree algorithmrdquo in Proceedings of the InternationalConference on Soft Computing and Data Mining SpringerSenai Malaysia pp 485ndash494 January 2018

[37] Z Chen Q Yan H Han et al ldquoMachine learning basedmobile malware detection using highly imbalanced networktrafficrdquo Information Sciences vol 433-434 pp 346ndash364 2018

[38] G He B Xu L Zhang and H Zhu ldquoMobile app identifi-cation for encrypted network flows by traffic correlationrdquoInternational Journal of Distributed Sensor Networks vol 14no 12 pp 1ndash17 2018

[39] L Xue Y Zhou T Chen X Luo and G Gu ldquoMalton to-wards on-device non-invasive mobile malware analysis forARTrdquo in Proceedings of the 26th USENIX Security Symposiumpp 289ndash306 Vancouver Canada August 2017

[40] M K Alzaylaee S Y Yerima and S Sezer ldquoDL-Droid deeplearning based android malware detection using real devicesrdquoComputers amp Security vol 89 Article ID 101663 2020

[41] A S Shamili C Bauckhage and T Alpcan ldquoMalware de-tection on mobile devices using distributed machine learn-ingrdquo in Proceedings of the 20th International Conference onPattern Recognition (ICPR) IEEE Istanbul Turkeypp 4348ndash4351 August 2010

[42] M Zhao T Zhang F Ge and Z Yuan ldquoRobotdroid alightweight malware detection framework on smartphonesrdquoJournal of Networks vol 7 no 4 p 715 2012

[43] K A Talha D I Alper and C Aydin ldquoAPK auditor per-mission-based android malware detection systemrdquo DigitalInvestigation vol 13 pp 1ndash14 2015

[44] S Arshad M A Shah A Wahid A Mehmood H Song andH Yu ldquoSamadroid a novel 3-level hybrid malware detectionmodel for android operating systemrdquo IEEE Access vol 6pp 4321ndash4339 2018

[45] G He L Zhang B Xu and H Zhu ldquoDetecting repackagedandroid malware based on mobile edge computingrdquo inProceedings of the 2018 Sixth International Conference onAdvanced Cloud and Big Data (CBD) IEEE Lanzhou Chinapp 360ndash365 August 2018

[46] R Roman J Lopez andMMambo ldquoMobile edge computingFog et al a survey and analysis of security threats andchallengesrdquo Future Generation Computer Systems vol 78pp 680ndash698 2018

[47] F Bonomi R Milito J Zhu and S Addepalli ldquoFog com-puting and its role in the internet of thingsrdquo in Proceedings ofthe First Edition of the MCC Workshop on Mobile CloudComputing ACM Helsinki Finland pp 13ndash16 August 2012

[48] M T Beck MWerner S Feld and S Schimper ldquoMobile edgecomputing a taxonomyrdquo in Proceedings of the Sixth Inter-national Conference on Advances in Future Internet pp 48ndash55 Lisbon Portugal November 2014

[49] P Mach and Z Becvar ldquoMobile edge computing a survey onarchitecture and computation offloadingrdquo IEEE Communi-cations Surveys amp Tutorials vol 19 no 3 pp 1628ndash1656 2017

[50] H T Dinh C Lee D Niyato and P Wang ldquoA survey ofmobile cloud computing architecture applications and ap-proachesrdquo Wireless Communications and Mobile Computingvol 13 no 18 pp 1587ndash1611 2013

[51] X Lyu H Tian L Jiang et al ldquoSelective offloading in mobileedge computing for the green internet of thingsrdquo IEEENetwork vol 32 no 1 pp 54ndash60 2018

[52] S Agrawal and J Agrawal ldquoSurvey on anomaly detectionusing data mining techniquesrdquo Procedia Computer Sciencevol 60 pp 708ndash713 2015

[53] A Tongaonkar ldquoA look at the mobile app identificationlandscaperdquo IEEE Internet Computing vol 20 no 4 pp 9ndash152016

[54] G He B Xu and H Zhu ldquoIdentifying mobile applications forencrypted network trafficrdquo in Proceedings of the 2017 FifthInternational Conference on Advanced Cloud and Big Data(CBD) IEEE Shanghai China pp 279ndash284 August 2017

[55] G Ranjan A Tongaonkar and R Torres ldquoApproximatematching of persistent lexicon using search-engines forclassifying mobile app trafficrdquo in Proceedings of the 35thAnnual IEEE International Conference on Computer Com-munications IIEEE San Francisco CA USA April 2016

[56] L Li and S Qu ldquoShort text classification based on improvedITCrdquo Journal of Computer and Communications vol 1 no 4pp 22ndash27 2013

[57] A H Lashkari A F A Kadir L Taheri and A A GhorbanildquoToward developing a systematic approach to generatebenchmark android malware datasets and classificationrdquo inProceedings of the 2018 International Carnahan Conference onSecurity Technology (ICCST) IEEE Montreal Canada Oc-tober 2018

[58] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[59] Y Li Z Yang Y Guo and X Chen ldquoDroidbot a lightweightUI-guided test input generator for androidrdquo in Proceedings ofthe Software Engineering Companion (ICSE-C) pp 23ndash26Buenos Aires Argentina May 2017

[60] M Fan X Luo J Liu et al ldquoGraph embedding based familialanalysis of android malware using unsupervised learningrdquo inProceedings of the 2019 IEEEACM 41st International Con-ference on Software Engineering (ICSE) IEEE Piscataway NJUSA pp 771ndash782 May 2019

Security and Communication Networks 19

Page 17: On-DeviceDetectionofRepackagedAndroidMalwarevia …downloads.hindawi.com/journals/scn/2020/8630748.pdf · 2020-05-31 · According to [4–6], most Android malware programs are repackagedapps,andthemajorityofnewAndroidmalware

e paradigms of edge computing are used to detectAndroid malware in this study However several challengeswould be encountered when deploying it in practice Firstthe mobile devices must install a specified app for thecollection of essential information which may cause con-cerns and mobile users may refuse to cooperate Second inedge computing the data should be preprocessed by the edgeto better protect the usersrsquo privacy However the edge servercan observe all the data (including the sensitive informa-tion) and it threatens the privacy of users For example ifthe edge is deployed by the cloud provider the usersrsquo privacyinformation is also available to the cloud ere is no dif-ference with the user-cloud model for privacy protectionunless the edge server is deployed by the userird the edgeserver should be well protected from network attacksOtherwise it will be convenient for hackers to steal usersrsquoprivate information

8 Conclusion

In this paper we propose a novel method for detectingAndroid malware based on edge computing To the best ofour knowledge the detection of Android malware by uti-lizing edge computing and traffic clustering has not beenconsidered previously By introducing the edge layer themain tasks of the mobile device are offloaded to the edgeserver and the mobile device only needs to record flowmetainformation which substantially conserves the re-sources of the mobile device e edge server extracts appsrsquonetwork traffic features and sends these features to the cloudplatform for further analysis us the usersrsquo privacy can bebetter protected With the received features the cloud cal-culates the similarities between apps and clusters thesesimilarity values to separate the original apps and themalware automatically We evaluated our methods on 400Android apps and the experimental results demonstratethat the detection accuracy can reach 100 for several appsand the average accuracy is 969 We also tested theperformance of the proposed method and compared it with

the typical approaches e comparison results show thatour method can detect repackaged Android malware withboth higher detection accuracy and lower power and re-source consumptions In the future we will continue to workto overcome the discussed limitations and to evaluate theproposed method in real world scenarios

Data Availability

e downloaded apps used to support the findings of thisstudy have been deposited in httpspanbaiducoms1b8vMceIdtdd1gBRC5Gbzvw Its extraction code is vd5re clustering code is also in the repository Since appnetwork traffic contains usersrsquo sensitive information itcannot be publicly shared However we have explained allthe steps to generate network traffic automatically esesteps are also described in the instructionspdf in therepository

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported by National Key TechnologiesResearch and Development Program of China under Grant2018YFD0401404 National Natural Science Foundation ofChina under Grants 61702282 61802192 and 71801123Natural Science Foundation of the Jiangsu Higher EducationInstitutions of China under Grants 18KJB520024 and17KJB520023 Nanjing Forestry University under GrantsGXL016 and CX2016026 and NUPTSF under GrantNY217143

References

[1] Statista Mobile Operating Systemsrsquo Market Share Worldwidefrom January 2012 to December 2019 Statista HamburgGermany 2009 httpswwwstatistacomstatistics272698global-market-share-held-by-mobile-operating-systems-since-2009

[2] W Martin F Sarro Y Jia Y Zhang and M Harman ldquoAsurvey of app store analysis for software engineeringrdquo IEEETransactions on Software Engineering vol 43 no 9pp 817ndash847 2017

[3] httpswwwappbraincomstatsnumber-of-android-apps2020

[4] L Li D Li T F Bissyande et al ldquoUnderstanding android apppiggybacking a systematic study of malicious code graftingrdquoIEEE Transactions on Information Forensics and Securityvol 12 no 6 pp 1269ndash1284 2017

[5] M Fan J Liu X Luo et al ldquoAndroid malware familialclassification and representative sample selection via frequentsubgraph analysisrdquo IEEE Transactions on Information Fo-rensics and Security vol 13 no 8 pp 1890ndash1905 2018

[6] J Zhang Z Qin K Zhang H Yin and J Zou ldquoDalvik opcodegraph based android malware variants detection using globaltopology featuresrdquo IEEE Access vol 6 pp 51964ndash51974 2018

[7] S Garg S K Peddoju and A K Sarje ldquoNetwork-baseddetection of android malicious appsrdquo International Journal ofInformation Security vol 16 no 4 pp 385ndash400 2016

Table 9 Comparison of the proposed method with SAMADroid

Proposed method SAMADroidAccuracy 969 831F-measure 0940 0793Average power consumption 12 23Average CPU consumption 06 08Average memory consumption 654MB 773MB

Table 10 Comparison of the proposed method with Shabtairsquosmethod

Proposed method Shabtairsquosmethod

Accuracy 969 771F-measure 0940 0695Average power consumption 12 77Average CPU consumption 06 23Average memory consumption 654MB 937MB

Security and Communication Networks 17

[8] A Shabtai L Tenenboim-Chekina D Mimran L RokachB Shapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquoComputers amp Security vol 43 pp 1ndash18 2014

[9] L Xie X Zhang J-P Seifert and S Zhu ldquopBMDS a be-havior-based malware detection system for cellphone de-vicesrdquo in Proceedings of the ird ACM Conference onWireless Network Security ACM New York NY USApp 37ndash48 2010

[10] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacyin Smartphones and Mobile Devices ACM Chicago IL USApp 15ndash26 October 2011

[11] G Dini F Martinelli A Saracino and D SgandurraldquoMADAM a multi-level anomaly detector for android mal-warerdquo in Proceedings of the International Conference onMathematical Methods Models and Architectures for Com-puter Network Security vol 12 Springer St PetersburgRussia pp 240ndash253 2012

[12] Y Zhang M Yang B Xu et al ldquoVetting undesirable be-haviors in android apps with permission use analysisrdquo inProceedings of the 2013 ACM SIGSAC Conference on Com-puter amp Communications Security ACM Berlin Germanypp 611ndash622 November 2013

[13] SWang Z Chen X Li LWang K Ji and C Zhao ldquoAndroidmalware clustering analysis on network-level behaviorrdquo inProceedings of the International Conference on IntelligentComputing Springer Liverpool UK pp 796ndash807 August2017

[14] G He B Xu and H Zhu ldquoAppFA a novel approach to detectmalicious android applications on the networkrdquo in Securityand Communication Networks Wiley Hoboken NJ USA2018

[15] X Wang Y Yang and Y Zeng ldquoAccurate mobile malwaredetection and classification in the cloudrdquo SpringerPlus vol 4no 1 p 583 2015

[16] K Tam A Feizollah N B Anuar R Salleh and L Cavallaroldquoe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys (CSUR) vol 49 no 4p 76 2017

[17] M Sun X Li J C S Lui R T B Ma and Z Liang ldquoMonet auser-oriented behavior-based malware variants detectionsystem for androidrdquo IEEE Transactions on Information Fo-rensics and Security vol 12 no 5 pp 1103ndash1112 2017

[18] T Vidas and N Christin ldquoEvading android runtime analysisvia sandbox detectionrdquo in Proceedings of the 9th ACMSymposium on Information Computer and CommunicationsSecurity pp 447ndash458 Kyoto Japan June 2014

[19] Y Duan M Zhang A V Bhaskar et al ldquoings you may notknow about android (Un) packers a systematic study basedon whole-system emulationrdquo in Proceedings of the 2018Network and Distributed System Security Symposium pp 1ndash15 San Diego CA USA February 2018

[20] L Xue X Luo L Yu S Wang and D Wu ldquoAdaptiveunpacking of android appsrdquo in Proceedings of the 2017 IEEEACM 39th International Conference on Software Engineering(ICSE) IEEE Buenos Aires Argentina pp 358ndash369 May2017

[21] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy IEEE San Francisco CAUSA pp 95ndash109 2012

[22] N W Lo S K Lu and Y H Chuang ldquoA framework for thirdparty android marketplaces to identify repackaged appsrdquo inProceedings of the 2016 IEEE 14th International Conference onDependable Autonomic and Secure Computing 14th Inter-national Conference on Pervasive Intelligence and Computing2nd International Conference on Big Data Intelligence andComputing and Cyber Science and Technology Congresspp 475ndash482 Auckland New Zealand August 2016

[23] L Li J Gao M Hurier et al ldquoAndrozoo++ collectingmillions of android apps and their metadata for the researchcommunityrdquo 2017 httpsarxivorgabs170905281

[24] W Shi J Cao Q Zhang Y Li and L Xu ldquoEdge computingvision and challengesrdquo IEEE Internet of ings Journal vol 3no 5 pp 637ndash646 2016

[25] W Zhou Y Zhou X Jiang and P Ning ldquoDetectingrepackaged smart-phone applications in third-party androidmarketplacesrdquo in Proceedings of the Second ACM Conferenceon Data and Application Security and Privacy ACM AntonioTX USA pp 317ndash326 February 2012

[26] Y Shao X Luo C Qian P Zhu and L Zhang ldquoTowards ascalable resource-driven approach for detecting repackagedandroid applicationsrdquo in Proceedings of the 30th AnnualComputer Security Applications Conference ACM NewOrleans LA USA pp 56ndash65 December 2014

[27] K Chen P Wang Y Lee et al ldquoFinding unknown malice in10 seconds mass vetting for new threats at the google-playscalerdquo in Proceedings of the USENIX Security Symposiumvol 15 Washington DC USA August 2015

[28] Z Wang C Li Y Guan and Y Xue ldquoDroidchain a novelmalware detection method for android based on behaviorchainrdquo in Proceedings of the 2015 IEEE Conference onCommunications and Network Security (CNS) IEEE Flor-ence Italy pp 727-728 September 2015

[29] K Tian D D Yao B G Ryder G Tan and G PengldquoDetection of repackaged android malware with code-het-erogeneity featuresrdquo IEEE Transactions on Dependable andSecure Computing vol 17 no 1 pp 64ndash77 2017

[30] A De Lorenzo F Martinelli E Medvet F Mercaldo andA Santone ldquoVisualizing the outcome of dynamic analysis ofandroid malware with vizmalrdquo Journal of Information Se-curity and Applications vol 50 Article ID 102423 2020

[31] M Lin D Zhang X Su and T Yu ldquoEffective and scalablerepackaged application detection based on user interfacerdquo inProceedings of the 2017 IEEE Conference on ComputerCommunications Workshops (INFOCOM WKSHPS) IEEEAtlanta GA USA May 2017

[32] G Meng M Patrick Y Xue Y Liu and J Zhang ldquoSecuringandroid app markets via modelling and predicting malwarespread between marketsrdquo IEEE Transactions on InformationForensics and Security vol 14 no 7 pp 1944ndash1959 2018

[33] A Arora and S K Peddoju ldquoMinimizing network trafficfeatures for android mobile malware detectionrdquo in Proceed-ings of the 18th International Conference on DistributedComputing and Networking ACM Hyderabad India January2017

[34] X Wu D Zhang X Su and W Li ldquoDetect repackagedAndroid application based on HTTP traffic similarity httptraffic similarityrdquo Security and Communication Networksvol 8 no 13 pp 2257ndash2266 2015

[35] J Malik and R Kaushal ldquoCredroid android malware de-tection by network traffic analysisrdquo in Proceedings of the 1stACM Workshop on Privacy-Aware Mobile Computing ACMPaderborn Germany pp 28ndash36 July 2016

18 Security and Communication Networks

[36] A Zulkifli I R A Hamid W M Shah and Z AbdullahldquoAndroid malware detection based on network traffic usingdecision tree algorithmrdquo in Proceedings of the InternationalConference on Soft Computing and Data Mining SpringerSenai Malaysia pp 485ndash494 January 2018

[37] Z Chen Q Yan H Han et al ldquoMachine learning basedmobile malware detection using highly imbalanced networktrafficrdquo Information Sciences vol 433-434 pp 346ndash364 2018

[38] G He B Xu L Zhang and H Zhu ldquoMobile app identifi-cation for encrypted network flows by traffic correlationrdquoInternational Journal of Distributed Sensor Networks vol 14no 12 pp 1ndash17 2018

[39] L Xue Y Zhou T Chen X Luo and G Gu ldquoMalton to-wards on-device non-invasive mobile malware analysis forARTrdquo in Proceedings of the 26th USENIX Security Symposiumpp 289ndash306 Vancouver Canada August 2017

[40] M K Alzaylaee S Y Yerima and S Sezer ldquoDL-Droid deeplearning based android malware detection using real devicesrdquoComputers amp Security vol 89 Article ID 101663 2020

[41] A S Shamili C Bauckhage and T Alpcan ldquoMalware de-tection on mobile devices using distributed machine learn-ingrdquo in Proceedings of the 20th International Conference onPattern Recognition (ICPR) IEEE Istanbul Turkeypp 4348ndash4351 August 2010

[42] M Zhao T Zhang F Ge and Z Yuan ldquoRobotdroid alightweight malware detection framework on smartphonesrdquoJournal of Networks vol 7 no 4 p 715 2012

[43] K A Talha D I Alper and C Aydin ldquoAPK auditor per-mission-based android malware detection systemrdquo DigitalInvestigation vol 13 pp 1ndash14 2015

[44] S Arshad M A Shah A Wahid A Mehmood H Song andH Yu ldquoSamadroid a novel 3-level hybrid malware detectionmodel for android operating systemrdquo IEEE Access vol 6pp 4321ndash4339 2018

[45] G He L Zhang B Xu and H Zhu ldquoDetecting repackagedandroid malware based on mobile edge computingrdquo inProceedings of the 2018 Sixth International Conference onAdvanced Cloud and Big Data (CBD) IEEE Lanzhou Chinapp 360ndash365 August 2018

[46] R Roman J Lopez andMMambo ldquoMobile edge computingFog et al a survey and analysis of security threats andchallengesrdquo Future Generation Computer Systems vol 78pp 680ndash698 2018

[47] F Bonomi R Milito J Zhu and S Addepalli ldquoFog com-puting and its role in the internet of thingsrdquo in Proceedings ofthe First Edition of the MCC Workshop on Mobile CloudComputing ACM Helsinki Finland pp 13ndash16 August 2012

[48] M T Beck MWerner S Feld and S Schimper ldquoMobile edgecomputing a taxonomyrdquo in Proceedings of the Sixth Inter-national Conference on Advances in Future Internet pp 48ndash55 Lisbon Portugal November 2014

[49] P Mach and Z Becvar ldquoMobile edge computing a survey onarchitecture and computation offloadingrdquo IEEE Communi-cations Surveys amp Tutorials vol 19 no 3 pp 1628ndash1656 2017

[50] H T Dinh C Lee D Niyato and P Wang ldquoA survey ofmobile cloud computing architecture applications and ap-proachesrdquo Wireless Communications and Mobile Computingvol 13 no 18 pp 1587ndash1611 2013

[51] X Lyu H Tian L Jiang et al ldquoSelective offloading in mobileedge computing for the green internet of thingsrdquo IEEENetwork vol 32 no 1 pp 54ndash60 2018

[52] S Agrawal and J Agrawal ldquoSurvey on anomaly detectionusing data mining techniquesrdquo Procedia Computer Sciencevol 60 pp 708ndash713 2015

[53] A Tongaonkar ldquoA look at the mobile app identificationlandscaperdquo IEEE Internet Computing vol 20 no 4 pp 9ndash152016

[54] G He B Xu and H Zhu ldquoIdentifying mobile applications forencrypted network trafficrdquo in Proceedings of the 2017 FifthInternational Conference on Advanced Cloud and Big Data(CBD) IEEE Shanghai China pp 279ndash284 August 2017

[55] G Ranjan A Tongaonkar and R Torres ldquoApproximatematching of persistent lexicon using search-engines forclassifying mobile app trafficrdquo in Proceedings of the 35thAnnual IEEE International Conference on Computer Com-munications IIEEE San Francisco CA USA April 2016

[56] L Li and S Qu ldquoShort text classification based on improvedITCrdquo Journal of Computer and Communications vol 1 no 4pp 22ndash27 2013

[57] A H Lashkari A F A Kadir L Taheri and A A GhorbanildquoToward developing a systematic approach to generatebenchmark android malware datasets and classificationrdquo inProceedings of the 2018 International Carnahan Conference onSecurity Technology (ICCST) IEEE Montreal Canada Oc-tober 2018

[58] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[59] Y Li Z Yang Y Guo and X Chen ldquoDroidbot a lightweightUI-guided test input generator for androidrdquo in Proceedings ofthe Software Engineering Companion (ICSE-C) pp 23ndash26Buenos Aires Argentina May 2017

[60] M Fan X Luo J Liu et al ldquoGraph embedding based familialanalysis of android malware using unsupervised learningrdquo inProceedings of the 2019 IEEEACM 41st International Con-ference on Software Engineering (ICSE) IEEE Piscataway NJUSA pp 771ndash782 May 2019

Security and Communication Networks 19

Page 18: On-DeviceDetectionofRepackagedAndroidMalwarevia …downloads.hindawi.com/journals/scn/2020/8630748.pdf · 2020-05-31 · According to [4–6], most Android malware programs are repackagedapps,andthemajorityofnewAndroidmalware

[8] A Shabtai L Tenenboim-Chekina D Mimran L RokachB Shapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquoComputers amp Security vol 43 pp 1ndash18 2014

[9] L Xie X Zhang J-P Seifert and S Zhu ldquopBMDS a be-havior-based malware detection system for cellphone de-vicesrdquo in Proceedings of the ird ACM Conference onWireless Network Security ACM New York NY USApp 37ndash48 2010

[10] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacyin Smartphones and Mobile Devices ACM Chicago IL USApp 15ndash26 October 2011

[11] G Dini F Martinelli A Saracino and D SgandurraldquoMADAM a multi-level anomaly detector for android mal-warerdquo in Proceedings of the International Conference onMathematical Methods Models and Architectures for Com-puter Network Security vol 12 Springer St PetersburgRussia pp 240ndash253 2012

[12] Y Zhang M Yang B Xu et al ldquoVetting undesirable be-haviors in android apps with permission use analysisrdquo inProceedings of the 2013 ACM SIGSAC Conference on Com-puter amp Communications Security ACM Berlin Germanypp 611ndash622 November 2013

[13] SWang Z Chen X Li LWang K Ji and C Zhao ldquoAndroidmalware clustering analysis on network-level behaviorrdquo inProceedings of the International Conference on IntelligentComputing Springer Liverpool UK pp 796ndash807 August2017

[14] G He B Xu and H Zhu ldquoAppFA a novel approach to detectmalicious android applications on the networkrdquo in Securityand Communication Networks Wiley Hoboken NJ USA2018

[15] X Wang Y Yang and Y Zeng ldquoAccurate mobile malwaredetection and classification in the cloudrdquo SpringerPlus vol 4no 1 p 583 2015

[16] K Tam A Feizollah N B Anuar R Salleh and L Cavallaroldquoe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys (CSUR) vol 49 no 4p 76 2017

[17] M Sun X Li J C S Lui R T B Ma and Z Liang ldquoMonet auser-oriented behavior-based malware variants detectionsystem for androidrdquo IEEE Transactions on Information Fo-rensics and Security vol 12 no 5 pp 1103ndash1112 2017

[18] T Vidas and N Christin ldquoEvading android runtime analysisvia sandbox detectionrdquo in Proceedings of the 9th ACMSymposium on Information Computer and CommunicationsSecurity pp 447ndash458 Kyoto Japan June 2014

[19] Y Duan M Zhang A V Bhaskar et al ldquoings you may notknow about android (Un) packers a systematic study basedon whole-system emulationrdquo in Proceedings of the 2018Network and Distributed System Security Symposium pp 1ndash15 San Diego CA USA February 2018

[20] L Xue X Luo L Yu S Wang and D Wu ldquoAdaptiveunpacking of android appsrdquo in Proceedings of the 2017 IEEEACM 39th International Conference on Software Engineering(ICSE) IEEE Buenos Aires Argentina pp 358ndash369 May2017

[21] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy IEEE San Francisco CAUSA pp 95ndash109 2012

[22] N W Lo S K Lu and Y H Chuang ldquoA framework for thirdparty android marketplaces to identify repackaged appsrdquo inProceedings of the 2016 IEEE 14th International Conference onDependable Autonomic and Secure Computing 14th Inter-national Conference on Pervasive Intelligence and Computing2nd International Conference on Big Data Intelligence andComputing and Cyber Science and Technology Congresspp 475ndash482 Auckland New Zealand August 2016

[23] L Li J Gao M Hurier et al ldquoAndrozoo++ collectingmillions of android apps and their metadata for the researchcommunityrdquo 2017 httpsarxivorgabs170905281

[24] W Shi J Cao Q Zhang Y Li and L Xu ldquoEdge computingvision and challengesrdquo IEEE Internet of ings Journal vol 3no 5 pp 637ndash646 2016

[25] W Zhou Y Zhou X Jiang and P Ning ldquoDetectingrepackaged smart-phone applications in third-party androidmarketplacesrdquo in Proceedings of the Second ACM Conferenceon Data and Application Security and Privacy ACM AntonioTX USA pp 317ndash326 February 2012

[26] Y Shao X Luo C Qian P Zhu and L Zhang ldquoTowards ascalable resource-driven approach for detecting repackagedandroid applicationsrdquo in Proceedings of the 30th AnnualComputer Security Applications Conference ACM NewOrleans LA USA pp 56ndash65 December 2014

[27] K Chen P Wang Y Lee et al ldquoFinding unknown malice in10 seconds mass vetting for new threats at the google-playscalerdquo in Proceedings of the USENIX Security Symposiumvol 15 Washington DC USA August 2015

[28] Z Wang C Li Y Guan and Y Xue ldquoDroidchain a novelmalware detection method for android based on behaviorchainrdquo in Proceedings of the 2015 IEEE Conference onCommunications and Network Security (CNS) IEEE Flor-ence Italy pp 727-728 September 2015

[29] K Tian D D Yao B G Ryder G Tan and G PengldquoDetection of repackaged android malware with code-het-erogeneity featuresrdquo IEEE Transactions on Dependable andSecure Computing vol 17 no 1 pp 64ndash77 2017

[30] A De Lorenzo F Martinelli E Medvet F Mercaldo andA Santone ldquoVisualizing the outcome of dynamic analysis ofandroid malware with vizmalrdquo Journal of Information Se-curity and Applications vol 50 Article ID 102423 2020

[31] M Lin D Zhang X Su and T Yu ldquoEffective and scalablerepackaged application detection based on user interfacerdquo inProceedings of the 2017 IEEE Conference on ComputerCommunications Workshops (INFOCOM WKSHPS) IEEEAtlanta GA USA May 2017

[32] G Meng M Patrick Y Xue Y Liu and J Zhang ldquoSecuringandroid app markets via modelling and predicting malwarespread between marketsrdquo IEEE Transactions on InformationForensics and Security vol 14 no 7 pp 1944ndash1959 2018

[33] A Arora and S K Peddoju ldquoMinimizing network trafficfeatures for android mobile malware detectionrdquo in Proceed-ings of the 18th International Conference on DistributedComputing and Networking ACM Hyderabad India January2017

[34] X Wu D Zhang X Su and W Li ldquoDetect repackagedAndroid application based on HTTP traffic similarity httptraffic similarityrdquo Security and Communication Networksvol 8 no 13 pp 2257ndash2266 2015

[35] J Malik and R Kaushal ldquoCredroid android malware de-tection by network traffic analysisrdquo in Proceedings of the 1stACM Workshop on Privacy-Aware Mobile Computing ACMPaderborn Germany pp 28ndash36 July 2016

18 Security and Communication Networks

[36] A Zulkifli I R A Hamid W M Shah and Z AbdullahldquoAndroid malware detection based on network traffic usingdecision tree algorithmrdquo in Proceedings of the InternationalConference on Soft Computing and Data Mining SpringerSenai Malaysia pp 485ndash494 January 2018

[37] Z Chen Q Yan H Han et al ldquoMachine learning basedmobile malware detection using highly imbalanced networktrafficrdquo Information Sciences vol 433-434 pp 346ndash364 2018

[38] G He B Xu L Zhang and H Zhu ldquoMobile app identifi-cation for encrypted network flows by traffic correlationrdquoInternational Journal of Distributed Sensor Networks vol 14no 12 pp 1ndash17 2018

[39] L Xue Y Zhou T Chen X Luo and G Gu ldquoMalton to-wards on-device non-invasive mobile malware analysis forARTrdquo in Proceedings of the 26th USENIX Security Symposiumpp 289ndash306 Vancouver Canada August 2017

[40] M K Alzaylaee S Y Yerima and S Sezer ldquoDL-Droid deeplearning based android malware detection using real devicesrdquoComputers amp Security vol 89 Article ID 101663 2020

[41] A S Shamili C Bauckhage and T Alpcan ldquoMalware de-tection on mobile devices using distributed machine learn-ingrdquo in Proceedings of the 20th International Conference onPattern Recognition (ICPR) IEEE Istanbul Turkeypp 4348ndash4351 August 2010

[42] M Zhao T Zhang F Ge and Z Yuan ldquoRobotdroid alightweight malware detection framework on smartphonesrdquoJournal of Networks vol 7 no 4 p 715 2012

[43] K A Talha D I Alper and C Aydin ldquoAPK auditor per-mission-based android malware detection systemrdquo DigitalInvestigation vol 13 pp 1ndash14 2015

[44] S Arshad M A Shah A Wahid A Mehmood H Song andH Yu ldquoSamadroid a novel 3-level hybrid malware detectionmodel for android operating systemrdquo IEEE Access vol 6pp 4321ndash4339 2018

[45] G He L Zhang B Xu and H Zhu ldquoDetecting repackagedandroid malware based on mobile edge computingrdquo inProceedings of the 2018 Sixth International Conference onAdvanced Cloud and Big Data (CBD) IEEE Lanzhou Chinapp 360ndash365 August 2018

[46] R Roman J Lopez andMMambo ldquoMobile edge computingFog et al a survey and analysis of security threats andchallengesrdquo Future Generation Computer Systems vol 78pp 680ndash698 2018

[47] F Bonomi R Milito J Zhu and S Addepalli ldquoFog com-puting and its role in the internet of thingsrdquo in Proceedings ofthe First Edition of the MCC Workshop on Mobile CloudComputing ACM Helsinki Finland pp 13ndash16 August 2012

[48] M T Beck MWerner S Feld and S Schimper ldquoMobile edgecomputing a taxonomyrdquo in Proceedings of the Sixth Inter-national Conference on Advances in Future Internet pp 48ndash55 Lisbon Portugal November 2014

[49] P Mach and Z Becvar ldquoMobile edge computing a survey onarchitecture and computation offloadingrdquo IEEE Communi-cations Surveys amp Tutorials vol 19 no 3 pp 1628ndash1656 2017

[50] H T Dinh C Lee D Niyato and P Wang ldquoA survey ofmobile cloud computing architecture applications and ap-proachesrdquo Wireless Communications and Mobile Computingvol 13 no 18 pp 1587ndash1611 2013

[51] X Lyu H Tian L Jiang et al ldquoSelective offloading in mobileedge computing for the green internet of thingsrdquo IEEENetwork vol 32 no 1 pp 54ndash60 2018

[52] S Agrawal and J Agrawal ldquoSurvey on anomaly detectionusing data mining techniquesrdquo Procedia Computer Sciencevol 60 pp 708ndash713 2015

[53] A Tongaonkar ldquoA look at the mobile app identificationlandscaperdquo IEEE Internet Computing vol 20 no 4 pp 9ndash152016

[54] G He B Xu and H Zhu ldquoIdentifying mobile applications forencrypted network trafficrdquo in Proceedings of the 2017 FifthInternational Conference on Advanced Cloud and Big Data(CBD) IEEE Shanghai China pp 279ndash284 August 2017

[55] G Ranjan A Tongaonkar and R Torres ldquoApproximatematching of persistent lexicon using search-engines forclassifying mobile app trafficrdquo in Proceedings of the 35thAnnual IEEE International Conference on Computer Com-munications IIEEE San Francisco CA USA April 2016

[56] L Li and S Qu ldquoShort text classification based on improvedITCrdquo Journal of Computer and Communications vol 1 no 4pp 22ndash27 2013

[57] A H Lashkari A F A Kadir L Taheri and A A GhorbanildquoToward developing a systematic approach to generatebenchmark android malware datasets and classificationrdquo inProceedings of the 2018 International Carnahan Conference onSecurity Technology (ICCST) IEEE Montreal Canada Oc-tober 2018

[58] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[59] Y Li Z Yang Y Guo and X Chen ldquoDroidbot a lightweightUI-guided test input generator for androidrdquo in Proceedings ofthe Software Engineering Companion (ICSE-C) pp 23ndash26Buenos Aires Argentina May 2017

[60] M Fan X Luo J Liu et al ldquoGraph embedding based familialanalysis of android malware using unsupervised learningrdquo inProceedings of the 2019 IEEEACM 41st International Con-ference on Software Engineering (ICSE) IEEE Piscataway NJUSA pp 771ndash782 May 2019

Security and Communication Networks 19

Page 19: On-DeviceDetectionofRepackagedAndroidMalwarevia …downloads.hindawi.com/journals/scn/2020/8630748.pdf · 2020-05-31 · According to [4–6], most Android malware programs are repackagedapps,andthemajorityofnewAndroidmalware

[36] A Zulkifli I R A Hamid W M Shah and Z AbdullahldquoAndroid malware detection based on network traffic usingdecision tree algorithmrdquo in Proceedings of the InternationalConference on Soft Computing and Data Mining SpringerSenai Malaysia pp 485ndash494 January 2018

[37] Z Chen Q Yan H Han et al ldquoMachine learning basedmobile malware detection using highly imbalanced networktrafficrdquo Information Sciences vol 433-434 pp 346ndash364 2018

[38] G He B Xu L Zhang and H Zhu ldquoMobile app identifi-cation for encrypted network flows by traffic correlationrdquoInternational Journal of Distributed Sensor Networks vol 14no 12 pp 1ndash17 2018

[39] L Xue Y Zhou T Chen X Luo and G Gu ldquoMalton to-wards on-device non-invasive mobile malware analysis forARTrdquo in Proceedings of the 26th USENIX Security Symposiumpp 289ndash306 Vancouver Canada August 2017

[40] M K Alzaylaee S Y Yerima and S Sezer ldquoDL-Droid deeplearning based android malware detection using real devicesrdquoComputers amp Security vol 89 Article ID 101663 2020

[41] A S Shamili C Bauckhage and T Alpcan ldquoMalware de-tection on mobile devices using distributed machine learn-ingrdquo in Proceedings of the 20th International Conference onPattern Recognition (ICPR) IEEE Istanbul Turkeypp 4348ndash4351 August 2010

[42] M Zhao T Zhang F Ge and Z Yuan ldquoRobotdroid alightweight malware detection framework on smartphonesrdquoJournal of Networks vol 7 no 4 p 715 2012

[43] K A Talha D I Alper and C Aydin ldquoAPK auditor per-mission-based android malware detection systemrdquo DigitalInvestigation vol 13 pp 1ndash14 2015

[44] S Arshad M A Shah A Wahid A Mehmood H Song andH Yu ldquoSamadroid a novel 3-level hybrid malware detectionmodel for android operating systemrdquo IEEE Access vol 6pp 4321ndash4339 2018

[45] G He L Zhang B Xu and H Zhu ldquoDetecting repackagedandroid malware based on mobile edge computingrdquo inProceedings of the 2018 Sixth International Conference onAdvanced Cloud and Big Data (CBD) IEEE Lanzhou Chinapp 360ndash365 August 2018

[46] R Roman J Lopez andMMambo ldquoMobile edge computingFog et al a survey and analysis of security threats andchallengesrdquo Future Generation Computer Systems vol 78pp 680ndash698 2018

[47] F Bonomi R Milito J Zhu and S Addepalli ldquoFog com-puting and its role in the internet of thingsrdquo in Proceedings ofthe First Edition of the MCC Workshop on Mobile CloudComputing ACM Helsinki Finland pp 13ndash16 August 2012

[48] M T Beck MWerner S Feld and S Schimper ldquoMobile edgecomputing a taxonomyrdquo in Proceedings of the Sixth Inter-national Conference on Advances in Future Internet pp 48ndash55 Lisbon Portugal November 2014

[49] P Mach and Z Becvar ldquoMobile edge computing a survey onarchitecture and computation offloadingrdquo IEEE Communi-cations Surveys amp Tutorials vol 19 no 3 pp 1628ndash1656 2017

[50] H T Dinh C Lee D Niyato and P Wang ldquoA survey ofmobile cloud computing architecture applications and ap-proachesrdquo Wireless Communications and Mobile Computingvol 13 no 18 pp 1587ndash1611 2013

[51] X Lyu H Tian L Jiang et al ldquoSelective offloading in mobileedge computing for the green internet of thingsrdquo IEEENetwork vol 32 no 1 pp 54ndash60 2018

[52] S Agrawal and J Agrawal ldquoSurvey on anomaly detectionusing data mining techniquesrdquo Procedia Computer Sciencevol 60 pp 708ndash713 2015

[53] A Tongaonkar ldquoA look at the mobile app identificationlandscaperdquo IEEE Internet Computing vol 20 no 4 pp 9ndash152016

[54] G He B Xu and H Zhu ldquoIdentifying mobile applications forencrypted network trafficrdquo in Proceedings of the 2017 FifthInternational Conference on Advanced Cloud and Big Data(CBD) IEEE Shanghai China pp 279ndash284 August 2017

[55] G Ranjan A Tongaonkar and R Torres ldquoApproximatematching of persistent lexicon using search-engines forclassifying mobile app trafficrdquo in Proceedings of the 35thAnnual IEEE International Conference on Computer Com-munications IIEEE San Francisco CA USA April 2016

[56] L Li and S Qu ldquoShort text classification based on improvedITCrdquo Journal of Computer and Communications vol 1 no 4pp 22ndash27 2013

[57] A H Lashkari A F A Kadir L Taheri and A A GhorbanildquoToward developing a systematic approach to generatebenchmark android malware datasets and classificationrdquo inProceedings of the 2018 International Carnahan Conference onSecurity Technology (ICCST) IEEE Montreal Canada Oc-tober 2018

[58] A Rodriguez and A Laio ldquoClustering by fast search and findof density peaksrdquo Science vol 344 no 6191 pp 1492ndash14962014

[59] Y Li Z Yang Y Guo and X Chen ldquoDroidbot a lightweightUI-guided test input generator for androidrdquo in Proceedings ofthe Software Engineering Companion (ICSE-C) pp 23ndash26Buenos Aires Argentina May 2017

[60] M Fan X Luo J Liu et al ldquoGraph embedding based familialanalysis of android malware using unsupervised learningrdquo inProceedings of the 2019 IEEEACM 41st International Con-ference on Software Engineering (ICSE) IEEE Piscataway NJUSA pp 771ndash782 May 2019

Security and Communication Networks 19