White Paper On Mobile Subscriber Insights – A Big Data Approach

Mobile Subscriber Fingerprinting: A Big Data

Approach

Jobin Wilson

R&D, Flytxt Technology Pvt. Ltd.

Trivandrum, India

[email protected]

Vikram Garg

R&D, Flytxt Technology Pvt. Ltd

Trivandrum, India

[email protected]

Abstract—Mobile advertising campaigns must use subscriber

data to target subscribers relevant to the product or service being

recommended. However, the matching of recommendations to

subscribers is most often inexact, and no straightforward

“attribute-value” matching algorithm suffices. Arbitrary

matching degrades subscriber experience and results in lower

conversion rates.

One viable solution is to cluster subscribers by common

attributes, and then match recommendations with the same

attributes to subscribers in the associated cluster. This approach

presents a few significant challenges in the data architecture: the

clusters are extremely large and need not have exact boundaries,

and their representation must facilitate real-time

recommendations. We describe here a novel architecture which

we call subscriber “fingerprinting”. The proposed system is

capable of analyzing extremely large volumes of data (in the

order of terabytes) using sophisticated large scale distributed

Extract-Transform-Load (ETL) operations followed by

distributed data analytics involving statistical models. The

insights generated from this process can be used to serve

personalized recommendations in real-time. The proposed system

uses big data analytics built over a Hadoop ecosystem, and

leverages a private cloud infrastructure for deployment. The

system also includes a simple, secure and light weight integration

API using REST protocols.

Keywords- recommendation systems; big data analytics; distributed computing; cloud computing; content recommendations; service personalization; mobile advertisements;

I. INTRODUCTION

Mobile telecommunication service providers are in a unique position to provide their subscribers quality of service in the form of accurate service personalization, customized personal service recommendations, and contextual advertisements. The service provider may also want to predict and influence subscriber behavior, for example in churn prediction and management. This requires the operator to maintain a 360 degree view of the subscriber (including his/her behavior, preferences and service usage patterns), which need to be mined and learned by the system automatically [1][2], along with ensuring strict privacy of subscribers. Rich noise-free metadata in the form of transactions logs which get generated from a subscribers interaction with a telecom network are an apt source of input for this process. Further, mobile devices provide multiple modalities of real-time content

delivery such as SMS, MMS, outbound calls and the mobile internet, thus making mobile phones a promising advertisement channel.

Inherently, PC-based internet usage is hard to link to the

subscriber identity, and hence provides limited and noisy metadata information for personalization, and has only limited modalities for real time advertisement delivery making it a less preferred channel over mobiles for the advertiser. [9]

The subscriber fingerprinting model that we present here assists the mobile service provider in personalization and it is used in the following fashion:

a) Classifying subscribers and generating a holistic view of them automatically in near-real-time

b) Providing a secure application programming interface to target micro segments as well as for seamless integration with third-party systems

c) Allowing marketers, the flexibility to define classification criteria and market segments that they are interested in based on their specific requirements

d) Providing service personalization and recommendations based on subscriber’s preferences and behavior

e) Ensure strict privacy for the subscribers.

The work presented here is based on a live system that we have developed for one of the major telecom service provider. The presented system addresses the privacy concerns using non invertible hashing based approach to generate the insights. These insights can’t be used to deduce the raw facts that resulted in a specific classification of any subscriber. We also find it important to mention that real-world data is confidential and we have masked identities wherever necessary.

The paper is organized as follows. Section II provides a brief study of the proposed solution. Section III presents a live example of our system usage. A conclusion will be presented in the Section IV.

II. OUR SOLUTION

We propose a system which provides a near real-time subscriber intelligence and service personalization to any Touch-point

1 System. Using continuously updating streams of

various data from service provider’s network & business infrastructure, our system maintains a real-time unified profile for each subscriber. This consists of both the static information

regarding the subscriber as well as learned meaningful insights such as his/her socio economic profile, data usage pattern, personas, and general preferences. The proposed system consists of the components shown in the figure 1.

Figure 1 Product Architecture

A. Data Stream Up loader Engine (DSUE)

The uploader engine connects to live data streams and pulls raw data files over FTP/SFTP and stores relevant data into the distributed data warehouse. This task can be time-consuming since the file sizes could be large in the range of few hundred Gigabytes. This is a distributed engine having a pool of worker nodes over a private cloud infrastructure [4][10]. The master identifies any free worker node at the time of a data upload and assigns the task to it. In case of any failure, the task gets reassigned to the next free worker node. Apache hive [6] serves as the distributed data warehouse where we run data warehousing workloads during insight generation.

B. Continuous Insight Engine(CIE)

The CIE is the intelligent component of the system, which generates meaningful insight about the subscriber in near-real time. It consists of models that continuously analyze data using a massively distributed cluster of nodes deployed over a private cloud infrastructure. The CIE is massively scalable and flexible to adapt to changing file formats and to new requirements. Models are triggered by a data-driven framework, which means that models run only when a fresh data set is made available to it. An important aspect of CIE is managing job schedules and actions within a job. We use the concept of job workflows to generate subscriber insights. Workflows would have multiple actions which get executed over a Hadoop [3] cluster as one or more map-reduce jobs. Multiple actions are possible within a job workflow which is a directed acyclic graph. There are multiple actions possible each performing task like data pre-processing and validation, advanced analytics and statistical modeling, persistence of insights into a data store etc. Yahoo! Oozie [7] is used as the workflow and scheduling engine.

1) Canonical Models

These models are basically content based filtering

models. The insights generated using these models are

deterministic in nature and requires rule based calculations.

2) Custom Models: Context based clustering

These are more complex content based filtering models. The

insights generated using these models are probabilistic in

nature and we use unsupervised machine learning algorithm to

generate the insights. After analyzing the telecom Call Data

Records (CDR), we introduced an unsupervised Gaussian

Mixture Modeling technique as one of our custom model to

figure out the natural clusters present in the telecom data. A

MapReduce based multivariate GMM is designed and

implemented over Hadoop in lines of Mahout Libraries [5] to

address the problem of scalability. The feature vector

constitutes subscribers’ call and data usage parameters such as

average monthly revenue generated by that user, average

number of SMSs per day, minutes of usages per day or night,

Amount of GPRS usages etc. Soft clusters thus generated allow

mapping of a user into a specific market segment along with a

confidence measure. This insight can be leveraged to provide

personalized recommendations and campaigns.

2.1) Map-Reduce paradigm for Gaussian Mixture Model

GMM is generally learned using expectation maximization

algorithm [5] [11]. We found that the expectation and maximization steps in this process can directly be mapped onto the map and the reduce phases, respectively, of a Map-Reduce paradigm.

Let’s assume we have a data set {x1. . . xN} consisting of N observations of a random D-dimensional variable x. The goal is to maximize the likelihood function with respect to the parameters of GMM.

1. Initialize the means μk, covariance matrixes Σk and mixing coefficients πk, and evaluate the initial value of the log likelihood. The means will act as the keys while the data points will act as the value in MapReduce’s key/value framework.

2. Expectation/Map step: Evaluate the posterior probabilities

𝛾(𝑧𝑛𝑘 ) using the current parameters as defined below

𝛾 𝑧𝑛𝑘 = 𝜋𝑘Ν 𝑥𝑛 𝜇𝑘 ,Σ𝑘)

𝜋𝑗Ν 𝑥𝑛 𝜇 𝑗 ,Σ𝑗 )𝐾𝑗=1

(2.1)

where Ν is a Multivariate Gaussian pdf

3. Maximization/Reduce step: Re-estimate the parameters using the current posterior probabilities using the equations given below

𝜇𝑘𝑛𝑒𝑤 =

1

Ν𝑘 𝛾(𝑧𝑛𝑘 )𝑥𝑛

𝑁𝑛=1 (2.2)

Σ𝑘𝑛𝑒𝑤 =

1

Ν𝑘 𝛾 𝑧𝑛𝑘 (𝑥𝑛 − 𝜇𝑘

𝑛𝑒𝑤𝑁𝑛=1 )(𝑥𝑛 − 𝜇𝑘

𝑛𝑒𝑤 )𝑇 (2.3)

𝜋𝑘𝑛𝑒𝑤 =

𝑁𝑘

𝑁 (2.4)

𝑊ℎ𝑒𝑟𝑒 𝑁𝑘 = 𝛾 𝑧𝑛𝑘 𝑁𝑛=1

4. Evaluate the log likelihood and check for convergence of

either the parameters or the log likelihood. If the convergence

criteria is not satisfied then return to step 2.

ln 𝑝 𝑋|𝜇, Σ, 𝜋 = ln 𝜋𝑘Ν 𝑥𝑛 𝜇𝑘 , Σ𝑘)𝐾𝑘=1 𝑁

𝑛=1 (2.5)

1Touch-points: Systems with which a subscriber interacts with, for e.g. the WAP portal, self care portal, IVR systems etc

Figure 2 An example GMM over Monthly Revenue from subscribes

C. Tag Store

A tag is a concise piece of information consisting of attributes such as name, value, timestamp, confidence measure associated with it. A subscriber’s fingerprint consists of such tags. Tag store is a distributed NoSQL columnar database running on a cloud. The tag store is decentralized and extremely scalable. HBase [12] serves as the foundation for the tag store. It allows a low latency scalable model of data access along with versioning. Since the number of attributes differs between subscribers, a relational model of data organization would not be scalable and suited for this problem. Consistency and availability is ensured along with failover mechanisms. Hadoop DRBD [8] is leveraged to counter the issue of name node being a single point of failure. Data replication is handled at the Hadoop level.

D. Tag Serving Cluster

The tag serving cluster is an array of web servers behind a load balancer (software or hardware) which provides secure access our system API to external touch point systems. The incoming request for tags and recommendations could come as an HTTP REST call or a SOAP call.

E. Policy Manager

The Tag Policy Manager authenticates and authorizes all incoming requests.

III. A LIVE EXAMPLE

System provides a short code based number where the user can call and ask for best p recharge offers for him/her considering his network usage trends, out of k recharge plans

2 the system is

providing at that time. This way a busy user wouldn’t need to surf through all the recharge plans and make a decision. We used live data files generated by a 50 million subscriber base. It was observed that nearly 4 million subscribers recharge every day. We process the recharge-CDR files to get the recharge values of these subscribers. Using Gaussian Mixture based clustering; we find the best k recharge options for each subscriber as described.

1 We find the k-mode GMM and map users into each such segment with a confidence measure.

2. Confidence measure is assessed based on the probability of that subscriber being a member of a specific recharge segment.

3. These confidence measures are used to order the best p (< k) recharge options to be given as a personalized recommendation.

Figure 2 demonstrates one of our experimental results in applying a GMM model to recommend the best recharge plan when the provider was offering 3 plans only. (Where p = 1 and k = 3)

IV. CONCLUSIONS

In this paper, we demonstrated our subscriber fingerprinting

model - a mobile service personalization and recommendation

system built on a distributed framework. This distributed

shared-nothing architecture scales to the large volumes of

subscriber data (in order of tens of millions), and is capable of

delivering low latency real-time recommendations. The

underlying cloud infrastructure makes the platform elastic and

future-proof to accommodate workloads of varying

complexity. We also show the utility of statistical models, such

as Gaussian Mixtures Models, for recommendation system. We

also provide empirical results, using this model on real-world

data to demonstrate an improved matching of

recommendations to subscribers.

REFERENCES

[1] Ho and Ho, "The Attraction of Personalized Service for Users in Mobile Commerce: An Empirical Study", ACM sigecom Exchanges, Vol. 3, No. (4, January 2003), Pages 10-18.

[2] Kurkovsky and Harihar, "Using ubiquitous computing in interactive mobile marketing", Personal and Ubiquitous Computing, Vol. 10, No. 4. (1 May 2006), pp. 227-240.

[3] Dean and Ghemawat, “Mapreduce: simplified data processing on large clusters”, Opearting Systems Design & Implementation,2004, pp.10–10.

[4] Ananthanarayanan, et.al. "Cloud analytics: Do we really need to reinvent the storage stack?", Workshop on Hot Topics in Cloud Computing, 2009.

[5] Xu and Jordan, “On convergence properties of the em algorithm for gaussian mixtures”, Neural Computation, 8:129–151, 1996

[6] Thusoo et. al. "Hive a warehousing solution over a Map-Reduce framework", VLDB,2009.

[7] Nguyen and Halem, "A MapReduce workflow system for architecting scientific data intensive applications", workshop on Software engineering for cloud computing, 2011

[8] Philipp R., “DRBD”, UNIX en High Availability,2001, Ede. 93 - 104.

[9] Ducoffe R., “Advertising value and advertising on the Web. Journal of Advertising Research", 1996,Page 21–35.

[10] VOGELS, “Head in the Clouds- The Power of Infrastructure as a Service”, workshop on Cloud Computing and in Applications, 2008

[11] Bishop, C. M., “Pattern Recognition and Machine Learning,”, Springer, 2006

[12] A. Khetrapal and V. Ganesh, “HBase and Hypertable for large scale distributed storage systems,” Dept. of Computer Science, Purdue University 2008

2Recharge Plan: A prepaid scheme provided by service providers to allow

subscribers to pay its products and services.

Documents

White Paper On Mobile Subscriber Insights – A Big Data Approach