Upload
flytxt
View
217
Download
0
Embed Size (px)
DESCRIPTION
Abstract-Mobile advertising campaigns must use subscriber data to target subscribers relevant to the product or service being recommended. However, the matching of recommendations to subscribers is most often inexact, and no straightforward “attribute-value” matching algorithm suffices. Arbitrary matching degrades subscriber experience and results in lower conversion rates. For more information simply click here http://flytxt.com/
Citation preview
Mobile Subscriber Fingerprinting: A Big Data
Approach
Jobin Wilson
R&D, Flytxt Technology Pvt. Ltd.
Trivandrum, India
Vikram Garg
R&D, Flytxt Technology Pvt. Ltd
Trivandrum, India
Abstract—Mobile advertising campaigns must use subscriber
data to target subscribers relevant to the product or service being
recommended. However, the matching of recommendations to
subscribers is most often inexact, and no straightforward
“attribute-value” matching algorithm suffices. Arbitrary
matching degrades subscriber experience and results in lower
conversion rates.
One viable solution is to cluster subscribers by common
attributes, and then match recommendations with the same
attributes to subscribers in the associated cluster. This approach
presents a few significant challenges in the data architecture: the
clusters are extremely large and need not have exact boundaries,
and their representation must facilitate real-time
recommendations. We describe here a novel architecture which
we call subscriber “fingerprinting”. The proposed system is
capable of analyzing extremely large volumes of data (in the
order of terabytes) using sophisticated large scale distributed
Extract-Transform-Load (ETL) operations followed by
distributed data analytics involving statistical models. The
insights generated from this process can be used to serve
personalized recommendations in real-time. The proposed system
uses big data analytics built over a Hadoop ecosystem, and
leverages a private cloud infrastructure for deployment. The
system also includes a simple, secure and light weight integration
API using REST protocols.
Keywords- recommendation systems; big data analytics; distributed computing; cloud computing; content recommendations; service personalization; mobile advertisements;
I. INTRODUCTION
Mobile telecommunication service providers are in a unique position to provide their subscribers quality of service in the form of accurate service personalization, customized personal service recommendations, and contextual advertisements. The service provider may also want to predict and influence subscriber behavior, for example in churn prediction and management. This requires the operator to maintain a 360 degree view of the subscriber (including his/her behavior, preferences and service usage patterns), which need to be mined and learned by the system automatically [1][2], along with ensuring strict privacy of subscribers. Rich noise-free metadata in the form of transactions logs which get generated from a subscribers interaction with a telecom network are an apt source of input for this process. Further, mobile devices provide multiple modalities of real-time content
delivery such as SMS, MMS, outbound calls and the mobile internet, thus making mobile phones a promising advertisement channel.
Inherently, PC-based internet usage is hard to link to the
subscriber identity, and hence provides limited and noisy metadata information for personalization, and has only limited modalities for real time advertisement delivery making it a less preferred channel over mobiles for the advertiser. [9]
The subscriber fingerprinting model that we present here assists the mobile service provider in personalization and it is used in the following fashion:
a) Classifying subscribers and generating a holistic view of them automatically in near-real-time
b) Providing a secure application programming interface to target micro segments as well as for seamless integration with third-party systems
c) Allowing marketers, the flexibility to define classification criteria and market segments that they are interested in based on their specific requirements
d) Providing service personalization and recommendations based on subscriber’s preferences and behavior
e) Ensure strict privacy for the subscribers.
The work presented here is based on a live system that we have developed for one of the major telecom service provider. The presented system addresses the privacy concerns using non invertible hashing based approach to generate the insights. These insights can’t be used to deduce the raw facts that resulted in a specific classification of any subscriber. We also find it important to mention that real-world data is confidential and we have masked identities wherever necessary.
The paper is organized as follows. Section II provides a brief study of the proposed solution. Section III presents a live example of our system usage. A conclusion will be presented in the Section IV.
II. OUR SOLUTION
We propose a system which provides a near real-time subscriber intelligence and service personalization to any Touch-point
1 System. Using continuously updating streams of
various data from service provider’s network & business infrastructure, our system maintains a real-time unified profile for each subscriber. This consists of both the static information
regarding the subscriber as well as learned meaningful insights such as his/her socio economic profile, data usage pattern, personas, and general preferences. The proposed system consists of the components shown in the figure 1.
Figure 1 Product Architecture
A. Data Stream Up loader Engine (DSUE)
The uploader engine connects to live data streams and pulls raw data files over FTP/SFTP and stores relevant data into the distributed data warehouse. This task can be time-consuming since the file sizes could be large in the range of few hundred Gigabytes. This is a distributed engine having a pool of worker nodes over a private cloud infrastructure [4][10]. The master identifies any free worker node at the time of a data upload and assigns the task to it. In case of any failure, the task gets reassigned to the next free worker node. Apache hive [6] serves as the distributed data warehouse where we run data warehousing workloads during insight generation.
B. Continuous Insight Engine(CIE)
The CIE is the intelligent component of the system, which generates meaningful insight about the subscriber in near-real time. It consists of models that continuously analyze data using a massively distributed cluster of nodes deployed over a private cloud infrastructure. The CIE is massively scalable and flexible to adapt to changing file formats and to new requirements. Models are triggered by a data-driven framework, which means that models run only when a fresh data set is made available to it. An important aspect of CIE is managing job schedules and actions within a job. We use the concept of job workflows to generate subscriber insights. Workflows would have multiple actions which get executed over a Hadoop [3] cluster as one or more map-reduce jobs. Multiple actions are possible within a job workflow which is a directed acyclic graph. There are multiple actions possible each performing task like data pre-processing and validation, advanced analytics and statistical modeling, persistence of insights into a data store etc. Yahoo! Oozie [7] is used as the workflow and scheduling engine.
1) Canonical Models
These models are basically content based filtering
models. The insights generated using these models are
deterministic in nature and requires rule based calculations.
2) Custom Models: Context based clustering
These are more complex content based filtering models. The
insights generated using these models are probabilistic in
nature and we use unsupervised machine learning algorithm to
generate the insights. After analyzing the telecom Call Data
Records (CDR), we introduced an unsupervised Gaussian
Mixture Modeling technique as one of our custom model to
figure out the natural clusters present in the telecom data. A
MapReduce based multivariate GMM is designed and
implemented over Hadoop in lines of Mahout Libraries [5] to
address the problem of scalability. The feature vector
constitutes subscribers’ call and data usage parameters such as
average monthly revenue generated by that user, average
number of SMSs per day, minutes of usages per day or night,
Amount of GPRS usages etc. Soft clusters thus generated allow
mapping of a user into a specific market segment along with a
confidence measure. This insight can be leveraged to provide
personalized recommendations and campaigns.
2.1) Map-Reduce paradigm for Gaussian Mixture Model
GMM is generally learned using expectation maximization
algorithm [5] [11]. We found that the expectation and maximization steps in this process can directly be mapped onto the map and the reduce phases, respectively, of a Map-Reduce paradigm.
Let’s assume we have a data set {x1. . . xN} consisting of N observations of a random D-dimensional variable x. The goal is to maximize the likelihood function with respect to the parameters of GMM.
1. Initialize the means μk, covariance matrixes Σk and mixing coefficients πk, and evaluate the initial value of the log likelihood. The means will act as the keys while the data points will act as the value in MapReduce’s key/value framework.
2. Expectation/Map step: Evaluate the posterior probabilities
𝛾(𝑧𝑛𝑘 ) using the current parameters as defined below
𝛾 𝑧𝑛𝑘 = 𝜋𝑘Ν 𝑥𝑛 𝜇𝑘 ,Σ𝑘)
𝜋𝑗Ν 𝑥𝑛 𝜇 𝑗 ,Σ𝑗 )𝐾𝑗=1
(2.1)
where Ν is a Multivariate Gaussian pdf
3. Maximization/Reduce step: Re-estimate the parameters using the current posterior probabilities using the equations given below
𝜇𝑘𝑛𝑒𝑤 =
1
Ν𝑘 𝛾(𝑧𝑛𝑘 )𝑥𝑛
𝑁𝑛=1 (2.2)
Σ𝑘𝑛𝑒𝑤 =
1
Ν𝑘 𝛾 𝑧𝑛𝑘 (𝑥𝑛 − 𝜇𝑘
𝑛𝑒𝑤𝑁𝑛=1 )(𝑥𝑛 − 𝜇𝑘
𝑛𝑒𝑤 )𝑇 (2.3)
𝜋𝑘𝑛𝑒𝑤 =
𝑁𝑘
𝑁 (2.4)
𝑊ℎ𝑒𝑟𝑒 𝑁𝑘 = 𝛾 𝑧𝑛𝑘 𝑁𝑛=1
4. Evaluate the log likelihood and check for convergence of
either the parameters or the log likelihood. If the convergence
criteria is not satisfied then return to step 2.
ln 𝑝 𝑋|𝜇, Σ, 𝜋 = ln 𝜋𝑘Ν 𝑥𝑛 𝜇𝑘 , Σ𝑘)𝐾𝑘=1 𝑁
𝑛=1 (2.5)
1Touch-points: Systems with which a subscriber interacts with, for e.g. the WAP portal, self care portal, IVR systems etc
Figure 2 An example GMM over Monthly Revenue from subscribes
C. Tag Store
A tag is a concise piece of information consisting of attributes such as name, value, timestamp, confidence measure associated with it. A subscriber’s fingerprint consists of such tags. Tag store is a distributed NoSQL columnar database running on a cloud. The tag store is decentralized and extremely scalable. HBase [12] serves as the foundation for the tag store. It allows a low latency scalable model of data access along with versioning. Since the number of attributes differs between subscribers, a relational model of data organization would not be scalable and suited for this problem. Consistency and availability is ensured along with failover mechanisms. Hadoop DRBD [8] is leveraged to counter the issue of name node being a single point of failure. Data replication is handled at the Hadoop level.
D. Tag Serving Cluster
The tag serving cluster is an array of web servers behind a load balancer (software or hardware) which provides secure access our system API to external touch point systems. The incoming request for tags and recommendations could come as an HTTP REST call or a SOAP call.
E. Policy Manager
The Tag Policy Manager authenticates and authorizes all incoming requests.
III. A LIVE EXAMPLE
System provides a short code based number where the user can call and ask for best p recharge offers for him/her considering his network usage trends, out of k recharge plans
2 the system is
providing at that time. This way a busy user wouldn’t need to surf through all the recharge plans and make a decision. We used live data files generated by a 50 million subscriber base. It was observed that nearly 4 million subscribers recharge every day. We process the recharge-CDR files to get the recharge values of these subscribers. Using Gaussian Mixture based clustering; we find the best k recharge options for each subscriber as described.
1 We find the k-mode GMM and map users into each such segment with a confidence measure.
2. Confidence measure is assessed based on the probability of that subscriber being a member of a specific recharge segment.
3. These confidence measures are used to order the best p (< k) recharge options to be given as a personalized recommendation.
Figure 2 demonstrates one of our experimental results in applying a GMM model to recommend the best recharge plan when the provider was offering 3 plans only. (Where p = 1 and k = 3)
IV. CONCLUSIONS
In this paper, we demonstrated our subscriber fingerprinting
model - a mobile service personalization and recommendation
system built on a distributed framework. This distributed
shared-nothing architecture scales to the large volumes of
subscriber data (in order of tens of millions), and is capable of
delivering low latency real-time recommendations. The
underlying cloud infrastructure makes the platform elastic and
future-proof to accommodate workloads of varying
complexity. We also show the utility of statistical models, such
as Gaussian Mixtures Models, for recommendation system. We
also provide empirical results, using this model on real-world
data to demonstrate an improved matching of
recommendations to subscribers.
REFERENCES
[1] Ho and Ho, "The Attraction of Personalized Service for Users in Mobile Commerce: An Empirical Study", ACM sigecom Exchanges, Vol. 3, No. (4, January 2003), Pages 10-18.
[2] Kurkovsky and Harihar, "Using ubiquitous computing in interactive mobile marketing", Personal and Ubiquitous Computing, Vol. 10, No. 4. (1 May 2006), pp. 227-240.
[3] Dean and Ghemawat, “Mapreduce: simplified data processing on large clusters”, Opearting Systems Design & Implementation,2004, pp.10–10.
[4] Ananthanarayanan, et.al. "Cloud analytics: Do we really need to reinvent the storage stack?", Workshop on Hot Topics in Cloud Computing, 2009.
[5] Xu and Jordan, “On convergence properties of the em algorithm for gaussian mixtures”, Neural Computation, 8:129–151, 1996
[6] Thusoo et. al. "Hive a warehousing solution over a Map-Reduce framework", VLDB,2009.
[7] Nguyen and Halem, "A MapReduce workflow system for architecting scientific data intensive applications", workshop on Software engineering for cloud computing, 2011
[8] Philipp R., “DRBD”, UNIX en High Availability,2001, Ede. 93 - 104.
[9] Ducoffe R., “Advertising value and advertising on the Web. Journal of Advertising Research", 1996,Page 21–35.
[10] VOGELS, “Head in the Clouds- The Power of Infrastructure as a Service”, workshop on Cloud Computing and in Applications, 2008
[11] Bishop, C. M., “Pattern Recognition and Machine Learning,”, Springer, 2006
[12] A. Khetrapal and V. Ganesh, “HBase and Hypertable for large scale distributed storage systems,” Dept. of Computer Science, Purdue University 2008
2Recharge Plan: A prepaid scheme provided by service providers to allow
subscribers to pay its products and services.