Write Up Raka Bigdata Analytic Telco

7/30/2019 Write Up Raka Bigdata Analytic Telco

1/11

Warming Up

This document is an attempt at summarizing the technologies that play part in the (big data) analytics

ecosystem. I put the word big data between parentheses because I will be focusing on the analyticsaspect.

Big data is a relatively new (industry) term, but not necessarily a new concept. So if we limit ourunderstanding of big data only to the volume, maybe we're missing the bigger picture (because big

is relative; our civilization has been dealing with increasingly bigger volume of data. What was big 10

years ago, is quite likely not so big today). Just a heads up, from this point on, the word by big datasometimes refers to techniques & tools for dealing with big data.

People coined these Vs (Volume, Velocity, Variety) as a way to characterize big data. I think talkingabout Variety and Velocity is a good / easier way to start a discussion about big data (especially with

people coming from strong database administration / datawarehousing background). At times I need to

convince (or doubt) myself about anything, including big data. I found it is easier to convince myself

about big data if I start from Variety and Velocity.

Why Variety? Because people (as user of technologies) can easily appreciate the fact that data are

coming from more and more varied sources, thanks to cheaper sensors, mobility, and hiperconnectivity.Everybody with a cellphone generates data now. Every single activity they do, on various online

services, through their cellphone generates the so-called data exhaust. Consequently, we are dealing

with a variety of format of data, from structured to unstructured1.

At this point I'm skeptical or wondering, how big data plays unique role in this situation? We already

have and perform Extract Transform Load (ETL; common to datawarehousing practicioners) to address

that challenge. Right, so maybe we should look at the other aspect to support this argument, velocity.

Why? Because ETL process implies non-realtime analytics; the data is not processed just-in-time; it hasto go through the transformation stage before it is loaded to datawarehouse, where eventually data is

picked up to be analyzed.

Actually there are several articles about datawarehousing in the advent of big data. I was trying to

understand if they are competing? Or are they complement to each other? I admit I haven't read throughthose articles, so I'll just a put their links for now; one from O'Reilly2, and the other one from Teradata3.

People are talking about soft real-time analysis of data (or events). There's a phrase for that: Complex

Event Processing (CEP) platform, which basically is a platform that enables us to observe moving-window of events (taken from continuous stream of data), and do time-series analysis such as pattern-

matching on that window4. One open-source product for CEP that I know is JBoss Drools Fusion5.

There's a strong link between Velocity and Variety, provided by Value 6. Fusing data from broad range

of sources (variety) opens up the risk of having low-value data, data with low signal-to-noise ratio.

1 http://www.finextra.com/community/FullBlog.aspx?blogid=61292 http://strata.oreilly.com/2011/01/data-warehouse-big-data.html

3 http://www.teradata.com/white-papers/Hadoop-and-the-Data-Warehouse-When-to-Use-Which/?type=WP4 http://bit.ly/xrLyV1

5 http://www.jboss.org/drools/drools-fusion.html

6 http://www.finextra.com/community/fullblog.aspx?blogid=6222


2/11

Example: facebook / twitter. The need for speed here is to find high-value data among piles of low-

value data.

At this point maybe I should stop for a while trying to convince myself about big data. Some practicalhindsights I will gather along the way will clear things up, or provide me with a better / bigger / more

fundamental questions. Either of them is beneficial. For now I'll just take it as a fact that big data is

important, it's here, and it's part of continuum of tools we need to employ to bring up intelligence.

I'll switch back to analytics, but before that I would like to do a round-up of big data tools I've found so

far. The following table is a summary of tools I'm currently learning to use. They're in my own words,based on my current understanding. So it's not comprehensive and may contain inaccuracies.

Name Description

Hadoop Provides a programming framework that implements MapReduce and

platform for its execution.

It also provides distributed filesystem smart-enough to ensure minimum

data-motion7 during parallel the processing of (big) data set spreadacross cluster of machines.

It splits the dataset such that the task assigned the subset of data

executes on the machine where the subset of data is located (locally or

nearby); Map phase. Then it coordinates the aggregating of results ofcomputation collected from the task nodes (during the Reduce phase).

It is still not clear to me how the process-affinity is achieved8, but looksto me it's through configuration of our hadoop cluster, specifically

optimized for the application we're working on (knowing the nature of

data distribution, pipeline of data processing, etc.).

Apache Pig While Hadoop provides programming library for implementing the map

and reduce task, Apache Pig projects raise the level abstraction,allowing people to specify those operations in SQL-like syntax. It brings

productivity / efficienty for the project. Yahoo is said to have 40% to

60% of its Hadoop workloads implemented in Pig scripts9.

Apache Mahout A collection of machine-learning algorithms (e.g.: clustering, naivebayes, covering, linear modeling, etc), ready-to-use, for execution over

Hadoop cluster.

A big relieve, thanks to Mahout we can save time in the projects tryingto make those standard machine-algorithms paralelizable, using

MapReduce approach, specifically for execution in Hadoop cluster.

Drools Fusion A platform & programming libraries for complex event-processing.

Basically we define some rules where we specify the way we correlate

an event with past event(s), and make a conclusion based on that. Based

7 http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/

8 http://www.plexxi.com/wp-content/uploads/2013/03/Plexxi-Use-Case-Hadoop-March-2013.pdf

9 http://www.ibm.com/developerworks/library/l-apachepigdataquery/


3/11

on that conclusion, an action is taken (which we also have to specify /

program). Obviously the rules will have to be discovered beforehand(e.g.: applying machine-learning techniques over historical records).

MongoDB One of many NoSQL databases available in the market. NoSQL, a

catch-all terms for non-relational database (which uses SQL as query

language). My understanding at this moment is that NoSQL databases

excels at scalability, by sacrificing the ACID (Atomicity, Consistency,Integrity, and Durability) property we can expect from RDBMS.

MongoDB is document-oriented database10. Meaning (according towikipedia): designed for storing, retrieving, and managing document-

oriented information, also known as semi-structured data.. So, I guess

if what we're capturing & storing looks like documents (e.g.: forms,receipts, articles, etc.), probably we should consider MongoDB.

The articles / book I've read on MongoDB gave me the impression thatdata-modeling for MongoDB does not put much emphasis on

normalizing our data (unlike in data modeling for RDBMS, sincenormalization is cornerstone in achieving consistence). Document-

oriented database seem to be focusing on the speed of retrieval, andhorizontal scaling (denormalized database is easier to be spread across

machines).

Others in NoSQL landscape Existing approaches in NoSQL landscape can be categorized as:

Key/Value stores

Column-oriented stores

Document-oriented stores

Object databases Graph databases

I believe our choice should be based on the structure (or lack of) of the

data we're dealing with, the kind of questions we want to get answered,

and obviously the application requirement.

Common cliche: NoSQL database is not the right choice for financial

system. It is a sweeping statement, as if financial system is one big

single application. Maybe it should be rephrased as: NoSQL, for itslack of support for transaction, should not be used in applications where

we have to support use-cases like transferring money betweenaccounts. For other application, such as capturing tick data (deals,transfer events, stock price, etc), NoSQL database has its place in

financial system11.

Finally, I'd like to refer a book I might recommend later, that can help usunderstand better the nature of several popular NoSQL databases:

10 http://en.wikipedia.org/wiki/Document-oriented_database

11 http://www.10gen.com/post/45116404296/how-banks-use-mongodb-as-a-tick-database


4/11

Seven Databases in Seven Weeks: A Guide to Modern Databases and

the NoSQL Movement12.

I think that's quite a solid toolset for big data analytics. I wouldn't spend much more time onphilosophising; not until I get more hindsight from using those tools to crack some problems. What

problems? Well, one good place to start is Kaggle.com, where we can get some problems (accompanied

by dataset), and compete.

12 http://www.amazon.com/Seven-Databases-Weeks-Movement-ebook/dp/B00AYQNR50/ref=tmm_kin_title_0?

ie=UTF8&qid=1366862920&sr=8-1


5/11

Wandering

Among so many challenges that CSP facing these days, we will have a look at two of them, namely

customer retention and dynamic pricing. How did they end up there? Several factors:

1. Saturated market.

2. Number portability.

3. Inability of CSPs to come up with killer-app(s).

Factor number 1 and 2 lead to churn prevention. Factor number 3 leads to dynamic pricing. It's only

logical, with not much room left for expansion, maintaining the customer base becomes a basic

necessity for survival. With number portability, the situation is even worse for CSP; subscriber can

switch to another CSP without worrying of losing her current number.

On to point #3, no killer-app. CSP have been complaining about decreasing ARPU (Average Revenue

Per-Unit). They can't just hike the price now (after they sacrificed it in the sake of expansion). So they

look for revenue-generation from VAS (Value-Added Service). The following screenshots show theVAS offered by Telcel.


6/11

I personally don't find those services valueable enough for me to pay for; I never use them. In the

last picture, Identificar de Llamadas (description: conoce el numero de quien te llama y/o el nombresi esta en el agenda de tu equipo / know the number of the calling person and / or the name if that

person is registered in the contact list of your device). I'm not sure how is that a VAS. Are they

charging me for having that service? I certainly hope not! I think I can accept Banca Movil as

valuable service. But then again, banks (such as Bancomer) already offers an app for Android & iOS

to do just that, and I'm pretty sure it has nothing to do with Telcel because the app uses internettechnologies all the way (meaning: no income for Telcel for usage of Bancomer-provided app).

I guess contribution of VAS to the revenue of CSP is not significant enough to compensate decreasing

ARPU. Here is another common complaint by CSP: they act only as a pipe, without benefiting enoughfrom their investment in building the infrastructure. The one that benefits from the traffic generated by

video browsing on YouTube is.... Google mainly. For photo uploads on Instagram..., Instagram mainly.

Same goes for Facebook, Twitter, etc13.

Therefore, they start experimenting with online-charging schemes that takes traffic into account.

Something nicer than simple bytes-to-cents function, in order to avoid backlash from subscribers. To

me being-nice here means: at least there would be a notification when certain browsing activity that canincur additional charges, offering user to decide next action (proceed with extra payment, or cancel);

minimizing surprises (and complaints). More on that after the box below.

Clarification about contribution and trend of VAS to revenue generation of CSP: this article is not

scientific; it's only a summary of what I gathered from various sources. I tried to find something tobackup my hunch that VAS will die out. Turns out the situation is different from market to market, and

even from one segment to another segment in the same market.

In Telcel case for example, I don't know how they label me in their system :) but certainly not a low-

end. I have smartphone with unlimited 3G plan. Obviously I use internet for everything. I have no

idea what percentage of Telcel subsribers still use low-end cellphone, and what percentage of themactually use the VAS displayed above.

So I googled for VAS revenue decline and VAS revenue increase (and similar queries). I even

googled for Telcel Earning Report (found nothing). I found conflicting results. In India, more than33% of total revenue of Tata Docomo comes from VAS14. I'd like to think this result draws from skills

and experience of NTT in VAS (whose 43% of its revenue reportedly comes from VAS in 201015).

So, NTT knows how to do it, very well. Part of the strategy, I guess, is by providing incentive for

innovation16.

13 Although that can change if CSP open their platform, giving access to software developers to build / enhance

applications using telecommunication services provided by CSP. Example, by providing API like Twilio(http://www.twilio.com) or BlueVia (http://www.bluevia.com)

14 http://www.wirelessduniya.com/2011/11/11/tata-docomo-vas-revenue-surpasses-33-of-its-entire-revenue/15 http://www.voicendata.com/voice-data/news/166226/data-services-revenue-exceed-usd330-bn-2013

16 http://www.thehindubusinessline.com/todays-paper/tp-info-tech/ntt-docomo-laments-limited-valueadd-in-

india/article1060116.ece
http://www.twilio.com/http://www.bluevia.com/http://www.bluevia.com/http://www.twilio.com/


7/11

It's very similar to revenue share developers get from Android Store & Apple App Store17.

Back on Track

Now I will drive the essay back to (big data) analytics. To recap, the situation provides motives for the

following use-cases:

1. Churn prevention.

2. Dynamic pricing.

They are only two of a dozen other use-cases, listed in the following diagram18.

The line of thinking I used in the essay is the usual start from goal. Specifically the approach will be

in this order:

1. What questions need / want to be answered? (one end)

2. What data we have? (the other end)3. Techniques? (fill-in the blank in between)

In the case of churn modeling, what are the questions that are nomally / possibly by CSP (that leads

them to building their churn model). My lame attempt at mimicking Shakespeare: What are thequestions? That is the question.. I googled around, found nothing concrete enough to my liking, so I

17 http://www.techrepublic.com/blog/app-builder/app-store-fees-percentages-and-payouts-what-developers-need-to-

know/1205

18 http://www.intracom-svyaz.com/download/eng_pdf/bigdata/BigStreamer.pdf


8/11

have to resort to writing something that sounds logic to me. So..., basically, based on historical data

(mainly) we want to find out how certain actions, events, and / or conditions lead up to customer churn.

Actions can be price change or marketing campaign, events can be dropped calls or congestion,

conditions basically attributes related to the customers.

All those information are gathered from sources like OSS (Operational Support System), BSS

(Business Support System).

The diagram is copied from http://ossline.typepad.com/.a/6a0105359f53d8970c0147e06562af970b-pi

From OSS we get network-related information (things like dropped calls are gathered from there).

From BSS we get business-related information (customer detail, the amount of billed, contracts, are allin that area). For more information about OSS/BSS, this page is a good start:

http://www.quora.com/How-would-you-explain-OSS-and-BSS-to-a-layman

Now, let's get really practical. Oracle for example, has an offering called Oracle Communication DataModel19 (part of its BSS offerings). As the name implies, it models the entities needed for business

operation of CSP. One of the components of OCDM is data-mining model. The following screenshot of

the Oracle's online-documention can give you an idea of what it is:

19 http://www.oracle.com/us/products/applications/communications/industry-analytics/data-model/overview/index.html
http://ossline.typepad.com/.a/6a0105359f53d8970c0147e06562af970b-pihttp://ossline.typepad.com/.a/6a0105359f53d8970c0147e06562af970b-pihttp://www.quora.com/How-would-you-explain-OSS-and-BSS-to-a-laymanhttp://ossline.typepad.com/.a/6a0105359f53d8970c0147e06562af970b-pihttp://www.quora.com/How-would-you-explain-OSS-and-BSS-to-a-layman


9/11

From that we can glimpse features relevant for churn-modeling20. This general idea can be transferred

to situations where we don't use Oracle's product for example, and we have to build it by hand. Here

are some of them:

Customer id, Target column of churn model, Number of future contract count in last 3 months,Subscription count in last 3 months, Suspension count in last 3 months, Contract count in last 3

months, Complaint count in last 3 months, Complaint call count to call center in last 3 months,

Complaint call count to call center in the life time in last 3 months, Contract left days in last 3 months,

Account left value in last 3 months, Remaining contract sum in last 3 months, Debt total in last 3months, Loyalty program balance in last 3 months, Total payment revenue in last 3 months, Monthly

revenue (arpu) in last 3 months, Contract arpu amount in last 3 months, Party type code, individual or

organizational in last 3 months, Business legal status, Marital status for individual user, Householdsize, Job Code, Nationality code, ...., For how long billing address is in effective, in days, ....

Of course those attributes in OCDM are tied to the algorithm used in the product for generating

prediction models. Some of the dimensions might not be relevant for our specific case. As in any data

analysis activities, we have to apply some algorithms to select relevant dimensions to base our analysison (http://en.wikipedia.org/wiki/Feature_selection). But at least all those features in OCDM can give us

starting point.

From there we can work backward to parts of the system where those data can be obtained from (e.g.:billing and charging system, CRM, etc). Let's see another product, BSCS iX (a billing and charging

solution now offered by Ericsson), just to get a glimpse of what is out there. The screenshot shows a

table where invoices are stored.

20 http://docs.oracle.com/cd/E11882_01/doc.112/e15886/data_mining_cdm.htm#autoId4
http://en.wikipedia.org/wiki/Feature_selectionhttp://en.wikipedia.org/wiki/Feature_selection


10/11

Churn-modeling itself is quite a feat. There's a good book on that matter, from Rob Matisson, given

away for free, and is also available in Google Books:

http://books.google.com.mx/books/about/The_Telco_Churn_Management_Handbook.html?id=M_uuQx7vMngC&redir_esc=y . And of course, exercise.

Teradata in collaboration with Duke University and several CSP held a contest in 2003 on churn-

modeling. From the tournamen't site at http://www.fuqua.duke.edu/centers/ccrm/ , we can downloadthe problem description and dataset, among other materials.

Wrapping up, a few words about dynamic pricing. I was wondering how is that related to big dataanalytics. Turns out dynamic pricing is a matter of yield management21, which can benefit from

analytics, specifically forecasting techniques.

21 http://en.wikipedia.org/wiki/Yield_management
http://books.google.com.mx/books/about/The_Telco_Churn_Management_Handbook.html?id=M_uuQx7vMngC&redir_esc=yhttp://books.google.com.mx/books/about/The_Telco_Churn_Management_Handbook.html?id=M_uuQx7vMngC&redir_esc=yhttp://www.fuqua.duke.edu/centers/ccrm/http://books.google.com.mx/books/about/The_Telco_Churn_Management_Handbook.html?id=M_uuQx7vMngC&redir_esc=yhttp://books.google.com.mx/books/about/The_Telco_Churn_Management_Handbook.html?id=M_uuQx7vMngC&redir_esc=yhttp://www.fuqua.duke.edu/centers/ccrm/


11/11

Here are some scenarios of dynamic pricing, which basically an attempt at offering bandwidth to

customer, at an attractive price, especially when the network conditions allows to do so. This has

something to do with giving flexibility to subscriber; by not forcing them to pick between subscribing

to unlimited capacity (at higher cost) or staying with limited bandwidth (barring them from using moreservices when needed). All-in-all this can make more people drawn into subscribing to / staying with

the CSP. These diagrams are copied from a whitepaper by Tango Telecom, Beyond Policy, a New Era

in Real-Time Charging

22

.

This is an example of the case for real-time analytics over stream of data, coming in from the network

elements (OSS), fed into forecasting model (built from historical data), to figure out for example if the

traffic is low (thus offering turbo-boost to a customer wouldn't affect the rest), etc. For a more use-

cases in this area, the following whitepaper from Telcordia is a good source: Applying YieldManagement in the Mobile Broadband Market23.

22 http://www.tango.ie/dynamic-marketplace.html

23 http://bit.ly/ZXdXPV (www.telecomtv.com/DocSend.aspx?fileid...9f55...yield-management...)
http://bit.ly/ZXdXPVhttp://bit.ly/ZXdXPV

Documents

Write Up Raka Bigdata Analytic Telco