Upload
rakaangga
View
224
Download
0
Embed Size (px)
Citation preview
7/30/2019 Write Up Raka Bigdata Analytic Telco
1/11
Warming Up
This document is an attempt at summarizing the technologies that play part in the (big data) analytics
ecosystem. I put the word big data between parentheses because I will be focusing on the analyticsaspect.
Big data is a relatively new (industry) term, but not necessarily a new concept. So if we limit ourunderstanding of big data only to the volume, maybe we're missing the bigger picture (because big
is relative; our civilization has been dealing with increasingly bigger volume of data. What was big 10
years ago, is quite likely not so big today). Just a heads up, from this point on, the word by big datasometimes refers to techniques & tools for dealing with big data.
People coined these Vs (Volume, Velocity, Variety) as a way to characterize big data. I think talkingabout Variety and Velocity is a good / easier way to start a discussion about big data (especially with
people coming from strong database administration / datawarehousing background). At times I need to
convince (or doubt) myself about anything, including big data. I found it is easier to convince myself
about big data if I start from Variety and Velocity.
Why Variety? Because people (as user of technologies) can easily appreciate the fact that data are
coming from more and more varied sources, thanks to cheaper sensors, mobility, and hiperconnectivity.Everybody with a cellphone generates data now. Every single activity they do, on various online
services, through their cellphone generates the so-called data exhaust. Consequently, we are dealing
with a variety of format of data, from structured to unstructured1.
At this point I'm skeptical or wondering, how big data plays unique role in this situation? We already
have and perform Extract Transform Load (ETL; common to datawarehousing practicioners) to address
that challenge. Right, so maybe we should look at the other aspect to support this argument, velocity.
Why? Because ETL process implies non-realtime analytics; the data is not processed just-in-time; it hasto go through the transformation stage before it is loaded to datawarehouse, where eventually data is
picked up to be analyzed.
Actually there are several articles about datawarehousing in the advent of big data. I was trying to
understand if they are competing? Or are they complement to each other? I admit I haven't read throughthose articles, so I'll just a put their links for now; one from O'Reilly2, and the other one from Teradata3.
People are talking about soft real-time analysis of data (or events). There's a phrase for that: Complex
Event Processing (CEP) platform, which basically is a platform that enables us to observe moving-window of events (taken from continuous stream of data), and do time-series analysis such as pattern-
matching on that window4. One open-source product for CEP that I know is JBoss Drools Fusion5.
There's a strong link between Velocity and Variety, provided by Value 6. Fusing data from broad range
of sources (variety) opens up the risk of having low-value data, data with low signal-to-noise ratio.
1 http://www.finextra.com/community/FullBlog.aspx?blogid=61292 http://strata.oreilly.com/2011/01/data-warehouse-big-data.html
3 http://www.teradata.com/white-papers/Hadoop-and-the-Data-Warehouse-When-to-Use-Which/?type=WP4 http://bit.ly/xrLyV1
5 http://www.jboss.org/drools/drools-fusion.html
6 http://www.finextra.com/community/fullblog.aspx?blogid=6222
7/30/2019 Write Up Raka Bigdata Analytic Telco
2/11
Example: facebook / twitter. The need for speed here is to find high-value data among piles of low-
value data.
At this point maybe I should stop for a while trying to convince myself about big data. Some practicalhindsights I will gather along the way will clear things up, or provide me with a better / bigger / more
fundamental questions. Either of them is beneficial. For now I'll just take it as a fact that big data is
important, it's here, and it's part of continuum of tools we need to employ to bring up intelligence.
I'll switch back to analytics, but before that I would like to do a round-up of big data tools I've found so
far. The following table is a summary of tools I'm currently learning to use. They're in my own words,based on my current understanding. So it's not comprehensive and may contain inaccuracies.
Name Description
Hadoop Provides a programming framework that implements MapReduce and
platform for its execution.
It also provides distributed filesystem smart-enough to ensure minimum
data-motion7 during parallel the processing of (big) data set spreadacross cluster of machines.
It splits the dataset such that the task assigned the subset of data
executes on the machine where the subset of data is located (locally or
nearby); Map phase. Then it coordinates the aggregating of results ofcomputation collected from the task nodes (during the Reduce phase).
It is still not clear to me how the process-affinity is achieved8, but looksto me it's through configuration of our hadoop cluster, specifically
optimized for the application we're working on (knowing the nature of
data distribution, pipeline of data processing, etc.).
Apache Pig While Hadoop provides programming library for implementing the map
and reduce task, Apache Pig projects raise the level abstraction,allowing people to specify those operations in SQL-like syntax. It brings
productivity / efficienty for the project. Yahoo is said to have 40% to
60% of its Hadoop workloads implemented in Pig scripts9.
Apache Mahout A collection of machine-learning algorithms (e.g.: clustering, naivebayes, covering, linear modeling, etc), ready-to-use, for execution over
Hadoop cluster.
A big relieve, thanks to Mahout we can save time in the projects tryingto make those standard machine-algorithms paralelizable, using
MapReduce approach, specifically for execution in Hadoop cluster.
Drools Fusion A platform & programming libraries for complex event-processing.
Basically we define some rules where we specify the way we correlate
an event with past event(s), and make a conclusion based on that. Based
7 http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/
8 http://www.plexxi.com/wp-content/uploads/2013/03/Plexxi-Use-Case-Hadoop-March-2013.pdf
9 http://www.ibm.com/developerworks/library/l-apachepigdataquery/
7/30/2019 Write Up Raka Bigdata Analytic Telco
3/11
on that conclusion, an action is taken (which we also have to specify /
program). Obviously the rules will have to be discovered beforehand(e.g.: applying machine-learning techniques over historical records).
MongoDB One of many NoSQL databases available in the market. NoSQL, a
catch-all terms for non-relational database (which uses SQL as query
language). My understanding at this moment is that NoSQL databases
excels at scalability, by sacrificing the ACID (Atomicity, Consistency,Integrity, and Durability) property we can expect from RDBMS.
MongoDB is document-oriented database10. Meaning (according towikipedia): designed for storing, retrieving, and managing document-
oriented information, also known as semi-structured data.. So, I guess
if what we're capturing & storing looks like documents (e.g.: forms,receipts, articles, etc.), probably we should consider MongoDB.
The articles / book I've read on MongoDB gave me the impression thatdata-modeling for MongoDB does not put much emphasis on
normalizing our data (unlike in data modeling for RDBMS, sincenormalization is cornerstone in achieving consistence). Document-
oriented database seem to be focusing on the speed of retrieval, andhorizontal scaling (denormalized database is easier to be spread across
machines).
Others in NoSQL landscape Existing approaches in NoSQL landscape can be categorized as:
Key/Value stores
Column-oriented stores
Document-oriented stores
Object databases Graph databases
I believe our choice should be based on the structure (or lack of) of the
data we're dealing with, the kind of questions we want to get answered,
and obviously the application requirement.
Common cliche: NoSQL database is not the right choice for financial
system. It is a sweeping statement, as if financial system is one big
single application. Maybe it should be rephrased as: NoSQL, for itslack of support for transaction, should not be used in applications where
we have to support use-cases like transferring money betweenaccounts. For other application, such as capturing tick data (deals,transfer events, stock price, etc), NoSQL database has its place in
financial system11.
Finally, I'd like to refer a book I might recommend later, that can help usunderstand better the nature of several popular NoSQL databases:
10 http://en.wikipedia.org/wiki/Document-oriented_database
11 http://www.10gen.com/post/45116404296/how-banks-use-mongodb-as-a-tick-database
7/30/2019 Write Up Raka Bigdata Analytic Telco
4/11
Seven Databases in Seven Weeks: A Guide to Modern Databases and
the NoSQL Movement12.
I think that's quite a solid toolset for big data analytics. I wouldn't spend much more time onphilosophising; not until I get more hindsight from using those tools to crack some problems. What
problems? Well, one good place to start is Kaggle.com, where we can get some problems (accompanied
by dataset), and compete.
12 http://www.amazon.com/Seven-Databases-Weeks-Movement-ebook/dp/B00AYQNR50/ref=tmm_kin_title_0?
ie=UTF8&qid=1366862920&sr=8-1
7/30/2019 Write Up Raka Bigdata Analytic Telco
5/11
Wandering
Among so many challenges that CSP facing these days, we will have a look at two of them, namely
customer retention and dynamic pricing. How did they end up there? Several factors:
1. Saturated market.
2. Number portability.
3. Inability of CSPs to come up with killer-app(s).
Factor number 1 and 2 lead to churn prevention. Factor number 3 leads to dynamic pricing. It's only
logical, with not much room left for expansion, maintaining the customer base becomes a basic
necessity for survival. With number portability, the situation is even worse for CSP; subscriber can
switch to another CSP without worrying of losing her current number.
On to point #3, no killer-app. CSP have been complaining about decreasing ARPU (Average Revenue
Per-Unit). They can't just hike the price now (after they sacrificed it in the sake of expansion). So they
look for revenue-generation from VAS (Value-Added Service). The following screenshots show theVAS offered by Telcel.
7/30/2019 Write Up Raka Bigdata Analytic Telco
6/11
I personally don't find those services valueable enough for me to pay for; I never use them. In the
last picture, Identificar de Llamadas (description: conoce el numero de quien te llama y/o el nombresi esta en el agenda de tu equipo / know the number of the calling person and / or the name if that
person is registered in the contact list of your device). I'm not sure how is that a VAS. Are they
charging me for having that service? I certainly hope not! I think I can accept Banca Movil as
valuable service. But then again, banks (such as Bancomer) already offers an app for Android & iOS
to do just that, and I'm pretty sure it has nothing to do with Telcel because the app uses internettechnologies all the way (meaning: no income for Telcel for usage of Bancomer-provided app).
I guess contribution of VAS to the revenue of CSP is not significant enough to compensate decreasing
ARPU. Here is another common complaint by CSP: they act only as a pipe, without benefiting enoughfrom their investment in building the infrastructure. The one that benefits from the traffic generated by
video browsing on YouTube is.... Google mainly. For photo uploads on Instagram..., Instagram mainly.
Same goes for Facebook, Twitter, etc13.
Therefore, they start experimenting with online-charging schemes that takes traffic into account.
Something nicer than simple bytes-to-cents function, in order to avoid backlash from subscribers. To
me being-nice here means: at least there would be a notification when certain browsing activity that canincur additional charges, offering user to decide next action (proceed with extra payment, or cancel);
minimizing surprises (and complaints). More on that after the box below.
Clarification about contribution and trend of VAS to revenue generation of CSP: this article is not
scientific; it's only a summary of what I gathered from various sources. I tried to find something tobackup my hunch that VAS will die out. Turns out the situation is different from market to market, and
even from one segment to another segment in the same market.
In Telcel case for example, I don't know how they label me in their system :) but certainly not a low-
end. I have smartphone with unlimited 3G plan. Obviously I use internet for everything. I have no
idea what percentage of Telcel subsribers still use low-end cellphone, and what percentage of themactually use the VAS displayed above.
So I googled for VAS revenue decline and VAS revenue increase (and similar queries). I even
googled for Telcel Earning Report (found nothing). I found conflicting results. In India, more than33% of total revenue of Tata Docomo comes from VAS14. I'd like to think this result draws from skills
and experience of NTT in VAS (whose 43% of its revenue reportedly comes from VAS in 201015).
So, NTT knows how to do it, very well. Part of the strategy, I guess, is by providing incentive for
innovation16.
13 Although that can change if CSP open their platform, giving access to software developers to build / enhance
applications using telecommunication services provided by CSP. Example, by providing API like Twilio(http://www.twilio.com) or BlueVia (http://www.bluevia.com)
14 http://www.wirelessduniya.com/2011/11/11/tata-docomo-vas-revenue-surpasses-33-of-its-entire-revenue/15 http://www.voicendata.com/voice-data/news/166226/data-services-revenue-exceed-usd330-bn-2013
16 http://www.thehindubusinessline.com/todays-paper/tp-info-tech/ntt-docomo-laments-limited-valueadd-in-
india/article1060116.ece
http://www.twilio.com/http://www.bluevia.com/http://www.bluevia.com/http://www.twilio.com/7/30/2019 Write Up Raka Bigdata Analytic Telco
7/11
It's very similar to revenue share developers get from Android Store & Apple App Store17.
Back on Track
Now I will drive the essay back to (big data) analytics. To recap, the situation provides motives for the
following use-cases:
1. Churn prevention.
2. Dynamic pricing.
They are only two of a dozen other use-cases, listed in the following diagram18.
The line of thinking I used in the essay is the usual start from goal. Specifically the approach will be
in this order:
1. What questions need / want to be answered? (one end)
2. What data we have? (the other end)3. Techniques? (fill-in the blank in between)
In the case of churn modeling, what are the questions that are nomally / possibly by CSP (that leads
them to building their churn model). My lame attempt at mimicking Shakespeare: What are thequestions? That is the question.. I googled around, found nothing concrete enough to my liking, so I
17 http://www.techrepublic.com/blog/app-builder/app-store-fees-percentages-and-payouts-what-developers-need-to-
know/1205
18 http://www.intracom-svyaz.com/download/eng_pdf/bigdata/BigStreamer.pdf
7/30/2019 Write Up Raka Bigdata Analytic Telco
8/11
have to resort to writing something that sounds logic to me. So..., basically, based on historical data
(mainly) we want to find out how certain actions, events, and / or conditions lead up to customer churn.
Actions can be price change or marketing campaign, events can be dropped calls or congestion,
conditions basically attributes related to the customers.
All those information are gathered from sources like OSS (Operational Support System), BSS
(Business Support System).
The diagram is copied from http://ossline.typepad.com/.a/6a0105359f53d8970c0147e06562af970b-pi
From OSS we get network-related information (things like dropped calls are gathered from there).
From BSS we get business-related information (customer detail, the amount of billed, contracts, are allin that area). For more information about OSS/BSS, this page is a good start:
http://www.quora.com/How-would-you-explain-OSS-and-BSS-to-a-layman
Now, let's get really practical. Oracle for example, has an offering called Oracle Communication DataModel19 (part of its BSS offerings). As the name implies, it models the entities needed for business
operation of CSP. One of the components of OCDM is data-mining model. The following screenshot of
the Oracle's online-documention can give you an idea of what it is:
19 http://www.oracle.com/us/products/applications/communications/industry-analytics/data-model/overview/index.html
http://ossline.typepad.com/.a/6a0105359f53d8970c0147e06562af970b-pihttp://ossline.typepad.com/.a/6a0105359f53d8970c0147e06562af970b-pihttp://www.quora.com/How-would-you-explain-OSS-and-BSS-to-a-laymanhttp://ossline.typepad.com/.a/6a0105359f53d8970c0147e06562af970b-pihttp://www.quora.com/How-would-you-explain-OSS-and-BSS-to-a-layman7/30/2019 Write Up Raka Bigdata Analytic Telco
9/11
From that we can glimpse features relevant for churn-modeling20. This general idea can be transferred
to situations where we don't use Oracle's product for example, and we have to build it by hand. Here
are some of them:
Customer id, Target column of churn model, Number of future contract count in last 3 months,Subscription count in last 3 months, Suspension count in last 3 months, Contract count in last 3
months, Complaint count in last 3 months, Complaint call count to call center in last 3 months,
Complaint call count to call center in the life time in last 3 months, Contract left days in last 3 months,
Account left value in last 3 months, Remaining contract sum in last 3 months, Debt total in last 3months, Loyalty program balance in last 3 months, Total payment revenue in last 3 months, Monthly
revenue (arpu) in last 3 months, Contract arpu amount in last 3 months, Party type code, individual or
organizational in last 3 months, Business legal status, Marital status for individual user, Householdsize, Job Code, Nationality code, ...., For how long billing address is in effective, in days, ....
Of course those attributes in OCDM are tied to the algorithm used in the product for generating
prediction models. Some of the dimensions might not be relevant for our specific case. As in any data
analysis activities, we have to apply some algorithms to select relevant dimensions to base our analysison (http://en.wikipedia.org/wiki/Feature_selection). But at least all those features in OCDM can give us
starting point.
From there we can work backward to parts of the system where those data can be obtained from (e.g.:billing and charging system, CRM, etc). Let's see another product, BSCS iX (a billing and charging
solution now offered by Ericsson), just to get a glimpse of what is out there. The screenshot shows a
table where invoices are stored.
20 http://docs.oracle.com/cd/E11882_01/doc.112/e15886/data_mining_cdm.htm#autoId4
http://en.wikipedia.org/wiki/Feature_selectionhttp://en.wikipedia.org/wiki/Feature_selection7/30/2019 Write Up Raka Bigdata Analytic Telco
10/11
Churn-modeling itself is quite a feat. There's a good book on that matter, from Rob Matisson, given
away for free, and is also available in Google Books:
http://books.google.com.mx/books/about/The_Telco_Churn_Management_Handbook.html?id=M_uuQx7vMngC&redir_esc=y . And of course, exercise.
Teradata in collaboration with Duke University and several CSP held a contest in 2003 on churn-
modeling. From the tournamen't site at http://www.fuqua.duke.edu/centers/ccrm/ , we can downloadthe problem description and dataset, among other materials.
Wrapping up, a few words about dynamic pricing. I was wondering how is that related to big dataanalytics. Turns out dynamic pricing is a matter of yield management21, which can benefit from
analytics, specifically forecasting techniques.
21 http://en.wikipedia.org/wiki/Yield_management
http://books.google.com.mx/books/about/The_Telco_Churn_Management_Handbook.html?id=M_uuQx7vMngC&redir_esc=yhttp://books.google.com.mx/books/about/The_Telco_Churn_Management_Handbook.html?id=M_uuQx7vMngC&redir_esc=yhttp://www.fuqua.duke.edu/centers/ccrm/http://books.google.com.mx/books/about/The_Telco_Churn_Management_Handbook.html?id=M_uuQx7vMngC&redir_esc=yhttp://books.google.com.mx/books/about/The_Telco_Churn_Management_Handbook.html?id=M_uuQx7vMngC&redir_esc=yhttp://www.fuqua.duke.edu/centers/ccrm/7/30/2019 Write Up Raka Bigdata Analytic Telco
11/11
Here are some scenarios of dynamic pricing, which basically an attempt at offering bandwidth to
customer, at an attractive price, especially when the network conditions allows to do so. This has
something to do with giving flexibility to subscriber; by not forcing them to pick between subscribing
to unlimited capacity (at higher cost) or staying with limited bandwidth (barring them from using moreservices when needed). All-in-all this can make more people drawn into subscribing to / staying with
the CSP. These diagrams are copied from a whitepaper by Tango Telecom, Beyond Policy, a New Era
in Real-Time Charging
22
.
This is an example of the case for real-time analytics over stream of data, coming in from the network
elements (OSS), fed into forecasting model (built from historical data), to figure out for example if the
traffic is low (thus offering turbo-boost to a customer wouldn't affect the rest), etc. For a more use-
cases in this area, the following whitepaper from Telcordia is a good source: Applying YieldManagement in the Mobile Broadband Market23.
22 http://www.tango.ie/dynamic-marketplace.html
23 http://bit.ly/ZXdXPV (www.telecomtv.com/DocSend.aspx?fileid...9f55...yield-management...)
http://bit.ly/ZXdXPVhttp://bit.ly/ZXdXPV