Paytm labs soyouwanttodatascience

Preview:

Citation preview

So you want to data science.

Adam Muise

Chief Architect

Who am I?!•  Chief Architect at Paytm Labs!

•  Paytm Labs is a data-driven lab founded to take on the really hard problems of scaling up Fraud, Recommendation, Rating, and Platform at Paytm!

•  Paytm is an Indian Payments/Wallet company, has 50 Million wallets already, adds almost 1 Million wallets a day, and will be greater than 100 Million customers by the end of the year. Alibaba recently invested in us, perhaps you heard. !

•  I’ve also worked with Data Science teams at IBM, Cloudera, and Hortonworks!

Paytm!

This presentation is short so that you can ask a lot of questions.!

Wisdom Nuggets…!

The Leadership!

The Leadership!

If you are creating a data science team, chances are that you are not a Data Scientist. Data Scientists are best applied to the problems of data, not management.!

The Leadership!Your boss (should ask): Why do you even data science to solve the problem?!

You (should) answer: The problem is too complex to solve without machine learning. Here’s why.!

You (should not) answer: Big data and data science is on the roadmap.!

The Leadership!

You have your budget for a team of 2 data scientists. That’s a good start right? Get ready to ask for more money. !

The Leadership!You need to ask your management for:!

-  Budget for 2 data engineers for every data scientist you hire!

-  Access to the data lake, failing that, access the data warehouse!

-  DevOps!

-  Time to gain domain expertise before producing results!

-  Exec-level cooperation from those teams who own the data and tools you need and those who understand the data you need!

-  A budget for servers/tools/additional storage based on a TCO calculation you already did (right?)!

-  A dedicated place for your team to work!

The Leadership!Got DataLake?!!No? Depending on your problem space, chances are you are building one unless you can pull what you need from an Existing Data Warehouse.!

The Leadership!You didn’t do a TCO (Total Cost of Ownership) calculation? Ok, here you go:!

1.  Internal/External cloud instances that can run Spark/Hadoop/etc!

2.  Storage costs (S3, internal, etc) for your analytical data sets!

3.  Lead time to get started, something like 1-2 months depending on the complexity of the problem (Fraud might take 3 months whereas Recommendation Engines might be 1 month)!

4.  Training time and costs for tools you didn’t know you needed!

What! How much!

24-32 medium to large instances on AWS each

month!

$15,000 to $45,000 per month!

Storage costs for S3 (400TB to 2PB)!

$12,000 to $57,000 per month!

Salaries & Operating Expenses!

2 x $xxxxx your operating costs including salaries for

yourself and 3 people!

Training!(Courses for Tools and

perhaps a conference trip for hiring)!

$5,000 to $15,000!

The Team!

The Team!

So you have permission, resources, and a corner in an office. How do you start? !

The Team!Assemble your team in the following order:!1. Get a Data Engineer with a good analytical mind. Have him beg, borrow, or steal whatever data sets that might be applicable to the problem. Without data, no data sciencey stuff can happen.!

The Team!Assemble your team in the following order:!2. While you are getting your data, hire or recruit an internal Data Scientist. !Easy, right?!

!!!!!!WARNING!!!!!!!Data Science is not a mystical art form handed down by monks and taught over 50 years. You just need:!

•  a good math background!

•  academic or job experience with machine learning !

•  business context!

•  understand how to code. !

That can be easier to find than you think. !

!

That being said, everybody seems to think they are data scientists these days, from the guy who writes the monthly SQL reports to your office manager who is a wiz at excel. !

The Team!Assemble your team in the following order:!3. More Data Engineers. !4. DevOps support (if you don’t have a common resource pool to draw from).!

The Team!Keep your data science team innovative, keep them away from bureaucracy, keep them cool. Don’t discount the cool factor.!They are supposed to solve hard problems, not deal with the everyday business issues. To objective they need to be decoupled from the emergencies and mediocre. !If that sounds elitist then I challenge you to create a scaling fraud detection system with your existing data warehouse team. No really, try it. !

The Team!What will they do?!

The Data Engineer !

Your data engineer is the heart and sole of your data science team and will get almost none of the credit in the end. They will help build your data pipeline, perform data transformations, optimize training, automate validation, and take the results into production. !

If you are lucky, you have Data Scientists that respect this role and will often take some of these roles on to help ensure their vision reaches production. Instead of relying on luck, you can hire this way too. !

The Team!What will they do?!

The Data Scientist!

Your Data Scientist will explore the data, create models, validate, explore the data again, go in a different direction, clarify requirements, model again, validate, retract, and then produce a good model. The process is not deterministic and is a mix of research and implementation. A good Data Scientist will be able to code in the tools that you intend to go implement production code with, something like Scala in Spark. !

Your Data Scientist will have or at least learn the business context required to solve your problem. They will need to communicate with business experts to validate their solutions actually solve the problem or to help drive them in a new direction. !

The Team!What will they do?!DevOps!Developer Operations will help build that data pipeline for you. If you have to build a Data Lake from scratch, you are going to really rely on these folks. They should be elite, understand distributed systems, ride a motorcycle, and be someone you feel uncomfortable standing next to in an elevator.!

Managing The Team!

If your Data Scientists are not stellar coders, put a Data Engineer in their grill and make them produce code. They can’t contribute if they can’t get their hands dirty. Data Science is not an ivory tower. !

Managing The Team!Introduce your team to the business team that knows the data or business processes better than anyone else. Often that’s not the CIO-favored DWH team, but rather the Customer Service Representatives*!*This was especially true in fighting Fraud. !

Managing The Team!Ways to make your team hate you:!

Data Scientists:!

•  Don’t provide the data they need to create their models!

•  Suggest that they create their own training data, from scratch!

•  Provide ambiguous goals for the accuracy and precision of their models!

•  Tell them to mine the data / don’t’ have a plan!

•  Don’t respect the time it takes to create a model!

Data Engineers:!

•  Let the Data Scientists use whatever tool they want without respect to parallel processing or implementation!

•  Have no management control over your data sources!

DevOps:!

•  Use anything by IBM, Microsoft, SAS, or Oracle in your pipeline!

•  Let the Data Engineers decide on the infrastructure!

!

The Work!

The Work!Start out with a clear that is unambiguous. !“I want to detect and prevent 50% of Fraud in my payments system”!“I want to increase conversion rates in my eCommerce platform by 20%”!

The Work!

Get as much of the raw data as soon as you can and as fast as you can. Don’t have a Data Lake? Get your Hadoop on ASAP. !!

The Work!

Give the team time to research the data, gain context and become experts. !!

The Work!Data without context == a complete lack of direction in research. !Research needs constant checks to ensure that the primary problem is being solved. !!

The Work!Data Science Development != Engineering Software Development.!You will have to separate your research process from the engineering process that delivers the models to production. !!

The Work!Data Engineering is an ongoing process. You will need to maintain pipelines, adapt to schema changes, implement data cleansing, maintain metadata in the data lake, optimize processing workflows, etc. You will never outgrow the need for your Data Engineers. !!

The Architecture!

The Architecture!Start with the cloud. You need to get your infrastructure up as quickly as possible. At the beginning, this is cheaper than you think compared the time and startup costs for creating an on-premise data lake, even/especially if you have an existing IT Team*!!*If you are big corporation your IT team is often the biggest barrier to your success in creating an independent Data Science team.!

The Architecture!We had to build a data lake. It looks like this:!!

The Architecture!Lambda Architecture!Batch Ingest:!

•  SQOOP from MySQL instances!

•  Keep as much in HDFS as you can, offload to S3 for DR/Archive and when you have colder data!

•  Spark and other Hadoop processing tools can run natively over S3 data so it’s never really gone (don’t use Glacier in a processing workflow)!

Realtime Ingest:!

•  Mypipe to get events from binary log data and push into Kafka topics (under construction)!

•  VoltDB connector to get events from DB and push to Kafka (under construction)!

•  Streaming data piped through Kafka!

•  All Realtime data processed with Spark Streaming or Storm from Kafka!

The Architecture!As you grow, your processing and storage needs will likely mature. Consider moving to on-premise solution for your Hadoop/Processing architecture. You can always archive to S3 if you need DR and don’t have the appetite to create two clusters.!

The Architecture!

With an on-premise architecture, you can interact with existing on-premise production systems quickly. For us, that means real-time Fraud detection and action. You may find yourself maintaining both in the long run.!

What Actual Data Science looks like…

armando@paytm.com - @jabenitez

Supervised learning vs Anomaly detection ๏  Very small number of positive

examples

๏  Large number of negative examples.

๏  Many different “types” of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we’ve seen so far.

40

๏  Ideally large number of positive and negative examples.

๏  Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.

* Anomaly Detection - Andrew Ng - Coursera ML Course

armando@paytm.com - @jabenitez

What approach to follow? ๏  Not so good: One model to rule them all

๏  Better:

๏  Many models competing against each other

๏  100s or 1000s of rules running in parallel

๏  Know thy customer

41

armando@paytm.com - @jabenitez

Feature Selection ๏  Want p(x) large (small) for normal examples, "

p(x) small (large) for anomalous examples

๏  Most common problem: " comparable distributions for both normal and anomalous examples

๏  Possible solutions:

๏  Apply transformation and variable combinations:

๏  xn+1 = ( x1 + x4 ) 2 / x3

๏  Focus on variable ratios and transaction velocity

๏  Use deep learning for feature extraction

๏  Dimensionality reduction

๏  your solution here

42

armando@paytm.com - @jabenitez

Feature Selection

43

armando@paytm.com - @jabenitez

Feature Selection

44

Variable X

Coun

ts BKG SIG

armando@paytm.com - @jabenitez

What have we have tried ๏  Density estimator

๏  2D Profiles

๏  Anomaly detection

๏  Clustering

๏  Model ensemble (Random forest)

๏  Deep learning (RBM)

๏  Logistic Regression

45

Combine

armando@paytm.com - @jabenitez

Gaussian distribution

46

armando@paytm.com - @jabenitez

Anomaly Detection* - Example ๏  Choose features, xi , that are indicative of anomalous examples.

๏  Fit parameters to a normal distribution

๏  Given new example, compute:

๏  Anomaly if

47

* Anomaly Detection - Andrew Ng - Coursera ML Course

armando@paytm.com - @jabenitez

Algorithm Evaluation ๏  Fit model on training set

๏  On a cross validation/test example, predict

๏  Possible evaluation metrics:

๏  True positive, false positive, false negative, true negative

๏  Precision/Recall

๏  F1-score

48

armando@paytm.com - @jabenitez

Implementation

49

armando@paytm.com - @jabenitez

Anomaly Detection*

50

* Anomaly Detection - Andrew Ng - Coursera ML Course

Cross validation set: Test set:

Assume we have some labeled data, of anomalous and non-anomalous examples: y = 0 if standard behaviour, . y = 1 if anomalous. Training set: "(assume normal examples/not anomalous)

armando@paytm.com - @jabenitez

Transform, Normalize, Calculate

51

armando@paytm.com - @jabenitez

Scala

52

Creating Scalable Architecture

Futures!

armando@paytm.com - @jabenitez

The lake again

54

Lake Simcoegoing on

Lake Superior

Classic LambdaArchitecture

VariousProcessingFrameworks

Near-RealtimeScoring/Alerting*

armando@paytm.com - @jabenitez

Fraud Capabilities and Technology

A.  Batch Ingest and Analysis of transaction data from Database

B.  Batch Behavioural and Portfolio heuristic fraud detection

C.  Near-realtime anomaly and heuristic fraud detection

D.  Online Model Scoring

55

A.  Traditional ETL tools for transfer, HDFS/S3 for storage, Spark for processing

B.  Model analysis with iPython/Scala Notebook, Spark for processing, HDFS/HBase/Cassandra for storage

C.  Kafka real-time ingest, introduce Storm/Spark Streaming for near-realtime interception of data, HBase for model/rule storage and lookup

D.  JPMML/Spark Streaming for realtime model scoring

armando@paytm.com - @jabenitez

Our framework shopping list

56

iPython & Scala

Notebooks

Explore & Train Ingest, Store, Score, & Act

Spark::Core ::MLLib

::Streaming ::GraphX?

Intercept with Storm?

Spark Streaming?

Kafka, Hadoop, HBase, Cassandra, SolrCloud, & S3

OpenScoring?

JPMML?R?

armando@paytm.com - @jabenitez57

Fin