Upload
fliptop
View
143
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Scala and PythonIntegrating scikit-learn into a Scala Stack to build realtime predictive models
Dan ChiaoVP Engineering
Why it was necessaryWe pivoted
The original product• Social data append
– PeopleGraph: match email addresses to public demographics and social profiles
– BrandGraph: match company URLs to public firmographics and social profiles
• Requirements– Integrate a large (and expanding)
number of web data sources (REST, SOAP, flat files)
– Realtime processing of large volumes of contacts (60 queries/s)
The original technology stack
• Scala– Best of both worlds
• Concise functional syntax• Java libraries and deployment architecture• Scala-specific libraries (Dispatch, Lift Web Framework)
• Twitter (soon to be Apache) Storm– Streaming intake and normalization of large amounts of data
• MongoDB– Expanding data sources = constantly updating schema– Most sophisticated query syntax of NoSQL options
• AWS and Azure– Well, duh
The new product• Moving up the application stack
– Focus on the most compelling single-use case for our data
– Fliptop SpendScore• Predictive analytics for sales and marketing teams• “Machine learning for Salesforce”
The updated technology stack
• Still need to wrangle large amounts of data, so no changes there
• New requirement: fast, scalable machine learning
Why not Scala (Java) native?
• The options– Apache Mahout
• Only skeleton implementations for most sophicated machine learning techniques (e.g. Random Forest, Adaboost)
• Customer-specific models – don’t need Big Data
– Weka – GPL
– Scala-native libraries – Too early to use in production
Why Python?
• scikit-learn– Mature – around since 2006– Actively-developed – Last stable release Aug 2013– Sophisticated – Random Forest and Adaboost classifier
show comparable performance to R
• Why not R? Not really production grade.
Requirements
• APIs to exploit Python’s modeling power– Train, predict, model info query, etc.
• Scalability– On demand Python serving nodes
Tools for Scala-Python Integration
• Reimplementation of Python– Jython (JPython)
• Communication through JNI– Jepp
• Communication through IPC– Apache Thrift
• Communication through REST API calls– Bottle
Jython
• Re-Implementation of Python in Java
• Can import and use any Java class.
• Includes almost all of the modules in the standard Python distribution – Except some of the modules implemented originally in C.
• Compiles to Java bytecode– either on demand or statically.
11
Jython
12
JVM
Scala Code
Python Code
Jython
Jython
• Lacks support for lots of extensions for scientific computing– Numpy, Scipy, etc.
• JyNI (Jython Native Interface) to the rescue?– Specifically designed to support CPython extensions like
Numpy, Scipy– Still in alpha
13
Communication through JNI
• Jepp (Java Embedded Python)– Embeds CPython in Java– Runs Python code in CPython– Leverages both JNI and Python/C for integration
Python Interpreter
Jepp
15
JVM
Scala Code
Python Code
JNI Jepp
Jepp
16
object TestJepp extends App { val jep = new Jep() jep.runScript("python_util.py") val a = (2).asInstanceOf[AnyRef] val b = (3).asInstanceOf[AnyRef] val sumByPython = jep.invoke("python_add", a, b) println(sumByPython.asInstanceOf[Int])}
def python_add(a, b): return a + b
python_util.py
TestJepp.scala
Communication through IPC
• Apache Thrift– Developed & open-sourced by Facebook– More community support than Protobuf, Avro
– IDL-based (Interface Definition Language)– Generates server/client code in specified languages– Take care of protocol and transport layer details– Comes with generators for Java, Python, C++, etc.
• No Scala generator• Scrooge (Twitter) to the rescue!
17
Thrift – IDL
18
namespace java python_service_testnamespace py python_service_test
service PythonAddService{ i32 pythonAdd (1:i32 a, 2:i32 b),}
TestThrift.thrift
$ thrift --gen java --gen py TestThrift.thrift
Thrift – Python Server
19
class ExampleHandler(python_service_test.PythonAddService.Iface): def pythonAdd(self, a, b): return a + b
handler = ExampleHandler()processor = Example.Processor(handler)transport = TSocket.TServerSocket(9090)tfactory = TTransport.TBufferedTransportFactory()pfactory = TBinaryProtocol.TBinaryProtocolFactory() server = TServer.TThreadedServer(processor, transport, tfactory, pfactory) server.serve()
PythonAddServer.py
class Iface: def pythonAdd(self, a, b): pass
PythonAddService.py
Thrift – Scala Client
20
object PythonAddClient extends App { val transport: TTransport = new TSocket("localhost", 9090) val protocol: TProtocol = new TBinaryProtocol(transport) val client = new PythonAddService.Client(protocol)
transport.open() val sumByPython = client.python_add(3, 5) println("3 + 5 = " + sumByPython) transport.close()}
PythonAddClient.scala
Thrift
21
JVM Scala Code
Thrift
Python Code
Python Interpreter
Thrift
Python Code
Python Interpreter
Thrift
…
Auto Balancing、Built-in Encryption
REST API Architecture
22
…Bottle
Python Code
Bottle
Python Code
Bottle
Python Code
JVM
Scala Code
Auto Balancer?Encoding?
Thrift v.s. REST
Thrift REST
Load Balancer ✔Encode/Decode ✔Low Learning Curve ✔No Dependency ✔
Does it matter?
No (AWS & Azure)
No(We’re already
doing it)Yes
Winner
Yes
Fliptop’s Architecture
24
Load Balancer
…Bottle
Python Code
Bottle
Python Code
Bottle
Python Code
JVM Scala Code
5 Python servers~5,000 requests/sec
Summary
• Jython• (✓) Tight integration with Scala/Java• (✗) Lack support for C extensions (JyNI might help in the
future)
• Jepp• (✓) Access high quality Python extensions with CPython speed• (✗) Two runtime environments
• Thrift, REST• (✓) Language-independent development• (✗) Bigger communication overhead
25
Questions?
Ask this guy
Thank You
27