THE FORTUNE TELLER API
Bas Geerdink
Doing data science with Apache Spark
TODAY’S MISSION:TO PREDICT THE FUTURE…
• Data Science• Spark and MLlib
• API
DATA SCIENCE
Process:1. Formulate a question2. Gather data3. Model data4. Create data product
Source: Drew Conway, The Data Science Venn Diagram, 2013
DATA SCIENCE METHOD
Source: Foundational Methodology for Data Science, IBM, 2015
1. Formulate a question
3. Analyze data
4. Product2. Gather data
DATA SCIENCE METHOD1. Formulate a question
BUSINESS PROBLEMFortune Teller at the circus
Input:• Glass ball• Lines on hand• Star sign• Astrology• Tarot cards
Output:• Vague prediction about future
Product Owner:“We should be able to do better than this!”
HOW TO CALCULATE HAPPINESS
Input: (personal details)• Country of residence• Age• Male / female• Partner (yes / no)• Number of children• Level of education (yes / no)
Output: (the happiness score)• Health
• Life expectancy• Disease
• Wealth• Poverty yes or no• Income
• “Psychological well-being”• Enjoyment• Stress• Anger• Worry• Sadness
DATA SCIENCE METHOD1. Formulate a question
2. Gather data
DATA SOURCES
• Gallup-Healthways Well-Being Index• The World Bank• Google Scholar• www.data.gov• Global Health Data Exchange• World Health Organization• Simple Online Data Archive for Population Studies
(Sodapop)• The World Factbook• UCI Machine Learning Repository
WINNING DATASET
National Health Interview Survey 2012
• 43345 surveys• 133 questions• Well documented• Free to download and use
HOW TO CALCULATE HAPPINESS
Input: (personal details)• Country of residence• Age• Male / female• Partner (yes / no)• Number of children• Level of education
Output: (the happiness score)• Health
• Life expectancy• Disease
• Wealth• Poverty yes or no• Income
• “Psychological well-being”• Enjoyment• Stress• Anger• Worry• Sadness
DATA SCIENCE METHOD1. Formulate a question
3. Analyze data
2. Gather data
• General purpose computing engine• In-memory processing• Support of streaming data, machine learning,
graphs• (much) faster than Hadoop MapReduce
• Small player in the (OS) world of Machine Learning: Python and R are leading, followed by SAS, Weka, RapidMiner, …
• It’s just a tool… no solution or holy grail• “I predict that mean cluster size will remain very
close to one until the end of humanity. The vast majority of problems are small. Honestly, the combined utility of PyData and Spark pales in comparison to the utility of Excel.”
SPARK OVERVIEW
Spark Core
Spark SQL
Spark Streaming GraphXMLlib
Standalone YARN Mesos
Scala
Python
R
Java
File system
HDFS
HBase
Cassandra
…
SPARK CLUSTER MODE
• Standalone• Mesos• YARN
DEMO
CORRELATION <> CAUSATION
BIG DATA IS OUT, ML IS IN
Source: Gartner, Hype Cycle for Emerging Technologies, 2015
MACHINE LEARNING
• Actually, this is…algorithms maximizing scores using a statistical approach to problem solving• Producing…systems that can learn from and make decisions and predictions based on data
The field of study that gives computers the ability to learn without being explicitly programmed.(Arthur Samuel, 1959)
MACHINE LEARNING TASKSRecommendation Using Association Rules (Similarity Matching)
• Predict items that have a high similarity to others within a given set of items.• Example: Predicting movies or books based on someone’s historic purchase behavior.
Classification
• Predict to which class/category a certain item belongs. These categories are predefined. A classification task can be binary or multi-class.
• Example: Determining whether a message is spam or non-spam (binary); determining characters from a handwriting sample (multi-class).
Regression
• Focus on predicting numeric values.• Example: Predicting the number of ice cream cones to be sold on a certain day based on
weather data.
Clustering
• Divide items into groups, but unlike in classification tasks, these groups are not previously defined.
• Example: Grouping customers based on certain properties to discover customer segments.
PICK AN ALGORITHM…
DEMO
DATA SCIENCE METHOD1. Formulate a question
3. Analyze data
4. Product2. Gather data
API DESIGN
• Start Spark server: GET http://fortuneteller/start• Stop Spark server: GET http://fortuneteller/stop • Add survey records: POST http://fortuneteller/survey • Train model: GET http://fortuneteller/train • Correlations: GET http://fortuneteller/correlations • Predict Health: GET http://fortuneteller/prediction/health• Predict Wealth: GET http://fortuneteller/prediction/wealth
DEMO
Web app?Deploy to cloud?Streaming linear
regression?
Next steps…
Questions?
https://github.com/geerdink/FortuneTellerApi