C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler

#CASSANDRA13

Ken Krugler | President, Scale Unlimited

Suicide Prevention Using Social Media and Cassandra

#CASSANDRA13

What we will discuss today...

*Using Cassandra to store social media content

*Combining Hadoop workflows with Cassandra

*Leveraging Solr search support in DataStax Enterprise

*Doing good with big data

This material is based upon work supported by the Defense Advance Research Project Agency (DARPA), and Space Warfare Systems Center Pacific under Contract N66001-11-4006. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the Defense Advance Research Program Agency (DARPA) and Space and Naval Warfare Systems Center Pacific.

Fine Print!

#CASSANDRA13

Obligatory Background

*Ken Krugler, Scale Unlimited - Nevada City, CA

*Consulting on big data workflows, machine learning & search

*Training for Hadoop, Cascading, Solr & Cassandra

#CASSANDRA13

Durkheim Project OverviewIncluding things we didn't work on...

#CASSANDRA13

What's the problem?

*More soldiers die from suicide than combat

*Suicide rate has gone up 80% since 2002

*Civilian suicide rates are also climbing

*More suicides than homicides

*Intervention after an "event" is often too late

#CASSANDRA13

What is The Durkheim Project?

*DARPA-funded initiative to help military physicians

*Uses predictive analytics to estimate suicide risk from what people write online

*Each user is assigned a suicidality risk rating of red, yellow or green.

Émile Durkheim

#CASSANDRA13

Current Status of Durkheim

*Collaborative effort involving Patterns and Predictions, Dartmouth Medical School & Facebook

*Details at http://www.durkheimproject.org/

*Finished phase I, now being rolled out to wider audience

#CASSANDRA13

Predictive Analytics

*Guessing at state of mind from text

-"There are very few people in this world that know the REAL me."

-"I lay down to go to sleep, but all I can do is cry"

*Uses labeled training data from clinical notes

*Phase I results promising, for small sample set

-"ensemble" of predictors is a powerful ML technique

#CASSANDRA13

Clinician Dashboard

*Multiple views on patient

*Prediction & confidence

*Backing data (key phrases, etc)

#CASSANDRA13

Data CollectionWhere _do_ you put a billion text snippets?

#CASSANDRA13

Saving Social Media Activity

*System to continuous save new activity

-Scalable data store

*Also needs a scalable, reliable way to access data

-Processed in bulk (workflows)

-Accessed at individual level

-Searched at activity level

#CASSANDRA13

Data Collection

*Pink is what we wrote

*Green is in Cassandra

*Key data path in red

Exciting Social Media Activity

Gigya Daemon

Durkheim Social API

Users Table

Durkheim App

Gigya Service

Activity Table

#CASSANDRA13

Designing the Column Families

*What queries do we need to handle?

-Always by user id (what we assign)

*We want all the data for a user

-Both for Users table, and Activities table

-Sometimes we want a date range of activities

*So one row per user

-And ordered by date in the Activities table

#CASSANDRA13

Users Table (Column Family)

*One row per user - row key is a UUID we assign

*Standard "static" columns

-First name, last name, opt_in status, etc.

*Easy to add more xxx_id columns for new services

row key first_name last_name facebook_id twitter_id opt_in

#CASSANDRA13

Activities Table (Column Family)

*One row per user - row key is a UUID we assign

*One composite column per social media event

-Timestamp (long value)

-Source (FB, TW, GP, etc)

-Type of column (data, activity id, user id, type of activity)

row key ts_src_data ts_src_id ts_src_providerUid ts_src_type

#CASSANDRA13

Two Views of Composite Columns

*As a row/column view

*As a key-value map 213_FB_data

213_FB_id

213_FB_providerUid

213_FB_type

"I feel tired"

"FB post #32"

"FB user #66"

"Status update"

"uuid1"

"uuid1" 213_FB_data 213_FB_id 213_FB_providerUid 213_FB_type

"I feel tired" "FB post #32" "FB user #66" "Status update"

#CASSANDRA13

Implementation Details

*API access protected via signature

*Gigya Daemon on both t1.micro servers

-But only active on one of them

*Astyanax client talks to Cassandra

*Cluster uses 3 m1.large servers

Durkheim Social API

Durkheim App

AWS Load Balancer

EC2 m1.largeservers

Durkheim Social API

EC2 t1.microservers

#CASSANDRA13

Predictive Analytics at ScaleRunning workflows against Cassandra data

#CASSANDRA13

How to process all this social media goodness?

*Models are defined elsewhere

*These are "black boxes" to us

213_FB_data 213_FB_id 213_FB_providerUid 213_FB_type

"I feel tired" "FB post #32" "FB user #66" "Status update"

307_TW_data 307_TW_id 307_TW_providerUid 307_TW_type

"Where am I?" "Tweet #17" "TW user #109" "Tweet"

Feature Extraction Model

model rating probability keywords

#CASSANDRA13

Why do we need Hadoop?

*Running one model on one user is easy

-And n models on one user is still OK

*But when a model changes...

-all users with the model need processing

#CASSANDRA13

Batch processing is OK

*No strict minimum latency requirements

*So we use Hadoop, for scalability and reliability

#CASSANDRA13

Hadoop Workflow Details

*Implemented using Cascading

*Read Activities Table using Cassandra Tap

*Read models from MySQL via JDBC

#CASSANDRA13

Hadoop Bulk Classification Workflow

Convert to Cassandra

Write Classification Result Table

Run Classifier models

CoGroup by user profile ID

Convert from Cassandra

Read User Profiles Table

Convert from Cassandra

Read Social Media Activity Table

#CASSANDRA13

Workflow Issues

*Currently manual operation

-Ultimately needs a daemon to trigger (time, users, models)

*Runs in separate cluster

-Lots of network activity to pull data from Cassandra cluster

-With DSE we could run on same cluster

*Fun with AWS security groups

#CASSANDRA13

Solr SearchPoking at the data

#CASSANDRA13

Solr Search

*Model results include key terms for classification result

-"feel angry" (0.732)

*Now you want to check actual usage of these terms

#CASSANDRA13

Poking at the Data

*Hadoop turns petabytes intopie-charts

*How do you verify results?

*Search works really well here

#CASSANDRA13

Solr Search

*Want "narrow" table for search

-Solr dynamic fields are usually not a great idea

-Limit to 1024 dynamic fields per document

*So we'll replicate some of our Activity CF data into a new CF

*Don't be afraid of making copies of data

#CASSANDRA13

The "Search" Column Family

*Row key is derived from Activity CF UUID + target column name

*One column ("data") has content from that row + column in Activity CF

row key "data"

"uuid1_213_FB "I feel tired"

"uuid1" 213_FB_data 213_FB_id

"I feel tired" "FB post #32"

Activity Column Family

Search Column Family

#CASSANDRA13

Solr Schema

*Very simple (which is how we like it)

*Direct one-to-one mapping with Cassandra columns

*Hits have key field, which contains UUID/Timestamp/Service

#CASSANDRA13

Combined Cluster

*One Cassandra Cluster can allocate nodes for Hadoop & Search

#CASSANDRA13

SecurityLocking things down

#CASSANDRA13

The Most Important Detail

*We don't have any personal medical data!!!

#CASSANDRA13

Three Aspects of Security

*Server-level

-ssh via restricted private key

*API-level

-validate requests using signature

-secure SHA1 hash

*Services-level

-Restrict open ports using security groups

#CASSANDRA13

SummaryBringing it all home

#CASSANDRA13

*You can effectively use Cassandra as:A repository for social media dataThe data source for workflowsA search index, via Solr integration

Key Points...

#CASSANDRA13

*It is possible to do more with big data than optimize ad yields

And the Meta-Point

#CASSANDRA13

THANK YOU

C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler

Technology

Suicide Awareness & Intervention Workshop SUICIDE SUICIDE

A GUIDE TO STRESS TESTING KAFKA, SPARK AND CASSANDRA … · Spark Workers. The nodes are named Spark-Cassandra-Master, Spark-Cassandra-Worker01 and Spark-Cassandra-Worker02. The Cassandra

Cassandra Training Session 2svn.wso2.org/repos/wso2/people/kasunw/BAM/Cassandra/Cassandra... · Configuring Cassandra Contd ... replicationStrategy, replicationFactor, cfs); cluster.addKeyspace(definition);

Suicide / Le Suicide 1897

Cassandra Summit 2014: Apache Cassandra at Telefonica CBS

LA Cassandra Day 2015 - Testing Cassandra

Cassandra Summit EU 2014 - Testing Cassandra Applications

Cassandra Community Webinar - August 22 2013 - Cassandra Internals

Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastructure

Cassandra Day Denver 2014: Introduction to Apache Cassandra

Suicide Risk Prediction Using Social Media and Cassandra

Helsinki Cassandra Meetup #2: From Postgres to Cassandra

Cassandra CLuster Management by Japan Cassandra Community

Cassandra @ Yahoo Japan | Cassandra Summit 2016

Cassandra + Hadoop: Analisi Batch con Apache Cassandra

LA Cassandra Day 2015 - Cassandra for developers

Cassandra Day NYC - Cassandra anti patterns

Cabs, Cassandra, and Hailo (at Cassandra EU)

Apache Cassandra, part 3 – machinery, work with Cassandra

Cassandra Community Webinar | Cassandra 2.0 - Better, Faster, Stronger