Topics-oriented APIsMay 2015 – APIdays Barcelona
Tyler Singletary - @harmophone
Director of Platform
HI.
HI. WITH CONTEXT.
A Practical Application of Social Media Machine Learning and NLP1
WHAT IS KLOUT, REALLY?
• Klout is an API client application of the social web.
• Federated identity across platforms
• Macro and micro understanding of profile, conversation, and content.
People linked by Topics.
UNIFYING PRINCIPLE: TOPICS
• TBs of Social Interactions a Day
• NLP applied to posts
• Aggregated to profiles: – effects are Klout Score,
topical strengths– The what becomes topics– The why becomes TopicSets
• Links crawled, NLP summarization
Content and people linked by Topics.
TOPIC SETS + USERS + SCORING
• Allow for time-series slicing
• Aggregate counting
• Slicing of set to create ordered list
Topic-oriented view
NLP-based Building Blocks 2
KLOUT DEALS WITH RIDICULOUS AMOUNTS OF DATA
o Topic assignment at scale:o ~650 M new pieces of data daily o hundreds of millions of profileso ~10,000 topics in 3-level hierarchyo Daily update
o Multiple Social networks and various data sources:o Twitter, Facebook, LinkedIn, Google+, Wikipediao User activity, profiles, connections
o Topics normalized to an evolving, managed ontology
WEIGHTING, NORMALIZATION, CALIBRATION
Signals are weighted and normalized to mirror real-world influence– Machine-learned weighting based on regression
analysis of survey data
Advanced algorithm based on 1500 signal combinations of relationships and ratios
– Where: Which network is the action taking place?– What: What action was taken?– Who: Who acted on your content?– How much: How many actions and unique actors?– When: When was the action performed?
TOPIC SETS FOR CONTEXT
User’s Influence
With various Scores
User’s Interests
With various Scores
User’s Self-selection
Based on registered self-declared interest
Audience InfluenceRollup of User’s
Influence within a user’s downlevel and
uplevel networks
Audience Interests
Rollup of User’s Interests within a User’s downlevel (and uplevel) networks
CHALLENGES IN BIG DATA
● Message size: Overall data size may be huge, but message size per user may be small.
● Text Sparsity: Many users may be passive consumers of content.
● Noise: colloquial language, slang, grammatical errors, abbreviations.
● Context: Need to expand context to get more information
● False positives are embarrassing when user-facing
CHALLENGES TO SCALE
NLP* - StanfordNLP english.conll.4class.distsim.crf.ser.gz
● Speed Matters (650M messages a day): ○ Stanford Named Entity Extraction - 10.959 ms (82.0 CPU days)○ Dictionary - 0.056ms (0.42 CPU days)
● Corpus○ Stanford Named Entity Extraction:
■ {‘the rule of law’=1.0}○ Dictionary based:
■ {‘the rule of law’=1.0, ‘nsa’=1.0, ‘eff’=1.0}
WEBSTER
MACHINE LEARNING AT KLOUT
We our leverage past machine learning and NLP classification assets to:
• Train new models for adding additional data sources
• Retraining Topics classification
• Predict “actionability” of support
• Predict virality of content [macro and micro]
• Predict the “personhood” of a social media account
• Content-targeting based on downlevel predictions
How do you productize this in APIs?3
INPUTS AND OUTPUTS
People-Specific Insights
Input: People(s)
Output: TopicSet(s)
Topic-Specific People
Input: Topic(s)
Output: People
Topic-Aggregate Insights
Input: Topic(s)
Output: Metadata, Aggregation
People-Aggregate Insights
Input: User(s)
Output: Metadata, Aggregate Sets
GET user.json/[id]/insights/influence-topics
GET user.json/insights/aggregated/influence-topics?userIds=1,2,3
GET topic.json/[ids]/people
GET topic.json/[ids]/insights
PAYLOADS{topicSetType: "expertise",topicSet: [{topicId: "7516448513106795305",score: 0.999596145670965,strength: "strong",displayName: "APIs",name: "APIs",slug: "api",imageUrl: "http://kcdn3.klout.com/static/images/topics/api_6bae2a67e1a5a9b68d526b4d483c4eb8.png",displayType: "visible",topicType: "entity"},{topicId: "10000000000000008253",score: 0.9992839644220868,strength: "strong",displayName: "Twitter",name: "Twitter",slug: "twitter",imageUrl: "http://kcdn3.klout.com/static/images/icons/generic-topic.png",displayType: "visible",topicType: "entity"},{topicId: "8961164588331655920",score: 0.9992326280041798,strength: "strong",displayName: "Klout",name: "Klout",slug: "klout",imageUrl: "http://kcdn3.klout.com/static/images/klout-topic-image-1333588028647.jpg",displayType: "visible",topicType: "entity”
topicSetType: "interest",topicSet: [{topicId: "10000000000000008253",score: 0.9946672348339362,strength: "strong",displayName: "Twitter",name: "Twitter",slug: "twitter",imageUrl: "http://kcdn3.klout.com/static/images/icons/generic-topic.png",displayType: "visible",topicType: "entity"},{topicId: "6485494992525344250",score: 0.9918719149780779,strength: "strong",displayName: "Marketing",name: "Marketing",slug: "marketing",imageUrl: "http://kcdn3.klout.com/static/images/topics/people.png",displayType: "visible",topicType: "sub"},{topicId: "7516448513106795305",score: 0.9888798650771197,strength: "strong",displayName: "APIs",name: "APIs",slug: "api",imageUrl: "http://kcdn3.klout.com/static/images/topics/api_6bae2a67e1a5a9b68d526b4d483c4eb8.png",displayType: "visible",topicType: "entity"},
Let’s get practical, prescriptive and
talk about the future4
PARAMETERIZATION
• Topics Scoring uses different models in each topic set
• Overall Topic Scoring is based on hundreds of features, weights, decays, spanning short and long term
• Parameterize scoring for different contexts
EXAMPLES
Use interchanging, specified models, with rules modifiers
EXAMPLES
• Treated like a product, you must think through implementations others would make.
• Maybe even make them your own.
POLICY
• Data is great.
• Representation of data is hard.
• Raw data rarely if ever needs to be displayed.
• Balance innovation on data assets with brand and utility, allowed use cases.
KLOUT RESEARCH ONLINE
• LASTA