18
BEHIND NOTEY’S ANALYTICS ENGINE FROM HINDSIGHT TO INSIGHT TO FORESIGHT

Notey's talk 20160923

Embed Size (px)

Citation preview

Page 1: Notey's talk 20160923

BEHIND NOTEY’S ANALYTICS ENGINE

FROM HINDSIGHT TO INSIGHT TO FORESIGHT

Page 2: Notey's talk 20160923

WHAT IS NOTEY?

Page 3: Notey's talk 20160923

INSIDE NOTEY

Page 4: Notey's talk 20160923

WHY ARE WE BETTER THAN

GOOGLE?

Page 5: Notey's talk 20160923

1. MORE RELEVANT SEARCH RESULTS

Page 6: Notey's talk 20160923

2. CURATED CONTENT

Page 7: Notey's talk 20160923

3.CONTEXTUAL TARGETING

Page 8: Notey's talk 20160923

HOW DO WE

USE BIG DATA?

Page 9: Notey's talk 20160923

DATA @ NOTEY

Each day, Notey1. Crawl 10,000+ existing blogs

2. Import ~50 new blogs

3. Import 15,000 articles

4. Add 2000+ new topics

5. Process 10G of logs (user log, access log, crawl log)

6. Generate thousands of recommendations and insights

Page 10: Notey's talk 20160923

SCALING NOTEY

Java + node.js as web servers, batch server and API Engines

EHCache for scalability and boosting database performance

mySQL (Master + Slaves) for persistent data storage

Solr search engine for live query, R for statistical processing

Google Big Query for big data processing

Amazon EC2, S3 + CloudFront for compute and delivery

Elastic Load Balancing (ELB) for load balancing across instances and real time scalability

Page 11: Notey's talk 20160923

BIG DATA TECHNOLOGIES

Apache Spark• Open source• Support native Java app• Real time processing• Performs 100x faster than

traditional Hadoop MapReduce

Google Big Query• Simple SQL syntax• Easy to use, even for non-developers • Suitable for ad-hoc queries• Analytics as a service,

hassle-free, managed service

Page 12: Notey's talk 20160923

OUR ALGORITHMS

Page 13: Notey's talk 20160923

HOT RANKING ALGORITHM

A story ranking algorithm based on relevancy, popularity and time-post sensitive information to deliver a compelling online experience. Simple yet open-sourced algorithm which is used in amazon.com, reddit.com etc.

Scoring is based on1. Submission time

Newer stories will be ranked higher than older based on a log algorithm.

2. Number of upvotesThe first 10 upvotes count as high as the next 100.

3. RelevancyThe more keyword it can match from its title and content, the higher the score.

Page 14: Notey's talk 20160923

TOPIC CLASSIFICATION

1. A data mining, statistical based approach to classify articles into relevant topics.

1. Keyword / Title / Tags analysis2. Text analysis3. Machine learning4. Topic merging & disabling

Page 15: Notey's talk 20160923

USER CENTRIC PERSONALISATION

A machine learning mechanism that recommends stories based on users’ persona, interests, activities and demographics.

1. Collaborative filteringAnalyzing a large amount of information

on user behaviour and predicting user preferences based on their similarity to other users.

2. Content-based filtering Based on a description of the item and a profile of the user’s preference.

3. Hybrid recommender systems Collaborative filtering + Content-based filtering

Page 16: Notey's talk 20160923

SOCIAL NETWORK DATA ANALYSIS

Social Network Data Analysis - using big data analytics to uncover patterns and identify trending topics in different demographics.

Base on the daily blog content and user activities on Notey, we answer the following questions:

1. What are your friends talking about?

2. What is the hottest topic in Hong Kong right now?

3. Which movie will be a hit next week?

Page 17: Notey's talk 20160923

SEARCH ENGINE OPTIMISATION

A feedback mechanism to improve web indexing and reduce keyword noise base on search crawler activities.

SEO optimization is done by analyzing1. Bot crawling logs2. User browsing activity3. User search behaviour4. Google page ranking

Page 18: Notey's talk 20160923

EXPLORE THE UNIVERSE YOURSELF

www.notey.com

Currently available on web and iOS. Android app coming in Oct, 2016.