Upload
rosanna-man
View
65
Download
0
Embed Size (px)
Citation preview
BEHIND NOTEY’S ANALYTICS ENGINE
FROM HINDSIGHT TO INSIGHT TO FORESIGHT
WHAT IS NOTEY?
INSIDE NOTEY
WHY ARE WE BETTER THAN
GOOGLE?
1. MORE RELEVANT SEARCH RESULTS
2. CURATED CONTENT
3.CONTEXTUAL TARGETING
HOW DO WE
USE BIG DATA?
DATA @ NOTEY
Each day, Notey1. Crawl 10,000+ existing blogs
2. Import ~50 new blogs
3. Import 15,000 articles
4. Add 2000+ new topics
5. Process 10G of logs (user log, access log, crawl log)
6. Generate thousands of recommendations and insights
SCALING NOTEY
Java + node.js as web servers, batch server and API Engines
EHCache for scalability and boosting database performance
mySQL (Master + Slaves) for persistent data storage
Solr search engine for live query, R for statistical processing
Google Big Query for big data processing
Amazon EC2, S3 + CloudFront for compute and delivery
Elastic Load Balancing (ELB) for load balancing across instances and real time scalability
BIG DATA TECHNOLOGIES
Apache Spark• Open source• Support native Java app• Real time processing• Performs 100x faster than
traditional Hadoop MapReduce
Google Big Query• Simple SQL syntax• Easy to use, even for non-developers • Suitable for ad-hoc queries• Analytics as a service,
hassle-free, managed service
OUR ALGORITHMS
HOT RANKING ALGORITHM
A story ranking algorithm based on relevancy, popularity and time-post sensitive information to deliver a compelling online experience. Simple yet open-sourced algorithm which is used in amazon.com, reddit.com etc.
Scoring is based on1. Submission time
Newer stories will be ranked higher than older based on a log algorithm.
2. Number of upvotesThe first 10 upvotes count as high as the next 100.
3. RelevancyThe more keyword it can match from its title and content, the higher the score.
TOPIC CLASSIFICATION
1. A data mining, statistical based approach to classify articles into relevant topics.
1. Keyword / Title / Tags analysis2. Text analysis3. Machine learning4. Topic merging & disabling
USER CENTRIC PERSONALISATION
A machine learning mechanism that recommends stories based on users’ persona, interests, activities and demographics.
1. Collaborative filteringAnalyzing a large amount of information
on user behaviour and predicting user preferences based on their similarity to other users.
2. Content-based filtering Based on a description of the item and a profile of the user’s preference.
3. Hybrid recommender systems Collaborative filtering + Content-based filtering
SOCIAL NETWORK DATA ANALYSIS
Social Network Data Analysis - using big data analytics to uncover patterns and identify trending topics in different demographics.
Base on the daily blog content and user activities on Notey, we answer the following questions:
1. What are your friends talking about?
2. What is the hottest topic in Hong Kong right now?
3. Which movie will be a hit next week?
SEARCH ENGINE OPTIMISATION
A feedback mechanism to improve web indexing and reduce keyword noise base on search crawler activities.
SEO optimization is done by analyzing1. Bot crawling logs2. User browsing activity3. User search behaviour4. Google page ranking
EXPLORE THE UNIVERSE YOURSELF
www.notey.com
Currently available on web and iOS. Android app coming in Oct, 2016.