Upload
ismail-fahmi
View
3.993
Download
72
Embed Size (px)
Citation preview
Drone EmpritKonsep dan Teknologi
Ismail Fahmi, PhD.Drone Emprit
Media Kernels Indonesia
IT CAMP – BIG DATA & DATA MININGOnno Center, Situ Gintung - Jakarta
1 Oktober 2017
2
1992 – 1997 S1, Teknik Elektro, ITB2003 – 2004 S2, Computational Linguistics, Universitas Groningen, Belanda2004 – 2009 S3, Computational Linguistics, Universitas Groningen, Belanda
2000 – 2003 Inisiator IndonesiaDLN (Digital Library Network pertama di Indonesia)Mengembangkan Ganesha Digital Library (GDL)Mendirikan Knowledge Management Research Group (KMRG) ITBMembangun Digital Library ITB
2009 – Sekarang Engineer di Weborama, Perusahaan berbasis big data (Paris/Amsterdam)2012 – Sekarang Co-Founder Awesometrics, Media Monitoring & Analytics Company2014 – Sekarang Founder PT. Media Kernels Indonesia, a Natural Language Processing Company2015 – Sekarang Konsultan Perpustakaan Nasional, Inisiator Indonesia OneSearch2017 – Sekarang Dosen Tetap Magister Teknik Informatika Universitas Islam Indonesia
Ismail Fahmi, [email protected]
AgendaSESI 1• Konsep
• Tentang Drone Emprit• Data, tambang emas baru• Arsitektur & Fitur
• Teknologi• Crawler
• Twitter• Facebook• Online News
• Indexing• Sharding• Replication
• Analytics• Sentiment Analysis• Opinion Analysis• Term Extraction• Clustering• Social Network Analysis
• Visualisasi
SESI 2• Studi Kasus
• Analisis Pilkada Jawa Barat• Analisis Pro-Kontra PKI• Membaca Agenda Setting Media
• Demo• Membuat topik monitoring baru• Membaca hasil analisis• Edit sentimen • Social Network Analysis
3
Tentang Drone Emprit
4
Media Kernels a.k.a Drone Emprit
• Sebuah sistem untuk memonitor dan menganalisa media online dan sosial berbasis teknologi big data.
• Dikembangkan sejak tahun 2009 di Amsterdam, Belanda, oleh anak bangsa, melalui Media Kernels Netherlands B.V.
• Mulai tahun 2012 digunakan di Indonesia.• Berbasis teknologi Artificial Intelligent (Machine
Learning) dan Natural Language Processing(NLP).
• Dikenal sebagai ‘Drone Emprit’ dalam berbagai pemberitaan di TV dan Media Nasional.
5
Drone Emprit
6
2-8 Januari 2017
TEMPO
Topik: Peternakan hoax di media sosial
Media Kernels: • Diberitakan dengan
name ‘Drone Emprit’.• Menyajikan peta
Social Network Analysis (SNA) tentang bagaimana sebuah hoax berasal, menyebar, siapa influencers utama, dan siapa groupnya.
• Beberapa isu yang dianalisis: 10 Juta Tenaga Kerja China, dan Aleppo (ISIS).
LAPORAN UTAMA TEMPO, 2-8 Januari 2017
confidential
7
12 Januari 2017
KANTOR STAF PRESIDEN
Kasus: Isu hoaxmenyerang pemerintah tentang 10 Juta Tenaga Kerja China Illegal.
Media Kernels: • Menyajikan dua studi
kasus: 10 Juta tenaga kerja china illegal, dan sentimen negatif terhadap gerakan anti hoax.
• Menunjukkan timelineresonansi isu, dan peta percakapan dengan fitur SNA.
• Menunjukkan kurang efektifnya komunikasi pemerintah, dan apa yang bisa dilakukan untuk perbaikan.
FGD KEHUMASAN SELURUH KEMENTERIAN DAN LEMBAGA DI KANTOR STAF PRESIDEN (KSP)
confidential
8
22 Maret 2017
MATA NAJWA
Kasus: Virus Dusta (alias Hoax)
Nara Sumber:• Stanley (Dewan Pers)• Johan Budi (Stafsus
Presiden)• Boy Rafli (Humas Polri)• Ismail Fahmi (MK)• Septiaji & Khairul
Anshar (Masy. Anti Hoax)
Media Kernels: • Menyajikan analisis ttg
10 Juta Tenaga Kerja China Illegal.
• Hoax Panglima TNI vs PKI.
MATA NAJWA LIVE ‘VIRUS DUSTA’
Data is New Gold
9
10
6 Mei 2017
Data Collection: Gold = Expensive
11
Free Data
12
Twitter Analysis: World Eco. Forum 2016
13https://medium.com/@swainjo/wef16-davos-twitter-sna-analysis-4c38cf4bc46d
14
Arsitektur
15
MK Big Data Architecture
confidential
16
News Crawler
Twitter Crawler
Twitter Streaming
FB Page Crawler
Data Pipeline
Data
SOLR Indexer 1 SOLR Indexer 2 SOLR Indexer 3 SOLR Indexer 4
Hadoop Framework
Physical Hardware
Insight
Data Ing
est M
anagem
ent & Q
ueue
Realtime
Job
Processing
Google Custom Search
Database Framework
Scheduled
Job
Processing
Map Reduce
Sentiment
Analysis
Other
Processings
Data &
Workflow
M
anagem
ent
Access
Visualization
Other sources
Analytics UI
17
Social Media
Sear
ch +
JSO
N
Detik (ID)
Reuters (EN)
Etc..RSS
+ H
TML
Gatra (ID)
Bloomberg (EN)
Etc..
HTM
L
Kaskus
Detik Forum
Etc..
HTM
L
Online News
Forums
Twitter StreamJS
ON
Kompas
TEX
T
Warta Ekonomi
Etc..
PUSH
JSO
NSubscriber
Projects Storage
Search + AccountCrawler
RSS + HTMLCrawler
HTML Crawler
HTML Crawler
SOLR NodesShard 1
SOLR NodesShard N
Index Servers
Redis Queue
Cache Manager
Mentions Storage
Keywords + Accounts Filters
deletes
Sentiment Analysis
Sentiment Models
Backtrack Filters
Sentiment Analysis
Analyses
Control Room Screens
Smart phones, tablets
Desktops
Client(s)
Converter
System Architecture
Fitur-fitur Media Kernels
confidential18
Trends
DASHBOARD
Comparison
Topic Map
NEWS PORTAL
Latest News
Media
ANALYTICS
News Sites
Page Ranks
Sentiment Analysis
PF-Chart
Engagement
Exposure
Retweets
TOPICS
Replies
Most Shared URLs
Most Shared Videos
Topic Map
Word Cloud
Impact
INFLUENCERS
Engagement
Reach
Most Engaged
Followers
Influencer Network
SNA
Topic Network
PR-Values
Reach
Hashtags Posts
Bubble Map
Twitter User Map
DEMOGRAPHY
User Locations
Edit Sentiments
MENTIONS
Training & Learning
Backtracking
Compare SNA
COMPARE
Compare Projects
Popularity vs Favorability
Background Jobs
Upload Report
REPORTING
Download Report
User Management
ADMIN
Project Management
Client Management
Source Management
Label and Training
OPINION ANALYSIS
Opinion Chart
Insight Explorer
News Crawler
19
Online News
20
Dan Ratusan Media Non-mainstream
Crawling Online News
21
Crawler Indeks Server
Web Crawler Tools
22
http://bigdata-madesimple.com/top-50-open-source-web-crawlers-for-data-mining/
Web Crawler Tools (2)
23
http://bigdata-madesimple.com/top-50-open-source-web-crawlers-for-data-mining/
Contoh: Scrapy.org
24
Web Crawler Drone Emprit
25
Bikin sendiri, powered by:
Anatomi: Metadata dan Fullteks
26
Ambil: Tanggal, judul, isi berita, penulis, url gambar
Buang:Iklan, daftar headline, komentar.
Twitter API
27
API: search/tweets
28
Contoh: Free Twitter Search
29
History: 7 daysStart search
100% results
API: Realtime (Sample)
30
Random SampleAll Statuses
Kurang dari 10%
API: Realtime (Filter)
31
API: Realtime (Filter)
32
Filtered StatusesAll Statuses
~ 100%
POST statuses/filterFilter max 400 keywords
Filter:Max 400 keywords
API: > 400 keywords?
33
All Statuses
Max 400 keywords
ServerIPAddr 1
ServerIPAddr 2
ServerIPAddr n
Max 400 keywords
Max 400 keywords
Twitter API Tools
34
Net::Twitter
Twitter API: Drone Emprit
35
Net::TwitterAnyEvent::Twitter::Stream
Facebook API
36
FB API (v1): Public Search
37
April 2014 à distop Facebook
FB API (v2): Searching
38
FB API (v2): Object
39
https://graph.facebook.com/$object_id/$type?fields=id,
parent_id,from,to,type,status_type,story,message,link,likes.summary(true),shares,comments.order(reverse_chronological).summary(true),created_time,updated_time
&order=reverse_chronological&access_token=$access_token&limit=$limit&until=$last_timestamp
$object_id = FB Page ID, etc
$type = [feed, comment, ...]
FB API Tools
40
Facebook::Graph
fb 0.4.0
FB API: Drone Emprit
41
WWW::Curl
Bikin sendiri, powered by:
Question: Perl or Python?
42
Of course!
Why Perl?
43
Perl yang menolong manusia setelah jatuh di
bumi, dan tentu lebih ‘nyunah’
Python yang bikin Adam-Hawa tergoda, lalu turun dari surga
Search Engine/Indexing
44
Full Text Indexing
45
Data Sources Search Engine
Full Text Search Engines
46
Search Engine: Drone Emprit
47
Simple - Powerful - Robust - Scalable
Solr Server Configuration
48
Sharding
49
Replication
50
Analytics
51
Analytics: Server Configuration
52
Slave Analysis Results
AnalysisProcesses
Analytics Engine
53
Search byKeywords
News, Twits, Statuses, etc
Sentiment Analysis
Opinion Analysis
Term Extraction
Segmentation
Quote Extraction
Named Entity Recognition
SearchResults
Paragraph Segmentation
54
NEWS ARTICLES MENTIONS
Sentiment Analysis
55
Sentiment Analysis
56
Positif
Negatif
Netral
?
MENTIONS
Sentiment Analysis
57
Positif
?
MENTIONS
Untuk Setya Novanto
Sentiment Analysis
58
Negatif?
MENTIONS
Untuk KPK
Sentiment Analysis
59
Netral
?
MENTIONS
Untuk Hakim Cepi Iskandar
Sentiment Analysis Techniques
60http://www.sciencedirect.com/science/article/pii/S2090447914000550
Evaluasi
61http://www.sciencedirect.com/science/article/pii/S2090447914000550
”one model for all” tidak bisa memberi label yang tepat untuk setiap subyek.
Lexicon base tergantung dari keberadaan kata dalam kamus sentimen, tidak bisa memberi label yang tepat untuk subyek yang berbeda.
Sentiment Analysis Tools
62
https://breakthroughanalysis.com/2012/01/08/what-are-the-most-powerful-open-source-sentiment-analysis-tools/
Text Mining Module
Sentiment Analysis: Drone Emprit
63
Adaptive Multiple Models
Training Data
64DOI: 10.1109/ICMLA.2015.22
81.000
Opinion Analysis
65
Kapolri: Opinion Analysis
66
Bersama DivHumas Polri di Kompas Petang
67
Fitur Opinion Analysis MK
68
Analisis Terhadap Statistik
69
Membaca Voice, bukan Noise
70
Analisis Terpengaruh Noise
71
Sayang, analisis berbasis ‘noise’ ini yang menjadi viral.
Opinion Analysis Techniques
72
Drone EmpritRegular Expression
Opinion Analysis
Quote Extraction
73
Quote Extraction
74
QUOTE QUOTE HOLDER
Quote Extraction: Drone Emprit
75
Pattern Matching dengan
Regular Expression
Named Entity Recognition
76
Named Entity Recognition
77
LOCATION PERSON ORGANIZATION
NER Tools
78
NER: Drone Emprit
79
Contoh NER
80
Clustering
81
Clustering
82
Clustering Types
83
Clustering Tools
84
http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm
Topic Map: Document Clustering
85
Social Network Analysis
86
SNA: Social Network Analysis
• SNA adalah pemetaan terhadap relasi antar orang, organisasi, topik, lokasi, dan entitas informasi lainnya.
• Node atau titik di dalam jaringan menggambarkan orang, organisasi, lokasi, atau entitas informasi.
• Garis sambungan antar titik menggambarkan relasi antar titik.
87
Betweenness Centrality
88
Betweenness Centrality: a measure of centrality.
Highest betweenness centrality(8 connections)
Lowest betweenness centrality(4 connections)
Anatomi Sebuah Twit
89
Anatomi Sebuah Twit
90
Relasi Retweet
91
Link Functions: Retweet / Mention
92
Retweet Network
94
Mention Network
Information Arbitrage
95
96
Information arbitrage: translateinformation across groups
Visualization
97
User Dashboard
98
Analysis Results
Slave
Visualization Tools
99
D3js.org
100
Drone Emprit is Hiring
101
System Administrator & Programmer
Terimakasih
102
Ismail Fahmi, PhDDrone EmpritPT Media Kernels IndonesiaEmail: [email protected]: 0812 8908 3894