Upload
amendra-shrestha
View
26
Download
2
Embed Size (px)
Citation preview
Multi-Domain Alias Matching Using MachineLearning
1Amendra Shrestha
1 Lisa Kaati 1Michael Ashcroft 2 Fredrik Johansson
1Uppsala University
2Swedish Defence Research Agency (FOI)
September 5, 2016
Outline Introduction Methodology Experiments & Results Summary
1 IntroductionMultiple aliasesOnline anonymity
2 Methodology
3 Experiments & Results
4 Summary
- 1 -
Outline Introduction Methodology Experiments & Results Summary
Multiple aliases
Multiple aliases
- 2 -
Outline Introduction Methodology Experiments & Results Summary
Multiple aliases
Multiple aliases
- 3 -
Outline Introduction Methodology Experiments & Results Summary
Multiple aliases
Possible reasons for multiple aliases
• Banned by administrator
• Banned for inactivity
• Lost trust of the group
• Developed bad personal relationships
• To support his arguments
• Privacy reasons
- 4 -
Outline Introduction Methodology Experiments & Results Summary
Online anonymity
Online anonymity
People are often open with who they are online, but sometimesthey want to remain anonymous.
• Spreading terrorism propaganda
• Performing ”hate crimes” online
• Participating in political debates
• Acting as whistle blowers
• Protesting against totalitarian
- 5 -
Outline Introduction Methodology Experiments & Results Summary
Online anonymity
Obtaining online anonymity
• Creation of anonymous user accounts (potentially incombination with use of Tor or internet cafes).
- 6 -
Outline Introduction Methodology Experiments & Results Summary
Online anonymity
Example of Author identification (manually)
- 7 -
• Theodore John ”Ted” Kaczynski (”the Unabomber”)
• Bombing campaign against people involved with moderntechnology
• Killing 3 people and injuring 23 others
• ”Industrial Society and Its Future”
• Stopped after his brother recognize the writing style
Outline Introduction Methodology Experiments & Results Summary
- 8 -
Outline Introduction Methodology Experiments & Results Summary
Attacking online anonymity
Can the user be identified anyway?
• Stylometric profiling (S)
• Time-based profiling (T)
• Emotion-based profiling (E)
- 9 -
Outline Introduction Methodology Experiments & Results Summary
Attacking online anonymity
- 10 -
• Electronic posts (blog posts, tweets, forumposts, etc.)
• Textual content• Metadata (e.g. publishing times)⇓
comparision←−−−−−→ create⇐===
Fingerprint of anony-mous user
Fingerprint of knownusers
Social Media
Outline Introduction Methodology Experiments & Results Summary
Stylometric techniques
• Statistical analysis of writing style
- 11 -
Outline Introduction Methodology Experiments & Results Summary
Time profile
- 12 -
• Hour of Day: Hour1, Hour2, . . . , Hour24
• Period of Day: MidNight, EarlyMorning, Morning, MidDay, Evening, Night
• Month: Jan, Feb, . . . , Dec
• Day: Sunday, Monday, . . . , Saturday
• Type of Day: WeekDay, WeekEnd
Outline Introduction Methodology Experiments & Results Summary
Examples of time profiles
Analysis of forum posts (boards.ie) suggests that time profiles ofauthors often are quite stable over time.
- 13 -
Outline Introduction Methodology Experiments & Results Summary
Emotions and Twitter-specific features
- 14 -
Outline Introduction Methodology Experiments & Results Summary
Datasets
- 15 -
• Discussion Board : Top-1000 posters (DB-All) & postslimited to 60 (DB-60)
• Twitter : Top-1000 tweeps (TW-All) & tweets limited to60 (TW-60)
• Blog : 1414 distinct bloggers where 260 has at least 2blogs
Outline Introduction Methodology Experiments & Results Summary
Performance of Classifiers
• Classifiers : AdaBoost, SVM, Naive Bayes
• Used all features (S + T + E)
• Experiments done on 4 different datasets
- 16 -
Outline Introduction Methodology Experiments & Results Summary
Performance of Different Datasets and Features
• Classifier : Adaboost
• Experiments on combination of features
• T, S, (S + T), (S + E), (S + T + E)
- 17 -
Outline Introduction Methodology Experiments & Results Summary
Cross-classification
• Data used : discussion forum & twitter
• Classifier : Adaboost
• Not worse than models trained on single dataset
- 18 -
Outline Introduction Methodology Experiments & Results Summary
Evaluation on blog data
• Non-synthetic data
• Precision : 0.966
• Recall : 0.567
• Accuracy : 0.929
- 19 -
Outline Introduction Methodology Experiments & Results Summary
Summary
• Techniques to identify multiple aliases
• AdaBoost outperforms SVM and Naive Bayes
• Combination of stylometric and time-based features yieldsbetter results
• Can be used for real-world linkage of user accounts
- 20 -
Outline Introduction Methodology Experiments & Results Summary
- 21 -
Thank You