11
12/14/2017 1 DAMI, December 11-13, 2017 Outline Introduction to DSA@SAU research group Data Deluge and New Dimensions of Research Text Mining and Exploratory Data Analysis Blog Data Analytics for Open Source Intelligence Group Members: Muhammad Abulaish (faculty member) Mohd. Fazil (RS): Social bot characterization and detection Vineet Sejwal (RS): Context- and trust-aware recommender system Jayati Gulati (RS): Big graph mining for community detection Ashraf Kamal (RS): Figurative language detection in Twitter Harshita Dalal (RS): Identity deception detection using deep learning Mohd. Tufail (RS): Cyberbullying characterization and detection Seven M.Sc. (CS) students SMAC Mobile Computing Analytics Cloud Computing Social Computing Trending 21 st Century Technologies SMAC Mobile Computing Analytics Cloud Computing Social Computing DSA group’s focus OCMiner: A Density-Based Overlapping Community Detection Method for Social Networks, Intelligent Data Analysis, Vol. 19, Issue 4, IOS Press, Netherland, 2015.

Trending 21 Technologies - HOME - SAUdami17/Abulaish-PPT-Slides.pdf · HOCTracker: Tracking the Evolution of Hierarchical and Overlapping Communities in Dynamic Social ... report

  • Upload
    vothu

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

12/14/2017

1

DAMI, December 11-13, 2017

Outline

Introduction to DSA@SAU research group

Data Deluge and New Dimensions of Research

Text Mining and Exploratory Data Analysis

Blog Data Analytics for Open Source Intelligence

Group Members:

Muhammad Abulaish (faculty member)

Mohd. Fazil (RS): Social bot characterization and detection

Vineet Sejwal (RS): Context- and trust-aware recommender system

Jayati Gulati (RS): Big graph mining for community detection

Ashraf Kamal (RS): Figurative language detection in Twitter

Harshita Dalal (RS): Identity deception detection using deep learning

Mohd. Tufail (RS): Cyberbullying characterization and detection

Seven M.Sc. (CS) students

SMAC

Mobile Computing

Analytics

Cloud Computing

Social Computing

Trending 21st

Century Technologies

SMAC

Mobile Computing

Analytics

Cloud Computing

Social Computing

DSA group’s focus

OCMiner: A Density-Based Overlapping Community

Detection Method for Social Networks, Intelligent Data

Analysis, Vol. 19, Issue 4, IOS Press, Netherland, 2015.

12/14/2017

2

HOCTracker: Tracking the Evolution of Hierarchical

and Overlapping Communities in Dynamic Social

Networks, IEEE Transactions on Knowledge and Data

Engineering, Vol. 27, Issue 4, pp. 1019-1032, 2014.

Namesake Alias Mining on the Web and its Role

Towards Suspect Tracking, Information Sciences,

Elsevier, Vol. 276, pp. 123-145, 2014.

Ranking Radically Influential Web Forum Users, IEEE

Transactions on Information Forensics and Security,

IEEE, 2015.

• Various modalities: Tables,

Images, Video, Audio, Text, hyper-

text, “semantic” text, Networks,

Spreadsheets, Multi-lingual

• Enriched Data: Weighted, Multi-

labeled, Temporal/spatial

attributes

• Dynamic, Distributed, Uncertain

• Massive: Tera/Peta-scale

Water, water, every where, nor any drop to drink:

l l C l id i f h i

Bioinformatics

• Datasets:– Genomes

– Protein structure

– DNA/Protein arrays

– Interaction Networks

– Pathways

– Metagenomics

• Integrative Science

– Systems Biology

– Network Biology

PubMed Dataset containing more than 26 millions citations for

biomedical lietratures

12/14/2017

3

Geo-Informatics

Cheminformatics

N

N

Cl

O

AAACCTCATAGGAAGCATACCAGGAATTACATCA…

Structural Descriptors Physiochemical Descriptors

Topological Descriptors Geometrical Descriptors

Business Intelligence

Web & Social Networks

WWW map

Data are not Independent, Rather Linked! Traditional notion of independent

data items is increasingly false,

Everything is Linked!

Many datasets can be modeled as

complex attributed graphs with

various attributes on nodes and

edges: labels, text, time series,

confidences, etc. Social (human) networks, Biological

networks, etc.

Even non-graph data, e.g., relational

tables, Tweets, Blogs data can be

modeled as graphs via some similarity

measure

Graphs, Graphs Everywhere

Data Science & Big Data

Hypothesis

Experiment

Data

Result

Design

Data analysis

Process/Experiment

DataNo Prior HypothesisNew Science of Data

Hypothesis DrivenTraditional Approach

12/14/2017

4

New Science of Data• New data models: dynamic, streaming, etc.

• New mining, learning, and statistical algorithms that offer timely and reliable inference and information extraction: online, approximate

• Data languages

• Data and model compression

• Data provenance, security and privacy

• Data sensation: visual, aural, tactile

• Knowledge validation: domain experts

• Leverage synergy across subdisciplies: Data Mining and Machine Learning, Databases, Statististics, Optimization, Algebra and Geometry, HPC, Information Theory, Visualization, Sonification, Social/ethical/legal Dimensions

Future So Bright I GottaWear Shades! (Timbuk3)

Why Text Mining?

Growth of E-contents: exponential

E-contents are not well-organized

unstructured or semi-structured

Inherent ambiguities in natural languages texts

Lexical, Syntactic, Semantic, Pragmatic

Problem: Extraction of core information becomes very

expensive.

Text mining: A solution to transform unstructured texts into

structured form and infer knowledge from them

Many Definitions in the literature

Definition-1: (Hearst, 1999)

A process of exploratory data analysis that leads to unknown

information, or to answers for questions for which the answer is

not currently known.

Definition-2:

The extraction of implicit, previously unknown, and

potentially useful information from (large amount of) textual

data

What is Text Mining?

Strict definition

Information that not even the writer knows

For example, discovering a new method for a hair growth that

is described as a side effect for a different procedure

Lenient definition

Rediscover the information that the author encoded in the

text

For example, automatically extracting a product’s name from

a text document.

What is “previously unknown” information?

Exploiting information contained in textual

documents (Grobelnik et al., 2001)

To discover patterns and trends in data

To discover associations among entities

To discover predictive rules

Objective of Text Mining

Large textual databases

High dimensionality

Dependency

Noisy data

Ambiguity

Not well-structured data

Text Mining Challenges

12/14/2017

5

Text Mining Challenges (cont…)

Large textual database

Efficiency consideration

• Google: Indexing capacity is increasing day-by –day

• Almost all research publications are in textual format

– PubMed/Medline : Biomedical research papers

High dimensionality

Consider each word/phrase as a dimension

The Size of the WWW

Large textual database

Efficiency consideration

• Google: Indexing capacity is increasing day-by –day

• Almost all research publications are in textual format

– PubMed/Medline : Biomedical research papers

High dimensionality

Consider each word/phrase as a dimension

Dependency

Relevant information is a complex conjunction of

words/phrases

• Document categorization

• Pronoun disambiguation

Noisy Data

Spelling mistakes, Grammar mistakes, etc.

Text Mining Challenges (cont…)

Text Mining Challenges (cont…)

Ambiguity

Word ambiguity

• Buy vs. Purchase• Cricket (game) vs Cricket (insect)

Semantic ambiguity

• The king saw the rabbit with his glasses (many meanings)

Not well-structured text

R u available?

Hey whazzzz up

Text Mining

Data Mining

Database

Computational Linguistics

Information Retrieval &

IE

Exploratory Data

Analysis

Text Mining: A Confluence of Multiple Disciplines

How does it relate to Data Mining?

How does it relate to Computational Linguistics?

How does it relate to Exploratory Data Analysis?

How does it relate to Database?

How does it relate to Information Retrieval?

Text Mining (Hearst-1999)

Finding PatternsFinding “Nuggets”

Novel Non-Novel

Non-Textual data Data Mining Exploratory

Data

Analysis

Database Queries

Textual dataComputational

LinguisticsInformation

Retrieval

12/14/2017

6

An approach for data analysis that employs a variety of

techniques (mostly graphical – scatter plots, box plot,

histograms, probability plot, etc.) to

Uncover underlying structure of data

Extract important variables

Detect outliers and anomalies in data

Develop models

Steps:

Problem Data Analysis Model Conclusions

Exploratory Data Analysis (EDA)

Example: Biomedical Data Exploration (Swanson, and Smalheiser, 1997)

Extract pieces of evidence from article titles in the

biomedical literature

“stress is associated with migraines”

“stress can lead to loss of magnesium”

“magnesium is a natural calcium channel blocker”

“calcium channel blockers prevent some migraines”

Induce a new hypothesis not in the literature by

combining culled text fragments with human

medical expertise

Magnesium deficiency may play a role in some kinds of

migraine headache

Semantic Net Representation

Stress Migraine

Magnesium

Calcium Channel Broker

associated with

lead to loss

is a

prevents

Magnesium deficiency may play a role in some kinds of migraine headache

Refers to the use of papers and other academic

publications (the "literature") to find new relationships

between existing knowledge (the "discovery").

Pioneered by Don R. Swanson in the 1980s.

LBD does not generate new knowledge through laboratory

experiments, as is customary for empirical sciences.

Instead it seeks to connect existing knowledge from

empirical results by bringing to light relationships that are

implicated and "neglected”

Literature Based Discovery (LBD)

Swanson linking is a term proposed in 2003 that refers to

connecting two pieces of knowledge previously thought to

be unrelated.

For example, it may be known that

Illness A is caused by chemical B, and

Drug C is known to reduce the amount of chemical B in the body.

If the respective articles not analyzed together then

relationship between illness A and drug C may be

unknown.

Swanson linking aims to find these relationships and

report them.

LBD (cont…): Swanson Linking

LBD (cont…): Swanson Linking

(i) Illness A is caused by chemical B, and

(ii) Drug C is known to reduce the amount of chemical B in the body

12/14/2017

7

Forum Crawling and Parsing

User List

Body Text

Quotations Timestamp

User Id

Message ID

Thread ID

Ranked Users

DataPre-processing

Metadata Extraction

Cleaning

Chunking

User Radicalness Identification

 

Threat list

Radicalness

User Posts

As we know that the natural language, be it the spoken or wr itten, is highly ambiguous in

nature, identifying the radicalness in natural language text is a major challenge. Many pr ior works have attempted to identify the radical elements based on their textual

contents exchanged in discussions. However, the foundation of their automatic radical identification

process is laid on a set of manually

User Collocations

User Collocation Identification

7.09.03.05.04.08.03.04.01.00.0

0.01.03.08.07.05.07.05.01.03.0

0.05.03.05.09.08.01.00.00.03.0

5.08.01.03.08.07.06.01.08.05.0

6.08.01.06.05.08.03.09.06.00.0

4.07.03.08.05.06.02.00.04.08.0

0.03.04.07.02.09.03.05.08.09.0

0.01.00.02.06.03.01.00.05.04.0

8.05.06.09.04.08.05.06.04.01.0

3.04.07.00.05.08.03.07.06.01.0

User Collocations

As we know that the natural language, be it the spoken or wri tten, is highly ambiguous in

nature, identifying the radicalness in natural language text is a major challenge. Many pr ior works have attempted to identify the radical elem ents based on their textual

contents exchanged in discussions. However, the foundation of their automatic radical identification

process is la id on a set of manually

Radically Influential User Ranking

Ranking

Ass

oci

ativ

ity

Com

put

atio

n

Who is Radical?

A person galvanizing people by fanatic thoughts beyond the

norm to an extreme antagonistic political, religious, racial,

nationalist or any other ideology.

Such thoughts arouse in minds when they feel of some unjust or

discrimination happened with them either directly or indirectly,

though it actually may be false.

Such thoughts are sometimes triggered by their personal

involvement (e.g., death of a close relative or friend), political

involvement (e.g., being a follower of a political or religious

belief), and social involvement (e.g., racism, nationalism).

Hostility may be against a race, or a political party, or a religion,

or a nation, or any organization with a mass of followers.

Who is Influential? Very hard to define concretely or measure tangibly, despite the

large number of existing theories of sociology.

Can be approximated as something attained by the activeness of

a person. However, Just being active in communication does

not make someone influential in a social network.

Influential users generally get a very good response from others

in their comments, and it differentiates them from the

spammers, who in spite of being active do not receive much

attention.

Major factors that make a user influential - recognition, activity

generation, novelty, and eloquence.

Radically Influential User

Characterized by two key properties – radicalness and

influence.

Two different approaches to tackle the problem of radically

influential user identification – two-stage sequential

ranking and one-stage parallel ranking

Two-Stage Sequential Process

One-Stage Parallel Process

Measuring Radicalness

Ωj = A threat list containing 25 radically jihadi ideology

terms (Compiled by Kramer [ACM ISI-SIGKDD, 2010] from

Ansar Web Forum )

12/14/2017

8

Identifying Collocation

Collocation can be defined as the association of users co-

interacting in same threads.

Collocation theory is applied to study the structural

association of different users, and estimate their influence

while propagating an ideology through their interactions.

Contingency Table

Association Metrics

1. Co-Occurrence Frequency (μ1)

2. CF-ITF (μ2)

Association Metrics (cont...)

3. PMI (μ3): Used in the fields of information theory andstatistics to determine the association or dependence of twoprobabilistic events.

4. Cosine (μ4): Used to measure the strength of associationbetween a pair of objects having feature vectors

Association Metrics (cont...)

5. Overlap (μ5)

6. Dice (μ6)

Association Metrics (cont...)

7. Jaccard (μ7)

8. Chi-Square (μ8): Used to determine the difference between the

distribution of an actually observed sample and another hypothetical orpreviously established distribution

12/14/2017

9

Association Metrics (cont...)

9. LLR (μ9)

Association Metrics (cont...)

10. Phi Coefficient (μ10)

11. Contingency Coefficient (μ11)

Ranking Radically Influential Users

Page Rank:

d = Damping Factor (probability of a surfer to click on a link)

An ideal value for d is 0.85 (empirical determination)

Page Rank ExampleAA

CCBB

PR(A) = 0.15 + 0.85 x PR(C)

PR(B) = 0.15 + 0.85 x [PR(A)/2]

PR(C) = 0.15 + 0.85 x [PR(A)/2 + PR(B)]

Ranking Radically Influential Users (cont…)

Modified Page Rank

Experimental Data Set A set of threads provided for a challenge at the ACM ISI-

KDD’12 workshop to find radical and infectious threads,

members, postings, ideas and ideologies.

Consists of a total of 1,29,425 message posts commented as

response to a total of 27,968 threads by 2803 users.

Includes all discussions carried on in the forum from April 28,

2004 to May 20, 2010.

12/14/2017

10

Lifespan of Threads in Experimental Data Set

10 Top-Ranked Members Using Different Association Measures

Evaluation Metric: Mean Reciprocal Rank (MRR)

A statistical measure for evaluating any process that

produces a list of possible responses to a sample of

queries, ordered by probability of correctness.

Reciprocal Rank of a query response is the multiplicative

inverse of the rank of the first correct answer

MRR is the average of the reciprocal ranks of results for

a sample of queries Q:

Mean Reciprocal Rank (cont…)

MRR = (1/3 + 1/2 + 1)/3 = 11/18 ≈ 0.61.

Evaluation Results using MRR

For Further Details:

Tarique Anwar and Muhammad Abulaish,

Ranking Radically Influential Web Forum

Users, IEEE Transactions on Information

Forensics and Security, 2015.

Available at www.abulaish.com

12/14/2017

11

Development of a real-time online social media

monitoring system (incremental learning, stream data

mining).

Exploiting content and structural data together for

message authentication/ rumor detection

Exploring the application of community (cliques)

structures and their evolutionary behaviors for groups

monitoring (organized crimes, social botnet, etc.)

Development of Context- and Trust-aware recommender

systems (target marketing, sales and promotions, etc.)

Ranked Retrieval

Early IR focused on set-based retrieval

Boolean queries, set of conditions to be satisfied

document either matches the query or not

• like classifying the collection into relevant / non-relevant sets

still used by professional searchers

“advanced search” in many systems

Modern IR: ranked retrieval

free-form query expresses user’s information need

rank documents by decreasing likelihood of relevance

many studies prove it is superior