Transcript

Building knowledge graphs in DIG

Pedro Szekely and Craig Knoblock University of Southern California

Information Sciences Institutedig.isi.edu

Goal

USC Information Sciences Institute CC-By 2.0 2

raw w messy w disconnected clean w organized w linked

hard to query, analyze & visualize easy to query, analyze & visualize

Use Case: Human Trafficking

USC Information Sciences Institute CC-By 2.0 3

raw w messy w disconnected clean w organized w linked

hard to query, analyze & visualize easy to query, analyze & visualize

Use Case: Human Trafficking

USC Information Sciences Institute CC-By 2.0 4

100 million pages~ 100 Web sites

help victimsprosecute traffickers

Salient Statistics on Human Trafficking

• Profits per Year: $32 Billion• Average Age of Entry To Prostitution in the US: 14• PIMP’s Profit Per Victim Per Year: $150,000• Advertising Budget On the Web:$45 Million

CC-By 2.0 5USC Information Sciences Institute

Task: Tracking the Victim’s Locations

>100millionpagesadvertisingadultservicesUSC Information Sciences Institute CC-By 2.0 6

Example: Investigating a Reported Victim

SanDiego,whereelse?USC Information Sciences Institute CC-By 2.0 7

DIG Interface: Find the locations where a potential victim was advertised

CC-By 2.0 8

Steps To Build a DIG

USC Information Sciences Institute CC-By 2.0 9

Crawling ExtractionData Acquisition

Mapping ToOntology

Entity Linking& Similarity

Knowledge GraphDeployment

Query &Visualization

ElasticSearch

GraphDB

schema.org geonames

Data Acquisition

Feature Extraction

Feature Alignment

EntityResolution

GraphConstruction

User Interface

Data Acquisition

Data Acquisition

USC Information Sciences Institute CC-By 2.0 10

downloading relevant data

batch w real-time

Web pagesw Web service w database wCSV w Excel w XML w JSON

Steps To Build a DIG

USC Information Sciences Institute CC-By 2.0 11

Crawling ExtractionData Acquisition

Mapping ToOntology

Entity Linking& Similarity

Knowledge GraphDeployment

Query &Visualization

ElasticSearch

GraphDB

schema.org geonames

Data Acquisition

Feature Extraction

Feature Alignment

EntityResolution

GraphConstruction

User Interface

Feature Extraction

USC Information Sciences Institute CC-By 2.0 12

from raw sources to structured data

• trainable text extractors

• extraction from structured Web pages

• image features

• PDF extractor

Feature Extraction from Text

USC Information Sciences Institute CC-By 2.0 13

“YOU don't wanna miss out on ME :) Perfect lil booty Green eyes Long curly black hair Im a Irish, Armenian and Filipino mixed princess :) ❤ Kim ❤7○7~7two7~7four77 ❤ HH 80 roses ❤ Hour 120 roses ❤ 15 mins 60 roses”

name: Kimeye-color: greenhair-color: black

phone: 707-727-7477rate: $60/15min

$80/30min$120/60min

20 Examples

CC-By 2.0 14USC Information Sciences Institute

1,000’s of Tasks (2 Cents/Sentence)

CC-By 2.0 15

Performance of CRF Extractors

80

1018

9991 94

0

20

40

60

80

100

120

Precision Recall F

RegularExpressions DIG

80

612

99

7384

0

20

40

60

80

100

120

Precision Recall F

RegularExpressions DIG

Eyes Hair

USC Information Sciences Institute CC-By 2.0 16

Structured Extraction

CC-By 2.0 17

Automated Extraction

input:a pileofpages

ClassifybyTemplates

pagesclusteredbytemplate

InferExtractor

InferExtractor

InferExtractor

InferExtractor

extractor

USC Information Sciences Institute CC-By 2.0 18

Unsupervised Extraction Tool

CC-By 2.0 19

Extraction Evaluation

Title Desc Seller Date Price Loc Cat MemberSince Expires Views ID

Perfect 1.0(50/50)

.76(37/49)

.95(40/42)

.83(40/48)

.87(39/45)

.51(23/45)

.68(34/50)

1.0(35/35)

.52(15/29)

.76(19/25)

.97(35/36)

PrettyGood

1.0(50/50)

.98(48/49)

.95(40/42)

.83(40/48)

.98(44/45)

.84(38/45)

.88(44/50)

1.0(35/35)

.55(16/29)

1.0(25/25)

1.0(36/36)

10websites,5pageseach

fields

USC Information Sciences Institute CC-By 2.0 20

Steps To Build a DIG

USC Information Sciences Institute CC-By 2.0 21

Crawling ExtractionData Acquisition

Mapping ToOntology

Entity Linking& Similarity

Knowledge GraphDeployment

Query &Visualization

ElasticSearch

GraphDB

schema.org geonames

Data Acquisition

Feature Extraction

Feature Alignment

EntityResolution

GraphConstruction

User Interface

Feature Alignment

USC Information Sciences Institute CC-By 2.0 22

from multiple schemas to a common domain schema

- CSV, Excel- Database tables- Web services- Extractors

- Nomenclature- Spelling

Multiple Schemas

Karma: Mapping Data to OntologiesServicesRelational

Sources

Karma

{JSON-LD}

HierarchicalSources

Schema.org

USC Information Sciences Institute CC-By 2.0 23karma.isi.edu

Karma Solves Feature Alignment

CC-By 2.0 24USC Information Sciences Institute

Provenance Domain Schema

took ~30 minutes to align the output of the Stanford name extractor

Feature Alignment Statistics• 5 contractors provided data• ~ 15 datasets• > 30 Karma models• > 200 million records

• 1 hour processing in 20 node Hadoop cluster

CC-By 2.0 25USC Information Sciences Institute

Steps To Build a DIG

USC Information Sciences Institute CC-By 2.0 26

Crawling ExtractionData Acquisition

Mapping ToOntology

Entity Linking& Similarity

Knowledge GraphDeployment

Query &Visualization

ElasticSearch

GraphDB

schema.org geonames

Data Acquisition

Feature Extraction

Feature Alignment

EntityResolution

GraphConstruction

User Interface

Entity Resolution

USC Information Sciences Institute CC-By 2.0 27

merging records that refer to the same entity

missing dataincorrect data

scale (~50 million records)

currently working on techniques to address

Entity Resolutuion on Strong Attributes

AdultService-1

Person-1

Offer-1availableAt

seller

phone

619-319-7315

Santa Barbara

hairColor

red

price

250/hour

startDate

2014-12-07

eyeColor

blue

name

Jessica

itemProvided

Offer-2

Person-2

availableAt

Washington DC

phone

seller

email

price

250/hour

startDate

2014-05-28

AdultService-2

eyeColorblue

nameJessica

itemProvided

USC Information Sciences Institute CC-By 2.0 28

Linking Using Text Similarity

E M I L Y SEXY. ** wHiTe/lATin girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S

L A Y L A SEXY. ** wHiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O____U____T____C___A___L____L____S

L I L A SEXY. ** WhiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S

USC Information Sciences Institute CC-By 2.0 29

Linking Using Image Similarity

CC-By 2.0 30USC Information Sciences Institute

100 Million Images Technology: Deep Learning

AdultService-1

Person-1

Offer-1availableAt

seller

phone

619-319-7315

Santa Barbara

hairColor

red

price

250/hour

startDate

2014-12-07

eyeColor

blue

name

Jessica

itemProvided

Offer-2

Person-2

availableAt

Washington DC

phone

seller

email

price

250/hour

startDate

2014-05-28

AdultService-2

eyeColorblue

nameJessica

itemProvided

same victim

same Trafficker

Unsupervised Collective Entity Resolution

USC Information Sciences Institute CC-By 2.0 31

Unsupervised Collective Entity Resolution

USC Information Sciences Institute CC-By 2.0 32

Steps To Build a DIG

USC Information Sciences Institute CC-By 2.0 33

Crawling ExtractionData Acquisition

Mapping ToOntology

Entity Linking& Similarity

Knowledge GraphDeployment

Query &Visualization

ElasticSearch

GraphDB

schema.org geonames

Data Acquisition

Feature Extraction

Feature Alignment

EntityResolution

GraphConstruction

User Interface

Graph Construction

USC Information Sciences Institute CC-By 2.0 34

assembling the data for efficient query & analysis

- ElasticSearch: scalable, efficient query- graph databases: network analytics- NoSQL: scalable analytics

- bulk loading: massive data imports- real-time updates: live, changing data

Elastic Search Data Model

AdultService Offer Person Phone Web

Page

USC Information Sciences Institute CC-By 2.0 35

Indexing for High Performance Knowledge Graph Queries

Avg.QueryTimesinMillisecondsSingleUserQueryLoad

1.2billiontriples

StateoftheArtGraphDatabase(RDF)

DIGindexingdeployedinElasticSearchUSC Information Sciences Institute CC-By 2.0 36

Steps To Build a DIG

USC Information Sciences Institute CC-By 2.0 37

Crawling ExtractionData Acquisition

Mapping ToOntology

Entity Linking& Similarity

Knowledge GraphDeployment

Query &Visualization

ElasticSearch

GraphDB

schema.org geonames

Data Acquisition

Feature Extraction

Feature Alignment

EntityResolution

GraphConstruction

User Interface

DIG Deployment for Human Trafficking

USC Information Sciences Institute CC-By 2.0 40

- 100 million Web pages - Live updates (~5,000 pages/hour)- ElasticSearch database (7 nodes)- Hadoop workflows (20 nodes)

- District Attorney- Law Enforcement- NGOs

Deployedto6LawEnforcement

AgenciesandSuccessfullyUsedtoProsecute

TraffickersUSC Information Sciences Institute CC-By 2.0 41

DIG ApplicationsHuman Trafficking

large, real usersMaterial Science Research

70,000 paper abstracts (built in 1 week)Arms Trafficking

Identify illegal salesPatent Trolls

Identify patent trollsCyber Attacks

Predict cyber attacks from dark web data

CC-By 2.0 42USC Information Sciences Institute

Conclusions• Complete tool-chain to build domain-specific

knowledge graphs

• Integrates heterogeneous data: web pages, databases, CSV, web APIs, images, etc.

• Scales to ~100 million pages, ~3 billion facts

• Deployed to law enforcement

USC Information Sciences Institute CC-By 2.0 43

Questions?

dig.isi.eduOpen Source, Apache 2 License

CC-By 2.0 44USC Information Sciences Institute


Recommended