44
Building knowledge graphs in DIG Pedro Szekely and Craig Knoblock University of Southern California Information Sciences Institute dig.isi.edu

Building Knowledge Graphs in DIG

Embed Size (px)

Citation preview

Building knowledge graphs in DIG

Pedro Szekely and Craig Knoblock University of Southern California

Information Sciences Institutedig.isi.edu

Goal

USC Information Sciences Institute CC-By 2.0 2

raw w messy w disconnected clean w organized w linked

hard to query, analyze & visualize easy to query, analyze & visualize

Use Case: Human Trafficking

USC Information Sciences Institute CC-By 2.0 3

raw w messy w disconnected clean w organized w linked

hard to query, analyze & visualize easy to query, analyze & visualize

Use Case: Human Trafficking

USC Information Sciences Institute CC-By 2.0 4

100 million pages~ 100 Web sites

help victimsprosecute traffickers

Salient Statistics on Human Trafficking

• Profits per Year: $32 Billion• Average Age of Entry To Prostitution in the US: 14• PIMP’s Profit Per Victim Per Year: $150,000• Advertising Budget On the Web:$45 Million

CC-By 2.0 5USC Information Sciences Institute

Task: Tracking the Victim’s Locations

>100millionpagesadvertisingadultservicesUSC Information Sciences Institute CC-By 2.0 6

Example: Investigating a Reported Victim

SanDiego,whereelse?USC Information Sciences Institute CC-By 2.0 7

DIG Interface: Find the locations where a potential victim was advertised

CC-By 2.0 8

Steps To Build a DIG

USC Information Sciences Institute CC-By 2.0 9

Crawling ExtractionData Acquisition

Mapping ToOntology

Entity Linking& Similarity

Knowledge GraphDeployment

Query &Visualization

ElasticSearch

GraphDB

schema.org geonames

Data Acquisition

Feature Extraction

Feature Alignment

EntityResolution

GraphConstruction

User Interface

Data Acquisition

Data Acquisition

USC Information Sciences Institute CC-By 2.0 10

downloading relevant data

batch w real-time

Web pagesw Web service w database wCSV w Excel w XML w JSON

Steps To Build a DIG

USC Information Sciences Institute CC-By 2.0 11

Crawling ExtractionData Acquisition

Mapping ToOntology

Entity Linking& Similarity

Knowledge GraphDeployment

Query &Visualization

ElasticSearch

GraphDB

schema.org geonames

Data Acquisition

Feature Extraction

Feature Alignment

EntityResolution

GraphConstruction

User Interface

Feature Extraction

USC Information Sciences Institute CC-By 2.0 12

from raw sources to structured data

• trainable text extractors

• extraction from structured Web pages

• image features

• PDF extractor

Feature Extraction from Text

USC Information Sciences Institute CC-By 2.0 13

“YOU don't wanna miss out on ME :) Perfect lil booty Green eyes Long curly black hair Im a Irish, Armenian and Filipino mixed princess :) ❤ Kim ❤7○7~7two7~7four77 ❤ HH 80 roses ❤ Hour 120 roses ❤ 15 mins 60 roses”

name: Kimeye-color: greenhair-color: black

phone: 707-727-7477rate: $60/15min

$80/30min$120/60min

20 Examples

CC-By 2.0 14USC Information Sciences Institute

1,000’s of Tasks (2 Cents/Sentence)

CC-By 2.0 15

Performance of CRF Extractors

80

1018

9991 94

0

20

40

60

80

100

120

Precision Recall F

RegularExpressions DIG

80

612

99

7384

0

20

40

60

80

100

120

Precision Recall F

RegularExpressions DIG

Eyes Hair

USC Information Sciences Institute CC-By 2.0 16

Structured Extraction

CC-By 2.0 17

Automated Extraction

input:a pileofpages

ClassifybyTemplates

pagesclusteredbytemplate

InferExtractor

InferExtractor

InferExtractor

InferExtractor

extractor

USC Information Sciences Institute CC-By 2.0 18

Unsupervised Extraction Tool

CC-By 2.0 19

Extraction Evaluation

Title Desc Seller Date Price Loc Cat MemberSince Expires Views ID

Perfect 1.0(50/50)

.76(37/49)

.95(40/42)

.83(40/48)

.87(39/45)

.51(23/45)

.68(34/50)

1.0(35/35)

.52(15/29)

.76(19/25)

.97(35/36)

PrettyGood

1.0(50/50)

.98(48/49)

.95(40/42)

.83(40/48)

.98(44/45)

.84(38/45)

.88(44/50)

1.0(35/35)

.55(16/29)

1.0(25/25)

1.0(36/36)

10websites,5pageseach

fields

USC Information Sciences Institute CC-By 2.0 20

Steps To Build a DIG

USC Information Sciences Institute CC-By 2.0 21

Crawling ExtractionData Acquisition

Mapping ToOntology

Entity Linking& Similarity

Knowledge GraphDeployment

Query &Visualization

ElasticSearch

GraphDB

schema.org geonames

Data Acquisition

Feature Extraction

Feature Alignment

EntityResolution

GraphConstruction

User Interface

Feature Alignment

USC Information Sciences Institute CC-By 2.0 22

from multiple schemas to a common domain schema

- CSV, Excel- Database tables- Web services- Extractors

- Nomenclature- Spelling

Multiple Schemas

Karma: Mapping Data to OntologiesServicesRelational

Sources

Karma

{JSON-LD}

HierarchicalSources

Schema.org

USC Information Sciences Institute CC-By 2.0 23karma.isi.edu

Karma Solves Feature Alignment

CC-By 2.0 24USC Information Sciences Institute

Provenance Domain Schema

took ~30 minutes to align the output of the Stanford name extractor

Feature Alignment Statistics• 5 contractors provided data• ~ 15 datasets• > 30 Karma models• > 200 million records

• 1 hour processing in 20 node Hadoop cluster

CC-By 2.0 25USC Information Sciences Institute

Steps To Build a DIG

USC Information Sciences Institute CC-By 2.0 26

Crawling ExtractionData Acquisition

Mapping ToOntology

Entity Linking& Similarity

Knowledge GraphDeployment

Query &Visualization

ElasticSearch

GraphDB

schema.org geonames

Data Acquisition

Feature Extraction

Feature Alignment

EntityResolution

GraphConstruction

User Interface

Entity Resolution

USC Information Sciences Institute CC-By 2.0 27

merging records that refer to the same entity

missing dataincorrect data

scale (~50 million records)

currently working on techniques to address

Entity Resolutuion on Strong Attributes

AdultService-1

Person-1

Offer-1availableAt

seller

phone

619-319-7315

Santa Barbara

hairColor

red

price

250/hour

startDate

2014-12-07

eyeColor

blue

name

Jessica

itemProvided

Offer-2

Person-2

availableAt

Washington DC

phone

seller

email

price

250/hour

startDate

2014-05-28

AdultService-2

eyeColorblue

nameJessica

itemProvided

USC Information Sciences Institute CC-By 2.0 28

Linking Using Text Similarity

E M I L Y SEXY. ** wHiTe/lATin girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S

L A Y L A SEXY. ** wHiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O____U____T____C___A___L____L____S

L I L A SEXY. ** WhiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S

USC Information Sciences Institute CC-By 2.0 29

Linking Using Image Similarity

CC-By 2.0 30USC Information Sciences Institute

100 Million Images Technology: Deep Learning

AdultService-1

Person-1

Offer-1availableAt

seller

phone

619-319-7315

Santa Barbara

hairColor

red

price

250/hour

startDate

2014-12-07

eyeColor

blue

name

Jessica

itemProvided

Offer-2

Person-2

availableAt

Washington DC

phone

seller

email

price

250/hour

startDate

2014-05-28

AdultService-2

eyeColorblue

nameJessica

itemProvided

same victim

same Trafficker

Unsupervised Collective Entity Resolution

USC Information Sciences Institute CC-By 2.0 31

Unsupervised Collective Entity Resolution

USC Information Sciences Institute CC-By 2.0 32

Steps To Build a DIG

USC Information Sciences Institute CC-By 2.0 33

Crawling ExtractionData Acquisition

Mapping ToOntology

Entity Linking& Similarity

Knowledge GraphDeployment

Query &Visualization

ElasticSearch

GraphDB

schema.org geonames

Data Acquisition

Feature Extraction

Feature Alignment

EntityResolution

GraphConstruction

User Interface

Graph Construction

USC Information Sciences Institute CC-By 2.0 34

assembling the data for efficient query & analysis

- ElasticSearch: scalable, efficient query- graph databases: network analytics- NoSQL: scalable analytics

- bulk loading: massive data imports- real-time updates: live, changing data

Elastic Search Data Model

AdultService Offer Person Phone Web

Page

USC Information Sciences Institute CC-By 2.0 35

Indexing for High Performance Knowledge Graph Queries

Avg.QueryTimesinMillisecondsSingleUserQueryLoad

1.2billiontriples

StateoftheArtGraphDatabase(RDF)

DIGindexingdeployedinElasticSearchUSC Information Sciences Institute CC-By 2.0 36

Steps To Build a DIG

USC Information Sciences Institute CC-By 2.0 37

Crawling ExtractionData Acquisition

Mapping ToOntology

Entity Linking& Similarity

Knowledge GraphDeployment

Query &Visualization

ElasticSearch

GraphDB

schema.org geonames

Data Acquisition

Feature Extraction

Feature Alignment

EntityResolution

GraphConstruction

User Interface

DIG Deployment for Human Trafficking

USC Information Sciences Institute CC-By 2.0 40

- 100 million Web pages - Live updates (~5,000 pages/hour)- ElasticSearch database (7 nodes)- Hadoop workflows (20 nodes)

- District Attorney- Law Enforcement- NGOs

Deployedto6LawEnforcement

AgenciesandSuccessfullyUsedtoProsecute

TraffickersUSC Information Sciences Institute CC-By 2.0 41

DIG ApplicationsHuman Trafficking

large, real usersMaterial Science Research

70,000 paper abstracts (built in 1 week)Arms Trafficking

Identify illegal salesPatent Trolls

Identify patent trollsCyber Attacks

Predict cyber attacks from dark web data

CC-By 2.0 42USC Information Sciences Institute

Conclusions• Complete tool-chain to build domain-specific

knowledge graphs

• Integrates heterogeneous data: web pages, databases, CSV, web APIs, images, etc.

• Scales to ~100 million pages, ~3 billion facts

• Deployed to law enforcement

USC Information Sciences Institute CC-By 2.0 43

Questions?

dig.isi.eduOpen Source, Apache 2 License

CC-By 2.0 44USC Information Sciences Institute