41
Mapping Domain Names to Categories Maya Rotmensch, Sorcha Gilroy, Corina Gur˘ au Academic Mentor: Cristina Garcia-Cardona Industry Sponsor: Oversee.net (Kryztof Urban) Institute of Pure and Applied Mathematics Research in Industrial Projects August 15, 2013 Institute for Pure & Applied Mathematics University of California, Los Angeles (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 1 / 41

Mapping Domain Names to Categories

Embed Size (px)

DESCRIPTION

Oversee.net + UCLA IPAM RIPS Summer internship project 2013

Citation preview

Page 1: Mapping Domain Names to Categories

Mapping Domain Names to Categories

Maya Rotmensch, Sorcha Gilroy, Corina GurauAcademic Mentor: Cristina Garcia-Cardona

Industry Sponsor: Oversee.net (Kryztof Urban)

Institute of Pure and Applied MathematicsResearch in Industrial Projects

August 15, 2013

Institute for Pure & Applied Mathematics

University of California, Los Angeles

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 1 / 41

Page 2: Mapping Domain Names to Categories

Outline

1 Oversee.net

2 Problem StatementWhy so complicated?ESA - Explicit Semantic AnalysisHow Oversee.net Does It

3 Our ProjectOur FocusMethodologyResults

4 Concluding Remarks

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 2 / 41

Page 3: Mapping Domain Names to Categories

Outline

1 Oversee.net

2 Problem StatementWhy so complicated?ESA - Explicit Semantic AnalysisHow Oversee.net Does It

3 Our ProjectOur FocusMethodologyResults

4 Concluding Remarks

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 3 / 41

Page 4: Mapping Domain Names to Categories

Oversee.net’s Business Model

Person Website

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 4 / 41

Page 5: Mapping Domain Names to Categories

Person looking for games A gaming website

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 5 / 41

Page 6: Mapping Domain Names to Categories

Oversee.net’s Business Model

Person looking for games Domain A gaming website

Direct Navigation: when users navigate to a website by using theaddress bar instead of a search engine.

looking for a gaming website → navigates to ’addictinggamas.com’

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 6 / 41

Page 7: Mapping Domain Names to Categories

Oversee.net’s Business Model

Domain parking + traffic matching −→ Oversee.net

Person Domain Category Website

Monetized Domain Parking

I The registration of internet domain names without placing anycontent on the domain.

I Owners monetize traffic by displaying links and advertisements

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 7 / 41

Page 8: Mapping Domain Names to Categories

Oversee.net’s Business Model

AdvertisersI Partners of Oversee.net

I Choose the types of traffic they want from Oversee.net’s category tree

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 8 / 41

Page 9: Mapping Domain Names to Categories

Oversee.net’s Business Model

Parked domains do not have any content

Mapping Domains to Categories is extremely difficult

I Oversee.net uses Keywords to describe Domains and Categories

Domain Keywords Keywords Category

Not enough, as we are not guaranteed use of same language!

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 9 / 41

Page 10: Mapping Domain Names to Categories

Outline

1 Oversee.net

2 Problem StatementWhy so complicated?ESA - Explicit Semantic AnalysisHow Oversee.net Does It

3 Our ProjectOur FocusMethodologyResults

4 Concluding Remarks

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 10 / 41

Page 11: Mapping Domain Names to Categories

So what’s the big deal?

Reasoning about concepts

Scarcity of input information

I Example 1 - Spelling errorcheapvacatins.com

I Example 2 - Ambiguous meaningbigbearhuts.com (animals? huts? it’s supposed to be winter sports)

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 11 / 41

Page 12: Mapping Domain Names to Categories

Text Categorization

Our problem can be thought of as a problem of categorization. Weneed to assign a domain to one or more classes or categories

I A natural choice is topic modeling

I However, unlike most text categorization problems, we don’t actuallyhave documents to classify, as we are dealing with undevelopeddomains

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 12 / 41

Page 13: Mapping Domain Names to Categories

Topic Modeling

This method analyzes the relationships between documents in a corpus byisolating a set of topics from the documents

For meaningful results, one must work with a set of large texts

I Our data set consists of keywords, as our domains are undeveloped

This method results in organic generation of topics

I The categories we are attempting to map into are pre-defined

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 13 / 41

Page 14: Mapping Domain Names to Categories

ESA - Explicit Semantic AnalysisBuilding a Semantic Interpreter

Using a Vector Space Model + an exogeneous knowledge base−→ represent the meaning of text

1

# of articles ∼ 3.5 Million# of terms ∼ 45 Million

1Evgeniy Gabrilovich and Shaul Markovitch. Computing Semantic Relatedness using Wikipedia-based Explicit

Semantic Analysis, 2007. Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI)

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 14 / 41

Page 15: Mapping Domain Names to Categories

ESA - Explicit Semantic Analysis

Government Finance Toys Children Bank School . . .

Law 0.2 0.3 0.8 0.9 0.2 0.7 . . .Article2 0.8 0.9 0.1 0.3 0.7 0.5 . . .Article3 0.5 0.2 0.3 0.6 0.4 0.8 . . .Article4 0.1 0.2 0.1 0.3 0.4 0.2 . . ....

......

......

......

...

Term frequency inverse document frequency:

tfidf (t, d ,D) = tf (t, d)× idf (t,D)

Logarithmically scaled term frequency:

tf (t, d) = log(f (t, d) + 1)

Inverse document frequency:

idf (t,D) = log|D|

|d ∈ D : t ∈ d |(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 15 / 41

Page 16: Mapping Domain Names to Categories

ESA - Explicit Semantic AnalysisUsing a Semantic Interpreter

Cosine similarity measure

similarity = cos(θ) =A · B||A|| ||B||

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 16 / 41

Page 17: Mapping Domain Names to Categories

How Oversee.net Does It

Instead of comparing two texts - compare two small sets of words!

Use keywords to describe domains and categories

Represent these keywords in terms of DBpedia articles

I A keyword is significantly related to an article if the TF-IDF is above acertain threshold

I The set of articles associated to a domain/category is the union of thesets of articles associated to its keywords

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 17 / 41

Page 18: Mapping Domain Names to Categories

How Oversee Does It

Compare the two sets of articles (A - domains, B - categories) usingthe Jaccard Index:

J(A,B) =|A ∩ B||A ∪ B|

Categories with highest scores using this index are matched to adomain

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 18 / 41

Page 19: Mapping Domain Names to Categories

Outline

1 Oversee.net

2 Problem StatementWhy so complicated?ESA - Explicit Semantic AnalysisHow Oversee.net Does It

3 Our ProjectOur FocusMethodologyResults

4 Concluding Remarks

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 19 / 41

Page 20: Mapping Domain Names to Categories

Our Focus

Domain Keywords Keywords Category

Critical link: domains to keywords

Improve quality of keywordsI Click Through Rate

I String Similarity

I Semantic Analysis

Keyword CTR String Similarity Semantic Similarity

industrial 20 80 0

industriel 20 89 0

industrie 20 100 0

china manufacturer 20 0 88

industries 20 80 98

industrial companies 20 0 86

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 20 / 41

Page 21: Mapping Domain Names to Categories

Domain Keywords

Focusing on developing the link between domains and keywords, the twomain questions we posed for our research were:

Could we use ESA to extend the number of meaningful keywords perdomain?

Could we use the keywords obtained through Oversee.net inhousestatistics as the basis of the new keywords?

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 21 / 41

Page 22: Mapping Domain Names to Categories

MethodologyExtending the set of keywords:

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 22 / 41

Page 23: Mapping Domain Names to Categories

MethodologyExtending the set of keywords:

When generating new keywords:

Only take top 3 articles

Only take top 2 terms

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 23 / 41

Page 24: Mapping Domain Names to Categories

MethodologyMethod 2 for extending the set of keywords:

Breaking up and correcting the domain name

chaselogon.com

haselogonaselogon

cha selogonchas elogonchase logonchasel ogonchaselo gon

chaselogchaselogo

Example: domain = ’chaselogon.com’

If entire string matches a word in reference file then stop

If both parts of broken string are exact words then stop

If substring is an exact word then correct other part using editdistances

I Corrections used: deletions, transpositions, replacements, insertions

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 24 / 41

Page 25: Mapping Domain Names to Categories

MethodologyMethod 2 for extending the set of keywords:

Reference file made up of collections of text, have added moreinformation

I Company namesI Popular websitesI Brand and store namesI Countries and major cities

Initial Keywords Keywords after parsing

chameloeon chas

chase

elson

login

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 25 / 41

Page 26: Mapping Domain Names to Categories

MethodologyGenerating new keywords and mapping to categories

bankfianancial.com

ncofinancialban

bankfinancial

financial institutionsfinancial centre

lobstersofficial personal

societies chairman. . .

Jaccard Index = 0.240492

finance

retirement pensiondebit card

tenant credit check...

Jaccard Index = 0.348147

credit cards

debit cardcredit applicationsrewards program

...

Jaccard Index = 0.219457

banking

savings bankingchecks

community bank...

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 26 / 41

Page 27: Mapping Domain Names to Categories

Results: Comparing Their Keywords to Semantic

We were given a sample of 300 domains that had been matched byhand to a total of 500 categories

CTR & String Similarity CTR, String Similarity & Semantic Analysis

Number of matches 25 309

percentage of match 5% 61.8%

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 27 / 41

Page 28: Mapping Domain Names to Categories

Results: Generating New Keywords

Using Method 1:

CTR & String Similarity Method 1 CTR & String Similarity & 7 Random

Number of matches 25 21 24

percentage of match 5% 4.2% 4.8%

Most of the time, the different methods yielded the same results

Cases where the new keywords improved the system:I thhetrainline.com

Cases where the base case did better:I inindustries.com

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 28 / 41

Page 29: Mapping Domain Names to Categories

Results

thhetrainline.com

thetrainline

Jaccard Index = 0.0001 microcars & city cars

Jaccard Index = 0.0002 property management

thhetrainline.com

thetrainlinestrafe train

moving departingtrain station

telecommunicationsgeorgia

rain shine. . .

Jaccard Index = 0.1348 bus & rail

Jaccard Index = 0.2255 libraries & museums

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 29 / 41

Page 30: Mapping Domain Names to Categories

Results

inindustries.com

industrialindustriasindustriel

. . .

Jaccard Index = 0.0786 manufacturing

inindustries.com

industrialindustriasindustriel

. . .ministry

quarterly garden/outdoorfilipino footballer

. . .

Jaccard Index = 0.099 tourist destinations

Jaccard Index = 0.1326 real estate

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 30 / 41

Page 31: Mapping Domain Names to Categories

Results: Parsing the Domains

Using Method 1 & 2:

CTR & String Similarity Method 1 & 2 CTR & String Similarity & 15 Random

Number of matches 25 93 23

percentage of match 5% 18.6% 4.6%

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 31 / 41

Page 32: Mapping Domain Names to Categories

Results - Parsing the Domains

chaselogon.com

chameloeon

No category matched

addictinggamas.com

chameloeonchaschaseelsonlogin

passwordjournalists cyberlogins expensive

beatles. . .

Jaccard Index =0.4637 credit cards

Jaccard Index = 0.4637 banking

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 32 / 41

Page 33: Mapping Domain Names to Categories

Results: Parsing the Domains

Using Method 2:CTR & String Sim. Method 1& 2 Method 2

Number of matches 25 97 77 out of 356

percentage of match 5% 19.4% ∼ 21.6 %

Initial results show that overall, just using parsing might be more beneficial→ depends on the amount of noise.

Example with a lot of noise:I mobilestorage.ca

Example with minimal noise:I addictinggamas.com

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 33 / 41

Page 34: Mapping Domain Names to Categories

Results - Amplification of noise

mobilestorage.ca

gfilestoragemobileshop

mobilestorage

ageinvestor

vilest. . .

Jaccard Index = 0.1011 mobile & wireless

Jaccard Index = 0.0959 music & audio

mobilestorage.ca

gfilestoragemobileshop

mobilestorage

ageinvestor

vilest. . .

legal agetaylor

phone companiesmobil

. . .

Jaccard Index =0.0942 music & audio

Jaccard Index = 0.0887 education

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 34 / 41

Page 35: Mapping Domain Names to Categories

Results - Minimal noise

addictinggamas.com

addictinggamsaddictivegamesadictigegames

. . .addict

addictinggamesingram

. . .

Jaccard Index = 0.0153 software

addictinggamas.com

addictinggamsaddictivegamesadictigegames

. . .addict

addictinggamesingram

. . .gameplay requires

gameimpulsedriven flash

add ons. . .

Jaccard Index = 0.2019 computer & video games

Jaccard Index = 0.1975 games

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 35 / 41

Page 36: Mapping Domain Names to Categories

Results: Extended Matches

Using Extended Matches:

We extended possible matches to parent and root nodes of thecategory tree.

I Checked in how many cases did the parent or root node of thecategories we got matched the manual matching.

CTR & String Sim. Method 1 Method 1& 2 Method 2

Number of matches 25 21 97 77 out of 356

percentage of match 5% 4.2% 19.4% ∼ 21.6 %

Number of extended matches 32 29 128 102 out of 356

Percentage of matches 6.4% 5.8% 25.6% ∼ 28.7 %

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 36 / 41

Page 37: Mapping Domain Names to Categories

Outline

1 Oversee.net

2 Problem StatementWhy so complicated?ESA - Explicit Semantic AnalysisHow Oversee.net Does It

3 Our ProjectOur FocusMethodologyResults

4 Concluding Remarks

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 37 / 41

Page 38: Mapping Domain Names to Categories

Conclusion

Implemented a program to match domains with categories

Created an ESA based method to amplify existing keywords

Adapted a domain name parsing and spell correcting method

Revisiting our research questions:

Could we use ESA to extend the number of meaningful keywords perdomain? → Yes

Could we use the keywords obtained through Oversee.net inhousestatistics as the basis of the new keywords? → No. Or at leastfurther processing must be done.

getting better & more keywords → getting a few good keywords

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 38 / 41

Page 39: Mapping Domain Names to Categories

Future Directions

Find out how many good initial keywords are required to use ourmethod successfully

Explore a better way of ranking keywords and determine which arethe most descriptive ones

I Click through rate and string similarity comparisons are not sufficientlydescriptive, need a better scoring method

Have a reference of the most popular websites, so that the domainsgiven could be compared to these

I Analyze content in websites to amplify domain to category mapping

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 39 / 41

Page 40: Mapping Domain Names to Categories

Thank you!

Academic Mentor: Cristina Garcia-Cardona

Industry Sponsor: Kryztof Urban and Oversee.net

RIPS Director: Dr. Michael Raugh

Director of IPAM: Dr. Russ Caflisch

IPAM Staff: Dimi, Stacey, Stacy, Roland, Stephanie, and everyonethat made RIPS possible

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 40 / 41

Page 41: Mapping Domain Names to Categories

Questions?

Thank you for listening!

(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 41 / 41