Presented By- Shahina Ferdous, Student ID – 1000630375, Spring 2010

Presented By- Shahina Ferdous,Student ID – 1000630375,

Spring 2010

SemTag is an application built on the platform Seeker that adds semantic tags to the existing HTML body of the web.

Example:“The Chicago Bulls announced that Michael Jordan will…”

Will be:

The <resource ref = http://tap.stanford.edu/Basketball Team_Bulls>Chicago Bulls</resource> announced yesterday that <resource ref = “http://tap.stanford.edu/ AthleteJordan_Michael”> Michael Jordan</resource> will...’’

The creation of this large scale automated semantic tagging will accelerate the creation of Semantic Web

Semantic Web is a vision to transform all documents in web into machine understandable format so that applications or programs can execute without human intervention.

All the entities of documents will be canonically annotated; therefore programs can easily understand what documents are about.

To accomplish the Semantic Web Vision, we need◦ Ontological support in the form of Web available services, which will

maintain metadata about entities and provide them whenever needed.

◦ Large scale availability of annotations within documents encoding canonical references to the entities.

Need to break the Circular Dependency, which means ◦ We need applications those will make extensive use of the semantically

tagged Data.

◦ There should be enough Tagged Data on the web so that these applications can be useful.

Tagging is a way to classify entities either in written or spoken text.

Any Tagging process generally consists of two steps:◦ Step 1: Identify the entities those should be classified◦ Step 2: classify these instances according to their categories.

In case of Semantic Tagging, the categories used to classify the entities are derived from their intentions or meanings (what is being said than how is it said!)

He runs the company He runs the marathon

run1 = control run2 = run by foot

Sense TaggingSense Tagging

Human Non-Human

Feature TaggingFeature Tagging

The speaker coughed The speaker was disconnected

Needs to resolve ambiguities in a natural language corpus like web.

Maintaining and Updating a large scale corpus requires such a scalable infrastructure, which most tagging applications are unable to support.

Requires a platform so that multiple Tagging applications can share.

Designed the platform Seeker, which provides highly scalable core functionalities to support SemTag and other Tagging algorithms.

SemTag uses a new disambiguation algorithm called TBD for resolving Taxonomy based disambiguates.

Applied SemTag to a collection of approx. 264 million web pages and generate 434 million automatically disambiguated semantic tags

Published metadata regarding the annotations to the web as a label bureau.

SemTag runs in three phases: Spotting Pass – Generate window of context surrounding a label

(10 words-label-10 words)

Learning Pass – Use representative sample to determine distribution of terms in the Taxonomy

Tagging Pass – Disambiguate references using TBD algorithm. Two kinds of ambiguities are: Same label appears at multiple locations in TAP ontology. Some labels occurs in contexts, which are missing in the taxonomy.

TBD makes use of two classes of training information:

Automatic Metadata – help in determining whether context around a label appears within a subtree of the taxonomy.

Manual Metadata – Provides information regarding the nodes of the taxonomy whether it contains highly ambiguous or unambiguous labels.

An Ontology in TBD defined by four elements: A Set of classes, C A subclass relation, s(c1, c2) A Set of Instances, I A Type relation, t(i, c)

A Taxonomy T is defined by three elements: A Set of Nodes, V A Root Node, r A parent function, p

Ontology describes relationships in an N-dimensional manner, where Taxonomy describes hierarchical relationships.

Each node in Taxonomy has a set of labels. E.g.: Musician, Singer, Band Members all can contain the label Mark Knopfler.

An ancestry chain denotes the path from a node to the root of the taxonomy followed by the parent relationship.

A spot, spot (l, c), i.e. spot (Mark knopfler, Singer) is a label in a context.

Each internal node in TAP associates a similarity function that determines whether a particular context is similar to a node.

Good Similarity function has the property that higher the similarity, the more likely that a spot containing a reference to an entity that belongs to the subtree rooted at that node.

Music

Musician Singer

Mark KnopflerLabel Mark Knopfler

Label

Example of a subtree in Taxonomy

Spot(Mark knopfler, Singer)c

u Should have Higher similarity value

Determines whether a particular context is appropriate to a particular node in Taxonomy.

TBD Uses the manually generatedMetadata to calculate ma

u and msu,

as the training set, where

mau = probability as measured by

Human judgement that spots for the subtree rooted at u are on topic.

And msu= Probability that Sim

correctly judges whether spots for the subtree rooted at u are on topic.

Lexicon generation:◦ Built a collection of 1.4 million unique words occurring in a random

subset of windows containing approximately 90 million total words.

◦ Took the most frequent 200,100 words.

◦ Took the most frequent 100 words out.

◦ Further computations are performed in the 200,000 dimensional vector space defined by these words.

Each node is associated with 200,000 dimensional vector.

Evaluated four standard candidates for Similarity Functions: Scheme ‘Prob’ Scheme ‘TF-IDF’ Algorithm ‘IR’ Algorithm ‘Bayes’

According to the their result, IR with TF-IDF scheme gives the best accuracy (82%), which is a significant improvement.

It is a platform developed to support SemTag and other sophisticated Text analytics applications.

It is designed to achieve the following goals: Composibility Modularity Extensibility Scalability Robustness

Seeker is a service oriented architecture (SOA), which means it is a local area, loosely-coupled, pull-based distributed computation system.

To address scalability and robustness issues, Seeker incorporates a Component containing small set of Critical Services named Infrastructure.

Analysis agents perform processing of web pages to generate annotations.

Automatic semantic tagging is essential to bootstrap the Semantic Web.

It’s possible to achieve good accuracy even with simple disambiguation approaches.

Question?

Documents

Presented By- Shahina Ferdous, Student ID – 1000630375, Spring 2010