Upload
evan-thompson
View
219
Download
0
Embed Size (px)
Citation preview
Presented By- Shahina Ferdous,Student ID – 1000630375,
Spring 2010
SemTag is an application built on the platform Seeker that adds semantic tags to the existing HTML body of the web.
Example:“The Chicago Bulls announced that Michael Jordan will…”
Will be:
The <resource ref = http://tap.stanford.edu/Basketball Team_Bulls>Chicago Bulls</resource> announced yesterday that <resource ref = “http://tap.stanford.edu/ AthleteJordan_Michael”> Michael Jordan</resource> will...’’
The creation of this large scale automated semantic tagging will accelerate the creation of Semantic Web
Semantic Web is a vision to transform all documents in web into machine understandable format so that applications or programs can execute without human intervention.
All the entities of documents will be canonically annotated; therefore programs can easily understand what documents are about.
To accomplish the Semantic Web Vision, we need◦ Ontological support in the form of Web available services, which will
maintain metadata about entities and provide them whenever needed.
◦ Large scale availability of annotations within documents encoding canonical references to the entities.
Need to break the Circular Dependency, which means ◦ We need applications those will make extensive use of the semantically
tagged Data.
◦ There should be enough Tagged Data on the web so that these applications can be useful.
Tagging is a way to classify entities either in written or spoken text.
Any Tagging process generally consists of two steps:◦ Step 1: Identify the entities those should be classified◦ Step 2: classify these instances according to their categories.
In case of Semantic Tagging, the categories used to classify the entities are derived from their intentions or meanings (what is being said than how is it said!)
He runs the company He runs the marathon
run1 = control run2 = run by foot
Sense TaggingSense Tagging
Human Non-Human
Feature TaggingFeature Tagging
The speaker coughed The speaker was disconnected
Needs to resolve ambiguities in a natural language corpus like web.
Maintaining and Updating a large scale corpus requires such a scalable infrastructure, which most tagging applications are unable to support.
Requires a platform so that multiple Tagging applications can share.
Designed the platform Seeker, which provides highly scalable core functionalities to support SemTag and other Tagging algorithms.
SemTag uses a new disambiguation algorithm called TBD for resolving Taxonomy based disambiguates.
Applied SemTag to a collection of approx. 264 million web pages and generate 434 million automatically disambiguated semantic tags
Published metadata regarding the annotations to the web as a label bureau.
SemTag runs in three phases: Spotting Pass – Generate window of context surrounding a label
(10 words-label-10 words)
Learning Pass – Use representative sample to determine distribution of terms in the Taxonomy
Tagging Pass – Disambiguate references using TBD algorithm. Two kinds of ambiguities are: Same label appears at multiple locations in TAP ontology. Some labels occurs in contexts, which are missing in the taxonomy.
TBD makes use of two classes of training information:
Automatic Metadata – help in determining whether context around a label appears within a subtree of the taxonomy.
Manual Metadata – Provides information regarding the nodes of the taxonomy whether it contains highly ambiguous or unambiguous labels.
An Ontology in TBD defined by four elements: A Set of classes, C A subclass relation, s(c1, c2) A Set of Instances, I A Type relation, t(i, c)
A Taxonomy T is defined by three elements: A Set of Nodes, V A Root Node, r A parent function, p
Ontology describes relationships in an N-dimensional manner, where Taxonomy describes hierarchical relationships.
Each node in Taxonomy has a set of labels. E.g.: Musician, Singer, Band Members all can contain the label Mark Knopfler.
An ancestry chain denotes the path from a node to the root of the taxonomy followed by the parent relationship.
A spot, spot (l, c), i.e. spot (Mark knopfler, Singer) is a label in a context.
Each internal node in TAP associates a similarity function that determines whether a particular context is similar to a node.
Good Similarity function has the property that higher the similarity, the more likely that a spot containing a reference to an entity that belongs to the subtree rooted at that node.
Music
Musician Singer
Mark KnopflerLabel Mark Knopfler
Label
Example of a subtree in Taxonomy
Spot(Mark knopfler, Singer)c
u Should have Higher similarity value
Determines whether a particular context is appropriate to a particular node in Taxonomy.
TBD Uses the manually generatedMetadata to calculate ma
u and msu,
as the training set, where
mau = probability as measured by
Human judgement that spots for the subtree rooted at u are on topic.
And msu= Probability that Sim
correctly judges whether spots for the subtree rooted at u are on topic.
Lexicon generation:◦ Built a collection of 1.4 million unique words occurring in a random
subset of windows containing approximately 90 million total words.
◦ Took the most frequent 200,100 words.
◦ Took the most frequent 100 words out.
◦ Further computations are performed in the 200,000 dimensional vector space defined by these words.
Each node is associated with 200,000 dimensional vector.
Evaluated four standard candidates for Similarity Functions: Scheme ‘Prob’ Scheme ‘TF-IDF’ Algorithm ‘IR’ Algorithm ‘Bayes’
According to the their result, IR with TF-IDF scheme gives the best accuracy (82%), which is a significant improvement.
It is a platform developed to support SemTag and other sophisticated Text analytics applications.
It is designed to achieve the following goals: Composibility Modularity Extensibility Scalability Robustness
Seeker is a service oriented architecture (SOA), which means it is a local area, loosely-coupled, pull-based distributed computation system.
To address scalability and robustness issues, Seeker incorporates a Component containing small set of Critical Services named Infrastructure.
Analysis agents perform processing of web pages to generate annotations.
Automatic semantic tagging is essential to bootstrap the Semantic Web.
It’s possible to achieve good accuracy even with simple disambiguation approaches.
Question?