[IEEE 2007 IEEE International Conference on Information Reuse and Integration - Las Vegas, NV, USA (2007.08.13-2007.08.15)] 2007 IEEE International Conference on Information Reuse

JMaPSS: Spreading Activation Search for the Semantic Web

Kevin Gary✝, Bradley Szabo, Lavanya Vijayan,

Braden Chapman, Jayavarshini Radhakrishnan, Aishwarya Sivaraman Division of Computing Studies

Arizona State University at the Polytechnic Campus 7001 East Williams Field Road

Mesa, Arizona 85212 [email protected]

Abstract

The semantic web augments search by providing meta-information to structure knowledge. Challenges associated with search technology, such as accessing a large knowledge base with limited processing capability, may be addressed by AI techniques that provide greater flexibility albeit with less precision. In this paper we present JMaPSS, which applies a parallel search algorithm known as marker-passing to improve search relevancy results. We describe an instantiation of JMaPSS implemented specifically for semantic web search. Our investigations suggest that such techniques, using an expanded notion of recall emphasizing relevance, deserve additional exploration.

1. Introduction

Marker-passing is a parallel search technique with foundations in spreading activation theories[1][2] of cognitive psychology. Broadly speaking, spreading activation theories assume human memory is organized as a semantic network, with nodes representing symbolic or sub-symbolic chunks of knowledge and links represent relationships between chunks. External stimuli “excites” nodes, causing a cascading excitation in a neighborhood of the originating nodes in the network. Marker-passing implements the spreading activation theory. A knowledge base is structured as a semantic network, and tokens, or markers, initiate a parallel search process in the repository. At the conclusion of the propagation process, nodes that have been visited by one or more markers are candidates for further processing.

For over a decade marker-passing enjoyed some popularity in AI, including applications in planning [3], natural language understanding[4][5][6], and knowledge representation and inferencing[7][8]. Popularity waned for several reasons. First, marker-passing became popular in an era where massively parallel computation

was anticipated as the next great revolution in computing. As processor efficiency rose dramatically, the revolution sputtered, and the emphasis on parallel algorithms in AI waned. Second, the research communities in both cognitive psychology and AI never seemed to agree on a foundation for the theory and practice; instead a number of variants were experimented with (to some success) but momentum never sustained.

The latter reason is especially prescient. AI techniques, including marker-passing, were evaluated more on a formal basis as a means to proving correctness of a (limited) reasoning agent’s capabilities. The imprecise, informal, and unpredictable nature of the marker-passing algorithm did not fit well (though some researchers tried). Of course, this was also an era before widespread use of the Internet. The utility of a search technique that naturally identifies relevant information was not understood at the time.

The Internet is an intractable web of knowledge that must be sifted through to find needed information. This is eerily analogous to the issues encountered by limited reasoning agents, who must draw conclusions based on incomplete knowledge and limited computational capabilities. Search engines are the prevalent method for finding information on the Internet. Most engines are based on some type of parsing, indexing, and ranking algorithm that returns repeatable search results. Given the similarities to limited reasoning agents, we believe it is worthwhile to revisit techniques such as marker-passing to see if they have some utility for performing web-based search. The Java Marker-passing Search System (JMaPSS) is an experiment in this area.

In this paper we present JMaPSS, a work-in-progress in exploring marker-passing for Internet search. We present an overview of the JMaPSS engine and describe its application to the semantic web. We discuss our initial explorations and compare it to related work, and conclude by suggesting avenues for future work.

1041-4244-1500-4/07/$25.00 ©2007 IEEE

2. JMaPSS Overview

JMaPSS applies a heuristic marker-passing search algorithm to generic web-based search and semantic web search. In this section we present the basic features and operation of JMaPSS. The following sections describe JMaPSS application to these two types of searches. The JMaPSS system has following features: • Web-based Interface. Users are provided with a web-

based interface for creating a searchable graph structure, modifying the elements of that structure, and executing various tasks (see Table 1).

• Web Service Interface. An optional web services interface may be deployed that exposes WSDL for each major JMaPSS operation.

• Document Retrieval and Indexing. JMaPSS uses a web spider to collect documents and then parse and index them for use in searching.

• Creating Graph Structures from Indexed Documents. The system uses the indexed documents to build a graph structure in memory.

• Modifiability of Graph Elements and Functionality. The system enables the user to retrieve the state and functionality of individual graph elements and also to update those settings.

• Search Requests, Execution, and Results. Search is performed through user interfaces (UIs) that enable terms to be specified as search primers. The UIs handle receiving search requests, conducting the search process, and presenting search results.

JMaPSS functions are accessed through these tools:

Table 1. JMaPSS tools Function Description Search query The user enters a set of search terms. Search results Results are returned based on the

excitation of nodes in the graph. Function editor The power user may modify marker-

passing characteristics at any node. Node editor The power user may manipulate state

(attribute values) at any given node. Graph viewer The user may navigate the semantic

network and inspect nodes (Figure 1). Node selector Users may filter returned nodes by

excitation level. Reset editor The power user may reset system state Web spider Seeds the semantic network. JMaPSS leverages existing open source technologies such as Lucene[9], JUNG[10], and Apache Commons Digester[11] to create the underlying search framework. A web spider utility extracts files from various websites. These files are parsed and indexed using Lucene. Lucene separates the terms and indexes them with their associated web addresses and massages it into a semantic

network using a JUNG data structure. The marker-passing algorithm operates over this data structure. When a user provides a search query, markers (tokens extracted from the query) are created and injected into the network. The markers propagate from matching nodes in parallel throughout the network until their excitation level falls below a threshold (attenuates). At the termination of this propagation process, nodes with excitation levels above a threshold are returned as the results of the search. The expectation is that if search terms are related via some common (intersecting) concept, then markers will converge on that concept, excite that node, and cause that node to be returned by the search process. Search results imply relevance; nodes are returned because they are related (recall), not because they “match” (precision).

Figure 1. Graph Viewer Tool

3. Semantic Web Search

The standard JMaPSS engine creates a semantic network from web (HTML) pages. The semantic network consists of Document and Term nodes that store specific data related to each type of node. Links between nodes are represented as separate objects that correspond to relationships between connected nodes. For example, when a HTML file is parsed and indexed, the terms are stored in “Term” nodes and the HTML document in which all the terms appear is stored in a “Document” node. Links are created between Term nodes and their corresponding Document node. If a term exists in more than one HTML document, links are created to all the document nodes that contain the specific term. This implementation had mixed success. Interestingly, it often finds relevant information at the intersection of two or more search terms that might otherwise not be found. However, the algorithm has proven difficult to tune as it is sensitive to the topology characteristics of the semantic network. Although the implementation provides for per-node customization of marker propagation behavior, there is no basis for utilizing this behavior. Put another way, there is no semantic information that tells

105

the algorithm when one Term node should be favored when markers propagate through a Document node. This means any term appearing in the document is considered as important as any other term. Furthermore, there is no effective way to normalize propagation for documents with varying numbers of terms. We are currently exploring the use of link analysis techniques in traditional ranking engines for application within JMaPSS. We are also exploring stronger heuristics in the construction of our semantic network. In this exploration we use the semantic web to provide richer type information that can then be used to direct marker propagation. To understand how the semantic web can assist with marker propagation issues, we first describe the nature of an ontology.

3.1. OWL-Lite ontologies

Ontologies define terms used to describe and represent an area of knowledge. To support the sharing and reuse of formally represented knowledge a common vocabulary is needed. An ontology is a representational vocabulary for a shared domain of discourse providing definitions for classes, relations, functions, and other objects[12]. The Web Ontology Language (OWL)[13] is a W3C standard widely used for specifying ontologies for information on the Web. OWL-Lite is a subset of the OWL language that is commonly used to describe simple ontologies. The following elements are part of the OWL-Lite format relevant to JMaPSS: {Class, subClassOf, Property, subPropertyOf, domain, range, Individual}. Individuals that share properties is described by the element Class. An Individual is an instance of a Class, and a Property can be used to state relationships between individuals (ObjectProperty) or between individuals to data values (DataProperty). Class hierarchies can be created by using the syntax element subClassOf. In the example given in [14], Entry is a Class with subclasses Article, Publisher, and Book, while elements prefixed by “has” are properties. DoubleDayPublisher and The DaVinci Code are Individuals of Publisher and Book. Semantic typing information in OWL-Lite provides useful metadata for guiding heuristic search. One can envision users’ providing intelligent queries based on the ontological structure, and defining heuristics to direct relevance-oriented search engines such as JMaPSS.

3.2. Mapping OWL-Lite to a semantic network

To apply the marker-passing algorithm, an OWL-Lite file must be converted to a graph structure resembling a

semantic network. This was done in JMaPSS by modifying the implementation in the following ways:

1. OWL-Lite files are parsed and elements extracted. 2. Elements are indexed and stored using Lucene. 3. Various indexed terms in Lucene are mapped to a

JUNG-implemented graph. 4. The JUNG graph implementation was modified to

support multiple types of nodes. A graph structure for a portion of the publishing example from Knouf[14] is shown in Figure 2. Oval nodes represent classes, rounded rectangles represent properties, and octagonal nodes represent individuals. Type information on the links between nodes indicates the nature of the relationship between nodes, i.e. whether a class is a subclass of another class, a property is a subtype of another property, or a property describes a class (DataProperty) or an individual (ObjectProperty). The additional typing information provided by an OWL-Lite ontology serves as input to the marker-passer’s propagation algorithm. The marker-passer can now define how much excitation to distribute from one type of node to another based on the type of relationship. For example, markers originating at Property nodes excite Class nodes at the intersection of those properties.

Figure 2. Sample OWL-Lite Graph

106

The semantic network construction algorithm for the sample OWL-Lite ontology is given in Figure 3.

Figure 3. Semantic Network Algorithm

3.3. JMaPSS Validation Scenarios

Heuristic search based on semantic web information may be applied to resolve semantic ambiguity scenarios. One scenario is where a search engine attempts to resolve queries by identifying concepts relevant to other concepts. Other scenarios are derived from the problem of ontology matching, where two reasoning agents must negotiate common understanding in the face of distinct ontologies that may or may not partially overlap. To evaluate the behavior of JMaPSS search, three scenarios were constructed. The first scenario simply verified that JMaPSS returned relevant information when presented with search queries of a single term on a single instance of JMaPSS. This scenario is used more to verify the proper implementation of the engine and tune propagation behavior and so is not presented here.

3.3.1. Scenario 1: Concept ambiguity. Searching for a concept (keyword) having different meanings and represented by different ontologies should result in a term that is an intersection of the meanings of the query terms. For example, “Java” has three different meanings: coffee, computer language, and island. If we search for Java, we will get results related to all the three meanings. But if we qualify the search with a description like “Java with

Beaches” then JMaPSS based semantic search should give results only related to island. For this scenario, two OWL-Lite files were created and deployed in JMaPSS, one describing an analog still picture camera (top half of Figure 4) and one which describes a digital video camera (Figure 6). We presented JMaPSS with the following queries:

1. Camera that has Compression 2. Camera with CCD 3. Camera with MPEG

There are two expected outcomes to this experiment: (1) the result should be an intersection of the query terms (2) if a concept is searched using the properties of the concept (keyword), the search should resolve in the concept itself even though the keyword or concept is not part of the query. In this example, we expected digital video camera as a result even though the query was formed without using keywords digital or video.

Figure 4. Digital camera ontology schematic

1. For each Class A, create a node for A. Example: Entry, Book, Article, Publisher 2. For each subclass B of A, link B to A Example: Article to Entry, Book to Entry 3. For each Property P, create a node for P Example: hasJournal, humanCreator 4. For each DatatypeProperty DP, connect DP to Class A if DP is a propertyOf A (A is the domain of DP). Only connect DP at the highest level of the Class hierarchy. Example: hasName 5. For each ObjectProperty OP, connect OP to its domain and range Classes. Example:hasPublisher,dom:Book,range:Publish 6. For each subProperty S of Property P, connect S to P. 7. Create a node for each Instance Example: “The Davinci Code”, “Doubleday” 8.For each Instance X of Class A link X->A Example: link “The Davinci Code”->“Doubleday” 9. If Class A is the range of some ObjectProperty P on another Class B (the domain), then connect the Instance X of Class A to the Instance Y of Class B. Example: “The Davinci Code” is linked to “Doubleday” as the XML the value of the Book’s publisher is the Publisher instance Doubleday.

107

The result of the query is shown in Figure 5. JMaPSS’s result matched the expected outcomes. We can see the terms Camera and MPEG are two hops away from the term “DigitalVideoCamera” yet excites the term “DigitalVideoCamera”. DigitalVideoCamera is an intersection of Camera and MPEG. Also, we can see that even though the search query did not have terms digital or video, the result of the search yielded “DigitalVideoCamera” as one of the outputs.

Figure 5. Scenario 1 query results 3.3.2. Scenario 2: Partially overlapping ontologies. The goal of this experiment is to map an ontology, which is a subset of another ontology of a given domain, and check whether search is able to resolve queries pertaining to each of these ontologies. For this experiment, we created four ontologies for analog still picture camera (ASPC), analog video camera, (AVC), digital still picture camera (DSPC) and (DVC) digital video camera such that ASPC ∪ AVC, DSPC, and DVC; DSPC ∪ DVC; and AVC ∪ DVC. Two OWL-Lite files were created and deployed in JMaPSS (see full Figure 4 and Figure 6). The following queries were given using the search interface:

1. Camera with shutterspeed 200 2. DigitalCamera with FlashMemory

A search for "camera" should get results related to ASPC and perhaps other elements. A search for “digital camera” using digital camera properties we should get results related to DSPC and not the other ontologies.

Figure 6. Digital video camera ontology

Results for queres 1 and 2 are given in Figures 7 and 8.

Figure 7. Results for scenario 2, query 1

108

Figure 8. Results for scenario 2, query 2

The results were as expected. Search for camera yielded all terms related to camera and not digital camera. Search for digital camera yielded all terms related to digital camera and not camera.

4. Related Work

A few other examples of spreading activation implementations for the semantic web have recently appeared in the literature. In [15], Rocha et. al. present a hybrid approach to implementing a spreading activation model that relies on domain experts to set numeric weights on relationships defined by the ontology. The same effect is achieved in JMaPSS by assigning different propagation functions to different node types. Rocha also allows different initial propagation values to be assigned based on the relevance of the input token (an instance, or Individual as described here for OWL-Lite) to a specified task. The paper also discusses how marker propagation must be constrained based on network topology, a common issue as discussed above. In [16], the author describes another hybrid search strategy, but this time does not assign weights a priori based on human intervention. Instead, the algorithm starts with an initial estimate based on the documents stored in the network, and uses a feedback process to adjust weights over time. Given an assumption of a growing knowledge base and the imprecision in attempting to manually assign weights, a feedback or training process should help with tuning search results.

5. Future Work

JMaPSS embodies the good and bad of spreading activation theories and implementations. The JMaPSS implementation for the semantic web shows that the additional meta-information included in OWL-Lite documents may be used to direct excitation down “relevant” pathways in the semantic network. The result is a flexible, intuitive algorithm that may find interesting information for the user that might otherwise not be returned using a more conventional search engine. The flexibility also leads to inexplicable connections and a lack of repeatable search results. While the former may be accepted in exchange for interesting results, the latter violates a fundamental assumption of most modern search engines. Whether this assumption is appropriate is a topic outside the scope of this paper.

8. References

[1] M.R. Quillian, “Semantic Memory”, in M. Minsky (ed.) Semantic Information Processing, MIT Press, Cambridge, MA. pp. 227-270, 1968.

[2] A. Collins and E. Loftus, “A spreading activation theory of semantic processing”, The Psychological Review, 82(6):407-428, 1975.

[3] J. Hendler, Integrating marker-passing and problem solving: A spreading activation approach to improved choice in planning, Lawrence Erlbaum Associates, Hillsdale, NJ, 1988.

[4] E. Charniak, “Passing markers: A theory of contextual influence in language comprehension”, Cognitive Science, 7:171-190, 1983.

[5] E. Charniak, “A neat theory of marker-passing”, Proceedings of AAAI-86, pp. 584-588, 1986.

[6] P. Norvig, “Marker passing as a weak method for text inferencing”, Cognitive Science, 13:569-620, 1989.

[7] S. Fahlman, NETL: A system for representing and using real-world knowledge, MIT Press, Cambridge, MA, 1979.

[8] W. Lee and D. Moldovan, “The design of a marker passing architecture for knowledge processing”, Proceedings of AAAI-90, pp. 59-64, 1990.

[9] Apache Lucene Overview [Online]. Available: http://lucene.apache.org/java/docs.

[10] JUNG. [Online]. Available: http://jung.sourceforge.net/doc/index.html.

[11] O. Gospodnetic. Parsing, indexing and searching XML with Digester. Available: http://www-128.ibm.com/developerworks/library/j-lucene/index.html.

[12] T. R. Gruber. “A translation approach to portable ontologies”, Knowledge Acquisition, p. 199-220 1993

[13] World-Wide Web Consortium (W3C), Web Ontology Language (OWL), [Online]. Available : http://www.w3.org/2004/OWL/

[14] N. Knouf. bibTeX Definition in Web Ontology Language (OWL). [Online] Available: http://visus.mit.edu/bibtex/0.1/

[15] C. Rocha, D. Schwabe, and M.P. de Aragao, “A hybrid approach for searching in the semantic web”, Proceedings WWW 2004, May 2004.

[16] M.M. Hasan, “A spreading activation framework for ontology-enhanced adaptive information access within organizations”, International Symposium on Agent Mediated Knowledge Management, pp. 288-296, Stanford, CA, March 2003.

109