View
217
Download
0
Category
Preview:
Citation preview
Graphical Representations of Knowledge and Its Distribution
Cliff BehrensInformation AnalysisApplied ResearchTelcordia Technologies, Inc973.829.5198cliff@research.telcordia.com
Workshop on Statistical Inference, Computing and Visualization for Graphs
Stanford University, August 1 - 2, 2003
Knowledge, Consensus and Information Sharing
Cultural Knowledge Derived from Consensus
Individual Knowledge
Information Sharing Among Individuals in a Single COI
Consensus Consensus Knowledge Knowledge
Schemer Knowledge Validation Services
Issues with CSCW technology– Focus of CSCW research on new tools, less on motivating their use– Collaborative modeling building often lacks scientific rigor and quality control
Schemer Web-based technology that derives knowledge from consensus among Subject Matter Experts
– Knowledge-based collaboration reveals distribution of domain expertise among panelists
– Metrics for qualifying panelists and validating the models they produce validates saliency of domain to SMEs
estimates competency of SMEs
yields best answers based on responses of SMEs weighted by their respective competencies
Generic service, but first tried on SIAM® influence networks
SIAM® Influence Net Example
Mathematics of Consensus Analysis (Romney et al. 1986) Formal model consists of a data matrix X containing the responses Xik of SMEs 1..i..N on
items 1..k..M
– from this matrix a symmetrical matrix M* is estimated and holds the empirical point estimates M*ij, the
proportion of matching responses on all items between SMEs i and j, corrected for guessing (if appropriate), on off-diagonal elements.
Obtain approximate solution yielding estimates of the individual SME competencies (the D*i)
by applying Maximum Likelihood Factor Analysis to fit equation below and solve for the main diagonal values– M* = D*D*'
– relative magnitude of eigenvalues (λ1 > 3 λ2) implies single factor solution
D*i, are the loadings for SMEs on the first factor
– D*i = v1i {λ1}
Estimated competency values (D*i ) and the profile of responses for item k (Xik,l) used to
compute Bayesian a posteriori probabilities for each possible answer. The formula for the probability that an answer is best or “correct” one follows:
N
– Pr(<Xik> i=1 | Zk=l) = [D*i + (1-D*
i)/L]Xik,l [(1-D*i)(L-1)/L]1-Xik,l
i = 1
Schemer Knowledge Validation Services
Knowledge-Based Communications Interface
Structured Collaboration and Advice Network
• User’s relation to other SMEs
• Most similar point-of-view
• Most different point-of-view
• Someone a bit more knowledgeable
• Gurus
• Novel thinkers
Information Routing
• Supports/challenges one’s point-of-view
• Supports/challenges the consensus point-of-view
SME Contact Data
• Email services
• Meeting services
• Other plug-ins
Latent Semantic Indexing (LSI): What is it?
Doc 3
Doc 2
Doc 1
memory
chip
Standard Vector Space Model(ndims = nterms)
com
puter
Doc 3
Doc 2
Doc 1
LSI Dimension 1
LS
I D
imen
sion
2
Reduced LSI Vector Space Model(ndims << nterms)
chip
memory
computer
LSI: How Does It Work?
Analyze training collection of documents– throw-out stop words and mark-up– count frequencies of words in each document
Compute term document matrix– store word counts as entries in a matrix– apply appropriate weighting, e.g., log-entropy, to entries
Compute LSI vector space– reduce term document matrix with Singular Value Decomposition
Fold new documents into LSI vector space– document vector computed from weighted sum of its term vectors
Compute vector for query (“pseudo-document”)– query vector computed from weighted sum of its term vectors
Search vector space for semantically-close term/document vectors– compute cosine of angle between query and other vectors
Scalability: Large Document Collections and Polysemy
Many Undifferentiated
Conceptual Domains/COIs
Many Undifferentiated
Conceptual Domains/COIs
"chip""wafer"
"chip""wafer"
potato
chipcorn
sugar
silicon
wafer valley
copper
Dimension 1
valleysilicon copper
Dimension 2
sugar cornpotato
waferchip
LSI: Ongoing Work Distributed LSI
– Needed for LSI to scale to massive document collections Adopts “divide and conquer” approach
– Sort documents by conceptual domain recognizes documents created for different COIs create more semantically homogeneous subcollections apply cluster analysis, e.g., bisecting K-means
– Compute independent LSI vector spaces for each subcollection more parsimonious representations of concept domains or contexts
– Compute similarity measures between spaces construct graphs from terms shared by two vector spaces compute similarity between these two graphs
– Discover appropriate search vector spaces for a query cosine calculations (as before) relevance feedback (as before) query expansion Visualizations to explore semantic context for a query in different LSI vector spaces
DLSI: Experiments with NSF-Movie Review Corpus
Vector Spaces Dimensions Non-stop Terms Documents
NSF-Geology 298 25,963 3,255
NSF-Engineering 229 30,247 3,057
NSF-Biology 224 38,176 3,645
Movie Reviews 239 70,411 3,557
All Documents 282 122,685 13,514
DLSI: The Context of Term Meaning
Graph of semantic relationships between top five terms retrieved for the query {travel, center, earth} from the vector space containing only NSF geology abstracts.
Graph of semantic relationships between top five terms retrieved for the query {travel, center, earth} from the vector space containing only Ebert movie reviews.
Graph of semantic relationships between top five terms retrieved for the query {travel, center, earth} from the vector space containing all documents.
center
research earth
reports travel
alien earth
science-fiction/sci-fi travel
cooperative earth
university center/center’s
Recommended