Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
® 2003
Unstructured Information ManagementContent and Technology Integration
SESAM 20:th of October 2004
Mikael Thorson, Infosphere ABStockholm, Sweden
® 2004
Session Topic
Integrating Content and Technology using Unstructured Information Management
There is data, data everywhere, but is it organized in any structured way? Infosphere suggests a major problem facing information professionals and content managers is not only finding information, but making sense of huge volumes of data.
Success in this area is dependent on the integration of content and technology.
Explore the new realm of Unstructured Information Management (UIM) and its potential for changing day‐to‐day research and information management.
® 2003
The Problem Area
“There are many superb technologies out there, but without content they are useless!”
® 2004
Today, the information challenges or hurdles facing most organizations include both content and technology issues
• The mix of free and for‐fee content – why should we pay for content?• How to integrate existing proprietary content with external content?• Can we combine our knowledge and allow data fusion of all sources?• What current IT tools are available and suitable to our needs?• In the world of automation – where is the human filter needed?
® 2004
Islands of Information – bridging the information assets
There are basically three ways of integrating content or to provide “data fusion”:1. The Manual Approach – force content into a pre‐defined structure
• Traditional Library Science• Use of Taxonomies or even Ontologies• Manual intensive and relies on tagging schemas
2. The Automatic Approach – allow the system to learn from the data itself• Artificial Intelligence and Statistics• Auto Categorization (clustering) with auto tagging• Not as accurate as a skilled human classifier (ambiguities)
3. Combination methods• Uses each approach where it is either more accurate or “cheaper”• Requires more knowledge from the administrators and implementers
® 2003
The World of Unstructured Data
“Merrill Lynch estimates that more than 85 percent of all business information exists as unstructured data – commonly appearing in e‐mails, memos, notes from call centers and support operations, news, user groups, chats, reports, letters, surveys, white papers,
marketing material, research, presentations and Web pages.”DM Review Magazine, February 2003 Issue
® 2004
Information is growing exponentially, and unstructured data is growing in importance
Yottabyte 1024
1022
1020
1018
1016
1014
1012
98 99 00 01 02 03 04 05 06 07 08 09200%200%
100%100%
300%300%
60%60%
Paper
CDR
Servers
Desktops
Digital Tape
DVD
Total
Microdisk
Online
Offline
Internet
Static HTML
Dynamic Pages
Source: IBM
Zetabyte
Exabyte
Petabyte
Terabyte
® 2004
Business decisions are increasing in complexity, while time to make decisions is being compressed
Releasing in 15 countries
Decision Makers
Globalization
Networks & Alliances
Product Diversity
Time CompressionCustomer Diversity
1,000 suppliers
Telco Example
“Ten week” markets
45,000 services
• Product Diversity• Time Compression• Customer Diversity• Globalization• Network and Alliances
30 million subscribers
® 2004
It takes time to structure information, thus companies that makes use of unstructured information gain a competitive advantage
Emerging TechnologyDeployed
Academic Publications
Patents
Alternative Press
Trade Publications
Research Reports
Chat Rooms
Personal Web Sites
Online New Sites
News Groups
News Groups
e-commerce sites
Chat Rooms
Chat Rooms
Chat Rooms
News Groups
News Groups
News Groups
Chat Rooms
Personal Web Sites
Personal Web Sites
Personal Web Sites
Personal Web Sites Personal
Web Sites
Personal Web Sties
Personal Web Sites
Personal Web Sites
Chat Rooms
News Groups
Online New Sites
Online New SitesOnline
New Sites
Online New Sites
Online New Sites
Online New Sites
Online New Sites
Alternative Press
Alternative Press Patents
Patents
Patents
Trade Publications
Trade Publications
Trade Publications
NewsMagazines
PeriodicalMagazines
PeriodicalMagazines
NewsMagazines
Quality of DataRaw Synthesized
Timeless of DataInstantaneous Historical
Emerging TechnologyFormation
e-commerce sites
e-commerce sites
Academic Publications
Patents
Alternative Press
Trade Publications
Reports
Chat Rooms
Personal Web Sites
Online New Sites
News Groups
News Groups
e-commerce sites e-mail
Chat Rooms
Chat Rooms
Chat Rooms
News Groups
News Groups
News Groups
Chat Rooms
Personal Web Sites
Personal Web Sites
Personal Web Sites
Personal Web Sites Personal
Web Sites
Personal Web Sties
Personal Web Sites
Personal Web Sites
Chat Rooms
News Groups
Online New Sites
Online New SitesOnline
New Sites
Online New Sites
Online New Sites
Online New Sites
Online New Sites
Alternative Press
Alternative Press Patents
Patents
Patents
Trade Publications
Trade Publications
Trade Publications
NewsMagazines
PeriodicalMagazines
PeriodicalMagazines
NewsMagazines
Quality of DataRaw Synthesized
Timeliness of DataInstantaneous Historical
Emerging TechnologyFormation
e-commerce sites
e-commerce sites
Emerging TechnologyDeployed
® 2003
Definitions of Unstructured Information Management
“Explore the new realm of Unstructured Information Management”
® 2004
Unstructured Information Management is not the same as KnowledgeManagement
UIM
Nonaka model of organizational learning:
• Polanyi: tacit and explicit knowledge
• Nonaka: organisational learning involves conversion between these forms
Tacit to TacitSocialization
Tacit to ExplicitExternalization
Explicit to TacitInternalization
Explicit to ExplicitCombination
Combination: systematizing concepts into a knowledge system
Internalization: embody explicit knowledge into tacit knowledge
Source: I. Nonaka and H. Takeuchi, The Knowledge Creating Company.
® 2004
Information comes in many formats, most of them are not readily understandable by computers
• Structured Information– Structured information is information that is analyzed!– The schema comes from a data model– e.g. data stored in a relational database
• Semi Structured Information– Data that may be irregular or incomplete and have a structure that may change rapidly
or unpredictably– The schema is discovered from the data, rather than imposed by the data model– e.g. XML markup
• Unstructured Information– Is not analysed and not readily understandable by machines– Has no structure other than implicit structure imposed by rules and conventions in
language– e.g. e‐mails, letters, news articles
® 2004
Unstructured Information Management Definition
Unstructured Information Management consists of the tools and methods necessary to • Store, • Access & Retrieve, • Navigate, and • DiscoverKnowledge in primarily unstructured text‐based information.
Source: Unstructured Information Management, Research Report, Infosphere AB
® 2004
Unstructured Information Management consists of many sub-domains of whichthe most interesting are Text Mining and Visualization
• Document / Content Management– content authoring (limited metadata capturing)– controlled archiving– versioning / editions / roll‐backs
• Search & Retrieval– find documents again
• Text Mining– helps with understanding and use of
information– contribute to semantic processing in above
tasks– organization of content into categories– help person decide if information is likely to be
useful• Visualization
– Be able to view hidden knowledge
Access & Retrieval
Navigation
Discovery
Data Storage
Visualization
Text Mining
IR
CMS
DBMSFile Servers
Functionality
® 2004
Different Technologies assist in the information to interpretation value chain
Databases Content Management File Repositories
Connectors (Spiders, Crawlers)
Document Filters
Search / Retrieval (Indexing)
Feature Extraction Language Identification
Clustering Classification Summarization
Taxonomy Alerts / ProfilingExpertise Relationships
VisualizationInterpretation of new insights through applications
Extract valuable insights through mining techniques
Access the availableinformation
Stored information
® 2003
Application Areas
” It is not only finding information, but making sense of huge volumes of data”
® 2004
The Top Five Text Mining Applications
• Language identification – Automatically identifies the language of a document for expedited processing
• Clustering – Groups related documents based on their content, without requiring predefined classes
• Categorization – Assigns documents to one or more user‐defined categories
• Summarization – Extracts sentences from a document to create a document summary
• Feature extraction – Recognizes significant items in text, such as names, technical terms and abbreviations
® 2004
Case Study – Reynaud’s Disease
The concept of hidden links or ʺundiscovered public knowledgeʺ was developed and exemplified in a paper by Prof. Don R. Swanson, University of Chicago. In his paper, Medline titles were used to find and identify a problem or topic of interest. From the downloaded titles the software constructs input for additional database searches and produce a series of hints or heuristic aids which help the user to select a second set of articles that is complementary to the first set but from a different area of research. The two sets are complementary if they, when combined, can reveal new useful information that cannot be inferred from either set alone. The program they developed was called Arrowsmith.
Sources:Don R. Swanson. “Fish Oil, Raynaudʹs Syndrome, and Undiscovered Public Knowledge”. Perspectives in Biology and Medicine 30(1): 7‐18, 1986.http://kiwi.uchicago.edu/
® 2004
Case Study – Patent Research
When a research scientist comes to the IP professional with a draft for a new patent s/he logs in to the patent research services. After submitting the necessary details about the patent draft, the user invokes the “Text Clustering Services”. This service makes relationships visible quickly as it displays clusters of similar documents based on extracted keywords. The user can examine his search results using linguistic technologies to explore relationships between different patents.This functionality allows for a more targeted analysis of the patent data with focus on the analysis of similarities. You can then observe the most relevant clusters or “drill down” into any cluster and view individual documents — taking hidden, textual information and transforming it into useful knowledge.
Sources:http://www.delphion.com
® 2004
Case Study – Market Intelligence Research
Gathering and reading of market news and updates are a tedious and time consuming task in the life of a market intelligence professional. Just the simple task of extracting names, addresses and contact information from many documents in a batch can prove to be very difficult.By using a feature extraction system this task can be highly automated allowing the marketing department to make real use of extracted knowledge in very short time frames.
® 2003
Technology Overview
“So, will it actually work for us?”
® 2004
® 2004
® 2004
® 2004
® 2003
Data Fusion and Text Analytics in Intelligence Domain
”The Infosphere Analyst’s Workbench”
® 2004Common RepositoryUser Created Data Stores
Internal UsersXML/HTTPManual loading into dbForm based entries
News monitoringReports
XML/HTTPArchivesNews monitoring
Web crawlingWeb site monitoring
GIS Services
XML/HTTP(SOAP)
System administrationUser created contentAnalysisRetrieval
External Sources
Crawlers
One View of the World... searchable....easy to overview and manage....content in context....allowing analysis
Analyst’s Workbench
® 2004
FACTIVAInfosphereInternet etc
GEOREFStatsFacts
Infosphere
GIS
® 2004
® 2003
What about the Future
”Web Scale Text Analytics”
® 2004
The problem with Web Scale data is that one must overcome both the Data barrier and the Structure barrier
Web‐scale Search Engines
Current Text Mining Solutions
Structure
Data
Enterprise Search and Clustering
Web‐scale Text Mining
Structuring & Integration Capability
Search
Mining
Enterprise WebData Scale
® 2004
Applying text analytics on web scale data brings a serious risk of mining for gold nuggets that simply are not present in the data set
IBM’s Project WebFountain is the first serious attempt to build a large scale text mining service offering. By some referred to as “A Google on Steroids”…
N.b! This kind of technology needs to be feed with the proper content. Today there is a common belief that the Net holds all the answers, just as long as you know where to look and what to look for. In some cases this is not true! To be truly effective these solutions rely heavily on pre‐analyzed data, often in very structured formats.
® 2004
Social Software, the blogosphere and the trend for networking via the web lays the foundation for establishing real distributed collection systems
The Friend of a Friend (FOAF) project is about creating a Web of machine‐readable homepages describing people, the links between them and the things they create and do.
® 2004
We suggest that proper understanding of the three pillars of requirements, content and technology are necessary to build a successful UIM system
Vision – Strategies
Users TechnologyContent
UIM Systems
N.b! The key to success is to know where to automate the process and where to use human input and control. We automate and standardize everything that is possible while focusing on using the best suited people for other tasks.
® 2004
Conclusion - Things to keep in mind when trying to integrate content with technology
• Who decides about our “Information Architecture” i.e. who decides our definitions and terms. Are there any standard taxonomies in our domain or should we create our own (the question of interchangeable data)...
• A user’s perspective of the world changes constantly due to many reasons (job roles, prior experiences, knowledge, biases etc). “The Truth is in the Eyes of the Beholder”.
• A taxonomy is normally a good tool in stable environments. The become stale and obsolete where fast if not maintained. Also, forget the overarching corporate taxonomy that will apply to all!
• In how many languages/formats will your content exist?• Linguistic or rule based systems can be very accurate and suitable, given that they
have support for your languages, are updated frequently and maintained properly.
• Machine learning tools are not fool proof, without an understanding of the type of content that will feed the system, there is a chance they will be completely useless.