Unstructured Information Managementsesam.smart-lab.se/seminarier/hostsem04/Thorson.pdf · the most interesting are Text Mining and Visualization •Document / Content Management –content

® 2003

Unstructured Information ManagementContent and Technology Integration

SESAM 20:th of October 2004

Mikael Thorson, Infosphere ABStockholm, Sweden

® 2004

Session Topic

Integrating Content and Technology using Unstructured Information Management

There is data, data everywhere, but is it organized in any structured way? Infosphere suggests a major problem facing information professionals and content managers is not only finding information, but making sense of huge volumes of data.

Success in this area is dependent on the integration of content and technology.

Explore the new realm of Unstructured Information Management (UIM) and its potential for changing day‐to‐day research and information management.

® 2003

The Problem Area

“There are many superb technologies out there, but without content they are useless!”

® 2004

Today, the information challenges or hurdles facing most organizations include both content and technology issues

• The mix of free and for‐fee content – why should we pay for content?• How to integrate existing proprietary content with external content?• Can we combine our knowledge and allow data fusion of all sources?• What current IT tools are available and suitable to our needs?• In the world of automation – where is the human filter needed?

® 2004

Islands of Information – bridging the information assets

There are basically three ways of integrating content or to provide “data fusion”:1. The Manual Approach – force content into a pre‐defined structure

• Traditional Library Science• Use of Taxonomies or even Ontologies• Manual intensive and relies on tagging schemas

2. The Automatic Approach – allow the system to learn from the data itself• Artificial Intelligence and Statistics• Auto Categorization (clustering) with auto tagging• Not as accurate as a skilled human classifier (ambiguities)

3. Combination methods• Uses each approach where it is either more accurate or “cheaper”• Requires more knowledge from the administrators and implementers

® 2003

The World of Unstructured Data

“Merrill Lynch estimates that more than 85 percent of all business information exists as unstructured data – commonly appearing in e‐mails, memos, notes from call centers and support operations, news, user groups, chats, reports, letters, surveys, white papers,

marketing material, research, presentations and Web pages.”DM Review Magazine, February 2003 Issue

® 2004

Information is growing exponentially, and unstructured data is growing in importance

Yottabyte 1024

1022

1020

1018

1016

1014

1012

98 99 00 01 02 03 04 05 06 07 08 09200%200%

100%100%

300%300%

60%60%

Paper

CDR

Servers

Desktops

Digital Tape

DVD

Total

Microdisk

Online

Offline

Internet

Static HTML

Dynamic Pages

Source: IBM

Zetabyte

Exabyte

Petabyte

Terabyte

® 2004

Business decisions are increasing in complexity, while time to make decisions is being compressed

Releasing in 15 countries

Decision Makers

Globalization

Networks & Alliances

Product Diversity

Time CompressionCustomer Diversity

1,000 suppliers

Telco Example

“Ten week” markets

45,000 services

• Product Diversity• Time Compression• Customer Diversity• Globalization• Network and Alliances

30 million subscribers

® 2004

It takes time to structure information, thus companies that makes use of unstructured information gain a competitive advantage

Emerging TechnologyDeployed

Academic Publications

Patents

Alternative Press

Trade Publications

Research Reports

Chat Rooms

Personal Web Sites

Online New Sites

News Groups

News Groups

e-commerce sites

Chat Rooms

Chat Rooms

Chat Rooms

News Groups

News Groups

News Groups

Chat Rooms

Personal Web Sites

Personal Web Sites

Personal Web Sites

Personal Web Sites Personal

Web Sites

Personal Web Sties

Personal Web Sites

Personal Web Sites

Chat Rooms

News Groups

Online New Sites

Online New SitesOnline

New Sites

Online New Sites

Online New Sites

Online New Sites

Online New Sites

Alternative Press

Alternative Press Patents

Patents

Patents

Trade Publications

Trade Publications

Trade Publications

NewsMagazines

PeriodicalMagazines

PeriodicalMagazines

NewsMagazines

Quality of DataRaw Synthesized

Timeless of DataInstantaneous Historical

Emerging TechnologyFormation

e-commerce sites

e-commerce sites

Academic Publications

Patents

Alternative Press

Trade Publications

Reports

Chat Rooms

Personal Web Sites

Online New Sites

News Groups

News Groups

e-commerce sites e-mail

Chat Rooms

Chat Rooms

Chat Rooms

News Groups

News Groups

News Groups

Chat Rooms

Personal Web Sites

Personal Web Sites

Personal Web Sites

Personal Web Sites Personal

Web Sites

Personal Web Sties

Personal Web Sites

Personal Web Sites

Chat Rooms

News Groups

Online New Sites

Online New SitesOnline

New Sites

Online New Sites

Online New Sites

Online New Sites

Online New Sites

Alternative Press

Alternative Press Patents

Patents

Patents

Trade Publications

Trade Publications

Trade Publications

NewsMagazines

PeriodicalMagazines

PeriodicalMagazines

NewsMagazines

Quality of DataRaw Synthesized

Timeliness of DataInstantaneous Historical

Emerging TechnologyFormation

e-commerce sites

e-commerce sites

Emerging TechnologyDeployed

e-mail

® 2003

Definitions of Unstructured Information Management

“Explore the new realm of Unstructured Information Management”

® 2004

Unstructured Information Management is not the same as KnowledgeManagement

UIM

Nonaka model of organizational learning:

• Polanyi: tacit and explicit knowledge

• Nonaka: organisational learning involves conversion between these forms

Tacit to TacitSocialization

Tacit to ExplicitExternalization

Explicit to TacitInternalization

Explicit to ExplicitCombination

Combination: systematizing concepts into a knowledge system

Internalization: embody explicit knowledge into tacit knowledge

Source: I. Nonaka and H. Takeuchi, The Knowledge Creating Company.

® 2004

Information comes in many formats, most of them are not readily understandable by computers

• Structured Information– Structured information is information that is analyzed!– The schema comes from a data model– e.g. data stored in a relational database

• Semi Structured Information– Data that may be irregular or incomplete and have a structure that may change rapidly

or unpredictably– The schema is discovered from the data, rather than imposed by the data model– e.g. XML markup

• Unstructured Information– Is not analysed and not readily understandable by machines– Has no structure other than implicit structure imposed by rules and conventions in

language– e.g. e‐mails, letters, news articles

® 2004

Unstructured Information Management Definition

Unstructured Information Management consists of the tools and methods necessary to • Store, • Access & Retrieve, • Navigate, and • DiscoverKnowledge in primarily unstructured text‐based information.

Source: Unstructured Information Management, Research Report, Infosphere AB

® 2004

Unstructured Information Management consists of many sub-domains of whichthe most interesting are Text Mining and Visualization

• Document / Content Management– content authoring (limited metadata capturing)– controlled archiving– versioning / editions / roll‐backs

• Search & Retrieval– find documents again

• Text Mining– helps with understanding and use of

information– contribute to semantic processing in above

tasks– organization of content into categories– help person decide if information is likely to be

useful• Visualization

– Be able to view hidden knowledge

Access & Retrieval

Navigation

Discovery

Data Storage

Visualization

Text Mining

IR

CMS

DBMSFile Servers

Functionality

® 2004

Different Technologies assist in the information to interpretation value chain

Databases Content Management File Repositories

Connectors (Spiders, Crawlers)

Document Filters

Search / Retrieval (Indexing)

Feature Extraction Language Identification

Clustering Classification Summarization

Taxonomy Alerts / ProfilingExpertise Relationships

VisualizationInterpretation of new insights through applications

Extract valuable insights through mining techniques

Access the availableinformation

Stored information

® 2003

Application Areas

” It is not only finding information, but making sense of huge volumes of data”

® 2004

The Top Five Text Mining Applications

• Language identification – Automatically identifies the language of a document for expedited processing

• Clustering – Groups related documents based on their content, without requiring predefined classes

• Categorization – Assigns documents to one or more user‐defined categories

• Summarization – Extracts sentences from a document to create a document summary

• Feature extraction – Recognizes significant items in text, such as names, technical terms and abbreviations

® 2004

Case Study – Reynaud’s Disease

The concept of hidden links or ʺundiscovered public knowledgeʺ was developed and exemplified in a paper by Prof. Don R. Swanson, University of Chicago. In his paper, Medline titles were used to find and identify a problem or topic of interest. From the downloaded titles the software constructs input for additional database searches and produce a series of hints or heuristic aids which help the user to select a second set of articles that is complementary to the first set but from a different area of research. The two sets are complementary if they, when combined, can reveal new useful information that cannot be inferred from either set alone. The program they developed was called Arrowsmith.

Sources:Don R. Swanson. “Fish Oil, Raynaudʹs Syndrome, and Undiscovered Public Knowledge”. Perspectives in Biology and Medicine 30(1): 7‐18, 1986.http://kiwi.uchicago.edu/

® 2004

Case Study – Patent Research

When a research scientist comes to the IP professional with a draft for a new patent s/he logs in to the patent research services. After submitting the necessary details about the patent draft, the user invokes the “Text Clustering Services”. This service makes relationships visible quickly as it displays clusters of similar documents based on extracted keywords. The user can examine his search results using linguistic technologies to explore relationships between different patents.This functionality allows for a more targeted analysis of the patent data with focus on the analysis of similarities. You can then observe the most relevant clusters or “drill down” into any cluster and view individual documents — taking hidden, textual information and transforming it into useful knowledge.

Sources:http://www.delphion.com

® 2004

Case Study – Market Intelligence Research

Gathering and reading of market news and updates are a tedious and time consuming task in the life of a market intelligence professional. Just the simple task of extracting names, addresses and contact information from many documents in a batch can prove to be very difficult.By using a feature extraction system this task can be highly automated allowing the marketing department to make real use of extracted knowledge in very short time frames.

® 2003

Technology Overview

“So, will it actually work for us?”

® 2004

® 2004

® 2004

® 2004

® 2003

Data Fusion and Text Analytics in Intelligence Domain

”The Infosphere Analyst’s Workbench”

® 2004Common RepositoryUser Created Data Stores

Internal UsersXML/HTTPManual loading into dbForm based entries

News monitoringReports

XML/HTTPArchivesNews monitoring

Web crawlingWeb site monitoring

GIS Services

XML/HTTP(SOAP)

System administrationUser created contentAnalysisRetrieval

External Sources

Crawlers

One View of the World... searchable....easy to overview and manage....content in context....allowing analysis

Analyst’s Workbench

® 2004

FACTIVAInfosphereInternet etc

GEOREFStatsFacts

Infosphere

GIS

® 2004

® 2003

What about the Future

”Web Scale Text Analytics”

® 2004

The problem with Web Scale data is that one must overcome both the Data barrier and the Structure barrier

Web‐scale Search Engines

Current Text Mining Solutions

Structure

Data

Enterprise Search and Clustering

Web‐scale Text Mining

Structuring & Integration Capability

Search

Mining

Enterprise WebData Scale

® 2004

Applying text analytics on web scale data brings a serious risk of mining for gold nuggets that simply are not present in the data set

IBM’s Project WebFountain is the first serious attempt to build a large scale text mining service offering. By some referred to as “A Google on Steroids”…

N.b! This kind of technology needs to be feed with the proper content. Today there is a common belief that the Net holds all the answers, just as long as you know where to look and what to look for. In some cases this is not true! To be truly effective these solutions rely heavily on pre‐analyzed data, often in very structured formats.

® 2004

Social Software, the blogosphere and the trend for networking via the web lays the foundation for establishing real distributed collection systems

The Friend of a Friend (FOAF) project is about creating a Web of machine‐readable homepages describing people, the links between them and the things they create and do.

® 2004

We suggest that proper understanding of the three pillars of requirements, content and technology are necessary to build a successful UIM system

Vision – Strategies

Users TechnologyContent

UIM Systems

N.b! The key to success is to know where to automate the process and where to use human input and control. We automate and standardize everything that is possible while focusing on using the best suited people for other tasks.

® 2004

Conclusion - Things to keep in mind when trying to integrate content with technology

• Who decides about our “Information Architecture” i.e. who decides our definitions and terms. Are there any standard taxonomies in our domain or should we create our own (the question of interchangeable data)...

• A user’s perspective of the world changes constantly due to many reasons (job roles, prior experiences, knowledge, biases etc). “The Truth is in the Eyes of the Beholder”.

• A taxonomy is normally a good tool in stable environments. The become stale and obsolete where fast if not maintained. Also, forget the overarching corporate taxonomy that will apply to all!

• In how many languages/formats will your content exist?• Linguistic or rule based systems can be very accurate and suitable, given that they

have support for your languages, are updated frequently and maintained properly.

• Machine learning tools are not fool proof, without an understanding of the type of content that will feed the system, there is a chance they will be completely useless.

® 2004

Contact Details

Mikael [email protected]

INFOSPHERE AB

www.infosphere.seor

www.unstruct.org

Documents

Unstructured Information Managementsesam.smart-lab.se/seminarier/hostsem04/Thorson.pdf · the most interesting are Text Mining and Visualization •Document / Content Management –content