Upload
gyles-carroll
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
14 Mar 05 1
Exploring Verity K2 through Pilot Applications
and Taxonomy Development
Gordon Campbell
Director, IS Strategic Planning & Innovation
14 Mar 05 2
sanofi pasteurThe vaccines business of sanofi-aventis Group
sanofi-aventis GroupFormed in 2004 by the merger of Sanofi-Synthélabo + Aventis
2004 Revenues = 25.4 Billion Euros
100,000 Employees
3rd largest Pharma company in the world
1st in Europe
sanofi pasteurWorld leader in Vaccines
2004 Revenues = 1.6 Billion Euros
8,000 Employees
Heritage includes Louis Pasteur (1890’s) and other vaccine pioneers (Merieux, Slee)
14 Mar 05 3
Global CIO with Global Functional HeadsCIOs for N. America and FranceCIO – R&DCIO – Industrial OperationsCIO – Commercial Operations (Sales & Marketing)CIO – Business Support (Functions – Finance, HR, etc.)Director, Global Infrastructure & OperationsDirector, IS QualityDirector, IS Strategic Planning & Innovation
Director, IS Strategic Planning & Innovation – responsibilitiesTransversal role – bridging functions & technologiesManage the Long Range Planning processManage the Global IS PortfolioVerity Champion – formulate the strategy and foster appropriate pilots and applications
sanofi pasteur IS Organization
14 Mar 05 4
Verity Experience at sanofi pasteur
Pre Verity K2 (through 2003)Limited applications – primarily intranet
Verity K2 AcquisitionEnd of 2003
Two primary applications targeted:Improve Intranet search results
Global Medical Affairs - share common disease / vaccine information
2004 Verity K2 – Pilots + Applications4 Pilots to explore taxonomies and multi-repository search
Plus 5 Applications
Developed with two consultancies:Verity Consulting Services – for French pilots & applications
Raritan Technologies, Inc. – for N. American pilots & applications
14 Mar 05 5
Search 101 – Basic Concepts
Google – a familiar search engine to manyEasy to use and results are ranked, often showing the best results near the top of the list (and paid sponsored links on the right).
Results are ranked based on Google’s
proprietary & typically secret algorithms
Users often mention Google when describing the type of search they
would like to have
14 Mar 05 6
101 - But there is much more Content to Search than what exists on the open Web
Enterprise generated content
is huge …Office documents
eMails
Database driven web pages
Not to mention other types of media (voice, video, etc.)
These estimates are from
2003 Study (UC Berkeley)With some volumes expected to double in 3 yrs
The increasing dilemma …How can I find what I need in a timely fashion?
Will I be forced to recreate what I can’t find?
Annual Information Volumes1
Media Type Terabytes 2 Comments
Scholarly publications
6 37,600 titles per year
Searchable Web
167 Openly accessible sites
Office Documents
1,397 10.75 billion pages per year
Deep Web 91,850 DB driven web sites
eMails (originals)
440,606 31 billion emails sent per day
Hard Disk Drives
1,986,000 44 million items per year
1 Source: How much information 2003?
2 Terabyte = 1 million million bytes, or approx 50,000 trees made into paper and printed.
14 Mar 05 7
NavigationBrowsing for information is the most common way to locate content of interestTypically, information is organized in a hierarchy of folders
Taxonomies – can provide the structure and logicBut, content must still be stored in appropriate locations, with meaningful descriptions (file names, abstracts, etc.)
As complexity increases, finding content by navigation becomes more and more difficult
Complexity factors include – volume & scope of content, multiple storage repositories, multiple copies of documents, etc.
SearchRepresents the primary alternative to navigationSimple text searches are very common in specific content repositories, but they may not produce effective resultsSophisticated search tools can yield prioritized and comprehensive lists of results …
but they require content access, rules and other techniques.
101 – Navigation & Search are Complimentary
14 Mar 05 8
101 - Requirements of a Good Search Engine:Access - content must be accessible to the search tools
First, access must be public or the user must have permission to search the site. (NB: Google cannot search secured or protected web sites.)Through a pre-established index of the content produced by a crawler or spider (this is the approach used by Google, producing very fast search results).Through bots that scan content at the time of the query. Some workers (bots) make use of local search engines often provided with a set of content.
Search Results – what will produce the best set of results?Simple text string, possibly with Boolean operators may locate only exact matches. Boolean skill may be necessary to enhance results.Rules augmented searches can locate many more items that are missed in a simple search, because they recognize synonyms, associated terms, etc.
Ranking Results – vastly improves the value of the searchAlgorithms are used to score items found by the search, and rank order the results, attempting to place the best matches near the top.Ranking scores can take into account many factors, such as where the search term is found - keyword list, the title or only the body of text? How often it appears. Proximity to other related key terms. Etc.
Bot is common parlance on the Internet for a software program acting as an agent on behalf of a user. Bots interact with other network services intended for people, as if it was a real person. One typical use of bots is to gather information. The term is derived from the word robot, reflecting the autonomous character in the "virtual robot"-ness of the concept.
14 Mar 05 9
101 - The Business Value of Good Search Tools
Parametric Search
Create & Maintain
EnterpriseTaxonomies
Federated or Consolidated
Search
sanofi pasteurRules
Rules
TagContent
reSearch
ClassifyContent
based onEnterprise
Rules
Classified Content
BusinessDecision
News,Journals,
Etc.
Identify Key ParametersImpactingdecision
Selected &Ranked
References
MakeDecision
OtherInputs
DefineBusinessProblem
ConsiderInputs & Evaluate
Alternatives
Business Value
=Better
InformedDecisions
Taxonomies provide the foundation for vastly improved search resultsTaxonomies provide the foundation for vastly improved search results
SimpleText
Search
Key Issues:•How long to find
needed info?•Quality of
results?•Missing or
inaccessible info?
14 Mar 05 10
Verity Pilots & Applications so far …
2004 Pilots
MeSH* Taxonomy ExtensionAdded depth and granularity on vaccine topics
Departmental Shared Folder2nd Taxonomy study –
Process Development3rd build to the taxonomy
Consolidated IS Content Search
Combine Verity collections from 3 different sources
Applications
VaccinePlace.comPublic service web site
Intranet K2 UpgradeStatic HTML pages + attachments
Global Medical ContentInternal shared access to common disease & vaccine information
RPI NewslineRegulatory publications
*MeSH = Medical Subject Heading from the US National Library of Medicine / NIH
14 Mar 05 11
US Vaccine Educational Web Sites
Corpus of DocumentsHTML pages of various vaccine information sites, including:
Daptacel.comInfluenza.comMeningitisvaccine.comRabies.comTetanus.orgTravelersvaccines.comVaccineProtection.com
Business DriversIncrease consumer access to information on vaccine-preventable diseases.Consolidate Internet access to several sites focused on vaccine-preventable diseases
Search ApproachSimple keyword text search
Taxonomy ExtensionsNone
14 Mar 05 13
Simple Text Search Results help VaccinePlace visitors find information quickly …
14 Mar 05 14
… or visitors can Browse to learn
14 Mar 05 15
But simple text searches are just the beginning – our Model for Improved Search Results includes …
TaxonomiesProvide the foundation for vastly improved search resultsBut public and commercial taxonomies often lack the richness and knowledge available in the enterprise
Our method for developing an enterprise taxonomy included:Use an existing taxonomy as a starting point - MeSH+ A Professional Librarian – Hugh McNaught+ Subject matter experts+ Verity experts – Raritan Technologies, Inc.= Robust Taxonomy + Enterprise Rules
Parametric Search Portal Is essential to test the taxonomy / rules effectivenessProvide enhanced access to the documents in the collection
*MeSH = Medical Subject Heading from the US National Library of Medicine / NIH
14 Mar 05 16
Taxonomy Concepts
Taxonomies – what are they?A hierarchical classification of things, or the principles underlying the classification. Almost anything, animate objects, inanimate objects, places, and events, may be classified according to some taxonomic scheme.
Why develop and use taxonomies?By developing and applying taxonomies that are specific to the collection(s) of interest, items in the collection(s) can be retrieved faster and easier. The items retrieved will be more relevant and more precise to the query asked.
Sources of taxonomiesMeSH = Medical Subject Heading from the US National Library of Medicine / NIHLibrary of Congress and other public domain sourcesCommercial taxonomies (Factiva, Verity, etc.)Internally developed – can enrich public domain / commercial taxonomies with enterprise knowledge
14 Mar 05 17
MeSH* + Reference Manager – 1st Pilot to Launch Development of a sanofi pasteur Taxonomy
Corpus of DocumentsVaccine related scientific publicationsAbstracts stored in the Reference Manager DB
Business DriversGlobal Information & Library Sciences desire to significantly improve quality of search results across a broad range of collectionsRecognition that public taxonomies such as MeSH, are not as rich in vaccine terms as needed
Search ApproachParametric + key word on title, abstract + key words
Taxonomy ExtensionsVaccine nodes of MeSH* taxonomyProductsCompaniesGeography
*MeSH = Medical Subject Heading
14 Mar 05 18
MeSH nodes Structure
Top Level MeSH D24 Nodes, including Vaccines
14 Mar 05 19
Expanded Vaccine NodePoliovirus Vaccine Structure
14 Mar 05 20
Verity Intelligent Classifier (VIC) - Provides tools to Enhance the Taxonomy and Create Rules
Taxonomy PaneTo create & modify the
users’ navigation structure
Topics PaneTo create & modify the rules –
synonyms, concepts, relationships, etc.
14 Mar 05 21
Poliovirus Vaccine Rules
• This is the set of Rules for the Poliovirus Vaccine
• The Inactivated node is expanded in this example.
• The high level node corresponds with a node in the structure.
• The rules ‘roll up’ to each higher level.
• All nodes contain Terms pertaining to the node, and Products used to treat that Virus
Verity Query LanguageIs the syntax used by VIC
to create & modify the rules
14 Mar 05 22
MeSH Extension – Product Taxonomy / Flu node
Flu vaccine brand names
14 Mar 05 23
Parametric Search Portal of the Reference Manager DB based on an Expanded MeSH Taxonomy
Company information added to
MeSH
Product information
added to MeSH
Geography nodes from
MeSH
6016 articles on Viruses
14 Mar 05 24
Clicking on a Parameter such as Influenza Vaccine automatically limits results …
5 BCG articles also
mentioning Flu Vaccine
Results can then be combined with a text
search for more precise selections
Further breakdowns of
the specific context for the
hits
Titles of the articles meeting the selected parameters are listed
here. 455 articles reference the
Americas
14 Mar 05 25
Search Parameters
Company- sanofi-aventis- GSK- Wyeth- etc.
Country- N. Am.- Europe- China- etc.
Verity Search Capabilities Leverage the Rules Incorporated in the Taxonomies
Text / Keyword / Parametric
Search
Federated / Consolidated Multi-source
Search
sanofi pasteurRules
TaggedVerity
Collections
VIC
Franchise- Flu- Pediatric- Traveler- Menactra
Rules reflect:- Synonyms- Concepts- Relationships
Disease- Flu- Tetanus- Polio- etc.
FocusedReference
Set
Includes only articles matching the selected parameters
Source 1
Source 2
Source n
TargetedRankedResults
Text / key words search across multiple sources & consolidateresults in one view
sanofi pasteurTaxonomies
14 Mar 05 26
Department Shared Folder Application
Corpus of DocumentsInternal documents stored on a shared network driveContents included a variety of Microsoft Office documents and Adobe Acrobat files
Business DriversNeed to locate relevant documents without a detailed knowledge of the folder structure & filing system
Search ApproachParametric + key word on title, abstract, key words and full text
Taxonomy ExtensionsFranchises – new nodes & rules (ex: Travel Vaccines)Companies – building on version 1 MeSH
14 Mar 05 27
Intranet K2 Enhancement
Corpus of DocumentsAll HTML static pages on sanofi pasteur intranetAttachmentsBut not yet contents of applications accessed through the Intranet
Business DriversNeed to greatly improve search results on intranet
Search ApproachKey word on title + full textResults ranked according to standard Verity algorithmsNot yet available –
Benefits from applying taxonomy rules, synonyms, etc.Benefits from federated searches of applications accessed via Intranet
Taxonomy ExtensionsNone yetApproach – for subject areas such as IS, HR, Purchasing, etc.
We could acquire commercial or public domain taxonomiesOr, we could develop something internally, similar to what was done for vaccines
14 Mar 05 28
Intranet – K2 Search Results
14 Mar 05 29
IS Content – Consolidated Search Portal
Corpus of DocumentsIS Intranet sites
IS shared folders (network drives)
IS Exchange Public folders
eRooms – not yet included in this pilot
Business DriversPilot techniques to access content stored in various online repositories
Search ApproachKey word on content
Taxonomy ExtensionsNone yet
Exploring public domain & commercial options
14 Mar 05 30
Search Access Components – a Summary
Components Description Tools CapabilitiesExperience
To-date
Verity Collection - Full text index of a corpus of documents
Verity K2 - Std Verity ranked list of results
- Required for taxonomy appl.
- Intranet K2 upgrade
Taxonomy - Hierarchy structure- Rules for searching &
ranking results
VIC - Structured browsing- Synonyms, rules
3 pilots – Ref Man DB, Shared Drive, Process Development
Gateways - Access to proprietary repository formats
Verity stds - Access & index contents, while respecting security
- Documentum (Global Content application)
Workers & Extractors
- Agent using repository’s native search engine
Custom Developmt
- Return results- Create Verity collection
- eRoom planned 05.
Basic Search Portal
- Simple text search of keywords / contents
Verity or Custom Developmt
- Unranked results- Possibly limited by a
qualifier (date, author).
- VaccinePlace.com- Other internal apps
Parametric Search Portal
- Predefined parameters (ex: product, co. name)
Verity or Custom
- Reduced set based on parameters
- Ref Man DB (MeSH)- Shared Dept Drive- Global Content
Federated Search Portal
- Multiple sources, using native search engines
- workers - Combine results from multiple sources
- None yet
Consolidated Search Portal
- Multiple sources, using extract
Custom Developmt
- Combined results + taxonomy ranking
- IS Pilot (in progress)
14 Mar 05 31
Extend the sanofi pasteur Enterprise Taxonomy – Build other non-vaccine nodes (ex: IS, HR, Legal, IO, etc.)
Apply to other applications – such as the Intranet sites
Add GatewayseRoom – worker and extractor to create Verity Collections
Data Discovery – a new application of Verity technologyGoal – review nature of internal content existing today on network drives, Public Folders, eRooms, etc.
Identify candidates for archiving / destruction
Isolate content worth including in Verity collections
R&D Consolidated Search PortalExplore needs and develop a business case
Across a broad array of internal and external sources
2005 Verity Projects – Applying What we have Learned and Extending our Learning
14 Mar 05 32
Questions?
(570) 839-4277
Swiftwater, PA 18370