Upload
gavin-gray
View
225
Download
3
Tags:
Embed Size (px)
Citation preview
“All databases are equal...
…but some are more equal than others.”
Stephen Adams,
Magister Ltd., GB
© Magister Ltd 2004, 2005 2
Topics
• Where database creation goes wrong…
• Why bother to evaluate?– A word about ‘quality’
• Quality content– missing documents, document kinds and
fields
• Quality context– search engines
• Conclusion
© Magister Ltd 2004, 2005 3
Topics
• Where database creation goes wrong…
• Why bother to evaluate?– A word about ‘quality’
• Quality content– missing documents, document kinds and
fields
• Quality context– search engines
• Conclusion
© Magister Ltd 2004, 2005 4
The basics of information retrieval
Query
Documents Documentrepresentation
Query representation
HitsMatching
Adapted from Crestani, J.Inf.Sci. 29(2), 87-96 (2003)
© Magister Ltd 2004, 2005 5
Reference interview
Query
Documents Documentrepresentation
Query representation
HitsMatching
“I’m sorry - I don’t understand the question…”
“Are you also interested in…?”
“How much do you already know about this?”
QUALITY RESULTS START WITH US.
© Magister Ltd 2004, 2005 6
Strategy development
Query
Documents Documentrepresentation
Query representation
HitsMatching“Where has that manual got to…!”
“When did they start using that field?”
“Is that field available for all records?”
© Magister Ltd 2004, 2005 7
Document quality - at source
Query
Documents Documentrepresentation
Query representation
HitsMatching“I leave the form-filling to the paralegals…”
“I’m sure my secretary never transposes application numbers - she can read my handwriting…”
“Our patent office uses that INID code differently…”
© Magister Ltd 2004, 2005 8
Full text, abstract, indexing...
Query
Documents Documentrepresentation
Query representation
HitsMatching
“I get so much rubbish with full-text…”
“I don’t trust abstracts - especially for a freedom-to-operate search…”
“Their timeliness has improved - but indexing quality is down…”
“800,000 corrections per year”
© Magister Ltd 2004, 2005 9
Hitting the keyboard
Query
Documents Documentrepresentation
Query representation
HitsMatching
“Where on earth did that false drop come from…?”
“We always use the free services - the results are OK so far”
“Why does this host always crash on a Friday?”
© Magister Ltd 2004, 2005 10
Major topics for today
Query
Documents Documentrepresentation
Query representation
HitsMatching
Database content
Database context
© Magister Ltd 2004, 2005 11
Content and context
• The effectiveness of “a database” as a search tool is a function of (at least) two variables:– the data content– the search engine / command language.
• The ideal answer may be a compromise:– (‘average’ database & ‘good’ command
language) or (‘good’ database & ‘poor’ search engine).
© Magister Ltd 2004, 2005 12
Topics
• Where database creation goes wrong…
• Why bother to evaluate?– A word about ‘quality’
• Quality content– missing documents, document kinds and
fields
• Quality context– search engines
• Conclusion
© Magister Ltd 2004, 2005 13
Why evaluate database content?
• Database evaluation is a basic part of information literacy:– “a set of abilities requiring individuals to
recognise when information is needed and have the ability to locate, evaluate and use effectively the needed information.”
– American Library Association 1989, Final Report of the ALA Presidential Committee on Information Literacy
• If we do not evaluate our sources, we cannot serve our customers fully.
© Magister Ltd 2004, 2005 14
The biggest database of all?
Isn’t that enough for anyone?
So why evaluate?
© Magister Ltd 2004, 2005 15
A simple evaluation parameter: language
Cyber Atlas distribution 2000
English
Japanese
German
Chinese
French
Spanish
Russian
Other
Source: CyberAtlas, www.clickz.com/stats/big_picture/demographics/article.php/5901_408521
OCLC figures for 2004 are comparable: 30-35% of the Internet is not in English.
© Magister Ltd 2004, 2005 16
Implication:
• The effectiveness of ‘the Internet’ as a retrieval tool will be skewed according to the nature of our search:– “Hermann Hesse” “Das Glasperlenspiel” =
13,600• of which “& domain=de” comprises 13,400
– “Hermann Hesse” “The Glass Bead Game” = 12,500
• of which “& domain=de” comprises 128
– “Hermann Hesse” “Magister Ludi” = 5,100
© Magister Ltd 2004, 2005 17
The third leg
• Good database evaluation should include not only the 2 factors identified above: – Database content i.e. how well it is put together
– Database context i.e. the command language and search engine
• but also a third factor– How well does this database fit my specific enquiry? (one-off
need or recurring usage)
– Note - if the evaluation process includes this factor, it follows that there is no such thing as the ‘ideal’ database for all enquiries
© Magister Ltd 2004, 2005 18
What is quality?
• “Fitness for purpose”– content– completeness– timeliness etc.
• It is difficult to be absolute; more easy to evalutate as a relative quantity– benchmarking two sources against one
another gives a better practical feel for ‘quality’ than attempts to measure against a mythical standard
© Magister Ltd 2004, 2005 19
Simple example of quality
• We wish to conduct a freedom-to-operate search in respect of Germany– one file contains DE-C2, DE-B4 documents– a second file contains DE-C2, DE-B4, DE-
C1, DE-B3, DE-T2 and DE-U documents
• Which one would you choose?– Whichever your answer, it does not imply
that the other is ‘poor quality’.
© Magister Ltd 2004, 2005 20
Measuring quality
• We can measure good content– essentially quantitative, binary
• We can measure good database structure/context – essentially qualitative, relative, subjective
• e.g. are there explicit links between individual records (e.g. common indexing scheme)?
• e.g. do the command language features or field standardisation facilitate virtual links?
• e.g. what proportion of the time is the system up?
© Magister Ltd 2004, 2005 21
The coelecanth
Location: GreenlandZone: polarHabitat: fresh waterSize: 30 cm.Era: 200 m. years agoExtinct for 50 m. years
Location: South AfricaZone: sub-tropicalHabitat: salt waterSize: 1.75 metresEra: 1938Alive and breeding
© Magister Ltd 2004, 2005 22
Databases or datadumps?
• Science is not ephemeral - it is cumulative– Unless adequate consideration is given to the
issue of retrieval at a distance of 10, 20 or 50 years after publication, then the resulting file is not a database at all - it is a datadump
• Much emphasis has been given in recent years to timeliness i.e. adding new records– add in haste, repent at leisure?
© Magister Ltd 2004, 2005 23
Robert Maxwell:
Chairman of Pergamon Press
Owner of Pergamon Orbit-InfoLine
Owner of Mirror Group Newspapers
© Magister Ltd 2004, 2005 24
“All the science that’s fit to print”
• Publication or ‘laid open to public inspection’ without consideration of retrieval afterwards means that each record is left isolated from the context of the corpus of science– and will be missed in a proportion of the
searches to which it is a relevant answer– or possibly never found again
© Magister Ltd 2004, 2005 25
Topics
• Where database creation goes wrong…
• Why bother to evaluate?– A word about ‘quality’
• Quality content– missing documents, document kinds and
fields
• Quality context– search engines
• Conclusion
© Magister Ltd 2004, 2005 26
Missing fields
Three ‘layers of incompleteness’
Missing Kinds-of-documents
Missing documents
© Magister Ltd 2004, 2005 27
Missing documents
• The classical measure of database quality:– Is
• every document of the same kind,
• published in that period
• by that publishing authority
– present in the file?
• Examples:– Latipat, USPTO.gov, Patent Abstracts of
Japan
© Magister Ltd 2004, 2005 28
Missing documents
• Latipat– Newly launched esp@cenet portal,
http://lp.espacenet.com
• USPTO.gov– Full-text of granted patents
• Patent Abstracts of Japan– JAPIO file
© Magister Ltd 2004, 2005 29
Latipat
0500
10001500
200025003000
35004000
45005000
Both Latipat and PlusPat (below) suffer from the same problem - missing records; lots of them!
© Magister Ltd 2004, 2005 30
USPTO.gov
Partial listing of missing patents:
4097518 - 4097928 (411)
4526401 - 4527286 (886)…
= 6,092 missing between 4,000,000 and 4,999,999 (0.6%)
STILL 224 missing between 6,000,000 and 6,101,209 (0.2%)
© Magister Ltd 2004, 2005 31
PAJ
PAJ fact sheet from Questel-Orbit
© Magister Ltd 2004, 2005 32
What the publicity impliesA
PP
LIC
AN
TS
DATE
TECHNOLOGY
1976
© Magister Ltd 2004, 2005 33
First limitation - by applicantA
PP
LIC
AN
TS
DATE
TECHNOLOGY
1976 1989
Backfile to 1989 now available - but has every host loaded it?
Prior to 1998, cases not claiming JP priority were not automatically included in PAJ
© Magister Ltd 2004, 2005 34
Second limitation - by technologyA
PP
LIC
AN
TS
DATE
TECHNOLOGY
1976 1989
Prior to 1989, only 48 out of 118 IPC classes were covered completely (40%)
Complete IPC coverage from 1989 - but no plans to create back-file?
© Magister Ltd 2004, 2005 35
The (messy) truthA
PP
LIC
AN
TS
DATE
TECHNOLOGY
1976 1989
© Magister Ltd 2004, 2005 36
How to evaluate?
• “Missing documents” is one of the few parameters which can be measured independently of the database– Annual Reports of the office concerned– WIPO Industrial Property statistics
• Caution : – these may not refer to the appropriate
document kinds; check before use.
© Magister Ltd 2004, 2005 37
Caution
• Determining database ‘completeness’ is only meaningful when measured against a quantitative parameter– e.g. publication number.
• It has little or no meaning when measured using more qualitative parameters– e.g. no. of hits found using the same strategy
across several databases• the strategy will be sub-optimal for some
databases and not for others
© Magister Ltd 2004, 2005 38
Simple source-by-source comparison
BIOSIS -v- Medline
BIOSIS Evolutions vol.9 no.6 © BIOSIS
© Magister Ltd 2004, 2005 39
Science Direct: Comprehensive - provided it’s from Elsevier...
Web of Science: Comprehensive - provided it’s got a high impact factor from ISI...
MDL: PCT and EP from 1976 ?
© Magister Ltd 2004, 2005 40
Take-home message
• There is nothing wrong with publicity– provided it is not confused with user
documentation.
• Database producers still have a long way to go in informing users of the gaps in their databases– it should be much easier to locate this data
than it is at present.
© Magister Ltd 2004, 2005 41
Topics
• Where database creation goes wrong…
• Why bother to evaluate?– A word about ‘quality’
• Quality content– missing documents, document kinds and
fields
• Quality context– search engines
• Conclusion
© Magister Ltd 2004, 2005 42
Missing Kinds-of-Documents
• Second measure of database quality– Is
• every document of every appropriate kind,
• published in that period
• by that publishing authority
– present in the file?
• Examples:– Overlapping year / country coverage– EP-A1, -A2, -A3, -A8, -A9– US-B1, -B2, -E, -C1, -C2
© Magister Ltd 2004, 2005 43
But they all cover Australia...
• Even given overlapping country and year coverage, different sources can cover different publication stages
• e.g. Australia– WPI : AU-A from 1963, AU-B from 1993– INPD : AU-A from 1973, AU-B from 1978– CAS : AU-B from 1927
• AU-A is included in CAPlus family, even though it will never be selected as CAS basic - see http://www.cas.org/EO/patkind.html
© Magister Ltd 2004, 2005 44
European correction documents
• ST.50 implemented from 1997– how many database producers take the data?– how many tell their users whether they take
the data?
• Examples:– Questel-Orbit EPPATENT file– STN Europatfull file
© Magister Ltd 2004, 2005 45
Coverage of correction documents
1/1 EPPATENT - (C) Questel.Orbit- imageCPIMPN - EP954211 A2 19991103 [EP-954211]BPN - 1999-44ET - Supporting apparatusBRR - 2000-29 (Updated 2000-29)DREX- 2001-01-18 Request for examination (Updated 2001-13)DNEX- 2001-08-06 First examination report (Updated 2001-38)DGR - 2003-07-23 Grant (Updated 2003-30)BGR - 2003-30 (Updated 2003-30)NGR - B1 (Updated 2003-30)
EPPATENT MAX format (edited) : all Bulletin announcements
© Magister Ltd 2004, 2005 46
Coverage of correction documents
L1 ANSWER 1 OF 1 EUROPATFULL COPYRIGHT 2004 WILA on STN PATENT APPLICATION - PATENTANMELDUNG - DEMANDE DE BREVET AN 954211 EUROPATFULL ED 19991114 EW 199944 FS OSTIEN Supporting apparatus.PIT EPA2 EUROPAEISCHE PATENTANMELDUNG GRANTED PATENT - ERTEILTES PATENT - BREVET DELIVRE AN 954211 EUROPATFULL UP 20030729 EW 200330 FS PSTIEN Supporting apparatus.PIT EPB1 EUROPAEISCHE PATENTSCHRIFT
Europatfull default format (edited) : no record of anything after EP-B1
© Magister Ltd 2004, 2005 47
INID (15) shows that this B8 is a correction to the B1 (grant).
© Magister Ltd 2004, 2005 48
How effective is this?
• The experienced information specialist is tempted to infer legal status information from the presence/absence of a particular publication stage (risky!) – e.g. EP-B = assumption of entry into force
• The inexperienced information specialist is not always given the correct links to lead to the right conclusion e.g.– e.g. US parent, re-issue, re-examination cases
© Magister Ltd 2004, 2005 49
Re-examination mentioned in facsimile version - but not in ASCII text:Parent case - claims 1-10Re-exam 1 - new claims 11-112Re-exam 2 - new claims 113-126
IFI record consolidates all changes into a single record - the novice has a better chance of getting a more accurate answer to a legal status search.
© Magister Ltd 2004, 2005 50
US coverage
Kind Code DefinitionEarliest date of use
Dialog / 652-654 IFI Claims INPADOC (incl. Delphion)
Questel / USAPPS
Questel / USPAT
STN/ USPAT2
STN/ USPATFULL WPI
US-A old Act grant 1836 1971 1950 1968 1971 1971 1963new Act published application 2001 2001 2001 2001 2001 2001
US-B old Act re-examination Y 1981new Act grant 2001 2001 Y
US-C new Act re-examination 2001 YUS-E re-issue 1838 Y 1963 1968 Y 1970US-H defensive publication 1969 Y 1963 1977 1976US-H1 Statutory Invention Registration 1985 Y 1963 1985 1985 Y 1968US-A1 Trial Voluntary Protest Program 1975 1975US-S Design Patent 1843 Y 1976 2001 1976 YUS-P old Act Plant Patent 1931 Y 1976 1994 1976 YUS-P1 Plant Patent published application 2001 2001US-P2 Plant Patent grant 2001US-A0 NTIS invention applications 1974 1983US-A9 Correction of new Act published application 2002 2002 2002None Office of the Alien Property Custodian (APC) 1917
• Example analysis of KD coverage– e.g. IFI would appear to cover SIR’s from 1963,
some 22 years before they started (?)
– e.g. split between USPATFULL/USPAT2 difficult to discern
© Magister Ltd 2004, 2005 51
Topics
• Where database creation goes wrong…
• Why bother to evaluate?– A word about ‘quality’
• Quality content– missing documents, document kinds and
fields
• Quality context– search engines
• Conclusion
© Magister Ltd 2004, 2005 52
Missing fields
• Third measure of database quality• every document of every appropriate kind
• published in that period
• by that publishing authority
• present in the file to the same level of detail
• Evaluation must compare like-with-like• Variations in completeness of coverage and/or field
population will affect the apparent effectiveness
• Examples: • EP and PCT full-text files, Derwent WPI
coverage
© Magister Ltd 2004, 2005 53
Non-systematic or missing fields
• New field during database life– imposes an implicit time range on your
search• e.g. IPC editions, WPI coding changes
• Systematic omission of a field– biases results against records which do not
contain that field• e.g. US-A assignees
• e.g. JP, CN inventors in WPI
© Magister Ltd 2004, 2005 54
European Patents Fulltext covers all European patent applications and granted European patents published since the opening of the European Patent Office (EPO) in 1978…
But…EP-A specifications from 1986 in only one languageEP-B specifications from 1991 in three languages
© Magister Ltd 2004, 2005 55
PCT full text
• Many files claim to cover ‘full text’ PCTs– Few handle the cases published in Japanese,
Chinese or Russian• but these still have an English abstract
• Abstract searching gives equal weight to all documents
• Full text searching skews results in favour of records containing full text
© Magister Ltd 2004, 2005 56
Derwent WPI countries
• Most countries in WPI are coded using the Manual Code system– but not all countries had Manual Codes added
from the start of their coverage
• A strategy incorporating Manual Codes imposes an implicit time ranging on some countries, and can distort retrieval– MC retrieval of KR-B started 1990, biblio
available from 1986
© Magister Ltd 2004, 2005 57
Topics
• Where database creation goes wrong…
• Why bother to evaluate?– A word about ‘quality’
• Quality content– missing documents, document kinds and
fields
• Quality context– search engines
• Conclusion
© Magister Ltd 2004, 2005 58
Quality and the search platform
• A poor search platform / command language can ruin a good quality database, by effectively concealing or distorting the information which is present.
• Typical questions:– does the default print format contain the most
useful information for my search?– do I obtain the same answer irrespective of
the route to it?
© Magister Ltd 2004, 2005 59
Default print formats
1/1 PLUSPAT - (C) QUESTEL-ORBIT- image CPIM (C) Questel-Orbit PN - EP0954211 A2 19991103 [EP-954211] TI - (A2) Supporting apparatusSTG - (A2) Pub. Of applic. Without search report
1/1 PLUSPAT - (C) QUESTEL-ORBIT- imageCPIM (C) Questel-OrbitPN - EP0954211 A2 19991103 [EP-954211]PN2 - EP0954211 A3 20000719 [EP-954211]PN3 - EP0954211 B1 20030723 [EP-954211]PN4 - EP0954211 B8 20040414 [EP-954211]TI - (A2) Supporting apparatusSTG - (A2) Pub. Of applic. Without search reportSTG2- (A3) Publi. Of search reportSTG3- (B1) PatentSTG4- (B8) Modified first page
PlusPat BIB format : only shows first publication stage.
PlusPat MAX format
© Magister Ltd 2004, 2005 60
Variation due to search route
• US patent term extension under 35 USC §136 (Hatch/Waxman)– issued in the form of a Certificate of
Correction
• At least two equivalent routes to view: – locate the original document and check for a
‘Correction’ segment in full text view OR– go directly to list of term extensions and link
to Certificate• http://www.uspto.gov/web/offices/pac/dapp/opla/term/156.html
© Magister Ltd 2004, 2005 61
Test question
• Is there an extension in force for – US 4540568 ? – US 4572909 ?– N.B. - This question avoids the use of PAIR
(inoperative on day of test) and assumes that the enquirer has already established that US 4540568 has been replaced by US Re 32969
• but why was PAIR not working?
• and why should I have to make that assumption?
© Magister Ltd 2004, 2005 62
35 USC 156 listing
© Magister Ltd 2004, 2005 63
Results
• US Re 32,969 (replaced US 4540568)– via 156 listing, obtains a PDF of Cert. of Correction
shows term extended for 931 days • actual extension in listing is recalculated as 897 under 35
USC 156(c)(3)
– via full text view, there is no record of the Certificate of Correction at all; nor any link from US 4540568
• US 4572909– via 156 listing, obtains a PDF showing extension for
1252 days
– via full text view, an additional ‘Correction’ segment available
© Magister Ltd 2004, 2005 64
Additional document segment is present for US 4572909, but missing for others….
© Magister Ltd 2004, 2005 65
Summary answer
Source: US 4540568 /US Re32969
US 4572909
35 USC 156listing
Yes Yes
Full text No Yes
PAIR ? ?
© Magister Ltd 2004, 2005 66
Topics
• Where database creation goes wrong…
• Why bother to evaluate?– A word about ‘quality’
• Quality content– missing documents, document kinds and
fields
• Quality context– search engines
• Conclusion
© Magister Ltd 2004, 2005 67
Conclusion
• There is no such thing as “the database for all seasons”
• Evaluation is ongoing, even for established products
• There are many ways in which databases can be ‘incomplete’
• A poor search environment can ruin a good database
• Communication between legal, information and database specialists is the key quality factor
© Magister Ltd 2004, 2005 68
Coming up in part 2...
• Two case studies– PCT publication rates– Searching for gold
© Magister Ltd 2004, 2005 69
Enjoy your break!