INFS 427: AUTOMATED INFORMATION RETRIEVAL · 2018-09-29 · Types of Document Surrogates –Cont [d. •Bibliographic data/record –all the data elements used to identify, describe,

College of Education

School of Continuing and Distance Education2014/2015 – 2016/2017

INFS 427: AUTOMATED INFORMATION RETRIEVAL(1st Semester, 2018/2019)

Session 03 – The Collection

Lecturer: Mrs. Florence O. Entsua-Mensah, DIS Contact Information: [email protected]

Session Overview

• In this class, we will discuss the nature of the body of knowledge/ collection that exists for the information user to access.

• We will discuss the various types of data and how they are organized to enhance the information retrieval process.

• Also, the automated systems for information gathering processing, and presentation.

Florence O. Entsua-mensah (Mrs), DIS/SCDE Slide 2

Session Outline

The key topics to be covered in the session are:

• Topic One: The Concept of a Collection

• Topic Two: Automated Information Gathering

• Topic Three: Automated Systems for Information Processing and Presentation

• Topic Four: Database technology

Florence O. Entsua-Mensah (Mrs) 3

What is a Collection?

• An organized pool of knowledge or information resources which a user may access to satisfy an information need.

• One of the essential components of an information retrieval system is its collection or the database.

• The collection is generally made up of documents of different kinds.


Recommended Reading

Chowdhury, G. G. (2010). Introduction to modern information retrieval (3rd ed. ). New York: Neal-Schuman Publishers, Inc

Ferguson, S., Hebels, R., & Charles Stuart University. (2003). Computers for librarians: An introduction to the electronic library. Wagga Wagga: Centre for Information Studies, Charles Stuart University.

Korfhage, R. R. (2006). Information Storage and Retrieval. Wiley India Pvt. Limited.


Documents

• In information retrieval, a document is defined as “a stored data or record in any form” (Korfhage, 2006).

• A document refers to a piece of written, printed, or electronic matter that provides information or evidence or that serves as an official record (Stevenson & Waite, 2011).

• The underlying idea is that the document must be stored in a retrievable form.

Examples of documents

• Books

• Letters

• Messages

• Parts of a book, such as an encyclopaedia dealing with different topics, i.e.,– A chapter

– A section

– A paragraph

• Graphics

• Sound/voice recordings

• Images

• Computer programs

• Data files

• Email messages etc.

Document surrogates

• Document Surrogates are “limited representations of full documents” (Korfhage, 2006).

• Types of Document Surrogates include: – Document Identifier

– Bibliographic Data/Records

– Keyword

– Abstract

– Extract

– Review

Types of Document Surrogates

• Document Identifier – a number/code e.g. accession or a classification number for the purpose of inventory control or document location.


Types of Document Surrogates –Cont’d.

• Bibliographic data/record – all the data elements used to identify, describe, or retrieve a document/publication of information content

• OR

• A collection of data elements organized in a logical way to represent a bibliographic item or document, publication or any record of human communication.

• Examples - author, title, publication date, publisher, ISBN etc. These are useful to the information seeker. For e.g., date shows the timeliness and appropriateness of the document.

• Keyword – one or a set of individual words chosen by the author/editor or sometimes dictated by the database to represent the contents of the document.

• Abstract – a brief one or two paragraph description of the contents of a paper often written by the author.

– Its purpose is to help a reader to determine whether the entire document should be retrieved.

Types of Document Surrogates – Cont’d.

• Extract – “Artificially constructed surrogates created by someone other than the author of a paper” (Korfhage, 2006).

– May comprise the first sentence of each paragraph or significant words and phrases in the document.

• Review – a critical article on a book, play, recital etc., written by someone other than the author

– Its purpose is to indicate the value of the document with respect to other works in the same field.

– It can be retrieved separately to suit the purposes of a reader.

Types of Document Surrogates – Cont’d.

AUTOMATED INFORMATION GATHERING

Topic Three


Information Gathering

• The problem of information gathering has received considerable attention from the planning community in recent years (Hiyakumoto & Veloso, 2002).

• However, research in this area has generally assumed a user’s information goal is perfectly represented by the query, and typically adopts a relational database model for representing query operations and information source (Hiyakumoto & Veloso, 2002).

• This ‘assumed’ notion has been challenged by recent studies, arguing that the users information need shape the information gathering process.


What is Information Gathering?

• In general practice, information gathering is the collection of data for dealing with the individual’s or the organization’s current situation.– Simply put, information gathering involves the building of

collection to satisfy a the information needs of a defined user group.

• Usually more data means more and better ways of dealing with the current situation.

• New ideas come more easily if there is a solid knowledge base.

(Teamreporter, 2018)


Information Gathering

• Information gathering is a time consuming process due to overload of available information and there are dedicated teams in many organizations for this task (Kate, Prapanca & Kalagnanam, 2014).


AUTOMATED SYSTEMS FOR INFORMATION PROCESSING AND PRESENTATION

Topic Three


Automatic Text Analysis - 1

• Before a computerised Information Retrieval system can actually operate to retrieve some information, that information must have already been stored inside the computer.

• Originally it will usually have been in the form of documents.

• The computer, however, is not likely to have stored the complete text of each document in the natural language in which it was written.

• It will have, instead, a document representative which may have been produced from the documents either manually or automatically.

(van Rijsbergen, 2012)


Automatic Text Analysis - 2

• The starting point of the text analysis process may be the complete document text, an abstract, the title only, or perhaps a list of words only. From it the process must produce a document representative in a form which the computer can handle.


Automatic Classification -1

• Loosely speaking classification describes the process by which a classificatory system is constructed.

• There are two main areas of application of classification methods in IR:

1. keyword clustering;

2. document clustering.

(Sobrino, 2014)


Key Word Clustering: this is a technique that target search terms into groups (clusters) relevant to designated part of a collection/ database.

GENERIC STEPS/ALGORITHM FOR KEY WORD CLUSTERING1. Select/identify keywords one by one from the search term

and sends them as search queries to the search engine. It scans the search results, pulls the ten first search listings, and matches them to each keyword from the list.

2. Form clusters. If a search engine returns the same search listings for two different keywords and the number of this listings is enough to trigger clustering, two keywords will be grouped together (clustered).



3. Identify clustering level: The clustering level (a minimum number of matches in the search results that trigger keyword clustering is called the clustering level) affects the number of groups and keywords in the group after clustering. The higher clustering level produces more groups with fewer keywords in every group. This happens due to a minimum chance to have 9-10 matching documents on the search results page (it would include almost all pages in the TOP-10 of search results). On the opposite, the clustering level 1 or 2 will create a few groups with a lot of keywords in each of them. There are certain exceptions, but they are not common.

4. If a tool finds no matching URLs in the TOP-10 of the search results, these keywords are sent into a separate group.

(Topvisor, 2016)



• Document Clustering:

– The task of organizing a collection of documents, whose classification is unknown, into meaningful groups (clusters) that are homogeneous according to some notion of proximity (distance or similarity) among documents (Tagarelli , 2009).

– The process of grouping similar documents into partitions where documents within the same partition exhibit higher degree of similarity among each other than to any other document in any other partition (Rahal, Wang, Schnepf, 2009).



DATABASE TECHNOLOGYTopic Four


Definition of key concepts

• Data- a “set of given facts” or “information in a form that can be processed by a computer” (Chowdhury, 2010).

– Can be numbers, eg., age, heights or weights of a group of people or

– Words eg., a set of keywords, medical records of patients

• Record – a collection of related information or unit of information in a database e.g., bibliographic information of a book, such as title, author, publication date, place of publication, etc.

• Field – the elements/segments included in a record e.g. Author field, title field, etc.

• Subfield – further sub divisions of a field e.g., the imprint field in a bibliographic database is made up of publisher, date of publication and place of publication.

• Field tag/Primary key –unique identifier given to a field at the design stage for the purposes of editing, printing, searching and data input.

Definition of key concepts – Cont’d.

Structured & Unstructured Data

• Structured data tends to refer to information in “tables”

• Typically allows numerical range and exact match (for text) queries, – e.g., QUERY: Score > 80 AND Gender = Female

RESULTS: Julie


NAME GENDER SCORE

Chun-Li Male 60

Valerie Female 73

Julie Female 81

Reggie Male 75

Class Exercise

• Use the table below to attempt the following queries:

1. QUERY: Year < 2000 AND Subject = Information Studies RESULTS: ?

2. QUERY: Year > 2015 AND Subject = Psychology RESULTS: ?


AUTHOR’S

NAME

TITLE OF BOOK YEAR OF

PUBLICATION

SUBJECT

AREA

Chun-Li Computer Application in

Libraries.

2018 Information

Studies

Valerie Preservation of Information

Resources

1991 Information

Studies

Julie The Origin of Man 2008 Archeology

Reggie Introduction to Information

Management

2017 Information

Studies

Mark Psychology for Everyday Living 2013 Psychology

In response to the second Query in the exercise:

• There are time when you search the UGCat and you have ‘no’ results.

• This simply means that your query/ search yielded no results – the search engine could not match your search to the available collection/database.

• In situation like this, you may have to reformulate your query or search a deferent database.


Class Exercise

Structured & Unstructured Data – Cont’d

• Unstructured Data: Typically refers to free text. Text that appears in “no” particular order.

• Unstructured data allows:

– Keyword queries including operators

– More sophisticated “concept” queries e.g.,

• Find all web pages dealing with drug abuse


Semi Structured Data

• Semi-structured data is data that has not been organized into a specialized repository, such as a database, but that nevertheless has associated information, such as metadata1 , that makes it more amenable to processing than raw data.

1 Metadata: Descriptive data about a data/information. E.g. date and time the data was recorded.


(Rouse & Wigmore, 2015)

Database

Definitions

• A database is a collection of information organized to provide efficient retrieval (Online Library Learning).

• “A collection of interrelated data stored so that it may be accessed by users with simple user friendly dialogues” (The Macmillan Dictionary of Information Technology, cited in Chowdhury, 2004, p. 16)

• Any collection of data or information specifically organised for fast searching and retrieval by a computer (Encyclopaedia Britannica)

• A database is structured to facilitate storage, retrieval, modification, deletion of data and other data processing operations.

Structured databases

• Databases organized in the form of a matrix are referred to as structured databases.

• Most databases used by librarians are structured and they include:

– External or remote databases – they are accessed online over the Internet

– Portable databases – they are stored on optical discs e.g. CDROMS

– In-house or locally stored databases – also accessed online, e.g. Catalogues or indexes to local collections

Representation of data as a matrix

• In a matrix, each row represents a discrete record within the file (Stuart & Hebels, 2003).

– Each cell represents a single datum (Stuart & Hebels, 2003).

Author field

Title Publisher field

Date field

Record 1 Author 1 Title 1 Sage 2003

Record 2 Author 2 Title 2 Blackwell 2006

Record 3 Author 3 Title 3 Cambridge 1998

Record 4 Author 4 Title 4 Merlin 2015

– A database is a persistent, logically coherent collection of inherently meaningful data, relevant to some aspects of the real world.

– An electronic database is therefore electronically organized collection of logically related data.

– Databases are usually designed to manage large bodies of data or information.

Electronic / Online Database

Databases Cont’d.

• The following are examples of databases that we use often:– address book

– dictionary

– telephone book

• DB are organized so that data or information stored in the DB can easily be – accessed,

– managed, and

– updated.

Databases Cont’d.

• A database allows both information professionals and users to avoid the loss of time, confusion and errors that can result when information is scattered and disorganized.

Types of databases

–Databases can be classified according to types of content:

• bibliographic

• full-text

• Numeric

• images.

Classification of Databases

• The two major divisions/ database classifications are:

reference databases

source databases

Reference databases

They are bibliographies or indexes which serve as guides to information in published literature1. Bibliographic databases – provide a citation or

descriptive record of an item but the item itself is not included in the database. Sometimes they include abstracts, e.g., Social Science Abstracts.

2. Catalogue databases – show the catalogue of a given library or a network of libraries.

3. Referral databases – Connect people to community resources, agencies, and specialised services . It e.g., Physician Referral databases, Child Care Referral database, Legal Referral database, a database of NGOs etc.


Source databases

They provide users with required information without the need for referral. They are often grouped by content, examples are:

1. Numeric databases – contain numerical data such as survey, financial, and statistical data

2. Full-text databases – contain the full text of documents and not just the citations. Examples journals, books, newspapers, dissertations, reports etc.

Source databases

3. Text-numeric databases – contain both text and numerical data such as annual reports of companies and handbook

4. Directory databases – provide information about individuals and organisations. Check http://www.the100lists.com/ for examples of directories.

5. Multimedia databases – Contain one or more primary media file types, such as video, audio, graphics, animation sequences, sequences, as well as documents.

http://www.the100lists.com/

The development of database in an information retrieval environment

Factors to consider1. Functionality/purpose – is it for online retrieval, resource sharing, stand

alone system etc.

2. Nature of documents/records to be included in the system.

3. Maximum number of records to be integrated into the system.

4. The nature and number of users

5. Availability of resources, i.e. Software, hardware and staff.

6. Knowledge and skills required to maximize use of software

7. Training facilities available.

These factors are important because they determine choice of software package, number of fields to be created, the optimal performance of the system

Other considerations

Hardware Requirements:

• The processor for executing the program

• Memory for holding ongoing works

• Disk storage for holding data files

• Devices for archiving data files to be used in the event of accidental damage or loss of data

• Printers to produce hard copy when needed

• Terminals for data input and control of all processes

PRACTICAL SESSIONTutorial Session / Individual Practice


Steps in the design of a database

• Database design is the first step in the development of a text retrieval system.

• Pre-requisite decisions include determining:

– The nature of data

– Nature and number of fields and subfields

– Nature of database indexing

– Format for display and printing of data

– Sorting of data during printing

– Entry or editing of data

Steps in the design of a database contd.

• E.g. The number/lists of fields and subfields are based on the nature of data/record. For e.g. Fields in a simple library catalogue are as follows:

Author Price

Title of book Call number

Publisher’s name Accession number

Place of publication Keywords

Date of publication

Steps in the design of a database contd.

• Database indexing – this step generates the index file that can be searched. This process depends on the software package being used for developing the database. Some software packages are programmed to update index files as soon as new records are added or existing records deleted.

• Data entry form/worksheet – This is a blank form used for entering data

• Output format- mode of display of records for browsing or searching.

• Data entry, searching and printing- These steps concludes the design of a database


Summary

• Knowledge on the nature of the information content of an automated retrieval system is very crucial to both the information professional and the user.

• We have, in this class, discussed the nature of the body of knowledge/ collection that exists in an information system.

• We also deliberated on automated systems for information gathering processing, and presentation; with special attention to database technology.

Activity 2.1

• Follow the link below: http://www.smallbusinesscomputing.com/buyersguide/article.php/3721436/Build-Your-First-Database-with-Access.htm

• …and practice how to create a searchable database.

http://www.smallbusinesscomputing.com/buyersguide/article.php/3721436/Build-Your-First-Database-with-Access.htm

References

Hiyakumoto, L. S., & Veloso, M. M. (2002). Towards planning and execution for information retrieval. TV©! § W4E% X3YA8 aG6 A&© 4bX § cX§ 09C de A&©£ 3, 22.

Korfhage, R. R. (2006). Information Storage and Retrieval. Wiley India Pvt. Limited.

Stevenson, A. & Waite, M. (2011). Concise Oxford English Dictionary: Book & CD-ROM Set. Oxford University Press.