GREENSTONE DIGITAL LIBRARY FROM PAPER TO COLLECTION 759/Course Documents - LIS 759/ima… · Greenstone gsdl-2.39 March 2003 GREENSTONE DIGITAL LIBRARY FROM PAPER TO COLLECTION Dr

Greenstone gsdl-2.39 March 2003

GREENSTONE DIGITAL LIBRARY FROM PAPER

TO COLLECTION

Dr Michel Loots, Dan Camarzan and Ian H. Witten

Human Info NGO, Belgium Simple Words, Romania

University of Waikato, New Zealand

Greenstone is a suite of software for building and distributing digital library collections. It provides a new way of organizing information and publishing it on the Internet or on CD-ROM. Greenstone is produced by the New Zealand Digital Library Project at the University of Waikato, and developed and distributed in cooperation with UNESCO and the Human Info NGO. It is open-source software, available from http://greenstone.org under the terms of the GNU General Public License.

We want to ensure that this software works well for you. Please report any problems to [email protected]

ii

About this manual

This document explains how to create CD-ROM collections from paper documents. It describes in full detail the procedures and economics involved in the scanning and optical character recognition (OCR) processes, so that you end up with text in the right format to apply the Greenstone software. It also describes how to use the Collection Organizer, generally called simply the “Organizer,” to create and edit the material associated with a collection. This is a freely-available software package, distributed with Greenstone, that runs under Windows.

We have tried to be as plain as possible in our explanation. Reference to any trade mark or company product is purely for illustrative purposes, and does not imply that we endorse or favor this product over any other.

Companion documents

The complete set of Greenstone documents include four volumes:

• Greenstone Digital Library Installer’s Guide • Greenstone Digital Library User’s Guide • Greenstone Digital Library Developer’s Guide • Greenstone Digital Library: From paper to collection (this

document)

Acknowledgements

The scanning operation, Organizer and other know-how relating to the creation of collaborative non-profit collections have been developed by Dr Michel Loots, MD, of Human Info NGO and HumanityCD, Dan Camarzan of Simple Words, and their team of collaborators in Brasov, Romania.

The Greenstone software is a collaborative effort between many people. Rodger McNab and Stefan Boddie are the principal architects and implementors. Contributions have been made by David Bainbridge, George Buchanan, Hong Chen, Elke Duncker, Carl Gutwin, Geoff Holmes, John McPherson, Craig Nevill-Manning, Gordon Paynter, Bernhard Pfahringer, Todd Reed, Bill Rogers, and Stuart Yeates. Other members of the New Zealand Digital Library project provided inspiration for the design of the system: Mark Apperley, Sally Jo Cunningham, Steve Jones, Te Taka Keegan, Michel Loots, Malika Mahoui and Lloyd Smith. We also thank all who have contributed to the packages included in this distribution: MG, GDBM, WGET, WV, PDF2HTML,PERL.

greenstone.org

Contents

About this manual ii

1 INTRODUCTION 1

2 SCANNERS AND SCANNING 3

2.1 Scanners 3 Low-cost flat-bed scanner 3 Low-end scanner with sheet feeder 4 Color scanners 4 Professional duplex scanners 4 Scanning programs 5

2.2 Preparing the documents 5

2.3 The scanning process 5 Quality control 6 Filename conventions 6

2.4 Productivity and resources 7 Scanning costs 7

3 OCR: OPTICAL CHARACTER RECOGNITION 11

3.1 The OCR process 12 Quality control 12 Tables 13 Images 13 Specialized material 14

3.2 Productivity and resources 14 Intensive OCR 15 Achievable productivity 16

2 CONTENTS

3.3 Alternatives to OCR 17 Manual retyping 17 Image files 17

3.4 Combining scanning and OCR 18

4 THREE EXAMPLES: 1000 TO 100,000 PAGES 19

4.1 Typical small collection: 500 to 1000 pages 19

4.2 All publications from an organization: 5000 pages 20

4.3 A small library: 100,000 pages 20

5 CREATING AN ELECTRONIC COLLECTION 23

5.1 Methods of collection building 23

5.2 The Organizer 24 Installing and Using the Organizer 25 Document model 26 Exploring the Organizer 27

5.3 Tagging document files 38

greenstone.org

1 Introduction

One goal of the Greenstone Digital Library software is to empower organizations such as universities, United Nations agencies, non-governmental organizations, non-profit organizations and governments to create varied collections of information that can be delivered online or on CD-ROM.

Typical steps that have to be implemented are:

i. Selecting the documents to be included ii. Securing copyrights permissions to use these documents in the

digital library iii. Scanning and OCR of the hard-copy documents which are not

available in to digital form to have a perfect digital format iv. Converting all documents to a format (integrating text and

images) which can be imported into Greenstone (preferably HTML or Microsoft Word, but others are also covered at varying levels of precision by a “plugin” (see the Greenstone User’s Manual)

v. Tagging the chapters, paragraphs and images of the digital documents

vi. Organising the collection into a optimally structured digital library

vii. Building the digital library using the Greenstone software viii. Printing and distributing the collection on CD-ROM and/or

distributing it over the Internet

In order to create a digital collection, the publications must be available in digital format. If books, newsletters or other documents are only available on paper, they will need to be scanned and processed into machine-readable form (step iii). Usually this is done using optical character recognition (OCR), but sometimes by manual retyping. This process is covered in Chapters 2-4 of this manual.

Step v. enables the different parts of a document to be independently

2 INTRODUCTION

selected and displayed by readers in the final library, while step vi. involves assigning attributes to the documents such as subject categories, keywords and bibliographic data for ordering and searching the library. These steps are covered in Chapter 5 of this manual which treats in detail the Organizer program provided with the Greenstone software.

This manual introduces many issues that affect the editorial process of creating a collection from paper. Before reading on, you should consider these questions:

• What is the goal of your collection? • What is your target group? • How big is it—local, regional, or global? • How many documents are you making available? • How many pages? • How much graphics content? • Does the material split into parts that will be consulted by a limited

audience and parts that need to be disseminated widely? • Are the documents already available electronically? • If so, in which formats? (Note incidentally that PDF files are not

automatically equivalent to digital full-text form, as they often contain only page images.)

• What is the copyright status of the documents? • Who owns the copyright? • Are there other organizations with the same target audience? • Are you willing to collaborate with other groups? • What budget is available for the whole project? • What human resources are available (in person-months) for co-

ordination, editing, scanning and programming? • How many computers are available for this project? • How many CD-ROMs do you want to distribute? • Will they be free, or for sale?

greenstone.org

2 Scanners and scanning

The first step in converting paper documents into a digital library collection is to obtain images of all pages of all publications in digital format. The next stage is optical character recognition (OCR), and clean, high-quality images are essential for successful OCR. The digitization process requires a scanner capable of working at a resolution of 300 dpi (dots per inch). Most scanning can be done in black-and-white, but if color illustrations are included they must be scanned with a color scanner. In most cases the covers of the book contain colors and will have to be scanned as a color photographic image.

2.1 Scanners

Scanners are available in all price ranges, and all shapes and sizes. They range from $100 for flat-bed scanners to upwards of $50,000 for large industrial scanners from manufacturers such as Bell & Howell.1 There are many websites that offer a wide range of scanners for sale. To locate them, just search for “scanners” in search engines like Google, Altavista, or Yahoo.

The output format of a scanned page is a computer file that is usually stored in TIFF or Bitmap format. Compressed TIFF IV is the best format to use. An average page scanned and converted to this format occupies only 50 Kb, compared to perhaps 2 Mb for the equivalent page in uncompressed Bitmap form.

Low-cost flat-bed scanner

Low-cost flat-bed units are the cheapest and most widely available type of scanner. There are many brands: HP, Agfa, Acer, etc. Prices range from $100 to $300. Both black-and-white and color images can be scanned. The low price allows each computer to have its own scanner.

1 All sums of money mentioned in this document are in US dollars, and were current in 2001.

4 SCANNERS AND SCANNING

Disadvantages of these scanners include the medium quality of the result, the slow rate of scanning, unreliability in warm environments, and relatively frequent breakdown. Pages must be scanned manually, one by one. Each page must be positioned carefully on the scanning plate to ensure that it is aligned correctly. Productivity of these scanners is low. Despite manufacturers’ claims that each page can be scanned in less than a minute, the fact is that rates exceeding twelve pages per hour are rarely achieved. The scanning process monopolizes the computer on which the work is being performed.

Consequently these scanners are useful only for small jobs with limited numbers of pages—no more than 200 to 400 pages a month on a regular basis, or one-time jobs of up to 1000 or 2000 pages.

Low-end scanner with sheet feeder

Low-end scanners with sheet feeders typically cost between $500 and $1200. Ten to fifty pages can be inserted, scanned and processed at once: thus the operator does not have to attend constantly to the machine. This increases capacity up to 150 to 200 pages per day. These scanners are more robust, and have a larger lifespan before repair—usually in the range 30,000 to 50,000 pages.

A disadvantage is that only one side of the page is scanned at a time—the stack of pages must be reversed and rescanned in order to obtain an image of both sides. This often creates problems because sheet feeders are never without problems and sometimes pages get blocked.

These scanners are useful for up to 1500 to 3000 pages a month.

Color scanners

Any scanning operation invariably involves some color images, so a color scanner will always be required. Generally speaking, less than 5% of any publication contains color images, plus the cover. Thus a low cost flat-bed scanner as described above suffices. It is advisable to select one capable of scanning up to 600 dpi resolution.

Professional duplex scanners

Professional scanners are reliable, heavy-duty machines capable of processing a large volume of pages—typically from 2000 pages to 10,000 pages per day. They have an automatic sheet-feeder tray system that processes batches of about 50 to 200 pages. The best and fastest are duplex machines that scan both sides of the page at once.

SCANNERS AND SCANNING 5

Professional duplex scanners require a powerful computer with a hard disk of at least 10 to 20 Gb. Prices range from $5000 to $50,000. For example, the Canon DR-6020 duplex scanner costs $5000 and works with double-sided documents. It has a capacity of about 2000 pages per day and a lifespan of 600,000 to 800,000 pages. Bell & Howell and Fujitsu scanners range from $10,000 to $50,000 and have a lifespan of many millions of pages.

Micro-fiche scanners cost from $15,000 for a semi-manual unit to $80,000 for one that operates fully automatically.

Scanning programs

Every scanner comes with its own software, which means that the program must be installed on the computer that manages the scanner. Some have a computer card that needs to be installed in your computer to speed up the scanning operation.

2.2 Preparing the documents

Before being scanned, documents must be properly prepared. Dusty documents must be cleaned, humid documents dried, clips removed, pages unfolded.

The spine of each book should be removed by cutting it off, straight and precisely. Books provided by libraries must often be rebound, and if so you should be particularly careful when removing spines in order to facilitate smooth rebinding.

If there are just a few documents, cutting can be done manually with a ruler and cutters. Be careful with your hands! For more documents, special manual cutting machines are available.

For high volumes—more than 20 documents—we recommend asking a printer or copy-shop if you can use their professional cutting machine. Do not forget to remove metal clips which could damage the cutting blades.

2.3 The scanning process

Using software provided with the scanner, a digital image of each paper page is scanned and transformed into a Bitmap or TIFF image. These images should be stored on hard disk with standard filenames. The OCR process starts once some or all of a batch of documents have been scanned. It can be undertaken by the person who operates the scanner, or by someone else.


Typically a scanning resolution of 300 dpi is needed, although sometimes 200 dpi is acceptable.

Quality control

The final goal of scanning is either to OCR the pages to obtain perfect word processor or HTML versions of the publications, or to produce enhanced image files such as PDF image files. In either case the quality of the image is very important. If quality is sub-standard, image files will not look good and will consume more memory. Image quality seriously affects the OCR process: with sub-standard quality, productivity deteriorates by up to 40%. OCR typically represents more than 90% of the total cost, so scanning quality can have a very substantial effect on the final cost.

The quality of the TIFF file can be enhanced by adjusting the scanning process to each type of paper, using settings provided by the scanner software. Relatively transparent kind of paper will require a lighter setting; the contrast must be adjusted depending on the quality of printing, and so on.

First divide the material into batches with similar paper and print qualities. Perform OCR tests on a sample from the first batch to determine the optimal settings. Then scan all material in this batch before proceeding to the next one.

Filename conventions

Give each book or document a job number or unique code, which will become the name of the folder that contains all TIFF images in the document. Depending on the computer system (DOS, Windows, UNIX, LINUX, etc) from 8 characters to 128 characters can be used in a filename. We recommend restricting this unique document identifier to 8 to 16 characters. The first five characters might identify the document, the following letter might contain a language code, and the remaining characters might identify the particular page. For example, the identifier u7548e12.tif might identify the TIFF image of page 12 of a book written in English with code u7548e.

Allocate one directory on the hard disk for scanning jobs, say scanjobs. Then make a subdirectory for each job. Within this make a subdirectory for each publication—say u7548e for the above document. Store all the TIFF images of the publication, including color images, in this folder.


2.4 Productivity and resources

You should not underestimate the magnitude of the scanning operation—and particularly the OCR process that follows. It is best to consider scanning and OCR as completely separate activities. The optimal choice from an economic and practical point of view should be made individually for each one.

Some points to consider are the investment in scanners and computers that is necessary; the availability of appropriate space and human resources; training the workforce; salary costs; the initial and total number of pages to be scanned; deadlines; and whether documents can be outsourced to third parties.

Scanning costs

An important decision is whether to invest in scanning equipment and perform all scanning oneself, or outsource it to a scanning company. The main considerations are:

• pressure of time for the scanning job; • total number of pages; • salary costs of those who perform the scanning.

The people who perform the scanning must be highly motivated, technically skilled, and quality-oriented.

The typical cost of scanning by a professional company is $0.06 per page. To this must be added the cost of shipment, which can be up to $0.03 per page for transport from developing countries to developed countries, and $0.015 per page for transport within countries.

Table 1 estimates the cost of doing it yourself, using various scanner types. Note that all figures are approximate. They are provided as rough guidelines based on the authors’ experience. The first three columns concern labor costs. The first is the capacity in pages/month, assuming full-time work. The resources required in person-hours per page is obtained by dividing the number of working hours per month by the pages/month capacity in the second column. It is shown in the second column, which assumes 180 working hours per month.


Table 1. Scanning cost

Capacity (pages/ month)

Hours/page (180-hour

month)

Cost/page (assuming $4/hour)

Scanner acquisit-

ion

Scanner lifespan (pages)

Outsourced pages for

scanner cost(at $.06 each)

Flat bed scanner

2,500 0.072 $0.288 $300 7,000 5,000

Scanner with sheet-feeder

8,000 0.0225 $0.09 $800 30,000 13,000

Professional: low-end duplex

40,000 0.0045 $0.018 $6,000 600,000

100,000

Professional: high-end duplex

150,000 0.0012 $0.0048 $50,000 8,000,000 833,000

To determine the price per page, multiply the total hourly salary costs in your situation by the second column of Table 1. As an example, the third column gives the price of in-house scanning at a salary rate of $4/hour—not including investment costs.

These calculations assume that the scanner is used for a sufficient volume to justify the investment. The final three columns of Table 1 give more information about the cost of the scanner itself. The first of these shows the acquisition cost of the scanner, and the next gives its expected lifetime. The last shows the number of pages that could be scanned commercially, at a cost of $0.06/page, for the price of the scanner alone.

Of course, many other factors affect the choice of scanner: availability of funds, need to minimize dependence on others, desire to build local capacity, obligations to libraries to scan books locally and not transport them, and so on.

The above figures give some idea of the volume of pages needed to justify different levels of investment. Rarely will an institute or organization need to scan 800,000 pages. At such levels more complex issues arise—such as maintenance and the possibility of recouping costs by offering scanning services to others—that we will not discuss here.

It is tempting to regard the development of scanning capacity as a commercial venture, particularly in developing countries. But one should


always bear in mind that scanning is not a repetitive business. Once documents have been scanned, clients never place new orders for the same documents—no matter how good the relationship with the scanning company. From a commercial point of view, intensive marketing efforts are needed. We do not advise NGOs or other non-profit organizations to venture into this realm without thorough initial trials and a carefully-considered business plan.

In conclusion, if 10,000 to 50,000 pages are to be scanned, one should consider outsourcing the job. A low-end professional scanner costing about $6000 can only be justified if more than 100,000 pages have to be scanned. You might consider banding together with a few other institutions—perhaps NGOs or libraries—to purchase such a scanner.


greenstone.org

3 OCR: Optical

Character Recognition

An optical character recognition or OCR system transforms a scanned image into text. The input is a digitized image in TIFF or Bitmap format—preferably a clean, high-quality image. The output is a word-processor or web file, typically in RTF, Word, or HTML format.

The following steps are involved in converting paper documents to computer form:

1. scanning; 2. page layout analysis; 3. recognition; 4. scanning images and tables.

Following these, you must perform quality checks on the resulting files, and save them in the appropriate format.

On the market are many good OCR programs, with prices ranging from $100 to $400.2 For example, among many others are:

• Read-Iris (http://www.readiris.com/) • Omnipage (http://www.omnipage.com/) • Fine-Reader (http://www.finereader.com/)

All information, including lists of local distributors, can be found on the manufacturers’ websites. Among these, in the authors’ experience the most user-friendly are Fine-Reader and Omnipage. Fine-Reader is cheapest, costing about $100. It offers a great deal of flexibility, and the widest range of different language options.

2 Recall that all sums of money are expressed in 2001 US dollars.

12 OCR: OPTICAL CHARACTER RECOGNITION

A choice must be made between undertaking the scanning and OCR in-house or outsourcing it to a commercial organization. To do it in-house requires a scanner, OCR software program, OCR skill development, and a quality-conscious, highly motivated workforce.

3.1 The OCR process

The OCR process differs from one OCR program to another, and each one requires a considerable amount of learning. The program’s manual will explain this process in detail. Four points deserve particular attention: quality control, tables, images, and specialized material such as formulas, foreign characters etc.

Quality control

We cannot place enough emphasis on quality control. Quality checks are best performed by native speakers, or people with an excellent command of the language to check. The best people are at the university or high-school level. We should also note that young people tend to sustain higher concentration than older people for this kind of work.

Normally there are four quality checks.

The first is performed at the same time as OCR. Every OCR program has a built-in spell-checker that highlights every suspect letter. At the same time the image of the word appears too, making it easy to check and correct the error.

The second is a general check of the text once the OCR process is finished. Common errors are to miss a page, a paragraph, chapter titles, and so on. A general overview is necessary to check if pages are missing. It is essential to check titles, chapter headings, paragraphs, and tables.

The third is a spelling check using Microsoft Word. This program has a dictionary that is often more sophisticated than the one embedded in OCR programs. By importing the book into Word and performing a spelling check there, more errors can be found and corrected. Be sure to add to the spell-checker any particularly difficult or error-prone words, or scientific and technical terms common in that type of publication.

Finally, the completed document should be checked by an independent person who samples the complete book and checks for errors, problems with tables and images, tagging, and the general look of the resulting text. Only after this final check can a book be considered ready for digital

OCR: OPTICAL CHARACTER RECOGNITION 13

dissemination.

Tables

OCR programs do not cope well with tables. Moreover, tables are hard to check. They contain many digits, sometimes with points and commas, and entries are easily misplaced into the wrong row or column. They require concentrated effort, dedicated work, intensive proof-reading, careful checking, and good quality control. They can be handled in three basically different ways.

First, tables can be treated as images. This involves scanning them as black-and-white images and placing them in this form at the appropriate point in the document. This is the easiest solution. There are no errors, and the only time taken is that involved in creating the image. However, this solution consumes more memory than others. Also, the resolution is not always sufficient when large tables are displayed on a computer screen. If you make the complete table fit, the resolution is too small. If you make the table over-wide, the user must scroll to see all columns and rows, and cannot get an overview of the contents.

Second, tables can be recreated manually by making a table with the same number of rows and columns and filling the entries by typing them in, character by character.

Third, the table can be OCR’d. This saves time compared to the manual process, but has a potential for more errors. Columns sometimes get merged, and commas and points are not recognized.

Images

Publications contain three different general types of image:

• black and white line art; • black and white photographs; • color photographs.

Black and white line art should be scanned in line art mode and saved as GIF or PNG files. Black and white photographs should be scanned in greyscale mode and saved as GIF or JPEG files. Color photographs should be scanned in color mode and saved as JPEG files. Generally speaking, medium-quality JPEG provides adequate resolution.


For most collections, images consume the bulk of the space required on a hard-disk or CD-ROM. This makes it important to optimize each image for clarity and visibility, while minimizing its size. To save space you might drop some or all of the images if they are not relevant to the text.

Images should be scanned separately, one by one. We recommend giving the image files a name that consists of the first five or six characters used to denote the document followed by the number of the page on which the image was found. An alternative, assuming each document is in its own directory, is to simply use the letter p followed by the page containing the image. If there are several images on a single page, append an additional letter a, b, c … to the filename. For example, if a JPEG image appeared on page 36 of the publication u7548e discussed earlier, it would be placed in a file named u7548e36.jpg or p36.jpg.

Once the images have been scanned, you can put batch-processing programs to work to resize or enhance all the images at once.

Specialized material

Many documents contain specialized material such as special characters, formulas, and difficult pages. Special characters generally relate to different languages and diacritical marks. The language option for the OCR program should be set for the specific language being read. Formulas will have to be recreated manually. Sometimes this is not possible in the OCR program, but only in a word processor like MICROSOFT Word. Difficult pages that contain complex material or are damaged so that a clear image cannot be obtained might have to be retyped manually.

3.2 Productivity and resources

As mentioned earlier, you should not underestimate the difficulty of OCR. Although the economic and practical options for OCR should be considered separately from scanning, similar points arise: the necessary investment in computers; the availability of human resources and management skills; training the workforce; salary costs; the total number of pages to be processed; and whether documents can be outsourced to third parties.

In this section we share our experience of OCR operations in Belgium, Romania and India. All case studies, calculations and figures assume average situations, documents of standard difficulty (including tables and images) such as are found in most archives or libraries, very high-quality


results, and a medium- to long-term operation.

Intensive OCR

OCR is difficult. It demands great concentration and much skill. Before attaining peak productivity level and quality, a learning period of about six weeks is needed.

Typically, best results and productivity are achieved during the first hours of each day. After three hours of OCR work, productivity declines very rapidly, perhaps to 50% of the initial level. After six hours most people become very tired.

The same kind of evolution occurs over the initial weeks. In the first few weeks everyone achieves fairly high productivity, but after that up to two-thirds of people become bored and frustrated. These people either quit or perform poorly in terms of quality and productivity. Even those who pass the first three to five critical weeks and become part of the regular work team often leave in search of a better position after 6 to 12 months.

The remarks made in Section 3.1 about personnel apply particularly to intensive OCR. Quality checks are best undertaken by native speakers or people with a good command of the language being checked. Young people generally sustain higher concentration than older people for OCR work. As a rule-of-the-thumb, people aged between 18 and 23 years tend to be better suited than those over 25.

Finally, OCR can be a boring job, which makes motivation and sustained commitment to quality exceptionally important.

These facts about OCR lead to the following guidelines:

• Young people between 18 and 25 are best suited for this job.

• Because the first hours are always the most productive, the work should either be organized on a part-time basis or only the most motivated and concentrated people should be selected for full-time work.

• Two-thirds of people tend to quit or get bored after about three to five weeks. This translates into poorer quality and low productivity in the last weeks.

• A regular supply of work is needed to justify the necessary training, to maintain concentration, and to keep spirits high.

16 OCR: OPTICAL CHARACTER RECOGNITION Achievable productivity

Table 2. OCR productivity

Working hours/day Pages/day Pages/month

Initial training (6 weeks) 3 6 120

Optimal productivity level 3 9 150 to 200

7 28 500 to 600

Table 2 gives typical OCR productivity figures. Documents come in all sizes and qualities, and these figures assume that the mix of documents contains an average number of images or tables—say one image and one table of five rows by five columns every 8 pages. They also assume that the page images are of medium to high quality—note that, as discussed above, this depends on the quality of scanning—and that the OCR workers have a good command of the language.

Table 2 gives separate figures for people undergoing training and for those who have reached their optimal productivity level. If a member of the administrative staff were to allocate three hours a day to OCR, they could achieve 180 to 200 pages OCR per month. For full-time staff with proper training, high concentration and dedication to quality, 500 to 600 pages a month can be achieved.

However, the rates that are achieved on difficult pages of low quality, with many columns or many tables, are far lower—perhaps 300 to 400 pages per month for full-time work.

Assume that the salary cost for dedicated and motivated full-time OCR workers is $400 per month, and the overhead—including management costs, computers, office space, utilities, etc.—comes to another $300 to $400 per person per month. Then the cost of OCR comes to about $1.2 to $1.6 per page. Taking into account the training period, total volume, time-span, and layoff costs should the operation close down for lack of work, these figures rise to $1.5 to $2.5 per page.

The cost of in-house OCR should be weighed against the cost of outsourcing the work to a professional OCR company. These typically charge from $1.5 to $4 per page, including images and tables. Human Info NGO/Simple Words has such a unit in Romania, and charges humanitarian non-profit organizations a special price that ranges from


$1.2 to $2 per page. Please contact us at [email protected] for further information and advice.

3.3 Alternatives to OCR

There are two alternatives to OCR that we discuss here.

Manual retyping

One, which eliminates most scanning as well, is to retype the documents manually, using a word processor. This still requires the images and front cover to be scanned, but the remaining pages need not be scanned—thus one can dispense with both powerful scanners and OCR software.

The people who do this work do not have to understand the text. They must be accurate typists and re-key exactly what they see. Retyping does introduce errors, and double-keying is often used to find and correct these. This method involves two people who independently re-key the same document, after which both digital versions are compared word for word using a special software program by an operator who has the original document in front of them. The assumption is that if the same word has been typed independently twice in the same way, it is correct. However, this is not always true, and for extremely high precision, triple-keying is performed.

The advantage of rekeying is that cost is saved because an OCR program is not needed and so the computers can be older, lower-range, or second-hand models—whereas powerful computers are needed for OCR. Also, the work can be performed by people with a lower level of skill. The disadvantages are that a training period of at least two months is needed. Single keying usually produces too many errors, and double or triple keying is needed.

The cost depends entirely on salary level. Typically, re-keyers in developing countries are paid on the order of $150/month. Their productivity could be twenty to thirty pages per day—corresponding to 400 pages per month, images included. With double-keying, this makes the total salary costs around $300 per month, plus overheads.

Image files

A very low cost alternative to OCR is simply to use a PDF image version of the document pages. The cost is only a fraction of OCR’s—about $0.1 per page.


Once scanning has been completed and TIFF files are available, an automatic converter (usually Adobe Acrobat or Adobe Photoshop) converts all TIFF files of book pages into PDF files.

The downside is that these files are not searchable. Also, they are quite large—usually 50 Kb per page, plus or minus 20% depending on the quality of the original TIFF file.

PDF image files are slow—sometimes, in developing countries, impossible or prohibitively expensive—to download. They rarely fit on a floppy disk, and do not support text manipulation functions such as cut-and-paste.

The PDF image file method should only be used if no OCR budget is available, and for documents that are likely to be used by a small number of people who have high-speed low-cost Internet access.

3.4 Combining scanning and OCR

If a scanner is connected directly to the computer that runs the OCR software, most OCR programs can scan a page and perform OCR immediately. Page-by-page scanning and OCR is a reasonable strategy for low volumes, but will prove time-consuming for bigger and more continuous jobs.

For up to 100 to 150 pages per month, this solution may suffice. For higher volumes it is faster and more efficient to scan the document first, then perfom OCR on all the pages as a separate step.

greenstone.org

4 Three examples:

1000 to 100,000 pages

4.1 Typical small collection: 500 to 1000 pages

Most NGOs have 500 to 1000 pages to scan. This volume can be OCRed in-house if motivated volunteers are available.

SCANNING

The first step is to scan the publications to generate a high-quality TIFF file of each page, and a separate line-art, grey-scale or color bitmap image for each illustration. Assuming that 1000 pages have to be scanned, this might represent a part-time job of about one month—just for scanning. The TIFF files would consume 60 to 80 Mb of hard-disk space, and a good policy is to create a CD-R containing these files. A low-cost flatbed scanner of $100 to $300 will be sufficient for the job. Scanning can be done after working hours or during the weekends by a volunteer in the office or at home.

OCR

The second step is OCR by another volunteer, or team of volunteers, skilled in language and correction. The TIFF files can either be shared between computers, or one computer can be used for the entire job. Typically, it will take five or six months of part-time labor (e.g. 20 hours a week) to convert 1000 pages into perfect Word or HTML documents.

OUTSOURCING

An alternative is to outsource the scanning and OCR process. It would probably cost $1500 to $2000 to convert everything into perfect Word and HTML files.

20 THREE EXAMPLES: 1000 TO 100,000 PAGES 4.2 All publications from an organization: 5000 pages

Many larger organizations have archives of around 5000 pages of currrent or out-of print books, journals, newsletters, grey literature, etc.

SCANNING

This is too much for a flat-bed scanner. Scanning should either be outsourced (approximately $400 for 5000 pages) or a sheet-feeder scanner purchased (approximately $900). Alternatively, a more expensive scanner could be bought together with a few other institutions or NGOs ($6000 costs divided by the number of participants). All 5000 pages in TIFF format will take about 300 to 400 Mb of hard-disk space. Again, a good policy is to create a CD-R containing these files.

OCR

The second step is OCR by another volunteer, or team of volunteers, skilled in OCR and correction. Again, several computers might be used, or one computer for the whole job. It would take 25 to 30 months of half-time labor (assuming 20 hours a week) to convert 5000 pages into perfect Word or HTML. In practice this is too long and too computer-intensive to manage on a volunteer basis. One would have to pay volunteers, monitor them for performance and quality, provide adequate space, etc, in order to have the job finished within reasonable time at a high level of quality.

Alternatively one could create image PDF files, which would take 300 to 400 Mb of space and would be harder to download over the Internet.

OUTSOURCING

An alternative is to outsource the scanning and OCR processes. It would probably cost $7500 to $10,000 to convert everything into perfect Word and HTML files.

4.3 A small library: 100,000 pages

Larger organizations, universities, governments, and specialized libraries might have a whole library to digitize—say 100,000 pages. The first issue to consider is the copyright status of the publications. If they are not in the public domain, explicit permission to digitize them must be obtained from the copyright holders. You should also check whether the files are already available digitally.

THREE EXAMPLES: 1000 TO 100,000 PAGES 21

SCANNING

The volume is too high for a sheet-feed scanner. Scanning should either be outsourced ($8000 for 100,000 pages), or a more expensive scanner purchased together with a few other institutions or NGOs ($6000 shared between the participants). 100,000 pages in TIFF format will take 6 to 8 Gb of hard-disk space. The best plan is to create a set of CD-R copies containing these files.

OCR

The second step is OCR (or creation of PDF files for less widely used documents). It would take 500 to 700 months of half-time labor to convert 100,000 pages into perfect Word or HTML. This is impossible to realize with volunteers, and the job must be done on a professional basis.

To save cost, some of the less-frequently-used pages—say 80% or 80,000 pages—could be transformed into PDF, and the other 20,000 pages into Word and HTML. The PDFs would take 4 to 6 Gb space and be harder to download on the Internet, but would cost only $0.2 per page to create by a professional organization (total of $16,000). If 80,000 PDF files were created from TIFF files by volunteers using PDF conversion programs like Adobe Acrobat, 10 to 20 months of part-time work would be necessary on a powerful computer.

OUTSOURCING

An alternative is to outsource the work. If the 80% PDF and 20% HTML mix were maintained, the PDF would cost around $16,000 and the HTML $30,000 to $40,000—a total budget of around $50,000. If everything were OCRed, it would cost $150,000 to $200,000 to convert the entire collection into perfect Word and /HTML files.

22 THREE EXAMPLES: 1000 TO 100,000 PAGES

greenstone.org

5 Creating an electronic

collection

Three important aspects should be kept in mind when deciding to create digital collections. First, the collection must be organized. The more content there is, the greater the need for indexes and powerful search systems. For collections of 3000 to 5000 pages or more, indexes and search systems are essential. Second, the needs of end-users must prevail. The target groups that will use the collection should be identified, and a process of regular consultation set up. Third, the available budget will determine how much can be done.

5.1 Methods of collection building

There are many examples of excellent CD-ROMs that are created on the web-page model. HTML, PDF or Word documents are added and linked using hyperlinks. Navigation is made simple and attractive by the use of hyperlinks, frames, keywords, indexes and so on. Such systems work well up to a few thousand pages, but from 3000 to 5000 pages onwards it is important to have a well-structured collection and a powerful search facility. This is where the Greenstone software can help.

The Greenstone Digital Library software creates a structured digital library including a very powerful search and retrieval engine. Up to 150,000 pages can be indexed on a single CD-ROM. Every CD-ROM can become an Internet server. Greenstone is open-source software, and is freely available under the GNU license.

The companion manuals describe how to build Greenstone collections. Simple collections can be built interactively using the “Collector” subsystem described in the Greenstone Digital Library User’s Guide (Chapter 3), which guides you through a sequence of interactive pages that request the information needed. However, for larger or more complex collections such as the ones described here, we recommend that you use

24 CREATING AN ELECTRONIC COLLECTION

the command-line building process described in the Greenstone Digital Library Developer’s Guide (Chapter 1). You will also need to read Chapter 2 of the Developer’s Guide in order to harness the full power of Greenstone to build advanced collections.

5.2 The Organizer

Greenstone has been used to build several collections for humanitarian purposes, collections that share a common structure and organization but with different content. The “Development Library Subset” (DLS) and “Demo” collections delivered as samples with Greenstone are examples of collections based on this structure.

The “Organizer” is an additional program distributed with Greenstone that assists in the creation of collections that have the DLS organization and presentation. For such collections it greatly simplifies the Greenstone command-line building process described above. The Organizer is a Microsoft Visual C++ 6.0 application, and is therefore restricted to use under Windows (versions 95, NT or above).

The Organizer is designed to help manage all aspects of organizing a digital library collection: entering document titles, assigning subjects and other metadata, altering them, etc. It works with the Greenstone digital library software by generating six intermediate files that are used by Greenstone when building collections: collect.cfg, metadata.xml, sub.txt, org.txt, Keyword.txt and AZList.txt. The structure and content of these files are explained in the Greenstone Digital Library Developer’s Guide.

Metadata.xml is read by the RecPlug plugin discussed at the end of Section 2.1 of the Developer’s Guide; it is shown in Figure 10. The sub.txt, org.txt, Keyword.txt and AZList.txt files define hierarchies that are used by the Hierarchy classifier described in Section 2.2; an excerpt from sub.txt is shown in Figure 15 of the Greenstone Developer’s Guide. These five files are associated with the DLS collection (and similar collections) through the collection configuration file (collect.cfg): metadata.xml because it is read by RecPlug which is specified in the configuration file, and the other four because they are explicitly mentioned in the configuration lines that specify hierarchy classifiers.

People who want to build collections providing full bibliographic and classification functions often have difficulty in coming up with these files. The purpose of the Organizer is to help you define the contents of these files—indeed, it actually writes them for you. The Organizer is, however, limited at present to the specific document model described below, so that users who wish to add metadata not included in this model would have to

CREATING AN ELECTRONIC COLLECTION 25

make some manual modifications in the Organizer’s results. A more general, flexible organizer function will be added into Greenstone itself in the future.

Installing and Using the Organizer

You are prompted to say whether you want to install the Organizer at the end of the standard Greenstone installation; if you answer “yes”, the installation will proceed without further intervention. If you say “no”, or if you uninstall the Organizer and later want to reinstall it, then go to the windows_utilities directory of the Greenstone CD-ROM and run the Organizer.exe program.

Launch the program by selecting Organizer under Greenstone Digital Library in the Programs list of the Windows Start menu. You are asked for a database; use the DLS database by selecting the dls.mdb file that is supplied. You are asked for a user name and password: use admin for both.

Each database in the Organizer can contain one or more document collections. When you first run the newly installed Organizer, you will be presented with a list of collections that contains only one entry: the Development Library Subset (DLS) collection. Generally it is more convenient to work with the Organizer by adding new collections to the default database, including modifications or subsets of previous collections, rather than creating new databases (in fact the current version of the Organizer does support the creation of a blank database).

Each database and each collection have three basic editable global lists:

• Documents that may be included in the collection • Organizations: organization names available for association with

documents • Subjects: subject classification terms available for building subject

classification hierarchies for association with documents in a collection.

Other bibliographic attributes which can be assigned in the document model are described below.

Please note that changes to the database are automatically saved by the Organizer upon exit, without a prompt for confirmation. Thus if you wish to have a backup copy of the original content of a database, you must back it up as soon as you enter the Organizer (two backup functions are available and described below). There is no separate backup function for individual collections.

26 CREATING AN ELECTRONIC COLLECTION Document model

In the Organizer, each document may be assigned metadata from the following set of attributes:

i. Title ii. Organisation (repeatable attribute generally representing the

publisher, but may also be a corporate author or other organization associated with the document).

iii. Subject (hierarchical classification assignment) iv. Keywords (additional user defined controlled indexing vocabulary

—in the DLS collection it is used to classify documents according to predefined “How to” questions frequently asked by the collection users)

v. Periodical (title of the document containing the document in question, e.g. document at higher bibliographic level)

vi. Publication date vii. Number of pages viii. Job number (document identifier, used by Greenstone to link the

metadata to the text and images of the document at the collection building stage)

ix. Language x. Batch (identifier of a group of documents which were processed

together) xi. Suggested collections (collections for which the document is a

candidate for assignment) xii. Copyright notice xiii. Copyright code (from a list of standard rights permission schemes)

The first five attributes are special in that the Organizer configures Greenstone to build and display indexes based on them for the retrieval of documents. Attributes vi to viii are not indexed but are carried over as passive document identifiers in the digital library application. Attributes ix to xiii are used only for internal document management in the Organizer; they have no effect on the generated digital library application.

The subject metadata are special in that they are really a property of the collection under consideration rather than of individual documents. When a document is moved from one collection to another, all the other metadata follow, but the subject metadata must be reassigned.

It is unlikely that the developer of a digital library would wish to use the keyword metadata in the classical documentary sense, since, in addition to the subject classification, all documents are accessible in Greenstone by full text search. This attribute can be seen as simply an extra indexable attribute. It could, for example, be used for Author (not explicitly


foreseen in the present model) or Country of Origin by simply placing the appropriate data in the keyword metadata. Then the button for retrieval by this new attribute (labeled “how to” in the default digital library interface) can be customised by changing the word “Howto” to “Author” or “Country” in the fifth line of the collect.cfg file generated by the Organizer.

Greenstone can also generate a Table of contents for access to sections and subsections of each document. This is not presently handled by the Organizer but rather must be accomplished by tagging the sections and subsections in the text itself, as described in Section 5.3 of this chapter.

It should be remembered that Greenstone itself is very flexible in terms of available document models, but that presently each different model must be formally conceived in terms of metadata functions and manually encoded in the collection building files described in section 2.2 of the Greenstone Developer’s Guide. If the required model is relatively close to the DLS model, one could start with the build files generated by Organizer, and manually edit them according to Greenstone rules.

Exploring the Organizer

Here are details of each of the Organizer’s features:

i. The Organizer Main Window

This window is displayed when a database has been selected. It has three parts:

• The horizontal menu bar at the top, including a horizontal short-cut toolbar underneath

• The vertical toolbar on the left • The central area which displays whose content and function are

selected by the vertical toolbar.

a. The horizontal menu bar consists of eight menus, each with one or more commands:

• The File menu provides only one command: Exit to quit the Organizer program.

• The New menu provides commands to add or modify Collections, Documents, Organisations, Subjects and to add or modify the lists of possible values of the other attributes in the document model. The dialogue for these commands should be self-explanatory, with the following complementary details:


The New/Organisation command displays a dialogue box called Edit Organisation Name which provides fields to enter the full organisation name and the abbreviation (if Auto complete is ticked, this will be taken automatically as the first letter of each word beginning with a capital). If the Auto complete function cannot propose a valid abbreviation (because there are no capital letters in the title or because the abbreviation duplicates another in the database), then the Auto complete box has to be unselected and the abbreviation entered manually.

The command New/Subject displays the Add new subject dialogue box to adds a category to the global (non-hierarchical) subject list for the database, whose elements are subsequently available for inclusion in the hierarchical subject classification for a given collection (Subjects view of the Collection properties window). This box can also be used to create keywords in the global subject list which can subsequently be assigned as i) the property of specific subject categories to facilitate their retrieval to add to a collection (in the Add subject function of the Subjects view of the Collection properties window, see below), and/or as ii) an attribute attached to specific documents (Keywords tab of the Document properties box of the Documents view of the Collection properties window, see below).

• The Delete menu provides commands to delete collections, documents, organisations, subjects and items from the lists of possible values of the other attributes in the document model.

• The Edit menu provides a Copy command (the same as Ctl C) which copies the headings of the central frame and the selected rows into the clipboard for pasting into files or other Organizer elements.

• The View menu provides commands to open the Collections, Documents, Organisations or Subjects lists in the central area (an alternate to using the top four vertical toolbar buttons). It also contains toggle boxes to display the short-cut toolbar (described below) under the menu bar and the status bar under the central frame.

• The Tools menu contains the following: – The Documents command contains two tools: one to globally

change the name of an organisation to another in the organisation attribute of all documents in the database (the new name must first be added to the list of organisations), and another to change/add properties to a group of documents selected in the document list (the latter tool is operational if the documents concerned have already been selected in the Documents view, even if another view


is presently active).

– The second command, Collections, contains two tools. A Statistics tool to give statistical information about a given collection and an Export lists tool to export into a text file a list of the titles or of one of the three main attributes of a collection (the lists are saved with standard names in the \Program Files\Human Info\Organizer\DataBase directory).

– The Backup command is used to create a back-up of the entire database each time it is selected (the database is saved under a standard name, indicating the date and time of backup, in the \Program Files\Human Info\Organizer\DataBase directory). Confirmation is requested so that multiple copies will not be unintentionally performed.

• The Administration tools menu provides windows to add new users and to delete a user or change a user’s password. Only an administrator (e.g. “admin”) can grant a user access to a database as “Administrator” or “Guest” and change passwords. Users and administrators have the same access rights on the collections.

• The Help menu contains a Microsoft standard help facility under Help Topics and information about the version and developers through the About Organizer command.

A short-cut toolbar just under the horizontal menu bar provides a set of icons which changes dynamically depending on the view which is open in the central area (set by the View menu—see above—or by vertical toolbar—see below). Clicking on the second icon from the left is the same as double clicking on the item selected in the diplayed list to edit that item (see below), while the next to last icon is the same as clicking on the header of the first column of the displayed list to sort it. The first, third and last icons have the same functions as three of the menu bar commands described above: Edit/Copy; creation of a new element in the displayed list: New/Collection/Empty, New/Document, New/Organisation or New/Subject; and Help/About Organizer. Two displays associate with an additional fourth icon: permitting changing the name of the selected collection in the collections list (not in the main menu) or exporting the list of documents (same as Tools/Collections/Export Lists with choice of the document list, except that one can also choose the target directory and file name).

b. The vertical toolbar contains five buttons to select the content and function of the central area of the screen.

The four top buttons of the toolbar are used to display for editing in


the central frame a list of one of the following elements: Collections, Documents, Organisations or Subjects (these lists can also be obtained by activating the corresponding command in the View menu in the horizontal toolbar):

• Collections list: Selecting the Collections button (the default upon entering the database) displays the list of all current collections in the database. Double click on any collection title to open the Collection Properties window (described below) in which you can add new documents to that collection or add/edit the collection’s document attributes. To create a new collection, use the New/Collection command in the top menu bar or the third icon from left in the short-cut toolbar (see above).

• Documents list: Selecting the Documents button displays the global list of all documents in the database (i.e. those in any of the associated collections or not yet attached to a collection). Double click on any document title to see/modify the attributes of that document other than the subject which is assigned in the Collection Properties window (described below). To create a new document entry with its attributes, use the New/Document command in the top menu bar or the third icon from left in the short-cut toolbar (see above). Text string title search: It is possible to search for titles containing a word or a string of characters by entering the word in the small dialogue box above the Document title heading of the list to get the first occurrence, and then clicking on the “binoculars” icon. Click on the next button to the right (“binoculars and arrow” icon) to jump successively to the next occurrences.

Filtering Documents: Sometimes, it is convenient to view and edit only the documents of one language, one organisation or one subject. For this, select the Filter button (“funnel” icon) in the upper right part of the window to display the Search documents dialogue box, enter your preferences and accept them with the Apply filter button. You may also change and test the search without leaving the box with the Search and Reset search buttons. You may then apply and remove the filter at any time with Apply filter check box. This filter mechanism is similar to a Boolean “and” search: All the documents that are displayed will conform to all the filter “included” criteria selected.

• Organisations list: Selecting the Organisations button displays the global list of all organisations in the database (i.e. assigned to documents in any of the associated collections, or not yet assigned to any document). Double click on any Organisation title to open the


Edit Organisation Name dialogue box to see/modify the Organisation name and abbreviation (same dialogue box as presented with the New/Organisation command). To create a new organisation entry, use the New/Organisation command in the top menu bar or the third icon from left in the short-cut toolbar (see above).

• Subjects list: Subjects are elements of a standard classification to be applied to the documents of a collection. Selecting the Subjects button enables addition, deletion and editing of the global list of all subjects in the database, even if they are not yet assigned to any document. An unlimited number of subject terms can be created for subsequent use at the collection level to build and assign to the collection’s documents a subject classification hierarchy. Double click on any subject in the list to open the Edit subject window to see/modify the Subject properties. To create a new subject entry, use the New/Subject command in the top menu bar or the third icon from left in the short-cut toolbar (see above).

In the Edit subject or Add new subject windows, “keywords” can be added to the global subject list (New keyword name box, same as using New/Add-Modify keywords command from the top menu bar), and can then be assigned as properties of specific subjects (Selected for this subject box) for use in searching for subject categories to add to a collection. The keywords (whether or not assigned to subject categories) can be used for subsequent attachment as attributes of documents (“How to” in the DLS collection). Note that the two types of keywords are mixed in alphabetical order in the list which may cause inconvenience when assigning keywords to documents; this can be overcome by adding a code like “z-” in front of keywords to be assigned uniquely to subject terms (not to those to be assigned to documents unless you wish the code to appear in the generated application’s keyword search list).

Warning: If the main database is very large, beware of a possible lengthy process (a minute or more) to upload collection elements. After selecting a new collection please wait until all elements are uploaded before starting to work (note that a small icon in the shape of a light bulb appears on the tab line until loading is complete). Failure to wait for uploading may cause the program to abort.

The Export Files button opens the Export Settings window (described below) which enables the saving of a collection’s metadata to export the digital library structure into the Greenstone Digital Library software. It also enables the complete database to be saved.


ii. The Collection properties window

This window, enabling the user to build or modify a particular collection, appears when a collection has been selected in the collection list of the Organizer Main Window. It allows the user to select one of four views of the collection by clicking the tabs at the top. Each view in turn provides a number of data selection and editing functions which are described below:

a. Subjects view: An unlimited number of subject terms can be created here for the selected collection in a hierarchy up to six levels deep (although most collections will not need more than three levels). To add a subject category to the collection (to be able to use it as an attribute of one or more documents), i) select the subject heading under which you want to add the new term, ii) select the Add subject button, iii) select Add Subjects from Global List, iv) use the buttons to display the entire global list or (default and normal procedure) those which have not yet been used in the collection hierarchy, v) select the desired subject from the list, and vi) select OK. If the subject desired is not yet in the global list, then first use the Add New Subject option offered by the Add subject button (same as going back to the main menu and adding it with the New/Subjects command). Subject categories can also be modified with the Edit icon between the subject and document displays; the principal usage of this function is to optionally add a numbering scheme for the subject hierarchy in the categories used in a collection (Note that the modifications of the categories are used only for the current collection and are not carried over to the global list of subjects).

To add one or more documents to a subject (i.e. assign the subject to one or more documents), you first select the subject category or sub-category to which you want to add documents, in the upper list box. Then select one of more documents in the lower list box, and click the little icon between the two displays with an “up” arrow and red book (or double click the documents one by one). [The documents thus assigned to the selected subject are shown with a √ symbol preceding each field in the corresponding line; for convenient viewing, you can move them to the top or bottom of the display with the “√ up” and “√ down” icons between the displays.] You will see that the documents have been added in the appropriate place in the Subject classification hierarchy. Now repeat this operation until all documents are classified. One document can be assigned to as many subjects as want. You can move (but not copy) a document classified under one subject to


any other subject by drag and drop with the left mouse button. To remove a document from a subject, select the document title listed under the subject, press the delete key and confirm.

b. Organisations view: This view is used as a convenient means of selecting or deselecting documents for the collection according to the name of the associated organisation, which will also become part of the list in the digital library for retrieval of documents by associated organisation (note that documents can also be added, and their organisation attribute changed, from the Documents view). The default list in the left-hand box contains the organisations already associated with documents in the collection; the right-hand box contains the titles of the documents to which that organisation has been associated—those titles marked with a √ are included in the collection, those not marked are not included. Click on any title to select or deselect it in the collection. Use the icons at bottom right (white square with √ and blank white square) to select or deselect all of the titles. To choose documents from organisations not yet attributed to documents in the collection, select the Add Organisations from Global List option of the Add organisation button to add organisations to the collection organisation list, then proceed to select documents as above. Similarly, you can use the Remove organisation button to remove all of the documents associated with a selected organisation from the collection (but not from the global document list). To work with only the documents in the collection, toggle the List only selected documents checkbox. From this view, you can also add new organisations and new documents to the global list (Add new organisation option of the Add Organisation button or Add new document option of the Add Document button, the same respectively as going back to the main menu and adding them with the New/Organisations or New/Subjects commands)

c. Documents view: Lists all the documents that are selected for inclusion in the active collection. This is the same list of documents as the one appearing in the lower window of the Subjects view. The main difference is that in this document list, double clicking a given document line opens the Documents properties dialogue box for that document. You can then add/change the attributes of that document (other than the associated subject categories which are modified as described above under the Subjects view) by selecting one of the tabs for different classes of attributes at the top of the box:

• General: In this window you can enter the Document title, Job number, Number of pages and Number of images, Year of


publication, and the Batch name. You can automatically input the number of images by clicking on the Find images button and choosing the directory containing the document’s images, after specifying the possible image format extensions in the Extensions field.

• Advanced: If the document is a periodical, or if it is part of a series, you can assign the title of periodical issue or series as a property, which automatically creates an entry in the list of periodical issues which will be included in the title search window of the generated application. In the advanced window you can also assign the Organisations and Languages of the document. Both fields are repeatable: If more than one organisation is associated with the document (publishers, corporate authors) or if the document is written in more than one language, for example bilingual English/French, you should assign all of the corresponding entries from the respective lists.

• Copyright: It is very important to know the copyright status of a publication. This window has two parts: one where the original copyright notice is displayed and the copyright level can be assigned. This information is for document management with the Organizer, and has no effect on the Greenstone digital library application.

• Suggested collections: This represents a list of collections in which this document could potentially be included later. The number of suggested collections is not limited. This information is for document management with the Organizer, and has no effect on the Greenstone digital library application.

• Keywords: Keywords specific for the publication can be selected. This property is an additional classification scheme which complements the subject classification and can be used in the generated digital library application to select and display documents in the digital library. In the DLS it is used for a “How to” parameter, but could also be used for any other additional metadata, for example the author or country of origin.

Note that in the Subjects, Organisations and Documents views, the Add documents button permits the user to add new documents to the collection directly from a dialogue box or by selection from the global database list. Documents added in the Collection properties window views are automatically entered in the global database list for future use. When a new document is added to the collection from the global database list, the Search documents dialogue box appears to enable the user to easily locate the desired documents


according to several selection criteria (this is the same filtering function described above in discussing the documents list obtained by selecting the Documents button of the vertical toolbar of the Organizer Main Window).

d. Others view: The Other hierarchies view shows the ordering of documents under periodical titles (which cannot be edited) and also the alphabetical classification of titles under each language. The user can change the alphabetical classification groups (example A-C, E-G or A-L, M-Z etc) to the best size to display the documents in the finished library. For this, click on a language and use the Split letters button.When you are pleased with the result, click on the Save Splitters button (until then you can get back the initial or previously saved letter categories by clicking Load/Refresh Splittings or eliminate the letter categories by clicking Eliminate Splittings).

iii. The Export Settings window

This window, which is presented when the bottom icon of the vertical toolbar of the Organizer Main Window is selected, allows you to export the result of your work, and is normally the last stage involved in creating a new collection or sub-collection. Select Export Files to invoke the Export Settings window and select a collection for export and a folder to receive the exported information. Then, click Export files. This will place the five files collect.cfg, metadata.xml, sub.txt, org.txt, Keywords.txt and AZList.txt into the selected folder. In order to build the collection with this information, you need to move the files to the appropriate place. Place the metadata.xml file in the collection’s import directory and the others in the collection’s etc directory.

Getting Started in Ten Steps and 15 Minutes

The best exercise to get the look and feel of the Organizer is by actually creating a small test library. If you have 15 minutes please follow these ten steps and you will understand this program much better.

Before getting started (assuming you want to use the Organizer output to build your first Greenstone digital library), you should have:

a. Installed Greenstone (see the Greenstone Installer’s Guide) which includes the Demo library in DLS format and its source files. Note, if you wish to be able to add to your collection any of the 140 documents in the DLS collection in the default Organizer


database (instead of just the 11 of these documents in the Greenstone Demo collection), you should install DLS as one of the sample Greenstone libraries and replace “demo” by “dls” in the instructions below. The Demo and DLS collections will be installed respectively in c:\program files\gsdl\collect\demo and c:\program files\gsdl\collect\dls. If you previously installed Greenstone without DLS and wish to install it, then you may uninstall and reinstall Greenstone specifying this collection.

b. Created the structure for your new collection (we will refer to it below as “newcol”) by executing the following command from Run...in the Windows Start menu:

“c:\program files\gsdl\bin\windows\build” newcol

c. Replaced the default collect.cfg file created by the previous step with that used by the Demo collection. That is, copy c:\program files\gsdl\collect\demo\etc\collect.cfg to c:\program files\gsdl\collect\newcol\etc\collect.cfg. This is necessary because Demo (and all DLS format collections) use some specialised options that the default collection does not (see the Greenstone Developer’s Guide for further details).

We suggest that you print the instructions below and follow them step by step:

1. Open the Collection Organizer, select the dls database with and give “admin” for both the user name and password (the Collections button of the vertical toolbar should be highlighted by default; if it is not, then select it).

2. Select the New/Collection/Empty command in the horizontal menu bar on the top of the Organizer Main Window to create a new empty collection. Give it the title and the version number you choose, for example “My First Collection” as the name and “1.0” as the version.

3. For certain document attributes, you will have to create a list of the possible values first. So if you know the languages and/or publishing organisations of your documents, use the commands New/Add-Modify languages and New/Organisation to add all the languages you will use in this or future collections, as well as the publishers of your documents. You can use the same commands to add or modify languages and organisations when you want, but not when you are editing the collection itself as in the next instruction.


4. Double Click on the line with the title of the collection you just created.

5. Click on the Subjects tab at top (if not already selected as it should be by default); then click on the Add subject button and the command Add new subject, then enter a few subjects in the Subject title field, pressing the “enter” key after each one. Click on the + before the word Subjects in the Subject hierarchy list to see the subjects you have entered.

6. Click on the Documents tab; this will open the Documents view. Then add documents to your collection as follows:

a. To add a document from the Demo collection (or the DLS collection if it is installed in Greenstone) to your new collection, click the Add documents button and select Add document from global list. Locate the document you require (using the “filtering” function described above) and add it to your collection in the Organizer. After adding the document locate its source files in the Demo import directory (c:\program files\gsdl\collect\demo\import) and copy them to the import directory of your new collection. For example, to add the document “Butterfly Farming in Papua New Guinea” to your new collection you should take note of the document’s job number once you locate it in the Organizer. This document’s job number is “b22bue”, so you should copy the “b22bue” directory from c:\program files\gsdl\collect\dls\import\ac01ne to c:\program files\gsdl\collect\newcol\import\ac01ne.

b. To add a new document (that is, one that is not included in the Demo collection) to your new collection click the Add documents button and select Add new document. Enter the document’s title, job number (your choice), number of pages, publishing organisation, languages and other information. You must then create a new directory in c:\program files\gsdl\collect\newcol\import to correspond to the job number of the new document. Into this new directory you should place the source file of the document and any associated image files (in HTML or any other format accepted by Greenstone (see Greenstone User’s Manual.)

7. Go back to the Subjects tab, and you will see that your documents are displayed in the lower list box. Select a document. Then select in the upper list a subject you want this document to be classified under and click the little icon between the two displays with an


“up” arrow and red book. Once a document is thus classified, you can still move it from one subject to another by drag and drop with the left mouse button. You can also move documents or subjects up or down within the same level of hierarchy by selecting the blue up or down buttons to the right of the Subject hierarchy list. Try to classify an average of 6 to 30 documents per subject. One document can be assigned to as many subjects as want.

8. Repeat the previous steps by adding new subjects, and adding more documents. When the library is finished, you will have to review the complete list of subjects and documents, to be sure that all are entered and correctly classified and ordered.

9. Finally, close the Collection properties window and press the Export Files button of the vertical toolbar. This will open the Export Settings window. Click on the Display collection list button, and select your collection, then click on the Browse for folder button and select the directory to which you want to export the metadata files, and press Export files to export the collection metadata for collection building with Greenstone.

10. Copy the exported files to the appropriate places in your new collection’s directory structure.

a. The exported metadata.xml file should be copied to the c:\program files\gsdl\collect\newcol\import directory.

b. The exported AZList.txt, Keyword.txt, sub.txt, and org.txt files should be copied to the c:\program files\gsdl\collect\newcol\etc directory.

Note that the collect.cfg file generated by the Organizer is not required because the “classify” lines it contains are already included in the collect.cfg files for the DLS and demo collections.

The newcol collection is now ready for building. Build it from the command line using the import.pl and buildcol.pl commands (see the Greenstone Developer’s Guide for details).

5.3 Tagging document files

Source documents often need to be structured into sections and subsections, and this information needs to be communicated to Greenstone so that it can preserve the hierarchical structure. Also, metadata - typically the title - might be associated with each section and subsection.


The source documents from an OCR process are typically a set of word processor files, including images. If these are represented as MicrosoftWord files, they can be input into Greenstone using the Word plugin. Alternatively, they can be converted to HTML and input using the HTML plugin.

In either case, the hierarchical structure of a document may be indicated by inserting tags in the text as follows:

 (text of section goes here) 

The  markers are used because they indicate comments in HTML; thus these section tags will not affect document formatting. You must include these markers around your section tags, even if the document you are working with is not HTML (e.g. if it's a Microsoft Word file).

In the Description part (between the <Description> and </Description> tags) other kinds of metadata can be specified, but this is not done for the style of collections we are describing here.

It is important to remember that you are creating a hierarchical table of contents when you insert section tags into your document. This means that sections can be nested within other sections. In fact, all sections must be nested within a single enclosing section that encompasses the entire document.

The following example demonstrates a document with two chapters, the second of which contains two subsections. For real examples of sourcedocuments tagged in this way, look at the source documents for the Demo or DLS collections.

 (text of chapter 1 goes here)  (text of sub-section 1 goes here)  (text of sub-section 2 goes here) 

Note that metadata assigned from within a section tag in a source document takes precedence over that assigned from within a metadata.xml file (like those produced by the Organizer). This means that you should not explicitly specify Title metadata for the top-level section within a source document unless you want it to override the title you set from within the Organizer. In the above example, if you want the title of the document to be what you set it to in the Organizer you should omit the line that reads:

<Metadata name="Title">My Document</Metadata>