131
SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion Agent by Itemfield. Future editions of this manual will reflect the new name, which replaces the name ContentMaster .

Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

  • Upload
    others

  • View
    19

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

SAP Conversion Agent byItemfield (ContentMaster)

Getting Started withContentMaster

Version 4.0

This product has been renamed as SAP Conversion Agent by Itemfield. Future editions ofthis manual will reflect the new name, which replaces the name ContentMaster.

Page 2: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

SAP Conversion Agent byItemfield (ContentMaster)

Getting Started withContentMaster

Volume 1: Introductory Exercises

Version 4.0

This product has been renamed as SAP Conversion Agent by Itemfield. Future editions ofthis manual will reflect the new name, which replaces the name ContentMaster.

Page 3: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Legal Notice

Getting Started with ContentMaster, Volume 1: Introductory Exercises

Copyright © 2005-2006 Itemfield Inc. All rights reserved.

Itemfield may have patents, patent applications, trademarks, copyrights, or other intellectual propertyrights covering subject matter in this document. Except as expressly provided in any written licenseagreement from Itemfield, the furnishing of this document does not give you any license to thesepatents, trademarks, copyrights, or other intellectual property.

The information in this document is subject to change without notice. Complying with all applicablecopyright laws is the responsibility of the user. No part of this document may be reproduced ortransmitted in any form or by any means, electronic or mechanical, for any purpose, without theexpress written permission of Itemfield Inc.

SAP AGhttp://www.sap.com

Publication Information:

Version: 4.0Date: January 2006

Page 4: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 Contents

i

Contents of Volume 1

1. Introducing ContentMaster ..............................................................1Introduction to XML ............................................................................................................................... 1How ContentMaster Works.................................................................................................................... 3

ContentMaster Studio......................................................................................................................3ContentMaster Engine..................................................................................................................... 3

Using ContentMaster in Integration Applications................................................................................... 4HL7 Integration ................................................................................................................................ 4Processing PDF Forms ................................................................................................................... 4Converting HTML Pages to XML..................................................................................................... 4

How to Continue in this Book ................................................................................................................ 5Organization of the Chapters .......................................................................................................... 5Quick Reference to ContentMaster Concepts and Features........................................................... 5

Additional ContentMaster Documentation .............................................................................................8

2. Installation ........................................................................................9System Requirements ........................................................................................................................... 9Installing ContentMaster........................................................................................................................ 9Tutorials and Workspace Folders........................................................................................................10

Exercises and Solutions ................................................................................................................10

3. Basic Parsing Techniques .............................................................11Opening ContentMaster Studio ...........................................................................................................11Importing the Tutorial_1 Project .......................................................................................................... 13A Brief Look at the ContentMaster Studio Window .............................................................................14Defining the Structure of a Source Document..................................................................................... 17

Correcting Errors in the Parser Configuration ...............................................................................24Techniques for Defining Anchors ..................................................................................................25Tab-Delimited Format....................................................................................................................25

Running the Parser..............................................................................................................................26Running the Parser on Additional Source Documents ..................................................................27

Points to Remember............................................................................................................................ 28What's Next? ....................................................................................................................................... 29

Page 5: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 Contents

ii

4. Defining an HL7 Parser ..................................................................30Requirements Analysis ........................................................................................................................ 30

HL7 Background............................................................................................................................ 30Input HL7 Message Structure........................................................................................................30Output XML Structure....................................................................................................................31

Creating a Project................................................................................................................................ 32Using XSD Schemas in ContentMaster ........................................................................................ 35

Defining the Anchors ...........................................................................................................................36Editing the IntelliScript .........................................................................................................................40Testing the Parser ............................................................................................................................... 40Points to Remember............................................................................................................................ 44

Index of Volume 1...............................................................................45

Page 6: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 1. Introducing ContentMaster

1

Introducing ContentMaster

ContentMaster is a data transformation system. Using ContentMaster, you canbuild a high-throughput interface between any data format and XML-basedsystems.

To configure a ContentMaster data transformation, you use a procedure that iscalled parsing by example. In this procedure, you teach ContentMaster how toconvert data to XML, simply by marking up an example in a visual editorenvironment. You do not need to do any programming to configure theconversion. You can configure even a complex transformation in just a few hoursor days, saving weeks or months of programming time.

ContentMaster can process fully structured, semi-structured, or unstructured data.You can configure ContentMaster to work with text, binary data, messagingformats, HTML pages, PDF documents, word-processor documents, and any otherformat that you can imagine.

ContentMaster can parse the data and transform it to any standard or custom XMLvocabulary. In the reverse direction, ContentMaster can serialize XML data,transforming the content to other formats. The ContentMaster components thatperform these transformations are called parsers and serializers, respectively.Another component, called a mapper, combines these two functions, and does XMLto XML transformations.

This book is a tutorial introduction to ContentMaster. As you read it, you will doseveral exercises that will teach you how to use ContentMaster to configure andrun your own data transformations.

Introduction to XML

XML (Extensible Markup Language) is the de-facto standard for cross-platforminformation exchange. For the benefit of ContentMaster users who may be new toXML, we present a brief introduction here. If you are already familiar with XML,you can skip this section.

The following is an example of a small XML document:

<Company industry="metals"><Name>Ore Refining Inc.</Name><WebSite>http://www.ore_refining.com</WebSite>

1

Page 7: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 1. Introducing ContentMaster

2

<Field>iron and steel</Field><Products>

<Product id="1">cast iron</Product><Product id="2">stainless steel</Product>

</Products></Company>

This sample is called a well-formed XML document because it complies with thebasic XML syntactical rules. It has tree structure, composed of elements. The top-level element in this example is called Company, and the nested elements are Name,WebSite, Field, Products, and Product .

Each element begins and ends with tags, such as <Company> and </Company>. Theelements may also have attributes. For example, industry is an attribute of theCompany element, and id is an attribute of the Product element.

To explain the hierarchical relationship between the elements, we sometimes referto parent and child elements. For example, the Products element is the child ofCompany and the parent of Product.

The particular system of elements and attributes is called an XML vocabulary. Thevocabulary can be customized for any application. In the example, we made up avocabulary that might be suitable for a commercial directory.

The vocabulary can be formalized in a syntax specification, which is called aschema. The schema might specify, for example, that Company and Name are requiredelements, that the other elements are optional, and that the value of the industryattribute must be a member of a predefined list. If an XML document conforms to arigorous schema definition, the document is said to be valid, in addition to beingwell-formed.

In order to make the XML document easier to read, we have indented the lines toillustrate how the elements are nested. The indentation and whitespace are notessential parts of the XML syntax. We could equally well have written a long,unbroken string such as the following, which does not contain any extrawhitespace:

<Company industry="metals"><Name>Ore Refining Inc.</Name><WebSite>http://www.ore_refining.com</WebSite><Field>iron and steel</Field><Products><Product id="1">cast iron</Product><Product id="2">stainlesssteel</Product></Products></Company>

The unbroken-string representation is exactly equivalent to the indentedrepresentation. In fact, a computer might store the XML as a string like this. Theindented representation is how XML is conventionally presented in a book or on acomputer screen because it is much easier to understand.

Page 8: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 1. Introducing ContentMaster

3

For More Information

You can get information about XML from many books, articles, or web sites. For anexcellent tutorial, see http://www.w3schools.com. To obtain copies of the XMLstandards, see http://www.w3.org.

How ContentMaster Works

The ContentMaster system has two main components:

ContentMaster StudioThe design and configuration environment of ContentMaster.

ContentMaster EngineThe data transformation engine.

ContentMaster Studio

ContentMaster Studio is a visual editor environment, where you can design andconfigure data transformations such as parsers, serializers, and mappers.

You use ContentMaster Studio to teach ContentMaster how to process data of aparticular type. If you are building a parser, you can use a select-and-clickapproach to identify the data fields in an example source document, and define howContentMaster should transform the fields to XML. This procedure is called parsingby example.

Note that we use the term document (as in example source document) in the broadestpossible sense. A document can contain text or binary data, and it can have anysize. It can be stored or accessed in a file, buffer, stream, URL, database, messagingsystem, or any other location.

ContentMaster Engine

ContentMaster Engine is a high-throughput data transformation engine. It has nouser interface. It works entirely in the background, executing the datatransformations that you have previously defined in ContentMaster Studio.

To move a data transformation from ContentMaster Studio to ContentMasterEngine, you must deploy the transformation as a ContentMaster service.

An integration application can submit requests to ContentMaster Engine in avariety of ways, for example, by calling the ContentMaster API or by transmittingto a CGI interface. A request specifies the data to be transformed and theContentMaster service that should perform the transformation. ContentMasterEngine executes the request and returns the output to the calling application.

Page 9: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 1. Introducing ContentMaster

4

Using ContentMaster in Integration Applications

The following paragraphs present some typical examples of how ContentMasterdata transformations are used in system integration applications.

As you perform the exercises in this book, you will get experience usingContentMaster for these types of data transformations. The chapter on Defining anHL7 Parser, for example, illustrates how to use ContentMaster to parse HL7messages. The chapters on Positional Parsing of a PDF Document and Parsing Wordand HTML Documents describe parsers that process various types of unstructureddocuments.

HL7 Integration

HL7 is the standard messaging protocol of the health industry. HL7 messages havea flexible structure, which supports optional and repetitive data fields. The fieldsare separated by a hierarchy of delimiter symbols.

In a typical integration application, a major health maintenance organization(HMO) uses ContentMaster to transform messages that are transmitted to andfrom its HL7-based information systems. ContentMaster is configured to processall types of HL7 messages.

Processing PDF Forms

The Adobe Acrobat (PDF) file format has become a standard for formatteddocument exchange. The format permits users to view fully formatteddocuments—including their original layout, fonts, graphics, etc.—on a widevariety of supported platforms. For information processing applications, however,the PDF format has been much less useful than for visual applications. This isbecause applications cannot access and analyze the unstructured, binary PDF files.

ContentMaster solves this problem by enabling conversion of PDF documents toan XML representation. For example, ContentMaster can convert invoices thatsuppliers send in PDF format to XML, for storage in a data-base.

Converting HTML Pages to XML

Information in HTML documents is usually presented in unstructured andunstandardized formats. The goal of the HTML presentation is visual display,rather than information processing.

ContentMaster has many features that can navigate, locate, and store informationfound in HTML documents. ContentMaster enables conversion of informationfrom HTML to a structured XML representation, making the informationaccessible to software applications. For example, retailers who present their stockon the web can convert the information to XML, letting them share the informationwith a clearing house or with other retailers.

Page 10: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 1. Introducing ContentMaster

5

How to Continue in this Book

This book is a tutorial learning guide for ContentMaster. As you read it, you willperform several exercises that teach how to use ContentMaster in real-life data-transformation scenarios.

When you finish the exercises, you will be familiar with the ContentMasterprocedures, and you will be able to apply ContentMaster to your own datatransformation needs.

Organization of the Chapters

Most of the chapters in this book present tutorial exercises, which teach you how tocreate data transformations that use parsers, serializers, or mappers. Werecommend that all users do the first and second exercises—Basic ParsingTechniques and Defining an HL7 Parser. You can then proceed through the otherexercises in sequence, or you can skim the chapters and skip to the ones that youneed.

The final chapter, Running ContentMaster Engine, explains how to deploy a datatransformation as a ContentMaster service, and it illustrates how to getContentMaster Engine to run the service. In that chapter, you will learn how tomove a ContentMaster project from the development stage to production.

Quick Reference to ContentMaster Concepts and Features

The following table is a guide to the ContentMaster concepts and features that areintroduced and explained in each chapter.

Concept Specific feature or component Chapter

System requirements 2. Installation

Setup procedure 2. Installation

Installation

Registration 2. Installation

Viewing IntelliScript 3. Basic Parsing Techniques

Editing IntelliScript 4. Defining an HL7 Parser

Multiple script files 7. Defining a Serializer

Color coding of anchors 3. Basic Parsing Techniques

Basic and advanced properties 5. Positional Parsing of a PDF Document

Working inContentMasterStudio

Global components 6. Parsing Word and HTML Documents

Importing 3. Basic Parsing TechniquesProjects

Creating 4. Defining an HL7 Parser

Page 11: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 1. Introducing ContentMaster

6

Concept Specific feature or component Chapter

Containing multiple parsers orserializers

7. Defining a Serializer

Project properties 7. Defining a Serializer

Determining the project folderlocation

7. Defining a Serializer

Example source documents 3. Basic Parsing Techniques

Creating 4. Defining an HL7 Parser

Running 3. Basic Parsing Techniques

Viewing results 3. Basic Parsing Techniques

Testing on example source 4. Defining an HL7 Parser

Testing on additional sourcedocuments

3. Basic Parsing Techniques6. Parsing Word and HTML Documents

Document processors 5. Positional Parsing of a PDF Document6. Parsing Word and HTML Documents

Parsers

Calling a secondary parser 7. Defining a Serializer

Text 3. Basic Parsing Techniques

Tab-delimited 3. Basic Parsing Techniques

HL7 4. Defining an HL7 Parser

Positional 5. Positional Parsing of a PDF Document

PDF 5. Positional Parsing of a PDF Document

Microsoft Word 6. Parsing Word and HTML Documents

Formats

HTML 6. Parsing Word and HTML Documents

Using an XSD schema 3. Basic Parsing Techniques

Adding a schema to a project 4. Defining an HL7 Parser

Creating and editing schemas 4. Defining an HL7 Parser

Using multiple schemas 8. Defining a Mapper

Data holders

Variables 6. Parsing Word and HTML Documents

Marker 3. Basic Parsing Techniques

Marker with count 6. Parsing Word and HTML Documents

Content 3. Basic Parsing Techniques

Content with positional offsets 5. Positional Parsing of a PDF Document

Anchors

Content with opening andclosing markers

6. Parsing Word and HTML Documents

Page 12: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 1. Introducing ContentMaster

7

Concept Specific feature or component Chapter

EnclosedGroup 6. Parsing Word and HTML Documents

RepeatingGroup 4. Defining an HL7 Parser

Search scope 5. Positional Parsing of a PDF Document

Nested RepeatingGroup 5. Positional Parsing of a PDF Document

Newlines as markers andseparators

5. Positional Parsing of a PDF Document

Default transformers 6. Parsing Word and HTML Documents

AddString 6. Parsing Word and HTML Documents

Transformers

Replace 6. Parsing Word and HTML Documents

SetValue 5. Positional Parsing of a PDF Document

CalculateValue 5. Positional Parsing of a PDF Document

Actions

Map 8. Defining a Mapper

Creating 7. Defining a Serializer

ContentSerializer 7. Defining a Serializer

RepeatingGroupSerializer 7. Defining a Serializer

EmbeddedSerializer 7. Defining a Serializer

Serializers andserializationanchors

Calling a secondary serializer 7. Defining a Serializer

Mappers andmapping

Creating 8. Defining a Mapper

Page 13: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 1. Introducing ContentMaster

8

Concept Specific feature or component Chapter

RepeatingGroupMapping 8. Defining a Mapper

Using color coding 4. Defining an HL7 Parser

Viewing events 3. Basic Parsing Techniques

Interpreting events 4. Defining an HL7 Parser

Testing and debuggingtechniques

4. Defining an HL7 Parser

Testing

Selecting which parser orserializer to run

7. Defining a Serializer

Deploying a ContentMasterservice

9. Running ContentMaster EngineRunning inContentMasterEngine API 9. Running ContentMaster Engine

Additional ContentMaster Documentation

The full ContentMaster documentation is available in an online help format. Youcan display the complete help library from the SAP ContentMaster folder on theStart menu or from the Help menu of ContentMaster Studio.

Page 14: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 2. Installation

9

Installation

Before you continue in this book, you should install the ContentMaster software (ifyou have not already done so).

For detailed information about the system requirements, installation, andregistration, please see the ContentMaster Administrator's Guide. The followingparagraphs contain brief instructions to help you get started.

System Requirements

To perform the exercises in this book, you should install ContentMaster on acomputer that meets the following minimum requirements:

Microsoft Windows 2000, XP Professional, or 2003 Server.

Microsoft Internet Explorer, version 6.0 or higher

Microsoft .NET Framework, version 1.1 or higher

At least 128 MB of RAM

Installing ContentMaster

To install ContentMaster, double-click the setup file and follow the on-screeninstructions. Be sure to install at least the following components:

EngineThe ContentMaster Engine component, which is needed for all exercises in thisbook.

Eclipse Development EnvironmentThe ContentMaster Studio for Eclipse design and configuration environment,which is needed for all exercises in this book.

Document ProcessorsOptional components, which are needed for the tutorials in chapters 5 and 6.

Default Installation Folder

By default, ContentMaster is installed in the following location:

2

Page 15: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 2. Installation

10

c:\Program Files\SAP\ContentMaster

The setup prompts you to change the location if desired.

Registration and Licensing

Depending on your ContentMaster version, the setup may prompt you to registerthe software and to install a license. Follow the on-screen instructions to completethe registration. For detailed information, see the ContentMaster Administrator'sGuide.

Tutorials and Workspace Folders

To do the exercises in this book, you need the tutorial files, which are located in thetutorials folder of your main ContentMaster program folder. By default, thelocation is:

c:\Program Files\SAP\ContentMaster\tutorials

As you perform the exercises, you will import or copy some of the contents of thisfolder to the ContentMaster Studio workspace folder. The default location of theworkspace is:

My Documents\SAP\ContentMaster\4.0\workspace

You should work on the copies in the workspace. We recommend that you do notmodify the originals in the Tutorials folder, in case you need them again.

Exercises and Solutions

The tutorials folder contains two subfolders:

tutorialsExercisesSolutions to Exercises

The Exercises folder contains the files that you need to do the exercises.Throughout this book, we will refer you to files in this folder.

As you perform the exercises, you will create ContentMaster projects that havenames such as Tutorial_1, Tutorial_2, etc. The projects will be stored in yourContentMaster Studio workspace folder.

The Solutions to Exercises folder contains our proposed solutions to theexercises. The solutions are projects having names such as TutorialSol_1, etc. Youcan import the projects to your workspace and compare our solutions with yours.Note that in many cases, there can be more than one correct solution.

Page 16: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 3. Basic Parsing Techniques

11

Basic Parsing Techniques

To help you start using ContentMaster quickly, we will give you a partiallyconfigured project containing a simple parser. The project is called Tutorial_1.

Working in ContentMaster Studio, you will edit and complete the configuration.You will then use the parser to convert a few sample text documents to XML.

The main purposes of this exercise are to:

Demonstrate the ContentMaster parsing-by-example approach

Start learning how to define and use a parser

Along the way, you will use some of the important ContentMaster Studio features,such as:

Importing and opening a project

Using an XSD schema, which defines the structure of the output XMLdocument

Defining the source document structure using marker anchors and contentanchors

Defining a parser based on an example source document

Using the parser to transform multiple source documents to XML

Viewing the parsing events

Opening ContentMaster Studio

To open ContentMaster Studio for Eclipse:

1. On the Start menu, click Programs > SAP ContentMaster > ContentMasterStudio for Eclipse.

2. Display the ContentMaster Studio Authoring perspective. The perspective is aset of windows, menus, and toolbars that you can use to edit ContentMasterprojects in Eclipse.

To do this, run the menu command Window > Open Perspective > Other >ContentMaster Studio Authoring.

3

Page 17: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 3. Basic Parsing Techniques

12

3. Optionally, run the command Window > Reset Perspective. This is a goodidea if you have previously worked in ContentMaster Studio, and you havemoved or resized the windows. The command restores all the windows totheir default sizes and locations.

4. Optionally, run the command Help > Welcome, and select the ContentMasterStudio welcome page. The page displays introductory instructions on how touse ContentMaster Studio.

You can close the welcome page by clicking the X at the right side of the editortab.

Page 18: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 3. Basic Parsing Techniques

13

Importing the Tutorial_1 Project

The partially configured project is supplied with the ContentMaster software, butyou cannot open it directly in ContentMaster Studio for Eclipse. Instead, you mustimport the project into the Eclipse workspace. To do this:

1. On the menu, run the File > Import command. At the prompt, choose theoption to import an Existing ContentMaster Project into Workspace.

2. Click the Next button, and browse to the following file (substitute your path tothe ContentMaster installation folder):

c:\Program Files\SAP\ContentMaster\tutorials\Exercises\Tutorial_1\Tutorial_1.cmw

3. On the remaining wizard pages, you can accept the defaults. Click Finish tocomplete the import.

In the Eclipse workspace folder, you should find the imported Tutorial_1folder:

My Documents\SAP\ContentMaster\4.0\workspace\Tutorial_1

4. In the upper right of the Eclipse window, the ContentMaster Explorer displaysthe Tutorial_1 files, which you have imported.

Page 19: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 3. Basic Parsing Techniques

14

The following is a brief description of the folders and files:

ExamplesThis folder contains an example source document, which is a sample inputdocument that you will use to configure the parser.

ScriptsA TGP script file, which stores the parser configuration.

XSDAn XSD schema file, which defines the XML structure that the parser willcreate.

ResultsThis folder is temporarily empty. When you configure and run the parser,ContentMaster will store its output in this folder.

Please note that most of these folders are virtual. They are used to categorizethe files in the display, but they don't actually exist on your disk. Only theResults folder is a physical directory on your disk, but it doesn't yet existbecause you haven't yet generated any output.

If you examine the Tutorial_1 folder in the Windows Explorer, you will find afew other files, which are not displayed in the ContentMaster Explorer, forexample:

Tutorial_1.cmwThe main project file, containing the project configuration properties.

.projectA file generated by the Eclipse development environment. This is not aContentMaster file, but you need it to open the ContentMaster project inEclipse.

A Brief Look at the ContentMaster Studio Window

ContentMaster Studio displays numerous windows. The windows are of twotypes, called views and editors. A view displays data about a project or lets you

Page 20: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 3. Basic Parsing Techniques

15

perform specific operations. An editor lets you edit the configuration of a projectfreely.

The following paragraphs describe the views and editors, starting from the upperleft and moving counterclockwise around the screen.

Upper LeftContentMaster Explorer view

Displays the projects and files in the ContentMaster Studio workspace. Byright-clicking or double-clicking in this view, you can add existing files to aproject, create new files, or open files for editing.

Lower LeftThe lower left corner of the ContentMaster window displays two, stacked views.You can switch between them by clicking the tabs on the bottom.

Component viewDisplays the main components that are defined in a project, such as parsers,serializers, mappers, transformers, and variables. By right-clicking or double-clicking, you can open a component for editing.

IntelliScript Assistant viewThe view helps you configure certain components in the IntelliScriptconfiguration of a data transformation. For an explanation, see the descriptionof the IntelliScript editor below.

Lower Right

The lower right displays several views, which you can select by clicking the tabs.

Page 21: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 3. Basic Parsing Techniques

16

Tasks viewDisplays syntax errors in a data transformation or in a schema. You can alsouse this view to record any tasks that you need to perform (a ToDo list).

Help viewDisplays help as you work in an IntelliScript editor. When you select an itemin the editor, the help scrolls automatically to an appropriate topic.

Note: You can also display the ContentMaster help from the Help menu ofContentMaster Studio or from the SAP ContentMaster folder on the Startmenu. This lets you read the complete ContentMaster documentation, asopposed to the single-topic display in the Help view.

Events viewDisplays events that occur as you run a data transformation. You can use theevents to confirm that a transformation is running correctly, or to diagnoseproblems.

Binary Source viewDisplays the binary codes of the example source document. This is useful ifyou are parsing binary input, or if you need to view special characters such asnewlines and tabs.

Schema viewDisplays the XSD schemas associated with a project. The schemas define theXML structures that a data transformation can process.

Repository viewDisplays the ContentMaster services that are deployed for running inContentMaster Engine.

Upper RightContentMaster Studio displays editor windows on the upper right. You can openmultiple editors, and switch between them by clicking the tabs.

IntelliScript editorUsed to configure a data transformation. This is where you will perform mostof the work as you do the exercises in this book.

The left pane of the IntelliScript editor is called the IntelliScript pane. This iswhere you define the data transformation. The IntelliScript has a treestructure, which defines the sequence of ContentMaster components thatperform the transformation.

The right pane is called the example pane. It displays the example sourcedocument of a parser. You can use this pane to configure a parser by a select-and-click procedure.

Page 22: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 3. Basic Parsing Techniques

17

IntelliScript pane on left, example pane on right

Defining the Structure of a Source Document

You are ready to start defining the structure of the source document. You will usethe parsing-by-example approach to do this.

1. In the ContentMaster Explorer, expand the Tutorial_1 files, and double-click the file Tutorial_11.tgp . This opens the file in an IntelliScript editor.

2. The IntelliScript displays a Parser component, which is called MyFirstParser.Expand the IntelliScript tree, and examine the properties of the Parser. Theyinclude the following values, which we have configured for you:

example_sourceThe example source document, which you will use to configure the parser.We have selected a file called File1.txt as the example source. The file isstored in the project folder.

formatWe have specified that the example source has a TextFormat, and that it isTabDelimited. This means that the text fields are separated from each otherby tab characters.

3. The example source, File1.txt, should be displayed in the right pane of theeditor. If it is not, right-click MyFirstParser in the left pane, and choose theoption to Open Example Source.

Page 23: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 3. Basic Parsing Techniques

18

Note: You can toggle the display of the left and right panes. To do this, openthe IntelliScript menu and select IntelliScript, Example, or Both. There are alsotoolbar buttons for these options.

4. Examine the example source more closely. It contains two kinds ofinformation.

The left-hand entries, such as First Name:, Last Name:, and Id:, are markeranchors, which indicate the locations of the data fields. The right-hand entries,such as Ron, Lehrer, and 547329876, are the content anchors , which are thevalues of the data fields. The marker and content anchors are separated by tabcharacters, which are called tab delimiters.

"First Name:" is a marker anchor. "Ron" is a content anchor. The anchors areseparated by tab delimiters.

To make your job easier, we have already configured the parser with the basicproperties, such as the TabDelimited format. You need to complete theconfiguration by defining the anchors. In the following steps, you willconfigure the parser to search for each marker anchor and retrieve the datafrom the content anchor that follows it. The parser will then insert the data inthe XML structure.

As you continue in this book, you will learn about many other types ofanchors and delimiters. Anchors are the places where the parser hooks into asource document and performs an operation. Delimiters are characters thatseparate the anchors. ContentMaster lets you define anchors and delimiters inmany different ways.

Page 24: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 3. Basic Parsing Techniques

19

5. You are now ready to start parsing by example. In the example pane, drag themouse over the text First Name:, which is the first marker anchor.

6. Right-click the text that you have just selected, and choose Insert Marker fromthe pop-up menu.

7. Your last action caused a Marker anchor to be inserted in the IntelliScript. Youare now prompted for the first property that requires your input, which issearch. This property lets you specify the type of search, and the defaultsetting is TextSearch. Since this is the desired setting, press Enter.

Page 25: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 3. Basic Parsing Techniques

20

8. You are now prompted for the next property that needs your input, which isthe text of the anchor that the parser searches for. The default is the text thatyou selected in the example source. Again, press Enter.

Note: In the images displayed here, the selected property is white, and thebackground is gray. On your screen, the background may be white. You cancontrol this behavior by choosing the Windows > Preferences command. Onthe ContentMaster page of the preferences, select or deselect the option toHighlight Focused Instance.

9. In the IntelliScript, ContentMaster Studio displays the new marker anchor aspart of the MyFirstParser definition. In the example source, ContentMasterStudio highlights the marker in yellow.

Note: If the color coding is not immediately displayed, open the IntelliScriptmenu, and confirm that the option to Learn the Example Automatically ischecked.

Page 26: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 3. Basic Parsing Techniques

21

10. Now, you are ready to create the first content anchor. In the example source,select the word Ron . Right-click the selected text, and choose Insert Contentfrom the pop-up menu.

11. This inserts a Content anchor in the IntelliScript, and prompts you for the firstproperty that requires your input, which is value. This property lets youspecify how the parsing will be done.

The default is LearnByExample, which means that the parser should find thecontent based on the delimiters surrounding it in the example source. This isthe desired behavior, so you should click Enter.

12. You can accept the defaults for the next few properties (example,opening_marker, and closing-marker), or you can skip them by clickingelsewhere with the mouse.

13. Your next task is to specify where the Content anchor should store the datathat it extracts from the source document.

The output location is called a data holder. This is a generic term which means"an XML element, an XML attribute, or a variable". In this exercise, you willuse data holders that are elements or attributes. In a later exercise, you will usevariables.

Page 27: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 3. Basic Parsing Techniques

22

To define the data holder, select the data_holder property, and double-click orpress Enter. This opens a Schema view, which displays the XML elements andattributes that are defined in the XSD schema of the project (Person.xsd).

Expand the no target namespace node, and select the First element. Thismeans that the Content anchor should store its output in an XML elementcalled First, like this:

<First>Ron</First>

More precisely, the First element is nested inside Name , which is nested insidePerson. The actual output structure generated by the parser will be:

<Person><Name>

<First>Ron</First></Name>

</Person>

Selecting a data holder

14. When you click the OK button, the data_holder property is assigned the value/Person/*s/Name/*s/First. In XML terminology, this is called an XPathexpression. It is a representation of the XML structure that we have outlinedabove.

(Actually, the standard XPath expression is /Person/Name/First. The *ssymbols are ContentMaster extensions of the XPath syntax, which you canignore.)

Page 28: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 3. Basic Parsing Techniques

23

ContentMaster Studio highlights the Content anchor in the example source.The Marker and Content anchors are in different colors, to help you distinguishthem.

15. Choose the File > Save command, or click the Save button on the toolbar. Thisensures that your editing is saved in the project folder.

16. So far, you have told the parser to search for a First Name: marker, retrievethe Ron content field that follows it, and store the content in the First elementof the XSD schema.

You should now proceed through the source document, defining the othermarker and content anchors in the same way. The following table lists theanchors that you need to define.

Anchor Anchor type Data holder

Last Name: Marker

Lehrer Content /Person/*s/Name/*s/Last

Id: Marker

547329876 Content /Person/*s/Id

Age: Marker

27 Content /Person/*s/Age

Gender: Marker

M Content /Person/@gender

Page 29: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 3. Basic Parsing Techniques

24

Be sure to define the anchors in the above sequence, which is the sequence inthe source document. This is important because you need to establish thecorrect relationship between each marker and the corresponding content. Ifyou make a mistake in the sequence, ContentMaster may fail to find the text.

When you have finished, the ContentMaster Studio window should look likethe following illustration. Of course, you may need to expand or collapse thetree to view the illustrated lines.

17. Save the project again.

Correcting Errors in the Parser Configuration

As you define anchors, you may occasionally make a mistake such as selecting thewrong text, or setting the wrong property values for an anchor.

If you make a mistake, you can correct it in several ways:

On the menu, you can run the Edit > Undo command.

In the IntelliScript pane, you can select an anchor (or any other componentthat you have added to the configuration) and press the Delete key.

In the IntelliScript pane, you can right-click an anchor. From the pop-up menu,click Delete.

As you gain more experience working in the IntelliScript pane, you can use thefollowing additional techniques:

If you create an anchor in the wrong sequence, you can drag it to the correctlocation in the IntelliScript.

If you forget to define an anchor, right-click the anchor following the omittedanchor location and click Insert.

You can copy and paste components such as anchors in the IntelliScript.

Page 30: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 3. Basic Parsing Techniques

25

You can edit the property values in the IntelliScript.

Techniques for Defining Anchors

In the above steps, we told you to select an anchor, right-click, and choose theInsert Marker or Insert Content command. This inserts an anchor in theIntelliScript, and allows you to assign its properties.

There are several alternative ways to define anchors:

You can define a content anchor by dragging text from the example source to adata holder in the Schema view. This inserts a Content anchor in theIntelliScript, where the data_holder property is already assigned.

You can define anchors by typing in the IntelliScript pane, without using theexample source.

You can edit the properties of Marker and Content anchors in the IntelliScriptAssistant view.

We encourage you to experiment with these features. For more information, seethe book ContentMaster Studio in Eclipse.

Tab-Delimited Format

Do you remember that MyFirstParser is defined with a TabDelimited format? Thedelimiters define how the parser interprets the example source. In this case, theparser understands that the marker and content anchors are separated by tabcharacters.

In the instructions, we suggested that you select the markers including the coloncharacter, for example:

First Name:

Because the tab-delimited format is selected, the colon isn't actually important inthis parser. If you had selected First Name (without the colon), the parser wouldstill find the marker and the tab following the marker, and it would read thecontent correctly. It would ignore other characters, such as the colon.

The tab-delimited format also explains why you can select a short content anchorsuch as Ron, and not be concerned about the field size. In another source document,a person may have a long first name such as Rumpelstiltskin. By default, a tab-delimited parser reads the entire string after the tab, up to the line break (unlessthe line contains another anchor or tab character).

Page 31: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 3. Basic Parsing Techniques

26

Running the Parser

You are now ready to test the parser that you have configured.

1. In the IntelliScript, right-click the Parser component, which is namedMyFirstParser. Choose the option to Set as Startup Component.

This means that when you run the project, ContentMaster will activateMyFirstParser.

2. On the menu, choose Run > Run MyFirstParser.

3. After a moment, ContentMaster Studio displays the Events view. The eventsinclude, for example, all the Marker and Content anchors that the parser foundin the example source.

You can use the Events view to examine any errors encountered duringexecution. Assuming that you followed the instructions carefully, theexecution should be error-free.

4. Now you can examine the output of the parser process. In the ContentMasterExplorer, expand the Tutorial_1 files/Results node, and double-click thefile output.xml.

ContentMaster Studio displays the XML file in an Internet Explorer window.Examine the output carefully, and confirm that the results are correct. If theresults are incorrect, examine the parser configuration that you created, correctany mistakes, and try again.

Page 32: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 3. Basic Parsing Techniques

27

If you are using Windows XP SP2 or higher, Internet Explorer may display awarning about active content in the XML results file. You can safely ignore thewarning.

Don't worry if the XML processing instruction (<?xml version="1.0" ...?>) ismissing or different from the one displayed above. This is due to the settingsin the project properties (the Project > Properties command on the menu). Youcan ignore this issue for now.

Running the Parser on Additional Source Documents

To further test the parser, you can run it on additional source documents, otherthan the example source.

1. At the right of the Parser component, click the >> symbol. This displays theadvanced properties of the component.

The >> symbol changes to <<. You can close the advanced properties byclicking the << symbol.

Displaying the advanced properties

2. Select the sources_to_extract property, and double-click or press Enter. Fromthe drop-down list, select LocalFile .

3. Expand the LocalFile node of the IntelliScript, and assign the file_nameproperty. ContentMaster Studio displays an Open dialog, where you canbrowse to the file.

Page 33: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 3. Basic Parsing Techniques

28

You can use one of the test files that we have supplied, which are in the folder

tutorials\Exercises\Tutorial_1\Additional source files

For example, select File2.txt, which has the following content:

First Name: DanLast Name: SmithId: 45209875Age: 13Gender: M

4. Run the parser again. You should get the same XML structure as above,containing the data that the parser found in the test file.

Points to Remember

A ContentMaster project contains the parser configuration and the XSD schema. Itusually also contains the example source document, which the parser uses to learn thedocument structure, and other files such as the parsing output.

The ContentMaster Explorer displays the projects that exist in your ContentMasterStudio workspace. To copy an existing project into the workspace, use the File >Import command.

Page 34: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 3. Basic Parsing Techniques

29

To open a data transformation for editing in the IntelliScript editor, double-click itsTGP script file in the ContentMaster Explorer. In the IntelliScript, you can addcomponents such as anchors . The anchors define locations in the source document,which the parser seeks and processes. Marker anchors label the data fields, andcontent anchors extract the field values.

To define a marker or content anchor, select the source text, right-click, and choosethe anchor type. This inserts the anchor in the IntelliScript , where you can set itsproperties. For example, you should set the data holder—an XML element orattribute—where a content anchor stores its output.

Use the color coding to review the anchor definitions.

The delimiters define the relation between the anchors. A tab-delimited format meansthat the anchors are separated by tabs.

To test a parser, set it as the startup component. Then use the Run command on theContentMaster Studio menu. To view the results file, double-click its name in theContentMaster Explorer.

What's Next?

Congratulations! You have configured and run your first ContentMaster parser.

Of course, the source documents that you parsed had a very simple structure. Thedocuments contained a few markers, each of which was followed by a tabcharacter and by content.

Moreover, all the documents had exactly the same markers in the same sequence.This made the parsing easy because you did not need to consider the possiblevariations among the source documents.

In real-life uses of parsers, very few source documents have such a simple, rigidstructure. In the succeeding chapters, you will learn how to parse complex andflexible document structures using a variety of parsing techniques. All thetechniques are based, however, on the simple steps that you learned in thischapter.

Page 35: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 4. Defining an HL7 Parser

30

Defining an HL7 Parser

In this chapter, you will parse an HL7 message. HL7 is a standard messagingformat used in medical information systems. The structure is characterized by ahierarchy of delimiters and by repeating elements. After you learn the techniquesfor processing these features, you will be able to parse a large variety of documentsthat are used in real applications.

As in the previous chapter, we will give you a fully configured XML schema. Youwill need to configure the parser yourself. You will learn techniques such as:

Creating a project

Creating a parser

Parsing a document selectively, that is, retrieving selected data and ignoringthe rest

Defining a repeating group

Using delimiters to define the source document structure

Testing and debugging a parser

Requirements Analysis

Before you start the exercise, let's analyze the input and output requirements of theproject. As you design the parser, you will use this information to guide theconfiguration.

HL7 Background

HL7 is a standard messaging and transaction protocol for the health servicesindustry. It is used world-wide in hospital and medical information systems.

For detailed information about HL7, see the Health Level 7 web site,http://www.hl7.org.

Input HL7 Message Structure

The following lines illustrate a typical HL7 message, which you will use as thesource document for parsing.

4

Page 36: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 4. Defining an HL7 Parser

31

MSH|^~\&|LAB||CDB||||ORU^R01|K172|PPID|||PATID1234^5^M11||Jones^William||19610613|MOBR||||80004^ElectrolytesOBX|1|ST|84295^Na||150|mmol/l|136-148|Above high normal|||Final resultsOBX|2|ST|84132^K+||4.5|mmol/l|3.5-5|Normal|||Final resultsOBX|3|ST|82435^Cl||102|mmol/l|94-105|Normal|||Final resultsOBX|4|ST|82374^CO2||27|mmol/l|24-31|Normal|||Final results

The message is composed of segments, which are separated by carriage returns.Each segment has a three-character label, such as MSH (message header) or PID(patient identification). Each segment contains a predefined hierarchy of fields andsub-fields, which are delimited by the characters immediately following the MSHdesignator (|^~\&).

For example, the patient's name (Jones^William) follows the PID label by five |delimiters. The last and first names (Jones and William) are separated by a ^delimiter.

The message type is specified by a field in the MSH segment. In the above example,the message type is ORU, subtype R01, which means Unsolicited Transmission of anObservation Message. The OBR segment specifies the type of observation, and theOBX segments list the observation results.

In this chapter, you will configure a parser that processes ORU messages such as theabove example. Some key issues in the parser definition are how to define thedelimiters and how to process the repeating OBX group.

Output XML Structure

The purpose of this exercise is to create a parser, which will convert the above HL7message to the following XML output:

<Message type="..." id="..."><Patient id="..." gender="...">

<f_name>...</f_name><l_name>...</l_name><birth_date>...</birth_date>

</Patient><Test_Type test_id="...">...</Test_Type><Result num="...">

<type>...</type><value>...</value><range>...</range><comment>...</comment><status>...</status>

</Result><Result num="...">

<type>...</type>

Page 37: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 4. Defining an HL7 Parser

32

<value>...</value><range>...</range><comment>...</comment><status>...</status>

</Result><!-- Additional Result elements as needed -->

</Message>

The XML has elements that can store many—but not all—of the data in the sampleHL7 message. That is acceptable. In this exercise, you will build a parser thatprocesses the data in the source document selectively, retrieving the informationthat it needs and ignoring the rest. The XML structure contains the elements thatare required for retrieval.

Notice the repeating Result element. This element will store data from therepeating OBX segment of the HL7 message.

Creating a Project

First, let's create a project where ContentMaster Studio can store your work.

1. On the ContentMaster Studio menu, click File > New > Project. This opens awizard where you can select the type of project.

2. In the left pane, choose a ContentMaster project. In the right pane, chooseParser Project.

Page 38: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 4. Defining an HL7 Parser

33

3. On the next page of the wizard, enter a project name, such as Tutorial_2 .ContentMaster creates a folder and a *.cmw configuration file having thisname.

4. On the following wizard page, enter a name for the Parser component. Sincethis parser will parse HL7 message of type ORU, let's call it HL7_ORU_Parser.

5. On the next page, enter a name for the TGP script file that the wizard creates.A convenient name is Script_Tutorial_2.

6. On the next page, you should select an XSD schema, which defines the XMLstructure where the parser will store its output. We have provided a schemafor you. In the ContentMaster installation folder, the location is:

tutorials\Exercises\Files_For_Tutorial_2\HL7_tutorial.xsd

Browse to this file and click Open. ContentMaster Studio copies the schema tothe project folder.

7. On the following page, specify the example source type. In this exercise, it is afile, which we have provided. Therefore, you should select File.

Page 39: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 4. Defining an HL7 Parser

34

8. The next page prompts you to browse to the example source file. The locationis:

tutorials\Exercises\Files_For_Tutorial_2\hl7-obs.txt

ContentMaster Studio copies the file to the project folder.

9. On the next page, select the encoding of the source document. In this exercise,the encoding is ASCII, which is the default.

10. You can skip the document preprocessors page. You do not need a documentpreprocessor in this project.

11. Select the format of the source document. In this project, the format is HL7.

This means that the parser should assume that the message fields areseparated by the HL7 delimiter hierarchy:

newline|Other symbols such as ^ and tab

For example, the parser might learn that a particular content anchor starts at acount of three | delimiters after a marker, and ends at the fourth | delimiter.

12. Review the summary page and click Finish.

Page 40: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 4. Defining an HL7 Parser

35

13. ContentMaster creates the new project. It displays the project in theContentMaster Explorer. It opens the script that you have created,Script_Tutorial_2.tgp, in an IntelliScript editor.

In the IntelliScript, notice that the example_source and format properties havebeen assigned according to your specifications.

The example source should be displayed. If it is not, right-click the Parsercomponent and choose Open Example Source.

Using XSD Schemas in ContentMaster

ContentMaster data transformations require XSD schemas to define the structureof XML documents. Every ContentMaster parser, serializer, or mapper projectrequires at least one schema.

When you perform the tutorial exercises in this book, we provide the schemas thatyou need. For your own applications, you may already have the schemas, or youmay need to create new schemas.

Learning XSD

For an excellent introduction to the XSD schema syntax, see the tutorial on theW3Schools web site, http://www.w3schools.com. For definitive reference information,see the XML Schema standard at http://www.w3.org.

Editing SchemasYou can use any XSD editor to create and edit the schemas that you use withContentMaster. For information, see the chapter on Data Holders in theContentMaster Studio User's Guide.

Page 41: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 4. Defining an HL7 Parser

36

You can configure ContentMaster Explorer to open a schema in the XSD editor ofyour choice. For instructions, see ContentMaster Studio Preferences in the bookContentMaster Studio in Eclipse.

Defining the Anchors

Now, it is time to start parsing by example.

1. You need to define marker anchors that identify the locations of fields in thesource document, and content anchors that identify the field values.

Let's start with the non-repeating portions of the document, which are the firstthree lines. The most convenient markers are the segment labels, MSH, PID, andOBR. These labels identify portions of the document that have a well-definedstructure.

Within each segment, you need to identify the data fields to retrieve. These arethe content anchors. There are several content anchors for each marker.

In addition, you need to define the data holders for each content anchor. Thedata holders are elements or attributes in the XML output, which is illustratedabove.

If you study the HL7 message and the XML structure that should beproduced, you can probably figure out which message fields need to bedefined as anchors, and you can map the fields to their corresponding dataholders. To save time, we have done this job for you. We present the results inthe following table:

Anchor Anchor type Data holder

MSH Marker

ORU Content /Message/@type

K172 Content /Message/@id

PID Marker

PATID1234^5^M11 Content /Message/*s/Patient/@id

Jones Content /Message/*s/Patient/*s/l_name

William Content /Message/*s/Patient/*s/f_name

19610613 Content /Message/*s/Patient/*s/birth_date

M Content /Message/*s/Patient/@gender

OBR Marker

80004 Content /Message/*s/Test_Type/@test_id

Electrolytes Content /Message/*s/Test_Type

Page 42: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 4. Defining an HL7 Parser

37

Note the @ symbol in some of the XPath expressions, such as /Message/@type .The symbol means that type is an attribute, as opposed to an element.

Go ahead and create the anchors in the parser definition, as you did in thepreceding chapter. When you finish, the IntelliScript editor should appear asin the following illustration.

2. Now, you need to teach the parser how to process the OBX lines of the sourcedocument. There are several OBX lines, each having the same format. Theparser should create output for each OBX line that it finds.

To do this, you will define an anchor called a RepeatingGroup. The anchor tellsContentMaster to search for a repeated segment. Inside the RepeatingGroup,you will nest several Content anchors, which tell the parser how to parse eachiteration of the segment.

In the IntelliScript pane, find the last anchor that you defined (Electrolytes).Immediately below the anchor, there is an empty node containing three dots(...). That is where you should define the new anchor.

Page 43: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 4. Defining an HL7 Parser

38

Define the RepeatingGroup at the ... symbol in the anchor list

Select the three dots and press the Enter key. This opens a drop-down list,which displays the names of the available anchors.

In the list, select the RepeatingGroup anchor. Alternatively, you can type thetext RepeatingGroup in the box. After you type the first few letters,ContentMaster Studio auto-completes the text.

Press the Enter key again to accept the new entry.

3. Now, you must configure the RepeatingGroup so it can identify the repeatingsegments. You will do this by assigning the separator property. You willspecify that the segments are separated from each other by a marker, which isthe text OBX.

In the IntelliScript pane, expand the RepeatingGroup. Find the line that definesthe separator property.

Page 44: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 4. Defining an HL7 Parser

39

By default, the separator value is empty, which is symbolized by a ...symbol. Select the ... symbol, press Enter, and change the value to Marker.Press Enter again to accept the new value. The Marker value means that therepeating elements are separated by a marker anchor.

Expand the Marker property, and find its text property. Select the value(which is "" by default) and press Enter. Type the value OBX , and press Enteragain. This means that the separator is the marker anchor OBX.

In the example pane, ContentMaster Studio highlights all the OBX anchors,signifying that it found them correctly.

4. Now, you will insert the Content anchors, which parse an individual OBX line.

To do this, keep the RepeatingGroup selected. You must nest the Contentanchors within the RepeatingGroup.

You should define the anchors only on the first OBX line. Because the anchorsare nested in a RepeatingGroup, the parser knows that it should look for thesame anchors in additional OBX lines.

Define the Content anchors as follows:

Anchor Anchor type Data holder

1 Content /Message/*s/Result/@num

Na Content /Message/*s/Result/*s/type

150 Content /Message/*s/Result/*s/value

136-148 Content /Message/*s/Result/*s/range

Above high normal Content /Message/*s/Result/*s/comment

Final results Content /Message/*s/Result/*s/status

Page 45: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 4. Defining an HL7 Parser

40

Nesting anchors with a RepeatingGroup

When you finish, the markup in the example pane should look like this:

Editing the IntelliScript

The procedure illustrated above is the general way to edit the IntelliScript:

1. Select the desired location for editing.

2. Press Enter. In most locations, you can also double-click instead of pressingEnter.

3. Choose or type a value.

4. Press Enter again to accept the edited value.

Testing the Parser

Let's test the parser and see if it works correctly. There are several ways to do this:

You can view the color coding in the example source. This tests the basicanchor configuration.

You can run the parser, confirm that the events are error-free, and view theXML output. This tests the parser operation on the example source.

Page 46: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 4. Defining an HL7 Parser

41

You can run the parser on additional source documents. This confirms that theparser can process any variations of the source structure that may occur in thedocuments.

In this exercise, you will use the first two methods to test the parser. We won't takethe time for the third test, although it is easy enough to do (see Running the Parseron Additional Source Documents in Chapter 3, Basic Parsing Techniques).

1. On the menu, choose IntelliScript > Mark Example. Alternatively, you canclick the button near the right end of the toolbar, which is labeled mark theentire example according to the current script.

Notice that the color-coding is extended throughout the example source.Previously, only the anchors that you defined in the first line of the repeatinggroup were highlighted. When you ran the Mark Example command,ContentMaster ran the parser and found the anchors in the other lines.

Confirm that the marking is as you expect. For example, check that the testvalue, range, and comment are correctly identified in each OBX line.

If the marking is not correct (or if there is no marking), there is a mistake in theparser configuration. Review the instructions, correct the mistake, and tryagain.

2. As an experiment, let's see what would happen if you made a deliberate errorin the configuration.

Do you remember the HL7 delimiters option, which you set in the New Parserwizard (see Creating a Project above)? Let's change the option to another value:

a. Save your work (in case you make some kind of serious error during thisexperiment).

b. Edit the delimiters property (it's located under Parser/format).c. Change the value from HL7 to Positional. This means that the content

anchors are located at fixed positions (numbers of characters) after themarkers.

d. Try the Mark Example command again. The anchors are incorrectlyidentified. For example, in the second OBX line, the comment is reported asrmal|||Final resu instead of Normal. The problem occurs because theparser thinks that the anchors are located at fixed positions on the line.

Page 47: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 4. Defining an HL7 Parser

42

e. Change the delimiters property back to HL7. Confirm that Mark Examplenow works correctly. (Alternatively, click Edit > Undo to restore theprevious configuration.)

f. Save your work again.

3. Now, it's time to run the parser. Right-click the Parser component and set it asthe startup component. Then use the Run > Run command to run it.

4. After a few moments, the Events view is displayed.

Notice that most of the events are labeled with the icon, which symbolizesan information event. This is normal when the parser contains no mistakes.

If you search the event tree, you can find an event that is labeled with aicon, meaning an optional failure. The event is located in the tree underExecution/RepeatingGroup, and it is labeled Separator before 5. This meansthat the RepeatingGroup failed to find a fifth iteration of the OBX separator. Thisis expected because the example source contains only four iterations. Thefailure is called optional because it is permitted for the separator to be missingat the end of the iterations.

Nested within the optional failure event, you can find a icon, which means afailure. The event means that ContentMaster failed to find the Marker anchor,which defines the OBX separator. Because the failure is nested within anoptional failure, it is not a cause for concern.

In general, however, you should pay attention to a failure event and makesure you understand what caused it. Often, a failure indicates a problem in theparser.

In addition to failure events, you should pay attention to (warning) eventsand to (fatal error) events. Warnings are less severe than failures. Fatal errorsare the most severe; they stop the data transformation from running.

Page 48: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 4. Defining an HL7 Parser

43

5. In the right-hand pane of the Events view, try double-clicking one of theMarker or Content events.

When you do this, ContentMaster highlights the anchor that caused the eventin the IntelliScript and example panes. This is a good way to find the source offailure or error events.

6. In the ContentMaster Explorer, double-click the output.xml file, which islocated under the Tutorial_2 files/Results node. You should see thefollowing XML:

Page 49: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 4. Defining an HL7 Parser

44

Points to Remember

To create a new project, run the File > New > Project command. This displays awizard, where you can set options such as:

The parser name

The XSD schema for the output XML

The example source document, such as a file or a URL

The source format, such as text or binary

The delimiters that separate the data fields

After you create the project, edit the IntelliScript and add the anchors, such as Markerand Content for simple data structures, or RepeatingGroup for repetitive structures.

To edit the IntelliScript, use the Select-Enter-Assign-Enter approach. That is, selectthe location that you want to edit. Press the Enter key. Assign the property value,and press Enter again.

Test the parser by using the IntelliScript > Mark Example command to color-codethe anchors.

Run the parser, and check the events for warnings, failures, or fatal errors. View theresults file, which contains the output XML.

Page 50: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 Index

45

Index of Volume 1

A

AcrobatContentMaster transformations, 4

anchorsContent, 18defining, 25Marker, 18RepeatingGroup, 37

APIof ContentMaster Engine, 3

attributesXML, 2

B

Binary Source view, 16

C

child elementsXML, 2

CMW files, 14Component view, 15components

ContentMaster tutorial, 5Content

anchor, 18ContentMaster

overv iew, 1ContentMaster Engine, 3ContentMaster Explorer, 15ContentMaster services, 3ContentMaster Studio, 3

opening, 11

D

data holdermeaning of, 21

debuggingparser, 40

delimiterstab, 18

documentsdefined, 3

E

editorsContentMaster Studio, 15

elementsXML, 2

error events, 42errors

correcting in parser configuration, 24event log

interpreting, 42Events view, 16example pane, 16example source document, 14

F

failure events, 42fatal error events, 42features

ContentMaster tutorial, 5format

tab-delimited, 25

H

hello, worldContentMaster tutorial, 11

helpdisplaying, 8

Help view, 16HL7

background, 30ContentMaster transformations, 4message structure, 30parsing tutorial, 30

HTML documentsContentMaster transformations, 4

Page 51: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 Index

46

I

information events, 42installation

ContentMaster, 9IntelliScript

correcting errors in, 24IntelliScript Assistant, 15IntelliScript editor, 16

opening, 17IntelliScript pane, 16

L

licensingContentMaster, 10

M

mappersdefinition, 1

Markeranchor, 18

P

parent elementsXML, 2

parserrunning in ContentMaster Studio, 26

parsersdefinition, 1testing and debugging, 40

parsingbeginner's tutorial, 11HL7 tutorial, 30

parsing by example, 1, 3PDF files

ContentMaster transformations, 4perspective

ContentMaster Studio Authoring, 11resetting, 12

projectscreating, 32importing, 13

Q

quick referenceContentMaster tutorials, 5

R

registrationContentMaster, 10

RepeatingGroupanchor, 37

Repository view, 16result files, 14

S

schemaXML, 2

schema files, 14Schema view, 16schemas

XSD, 35script files

TGP, 14separator

of repeating group, 38serializers

definition, 1services

ContentMaster, 3solutions

to Getting Started exercises, 10source documents

testing additional, 27startup component

setting, 26system requirements

ContentMaster, 9

T

tab delimiters, 18tab-delimited format, 25Tasks view, 16testing

parser, 40text files

parsing tutorial, 11TGP script files, 14Tutorial_1

basic parsing techniques, 11Tutorial_2

defining a parser, 30tutorials

basic parsing techniques, 11HL7 parser, 30

Tutorials foldercopying, 10

V

valid XML, 2views

ContentMaster Studio, 15vocabulary

XML, 2

Page 52: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 1 Index

47

W

warning events, 42web pages

ContentMaster transformations, 4welcome page

displaying in Eclipse, 12well-formed XML, 2whitespace

in XML, 2windows

ContentMaster Studio, 15

X

XMLoverview, 1standards, 3tutorial, 3

XPathContentMaster extensions, 22defining anchor mappings, 22

XSDreference information, 35

XSD schemas, 35

Page 53: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

SAP Conversion Agent byItemfield (ContentMaster)

Getting Started withContentMaster

Volume 2: Advanced Exercises

Version 4.0

This product has been renamed as SAP Conversion Agent by Itemfield. Future editions ofthis manual will reflect the new name, which replaces the name ContentMaster.

Page 54: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Legal Notice

Getting Started with ContentMaster, Volume 2: Advanced Exercises

Copyright © 2005-2006 Itemfield Inc. All rights reserved.

Itemfield may have patents, patent applications, trademarks, copyrights, or other intellectual propertyrights covering subject matter in this document. Except as expressly provided in any written licenseagreement from Itemfield, the furnishing of this document does not give you any license to thesepatents, trademarks, copyrights, or other intellectual property.

The information in this document is subject to change without notice. Complying with all applicablecopyright laws is the responsibility of the user. No part of this document may be reproduced ortransmitted in any form or by any means, electronic or mechanical, for any purpose, without theexpress written permission of Itemfield Inc.

SAP AGhttp://www.sap.com

Publication Information:

Version: 4.0Date: January 2006

Page 55: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 Contents

i

Contents of Volume 2

5. Positional Parsing of a PDF Document...........................................1Requirements Analysis .......................................................................................................................... 1

Source Document............................................................................................................................ 2XML Output ..................................................................................................................................... 4The Parsing Problem....................................................................................................................... 4

Creating the Project............................................................................................................................... 6Defining the Anchors .............................................................................................................................8

More About Search Scope ............................................................................................................12Defining the Nested Repeating Groups ............................................................................................... 13

Basic and Advanced Properties ....................................................................................................19Using an Action to Compute Subtotals................................................................................................ 20

Actions...........................................................................................................................................23Potential Enhancement: Handling Page Breaks ...........................................................................23

Points to Remember............................................................................................................................ 23

6. Parsing Word and HTML Documents............................................25Requirements Analysis ........................................................................................................................ 26

Source Document..........................................................................................................................26XML Output ................................................................................................................................... 28The Parsing Problem..................................................................................................................... 28

Creating the Project.............................................................................................................................30Defining a Variable ..............................................................................................................................31Parsing the Name and Address...........................................................................................................32

Why the Output Contains Empty Elements ...................................................................................35Parsing the Optional Currency Line.....................................................................................................35Parsing the Order Table ......................................................................................................................36

Why the Output Doesn't Contain HTML Code...............................................................................41Using Count to Resolve Ambiguities .............................................................................................43

Using Transformers to Modify the Output............................................................................................ 44Global Components.......................................................................................................................47

Testing the Parser on Another Source Document...............................................................................48Points to Remember............................................................................................................................ 49

Page 56: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 Contents

ii

7. Defining a Serializer .......................................................................51Prerequisite .........................................................................................................................................51Requirements Analysis ........................................................................................................................ 51Creating the Project.............................................................................................................................53

Determining the Project Folder Location .......................................................................................54Configuring the Serializer ....................................................................................................................55Calling the Serializer Recursively ........................................................................................................57

Defining Multiple Components in a Project ...................................................................................58Points to Remember............................................................................................................................ 59

8. Defining a Mapper ..........................................................................60Requirements Analysis ........................................................................................................................ 60Creating the Project.............................................................................................................................61Configuring the Mapper .......................................................................................................................62Points to Remember............................................................................................................................ 65

9. Running ContentMaster Engine ....................................................66Deploying a Parser as a Service .........................................................................................................66ContentMaster COM API Application ..................................................................................................67

Source Code .................................................................................................................................68Explanation of the API Calls.......................................................................................................... 69

Running the COM API Application ...................................................................................................... 70Points to Remember............................................................................................................................ 72

Index of Volume 2...............................................................................73

Page 57: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

1

Positional Parsing of a PDF Document

In many parsing applications, the source documents have a fixed page layout. Thisis true, for example, of bills, invoices, and account statements. In such cases, youcan configure a parser that uses a positional format to find the data fields.

In this chapter, you will use a positional strategy to parse an invoice form. You willdefine the content anchors according to their character offsets from the markers.

In addition to the positional strategy, the exercise illustrates several otherimportant ContentMaster features:

The source document is an Adobe Acrobat PDF file. The parser uses adocument processor to convert the document from the binary PDF format to atext format, which is suitable for further processing.

The data is organized in nested repeating groups.

The parser uses actions to compute subtotals, which are not present in thesource document.

The document contains a large quantity of irrelevant data that is not requiredfor parsing. Some of the irrelevant data contains the same marker strings asthe desired data. The exercise introduces the concept of search scope, which youcan use to narrow the search for anchors and identify the desired data reliably.

To configure the parser, you will use both the basic properties of components,and the advanced properties, which are hidden but can be displayed on demand.

In this exercise, you will solve a complex, real-life parsing problem. Have patienceas you do the exercise. Don't be afraid to delete or undo your work if you make amistake. With a little practice, you will be able to create parsers like this with nodifficulty whatsoever!

Requirements Analysis

Before you start to configure a ContentMaster project, you should examine thesource document and the desired XML output, and analyze what the parser needsto do.

5

Page 58: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

2

Source Document

To view the PDF source document, you need the Adobe Reader (also known as theAdobe Acrobat Reader), version 4 or higher. If you don't have the Adobe Reader,you can download it from http://www.adobe.com. (You need the Adobe Reader onlyfor viewing the document. ContentMaster has a built-in component for processingPDF documents and does not require any additional Adobe software.)

In the Adobe Reader, open the file in the ContentMaster installation folder:

tutorials\Exercises\Files_for_Tutorial_3\Invoice.pdf

The document is an invoice that a fictitious egg-and-dairy wholesaler, calledOrshava Farms, sends to its customers. The first page of the invoice displays datasuch as:

The customer's name, address, and account number

The invoice date

A summary of the current charges

The total amount due

The top and bottom of the first page display boilerplate text and advertising.

Page 59: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

3

The second page displays the itemized charges for each buyer. The sampledocument lists two buyers, each of whom made multiple purchases. The page hasa nested repeating structure:

The main section is repeated for each buyer.

Within the section for each buyer, there is a two-line structure, followed by ablank space, for each purchase transaction.

At the top of the second page, there is a page header. At the bottom, there isadditional boilerplate text.

This structure is typical of many invoices: a summary page, followed by repeatingstructures for different account numbers, credit cards, etc. A business might storesuch invoices as PDF files instead of saving paper copies. It might use the PDFinvoices for online billing by email or via a web site.

Your task, in designing the parser, is to retrieve the required data while ignoringthe boilerplate. Since you are doing this (presumably) for a system integrationapplication, you must do it with a very high degree of reliability.

Page 60: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

4

XML Output

For the purpose of this exercise, we assume that you want to retrieve thetransaction data from the invoice. You need to store the data in an XML structure,which looks like this:

<Invoice account="12345"><Period_Ending>April 30, 2003</Period_Ending><Current_Total>351.01</Current_Total><Balance_Due>457.07</Balance_Due><Buyer name="Molly" total="217.64">

<Transaction date="Apr 02" ref="22498"><Product>large eggs</Product><Total>29.07</Total>

</Transaction><Transaction date="Apr 08" ref="22536">

<Product>large eggs</Product><Total>58.14</Total>

</Transaction><!-- Additional transaction elements -->

</Buyer><Buyer name="Jack" total="133.37">

<!-- Transaction elements --></Buyer>

</Invoice>

The structure contains multiple Buyer elements, and each Buyer element containsmultiple Transaction elements. Each Transaction element contains selected dataabout a transaction: the date, reference number, product, and total price.

The structure omits other data about a transaction, which we choose to ignore. Forexample, the structure omits the discount, the quantity of each product, and theunit price.

Each Buyer element has a total attribute, which is the total of the buyer'spurchases. The total per buyer is not recorded in the invoice. We require thatContentMaster compute it.

The Parsing Problem

Try opening the Invoice.pdf file in Notepad. You will see something like this:

Page 61: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

5

Parsing this binary data might be possible, but it would clearly be very difficult.We would need a detailed understanding of the internal PDF file format, and wewould need to work very hard to identify the markers and contentunambiguously.

If we extract the text content of the document, the parsing problem seems moretractable:

Apr 08 22536 large eggs 61.20 3.06 58.1460 dozen @ 1.02 per dozen

Apr 08 22536 cheddar cheese 45.90 2.30 43.6130 lbs. @ 1.53 per lb.

The transaction data is aligned in columns. The position of each data field is fixedrelative to the left and right margins. This is a perfect case for positional parsing—extracting the content according to its position on the page. That is why thepositional format is appropriate for this exercise.

Another feature is that each transaction is recorded in a fixed pattern of lines. Thefirst line contains the data that we wish to retrieve. The second line contains thequantity of a product and the unit price, which we do not need to retrieve. Thefollowing line is blank. We can use the repeating line structure to help parse thedata.

A third feature is that the group of transactions is preceded by a heading row, suchas:

Purchases by: Molly

The heading contains the buyer's name, which we need to retrieve. The headingalso serves as a separator between groups of transactions.

Page 62: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

6

Creating the Project

Now that you understand the parsing requirements, you are ready to configure theContentMaster project.

1. Use the File > New > Project command to create a parser project calledTutorial_3.

2. On the first few pages of the wizard, set the following options. The options aresimilar to the ones that you set in the preceding tutorial (see Chapter 4,Defining an HL7 Parser).

a. Name the parser PdfInvoiceParser .b. Name the script file Pdf_ScriptFile.c. When prompted for the schema, browse to the file OrshavaInvoice.xsd,

which is in the tutorials\Exercises\Files_For_Tutorial_3 folder.d. Specify that the example source is a file on a local computer.e. Browse to the example source file, which is Invoice.pdf.f. Specify that the source content type is PDF.

3. When you reach the document processor page of the wizard, select the PDF toUnicode (UTF-8) processor.

This processor converts the binary PDF format to the text format that theparser requires. The processor inserts spaces and line breaks in the text file, inan attempt to duplicate the format of the PDF file as closely as possible.

Note: PDF to Unicode (UTF-8) is a description of the processor, which isdisplayed in the wizard. In the IntelliScript, you can view the actual name ofthe processor, which is PdfToTxt_3_00.

4. On the next wizard page, select the document format, which is CustomFormat.You will change this value later, when you edit the IntelliScript.

5. On the final page, click Finish.

6. After a few seconds, the ContentMaster Explorer displays the Tutorial_3project, and the script file is opened in an IntelliScript editor.

7. Notice that the example pane displays the example source in text format,which is the output of the document processor. You will configure anchorsthat process the text.

Page 63: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

7

If the Adobe Reader is installed on the computer, try clicking the Browser tabat the bottom of the example pane. This tab displays the original document ina Microsoft Internet Explorer window, which uses the Adobe Reader as aplug-in component to display the file. The Browser tab is for display only. Youshould return to the Source tab to configure the anchors.

8. Now that the project and parser have been created, you should do some finetuning by editing the IntelliScript.

Expand the format property of the parser, and change its value fromCustomFormat to TextFormat. This is appropriate because the parser willprocess the text form of the document, which is the output of the documentprocessor.

Under the format property, change delimiters to Positional. This means thatthe parser should learn the structure of the example source by counting thecharacters between the anchors.

Page 64: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

8

Defining the Anchors

Let's start parsing by example:

1. Near the top of the example source, find the string ACCOUNT NO:, which marksthe beginning of the text that you need to parse. Define the string as a Markeranchor by selecting it, right-clicking, and choosing Insert Marker.

The marker is automatically displayed in the IntelliScript, and you areprompted to enter the search property setting. Select the value TextSearch ,which is the default. Under TextSearch, confirm that the text string, which theMarker should seek, is ACCOUNT NO:.

Click the >> symbol at the end of TextSearch to display its advancedproperties. Select the match case option under Marker. The option means thatthe parser looks only for the string ACCOUNT NO:, and not account no: , AccountNo:, or any other spelling. This can help prevent problems if one of the otherspellings happens to occur in the document. (In fact, the spelling Account No:occurs in the header of the second page.)

Although it is not strictly necessary, we suggest that you select the match caseoption for all the Marker anchors that you define in this exercise.

Note: For more information about the advanced properties, see Basic andAdvanced Properties later in this chapter.

2. Continuing on the same line of the source document, after the Marker, selectand right-click 12345. From the pop-up menu, choose Insert Offset Content.

This inserts a Content anchor, with property values that are appropriate forpositional parsing. The properties are as follows:

opening_marker = OffsetSearch(1)This means that the text 12345 starts 1 character after the Marker anchor.

closing_marker = OffsetSearch(5)This means that the text 12345 ends 5 characters after its start.

Page 65: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

9

3. Edit the data_holder property of the Content anchor. Set its value to/Invoice/@account. This is the data holder where the Content anchor willstore the text that it retrieves from the source document.

4. There is a small problem in the definition of the Content anchor. What willhappen if a source document has an account number such as 987654321, whichis longer than the number in the example source? According to the currentdefinition of the closing_marker, the parser will retrieve only the first fivedigits, truncating the value to 98765.

To solve this potential problem, you should change the closing_marker toretrieve not only until character 5, but until the end of the text line. Do this bychanging the value of closing_marker from OffsetSearch to NewlineSearch.

5. Continuing to the next line of the example source, define PERIOD ENDING: as aMarker anchor.

6. Define April 30, 2003 as a Content anchor with the offset properties. Map theContent to the /Invoice/*s/Period_Ending data holder. Change theclosing_marker to NewlineSearch, to support source documents where thedate string is longer than in the example source.

Page 66: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

10

7. According to the Requirements Analysis, the next data that you need to retrieveis the current invoice amount. Scroll a few lines down, and define CURRENTINVOICE as a Marker.

8. At the end of the same line, define 351.01 as an offset Content anchor. Map theContent to the /Invoice/*s/Current_Total data holder.

When you select the Content, include a few space characters to the left of thestring 351.01. In positional parsing, this is important because the area to theleft might contain additional characters. For example, the content might be1351.01.

As above, change closing_marker to NewlineSearch, in case the number islocated to the right of its position in the example source.

9. In the same way, define BALANCE DUE as a Marker and 475.07 as Content. Mapthe Content to /Invoice/*s/Balance_Due.

10. Examine the IntelliScript, and confirm that the sequence of markers and offset-content anchors is correct. It should look like this:

Page 67: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

11

Note that the precise offsets in your solution may differ from the ones in theillustration. The offsets depend on the number of spaces that you selected tothe left or right of each anchor.

11. Examine the color coding in the example pane. It should look like this:

12. Right-click and define PdfInvoiceParser as the startup component.

13. Run the parser and view the results file. The results should look like this:

14. You are now almost ready to define the repeating group, which creates theBuyer elements of the XML output. Afterwards, you will define a nestedrepeating group, which creates the Transaction elements.

The repeating group starts at the Purchases by: line, which is on page 2 of theinvoice. You will define Purchases by: as the separator of the repeatinggroup.

There is a catch, however. The string Purchases by: appears also on thebottom of page 1. To prevent this from causing a problem, you need to startthe search scope for the repeating group at the beginning of page 2.

Page 68: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

12

To do this, define the string Page 2 as a Marker anchor. The parser assumesthat the anchors are defined in the order that they appear in the document.Therefore, it starts looking for the subsequent repeating group from the end ofthis Marker .

More About Search Scope

We told you that the purpose of the Page 2 marker is to establish the search scope ofthe subsequent repeating group (which you will define below).

Actually, the function of every Marker anchor is to establish the search scope ofother anchors. For example, suppose that a Content anchor is located between twoMarker anchors. ContentMaster finds the Marker anchors in an initial pass throughthe document, called the initial phase. In a second pass, called the main phase,ContentMaster searches for the Content between the Marker anchors. You havealready seen many examples of how this works as you performed the precedingexercises.

There are many ways to refine the search scope and the way in whichContentMaster searches for anchors. For example, you can configureContentMaster to:

Search for particular anchors in the initial, main, or final phase of theprocessing

Search backwards for anchors

Permit anchors to overlap one another

Find anchors according to search criteria such as a specified text string, apattern (a regular expression), or a specified data type

Adjusting the search scope, phase, and criteria is one of the most powerful featuresthat ContentMaster offers for parsing complex documents. For completeinformation, we encourage you to read the chapter on Anchors in the ContentMasterStudio User's Guide.

Page 69: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

13

Defining the Nested Repeating Groups

Now, you need to define a RepeatingGroup anchor that parses the repeatingPurchases by: structure. Within it, you will define a nested RepeatingGroup thatparses the individual transactions.

Do you remember how to define a repeating group? For instructions, seeChapter 4, Defining an HL7 Parser.

1. At the end of the anchor list, edit the IntelliScript to insert a RepeatingGroup .

2. Define the separator property of the RepeatingGroup as a Marker, whichperforms a TextSearch for the string Purchases by:.

Set the separator_position to before. This is appropriate because theseparator is located before each iteration of the repeating group.

ContentMaster highlights the Purchases by: strings in the example pane.

3. Within the RepeatingGroup , insert an offset Content anchor that maps thebuyer's name, Molly, to the data holder /Invoice/*s/Buyer/@name.

Don't forget to change the closing_marker to NewlineSearch, in case thebuyer's name contains more than five letters. (We won't remind you about thisany more. Whenever you define an offset Content anchor, you should checkthat it supports strings that may be longer than the ones in the examplesource.)

Page 70: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

14

4. On the menu, click the IntelliScript > Mark Example command. Check thatboth buyer names, Molly and Jack , are highlighted. This confirms that theparser identifies the iterations of the RepeatingGroup correctly.

If you want, you can also run the parser. Confirm that the results file containstwo Buyer elements.

Page 71: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

15

5. Find the three-dots symbol that is nested within the RepeatingGroup, anddefine another RepeatingGroup.

6. What can we use as the separator of the nested RepeatingGroup? There doesn'tseem to be any characteristic text between the transactions.

Instead of a text separator, we can use the fact that every transaction occupiesexactly four lines of the document. We can use the sequence of four newlinecharacters as a separator.

To do this, set the separator property of the nested RepeatingGroup to aMarker that uses a NewlineSearch. Click the >> symbol at the end of the Markerto display its advanced properties, and set count = 4.

You may then click the << symbol to hide the advanced properties.

There is no separator before the first transaction or after the last transaction, soyou should set separator_position = between.

Page 72: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

16

As an aside, if you examine the original PDF document carefully, you willdiscover that each transaction occupies only three lines. But we are parsing theoutput of the document processor. For some reason (undoubtedly related tothe binary structure of the PDF file), the processor inserts an extra line.

7. In the first line of the first transaction (that is, Molly's first purchase), definethe following offset Content anchors:

Content Data holder

Apr 02 /Invoice/*s/Buyer/*s/Transaction/@date

22498 /Invoice/*s/Buyer/*s/Transaction/@ref

large eggs /Invoice/*s/Buyer/*s/Transaction/*s/Product

29.07 /Invoice/*s/Buyer/*s/Transaction/*s/Total

When you are done, the example pane should be color-coded like this. Theexact position of the colors depends on the number of spaces that you selectedto the left and right of each field.

Page 73: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

17

The IntelliScript should display the four Content anchors within the nestedRepeatingGroup.

8. Try the IntelliScript > Mark Example command again. Confirm that the date,reference, product, and total columns are highlighted in all the transactions.

9. Do you remember how we restricted the search scope of the outerRepeatingGroup by defining Page 2 as a Marker? It's a good idea to define aMarker at the end of the nested RepeatingGroup, too, to make sure thatContentMaster doesn't search too far.

To do this, return to the top-level anchor list (the level at which the outerRepeatingGroup is defined), and define some of the boilerplate text in thefooter of page 2 as a Marker. For example, you can define the string For yourconvenience:.

Page 74: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

18

10. Run the parser. You should get the following output:

11. Collapse the tree and confirm that the two Buyer elements contain 5 and 3transactions, respectively. This is the number of transactions in the examplesource.

Note: On Windows XP SP2 or higher, Internet Explorer may display a yellowinformation bar notifying you that it has blocked active content in the file. Ifthis occurs, clicking the - and + icons fails to collapse or expand the tree. Youshould unblock the active content, and then collapse the tree.

Page 75: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

19

As usual, examine the event log for any problems. Remember that optionalfailure events (and nested failure events) are normal at the end of aRepeatingGroup. See Testing the Parser in Chapter 4, Defining an HL7 Parser foran explanation.

Basic and Advanced Properties

When you configured the separator of the nested RepeatingGroup, we told youto click the >> symbol, which displays the advanced properties. When you dothis, the >> symbol changes to <<. Click the << symbol to hide the advancedproperties.

Many ContentMaster components have both basic properties and advancedproperties. The basic properties are the ones that you need to use mostfrequently, so they are displayed by default. The advanced properties areneeded less often, so ContentMaster hides them. When you click the >>symbol, they are displayed in gray.

If you assign a non-default value to an advanced property, it turns black and itis displayed like a basic property.

The distinction between the basic and advanced properties is only in thedisplay. The advanced properties are not harder to understand or moredifficult to use. You should feel free to use them as needed.

IntelliScript showing basic properties. Click >> to display advanced properties.

Page 76: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

20

IntelliScript showing basic and advanced properties. Click << to hide theadvanced properties.

Using an Action to Compute Subtotals

The exercise is complete except for one feature. You need to compute the subtotalfor each buyer, and insert the result in the total attribute of the Buyer elements.The desired output is:

<Buyer name="Molly" total="217.64">

You can compute the subtotals in the following way:

Before processing the transactions, initialize the total attribute to 0.

After the parser processes each transaction, add the amount of the transactionto the total .

You will use a SetValue action to perform the initialization step. You will use aCalculateValue action to perform the addition.

1. Expand the IntelliScript to display the nested RepeatingGroup .

2. Right-click the nested RepeatingGroup, and choose Insert from the pop-upmenu. This lets you insert a new anchor above the RepeatingGroup.

Page 77: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

21

3. Insert the SetValue action, and set its properties as follows:

quote = 0.00

data_holder = /Invoice/*s/Buyer/@total

The SetValue action assigns the quote value to the data_holder.

4. Now, expand the nested RepeatingGroup. At the three dots (...) under thefourth Content anchor, enter a CalculateValue action.

5. Set the properties of the CalculateValue action as follows:

params =/Invoice/*s/Buyer/@total/Invoice/*s/Buyer/*s/Transaction/*s/Total

result = /Invoice/*s/Buyer/@total

expression = $1 + $2

The params property defines the parameters. In this case, the parameters arethe current total attribute of the buyer (which we initialized to 0), and theTotal element of the current transaction.

The expression property is a JavaScript expression that the action shouldperform. $1 and $2 are the parameters, which you defined in the paramsproperty.

The result property says that we should assign the result of the expression asthe new value of the total attribute.

Page 78: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

22

6. Run the parser. In the results file, each Buyer element should have a totalattribute.

Page 79: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

23

Actions

In this exercise, you used the SetValue and CalculateValue actions to compute thesubtotals.

Broadly speaking, actions are ContentMaster components that perform operationson data. ContentMaster provides numerous action components, which performoperations such as:

Computing values

Concatenating strings

Testing a condition for continued processing

Running a secondary parser

Running a database query

For more information, see the chapter on Actions in the ContentMaster Studio User'sGuide.

Potential Enhancement: Handling Page Breaks

So far, we have assumed that the repeating groups have a perfectly regularstructure. But what happens if the transactions run over to another page? The newpage would contain a page header, which would interrupt the repeating structure.

One way to support page headers might be to redefine the separator between thetransactions. For example, you might use an Alternatives anchor as the separator.This lets you define multiple separators that might occur between the transactions.

We won't give you an exercise to perform this enhancement, but we encourage youto experiment with the features yourself.

Points to Remember

To parse a binary format such as a PDF file, you can use a document processor thatconverts the source to text.

The positional strategy is useful when the content is laid out at a fixed number ofcharacters, called an offset, from the margins or from markers.

When you define anchors, think about the possible variations that may occur in sourcedocuments. If you define the content positionally, adjust the offsets in case a sourcedocument contains longer data than the example source.

The search scope is the segment of a document where ContentMaster searches for ananchor. The function of markers is to define the search scope for other anchors.

Repeating groups can be nested. You can use this feature to parse nested iterativestructures.

Page 80: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 5. Positional Parsing of a PDF Document

24

Newline characters are useful as markers and separators. You can set the count ofnewlines, which specifies how many lines to skip.

In the IntelliScript, you can click the >> symbol next to a component to display itsadvanced properties. Advanced properties are used less frequently than basicproperties, so they are hidden to avoid cluttering the display.

You can use actions to perform operations on the data, such as computing totals.

Page 81: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

25

Parsing Word and HTML Documents

Many real-life documents have a looser structure than the ones we have describedin the previous chapters. The documents may have:

Few labels and delimiters that can help a parser identify the data fields

Wrapped lines and flexible layout, which make positional parsing difficult

Repeated keywords, which make it hard to define unambiguous markers

Microsoft Word documents are typical examples of these phenomena. Worddocuments are usually created by hand (although they can also be generatedprogrammatically). The authors may create the documents from a fixed template,but they may vary the layout from one document to another.

In such cases, you need to use whatever markup or patterns are available to parsethe documents. For example, you can often use the formatting markup—paragraphs, tables, italics, etc.—to identify the data fields.

For a Microsoft Word document, you can expose the formatting markup by savingthe document as HTML. A parser can use the HTML tags, which are essentiallyformat labels, to interpret the document structure.

In this chapter, you will construct a parser for a Word document. You will learntechniques such as:

Using the WordToHtml document processor, which converts a Word documentto an HTML document.

Taking advantage of the opening-and-closing tag structure to parse an HTMLdocument.

Identifying anchors unambiguously, in situations where the same text isrepeated frequently throughout the document

Using transformers to modify the output of anchors.

Using variables to store retrieved data for later use.

Defining a global component, which you configure once and use multiple timesin a project.

Testing a parser on other source documents, in addition to the example source.

6

Page 82: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

26

Scope of the Exercise

This exercise is the longest and most complex in this book. It is typical of real-lifeparsing scenarios. You will create a full-featured parser, which could be used in aproduction application.

We suggest that you skim the chapter from beginning to end to understand howthe parser is intended to work, and then try the exercise. Please feel free toexperiment with the parser configuration. You may find parsing solutions thatdiffer from ours, or perhaps that are better than ours.

Actually, we know some ways to further improve the parser. For example, it ispossible to modify the anchor configuration to support numerous variations of thesource-document structure. It is possible to modify the XSD schema, restricting thedata types of the elements, to ensure that the parser retrieves exactly the correctdata from a complex document.

We have omitted most of these refinements because we want to keep the scope ofthe exercise within bounds. When you finish the exercise, you can learn muchmore about the techniques described here by reading the ContentMaster StudioUser's Guide.

PrerequisitesIn order to do this exercise, you need Microsoft Word 97 or higher on the samecomputer as ContentMaster Studio.

If you don't have Word on the same computer, you can open the source documentin Word on another computer and save it as HTML. You can then transfer theHTML document back to the ContentMaster computer and parse it. The rest of theexercise is unchanged, except that you do not need to use the WordToHtml processorbecause the document is already HTML.

We presume in this exercise that you have an elementary familiarity with HTMLcode. You can get an introduction at http://www.w3.org or from any book on web-site development.

Requirements Analysis

Before you start configuring a parser, it's always a good idea to plan what theparser must do. Let's examine the source document structure, the required XMLoutput, and the nature of the parsing problem.

Source Document

You can view the example source in Microsoft Word. The document is stored in:

tutorials\Exercises\Files_for_tutorial_4\Order.doc

Page 83: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

27

The document is an order form, which is used by a fictitious organization calledThe Tennis Book Club. The form contains:

The name of the purchaser

A one-line address

An optional line, which specifies the currency such as $ or £

A table of the items ordered

The format of the items in the table is not uniform. Although the first columnheading is Books, some of the possible items in this column—such as a videocassette—are not books. These items may be formatted in a different font from thebooks.

In some of the table cells, the text wraps to a second line. Positional text parsing,using the techniques that you learned in the previous chapter, would be difficultfor these cells.

The table cells do not contain any identifying labels. The parser needs to interpretthe cells according to their location in the table.

The word Total appears three times in the table. In two of the instances, the wordis in a column or row heading; we can assume that this is the case in all similar

Page 84: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

28

source documents. In the third instance, Total appears by chance in the title of thevideo cassette. This is significant because we want to parse the total-price row ofthe table, and we need an unambiguous identifier for the row.

As a further complication, the Currency line is optional. Some of the sourcedocuments that we plan to parse are missing this line.

XML Output

For the purpose of this exercise, we assume that the output XML format should beas follows:

<Order><To>Becky Handler</To><Address>18 Cross Court, Down-the-Line, PA</Address><Books>

<Book><Title>Topspin Serves Made Easy, by Roland Fasthitter</Title><Price>$11.50</Price><Quantity>1</Quantity><Total>$11.50</Total>

</Book><!-- Additional Book elements -->

</Books><Total>$46.19</Total>

</Order>

Notice that the prices and totals include the currency symbol ($). In the sourcedocument, the currency is stated on the optional currency line, and not in the table.

The Parsing Problem

Processor SelectionContentMaster provides several document processors, which can convert a Worddocument to a format that is suitable for parsing. Among these are:

WordToHtmlWordToRtfWordToTxtWordToXml

We choose to use the WordToHtml processor because:

The HTML code is easy for a human being to read, making it easy to configurethe parser.

The HTML tags expose a wealth of formatting information, which we can useto help identify the data to retrieve.

Page 85: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

29

The WordToXml and WordToRtf processors might also be good choices, but theiroutput is more difficult to read than that of the WordToHtml processor.

The WordToTxt processor produces plain-text output. Because the text lacksformatting information, it would probably be much harder to parse reliably thanthe HTML.

HTML Structure

The WordToHtml processor uses the Save As HTML feature of Microsoft Word togenerate the HTML. The HTML is exceedingly verbose; it may run to many pages,even though the original Word document is less than one page. The reason for theverbosity is that Word saves the complete document features, including the styleand format definitions. Much of this information has no significant effect on theappearance of the HTML in a web browser, but Word uses the information if youlater re-import the HTML to Word. Most of this irrelevant code is in the <header>element of the HTML document.

If you wish, you can save the example source document as HTML manually inWord. You can view the resulting HTML file in a browser, or you can open the filein Notepad and examine the HTML code.

In a broad outline, the code looks like this:

<html>\<header><!-- Very long header storing Word style definitions, etc. -->

</header><body>

<h1>The Tennis Book Club<br>Order Form</h1><p>Send to:</p><p>Becky Handler</p><p>18 Cross Court, Down-the-Line, PA</p><table>

<tr><!-- Table column headers -->

</tr><tr>

<td><i>Topspin Serves Made Easy</i>, by Roland Fasthitter</td><td>$11.50</td><td>1</td><td>$11.50</td>

</tr><!-- Additional tr elements containing the other table rows -->

</table></body>

</html>

You will use this basic HTML tag structure to parse the document.

The actual code is considerably more complex than the above outline. Thefollowing is a typical example, which is the first <td> element of the Topspin

Page 86: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

30

Serves row. The code may vary somewhat on your computer, depending on yourWord version and configuration.

<td width=168 valign=top style='width:125.9pt;border:solid windowtext1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0cm 5.4pt 0cm 5.4pt'><pclass=MsoBodyText><i style='mso-bidi-font-style:normal'>Topspin ServesMade Easy</i>, by Roland <span class=SpellE>Fasthitter</span></p></td>

As you begin to work with the Word-generated HTML, we trust that you willquickly learn to find the important structural tags, such as the <td>...</td>element. You can ignore the irrelevant code such as the lengthy attributes and<span> tags.

Creating the Project

1. To start the exercise, create a project called Tutorial_4. In the New Projectwizard, you should select the following options:

a. Name the parser MyHtmlParser.b. Name the script file Html_ScriptFile.c. Select the schema TennisBookClub.xsd, which is in the

tutorials\Exercises\Files_For_Tutorial_4 folder.d. Specify that the example source is a file on a local computer.e. Browse to the example source file, which is Order.doc.f. Select the Microsoft Word encoding.g. Select the document processor Microsoft Word To HTML.h. Select the HTML format. This format component is pre-configured with the

appropriate delimiters and other components for parsing HTML code.

The resulting IntelliScript should have the following appearance:

2. When ContentMaster Studio opens the example source, it may display aMicrosoft Internet Explorer prompt to open or save the file. This occursbecause ContentMaster Studio uses Internet Explorer to display a browserview of the file. The behavior depends on your Internet Explorer securitysettings.

Page 87: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

31

You may safely click Open and continue.

Defining a Variable

Do you remember the Currency line, near the top of the example source? We needto retrieve the currency symbol, but we don't want to map it to an output elementor attribute. Instead, we want to store the currency symbol temporarily, and prefixit to the prices in the output.

To do this, we will define a temporary data holder, which is called a variable. Youcan use a variable in exactly the same way as an element or attribute data holder,but it isn't included in the XML output.

To define the variable, perform these steps:

1. At the bottom of the IntelliScript (at the global level, not nested within theParser component), select the three dots (...) and press Enter.

2. Type the variable name, varCurrency, and press Enter.

3. At the three dots on the right side of the equals sign (=), press Enter and selectVariable.

The variable is displayed in the IntelliScript, in the Schema view, and in theComponent view.

Page 88: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

32

Variable definition in the IntelliScript

Variable displayed in the Component view

Variable displayed in the Schema view

Notice the location of the variable in the Schema view, nested within thewww.ContentMaster.com/CM/Variables namespace. Later, you will need toaccess the variable in the Schema view.

Parsing the Name and Address

First, let's define the anchors that parse the name and address, at the beginning ofthe example source. Follow these steps:

1. In the example pane, scroll past the header to the body tag (about 80% of theway down!).

To scroll quickly, you can right-click in the example pane, select the Findoption, and find the string <body (with a starting < symbol but without theending > symbol).

Page 89: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

33

2. Define <body as a Marker anchor. This causes the parser to skip the headercompletely (just as you did when you scrolled through the document).

Do not select the match_case property for this anchor, or for any other HTMLtags that you define as anchors in this exercise. HTML is not case sensitive.Word 2000 and 2002 generate HTML tags in lower case, but Word 97 generatestags in upper case. If you select match case, the parser will not support Word97.

3. Scroll down a few paragraphs, and define the string Send to: as a Markeranchor. The string signals the beginning of the content that you need toretrieve.

Here, it's OK to select the match case option. This can be helpful to ensure thatyou match only the indicated spelling.

4. Now, you are ready to define the first Content anchor. You will configure theanchor to retrieve the data between specified starting and ending strings. AContent anchor that is defined in this way is similar to a Marker ContentMarker sequence (which you could also use, if you prefer).

To do this, right-click the string Becky Handler and insert a Content anchor.Make the following assignments in the IntelliScript:

opening_markerType the string <p.

closing_markerType the string </p>.

Page 90: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

34

data_holderMap the anchor to /Order/*s/To.

When you finish these assignments, the example pane highlights <p and </p>as markers. It highlights Becky Handler as content.

Notice that only the string Becky Handler is highlighted in blue. The rest of theHTML code between the opening_marker and the closing_marker, such as

class=MsoBodyText>

is not highlighted. This is because the HtmlFormat recognizes > as a delimitercharacter. It learns from the example source that the desired content is locatedafter the > delimiter.

5. Move to the next paragraph of the source, and use the same technique(defining the opening_marker and the closing_marker) to map 18 CrossCourt, Down-the-Line, PA to the /Order/*s/Address data holder.

6. Run the parser, and confirm that you get the following result:

Page 91: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

35

Why the Output Contains Empty Elements

Notice that the parser inserts empty Books and Total elements. That is because theXSD schema defines Books and Total as required elements. The parser inserts theelement to ensure that the XML is valid according to the schema (you can changethis behavior in the project properties). The elements are empty because we havenot yet parsed the table of books.

Parsing the Optional Currency Line

If the source document contains a Currency line, you need to parse it and store theresult in the varCurrency variable, which you created above. If the Currency line ismissing, the parser should ignore the omission and continue parsing the rest of thedocument.

You can do this in the following way:

Define a Group anchor. The purpose of this anchor is to bind a set of nestedanchors together, for processing as a unit.

Within the group, nest Marker and Content anchors that retrieve the currency.

Select the optional property of the Group. This means that if the Group fails(because the Marker and/or Content are missing) the parser should continuerunning.

The procedure is as follows:

1. Following the anchors that you have already defined, insert a Group anchor.

2. Select the three-dots that are nested within the Group. Define the textCurrency; as a Marker anchor.

Page 92: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

36

3. Continuing within the Group, define $ as a Content anchor. Map the anchor tothe VarCurrency variable, which you defined above.

4. Click the >> symbol to display the advanced properties of the Group, and selectthe optional property. (Be sure you select the optional property of the Group ,and not the optional property of the Marker or Content.)

Parsing the Order Table

As you have seen in the preceding exercises, you can parse the order table by usinga RepeatingGroup. In this exercise, you will nest the RepeatingGroup in anEnclosedGroup anchor. The EnclosedGroup is useful for parsing HTML codebecause it recognizes the opening-and-closing tag structure, which is characteristicof the code. Specifically, you will use the EnclosedGroup to recognize the <tableand </table> tags that surround an HTML table.

Page 93: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

37

1. Select the three dots at the end of the top-level anchor sequence (that is, thelevel where the Group is defined, and not the nested level within the Group).

2. By editing the IntelliScript, insert an EnclosedGroup anchor.

3. Under the opening property of the EnclosedGroup, assign text = <table.Under the closing property, assign text = </table>. You can type the strings,or you can drag them from the example pane.

The example pane should highlight the <table and </table> tags in the HTMLcode.

4. Now, start adding anchors within the EnclosedGroup . If you did the exercisesin the preceding chapters, you should be able to work out a solution. Weencourage you to try it yourself before you continue reading.

Here is one possible solution. We suggest that you configure the anchors byediting the IntelliScript, rather than selecting and right-clicking in the examplesource. The HTML code of the example source is long and confusing, andseveral of the anchors have non-default properties, which you can configureonly in the IntelliScript.

Within theEnclosedGroupanchor, add

Configuration Explanation

Marker Advances to the first row ofthe table, which you do notneed to retrieve.

RepeatingGroup Starts a repeating group atthe second row of the table.(See below for the anchorsto insert within theRepeatingGroup.)

Page 94: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

38

Within theEnclosedGroupanchor, add

Configuration Explanation

Marker Terminates the repeatinggroup and moves to thelast row of the table.To assigncount=2 ,display the advancedproperties (see UsingCount to ResolveAmbiguities below).

Marker Advances to the fourth cellin the last row.To assigncount=3 ,display the advancedproperties (see UsingCount to ResolveAmbiguities below).

Content Retrieves the content of thefourth cell in the last row.

At this point, the EnclosedGroup should look like this:

5. Try the Mark Example command. If you scroll through the example source,you should see the color coding for the above anchors.

6. Run the parser. The results should look like this:

Page 95: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

39

7. Now, expand the RepeatingGroup and insert the following anchors within it.

Notice that the four Content anchors have very similar configurations.Optionally, you can save time by copying and pasting one of the Contentanchors, and editing the copies.

Within theRepeatingGroupanchor, add

Configuration Explanation

Content Retrieves thecontent of thefirst cell in arow.

Content Retrieves thecontent of thesecond cell.

Content Retrieves thecontent of thethird cell.

Content Retrieves thecontent of thefourth cell.

When you have finished, the EnclosedGroup and RepeatingGroup should looklike this:

Page 96: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

40

8. Run the Mark Example command, and check the color-coding again.

9. Run the parser. The results should look like this:

Page 97: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

41

Why the Output Doesn't Contain HTML Code

Notice that the parser removed the HTML code from the retrieved text.

For example, the first Content anchor within the RepeatingGroup is configured toretrieve all the text between the opening_marker (which is <td) and theclosing_marker (which is </td>). You might have expected it to retrieve the text:

width=168 valign=top style='width:125.9pt;border:solid windowtext1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;padding:0cm 5.4pt 0cm 5.4pt'><pclass=MsoBodyText><i style='mso-bidi-font-style:normal'>Topspin ServesMade Easy</i>, by Roland <span class=SpellE>Fasthitter</span></p>

In fact, this is exactly the color-coding of the IntelliScript:

Page 98: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

42

The answer is that the Content anchor retrieves the entire text, including the HTMLcode. However, the HtmlFormat component of the parser is configured with a set ofdefault transformers, which modify the output of the anchor.

One of the default transformers is RemoveTags, which removes the HTML code.

TransformersA transformer is a component that modifies the output of an anchor. Some commonuses of transformers are to add or replace strings in the output. You will usetransformers for these purposes in the next stage of this exercise.

The format component of a parser is typically configured with a set of defaulttransformers, which the parser applies to the output of all anchors. The purpose ofthe default transformers is to clean up the output.

The HtmlFormat is configured with the following default transformers:

RemoveTags, which removes HTML code from the output.

HtmlEntitiesToASCII, which converts HTML entities (such as &gt: or &amp;)to plain-text symbols (> or &).

HtmlProcessor, which converts multiple whitespace characters (spaces, tabs,or line breaks) to single spaces.

RemoveMarginSpace, which deletes leading and trailing space characters.

For more information about the transformers that ContentMaster makes availablefor your use, see Transformers in the ContentMaster Studio User's Guide.

Why Doesn't the Parser Use Delimiters to Remove the HTML Codes?Do you remember the Becky Handler anchor, which you defined at the beginningof this exercise? We explained that the anchor removes HTML codes because itretrieves only the data after the > delimiter. The color-coding illustrates this:

Page 99: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

43

This works because the Becky Handler anchor is configured with theLearnByExample component, whose function is to interpret the delimiters in theexample source.

In the Content anchors of the RepeatingGroup, we deliberately omitted theLearnByExample component:

The LearnByExample component would not have worked correctly here. Theproblem is that the number of delimiters in the order table is not constant. In someof the table rows, the name of a book is italicized. In other rows, the name is notitalicized. In some source documents, the table may contain other formattingvariations. The formatting can change the HTML code, including the number andlocation of the delimiters.

That's why we left the value property empty. We relied on the defaulttransformers to clean up the output, instead.

Using Count to Resolve Ambiguities

Within the EnclosedGroup, we told you to assign the count property of two Markeranchors. The purpose was to resolve ambiguities in the example source.

In the >Total< anchor, we set count = 2. This was to avoid ending the repeatinggroup prematurely, at the first instance of the string >Total<, which is in thecolumn-heading row. The count = 2 property means that the parser should lookfor the second instance of >Total<, which is in the last row of the table. Thisresolves the ambiguity.

Another interesting point is that we quoted the text >Total< , rather than Total(without the > and < symbols). This resolves the ambiguity with the word Total,which appears in the Total Tennis (video) row of the table. If we didn't do this,the repeating group would incorrectly end at the Total Tennis (video) row.

Page 100: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

44

The other use of the count property is in the last <td marker of the EnclosedGroup,which lies within the last table row. There, we set count = 3 to advance the parserby three table cells. The skipped cells in this row contain no data, so we don't needto parse them.

Using Transformers to Modify the Output

At this point, the XML output still has two small problems:

Some of the book titles are punctuated incorrectly. Instead of

<Title>Topspin Serves Made Easy, by Roland Fasthitter</Title>

the output is

<Title>Topspin Serves Made Easy , by Roland Fasthitter</Title>

The difference is an extra space character, located before the comma.

The prices and totals do not have a currency symbol:

<Price>11.40</Price><Total>46.19</Total>

instead of

<Price>$11.40</Price><Total>$46.19</Total>

You will apply transformers to particular Content anchors. The transformers willmodify the output of the anchors and correct these problems.

These transformers are in addition to the default transformers, which the parserapplies to all the anchors (see Why the Output Doesn't Contain HTML Code above).

1. In the IntelliScript, expand the first Content anchor of the RepeatingGroup.

2. Display the advanced properties of the Content anchor, and expand thetransformers property.

3. At the three dots within the transformers property, nest a Replacetransformer.

4. Set the properties of the Replace transformer as follows:

find_what="TextSearch(" ,")

replace_with = ","

In other words, the transformer replaces a space followed by a comma with acomma.

Page 101: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

45

5. You will use an AddString transformer to prefix the prices and totals with thecurrency symbol. You could independently configure an independentAddString transformer in each appropriate Content anchor, but there is aneasier way. You can configure a single AddString transformer as a globalcomponent. You can then use the global component wherever it is needed.

To configure the global component, select the three dots to the left of the equalsign, at the top level of the IntelliScript—the level where MyHtmlParser andvarCurrency are defined. This is called the global level.

Press Enter, and type AddCurrencyUnit. This is the identifier of the globalcomponent.

6. Press Enter at the second three-dots symbol on the same line (following theequals sign). Insert an AddString transformer.

7. Select the pre property of the AddString transformer, and press Enter. At theright of the text box, click the browse button. This displays a Schema view,where you can select a data holder.

Page 102: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

46

In the Schema view, select VarCurrency.

This means that the AddString transformer should prefix its input with thevalue of VarCurrency (which is the currency symbol).

8. Now, you can insert the AddCurrencyUnit component in the appropriatelocations of MyHtmlParser.

For example, display the advanced properties of the last Content anchor in theEnclosedGroup (the one that is mapped to /Order/*s/Total). Under itstransformers property, select the three dots and press Enter. The drop-downlist displays the AddCurrencyUnit global component, which you have justdefined. Select the global component. This applies the transformer to theanchor.

The result should look like this:

Page 103: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

47

In the same way, insert AddCurrencyUnit in the transformers property of thesecond and fourth Content anchors in the RepeatingGroup (which are mappedto Price and Total, respectively).

9. Run the parser. The results should be:

Notice that the Title element does not contain an extra space before thecomma. The Price and Total elements contain the $ sign. These are the effectsof the transformers that you defined.

Global Components

Global components are useful when you need to use the same componentconfiguration repeatedly in a project.

You can define any ContentMaster component as a global component, for example,a transformer, an action, or an anchor. The identifier of the global componentappears in the drop-down list at the appropriate locations of the IntelliScript. Youcan use the global component just like any other component.

Page 104: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

48

In this exercise, you defined three global components:

The parser, called MyHtmlParser, which you defined by using the wizard.

The VarCurrency variable, which you defined at the beginning of the exercise.

The AddCurrencyUnit transformer, which you added at the end of the exercise.

Testing the Parser on Another Source Document

As a final step, let's test the parser on the source documentOrderWithoutCurrency.doc. This document is identical to the example source(Order.doc), except that it is missing the optional currency line.

The object of this test is to confirm that the parser correctly processes a documentwhen the currency line is missing.

1. Display the advanced properties of MyHtmlParser.

2. Assign the sources_to_extract property a value of LocalFile.

This means that the parser should process a file on the local computer, insteadof the example source.

3. Edit the file_name property, and browse to OrderWithoutCurrency.doc.

4. Display the advanced properties of LocalFile, and assign pre_processor =WordToHtml.

This means that ContentMaster should run the WordToHtml processor on thedocument before it runs the parser.

Page 105: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

49

5. Run the parser. The output should be identical to the above, except that theprices and totals are not prefixed with a $ sign.

6. When you finish, you can delete the sources_to_extract property. This was atemporary setting, which you need only for testing the parser.

Points to Remember

To parse a Microsoft Word document, you can use a document processor such asWordToHtml, which converts the document to a parser-friendly HTML format.

To parse HTML code, you can use the EnclosedGroup anchor or the opening_markerand closing_marker properties of the Content anchor. These approaches takeadvantage of the characteristic opening-and-closing tag structure of HTML. If youdefine an HTML tag as a Marker, it is usually best not to select the match_caseproperty because HTML is not case sensitive.

To resolve ambiguities between multiple instances of the same text, you can use thecount property of a Marker.

Use variables to store retrieved data, which you do not want to display in the XMLoutput of a parser. The variables are useful as input to other components.

Page 106: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 6. Parsing Word and HTML Documents

50

Use transformers to modify the output of anchors. You can apply default transformersto all the anchors, or you can apply transformers to specific anchors.

To exclude HTML tags from the parsing output, you can either define the anchorsusing delimiters, or you can rely on the default transformers to remove the code.

Global components are useful when you need to use the same componentconfiguration repeatedly.

To test a parser on additional documents (other than the example source), assign thesources_to_extract property. Don't forget to assign a document processor to thedocument, if it needs one.

Page 107: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 7. Defining a Serializer

51

Defining a Serializer

In the preceding exercises, you have defined parsers, which convert documents invarious formats to XML. In this exercise, you will create a serializer, which worksin the opposite direction—it converts XML to another format.

Usually, it is easier to define a serializer than a parser. The reason is that the inputis a fully structured, unambiguous XML document.

The serializer that you will define is very simple. It contains only four serializationanchors (which are the opposite of the anchors that you use in parsing).Nonetheless, the serializer has some interesting points:

The serializer is recursive. That is, it calls itself repetitively to serialize thenested sections of an XML document.

The output of the serializer is a worksheet that you can open in Excel.

You will define the serializer by editing the IntelliScript. It is also possible togenerate a serializer automatically, by inverting the operation of a parser. Forfurther information, see the chapter on Serializers in the ContentMaster Studio User'sGuide.

Prerequisite

The output of this exercise is a *.csv (comma separated values) file. You can viewthe output in Notepad, but for the most meaningful display, you need MicrosoftExcel.

You do not need Excel to run the serializer. It is recommended only to view theoutput.

Requirements Analysis

The input XML document is:

tutorials\Exercises\Files_For_Tutorial_5\FamilyTree.xml

The document is an XML representation of a family tree. Notice the inherentlyrecursive structure: each Person element can contain a Children element, whichcontains more Person elements.

7

Page 108: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 7. Defining a Serializer

52

<Person><Name>Jake Dubrey</Name><Age>84</Age><Children>

<Person><Name>Mitchell Dubrey</Name><Age>52</Age>

</Person><Person>

<Name>Pamela Dubrey McAllister</Name><Age>50</Age><Children>

<Person><Name>Arnold McAllister</Name><Age>26</Age>

</Person></Children>

</Person></Children>

</Person>

Our goal is to output the names and ages of the Person elements as a *.csv file,which has the following structure:

Jake Dubrey,84Mitchell Dubrey,52Pamela Dubrey McAllister,50Arnold McAllister,26

Excel displays the output as a worksheet looking like this:

Page 109: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 7. Defining a Serializer

53

Creating the Project

To create the serializer, follow these steps:

1. On the ContentMaster Studio menu, choose File > New > Project.

2. In the left pane of the New Project window, select ContentMaster. In the rightpane, select Serializer Project.

3. On the following wizard pages, specify the following options:

a. Name the project Tutorial_5.b. Name the serializer FamilyTreeSerializer.c. Name the script file Serializer_Script.

4. When you reach the Schema page, browse to the schema FamilyTree.xsd,which is in the tutorials\Exercises\Files_For_Tutorial_5 folder.

The schema defines the structure of the input XML document. If you are newto XSD, you may wish to open the schema in an XSD editor or in Notepad, andexamine how the recursive data structure is defined.

5. When you finish the wizard, the ContentMaster Explorer displays the newproject. Double click the Serializer_Script.tgp file to edit it.

6. Unlike a parser, a serializer does not have an example source file. Optionally,you can hide the empty example pane of the IntelliScript editor.

To do this, choose the IntelliScript > IntelliScript command on the menu, orclick the toolbar button that is labeled show IntelliScript pane only. To restorethe example pane, choose IntelliScript > Both, or click the button that is labeledshow both IntelliScript and example panes.

Page 110: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 7. Defining a Serializer

54

7. To design the serializer, you will use the FamilyTree.xml input document,whose content is presented above.

Although not required, it is a good idea to organize all the files that you use todesign and test a project in the project folder, which is located in your Eclipseworkspace. We suggest that you copy the FamilyTree.xml file from

tutorials\Exercises\Files_For_Tutorial_5

to the project folder, which is by default

My Documents\SAP\ContentMaster\4.0\workspace\Tutorial_5

Determining the Project Folder Location

Your project folder may be in a non-default location for either of the followingreasons:

In the New Project wizard, you selected a non-default location.

Your copy of Eclipse is configured to use a non-default workspace location.

You can determine the location of the project folder by selecting the project in theContentMaster Explorer, and choosing the File > Properties command.Alternatively, click in the IntelliScript editor and choose the Project > Propertiescommand. The Info tab of the properties window displays the location.

Project PropertiesThe properties window displays many useful options, such as the input andoutput encodings that are used in your documents and the XML validationoptions. For more information, see Project Properties in the ContentMaster StudioUser's Guide.

Page 111: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 7. Defining a Serializer

55

Configuring the Serializer

You are now ready to configure the serializer properties and add the serializationanchors. You must do this by editing the IntelliScript. There is no analogy to theselect-and-click approach that you used to configure parsers.

1. Display the advanced properties of the serializer, and setoutput_file_extension = .csv (with a leading period).

When you run the serializer in ContentMaster Studio, this causes the outputfile to have the name output.csv. By default, *.csv files open in MicrosoftExcel.

2. Under the contains line of the serializer, insert a ContentSerializerserialization anchor, and configure its properties as follows:

data_holder = /Person/*s/Name

closing_str = ","

This means that the serialization anchor writes the content of the/Person/*s/Name data holder to the output file. It appends the closing string"," (a comma).

3. Define a second ContentSerializer as illustrated:

Page 112: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 7. Defining a Serializer

56

This ContentSerializer writes the /Person/*s/Age data holder to the output.It appends a carriage return (ASCII code 013) and a linefeed (ASCII 010) to theoutput.

To type the ASCII codes:

a. Select the closing_str property and press Enter.b. On the keyboard, press Ctrl+a. This displays a small dot in the text box.c. Type 013.d. Press Ctrl+a again.e. Type 010.f. Press Enter to complete the property assignment.

4. Run the serializer. To do this:

a. Set the serializer as the startup component.b. On the menu, choose the Run > Run command.c. At the prompt, browse to the test input file, FamilyTree.xml.d. When the serializer has completed, examine the Events view for errors.e. In the ContentMaster Explorer view, under Results, double-click

output.csv to view the output.

Assuming that Excel is installed on the computer, ContentMaster displays anExcel window, like this:

If Excel is not installed on the computer, you can view the output file inNotepad. Alternatively, you can copy the file to another computer where Excelis installed, and open it there.

Page 113: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 7. Defining a Serializer

57

Calling the Serializer Recursively

So far, the results contain only the top-level Person element. You need to configurethe serializer to move deeper in the XML tree and process the child Personelements.

You can do this as follows:

1. Insert a RepeatingGroupSerializer .

You can use this serialization anchor to iterate over a repetitive structure in theinput, and to generate a repetitive structure in the output. In this case, theRepeatingGroupSerializer will iterate over all the Person elements at a givenlevel of nesting.

2. Within the RepeatingGroupSerializer , nest an EmbeddedSerializer . Thepurpose of this serialization anchor is to call a secondary serializer.

3. Assign the properties of the EmbeddedSerializer as illustrated.

Page 114: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 7. Defining a Serializer

58

The properties have the following meanings:

- The assignment serializer = FamilyTreeSerializer means that thesecondary serializer is the same as the main serializer. In other words, theserializer calls itself recursively.

- The schema_connections property means that the secondary serializershould process /Person/*s/Children/*s/Person as though it were a top-level /Person element. This is what lets the serializer move down throughthe generations of the family tree.

- The optional property means that the secondary serializer shouldn't causethe main serializer to fail when it runs out of data.

4. Run the serializer again. The result should be:

Defining Multiple Components in a Project

The issue of running a secondary serializer raises another point, which we haven'tcommented on before.

In the exercises throughout this book, each project has contained a single parser,serializer, or mapper. We did this for simplicity.

It is quite possible for a single project to contain multiple components, for example:

Multiple parsers, serializers, or mappers

Multiple script (TGP) files

Multiple XSD schemas

We won't give you an exercise on this subject, but you can find the instructions inthe ContentMaster Studio User's Guide. The following paragraphs are a briefsummary.

Multiple Parsers and SerializersTo define multiple parsers, serializers, or mappers, simply insert them at the globallevel of the IntelliScript. Here is an example:

Page 115: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 7. Defining a Serializer

59

To run one of the parsers or serializers, set it as the startup component and use thecommands on the Run menu.

The startup component can call secondary parsers, serializers, or mappers toprocess portions of a document. That is what you did in this exercise (with theinteresting twist that the main and secondary serializers were the same—arecursive call).

Multiple Script Files

You can use multiple script files to organize your work.

To create a script file, right-click the Scripts node in the ContentMaster Explorer,and choose New > Script. To add a script that you created in another project,choose Add File.

To open a script file for editing, double-click the file in the ContentMaster Explorer.

Multiple XSD SchemasTo add multiple schemas to a project, right-click the XSD node in the ContentMasterExplorer and choose Add File. To create an empty XSD schema, which you can editin any editor, choose New > XSD.

Points to Remember

A serializer is the opposite of a parser: it converts XML to other formats.

You can create a serializer either by generating it from an existing parser, or byediting the IntelliScript. A serializer contains serialization anchors, which areanalogous to the anchors that you use in a parser, but which work in the oppositedirection.

In an IntelliScript editor, you can hide or display the panes by choosing thecommands on the IntelliScript menu or on the toolbar.

We recommend that you store all files associated with a project in the project folder,which is located in your Eclipse workspace. You can determine the folder locationby using the File > Properties or Project > Properties command.

A single project can contain multiple parsers, serializers, mappers, script file, XSDfiles, etc. The startup component can call secondary serializers, parsers, or mappers,which are defined in the same project and which process portions of a document.Recursion (a component calling itself) is supported.

You can design a serializer that outputs to any data format, for example, MicrosoftExcel.

Page 116: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 8. Defining a Mapper

60

Defining a Mapper

So far, you have worked with two major ContentMaster components. With parsers,you converted documents of any format to XML. With serializers, you changedXML documents to other formats.

In this chapter, you will work with a mapper, which performs XML to XMLconversions. The purpose is to change the XML structure or vocabulary of the data.

The mapper design is similar to that of parsers and serializers. Within the mainMapper component, you can nest mapping anchors, Map actions, and othercomponents, which perform the mapping operations.

The exercise presents a simple mapper, which generates an XML summary report.Among other features, the exercise demonstrates how to configure an initiallyempty project, which you create by using the Blank Project wizard. You can use ablank project to configure any kind of data transformation, in addition to mappers.

Requirements Analysis

The goal of this exercise is to take an existing XML file, and generate a new XMLfile that has a modified data structure. The files are provided in the followingfolder:

tutorials\Exercises\Tutorial_6

Input XML

The input of the mapper is an XML file (Input.xml) that records the identificationnumbers and names of several persons. The input conforms to the schemaInput.xsd.

<Persons><Person ID="10">Bob</Person><Person ID="17">Larissa</Person><Person ID="13">Marie</Person>

</Persons>

8

Page 117: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 8. Defining a Mapper

61

Output XML

The expected output XML is a summary report (ExpectedOutput.xml), which liststhe names and IDs separately. The output conforms to the schema Output.xsd .

<SummaryData><Names>

<Name>Bob</Name><Name>Larissa</Name><Name>Marie</Name>

</Names><IDs>

<ID>10</ID><ID>17</ID><ID>13</ID>

</IDs></SummaryData>

Creating the Project

To create the mapper, you will start with a blank (empty) ContentMaster project:

1. On the ContentMaster Studio menu, choose File > New > Project.

2. In the left pane, select ContentMaster. In the right pane, select Blank Project.

3. In the wizard, name the project Tutorial_6, and click Finish.

4. In the ContentMaster Explorer, expand the project. Notice that it contains adefault TGP script file, but it contains no XSD schemas or other components.

5. Right-click the XSD node and choose Add File. Add the schemas Input.xsd andOutput.xsd. Both schemas are necessary because you must define the structureof both the input and output XML.

Page 118: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 8. Defining a Mapper

62

6. Notice that the Schema view displays the elements that are defined in bothschemas. You will work with the Persons branch of the tree for the input, andwith the SummaryData branch for the output.

7. We recommend that you copy the test documents, Input.xml andExpectedOutput.xml, into the project folder. By default, the folder is:

My Documents\SAP\ContentMaster\4.0\workspace\Tutorial_6

Configuring the Mapper

To configure the mapper and insert the mapping anchors, follow theseinstructions:

1. Open the script file, which has the default name Tutorial_6.tgp, in anIntelliScript editor.

2. Optionally, use the commands on the IntelliScript menu or the toolbar todisplay the IntelliScript pane only. A mapper does not use an example source,so you do not need the example pane.

3. Define a global component called Mapper1, and give it a type of Mapper. (For areview, see Global Components in Chapter 6, Parsing Word and HTMLDocuments.)

Page 119: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 8. Defining a Mapper

63

4. Assign the source and target properties of the mapper as illustrated. Theproperties define the schema branches where the mapper will retrieve itsinput and store its output.

5. Under the contains line of the mapper, insert a RepeatingGroupMappingcomponent. You will use this mapping anchor to iterate over the repetitiveXML structures in the input and output.

6. Within the RepeatingGroupMapping , insert two Map actions. The purpose ofthese actions is to copy the data from the input to the output.

Configure the source property of each Map action to retrieve a data holderfrom the input. Configure the target property to write the corresponding dataholder in the output.

Page 120: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 8. Defining a Mapper

64

7. Set the mapper as the startup component and run it. Check the Events viewfor errors.

8. Compare the results file, which is called output.xml, withExpectedOutput.xml. If you have configured the mapper correctly, the filesshould be identical (except perhaps for the <?xml?> processing declaration,which depends on options in the project properties).

Page 121: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 8. Defining a Mapper

65

Points to Remember

A mapper converts an XML source document to an XML output documentconforming to a different schema.

You can create a mapper by editing a blank project in the IntelliScript. A mappercontains mapping anchors, which are analogous to the anchors that you use in aparser or to the serialization anchors in a serializer. It uses Map actions to copy thedata from the input to the output.

Page 122: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 9. Running ContentMaster Engine

66

Running ContentMaster Engine

After you have created and tested a data transformation, you need to move it fromthe development stage to production. There are two main steps in doing this:

1. In ContentMaster Studio, deploy the data transformation as a ContentMasterservice. This makes it available to run in ContentMaster Engine.

2. Launch an application that activates ContentMaster Engine and runs theservice.

In this tutorial, you will perform the procedure for the HL7 parser, which youcreated in Chapter 4, Defining an HL7 Parser. If you prefer, you can perform theexercise on any other parser or serializer that you have prepared.

The application that activates ContentMaster Engine will be a simple MicrosoftVisual Basic 6 program, which calls the ContentMaster COM API. For the benefitof ContentMaster users who may not have Visual Basic, we provide both thesource code and the compiled program.

As alternatives to the approach that you will learn in this tutorial, you can runContentMaster services from the command line, by using the ContentMaster JavaAPI or by using the CGI web interface. For more information, see theContentMaster Engine Developer's Guide.

Deploying a Parser as a Service

Deploying a ContentMaster service means making a data transformation availableto ContentMaster Engine. By default, the repository location is:

c:\Program Files\SAP\ContentMaster\ServiceDB

Note that there is no relation between ContentMaster services and Windowsservices. You cannot view or administer ContentMaster services in the WindowsControl Panel.

To deploy a data transformation as a service:

1. In the ContentMaster Explorer, select the project. We suggest that you use theTutorial_2 project, which you prepared in Chapter 4, Defining an HL7 Parser.

2. Confirm that the startup component is selected and that the datatransformation runs correctly.

3. On the ContentMaster menu, choose Project > Deploy.

9

Page 123: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 9. Running ContentMaster Engine

67

4. The Deploy Service window displays the service details. You may edit theinformation as required.

5. Click the Deploy button.

6. At the lower left of the ContentMaster Studio window, display the Repositoryview. The view lists the service that you have deployed, along with any otherContentMaster services that have been deployed on the computer.

ContentMaster COM API Application

You can run a service in ContentMaster Engine in any of the following ways:

By using commands in the ContentMaster command-line interface.

By programming an application that calls the ContentMaster COM API.

By programming an application that calls the ContentMaster Java API.

By posting to the ContentMaster CGI interface.

For the purpose of illustration, we have supplied a sample application that uses thesecond approach. The application is programmed in Microsoft Visual Basic 6.0,and it calls the ContentMaster COM API. The source code and the compiledexecutable file are stored in the folder:

tutorials\Exercises\CMComApiTutorial

The application is described in the following paragraphs.

Page 124: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 9. Running ContentMaster Engine

68

Source Code

The Visual Basic application displays the following form:

The user enters the source document path, an output document path, and thename of a ContentMaster service. When the user clicks the Run button, theapplication executes the following code:

Private Sub cmdRun_Click()

Dim objCMEngine As CM_COM3Lib.CMEngine 'ContentMaster EngineDim objCMRequest As CM_COM3Lib.CMRequest 'Request generationDim objCMStatus As CM_COM3Lib.CMStatus 'Status display

Dim strRequest As String 'Request stringDim strOutput As String 'Output string (not used in this sample)Dim strStatus As String 'Status string

'Check for the required inputIf txtSource = "" Then

MsgBox "Please enter the path of the source document"Exit Sub

End If

If txtOutput = "" ThenMsgBox "Please enter the path of the output document"Exit Sub

End If

If txtService = "" ThenMsgBox "Please enter the name of a deployed " & _

"ContentMaster service"Exit Sub

End If

Me.MousePointer = vbHourglass

Page 125: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 9. Running ContentMaster Engine

69

'Initialize ContentMaster EngineSet objCMEngine = New CM_COM3Lib.CMEngineobjCMEngine.InitEngine

'Generate a request stringSet objCMRequest = New CM_COM3Lib.CMRequeststrRequest = objCMRequest.Generate( _

txtService.Text, _objCMRequest.FileInput(txtSource.Text), _objCMRequest.FileOutput(txtOutput.Text), _"", "", "")

'Execute the requeststrStatus = objCMEngine.Exec(strRequest, "", "", strOutput)

'Display the statusSet objCMStatus = New CM_COM3Lib.CMStatusCall MsgBox( _

"Return code = " & objCMStatus.IsGood(strStatus) & vbCr & _objCMStatus.GetDescription(strStatus), _vbOKOnly, "ContentMaster Status")

Me.MousePointer = vbDefault

Set objCMEngine = NothingSet objCMRequest = NothingSet objCMStatus = Nothing

End Sub

Explanation of the API Calls

The Visual Basic sample uses three ContentMaster COM API objects:

CMRequestGenerates a request string, which specifies the service name, the input, theoutput, and other details of the operations that ContentMaster Engine shouldperform.

CMEngineExecutes the request and generates the output.

CMStatusDisplays information about the results of the ContentMaster Engine operation,such as a return code and error messages.

The procedure for using these objects, demonstrated in the sample, is as follows. Asimilar procedure is applicable to any ContentMaster COM API application.

Page 126: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 9. Running ContentMaster Engine

70

1. Define the request parameters, such as the input location, output location, andContentMaster service name.

2. Use CMRequest to generate a request string.

3. Use CMEngine to execute the request.

4. Use CMStatus to confirm that the request ran successfully.

For More Information

For complete information about the COM API, see the ContentMaster COM APIReference.

Running the COM API Application

Assuming that you have the Visual Basic 6.0 run-time library on your computer(which is true on most Windows computers), you can run the sample Visual Basicapplication. Of course, if this were a production application, we would package itin a setup file that includes the required run-time library.

1. In your working copy of the Tutorials folder, execute the following file:

Tutorials\Exercises\CMComApiTutorial\CMComApiTutorial.exe

2. Type the full path of the source document and the output document. Type thename of the ContentMaster service that you deployed (by default, the name ofthe project).

For example, to run the sample HL7 parser, you might copy the sourcedocument hl7-obs.txt to the c:\temp directory, and enter the options asillustrated below:

3. Click the Run button. After a moment, the application displays a statusmessage:

Page 127: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 9. Running ContentMaster Engine

71

A return code of 1 means success. If an error occurred during the serviceoperation, the application displays an error message.

4. Open the output file in Notepad or Internet Explorer.

The output should be identical to the output that you generated when you ranthe application in ContentMaster Studio.

If an error occurred, ContentMaster Engine stores an event log in theContentMaster reports location, by default

Page 128: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 9. Running ContentMaster Engine

72

c:\Documents and Settings\<USER>\Application Data\SAP\ContentMaster\4.0\CMReports

where <USER> is your user name. To change the reports location, see thechapter on the Configuration Editor in the ContentMaster Administrator's Guide.

To view a log, you can drag the *.cme file to the Events view of ContentMasterStudio.

Points to Remember

To move a parser or a serializer from development to production, you can deployit as a ContentMaster service. The services are stored in the ContentMaster repository.

You can run a ContentMaster service in ContentMaster Engine via the command-line interface, by API programming, or via the CGI interface.

Page 129: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 Index

73

Index of Volume 2

A

Acrobatparsing tutorial, 1

Acrobat filesparsing, 1

Acrobat Reader, 2actions

computing totals, 1tutorial, 20

Adobe Reader, 2advanced properties, 1, 19Alternatives

anchor, 23ambiguities

resolving, 25ambiguous source text

resolving, 43anchor

Group, 35anchors

Alternatives, 23Content, 33defining positional, 8EnclosedGroup, 36SetValue, 21

APICOM example, 68tutorial, 66

B

basic properties, 1, 19binary data

parsing, 5blank project

editing, 61boilerplate

ignoring, 1

C

CMComApiTutorialrunning ContentMaster Engine, 66

COM APIVisual Basic example, 68

componentsglobal, 45, 47

Contentanchor, 33

ContentMaster COM APIVisual Basic source code example, 68

ContentMaster Engineactivating, 67API tutorial, 66

ContentSerializerserialization anchor, 55

countof newlines, 15using to resolve ambiguities, 43

D

default transformers, 41delimiters

LearnByExample, 43positional, 1

deploying servicestutorial, 66

document processorsexample, 6for Word documents, 28

E

EmbeddedSerializerserialization anchor, 57

empty elementsin XML output, 35

Page 130: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 Index

74

EnclosedGroupanchor, 36

Excelserialization tutorial, 51

F

folderproject, 54

formatpositional, 1

formattingusing for parsing, 25

G

global components, 45, 47Group

anchor, 35

H

HTMLbackground information, 26converting Word documents to, 25generated by Word, 29removing code from output, 41

I

invoicesparsing, 2

L

LearnByExampleinterpreting delimiters, 43

M

Mapaction, 63

mapperscreating, 61definition, 60tutorial, 60

Microsoft Wordparsing documents, 25parsing tutorial, 25

N

newlinesas separator of RepeatingGroup, 15searching up to, 9

O

offsetContent anchor, 8

optional propertyof Group, 35

P

parsersmultiple in project, 58

PDFparsing tutorial, 1

PDF filesparsing, 1viewing, 2

PDF to Unicode (UTF-8)document processor example, 6

PdfToTxt_3_00document processor example, 6

positional format, 1positional parsing

tutorial, 1processors

document, 6for Word documents, 28WordToHtml, 25

projectdetermining folder location, 54

project propertiesin ContentMaster Studio, 54

propertiesproject, 54

R

recursive serializer calls, 51repeating groups

nested, 1, 13RepeatingGroupMapping

mapping anchor, 63RepeatingGroupSerializer

serialization anchor, 57requirements analysis

for parsing, 1

S

search scope, 11, 12controlling, 1

serialization anchors, 51ContentSerializer, 55defining, 55

Page 131: Getting Started with ContentMaster - SAP...SAP Conversion Agent by Itemfield (ContentMaster) Getting Started with ContentMaster Version 4.0 This product has been renamed as SAP Conversion

Getting Started with ContentMaster, Volume 2 Index

75

EmbeddedSerializer, 57RepeatingGroupSerializer, 57

serializerscreating, 53multiple in project, 58tutorial, 51

servicesContentMaster, 66ContentMaster repository, 66

SetValueanchor, 21

sourceof Mapper, 63

source documentstesting parser on, 48

T

targetof Mapper, 63

testingsource documents, 48

totalscomputing, 20

transformers, 42configuring, 44default, 41

Tutorial_3positional parsing, 1

Tutorial_4loosely structured documents, 25

Tutorial_5defining a serializer, 51

Tutorial_6defining a mapper, 60

tutorialsContentMaster Engine API, 66parsing HTML document, 25parsing Word document, 25positional PDF parsing, 1running service in ContentMaster Engine,

66

V

variablesdefining, 31

Visual BasicContentMaster COM API example, 68

W

Wordparsing documents, 25parsing tutorial, 25

WordToHtmlprocessor, 25, 28

WordToRtfprocessor, 28

WordToTxtprocessor, 28