87
FSOpenLink User’s handbook Handbook version : 3.23.00 System version : 3.23 Printed on : 23.12.2011 EntireLink Services Geneva, Switzerland [email protected]

FSOpenLink - User's handbook_3.23.00-3.23

Embed Size (px)

Citation preview

Page 1: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink

User’s handbook

Handbook version : 3.23.00

System version : 3.23 Printed on : 23.12.2011

EntireLink Services

Geneva, Switzerland

[email protected]

Page 2: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 1 / 87

Contents

1. Introductory elements 7

1.1 The FSOpenLink system.............................. ...........................7 1.1.1 Purpose...................................................................................................... 7 1.1.2 Platform(s) ................................................................................................. 7 1.1.3 Technologies.............................................................................................. 8 1.1.4 Installing..................................................................................................... 8 1.1.5 Dependencies .......................................................................................... 10 1.1.6 Recommended system............................................................................. 10 1.1.7 Misc. informations .................................................................................... 10

Author ...............................................................................................................................10 Licensing ..........................................................................................................................10 Copyright ..........................................................................................................................10

1.1.8 Reference projects & usages ................................................................... 11 1.1.9 Glossary................................................................................................... 11

1.2 Records linkage: general requirements .............. ................12 1.2.1 Common attributes................................................................................... 12 1.2.2 Efficient candidate pairs detection............................................................ 13 1.2.3 Powerful measure of attributes similarity .................................................. 13 1.2.4 Construction of a sound score & classification state................................. 13 1.2.5 Good understanding of process limits ...................................................... 13 1.2.6 How can FSOpenLink help ?.................................................................... 14

2. The FSOpenLink realm 15

2.1 General aspects .................................... .................................15 2.1.1 Expandability & tunability ......................................................................... 15 2.1.2 Large datasets readiness ......................................................................... 15 2.1.3 Full validation ........................................................................................... 15 2.1.4 User-level and system-level logging ......................................................... 16

2.2 Steps of a RL task performed by FSOpenLink ......... ..........17 2.2.1 Schematic illustration of the key records linkage steps & elements .......... 17 2.2.2 Overview of system architecture components .......................................... 18

2.3 System-specific elements and concepts.............. ...............19 2.3.1 Datasource............................................................................................... 19 2.3.2 Datasource type....................................................................................... 19

‘FLAT’ type .......................................................................................................................19 ‘HPFLAT’ type ..................................................................................................................19 ‘RDB’ type.........................................................................................................................20 ‘WEB’ type ........................................................................................................................20

2.3.3 Operational datasource............................................................................ 20 2.3.4 Blocking strategy...................................................................................... 20 2.3.5 Gamma vector.......................................................................................... 21 2.3.6 Metrics ..................................................................................................... 21 2.3.7 Normaliser................................................................................................ 22 2.3.8 Tokeniser ................................................................................................. 22 2.3.9 XML descriptor ......................................................................................... 23

Datasource descriptor (“DSD”) .........................................................................................23

Page 3: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 2 / 87

Blocking strategy descriptor (“BSD”).................................................................................23 Gamma vector descriptor (“GVD”) ....................................................................................23

2.3.10 Linkage mode .......................................................................................... 23 2.3.11 Job phase ................................................................................................ 24

2.4 Configuration ...................................... ...................................25

3. Preparing a records linkage task 26

3.1 Choice of fields, their related normalisers and met rics.....26

3.2 Choices when using statistical linkage (“FS”) mode. ........26

3.3 Choices when using similarity search (“SS”) mode ... .......27 3.3.1 The SSClassifier object ............................................................................ 27

Default SSClassifier object provided with the system’s core package..............................27

3.4 Choice of nomenclature to express final results.... ............27 3.4.1 The SQRBuilder object............................................................................. 28

Default SQRBuilder object provided with the system’s core package...............................28 3.4.2 Automatic qualification of “orphan” records .............................................. 28

4. User-exposed objects properties 29

4.1 GammaVector........................................ .................................29 4.1.1 Public attributes........................................................................................ 29 4.1.2 Public methods......................................................................................... 29

4.2 MatchedPair........................................ ....................................30 4.2.1 Public attributes........................................................................................ 30 4.2.2 Public methods......................................................................................... 30

5. User-level components APIs 31

5.1 Normaliser API ..................................... ..................................31

5.2 Tokeniser API ...................................... ...................................31

5.3 Metric API ......................................... ......................................31

5.4 SSClassifier API ................................... ..................................32

5.5 SQRBuilder API..................................... .................................32

5.6 MatchedPairFilter API.............................. ..............................33

6. System’s command line & parameters 34

6.1 General command line structure..................... .....................34

6.2 List of command line parameters .................... ....................34 6.2.1 jobConfig.................................................................................................. 34

Specifications ...................................................................................................................34 Description........................................................................................................................34

6.2.2 validate .................................................................................................... 34 Specifications ...................................................................................................................34

Page 4: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 3 / 87

Description........................................................................................................................34 6.2.3 phases ..................................................................................................... 35

Specifications ...................................................................................................................35 Description........................................................................................................................35

6.2.4 mode........................................................................................................ 36 Specifications ...................................................................................................................36 Description........................................................................................................................36

6.2.5 chooseLevels ........................................................................................... 36 Specifications ...................................................................................................................36 Description........................................................................................................................36

6.2.6 filterMP..................................................................................................... 36 Specifications ...................................................................................................................36 Description........................................................................................................................36

6.2.7 multiProc .................................................................................................. 37 Description........................................................................................................................37

6.2.8 poolSize ................................................................................................... 37 Specifications ...................................................................................................................37 Description........................................................................................................................37

6.2.9 multiplWarningLimit .................................................................................. 37 Specifications ...................................................................................................................37 Description........................................................................................................................37

6.2.10 multiplDiscardLimit ................................................................................... 38 Specifications ...................................................................................................................38 Description........................................................................................................................38

6.2.11 CSVExport ............................................................................................... 38 Specifications ...................................................................................................................38 Description........................................................................................................................38

6.2.12 FlatExport................................................................................................. 38 Specifications ...................................................................................................................38 Description........................................................................................................................38

6.2.13 SampleExport .......................................................................................... 39 Specifications ...................................................................................................................39 Description........................................................................................................................39

6.2.14 SummaryExport ....................................................................................... 39 Specifications ...................................................................................................................39 Description........................................................................................................................39

6.2.15 PlotExport ................................................................................................ 39 Specifications ...................................................................................................................39 Description........................................................................................................................39

6.2.16 AllExport................................................................................................... 40 Specifications ...................................................................................................................40 Description........................................................................................................................40

6.2.17 help .......................................................................................................... 40 Specifications ...................................................................................................................40 Description........................................................................................................................40

7. System-specific configuration elements 41

7.1 Global parameters configuration file ............... ....................41

7.2 Structure of the system configuration file ......... .................41

7.3 Global configuration parameters & their effect ..... .............42

8. Job-specific configuration elements 43

8.1 General points..................................... ...................................43

8.2 Structure of a Similarity Search job config. file ..................44

Page 5: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 4 / 87

8.3 Structure of a Fellegi-Sunter job config. file .......................45

8.4 Job-specific config. parameters & their effect..... ...............46 8.4.1 Job general resources.............................................................................. 46 8.4.2 Parameters estimation – related parameters (FS mode) .......................... 46 8.4.3 Scoring-related parameters (FS mode) .................................................... 47 8.4.4 Scoring-related parameters (SS mode) .................................................... 48 8.4.5 Scoring-related parameters (all modes).................................................... 48 8.4.6 How FSOL locates external job resources................................................ 48

9. Dealing with metrics. 50

9.1 Introduction ....................................... .....................................50 9.1.1 Currently implemented architecture of ‘metrics’ components.................... 50

9.2 Currently provided DLLs............................ ...........................51 9.2.1 DMeta ...................................................................................................... 51 9.2.2 PLev......................................................................................................... 51 9.2.3 GenEdit .................................................................................................... 51

9.3 ‘StringMetrics’ toolbox library contents........... ...................51 9.3.1 List of available functions & related distances .......................................... 52

Typographic/lexical distances...........................................................................................52 Phonetic encodings ..........................................................................................................52 Phonetic distances............................................................................................................52

9.3.2 The (M)GED distances and their configuration......................................... 52 What (M)GED do ..............................................................................................................52 (M)GED default parameters..............................................................................................53 (M)GED weights specification files ...................................................................................53 Symmetry preservation.....................................................................................................54 Defaulting mechanism ......................................................................................................54

10. HTTP linkage server 55

10.1 Introduction ....................................... .....................................55 10.1.1 Purpose.................................................................................................... 55 10.1.2 Usage ...................................................................................................... 55

10.2 Duty cycle: typical example ........................ ..........................56

10.3 Server management.................................. .............................59 10.3.1 Dedicated configuration file ...................................................................... 59 10.3.2 Linkage server config. parameters and their effect ................................... 60 10.3.3 Server startup/shutdown .......................................................................... 60 10.3.4 User-level and system-level logging ......................................................... 60

11. Currently available bundles 61

11.1 General points..................................... ...................................61 11.1.1 Bundle packaging..................................................................................... 61

11.2 “First names” bundle ............................... .............................62 11.2.1 List of bundle components ....................................................................... 62 11.2.2 “Firstname_dist” component usage .......................................................... 62

Attribute-level distance assessment .................................................................................63 Token-level distance assessment.....................................................................................63

Page 6: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 5 / 87

Metrics ‘pinning’ feature: managing an exceptions list......................................................63

11.3 “Family names” bundle.............................. ...........................65 11.3.1 List of bundle components ....................................................................... 65 11.3.2 “Familyname_dist” metrics usage............................................................. 66

Attribute-level distance assessment .................................................................................66 Token-level distance assessment.....................................................................................66 Metrics ‘pinning’ feature: managing an exceptions list......................................................66

11.4 “Birthdates” bundle................................ ...............................68 11.4.1 List of bundle components ....................................................................... 68 11.4.2 “Birthdate_dist” metrics usage.................................................................. 69

11.5 “CH Street address” bundle ......................... ........................70 11.5.1 List of bundle components ....................................................................... 70

11.6 “CHPlaces” bundle .................................. ..............................71 11.6.1 List of bundle components ....................................................................... 71

12. Descriptors XSD 72

12.1 Datasource descriptor (“DSD”) ...................... ......................72 12.1.1 Graphical schema .................................................................................... 72 12.1.2 Elements description................................................................................ 74

12.2 Blocking strategy descriptor (“BSD”) ............... ..................77 12.2.1 Graphical schema .................................................................................... 77 12.2.2 Elements description................................................................................ 77

12.3 Gamma vector descriptor (“GVD”) .................... ..................79 12.3.1 Graphical schema .................................................................................... 79 12.3.2 Elements description................................................................................ 80

13. Appendices 82

13.1 Typical user logfile sample ........................ ...........................83

____________________________

Page 7: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 6 / 87

What’s new in version 3.23 � XML descriptors are now accepted with any encoding (provided the encoding is

properly declared in the XML file header).

� New, optional <ComponentDescr> tag in the <ComponentProperties> section of datasource descriptors : allows for a full-text, user-oriented description of the attribute (dedicated use : with the HTTP linkage mode, to display full text description of the target datasource attributes displayed in the linkage results page).

� New, optional “linkageTaskDescr” in the FSO job config file : allows for a full-text, user-oriented qualification of the linkage task.

� HTTP linkage server has now fully customizable HTML forms. Installation instructions: � HTTP linkage server config file changes with version 3.23. You can deploy version

3.23 over an existing FSOL installation, but note that your custom HTTP linkage server config. file will be overwritten. Therefore, take proper measures to save your existing config. parameters, and restore them into the new config. file structure.

__________________

Page 8: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 7 / 87

1. Introductory elements

1.1 The FSOpenLink system

1.1.1 Purpose Offer an open, generic, versatile and efficient framework to support all-purpose records linkage tasks.

Records linkageRecords linkageRecords linkageRecords linkage : : : :

Class of data mining problems dealing with the art of reliably detecting which records, among two independent datasets, are mutually related. This relation usually stems from them being representations of a common underlying entity. Such two mutually independent representations can differ since each of them is partial and/or distorted one relatively to the other. To allow for an automated detection of this connection, the two representations must exhibit a sufficient amount of common information. Additionally, this shared information must be expressed in a way that exhibits sufficient similarity to be detectable and assessable by a predefined computerised logic.

1.1.2 Platform(s) FSOpenLink currently exists for Win32 platforms. It has been currently been successfully tested and productively exploited on

• Microsoft Windows XP Professional (SP2 & SP3) • Microsoft Server 2003 (SP1)

We currently have no elements that would lead us to think the system would not run flawlessly on other post-Win98 32 bits Windows platforms. The core system, however, should also run without much hassle on Win64 platforms. Some optional bundles do rely on external DLLs currently compiled for 32 bits systems. Such DLLs would require to be explicitly compiled to 64 bits for a proper operation under 64 bits Windows platforms. FSOpenLink should easily be portable to Unix-type operating systems, but nothing has been undertaken in this direction until now.

Page 9: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 8 / 87

1.1.3 Technologies FSOpenLink’s core system exclusively makes use of open-source products and technologies, and is therefore free of any third-party royalties. Current implementation relies upon the following products (among others) :

• Python core system, version 2.5.2. • Psyco version 1.6 (Python just-in-time execution optimiser). • NumPy version 1.3.0 (Open source Python library of scientific tools). • Matplotlib version 0.99 (2D plotting Python library). • SQLite 3.6.19 (embedded zero-config. RDBMS).

1.1.4 Installing Current implementation has a very discrete footprint on its deployment platform. Few entries added to Windows Registry, no adding to the operating system’s environment variables, no mounting of new services etc. Simply run the installer and deploy into a target directory (e.g. “FSOpenLink”). This directory contains all of FSOpenLink’s software components. These are visible in the snapshot below, and individually detailed in the following table.

Page 10: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 9 / 87

Component Class Description

FSOpenLink.exe System batch linkage mode executable

FSOpenLink_server.exe System HTTP linkage server executable

lib_FSOL.dat ; lib_FSOLsrv.dat ;

PCGW32.dll ; python25.dll

System executables-associated resource libraries

tcl ; mpl_data

System directories containing various technical, Python packages-related resources necessary to the system’s proper operation.

FSOL_rsc System

and User

directory containing various files necessary to the system’s proper operation.

documentation User

directory containing various documents pertaining to the use of FSOpenLink itself (like this manual) or to the general “records linkage” field.

xsd User directory containing the XML-Schema definition files that define the structure of each XML descriptor used by FSOpenLink.

metrics User library where user is invited to store his own metrics (functions providing a measure of the “distance” between 2 attribute instances)

normalisers User library where user is invited to store his own attribute normalising functions

tokenisers User library where user is invited to store his own attribute tokenising functions

SSClassifiers User library where user is invited to store his own classification schemes used by RL tasks based on the provided ‘SimilaritySearch’ linkage mode.

SQRBuilders User library where user is invited to store his own final consolidation schemes used to produce the final linkage diagnosis on every record.

tools User Additional (free) 3rd party tools, made available to make working with FSOpenLink more comfortable.

Page 11: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 10 / 87

1.1.5 Dependencies So far, the only known dependency on a third-party library is the Python interpreter dependency on Microsoft’s VisualC++ runtime library: MSVCRT71.dll. Your platform must therefore have this component available in a location that is accessible to any OS-external executable. In doubt, just search for the MSVCRT71.dll file on your system and copy it into the FSOpenLink target deployment directory.

1.1.6 Recommended system Depending on the volume of data to be processed, FSOpenLink’s performance may greatly benefit from:

• 2GB of RAM, ideally dedicated to the sole FSOpenLink process itself. We therefore recommend 3GB of RAM (3GB corresponding to the maximal amount of RAM usually recognised by Win32 platforms).

• Fast CPU (some metrics value computation can be quite CPU intensive). • Fast mass storage r/w access performances (disk usage is typically very

intensive during a linkage task, although FSOpenLink makes use of several internal strategies to optimise its dependency on the mass storage medium performance bottleneck).

Remarks: 1. The larger the processed datasets, the more measurable is the effect of these factors on the overall task completion time. 2. Current system settings are such as to make best use of a 2GB (or more) RAM-based platform. Below, intense disk-swapping or memory errors may occur, that will plague the system’s behaviour and performance.

1.1.7 Misc. informations

Author Jérôme Magnin, PhD (Geneva, Switzerland).

Licensing Flexible licenses are available allowing usage of FSOpenLink by 3rd parties. Conditions can be obtained by request at: [email protected].

Copyright This software is currently 100% property of its author. It is released under the “EntireLink Services” label. This label is used as the designator of an author-related field of competences & activities, rather than as a genuine corporate name.

Page 12: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 11 / 87

1.1.8 Reference projects & usages

FSOpenLink has been extensively and successfully put to work in 2008 and 2009 at the Swiss Central Compensation Office (Swiss federal social insurances coordination centre). The context was a gigantic reconciliation operation of 2’700 administrative registers coming from all horizons, totalling 43 millions records, whose objective was the coherent assignment to those records of the new 13-digits Swiss citizen identifier (OASIN13).

Since 2007, FSOpenLink has been daily used at the Swiss Central Compensation Centre to assign the new Swiss citizen identifier (OASIN13) to many individual administrative registers from various public and private organisations, by direct comparison with the master reference database hosting this identifier.

1.1.9 Glossary We explain here, once for all, some particular terms or expressions that recurrently appear in many parts of this document and that require to be explicitly defined in order to be unambiguously and fully understood.

Term Synonym(s); abbreviation Definition

“The system” FSOpenLink “Record linkage” RL See section 1.1.1

“Linkage mode” /

Conceptual framework within which the linkage is performed. Currently, two of them exist in FSOpenLink: � ‘SimilaritySearch’ mode � ‘Fellegi-Sunter’ (or ‘statistical’) mode

Recordset Datasource Collection of records, organised in a systematic & structured way.

“Query”/”Target” datasource

QDS, TDS

When 2 recordsets are being linked together, one is arbitrarily called the “query” datasource (containing “query” records), and the other the “target” datasource (hosting “target” records).

“Query”/”Target” record

QR, TR A record belonging to the ‘query’ or ‘target’ recordset respectively..

Attribute Field ;

(Record) Component

A record is an ordered collection of 1 to n attributes.

Attribute name AttributeID

ComponentID The (unique) name given to a specific attribute.

Attribute instance Field value ;

ComponentValue

A particular value taken by an attribute. Example : “Smith” and “Müller” are 2 different attribute instances of the attribute “family name”.

Page 13: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 12 / 87

Term Synonym(s); abbreviation Definition

Metric(s) Distance function

A function mapping two attribute instances onto a numerical value representing in the best possible way the degree of similarity between the 2 instances. Metrics functions values are typically (but not obligatorily) normalised between 0.0 and 1.0 (0.0 meaning attribute instances are identical ; 1.0 meaning totally dissimilar instances). NB: using normalised metrics is a requirement when using the statistical (‘FS’) linkage mode.

(Matching) score /

A real number, assigned to a pair of (query,target) records. This number can be normalised (i.e. included in a well-defined finite range) or not.

Classification state

/

A category, chosen in a predefined nomenclature (= categorical variable), to which a pair of (query,target) records is assigned. Each possible category bears a precise definition that usually depends on the RL methodology.

Matched pair

MP A pair of (query, target) records having been assigned a classification state designing it as a matcher (‘M’) or potential matcher (‘C’).

Scored query record

SQR

Record belonging to the query recordset, that has been assigned a final diagnosis w.r.t. the whole matching process. Example of diagnosis : “positive matches onto target records {x1;x2} ; clerical matches onto target records {x3,x4,x5;x6}”

1.2 Records linkage: general requirements By nature, records linkage is a challenging task. Reaching good levels of efficiency and power at the same time is never granted. Being fully aware of the few important aspects of the topic, as well as of their mutual influence, is the first step towards a successful use of FSOpenLink on real-life linkage problems.

1.2.1 Common attributes Recordsets “linkability” is heavily dependent on the degree of common information they contain. Such a condition is fulfilled when one can ascertain the presence of attributes common to the two recordsets. These must be in sufficient quantity (and quality) so as to provide us with an adequate level of discriminating power.

Page 14: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 13 / 87

1.2.2 Efficient candidate pairs detection When recordset size exceeds a few thousands entries, comparing every record of one database with every record in the other (Cartesian product) becomes infeasible in terms of computation time. We therefore must in general rely on a strategy allowing us to initially “pre-select” only those record pairs that have a decent chance to be matchers. Only those candidate pairs are then retained for further processing.

1.2.3 Powerful measure of attributes similarity What typically makes a records linkage task inherently difficult is the following: attributes that are common to the two recordsets have identical instances expressed differently. This is a consequence of the independence of the processes leading to the determination of the values of these attributes among the organisations managing the recordsets, which inevitably leads to the fact that “noise” affects those 2 information sources independently. An automated system should therefore be tailored to recognise to the best possible degree such divergences, while correctly balancing them with attribute instances that are genuinely different. Example: German surnames “Koller” and “Keller” are different instances of attribute “name”. “Karrer” and “Karer”, on the other hand, represent the correct, resp. typo-plagued spelling of another German surname (hence two identical attribute instances). Both pairs differing only by 1 character, the criterion “only 1 character difference between 2 instances implies same instance“ is obviously not precise enough, here, to avoid detection error. By “powerful” measurement, we hereby mean the capability of an automated logic to correctly separate the first case in the above example from the second one. This can be quite challenging indeed.

1.2.4 Construction of a sound score & classificatio n state From the observed degrees of similarity between common attributes, one usually assigns a score value to a records pair. This value further serves as a guide to determine if the pair should be considered a “matcher” with good confidence (“M”), a potential matcher that should be clerically reviewed (“C”), or a mismatching pair (“U”).

map of 2- by-2 attribute similarities � score value � classification state

Note that the intermediate step (build-up of a score value) can be optional, depending on the records linkage methodology. It is however very often considered a great help to handle the results, even when not prescribed by the methodology.

1.2.5 Good understanding of process limits Dealing with records linkage, it is essential to keep in mind that there is never, ever any free lunch. The most severe limitations incurred by this class of problems are inherent

Page 15: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 14 / 87

to information theory, and therefore inescapable. Detection power and reliability both depend on the following factors:

1. Availability of a sufficient set of common attributes. 2. Entropy of the distribution of each attribute possible values 3. Usability of common attributes (missing values rate). 4. Level of “noise” plaguing common attributes 5. Resources available to the user to gain insight into the particularities of the

actual datasets to be linked (e.g. “colour” of the noise affecting one given attribute).

6. Modelling skills at hand by the user, i.e. his ability to come up with an automated logic that best discriminate between same attribute instances differing up to some random “noise”, and genuinely different attribute instances.

7. Finally, availability of a linkage platform that is sufficiently flexible and versatile to allow the seamless integration of the user’s efforts into a well-founded linkage methodology, thereby allowing for the best conditions to come up with satisfying results.

1.2.6 How can FSOpenLink help ? In the above list of limiting factors, it is clear that:

A. Factors 1 to 4 only depend on the data one is given to work with (together with any meta-information on the data that can be of any use for its correct interpretation). In general, the data constitute the “external” boundary conditions that have to be dealt with in any records linkage task.

B. Factors 5 and 6 are user-dependent. Recognising this implies that it is foolish to believe in the existence of a ready-made, “black-box” ultimate generic software providing optimal performance on a broad spectrum of records linkage problems. When faced with a new RL problem, optimality of the linkage process results can only be approached at the cost of a significant human contribution to the process. Again: “no free lunch…”

FSOpenLink’s core system provides a solution to address item 7 in the above list of limiting factors. It currently offers two different methodological frameworks to perform records linkage, into which data-aware user contributions find a “natural” place in the form of simple, attribute-level components: distance functions, normalising operators, and tokenising operators. FSOpenLink’s optional additional components provide, to some extent, good candidate solutions to the 5th and 6th limiting factors, in the form of ready-made attribute-specific bundles.

Page 16: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 15 / 87

2. The FSOpenLink realm

2.1 General aspects

2.1.1 Expandability & tunability

FSOpenLink’s mantraFSOpenLink’s mantraFSOpenLink’s mantraFSOpenLink’s mantra : : : :

A. The system’s core engine provides the user with all the technicalities pertaining to the chosen methodology. B. The key ingredients in turn - those that determine the ultimate linkage performance - belong to the user’s domain, in the form of open libraries.

These open libraries can be supplemented anytime with new (user-defined) components. The cost to achieve this is low: some basic familiarity with the Python programming language. Besides, some ready-made, thoroughly developed and tested libraries components are available separately from the author. In this respect, the system can be considered as the right opposite of an “out-of-the-box” software. Above all, it aims at freeing the user from the “black-box” effect that commercial solutions often exhibit (“you don’t know what’s in it, you cannot tune it to your specific needs, so get satisfied with the default suboptimal behaviour or pay us for additional service”).

2.1.2 Large datasets readiness No matter how long, one day or another you will be faced with the challenge of merging 2 large datasets (say, 10 millions records each). FSOpenLink implements a datasource type specifically adapted to seamlessly handle such large amounts of data, and shown to perform satisfyingly in such demanding conditions. Even on a standard office box.

2.1.3 Full validation [new in version 3.07] On request, the system performs a full validation of all the (user-definable) elements involved in every run of a FSOpenLink session: system’s main configuration file elements; job-specific configuration file elements (existence, accessibility and well-formedness of all specified XML descriptors & other file resources; existence & loadability of all specified user-level components).

Page 17: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 16 / 87

2.1.4 User-level and system-level logging � A detailed user-level logfile, located in the job output directory, keeps you informed on how things go while a task is being completed, and display explicit error messages helping you locate where a problem has occurred and how to remedy it. See Appendix 13.1 for an illustrative example of user-level logfile. � A system logfile is also available at the root directory level that displays eventual Python runtime messages.

Page 18: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 17 / 87

2.2 Steps of a RL task performed by FSOpenLink

2.2.1 Schematic illustration of the key records lin kage steps & elements

Instantiate & populate the 2 operational datasources

(if not yet existing) :

attributes normalisation

block instances

computation

Generate matched pairs from each specified block type

Generate scored query records from the set of matched pairs

In FS (“statistical”) mode: perform parameters estimation

using the EM procedure OR

use the user-provided learning set to compute the parameters

Finalise process : • export results • zip/cleanup files

Matched pairs

oper: target

datasrc

X ML

classif : parameters

oper: query

datasrc

Scored query

records

Page 19: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 18 / 87

2.2.2 Overview of system architecture components

41

Q T

datasources

XML XML

Scoringengine

Dx Dz

Dw

DyΓΓΓΓ-vector

BlocksGenerator XML

XML

Comparedattributes, and

how

How candidate pairs are selected

Exploitedattributes and

their properties

System-provided User-provided (input)

QxT

ParamsEstimator

System output

Classifier

Page 20: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 19 / 87

2.3 System-specific elements and concepts

2.3.1 Datasource The system “connects” to the 2 recordsets through the abstract notion of “datasource”. FSOpenLink then works “behind the scene” to create and exploit ad-hoc data repositories populated with the recordsets contents, that are well suited to the RL operation itself. These ad-hoc data structures are called operational datasources.

2.3.2 Datasource type Qualifies the specific data repository type FSOpenLink works with during a RL task. The datasource type has no influence on the nature of the process and its results. Types differ mainly by the performance they exhibit, and their degree of adaptation to large recordsets.

‘FLAT’ type

• Original records to be linked are to be read from a flat file, in the following canonical form: one record per line.

• Line format can be either “char-separated” (record attributes separated by a well-defined sign), or “fixed” (each record attribute is to be found within a given column range, identical among all lines).

• When dealing with this datasource type, FSOL first maps the flatfile structure into an “operational relational database” (SQLite standard), and works from the latter during all subsequent tasks.

• [new in version 3.08]: this datasource type allows for the operational database to be held in RAM instead of on disk, which can, when used with care, greatly speed-up the linkage processing time.

• From version 3.22 on : this is the recommended datasource type to be used for any linkage task, independently of recordsets size !

‘HPFLAT’ type

• As with ‘FLAT’ type, records to be linked are to be read from a flat file, in the canonical form: one record per line. Similarly also, line format can be csv or fixed.

• The difference with ‘FLAT’ type is that no operational database is used. Instead, the source flatfile gives rise to several other ‘operational’ flatfiles, from which FSOL works during all required tasks.

• ‘HP’ stands here for ‘High-Performance’. This datasource type has been created to solve efficiency problems met with the implementations of SQLite prior to version 3.7.8, which exhibited excessively high indexing time of very large recordsets.

Page 21: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 20 / 87

‘RDB’ type

• Records to be linked are read from a relational database table. • FSOL will need to supplement the source data in place (i.e. inside the database

itself) with additional technical bits of information used by the linkage algorithm. The database access granted to FSOL must therefore allow the following privileges: table creation & indexing ; unlimited SELECT on all tables.

• This datasource type is kind of “deprecated” and should only be chosen when other types cannot be considered. Namely, its performance does not compare well to FLAT and HPFLAT types since it is heavily bound to the RDBMS performance itself, which usually requires some not-so-obvious tuning. Working from flat files remains by far the best choice when simplicity of use and performance are at stake.

‘WEB’ type

• [new in version 3.11] • One single record is dynamically entered by user on a web page form. • The filled query form is then submitted to the HTTP linkage sever, that maps it

onto an operational SQLite database for further linkage operations onto the specified target dataset.

2.3.3 Operational datasource A datasource of type “FLAT” or “WEB” is automatically and transparently mapped by the system onto a SQLite database structure (the “operational” datasource). The latter contains the records themselves, as well as tables containing “block instances” (see 2.3.4). A datasource of type “HPFLAT” is automatically and transparently mapped by the system onto a data structure made of several flat files (whose set forms the “operational” datasource). These flat files contain the records themselves, as well as the “block instances” (see 2.3.4).

2.3.4 Blocking strategy Collection of 1 to n “block definition(s)”. A “block definition” - or “block type” - is simply the enumeration of a well-defined subset of the set of all the attributes that are common to the two recordsets. To each of these attribute is optionally associated a tokenising operator (see “tokeniser”). Each block type can be seen as one particular “digest” of the record. Such digests work as a specific composite index, helping to efficiently bring together records having a good chance to be true matchers.

Page 22: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 21 / 87

Practically: all possible combinations of tokens (1 per attribute) are generated at runtime, each of them acting as a composite index value. Two records having at least one index value in common will be elected for detailed comparison & scoring. To summarise, a blocking strategy serves to efficiently bring together candidate matching pairs from the two recordsets to be linked, without having to perform a full Cartesian product.

2.3.5 Gamma vector A vector of numerical values. In ‘statistical’ linkage mode, this vector is required to be normalised: each of its components must be within the [0; 1] range. This vector is an abstract representation (projection) of the degrees of similarity between the instances of corresponding attributes in two records. Each of its components is built by applying a user-chosen metric (see 2.3.6) to pairs of corresponding fields - or set of fields - among those chosen to be compared in the two recordsets. For a given records pair, its gamma vector represents the “complex map of two-by-two attribute similarities” that was referred to in the above section entitled “Construction of a sound score & classification state”. Caution: for consistence reasons with the notation used in the Fellegi-Sunter seminal paper, each gamma vector component value is computed as 1-metric(x;y) (hence value 0 indicates a total mismatch of the corresponding attributes pair, and value 1 a total identity).

2.3.6 Metrics Distance function providing a (continuous or discrete valued) measure of similarity between two corresponding fields - or sets of fields - in the query and target recordsets. IN: two (sets of) record field instances of the given type. OUT: (normalised) real number. With a normalised metric, the output ranges between 0.0 (total identity) and 1.0 (total dissimilarity).

° ° ° attribute n attribute 1

attribute n attribute 1

Query datasource

Target datasource

metrics

° ° °

number

Page 23: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 22 / 87

Example : the most simple, normalised metric one can consider is the trivial or dichotomist metric given by the rule: metric(x;y) = 1.0 iff instance x is identical to instance y, 0.0 otherwise. (equality is taken here in the sense of string identity if the attribute is of alphanumeric type, or number equality if numeric).

2.3.7 Normaliser A normalising function, to be applied on an attribute of a certain type to transform it into a “normal form” that is suitable to comparison by a metric that expects its input arguments to be in normal form. Example: the action of a typical name normaliser will turn attribute instance “Müller” into “MUELLER”. NB : normaliser and metric are closely bound together. What one of them implements is not to be implemented by the other.

2.3.8 Tokeniser An attribute splitting or mapping function, producing n≥1 « tokens » derived from the attribute value. Generated tokens must be sufficiently “representative” of the informational contents of the attribute, to a) allow for an efficient pairing of records that have a good chance to be true matchers, b) at the same time ensure that the presence of some reasonably low “noise” in the attribute instance will not hinder the pairing process. Examples of sensible tokenising functions:

• Field splitting function: splits the attribute instance into several meaningful subparts (i.e. that reflect the intrinsic structure of the attribute value).

• Field mapping function: maps the attribute onto a phonetic or other symbolic representation.

Global effort to be invested for a given performance level

Attribute normalisation

(″preprocessing″)

Attribute similarity level assessment

Page 24: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 23 / 87

2.3.9 XML descriptor A user-built XML parametrisation file with simple and well-defined syntax. It provides the linkage system with the necessary information needed to perform an RL task after the user’s specifications. Three different types of XML descriptors are currently required by the system to execute any RL task:

Datasource descriptor (“DSD”) Enumerates all datasource properties:

• Chosen type (FLAT,HPFLAT,WEB…) • Actual location of the recordset in the filesystem (FLAT & HPFLAT) • File encoding (utf-8, windows-1252, EBCDIC,…) (FLAT & HPFLAT) • Record format (csv, fixed columns) (FLAT & HPFLAT) • HTML query form properties (WEB) • Record composition (list of considered attributes with their name & type) • Optional normaliser to be applied on each attribute

Blocking strategy descriptor (“BSD”) Describes the user-chosen block types that will allow the system to efficiently bring together candidate records pairs. Each block definition consists in specifying:

• which attribute(s) are involved • which (optional) tokeniser should be applied on every attribute

Gamma vector descriptor (“GVD”) Enumerates

• which among the common attributes must be compared • which metric - among those available in the library - must be used on each

attribute to perform the comparison.

2.3.10 Linkage mode FSOpenLink currently implements two linkage modes, each of them relating to a well-defined methodology :

Linkage mode

Abbrev-iation

Short description

Page 25: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 24 / 87

SimilaritySearch SS

Scoring and classification are based on the exhibited degree of similarity between 2 records. User defines by himself the “gamma vector � score” mapping, as well as the two thresholds controlling the 3-states classifier levels.

Fellegi-Sunter FS

Scoring and classification are based on the Fellegi-Sunter statistical records linkage methodology. Classification parameters estimation is performed by a well-defined algorithm in a preliminary operation (before linkage itself). Currently implemented ones are EM (see [Jaro] or it simplest form ‘FREQ’.

2.3.11 Job phase A typical, full records linkage process can be decomposed into several consecutive, mutually independent steps we refer to as “job phases”:

Phase designation

Abbrev-iation

Phase order index

Short description of what is performed

Datasource instantiation

DSI 1

Use the specifications contained in the two datasource descriptors to produce the operational datasources, from which actual linkage will be performed. This phase comprises attributes normalisation and block types instances generation.

Parameters estimation

PE 2

[Only for FS linkage mode] From the data at hand and a minimal guidance by the user, computes the Fellegi-Sunter parameters that will condition statistical scoring & classification of records pairs.

Matched pairs generation

MPG 3

“Core” phase of the RL process. Blocking strategy is used to detect candidate pairs, that are submitted to the scoring & classification engine. For each candidate pair, the gamma vector is computed, then the designated classifier assigns a classification state to the pair. When classified as positive (‘M’) or potentially positive (‘C’), the result is stored in a “matched pairs database” to be processed during the subsequent phase.

Scored query records

generation SQRG 4

All matched pairs stored during the ‘MPG’ phase that concern one given query record are gathered. From them, a (consolidated) score & classification state is assigned to the query record. These form the ‘final diagnosis’

Page 26: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 25 / 87

applied on the query record by the RL process.

Finalisation FIN 5

Optional. Files containing results of phases ‘MPG’ and ‘SQRG’ are zipped. Auxiliary files are cleaned up.

2.4 Configuration The system currently distributes configuration on two levels:

1. System-level, default configuration parameters. These are settable in a user-editable system configuration file. Such parameters then act as default parameters that apply to every linkage task. One such set of global parameters is common to both the standalone (FSOpenLink.exe) and the HTTP linkage server (FSOpenLink_server.exe) modules, since both these execution modes share a great deal of functioning principles and ingredients. Parameters specific to the HTTP server module are defined in a separate configuration file.

2. Task-specific configuration parameters. They are specified in both a job-specific user-editable configuration file, as well as in the two datasource descriptors.

As a general rule, task-level configuration parameters override system-level ones, allowing for a fine grained control at the task level. Refer to sections 7 and 8 for a detailed description of how configuration works.

Page 27: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 26 / 87

3. Preparing a records linkage task For any new records linkage operation, you need to take the following decisions (and undertake the appropriate actions accordingly):

3.1 Choice of fields, their related normalisers and metrics

In the list of common (= comparable) fields in the 2 recordsets to be linked, choose the ones you will perform the linkage on. For each of them, you must choose 1) if a normalisation function should be applied to it (optional: can be ‘none’) 2) which metric (distance function) must be used by the system on that field (or a subset made of n≥1 of them). (mandatory) If the desired functions are not available out of the toolbox, or if those provided are considered not adequate, new ones can be created by user, through simple inclusion of a new element (implementing the Metrics API, see “Technical reference” section) inside the appropriate toolbox directory.

3.2 Choices when using statistical linkage (“FS”) m ode In FS mode, choices to be made by user pertain essentially to the initial parameters estimation procedure, from which linkage parameters will be derived. A) Which algorithm to use among those already available with the system? B) How do I obtain the training set of true positive matches required by the system to compute the parameters that will control its classifier behaviour? Choices available are: b1) Manual construction: random sample selection in query database, followed by manual search & assessment in target database. Such a user-provided training set must comply to the following format: flatfile made of lines, each containing a 2-uple “queryID;targetID” (separator : semicolon) indicating which pair of records in the query and target databases are designated by user to be true matches. b2) Use the system’s parameter estimator built-in functionality that allows it to automatically construct an M-training set. One must then carefully choose a blocking strategy and supplement it with an ad-hoc filter, ensuring that close to 100% of the retained matches are indeed true positives.

Page 28: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 27 / 87

3.3 Choices when using similarity search (“SS”) mod e In SS mode, choices to be made by user pertain to the build-up of a scoring and classification logics from the gamma vector. This logics is then exploited by the scoring engine to assign every candidate pair the following well-defined 2-uple: (score value; classification state).

3.3.1 The SSClassifier object The object implementing the (user-specified) scoring & classification logics is called an “SSClassifier”. In the job configuration file, 2 parameters, listed in section “[SCORING]” under the names “Cthres” (=L1) and “Mthres” (=L2), are passed to the SSClassifier by the system in order to allow for a job configuration level control of the implemented scoring and classification logics (NB : the latter can also be unparametrised. User is left toally free to exploit or not this “parameters forwarding” convenience).

Default SSClassifier object provided with the system’s core package

The 2 levels 0.0 <= L1 <= L2 <= 1.0 from the ‘[SCORING]’ section of the job configuration file determine the score value thresholds (expressed in % of the highest possible score value) beyond which a classification state “M” (L2) or “C” (L1) is assigned to a records pair. (“M” stands for “Match” ; “C” for “Clerical review required” or “Check”). The “score” associated to a gamma vector with given values is currently computed as 100 * (weighted average over the gamma vector component values). Its highest value is therefore 100.0, indicating a perfect identity between the 2 records being compared. The weight associated to every component in the gamma vector (aka attribute) can be specified in the gamma vector descriptor (optional element ‘ComponentSSScoreWeight’). If none is specified inside the GVD, all weights are supposed equal by default. Should a different behaviour be desired (depending on the task at hand or on user’s own preferences), a user-written SSClassifier can be provided to the system, that conforms to the required SSClassifier API (see “SSClassifier API”, section 5.4).

3.4 Choice of nomenclature to express final results At the matched pairs generation (MPG) level and according to the complexity of the blocking strategy used, there can be several matched pairs generated for a given pair of (query;target) records.

Page 29: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 28 / 87

3.4.1 The SQRBuilder object After the MPG phase, a (user-specified) “SQRBuilder” object is invoked by the system to construct a single diagnosis based on the set of all collected matchedPairs concerning one given query record (generally based on the matching scores and classification states of those matchedPairs).

Default SQRBuilder object provided with the system’s core package

Implements the following sound scheme to assign a diagnosis state to every record in the query database (with identifier ‘queryID’) :

Condition Assigned state State definition All queryID-related matched pairs have classif. state ‘M’ and point to the same unique target record.

POS POSitive identification

All queryID-related matched pairs have classif. state ‘M’ or ‘C’ but point to more than one unique target record.

POS_AMB

POSitive but AMBiguous identification :

more than one candidate exist in the target database

All queryID-related matched pairs have classif. state ‘C’.

CLERIC CLERICal review needed to safely assess the final status

of identification No matched pair has been generated for the given queryID.

NEG NEGative identification

Should a different behaviour be desired (depending on the task at hand or on user’s own preferences), a user-written SQRBuilder can be provided to the system, that conforms to the required SQRBuilder API (see “SQRBuilder API”, section 5.5).

3.4.2 Automatic qualification of “orphan” records When not any matched pair has been generated for a given query record during the MPG phase, FSOpenLink automatically assigns this query record the state labelled ‘NEG’ at the end of the SQRG stage, as an indication of a ‘negative’ matching state. The motivation for such a default state assignment is simply to end up with a results database containing exactly one explicit diagnostic for each query record in the query datasource. Note that this process is totally independent of the SQRBuilder chosen to perform the linkage. It is “hard-wired” inside the core system itself and cannot be deactivated or modified at user level.

Page 30: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 29 / 87

4. User-exposed objects properties

4.1 GammaVector

4.1.1 Public attributes Attribute name Attribute description

values

Vector of N numerical values, each between 0.0 and 1.0. Each value is the result of the chosen metric applied to a pair of attributes (or a set of such corresponding pairs). Vector component with index i (0 ≤ i ≤ N-1) contains metric output for component listed at position i in the GVD

4.1.2 Public methods

Method name Method description

getValueByComponentID (string componentID)

Returns the GammaVector component value corresponding to the compared attribute with name ‘componentID’

getValueByComponentIndex (int componentIndex)

Returns the GammaVector component value corresponding to the attribute with index ‘componentIndex’ in the GVD. Hint : use direct access by index to “values” component instead (much more efficient).

getNorm(void) Returns the GammaVector Euclidian norm

getAverage(void) Returns the arithmetic average of all GammaVector component values.

getWeightedAverage (void) Returns the weighted average of all GammaVector component values. Individual weights can be specified in the GVD.

getWeightedHarmonic Average (void)

Returns the harmonic weighted average of all GammaVector component values. Individual weights are specified in the GVD.

getGeometricAverage (void) Returns the geometric average of all GammaVector component values. Individual weights are specified in the GVD.

getMaxValue(void) Returns the highest value among all GammaVector components.

getMinValue(void) Returns the lowest value among all GammaVector components.

Page 31: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 30 / 87

Notes: 1. Whenever possible, use direct attribute reference instead of calling a getter method:

this is much more efficient with respect to performance of execution. 2. The 5 last methods are provided to help user easily build-up his own tailor-made

logics to compute a score and assign a classification state, based on various statistical indicators of the GammaVector values.

4.2 MatchedPair

4.2.1 Public attributes

Attribute name Attribute description queryID Identifier of the related query record

targetID Identifier of the related target record gv Related “GammaVector” object

FSScore Assigned score value classifState Assigned classification state

4.2.2 Public methods

Method name Method description none

Note: as we purposely try to encourage direct attribute reference for performance reasons, no getter methods are available for this object.

Page 32: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 31 / 87

5. User-level components APIs

5.1 Normaliser API

def myNormaliser (myAttributeToNormalise): [ normalisation logics implementation ] return myNormalisedAttribute

where :

• myAttributeToNormalise type is: ‘string’ or ‘numeric’. • myNormalisedAttribute type is same as

‘myAttributeToNormalise ’.

5.2 Tokeniser API

def myTokeniser (myAttributeToTokenise): [ tokenisation logics implementation ] return (myListOfTokens)

where :

• myAttributeToTokenise type is ‘string’ or ‘numeric’. • (myListOfTokens) is a tuple of values, each of ‘string’ type.

5.3 Metric API

def myMetric ( (myAttribute_Q_1,…,myAttribute_Q_n) , (myAttribute_T_1,…,myAttribute_T_n) ): [ distance logics implementation ] return distanceValue

where :

• (myAttribute_Q/T_1,…, myAttribute_Q/T_n) is a tuple of n attributes (1≤n), each with element of type ‘string’ or ‘numeric’. (‘Q’ stands here for ‘query’ ; ‘T’ for ‘target’).

• distanceValue type is a float with 0.0 <= value <= 1.0. (Convention to comply to : 0.0 ≡ perfect identity ; 1.0 ≡ perfect dissimilarity)

Page 33: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 32 / 87

5.4 SSClassifier API

def mySSClassifier ( U_C_limit, C_M_limit, gammaVec tor ): [ scoring and classification logics implementation ] return (similarityScore, classifState)

where :

• U_C_limit, C_M_limit are placeholder variables that will be dynamically set to the two values (Cthres, Mthres ) specified inside the job configuration file (see “Structure of a Similarity Search job config. file”).

• gammaVector is a placeholder variable that will be dynamically made to point to the “GammaVector” Python object instance on which scoring and classification must be performed.

• (similarityScore, classifState) is a 2-uple whose elements are :

� similarityScore : a numerical value (possibly normalised, generally positive) � classifstate : a string (chosen among a finite set of possible values)

5.5 SQRBuilder API

from FSScorer.Scorer import ScoredQRecord def mySQRBuilder ( queryID, matchedPairs ): [ ScoredQueryRecord building logics implementation ] return myScoredQueryRecord

where :

• queryID is a placeholder variable that will be dynamically set to the query record identifier for which a ScoredQueryRecord Python object must be generated by the builder.

• matchedPairs is a placeholder variable that will be dynamically made to point to the complete list of MatchedPairs Python object instances that have been generated for query record whose identifier is queryID .

• myScordedQueryRecord is the returned ScoredQueryRecord Python object, built after the user-implemented logics.

Page 34: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 33 / 87

5.6 MatchedPairFilter API

def myFilter ( gammaVector ): [ filter passing Boolean logics implementation ] return myStatus

where :

• gammaVector is a placeholder variable that will be dynamically made to point onto the “GammaVector” Python object instance on which scoring and classification must be performed.

• myStatus is a Boolean variable taking either value ‘True’ or ‘False’, after the implemented logics’ verdict on the provided GammaVector.

Page 35: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 34 / 87

6. System’s command line & parameters NB 1 : only those parameters related to currently “public” (i.e. recommended & documented) functionalities of the system are listed here. NB 2 : the present section is only relevant to the standalone (non-HTTP) executable mode.

6.1 General command line structure

FSOpenLink.exe --param1=value1 –-param2=value2 . . . --paramN=valueN

NB :

� Values must not be quoted ! � Parameters described as “boolean” take values TRUE or FALSE.

6.2 List of command line parameters

6.2.1 jobConfig

Specifications Status: mandatory Value type: pathname (either slash-separated or quoted) Default value: n/a Restriction(s): /

Description The job configuration file to be used (see “Job-specific configuration elements”, section 8).

6.2.2 validate

Specifications Status: optional Value type: [ TRUE | FALSE ] Default value: FALSE Restriction(s): Current implementation allows full XML validation only on Win platforms (makes use of MSXML 4.0 or 6.0 libraries).

Description Performs a preliminary validation of all the (user-definable) elements involved in the current FSOpenLink session:

Page 36: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 35 / 87

• FSOL system configuration file: � Check existence and accessibility of the specified default directories.

• Job-specific configuration file : � Check existence and accessibility of all specified XML descriptors & other file resources. � Check existence & loadability of all specified user-level components, namely: SQRBuilder, SSClassifier, PE_MPFilter, SC_MPFilter.

• XML descriptors (DSDs, BSD, GVD): � Assess XSD conformity (full XML Schema validation) � Check existence & loadability of all referenced user-level components, namely:

normalisers, tokenisers, metrics Behaviour : • When invoked, this new validation functionality is executed by FSOL prior to any

other productive task. • User log displays the validation status of each type of elements, as well as eventual

detailed error messages thrown by the XML validator. • The system will exit on any error detected.

6.2.3 phases

Specifications Status: mandatory Value type: Any semicolon-separated combination of : DSI (DataSource Instantiation) PE (Parameters Estimation (FS mode only) MPG (MatchedPairs Generation) SQRG (ScoredQueryRecords Generation)

FIN (Finalisation, = zip and cleanup files Or alternatively:

ALL as an alias for DSI;(PE):MPG;SQRG;FIN Default value: n/a Restriction(s): PE only to be requested when mode=SS (see below)

Description Which among all linkage job phases to perform during this run. Akins to choose which steps in diagram of section 2.2.1 should be executed during the run. Execution of a particular requested step is naturally bound to the existence (and accessibility on disk) of the file(s) containing the results of previously executed steps in the chain of operations illustrated by the diagram of section 2.2.1. More explicitly, dependences are as follows: Step name Directly depends on the availability of result(s) of steps DSI / PE DSI MPG DSI (SS mode) DSI & PE (FS mode) SQRG MPG Note on DSI-related behaviour of the sytem :

Page 37: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 36 / 87

• When DSI is invoked and operational datasources do already exist on disk, the system will unconditionally rebuild them from the available DSDs.

• When DSI is not invoked and operational datasources cannot be found on disk, the system will automatically generate them from the available DSDs.

6.2.4 mode

Specifications Status: optional Value type: [ SS | FS ] Default value: SS Restriction(s): /

Description Which linkage mode to use: SS = Similarity Search linkage mode FS = Fellegi-Sunter (statistical) linkage mode

6.2.5 chooseLevels

Specifications Status: optional Value type: [ TRUE | FALSE ] Default value: FALSE Restriction(s): Effective only when mode=FS

Description Should user be prompted to interactively choose his classification levels on a graph.

TRUE � user is unconditionally prompted to interactively enter those 2 parameters.

FALSE � user is only prompted if the classification levels specified in the config.file are not compatible with the range of available values related to the parameters previously computed during the PE phase.

6.2.6 filterMP

Specifications Status: optional Value type: [ TRUE | FALSE ] Default value: FALSE Restriction(s): If set to TRUE, the job config. file parameter ‘SC_MPfilter’ must point to an existing and valid ‘MatchedPairFilter’ Python object implementation.

Description Should the (optionally) designated filter be used during MPG to accelerate computations (at the potential cost of a loss of sensibility).

Page 38: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 37 / 87

6.2.7 multiProc Status: optional Value type: [ n | MAX | FALSE ] Default value: FALSE Restriction(s): /

Description Whether to perform operational database instantiation, MatchedPairs generation & scoring in multiprocessing mode. Allows taking advantage of multi-CPU platforms, by reducing the overall computation time during execution of both the “DSI” and “MPG” phases. � When an integer n (n>0) is provided, the system will generate up to n processes for its own needs. � When ‘MAX’ is specified, the system will generate up to as many processes as there are CPUs on the platform (well, as detected actually).

6.2.8 poolSize

Specifications Status: optional Value type: integer Default value: 10 Restriction(s): 1 <= value <= 100(recommended)

Description Technical parameter controlling the level of buffering during MPG phase. Can require lowering from the default value if chosen block type produces too large blocks for the data being processed (� Python interpreter “memory error”, due to process memory running out of resources). Hint: keep at default value unless a memory problem is actually encountered.

6.2.9 multiplWarningLimit

Specifications Status: optional Value type: integer Default value: 5000 Restriction(s): 1 <= value

Description When product of 2 block instances cardinalities exceeds the specified value, a warning is issued in the user logfile but the system proceeds with processing of the instance anyway.

Page 39: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 38 / 87

Hint : too many such warnings generated are an indication that block types should be made more specific in order to improve linkage time efficiency.

6.2.10 multiplDiscardLimit

Specifications Status: optional Value type: integer Default value: 100000 Restriction(s): 1 <= value

Description When product of 2 block instances cardinalities exceeds specified value, a warning is issued in the user logfile and the system discards processing of the current block instance (which mechanically lowers sensibility of the system, hence “low” values should be used with caution!) This parameter exists to avoid performance being plagued by rare instances of huge blocks. Hint: if too many such warnings generated are an indication that block types should imperatively be made more specific to preserve both a reasonable computation time and a satisfying level of sensibility.

6.2.11 CSVExport

Specifications Status: optional Value type: [ TRUE | FALSE ] Default value: FALSE Restriction(s): /

Description Converts the output produced by SQRG phase from an SQLite database into a TAB-separated flatfile.

6.2.12 FlatExport

Specifications Status: optional Value type: [ TRUE | FALSE ] Default value: FALSE Restriction(s): only effective when 2 datasource are of ‘FLAT’ type

Description Converts the output produced by SQRG phase from a SQLite database into an all-encompassing, human-readable format.

Page 40: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 39 / 87

6.2.13 SampleExport

Specifications Status: optional Value type: [ TRUE | FALSE ] Default value: FALSE Restriction(s): only effective when 2 datasource are of ‘FLAT’ type

Description Same as ‘FlatExport’ but limited to a sample that is representative of all obtained score values. Hint: this export mode is useful during /typically) the metrics or classification parameters adjustment phase of a linkage job.

6.2.14 SummaryExport

Specifications Status: optional Value type: [ TRUE | FALSE ] Default value: FALSE Restriction(s): only effective when 2 datasource are of ‘FLAT’ type

Description Produce a flatfile containing a summary statistics on the number of “hits” obtained for Each of the classes defined in the classifier used. Hint: this export mode is useful during /typically) the metrics or classification parameters adjustment phase of a linkage job.

6.2.15 PlotExport

Specifications Status: optional Value type: [ TRUE | FALSE ] Default value: FALSE Restriction(s): /

Description Generate an histogram of the distribution of obtained score values (SVG format).

Page 41: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 40 / 87

6.2.16 AllExport

Specifications Status: optional Value type: [ TRUE | FALSE ] Default value: FALSE Restriction(s): /

Description Akin to all ‘XXXExport’ options set to TRUE.

6.2.17 help

Specifications Status: optional Value type: none Default value: n/a Restriction(s): /

Description Display all available cmd line options on stdout, then exit unconditionally. Syntax: “—help” (no “=value” part).

Page 42: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 41 / 87

7. System -specific configuration elements

7.1 Global parameters configuration file This system-level configuration file is located in the “FSOL_rsc/config/user/common ” directory. It can be freely edited by user to accommodate the system’s behaviour to its deployment environment. We first show a practical example of such a file. Its elements are then described one by one.

7.2 Structure of the system configuration file Below is an example of such a file, tuned to one specific deployment environment (the author’s one). Settable parameters appear in green colour. The required syntax of the file contents reflects the syntax defined in the “ConfigObj” Python module (a module dedicated to the easy management of program configuration files). In short:

� Parameters are grouped in functional sections started by a […] tag. � Comments start with a ‘#’. � Helper variables can be defined, whose value can later be “inlined” inside an

expression by using the %(varname)s syntax. ################################################### ########################### # This is the "root" (i.e. basic and general) confi guration file for # FSOpenLink. # It allows to set up fundamental, default function ing parameters related to # the local deployment of the system, and expressin g the user's personal # preferences. # # Important : pathes must have platform-dependent s yntax # (slashes on unix, backslahes on Windows). ################################################### ########################### #================================================== =========================== # Various default parameters section #================================================== =========================== # Default logging configuration file to be used loggingConf = FSOL_rsc\FSOL_standard_logging.FSO # Default "root" output directory for FSOL result f iles # NB : subdirectories will be created inside, that are named after the user- # specified "linkageTaskID" (provided inside j ob. config. file) outputDir = E:\FSOL_output # Number of backup levels for outputDir & contents : outputDir_backupDepth = 3 #================================================== =========================== # Default directories section #================================================== ===========================

Page 43: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 42 / 87

# Default directory for XML datasource descriptors : DSD_defaultDir = E:\FSOL_DSDs # Default directory for XML gammavector descriptors : GVD_defaultDir = E:\FSOL_GVDs # Default directory for XML blockingstrategies desc riptors : BSD_defaultDir = E:\FSOL_BSDs # Default directory for datasources files : DS_defaultDir = E:\FSOL_Datasources # Default directory for operational datasource file s : opDS _defaultDir = E:\FSOL_OPDatasources

7.3 Global configuration parameters & their effect

Parameter name Description

loggingConf Logging configuration file to be used as default outputDir Directory to be used to store the system’s output

outputDir_ backupDepth Number of output directory backup levels to maintain

DS_defaultDir Default location where to look for datasource files opDS_defaultDir Default location where to store operational datasource files DSD_defaultDir Default location where to look for datasource descriptor files GVD_defaultDir Default location where to look for gamma vector descriptor files BSD_defaultDir Default location where to look for blocking strategy descriptor files

Page 44: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 43 / 87

8. Job-specific configuration elements

8.1 General points Two types of job-level configuration files exist: one for each linkage mode currently implemented by the system. Job-specific configuration files can be located anywhere. They simply need to be correctly pointed to in the ‘jobConfig ’ command line parameter of the executable. Most parameters contained in these two file types are common to both linkage modes, since these rely mostly on the same fundamental components (datasources, blocking strategy, gamma vector, etc.). The required syntax of the file contents reflects the syntax defined in the “ConfigObj” Python module (a module dedicated to the easy management of program configuration files). Its essential characteristic features are:

� Parameters are grouped in functional sections started by a […] tag. � Comments start with ‘#’ (and end with the nearest end-of-line mark). � Helper variables can be defined, whose value can later be inlined inside an

expression by using the %(varname)s syntax. We first provide two practical examples of how such a file typically looks (one for each linkage mode). We then explain in detail its various elements, although the role of most of them can be intuitively understood simply from their naming.

Page 45: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 44 / 87

8.2 Structure of a Similarity Search job config. file

# ================================================= ============================ # This is a generic, dummy job configuration file i ntended to guide users in # properly setting up a records linkage task operat ed by FSOpenLink. # # Locating the queryDSD, targetDSD, GVD, BSD and lo gging config. file: # ------------------------------------------------- --------------------- # specified paths for these resources are first int erpreted by FSOL as relative # to the location of the present config. file. When not found at this location, # the resource will be searched for at the default location specified inside the # FSOpenLink global config. file. # Besides, using absolute paths is of course possib le anywhere. # NB : use slashes ("/") as path separator. # # Optional use of helper ConfigObj variables: # ------------------------------------------ # eases the construction of meaningful, coherent jo b resource names. # 1. Define the variable inside the [DEFAULT] secti on : "varname = value" # 2. Use it wherever you want later, by inlining "% (varname)s". # # Comments: from the "#" sign to the next end-of-li ne # # !! DO NOT ERASE OR RENAME THE SECTION HEADS, THEY ARE NECESSARY TO FSOL !! # ================================================= ============================ [DEFAULT] linkageTaskID = myTaskID linkageTaskDescr = myTaskDescr [DIRECTORIES] outputDir = my/path/to/FSOL outputs/directory/%(lin kageTaskID)s [DATASOURCES] queryDSD = myQueryDatasrcDescriptor.xml targetDSD = myTargetDatasrcDescriptor.xml [GAMMA VECTOR] GVD = myGammaVector.xml [BLOCKING] BSD = myBlockingStrategy.xml [SCORING] ScoreAndClassifyLogics = SSClassifiers.defaultSSCla ssifier.weightedAvg Mthres = 0.85 Cthres = 0.65 skipSameRecordID = False SC_MPFilter = filters.scoring.reasonableMPFilter.re asonableMPFilter SQRBuilder = SQRBuilders.basicSQRBuilder.basicSQRBu ilder [OPTIONS] logConfig = myLogging.conf

Page 46: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 45 / 87

8.3 Structure of a Fellegi-Sunter job config. file

# ================================================= ============================ # This is a generic, dummy job configuration file i ntended to guide users in # properly setting up a records linkage task operat ed by FSOpenLink. # # Locating the queryDSD, targetDSD, GVD, BSD and lo gging config. file : # ------------------------------------------------- --------------------- # Specified paths for these resources are first int erpreted by FSOL as relative # to the location of the present config. file. When not found at this location, # the resource will be searched for at the default location specified inside the # FSOpenLink global config. file. # Besides, using absolute paths is of course possib le anywhere. # NB : use slashes ("/") as path separator. # # Optional use of helper ConfigObj variables : # ------------------------------------------ # Eases the construction of meaningful, coherent jo b resource names. # 1. Define the variable inside the [DEFAULT] secti on : "varname = value" # 2. Use it wherever you want later, by inlining "% (varname)s". # # Comments : from the "#" sign to the next end-of-l ine # # DO NOT ERASE OR RENAME THE SECTION HEADS, THEY AR E NECESSARY TO FSOL ! # ================================================= ============================ [DEFAULT] linkageTaskID = myTaskID linkageTaskDescr = myTaskDescr [DIRECTORIES] outputDir = my/path/to/FSOL outputs/directory/%(lin kageTaskID)s [DATASOURCES] queryDSD = myQueryDatasrcDescriptor.xml targetDSD = myTargetDatasrcDescriptor.xml [GAMMA VECTOR] GVD = myGammaVector.xml [BLOCKING] BSD = myBlockingStrategy.xml [PARAMETERS ESTIMATION] algorithm = EM tolerance = 0.008 trainingSetM = None minNbVects = 8000 maxNbVects = 20000 maxNbTrials = 5 PE_MPFilter = filters.FSParamEstim.reasonableMPFilt er. \ \ re asonableMPFilter [SCORING] maxFalsePosProb = 0.00001 maxFalseNegProb = 0.001 skipSameRecordID = False SC_MPFilter = filters.scoring.reasonableMPFilter.re asonableMPFilter SQRBuilder = SQRBuilders.basicSQRBuilder.basicSQRBu ilder [OPTIONS] logConfig = myLogging.conf

Page 47: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 46 / 87

8.4 Job-specific config. parameters & their effect

8.4.1 Job general resources

Parameter name Description

[linkageTaskID]

Keyword used by FSOL to internally identify the linkage task. Optional. When missing, FSOL will default to the job config. file name as the linkageTaskID value.

[linkageTaskDescr]

Full text, user-oriented description of the linkage task. Imperatively use double quotes to wrap your description !. Optional. When missing, FSOL will report “n/a”.

outputDir Pathname of an alternate output directory to be used in place of the default one (specified in the system’s config. file).

logConfig Pathname of a (job-specific) logger’s configuration file to be used in place of the default one (specified in the system’s config. file).

queryDSD Pathname of the query dataset “DataSourceDescriptor” XML file

targetDSD Pathname of the target dataset “DataSourceDescriptor” XML file

GVD Pathname of the job’s “GammaVectorDescriptor” XML file

BSD Pathname of the job’s “BlockingStrategyDescriptor” XML file

8.4.2 Parameters estimation – related parameters (FS mode) These parameters are specific to the ‘FS’ linkage mode.

Parameter name Description

algorithm

Which algorithm to use (among those currenly implemented) to perform FS parameters estimation. Possible values are : ‘EM’ (Expectation-Maximisation) ‘FREQ’ (Frequency-based)

tolerance Measure of the relative divergence of two consecutive iterations of the EM procedure below which the system stops iterating (� steady-state values reached)

Page 48: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 47 / 87

trainingSetM

Pathname of the (optional) training file containing record pairs representing a representative sample of true positive matches. If set to ‘None’, parameters estimation will be based on automatic build-up of a sample set of true positive matching pairs, exploiting the ‘PE_MPFilter ’ MatchedPairs filter to achieve this goal. This parameter cannot be ‘None’ when ‘PE_MPFilter ’ is.

minNbVects

Minimal number of GammaVectors to base the EM calculation on. This parameter is not exploited when a training set is specified.

maxNbVects

Maximal number of GammaVectors to base the EM calculation on. This parameter is not exploited when a training set is specified.

maxNbTrials Number of consecutive failed attempts to complete the full parameters estimation after which the system gives up and exits with an error message.

PE_MPFilter

(optional) MatchedPairs filter to be used for the automatic build-up of a sample set of true positive matching pairs. This parameter is not exploited when a training set is specified. Conversely, it cannot be ‘None’ when ‘ trainingSetM ’ is itself set to ‘None’.

8.4.3 Scoring-related parameters (FS mode) These parameters are specific to the ‘FS’ linkage mode.

Parameter name Description

maxFalsePosProb Specifies the maximal False-Positive probability risk level one wants to link with.

maxFalseNegProb Specifies the maximal False-Negative probability risk level one wants to link with.

Page 49: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 48 / 87

8.4.4 Scoring-related parameters (SS mode) These parameters are specific to the ‘SS’ linkage mode.

Parameter name Description

ScoreAndClassifyLogics ‘SSClassifier’ API compliant Python function that implements the logics to be used by the current job to construct a score value and the classification state.

Mthres

Sets a (relative or absolute) threshold level that can typically be exploited by the ScoreAndClassifyLogics to locate the boundary (on the score scale) between ‘C’-classified pairs and ‘M’-classified pairs.

Cthres

Sets a (relative or absolute) threshold level that can typically be exploited by the ScoreAndClassifyLogics to locate the boundary (on the score scale) between ‘U’-classified pairs and ‘C’-classified pairs.

8.4.5 Scoring-related parameters (all modes) These parameters are common to all linkage modes.

Parameter name Description

skipSameRecordID

Boolean flag: [True|False] If set to ‘True’, MatchedPairs where queryID=targeted are ignored. Useful to save lots of disk space (and computation time) when self-matching a recordset for deduplication purposes…

SQRBuilder

‘SQRBuilder’ API compliant Python function that implements the logics to be used by the current job to construct a ‘ScoredQueryRecord ’ object based on all MatchedPairs relating to one given query record.

SC_MPFilter (optional) MatchedPairs filter to be used to help boosting performance of the MatchedPairs generation phase.

8.4.6 How FSOL locates external job resources When their provided pathname is not absolute, FSOL tries to locate the external files (i.e. queryDSD, targetDSD, GVD, BSD and logConfig) using the following strategy: 1. Search for the named resource inside the same directory where the job

configuration file is located.

Page 50: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 49 / 87

2. When not found at step 1, search for the named resource inside the default directory for this resource type, as specified inside the global configuration file (see section 7.1).

Page 51: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 50 / 87

9. Dealing with metrics.

9.1 Introduction This chapter � describes how metrics can typically be architectured (codewise), and � documents the basic bricks (or toolbox elements) they can typically be built from. When parametrisable, it is important that such basic bricks be well-understood. This allows for a proper fine-tuning of any “user, higher level” metric that would be built on them. Metrics delivered with the system inside of ready-made bundles (see chapter 11) have been built according to these principles. It is important to precise that what is exposed here is by no way compelling: the sole and only constraint a metric function must comply to is its API (see section 5.3).

9.1.1 Currently implemented architecture of ‘metric s’ components The architecture on which a large part of the already existing metrics are built (and to which we encourage new, user-developed metrics to comply) is three-layered :

Architecture of the metrics library system.

DLLs performant C/C++ implementations of

various string-related algorithms

Core ‘StringMetrics’ Python library “Toolbox” containing Python functions implementing

(calls to) fundamental string operators & distances

Attribute-related metrics Freely constructed by making a discretionary (and optional) use

of the toolbox ingredients (plus others if desired).

Page 52: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 51 / 87

9.2 Currently provided DLLs

9.2.1 DMeta This DLL contains a fast C++ implementation of the Double-Mataphone phonetic encoding algorithm.

9.2.2 PLev This DLL contains a fast C implementation of the following well-known and well-documented string distance functions:

• Levenshtein, classical “edit” distance • Damereau-Levenshtein distance (extension of the latter to transpositions)

9.2.3 GenEdit This DLL contains two fast C implementations of :

• the Generalised Edit Distance (“GED”) • the Markovian Generalised Edit Distance (“MGED”): a context-sensitive

extension of the GED.

9.3 ‘StringMetrics’ toolbox library contents Most among the available components to date have been constructed by making use of “all-purpose” functions, either pre-existing and well-documented in the literature (Levenshtein, Damereau-Levenshtein, Generalised Edit) or built on well-known string mapping/encoding schemes (Q-grams, Double-Metaphone, Soundex, Soundex2, Phonex,…). Such all-purpose distance functions are gathered in a toolbox-like library named “StringMetrics” . It is part of the core FSOpenLink system and, as such, unconditionally made available to user independently of his acquisition of optional bundles. This library forms a relatively rich toolset upon which users can capitalise to build their own tailor-made metrics. Several functions available in this library have an underlying C implementation, for obvious performance reasons. Their implementation is packed in a DLL, and wrapped to be easily callable from a Python script just as if they were in a regular Python module. This architecture clearly offers the optimal compromise between ease-of-use and raw linkage time performance.

Page 53: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 52 / 87

9.3.1 List of available functions & related distanc es Note: the (P) sign indicates that the function is currently implemented in pure Python (no call to an external DLL).

Typographic/lexical distances

• Q-grams-based distance (P) • Jaro / Jaro-Winkler distance (P) • Levenshtein (absolute and normalised variants) • Damereau-Levenshtein (absolute and normalised variants) • Generalised Edit Distance (absolute and normalised variants) • Markovian Generalised Edit Distance (absolute and normalised variants)

Phonetic encodings

• Double Metaphone • Soundex (P) • Soundex2 (P) • Phonex (P)

Phonetic distances

• Double Metaphone-based • Soundex-based (P) • Soundex2-based (P) • Phonex-based (P)

Most functions currently made available in the ‘StringMetrics’ library do not require description: they are well-documented on the web and work “out-of-the-box” without any particular need for intervention by user.

9.3.2 The (M)GED distances and their configuration This class of metrics is heavily parametrisable, and therefore requires some instructions on how to properly deal with it.

What (M)GED do Generally, (M)GED functions compute the minimal cost to transform one string to another one, through successive applications of the following basic operations :

• character matching • (contextual) character insertion • (contextual) character deletion • character substitution • character transposition

To each of them is associated a “weight” or “cost value”, that can depend on the identity of the character(s) involved in the operation.

Page 54: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 53 / 87

The “Markovian” character of MGED (as opposed to GED) comes from the contextual nature of character insertion & deletion operations: the associated weights can namely be made to depend explicitly on the previous character in the string.

(M)GED default parameters Unless otherwise specified by an explicit entry in the weights configuration files (see below), all weights take the following default values:

• character matching default weight: 0.0 • (contextual) character insertion default weight: 1.0 • (contextual) character deletion default weight: 1.0 • character substitution default weight: 1.0 • character transposition default weight: 1.0

When only default values are at play, the MGED is identical to the Damereau-Levenshtein distance.

(M)GED weights specification files Weights specification files exist to allow users to explicitly assign values to those weights whose value should depart from the default ones (see above). Only those departing values need to be mentioned in the file. Those not appearing implicitly assume the default value. Files are located inside metrics\metrics_rsc directory. Despites their “.dat” extension, they exhibit a flat and and perfectly human-editable contents.

File name Description

GenEdit_matchWeights.dat Weights associated to character matching operation

GenEdit_del_insWeights.dat Weights associated to character insertion/deletion operation

GenEdit_del_insWeights_M.dat Weights associated to character insertion/deletion operation, with dependence of the preceding character.

GenEdit_substituteWeights.dat Weights associated to character substitution operation

GenEdit_transposeWeights.dat Weights associated to character transposition operation

Page 55: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 54 / 87

Symmetry preservation In order for the symmetry property to be preserved, the following conditions must be met on weights: 1. The 2 insertion & deletion weights depending on the same character must be

identical. This is why we choose not to allow users to specify two independent weightsets in separate, independent files for those 2 operations.

2. Substitution and transposition weights depending on the same two characters must be identical. This is implemented by systematically initialising also the transposed weights matrix entry when a line is read from GenEdit_substituteWeights.dat or GenEdit_transposeWeights.dat . This symmetrisation mechanism implies that it is not necessary to have the two symmetric lines [C1;C2;value] and [C2;C1;value] in any of those 2 files, since their effect on the stored weights matrix is perfectly redundant. If symmetrisation mechanism implies that it is not necessary to have the two symmetric lines [C1;C2;value1] and [C2;C1;value2] are specified, the latter will “overwrite” the former.

Defaulting mechanism GED only makes use of those weights defined in the 4 “non _M” files. As explained, when a weight is not explicitly mentioned in such files, the system defaults to the values indicated in the “(M)GED default parameters” section above. MGED makes use of the 5 weight files in the same manner as GED does, with an additional rule: When a context-dependent weight [C1;C2;C3;value] is not explicitly specified inside GenEdit_del_insWeights_M.dat , the weight defaults to the corresponding, non-Markovian one [C2;C3;value] inside GenEdit_del_insWeights.dat , if present. If not, it defaults to the value indicated in the “(M)GED default parameters” section above.

Page 56: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 55 / 87

10. HTTP linkage server [New in version 3.11]

10.1 Introduction

10.1.1 Purpose

Allow to easily and interactively query a target dataset from a web browser, and obtain the result as an HTML page.

This functionality makes it very easy to implement an interactive matching - onto a target dataset - of record fields that are dynamically provided by users through an HTML form. It therefore allows the setting up of advanced, custom metrics-powered “lookup tools” into arbitrary datasets.

10.1.2 Usage To prepare a web-enabled linkage of a single user-provided record onto an existing target dataset, user simply configures a linkage task in exactly the same way he does with the batch, non-HTTP linkage executable. The only restrictions are: Requirement 1: the two datasource types must be as follows :

Datasource required type Query WEB Target FLAT

� The WEB type datasource descriptor is very similar to the other descriptor types. It contains two specific elements for each component (field) that tell the server how to format the query form for that attribute. � The FLAT type requirement affecting the target datasource is due to the need to be able to efficiently fetch the matched target records in the operational target database, in order to display them in the results page. Requirement 2: the ‘.FSO’ job configuration file must reside in a specific directory known to the server as the one to screen for “web” jobs.

Page 57: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 56 / 87

10.2 Duty cycle: typical example Once the server properly configured and started up, a typical full duty cycle is a follows:

1. By calling the root URL, one obtains a list of all ‘web type’ jobs. That is, jobs having been found in the dedicated “jobs” directory and being compliant to the datasource type restriction mentioned above.

2. One chooses the desired matching job by clicking on its job configuration filename.

3. The linkage server then constructs a user query form according to the properties found in the query datasource descriptor.

Page 58: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 57 / 87

4. User enters the desired (query) field values into the provided input fields, and

submits.

Page 59: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 58 / 87

5. The linkage server performs a full linkage cycle (MPG, SQRG) and returns the result as an HTML “results” page.

Page 60: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 59 / 87

10.3 Server management

10.3.1 Dedicated configuration file A configuration file containing the server’s functioning parameters is to be found at the following location:

FSOL_rsc\config\user\server\FSOL_server_config.FSO Its contents are displayed hereafter: # ################################################### ########################### # This is the "root" (i.e. basic and general) confi guration file for FSOpenLink # HTTP Server. # ################################################# ############################# # ================================================= ============================= # Server config. parameters section # ================================================= ============================= # TCP port number used by FSOL server TCPPort = 8051 # Whether to erase task-related temp. directories w hen a 'match' request has # been completed. eraseWorkDir = True # Whether to refresh the internal jobs cache (list of available web jobs) when # the jobs list gets requested. # False --> the job cache is only built once, at se rver startup. refreshJobsCache = True # ================================================= ============================= # HTML templates section # ================================================= ============================= indexFormTemplateFile = indexFormTemplate_def_EN.ht ml queryFormTemplateFile = queryFormTemplate_def_EN.ht ml resultsFormTemplateFile = resultsFormTemplate_def_E N.html # ================================================= ============================= # Directories section # ================================================= ============================= # Default directory for HTTP server job definition files jobDir = E:\Eclipse workspace\FSOpenLink3\dist\job s # Default "work" directory for FSOL processing file s workDir = E:\FSOL_server_workDir # ================================================= ============================= # Logging section # ================================================= ============================= # Default logging configuration file to be used loggingConf = FSOL_rsc\config\user\server\FSOL_ser ver_logging.FSO

Page 61: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 60 / 87

[ Please note that the HTTP linkage server is also affected by parameters defined in FSOpenLink’s main global configuration file, located at :

FSOL_rsc\config\user\common\FSOL_standard_config.FS O ]

10.3.2 Linkage server config. parameters and their effect

Parameter name Description TCPPort TCP port the server listens to.

eraseWorkDir True� temp directories containing files created during a linkage request are erased after the result page is returned. False� cleaned up only at server startup.

indexFormTemplateFile HTML template file to be used to display the list of available linkage jobs to choose from

queryFormTemplateFile HTML template file to be used to display the query form related to the chosen linkage job

resultsFormTemplateFile HTML template file to be used to display the result of the linkage task operated on the user-provided data

refreshJobsCache

Whether to refresh the internal jobs cache everytime the list of available jobs is called. False � the jobs cache is built once for all at server startup.

jobDir Default location where to look for (web-type) jobs workDir Default location where to store operational, temporary files

loggingConf Logging configuration file to be used

10.3.3 Server startup/shutdown For convenience, two command line tools – located in the root directory - are provided to easily start and stop the HTTP linkage server:

start_FSOPenLink_server.bat stop_FSOPenLink_server.bat

10.3.4 User-level and system-level logging Logging implementation for the server is identical to one in the non-HTTP batch linkage module. Dedicated user-level and system-level logfiles are generated inside the root directory. They bear the “_server” suffix to distinguish them from the related files outputted by the batch linkage module.

Page 62: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 61 / 87

11. Currently available bundles

11.1 General points This section enumerate some of the currently existing component sets that have been developed and whose nature is general (i.e. not task-specific) enough to be of potential interest for user. These components are grouped into attribute-specific “bundles”, comprising all components whose use is related to a specific type of attribute.

11.1.1 Bundle packaging These components can be delivered with the system in two different packaging forms: 1. “Packaged”: components are directly integrated into the system’s internal library.

Their source code is therefore not visible to the user in the user directories “metrics ”, “ normalisers ”, and “tokenisers ”. They are however callable anytime, just as if their code was located inside these user directories. When subject to parameterisation, the parameter files are usually freely accessible to the user.

2. “External”: components are delivered as Python source (“.py ”) files inside the user directories “metrics ”, “ normalisers ”, and “tokenisers ”. The component’s source code is therefore visible and freely customisable by user.

Page 63: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 62 / 87

11.2 “First names” bundle

11.2.1 List of bundle components

Attribute-related metrics Name Location

(Python syntax) Description Input

firstName_dist metrics.

FirstNames. firstName_dist

Distance betw. 2 first names fields

2 tuples, each containing exactly

1 string

naming_dist metrics. Naming.

naming_dist

Distance betw. 2 (firstname,familyname)

fields, testing for eventual erroneous permutation of

the 2 fields. (This can be considered a

‘cross-bundle’ metric)

2 tuples, each containing exactly

2 strings

Attribute-related normaliser(s)

Name Location (Python syntax)

Description Input

makeASCII normalisers.

UnicodeToASCII. makeASCII

Sound conversion of accentuated strings into non-

accentuated ones. string

makeASCII_UC normalisers.

UnicodeToASCII. makeASCII_UC

Sound conversion of accentuated strings into non-accentuated, upper-case ones.

string

nameNormaliser normalisers.

name. nameNormaliser

Same as makeASCII_UC, plus detection & elimination of various parasite characters.

string

Attribute-related tokeniser(s)

Name Location (Python syntax)

Description Input

firstnameTokeniser tokenisers.

firstnameTokeniser. firstnameTokeniser

Sound splitting of first names field into its

individual components.

string

11.2.2 “Firstname_dist” component usage This metric implements a hierarchical levels approach to assess the degree of similarity between two firstname field instances.

Page 64: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 63 / 87

Attribute-level distance assessment This level is here to perform a proper treatment of multi-component attribute instances (i.e. firstnames composed of a concatenation of several firstnames, including middlenames). Roughly: all distinct permutations of the two obtained lists are then mutually compared to determine the one giving the lowest cumulated “token-level” distance value (see below). This value is finally reported as the distance value between the 2 attribute instances. [New in version 3.05]: A set of penalty factors are defined in a separate, user-editable parameters file. They provide a fine-grained control over the buildup of the distance value. Penalty factors file location: metrics\metrics_rsc\FirstNames_penalties.dat

Token-level distance assessment This level’s duty is to quantise in the most efficient way the similarity between two firstname tokens. The main asset this distance capitalises on is a 150’000+ entries-rich, consolidated dictionary of first- and nicknames, grouped after “equivalence classes”. This dictionary is a powerful resource to assess with a high degree of accuracy:

1. whether 2 variants correspond indeed to the same naming 2. whether 2 firstname variants can be safely considered not the same

Beyond firstname equivalence assessment, we also try to take into account in a sound way the typical noise sources that can plague the spelling of a firstname token, like:

• phonetic transcription errors • typographical errors • handwriting deciphering errors

The comparison logics is currently based on the Markovian Generalised Edit Distance (see section 9.3.2). Acting on the user-editable MGED parameters files (see ”(M)GED weights specification files” in section 9.3.2) will therefore have a direct influence on the metrics’ behaviour.

Metrics ‘pinning’ feature: managing an exceptions list Firstname_dist beneficiates from a feature called ‘pinning’. Via an editable file, user can 'pin' the metric’s behaviour on specific pairs of attribute instances, by forcing the numerical value taken by the metric on the specified instances pair. This feature allows, for example, to manually correct an eventual misbehaviour of the metric in a specific situation. It therefore plays the practical role of a 'list of exceptions'.

Page 65: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 64 / 87

Pinning file location: metrics\metrics_rsc\FirstNames_pinning.dat Pinning file line format: "attributeInstance1;attributeInstance2;distanceValue"

NB : lines starting with '#' are considered as comments and ignored.

Page 66: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 65 / 87

11.3 “Family names” bundle

11.3.1 List of bundle components

Attribute-related metrics Name Location

(Python syntax) Description Input

familyName_dist metrics.

FamilyNames. familyName_dist

Distance betw. 2 family names fields

2 tuples, each containing exactly

1 string

naming_dist metrics. Naming.

naming_dist

Distance betw. 2 (firstname,familyname)

fields, testing for eventual erroneous permutation of

the 2 fields. (This can be considered a

‘cross-bundle’ metric)

2 tuples, each containing exactly

2 strings

Attribute-related normaliser(s) Name Location

(Python syntax) Description Input

makeASCII normalisers.

UnicodeToASCII. makeASCII

Sound conversion of accentuated strings into non-

accentuated ones. string

makeASCII_UC normalisers.

UnicodeToASCII. makeASCII_UC

Sound conversion of accentuated strings into non-accentuated, upper-case ones.

string

nameNormaliser normalisers.

name. nameNormaliser

Same as makeASCII_UC, plus detection & elimination of various parasite characters.

string

Attribute-related tokeniser(s) Name Location

(Python syntax) Description Input

familynameTokeniser tokenisers.

familynameTokeniser. familynameTokeniser

Sound splitting of family names field into its individual components (takes into account usual

name prefixes)

string

Page 67: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 66 / 87

11.3.2 “Familyname_dist” metrics usage This metric implements a hierarchical levels approach to assess the degree of similarity between two surname field instances.

Attribute-level distance assessment This level is here to perform a proper treatment of multi-component attribute instances (i.e. surnames composed of a concatenation of several basic surnames, like “wedding names” for instance). Each attribute instance is properly tokenised, making use of a user-editable collection of reference surname particles (prefixes). Roughly: all distinct permutations of the two obtained lists are then mutually compared to determine the one giving the lowest cumulated “token-level” distance value (see below). This value is finally reported as the distance value between the 2 attribute instances. [New in version 3.05]: A set of penalty factors are defined in a separate, user-editable parameters file. They provide a fine-grained control over the buildup of the distance value. Penalty factors file location: metrics\metrics_rsc\FamilyNames_penalties.dat

Token-level distance assessment This level’s duty is to quantise in the most efficient way the similarity between two name tokens. This is achieved by taking into account in a sound way the typical noise sources that can plague the spelling of a surname token, like:

• phonetic transcription errors • typographical errors • handwriting deciphering errors

The comparison logics is currently based on the Markovian Generalised Edit Distance (see section 9.3.2). Acting on the user-editable MGED parameters files (see ”(M)GED weights specification files” in section 9.3.2) will therefore have a direct influence on the metric’s behaviour.

Metrics ‘pinning’ feature: managing an exceptions list Familyname_dist beneficiates from a feature called ‘pinning’. Via an editable file, user can 'pin' the metrics behaviour on specific pairs of attribute instances, by forcing the numerical value taken by the metrics on the specified instances pair. This feature allows, for example, to manually correcting an eventual misbehaviour of the metrics in a specific situation. It therefore plays the practical role of a 'list of exceptions'.

Page 68: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 67 / 87

Pinning file location : metrics\metrics_rsc\FamilyNames_pinning.dat Pinning file line format : "attributeInstance1;attributeInstance2;distanceValue"

NB : lines starting with '#' are considered as comments and ignored.

Page 69: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 68 / 87

11.4 “Birthdates” bundle

11.4.1 List of bundle components

Attribute-related metrics Name Location

(Python syntax) Description Input

birthDate_dist metrics.

Birthdates. birthdate_dist

Distance betw. 2 birth/calendar dates

2 tuples, each containing exactly 1 string ; format ‘YYYYMMDD’

birthDate_dist_eCH0083

metrics. birthdates.

birthdate_dist_ eCH0083

Distance betw. 2 birth/ calendar dates, input

format after eCH0083 standard

2 tuples, each containing exactly 3 strings whose 2

are empty.

Attribute-related normaliser(s) Name Location

(Python syntax) Description Input

calDateNormaliser_ DDMMYYYY_fixed

normalisers. calendarDate.

calDateNormaliser_ DDMMYYYY_fixed

IN : calendar date in the form 'DD(sep)MM(sep)YYYY' where "sep" is an optional separator character OUT : normalised calendar date in the form 'YYYYMMDD'

2-uple : (string, sep)

calDateNormaliser_ DDMMYYYY_var

normalisers. calendarDate.

calDateNormaliser_ DDMMYYYY_var

IN : calendar date in the form 'D?D(sep)M?M(sep)YYYY' where : � "sep" is an optional separator character � the first number of DD or MM may be omitted if equal to zero OUT : normalised calendar date in the form 'YYYYMMDD'

2-uple : (string, sep)

Page 70: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 69 / 87

Attribute-related tokeniser(s) Name Location

(Python syntax) Description Input

birthdateTokeniser_ by2

tokenisers. birthdateTokeniser.

birthdateTokeniser_by2

Splitting of birth/calendar

date into YYYY and

MMDD parts

string with format

‘YYYYMMDD’

birthdateTokeniser_ by3

tokenisers. birthdateTokeniser. birthdateTokeniser_

by3

Splitting of birth/calendar

date into YYYY ; MM and DD parts

string with format

‘YYYYMMDD’

birthdateTokeniser_ by3v2

tokenisers. birthdateTokeniser. birthdateTokeniser_

by3v2

Splitting of birth/calendar

date into YYYYMM ; YYYYDD

and MMDD parts

string with format

‘YYYYMMDD’

11.4.2 “Birthdate_dist” metrics usage This metrics tries to capture differences in birthdates that come from two sources (i.e. ‘noise’ types):

• Keystroke errors: one digit is typically replaced by its neighbouring one on the keyboard.

• Calendar-type transcription errors, which translate in differences by a few number of elapsed days between the two birthdates.

[New in version 3.05a]: A set of penalty factors are defined in a separate, user-editable parameters file. They provide a fine-grained control over the build-up of the distance value. Penalty factors file location: metrics\metrics_rsc\Birthdates_penalties.dat

Page 71: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 70 / 87

11.5 “CH Street address” bundle

11.5.1 List of bundle components

Attribute-related metrics Name Location

(Python syntax) Description Input

address_dist metrics. Address.

address_dist

Distance betw. 2 normalised street addresses

2 tuples, each containing exactly

1 string

Attribute-related normaliser(s) Name Location

(Python syntax) Description Input

CHAddrNormalize normalisers.

CHAddress. CHAddrNormalize

Normalisation of a (free form) street address field, typically of the type currently met in

Switzerland

string

Attribute-related tokeniser(s) Name Location

(Python syntax) Description Input

none

Page 72: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 71 / 87

11.6 “CHPlaces” bundle

11.6.1 List of bundle components

Attribute-related metrics Name Location

(Python syntax) Description Input

CHPlaces_dist metrics.

CHPlaces. CHPlaces_dist

(non-geographical) distance betw. 2 Swiss

locations

2 tuples, each containing exactly

1 string.

Attribute-related normaliser(s) Name Location

(Python syntax) Description Input

CHPlaces_normaliser normalisers.

CHPlaces. CHPlaces_normaliser

Normalisation of a text field containing either a BFS official coding, or a free-form textual designation of a Swiss location.

2-uple : (string,

sep)

Attribute-related tokeniser(s) Name Location

(Python syntax) Description Input

None

Page 73: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 72 / 87

12. Descriptors XSD

12.1 Datasource descriptor (“DSD”)

12.1.1 Graphical schema

Page 74: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 73 / 87

Page 75: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 74 / 87

12.1.2 Elements description

Element name Element description

DataSourceID Provide FSOL a tagname to identify the datasource presently described.

DataSourceType Indicates the type of the datasource (among those currently implemented in FSOL) the present descriptor refers to.

DataSourceProperties Enumerate the set of (type-specific) datasource properties. This minimal set of information is required for FSOL to properly and efficiently interact with the datasource.

Flatfile Fully qualified pathname of flatfile representing the datasource

InMemory

Only effective for ‘FLAT’-type datasources. ‘True’� the operational database will be memory-based instead of disk-based (measurable processing time reduction for reasonably-sized datasets, but probable fatal performance degradation beyond a certain size due to disk swapping. Hence to be used with care!). When missing, defaults to ‘False’.

Encoding

The encoding of the flatfile contents. Supported values for the “Encoding” parameter are those listed in section 4.8.3 of the Python Library Reference (“Standard Encodings”), available online.

LineFormat ‘CSV’ (character-separated, variable length lines) or ‘FIXED’(each field ranges between 2 well-defined columns)

FieldSeparator Character used inside the flatfile to separate between the contiguous fields forming one record (MUST be provided when LineFormat is 'csv').

OperatingFilesLoc Directory (with proper r/w privileges granted to user running FSOL) where FSOL will store its operating files.

DBServerType RDBMS vendor the datasource is associated to.

DBInstanceFile For a SQLITE database : fully qualified pathname to its file

DBServerName Name of the RDBMS server

Page 76: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 75 / 87

Element name Element description

DBServerPort TCP port the RDBMS server listens on.

DBInstanceID Name of the database instance

DBUserID Database access userID

DBUserPW Database access password

DataRecordSpecs Full qualification elements for a data record in the database

RecordComponents Container for the set of RecordComponent elements

RecordComponent Qualification of one particular record component

ComponentID FSOpenLink reference ‘name’ given to the attribute (field)

IsRecordPK

‘True’--> this coimponent works as primary key for the recordset behind this datasource. NB : only one record component only should have this flag set to 'True'. In case several are found, however, only the first will be used as PK. If none is found, FSOL execution will abort. When missing, defaults to ‘False’.

IsFullRecord

True--> this component works as a summary of all record components (provided as a comma-separated list). NB : one and one record component only should have this flag set to 'True'. When missing, defaults to ‘False’.

ComponentProperties Datasource type-dependent set of properties.

HTMLFormFieldText Text to appear next to the input field for the attribute, inside the HTML query form. This text should make it clear to the user which data is expected in this precise field.

HTMLFormFieldLength Desired length of the input field (in characters).

Page 77: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 76 / 87

Element name Element description

FieldColumnRange

For a fixed-type line structure: column range in the form ##-##. Example: "12-19" indicates that the field extends from column 12 to col. 19 (both inclusive). First column in the line has index 1.

DataType Type of the data contained in the component. Hint: ‘STRING’ works for everything, including numbers (thanks to SQLite data typing specificities)

Normaliser Reference to normalising logics (function) should be applied on the field. Format: Python syntax (“moduleName.functionName”).

TableName Name of the RDB table inside which the component is stored.

ColumnName Name of the table column inside which the component is stored.

RecordKey The ‘componentID’ of the component playing the role of unique record key for the present component.

Notes: /

Page 78: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 77 / 87

12.2 Blocking strategy descriptor (“BSD”)

12.2.1 Graphical schema

12.2.2 Elements description

Element name Element description

BlockingStrategyID An identifier for the particular blocking strategy that is defined here

BlockType Container for the qualification of one ‘BlockType’ (= one particular mode of records “blocking”)

BlockTypeID Name the current block type. Of course, must be unique within the set of all provided BlockTypes.

BlockingField A field belonging to the constitution of the BlockType

FieldName Must correspond to the name of a datasource component in the 2 datasource descriptors for the 2 databases the present strategy applies to.

FieldTokeniser If needed, specifies a field tokeniser logics (field contents splitting function) to be applied on the field. Format: Python syntax (“moduleName.functionName”).

Page 79: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 78 / 87

Element name Element description

BlockField For datasources with pre-built block instances (‘RDB’ type): specifies the name of the datasource component containing the current blockType instances.

Page 80: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 79 / 87

12.3 Gamma vector descriptor (“GVD”)

12.3.1 Graphical schema

Page 81: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 80 / 87

12.3.2 Elements description

Element name Element description

VectorComponents Container for all VectorComponent elements

VectorComponent Description of one component of the Gammavector

ComponentID

The (identically named) attribute(s) in the query and target datasources on which the component is calculated. If several attributes participate to the buildup of one component, they are simply enumerated sequentially, using "+" as separator.

QueryComponentID

The attribute(s) in the query datasource on which the component is calculated. If several attributes participate to the buildup of one component, they are simply enumerated sequentially, using "+" as separator.

TargetComponentID

The attribute(s) in the target datasource on which the component is calculated. If several attributes participate to the buildup of one component, they are simply enumerated sequentially, using "+" as separator.

ComponentMetricsID Which metrics to apply to the component(s) referenced by 'ComponentID' tag.

ComponentSSScoreWeight In SimilaritySearch mode: attribute-related scoring weight that can be exploited by the score-and-classify logics.

ComponentLevels Container for the description of the metrics discretisation (sampling) levels scheme

EquiLevels Specification of an homogeneous partitioning of the unit interval into n bins.

LevelList Specification of an heterogeneous partitioning of the unit interval into n bins.

Page 82: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 81 / 87

Element name Element description

Level Qualification of 1 among the n bins

UpperLimitValue Value of the right bin boundary. NB : the UpperLimitValue of the LAST level in the provided list must obligatorily be '1.0" !

Page 83: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 82 / 87

13. Appendices

Page 84: FSOpenLink - User's handbook_3.23.00-3.23

13.1 Typical user logfile sample 2010-05-13 18:39:55,500 INFO [FSOpenLink] - ======= =================================================== =================================================== =========== 2010-05-13 18:39:55,500 INFO [FSOpenLink] - FSOPENL INK SESSION - HEADER INFORMATIONS 2010-05-13 18:39:55,500 INFO [FSOpenLink] - ------- --------------------------------------------------- --------------------------------------------------- ----------- 2010-05-13 18:39:55,500 INFO [FSOpenLink] - This is FSOpenLink v. 3.11 2010-05-13 18:39:55,500 INFO [FSOpenLink] - License d to Dev One [EntireLink Services] 2010-05-13 18:39:55,500 INFO [FSOpenLink] - Run on host platform : nomadis2 (SiteCode: D0632FE0 ; mach ineID: 5EDE-E5D0-DE47-237D) 2010-05-13 18:39:55,500 INFO [FSOpenLink] - Copyrig ht EntireLink Services, 2006-2010. All rights reser ved. 2010-05-13 18:39:55,500 INFO [FSOpenLink] - ======= =================================================== =================================================== =========== 2010-05-13 18:39:55,500 INFO [FSOpenLink] - VALIDAT ION PHASE 2010-05-13 18:39:55,500 INFO [FSOpenLink] - ------- --------------------------------------------------- --------------------------------------------------- ----------- 2010-05-13 18:39:55,500 INFO [FSOpenLink] - Validat ing main FSOL configuration file elements... 2010-05-13 18:39:55,500 INFO [FSOpenLink] - Validat ion of main FSOL configuration file complete. File is valid. 2010-05-13 18:39:55,500 INFO [FSOpenLink] - Validat ing job configuration file... 2010-05-13 18:39:55,703 INFO [FSOpenLink] - Validat ion of job configuration file complete. File is val id. 2010-05-13 18:39:56,312 INFO [FSOpenLink] - Validat ing query datasource XML descriptor file syntax & c ontents... 2010-05-13 18:39:56,328 INFO [FSOpenLink] - Query d atasource descriptor [E:\Eclipse workspace\FSOpenLi nk3\dist\jobs\JM\QUERYDB_DataSource_flat.xml] is XS D valid. 2010-05-13 18:39:56,375 INFO [FSOpenLink] - Validat ion complete. Syntax and contents of query datasour ce XML descriptor file are valid. 2010-05-13 18:39:56,375 INFO [FSOpenLink] - Validat ing target datasource XML descriptor file syntax & contents... 2010-05-13 18:39:56,375 INFO [FSOpenLink] - Target datasource descriptor [E:\Eclipse workspace\FSOpenL ink3\dist\jobs\JM\IDDB_DataSource_flat.xml] is XSD valid. 2010-05-13 18:39:56,375 INFO [FSOpenLink] - Validat ion complete. Syntax and contents of target datasou rce XML descriptor file are valid. 2010-05-13 18:39:56,375 INFO [FSOpenLink] - Validat ing blockingstrategy XML descriptor file syntax & c ontents... 2010-05-13 18:39:56,375 INFO [FSOpenLink] - Blockin gstrategy descriptor [E:\Eclipse workspace\FSOpenLi nk3\dist\jobs\JM\BlockingStrategy_QueryDB_IDDB_flat _short.xml] is XSD valid. 2010-05-13 18:39:56,390 INFO [FSOpenLink] - Validat ion complete. Syntax & contents of blockingstrategy XML descriptor file are valid. 2010-05-13 18:39:56,390 INFO [FSOpenLink] - Validat ing gammavector XML descriptor file syntax & conten ts... 2010-05-13 18:39:56,390 INFO [FSOpenLink] - Gammave ctor descriptor [E:\Eclipse workspace\FSOpenLink3\d ist\jobs\JM\GammaVector_3comp_10lev.xml] is XSD val id. 2010-05-13 18:39:58,171 INFO [FSOpenLink] - Validat ion complete. Syntax & contents of gammavector XML descriptor file are valid. 2010-05-13 18:39:58,171 INFO [FSOpenLink] - VALIDAT ION PHASE COMPLETE. 2010-05-13 18:39:58,171 INFO [FSOpenLink] - ======= =================================================== =================================================== =========== 2010-05-13 18:39:58,171 INFO [FSOpenLink] - COMPONE NTS INITIALISATION AND REGISTRATION 2010-05-13 18:39:58,171 INFO [FSOpenLink] - ------- --------------------------------------------------- --------------------------------------------------- ----------- 2010-05-13 18:39:58,171 INFO [FSOpenLink.DataSource ] - Initializing DatasourceDescriptor object. 2010-05-13 18:39:58,171 INFO [FSOpenLink.DataSource ] - Reading DataSourceDescriptor object properties from file E:\Eclipse workspace\FSOpenLink3\dist\jobs\JM\QUE RYDB_DataSource_flat.xml 2010-05-13 18:39:58,171 INFO [FSOpenLink.DataSource ] - Initializing DatasourceDescriptor object. 2010-05-13 18:39:58,171 INFO [FSOpenLink.DataSource ] - Reading DataSourceDescriptor object properties from file E:\Eclipse workspace\FSOpenLink3\dist\job s\JM\IDDB_DataSource_flat.xml 2010-05-13 18:39:58,187 INFO [FSOpenLink.DataSource ] - QUERYDB - Registered datasource descriptor 2010-05-13 18:39:58,187 INFO [FSOpenLink.DataSource ] - Initialized DataSource object with datasource d escriptor. 2010-05-13 18:39:58,187 INFO [FSOpenLink.DataSource ] - Initialized RDBMS_DataSource object. 2010-05-13 18:39:58,187 INFO [FSOpenLink.DataSource ] - Initializing DatasourceDescriptor object. 2010-05-13 18:39:58,187 INFO [FSOpenLink.DataSource ] - Reading DataSourceDescriptor object properties from file E:/SQLite Datafiles\QUERYDB__opDSD.xml 2010-05-13 18:39:58,203 INFO [FSOpenLink.DataSource ] - QUERYDB - Registered datasource descriptor 2010-05-13 18:39:58,203 INFO [FSOpenLink.DataSource ] - IDDB - Registered datasource descriptor 2010-05-13 18:39:58,203 INFO [FSOpenLink.DataSource ] - Initialized DataSource object with datasource d escriptor. 2010-05-13 18:39:58,203 INFO [FSOpenLink.DataSource ] - Initialized RDBMS_DataSource object.

Page 85: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 84 / 87

2010-05-13 18:39:58,203 INFO [FSOpenLink.DataSource ] - Initializing DatasourceDescriptor object. 2010-05-13 18:39:58,203 INFO [FSOpenLink.DataSource ] - Reading DataSourceDescriptor object properties from file E:/SQLite Datafiles\IDDB__opDSD.xml 2010-05-13 18:39:58,203 INFO [FSOpenLink.DataSource ] - IDDB - Registered datasource descriptor 2010-05-13 18:39:58,203 INFO [FSOpenLink.Blocking] - Initializing BlockingStrategy object. 2010-05-13 18:39:58,203 INFO [FSOpenLink.Blocking] - Reading BlockingStrategy object properties from f ile E:\Eclipse workspace\FSOpenLink3\dist\jobs\JM\Block ingStrategy_QueryDB_IDDB_flat_short.xml 2010-05-13 18:39:58,203 INFO [FSOpenLink.Blocking] - Successfully read BlockingStrategy object propert ies. 2010-05-13 18:39:58,217 INFO [FSOpenLink.DataSource ] - Initializing DatasourceDescriptor object. 2010-05-13 18:39:58,217 INFO [FSOpenLink.DataSource ] - Reading DataSourceDescriptor object properties from file E:/SQLite Datafiles\QUERYDB__opDSD.xml 2010-05-13 18:39:58,217 INFO [FSOpenLink.DataSource ] - QUERYDB - Registered datasource descriptor 2010-05-13 18:39:58,217 INFO [FSOpenLink.Blocking] - Initializing BlockingStrategy object. 2010-05-13 18:39:58,217 INFO [FSOpenLink.Blocking] - Reading BlockingStrategy object properties from f ile E:/SQLite Datafiles\QUERYDB__opBSD.xml 2010-05-13 18:39:58,233 INFO [FSOpenLink.Blocking] - Successfully read BlockingStrategy object propert ies. 2010-05-13 18:39:58,233 INFO [FSOpenLink.DataSource ] - QUERYDB - Registered blocking strategy from pro vided BlockingStrategyDescriptor object 2010-05-13 18:39:58,233 INFO [FSOpenLink.DataSource ] - Initializing DatasourceDescriptor object. 2010-05-13 18:39:58,233 INFO [FSOpenLink.DataSource ] - Reading DataSourceDescriptor object properties from file E:/SQLite Datafiles\IDDB__opDSD.xml 2010-05-13 18:39:58,233 INFO [FSOpenLink.DataSource ] - IDDB - Registered datasource descriptor 2010-05-13 18:39:58,233 INFO [FSOpenLink.Blocking] - Initializing BlockingStrategy object. 2010-05-13 18:39:58,233 INFO [FSOpenLink.Blocking] - Reading BlockingStrategy object properties from f ile E:/SQLite Datafiles\IDDB__opBSD.xml 2010-05-13 18:39:58,233 INFO [FSOpenLink.Blocking] - Successfully read BlockingStrategy object propert ies. 2010-05-13 18:39:58,233 INFO [FSOpenLink.DataSource ] - IDDB - Registered blocking strategy from provid ed BlockingStrategyDescriptor object 2010-05-13 18:39:58,233 INFO [FSOpenLink.GammaVecto r] - Initializing GammaVector object. 2010-05-13 18:39:58,233 INFO [FSOpenLink.GammaVecto r] - Reading GammaVector object properties from fil e E:\Eclipse workspace\FSOpenLink3\dist\jobs\JM\Gam maVector_3comp_10lev.xml 2010-05-13 18:39:58,250 INFO [FSOpenLink.GammaVecto r] - Successfully acquired GammaVector object prope rties. 2010-05-13 18:39:58,250 INFO [FSOpenLink.Blocking] - Entering MPGenerator initialization. 2010-05-13 18:39:58,250 INFO [FSOpenLink.Blocking] - Registered query datasource as : QUERYDB 2010-05-13 18:39:58,250 INFO [FSOpenLink.Blocking] - Registered target datasource as : IDDB 2010-05-13 18:39:58,250 INFO [FSOpenLink.Blocking] - Registered blocking strategy descriptor. 2010-05-13 18:39:58,250 INFO [FSOpenLink] - COMPONE NTS INITIALISATION AND REGISTRATION COMPLETE. 2010-05-13 18:39:58,250 INFO [FSOpenLink] - ======= =================================================== =================================================== =========== 2010-05-13 18:39:58,250 INFO [FSOpenLink] - SCORING PHASE 2010-05-13 18:39:58,250 INFO [FSOpenLink] - ------- --------------------------------------------------- --------------------------------------------------- ----------- 2010-05-13 18:39:58,250 INFO [FSOpenLink.FSClassifi er] - setSSLevels : set M-zone lower threshold to [ 0.900000] ; C-zone lower threshold to [0.660000]. 2010-05-13 18:39:58,250 INFO [FSOpenLink.FSClassifi er] - Importing package SSClassifiers.defaultSSClas sifier 2010-05-13 18:39:58,250 INFO [FSOpenLink.FSClassifi er] - Successfully registered score-and-classify lo gics 'SSClassifiers.defaultSSClassifier.weightedAvg ' 2010-05-13 18:39:58,250 INFO [FSOpenLink.FSScorer] - Entering Scorer initialization. 2010-05-13 18:39:58,250 INFO [FSOpenLink.FSScorer] - Registered matched pairs generator to be used. 2010-05-13 18:39:58,250 INFO [FSOpenLink.FSScorer] - Registered gamma vector to be used. 2010-05-13 18:39:58,250 INFO [FSOpenLink.FSScorer] - Importing package metrics.FamilyNames 2010-05-13 18:39:58,250 INFO [FSOpenLink.FSScorer] - Importing package metrics.FirstNames 2010-05-13 18:39:58,250 INFO [FSOpenLink.FSScorer] - Importing package metrics.Birthdates 2010-05-13 18:39:58,250 INFO [root] - Importing pac kage filters.scoring.reasonableMPFilter 2010-05-13 18:39:58,250 INFO [FSOpenLink.FSScorer] - Now starting to score query records. 2010-05-13 18:39:58,250 INFO [FSOpenLink.FSScorer] - Starting query datasource records scoring. 2010-05-13 18:39:58,265 INFO [FSOpenLink.DataSource ] - QUERYDB - Optimized connection for faster perfo rmance. 2010-05-13 18:39:58,280 INFO [FSOpenLink.DataSource ] - IDDB - Optimized connection for faster performa nce. 2010-05-13 18:39:58,280 INFO [FSOpenLink.Blocking] - Handling block type : parNomDdn 2010-05-13 18:39:58,280 INFO [FSOpenLink.DataSource ] - QUERYDB - Building a blockIDGenerator for block TypeID 'parNomDdn' 2010-05-13 18:40:24,921 INFO [FSOpenLink.FSScorer] - Number of MatchedPairs examined up to now : 5000 2010-05-13 18:40:24,921 INFO [FSOpenLink.FSScorer] - Number of MatchedPairs processed up to now : 4798

Page 86: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 85 / 87

2010-05-13 18:40:24,921 INFO [FSOpenLink.FSScorer] - Number of MatchedPairs stored up to now : 4792 2010-05-13 18:40:24,921 INFO [FSOpenLink.FSScorer] - Number of query records identified up to now : 14 46 2010-05-13 18:40:34,328 INFO [FSOpenLink.Blocking] - Completed handling block type : parNomDdn 2010-05-13 18:40:34,328 INFO [FSOpenLink.Blocking] - Handling block type : parNomPrenom 2010-05-13 18:40:34,328 INFO [FSOpenLink.DataSource ] - QUERYDB - Building a blockIDGenerator for block TypeID 'parNomPrenom' 2010-05-13 18:40:36,030 INFO [FSOpenLink.FSScorer] - Number of MatchedPairs examined up to now : 10000 2010-05-13 18:40:36,030 INFO [FSOpenLink.FSScorer] - Number of MatchedPairs processed up to now : 9256 2010-05-13 18:40:36,030 INFO [FSOpenLink.FSScorer] - Number of MatchedPairs stored up to now : 9226 2010-05-13 18:40:36,030 INFO [FSOpenLink.FSScorer] - Number of query records identified up to now : 25 01 2010-05-13 18:40:42,562 INFO [FSOpenLink.FSScorer] - Commited last 100000 processed Matchedpairs. 2010-05-13 18:41:01,328 INFO [FSOpenLink.FSScorer] - Number of MatchedPairs examined up to now : 15000 2010-05-13 18:41:01,328 INFO [FSOpenLink.FSScorer] - Number of MatchedPairs processed up to now : 1312 2 2010-05-13 18:41:01,328 INFO [FSOpenLink.FSScorer] - Number of MatchedPairs stored up to now : 12637 2010-05-13 18:41:01,328 INFO [FSOpenLink.FSScorer] - Number of query records identified up to now : 25 16 2010-05-13 18:41:22,421 INFO [FSOpenLink.FSScorer] - Number of MatchedPairs examined up to now : 20000 2010-05-13 18:41:22,421 INFO [FSOpenLink.FSScorer] - Number of MatchedPairs processed up to now : 1722 6 2010-05-13 18:41:22,421 INFO [FSOpenLink.FSScorer] - Number of MatchedPairs stored up to now : 16258 2010-05-13 18:41:22,421 INFO [FSOpenLink.FSScorer] - Number of query records identified up to now : 25 32 2010-05-13 18:41:34,108 INFO [FSOpenLink.Blocking] - Completed handling block type : parNomPrenom 2010-05-13 18:41:34,108 INFO [FSOpenLink.Blocking] - Total number of MatchedPairs returned : 23047. 2010-05-13 18:41:34,405 INFO [FSOpenLink.FSScorer] - Starting MatchedPairs table indexing... 2010-05-13 18:41:34,546 INFO [FSOpenLink.FSScorer] - MatchedPairs table indexing complete. 2010-05-13 18:41:34,546 INFO [FSOpenLink.FSScorer] - Total number of MatchedPairs examined : 23047 2010-05-13 18:41:34,546 INFO [FSOpenLink.FSScorer] - Total number of MatchedPairs processed : 19735 2010-05-13 18:41:34,546 INFO [FSOpenLink.FSScorer] - Total number of MatchedPairs stored : 18508 2010-05-13 18:41:34,546 INFO [FSOpenLink.FSScorer] - Processing time : 96.296000 seconds. 2010-05-13 18:41:34,546 INFO [FSOpenLink.FSScorer] - Ended query datasource records scoring. 2010-05-13 18:41:34,562 INFO [FSOpenLink] - Time ta ken by scoring phase (18508 MatchedPairs stored) : 96.312000 seconds 2010-05-13 18:41:34,562 INFO [FSOpenLink] - SCORING PHASE COMPLETE. 2010-05-13 18:41:34,562 INFO [FSOpenLink] - ======= =================================================== =================================================== =========== 2010-05-13 18:41:34,562 INFO [FSOpenLink] - METASCO RING PHASE 2010-05-13 18:41:34,562 INFO [FSOpenLink] - ------- --------------------------------------------------- --------------------------------------------------- ----------- 2010-05-13 18:41:34,562 INFO [FSOpenLink.FSScorer] - Entering Scorer initialization. 2010-05-13 18:41:34,562 INFO [FSOpenLink.FSScorer] - Registered matched pairs generator to be used. 2010-05-13 18:41:34,562 INFO [FSOpenLink.FSScorer] - Registered gamma vector to be used. 2010-05-13 18:41:34,562 INFO [FSOpenLink.FSScorer] - Importing package metrics.FamilyNames 2010-05-13 18:41:34,562 INFO [FSOpenLink.FSScorer] - Importing package metrics.FirstNames 2010-05-13 18:41:34,562 INFO [FSOpenLink.FSScorer] - Importing package metrics.Birthdates 2010-05-13 18:41:34,562 INFO [root] - Importing pac kage SQRBuilders.basicSQRBuilder 2010-05-13 18:41:34,562 INFO [FSOpenLink.FSScorer] - Starting the ScoredQueryRecords generation proces s. 2010-05-13 18:41:34,562 INFO [FSOpenLink.FSScorer] - SQLite storage required. Creating SQLite structur e... 2010-05-13 18:41:34,578 INFO [FSOpenLink.FSScorer] - SQLite storage structure created. 2010-05-13 18:41:34,578 INFO [FSOpenLink.FSScorer] - Now starting the MatchedPairs processing phase. 2010-05-13 18:41:34,578 INFO [FSOpenLink.DataSource ] - QUERYDB - Optimized connection for faster perfo rmance. 2010-05-13 18:41:34,578 INFO [FSOpenLink.DataSource ] - IDDB - Optimized connection for faster performa nce. 2010-05-13 18:41:35,483 INFO [FSOpenLink.FSScorer] - Now processing 18508 currently selected MatchedPa irs... 2010-05-13 18:41:35,717 INFO [FSOpenLink.FSScorer] - Processed 15000 MatchedPairs into query records. 2010-05-13 18:41:35,733 INFO [FSOpenLink.FSScorer] - Now storing 9348 ScoredQueryRecords into SQLite d atabase.. 2010-05-13 18:41:35,967 INFO [FSOpenLink.FSScorer] - Storing complete. 2010-05-13 18:41:35,967 INFO [FSOpenLink.FSScorer] - MatchedPairs processing phase complete.

Page 87: FSOpenLink - User's handbook_3.23.00-3.23

FSOpenLink – User’s handbook

Version : 3.23.00 page 86 / 87

2010-05-13 18:41:35,967 INFO [FSOpenLink.FSScorer] - Processed 18508 MatchedPairs concerning 9348 quer y records. 2010-05-13 18:41:35,983 INFO [FSOpenLink.FSScorer] - Adding U-classified query records to the ScoredQu eryRecord dataset.. 2010-05-13 18:41:35,983 INFO [FSOpenLink.FSScorer] - Retrieving all queryIDs and storing them into tm p file [E:\FSOL_output\QUERYDB_IDDB\allQueryIDs.dat ].. 2010-05-13 18:41:35,983 INFO [FSOpenLink.DataSource ] - QUERYDB - Optimized connection for faster perfo rmance. 2010-05-13 18:41:35,983 INFO [FSOpenLink.DataSource ] - QUERYDB - Retrieving all recordIDs 2010-05-13 18:41:36,030 INFO [FSOpenLink.FSScorer] - Retrieval complete. 2010-05-13 18:41:36,030 INFO [FSOpenLink.FSScorer] - Retrieving all MatchedPairs-involved queryIDs an d storing them into tmp file [ E:\FSOL_output\QUERYDB_IDDB\alreadySco redQueryIDs.dat].. 2010-05-13 18:41:36,108 INFO [FSOpenLink.FSScorer] - Retrieval complete. 2010-05-13 18:41:36,108 INFO [FSOpenLink.FSScorer] - Infile sorting of the set of all queryIDs.. 2010-05-13 18:41:36,140 INFO [FSOpenLink.FSScorer] - Infile sorting complete. 2010-05-13 18:41:36,140 INFO [FSOpenLink.FSScorer] - Infile sorting of the set of MatchedPairs-involv ed queryIDs.. 2010-05-13 18:41:36,171 INFO [FSOpenLink.FSScorer] - Infile sorting complete. 2010-05-13 18:41:36,171 INFO [FSOpenLink.FSScorer] - Computing delta between two sets.. 2010-05-13 18:41:36,233 INFO [FSOpenLink.FSScorer] - Computation complete. 2010-05-13 18:41:36,280 INFO [FSOpenLink.FSScorer] - Adding complete. 2010-05-13 18:41:36,280 INFO [FSOpenLink.FSScorer] - Indexing SQLite structure for better data access performance... 2010-05-13 18:41:36,328 INFO [FSOpenLink.FSScorer] - ScoredQueryRecords generation process complete. 2010-05-13 18:41:36,358 INFO [FSOpenLink] - Time ta ken by metascoring phase (9348 ScoredQRecords gener ated) : 1.797000 seconds 2010-05-13 18:41:36,358 INFO [FSOpenLink] - METASCO RING PHASE COMPLETE. 2010-05-13 18:41:36,358 INFO [FSOpenLink] - ======= =================================================== =================================================== =========== 2010-05-13 18:41:36,358 INFO [FSOpenLink] - RESULTS EXPORTING PHASE 2010-05-13 18:41:36,358 INFO [FSOpenLink] - ------- --------------------------------------------------- --------------------------------------------------- ----------- 2010-05-13 18:41:36,467 INFO [FSOpenLink] - Exporte d ScoredQueryRecords SQLite database to flat CSV fi le. 2010-05-13 18:41:36,467 INFO [FSOpenLink.DataSource ] - QUERYDB - Optimized connection for faster perfo rmance. 2010-05-13 18:41:36,467 INFO [FSOpenLink.DataSource ] - IDDB - Optimized connection for faster performa nce. 2010-05-13 18:41:39,546 INFO [FSOpenLink] - Exporte d results to full flat file. 2010-05-13 18:41:41,375 INFO [FSOpenLink] - Generat ed score values distribution histogram. 2010-05-13 18:41:42,265 INFO [FSOpenLink.DataSource ] - QUERYDB - Optimized connection for faster perfo rmance. 2010-05-13 18:41:42,265 INFO [FSOpenLink.DataSource ] - IDDB - Optimized connection for faster performa nce. 2010-05-13 18:41:42,328 INFO [FSOpenLink] - Exporte d sample results file. 2010-05-13 18:41:42,358 INFO [FSOpenLink] - Exporte d summary results file. 2010-05-13 18:41:42,358 INFO [FSOpenLink] - RESULTS EXPORTING PHASE COMPLETE. 2010-05-13 18:41:42,375 INFO [FSOpenLink] - ======= =================================================== =================================================== =========== 2010-05-13 18:41:42,375 INFO [FSOpenLink] - ======= =================================================== =================================================== =========== 2010-05-13 18:41:42,375 INFO [FSOpenLink] - END OF PROCESSING. Exiting normally. 2010-05-13 18:41:42,375 INFO [FSOpenLink] - ======= =================================================== =================================================== =========== 2010-05-13 18:41:42,375 INFO [FSOpenLink] - ======= =================================================== =================================================== ===========