Upload
gianluca-tarasconi
View
333
Download
1
Tags:
Embed Size (px)
DESCRIPTION
presented @USPTO nov 2011
Citation preview
By Gianluca Tarasconiwww.rawpatentdata.blogspot.com
Sharing names/address cleaning patterns for Patstat: a metadata structure proposal
From chaos to order...
Main milestones of clearing and standardizing patstat persons (inventors and applicants), starting from TLS206 table, can be synthesized as follows:
RE-PARSING / RESTRUCTURING CLEANING STANDARDIZATION DEDUPLICATION
Due to the strict sequentiality of the process, results of last steps (address standardization and deduplication) greatly depend from the quality of first two steps.
1
... and back to chaos...
Different team specialize on ‘local’ addresses [countrywise data cleaning]
Standards (i.e. sequence in toponym, street name, number) differ from country to country
Enrichments / links to other data may need special data structure
Eventually data parsing and cleaning will produce very different results among different workteams.
2
PERSON_ID 1430436: how would you clean this (if you are not russian?)
ARMYANSKOE SPETSIALIZIROVANNOE PROEKTNO-IZYSKATELSKOE, HAÔÚHO-ôCCóE¯OBATEóÓCKOE ô KOHCTPÔKTOPCKOE OT¯EóEHôE BCECO½³HO×O ×OCÔ¯APCTBEHHO×O üPOEKTHO-ô³ÕCKATEóÓCKO×O ô HAÔÚHO-ôCCóE¯OBATEóÓCKO×O ôHCTôTÔTA ÜHEP×ETôÚECKôX CôCTEM ô ÜóEKTPôÚECKôX CETEö "ÜHEP×OCETÓüPOEKT"
2b
ARMYANSKOE SPETSIALIZIROVANNOE PROEKTNO-IZYSKATELSKOE, NAUCHNO-ISSLEDOVATELSKOE I KONSTRUKTORSKOE OTDELENIE VSESOYUZNOGO GOSUDARSTVENNOGO PROEKTNO-IZYSKATELSKOGO I NAUCHNO-ISSLEDOVATELSKOGO INSTITUTA ENERGETICHESKIKH SYSTEM I ELEKTRICHESKIKH SETEI "ENERGOSVYAZPROEKT"
Metadata structure proposal (I)
We figure that there will be a certain point in which data coming from patstat are parsed into an intermediate data structure where original strings should be splitted into several fields according to the meaning of information contained; right after cleaning phase will remove the noise, allowing other tools (ie google lookup) to standardize tuples.
Data origin (206,
206ascii…)
Parsed data
structure
Parsed & clean
data
Re-parsing Cleaning
Standard data
Standard.Standard
& disambig
data
Dedupl.
3
Metadata structure proposal (II)
LAST_NAME Surname / company name FIRST_NAME First name (blank for companies) MIDDLE_NAME Second, third, 4th names … (blank for companies) NAME_EXTENSION Jr/Sr/academic title; type of business entity in
companies ADDRESS Typically: toponym, name, number LOCALITY City area (optional) ADDR_OTHER Other specifics different than toponyms (floor, building, but also
c/o company name) [should be data not relevant for standardization] CITY Municipality name COUNTY Administrative level above municipality REGION Administrative level above county STATE Administrative level above region for federal nations ZIP_CODE Alphanumeric zip code
4
Dimensions in data cleaning:
Define pre-parsing and data cleaning as:• a projection (in the algebraic sense of the word),
where:• some operators transform some vectors (fields) into
other vectors…• … within the constraint of (endogenous) conditions
given from the data structure.Projections may take place if some pre-conditions
(patterns) are satisfied.
5
Dimensions in data cleaning: operators (I)
We consider only operators for MAP+CORRECT, considering the other possible operators as particular cases of this operator
MAP+CORRECT maps a vector in the correct domain, possibly transforming its elements (moves a string from one field to another, replacing it where correction is needed…)
For such operator we should consider two dimensions indicating where the operation takes place: FIELD FROM name of the field where operation start from FIELD TO name of the final target of operator (optional)
Also we need to list what string has to be found and what must be replaced with FIND string to be found REPLACE string replacing the string found
6
Dimensions in data cleaning: operators (II)
How to emulate other strings operators with map+correct:
MOVE (moves a string from one field to another) = M+C where REPLACE string = FIND string
REPLACE (changes a string inside a field )= M+C where FIELD FROM = FIELD TO
INSERT (inserts a string without removing other strings ) = M+C where FIELD FROM = FIELD TO and REPLACE string = FIND + insert string
DELETE (removes a string without removing other strings) = M+C where FIELD FROM = FIELD TO and REPLACE string is empty
[NOTE: move a string within a field is considered only in case we need to shift it to trailing or leading position may take several steps to accomplish;
7
Usage of operators: example
Field from Field to Find replaceDescription / last_nameTAPROGGE GESELLSCHAFT MBH
REPLACE FIRST_NAME FIRST_NAME GESELLSCHAFT MBH
GMBH
TAPROGGE GMBH
MOVE FIRST_NAME NAME_EXT GMBH GMBH
TAPROGGE
MOVE+REPLACE (same as A+B in 1 step)
FIRST_NAME NAME_EXT GESELLSCHAFT MBH
GMBH
8
Endogenous conditions
Methods used to clean addresses may differ depending from pieces of information contained in the data themselves. Typical case are:
APPLICATION AUTHORITY gives some ‘address filling hints’ and charset
COUNTRY CODE gives toponyms, administrative data etc. etc.
YEAR FROM / TO (OPT.) some info may change with time (fi: change in ctry code)
PATSTATEDICTION FROM/TO (OPT.) some info can change with changes in patstat.
9
Pre-conditions: match patterns
Eventually, at string level, this is the core of our interchange format.
Our proposal is to use SQL REGEXP operator patterns as default, including the following parameters
LIKE pattern to be found (inclusion criteria) LIKE NOT [OPTIONAL] pattern not to be in (exclusion criteria) POSITION (begin / end) start / end position where pattern can be SQLSTANDARD gives the standard used for filling the
patterns (sql ‘dialect’, like vs regexp…) in order to make easier translation
10
Interchange data structure proposal: vectors(I)
It’s proposed to use a field called OPERATIONKIND where we may store origin and destination of the move operation.
It would be a multilayer indicator having a digit for each of the field of the pattern group, indicating the field to be addressed.
COUNTRY ADDRESS LOCALITY ADDR_OTHER CITY COUNTY REGION STATE ZIP NOWHEREA B C D E F G H I 0
LAST_NAME FIRST_NAME MIDDLE_NAME NAME_EXTENSION0 1 2 3
11
Interchange data structure proposal: vectors(II)
FI: BCEF
LIKE, LIKE NOT, FIND, REPLACE = BCEF would mean if LIKE pattern is in address, NOT LIKE is not in locality, find FIND pattern in city and insert REPLACE pattern in county.
It will be added an optional last digit indicating in case of move operation (where 1st and 4th digit are different) containing L or T respectively where REPLACE pattern must be inserted leading or trailing in target field.
FI: BBBDT would mean LIKE, LIKE NOT, FIND are in address, and replace string must be inserted at the end of addr_other field.
12
One example (as before):
ID 1OPERATIONKIND 0003APPLICATION AUTHORITY EPCOUNTRY CODE DE
ALIKE GESELLSCHAFT MBH$LIKE NOT FIND GESELLSCHAFT MBHFIND2 GESELLSCHAFT MBHREPLACE GMBHPOSFROM 2000POSTO 2000SQLSTANDARD MYSQL50DATE FROM DATE TO PATSTATFROM PATSTATTO DESCRIPTION moves GESELLSCHAFT
MBH from name to kind
Looking back at previous example @ page 9 (TAPROGGE GESELLSCHAFT MBH)
0003 means LIKE, NOT LIKE, and FIND patterns are in first_name, REPLACE pattern in NAME_EXTENSION.
13
Open issues (I):
Eventually we have to consider some issues still pending
Define a standard address
Since cleaning pattern rely on backward logic, people sharing these data should have a common target in data standardization. It’s propose to use local post office standards, but such standards may be unavailable / not fitting.
Automatic query generation
User would greatly benefit from exchanging patterns if it could be possible to create a query generating tool that would, from pattern table, create SQL files.
14
Open issues (II):
High correlation & chronology
Quality and results of data cleaning may depend from the order steps have been run (FI: if I do not remove PO BOXES numbers from addresses before cleaning street numbers I may have wrong results).
Most of all some patterns must be run recursively and in some cases groups of patterns should run recursively (fi: MOVE from address PO BOX, CITY, ZIP, REMOVE COMMA; since I do not know the order the elements have in ADDRESS I should run the group of queries 4 times to be sure)
A partial solution may be to add fields indicating the ID of previous query, of following query and number of repetitions.
Remain open the issue of how do we manage group of repetitions and cleaning patterns needing a ‘loop until no match is found.
15
Open issues (III):
Grants for collaboration to this project
A grant for collaborating to this part of ape-inv project is open for visiting @ Kites;
info on APE-INV website, section"Grants" (call: http://www.esf-ape-inv.eu/download/Draft%20call%20for%20visits%20_ESF-APE-INV%202012_4th_call.pdf)
16
Acknowledgements
Thanks to: Francesco Lissoni for supervision, Lorenzo Peccati For suggestions, Bulat Sanditov for Russian translation.
17
Appendix: Interchange data structure proposal (I)
This is the list of the fields needed; where not indicated meaning of the field is explained in previous slides APPLICATION AUTHORITY 2 char string % may indicate valid
for all COUNTRY CODE 2 char string % may indicate any
country DATE FROM date [optional] empty means
no exclusion DATE TO date [optional] empty
means no exclusion PATSTATFROM MMYYYY [optional] empty means
no exclusion PATSTATTO MMYYYY [optional] empty means
no exclusion
A1
Interchange data structure proposal (II)
Where not indicated meaning of the field is explained in previous slides
ALIKE string (is not called LIKE cause it may cause errors in some SQL )
LIKE NOT [OPTIONAL] string FIND string FIND2 string when literal find do not work
and we need a fix len REPLACE string POSFROM integer start point of string position POSTO integer end of position where string can be SQLSTANDARD string
A1
Interchange data structure proposal (III)
Note: some combinations of POSFROM POSTO may have particular meanings like :
(1 , 1) mean start position ; (9999; 9999) means trailing position;
(2 ; 9999) means everywhere but at beginning.
Eventually a field containing a description of the operation is needed;
DESCRIPTION text
A1
Appendix2: Some examples
ID 1 2 99 100 106OPERATIONKIND EEED EEEE BBBB BBBB BBBBAPPLICATION AUTHORITY EP % % % %COUNTRY CODE US % % % %
LIKE PO BOX [0-9][0-9][0-9][0-9] %,,% '[0-9] - [0-9]' '[0-9] BIS [0-9]' '[0-9] A [0-9]'
LIKE NOT ' - .+ - ' ' BIS .+ BIS ' ' A .+ A 'FIND PO BOX ,, ' - ' ' BIS ' ' A 'FIND2 PO BOX #### REPLACE , '-' '-' '-'POSFROM 1 1 2 2 2POSTO 1 9999 9999 9999 9999SQLSTANDARD MYSQL50 MYSQL50 MYSQL50 MYSQL50 MYSQL50DATE FROM DATE TO PATSTATFROM PATSTATTO
DESCRIPTIONmoves PO BOX from city to addr_other
Removes double comma in city
these are different formats aiming to set multiple street number in address to format #-#
A2
Appendix 3: Deep into one pattern (I)
Let’s see how query would work in one examples (# 100 the one highlightened)
We suppose we have an intermediate table called address where our fields are structured according to metadata structure proposal (see above).
Our patterns table is called here corrections.
We run it on a record with ADDRESS = “WAGNER STRASSE 3 BIS 12”
A3
Appendix 3: Deep into one pattern (II): “WAGNER STRASSE 3 BIS 12” VS “ BIS “
update applicant a, corrections b
set a.address=trim(concat(
LEFT(a.address, INSTR(a.address,b.find)-1),
b.replace,
right(a.address, LENGTH(a.address) - length(b.find)-INSTR(a.address, b.find)+1) ))
where
b.OPERATIONKIND = “BBBB”
INSTR(a.address, b.find) >= b.posfrom
and INSTR(a.address, b.find) <= b.POSTO
and a.address regexp b.like
and a. address not regexp b.likenot
and b.datefrom is null and b.dateto is null and
b.pastatfrom is null and b.pastatto is null;
new address field is trimmed aggregation of what was before the change (“WAGNER STRASSE 3“)“-“
“12”
this means “ – “ is from position 2 onwardthis means “ – “ is before position 9999
address contains reg. expr. '[0-9] BIS [0-9]'addr. don’t contain ' BIS .+ BIS ‘that means twice‘ BIS ‘
no criteria on date or patstat ediction
A3