C I R C L ECentre for Innovation, Research and Competence in the Learning Economy
L U N D U N I V E R S I T YP.O. Box 117, SE-221 00 Lund, Sweden
Swedish inventors matching to registers and‐descriptive data
Presentation at APE-INVBrussels September 5th 2011
Lina Ahlin and Olof [email protected]
On the agenda
• What is so special with Swedish data• 1st matching • 2nd matching • Future – how to reach 100% match rate?• (Results)
Linking inventors to registers
• EPO applied patents 1978-2009 for inventors with addresses in Sweden.
• Matching done on name-home address combinations
• Problem 1: different inventors may have the same name
• Problem 2: addresses may be old• How to verify person identity and connect to
Swedish register data?
Swedish dataQ: What makes Swedish data so exciting (and why we want a high match rate)?A: Through Statistics Sweden it is possible to connect individuals to register data which connects several levels of information relevant for innovation studies:• Individual level: field/level of education, age, income, gender,
workplace• Regions: workplace, home municipality• Sectoral level: sectors, firm size, level of R&D...
can give a multifacetted view of innovation, but need a personal identifier ”personnummer” to do this
e.g. 19500131-3422
Birth date Jan 31st, 1950 Even number = female
1st matching (Oct-Dec 2010)• All Swedes (incl. Personnummer) listed on address register ”SPAR” • Matching of addresses through InfoTorg stores addresses/address changes
latest 3 years addition of personnummer– Individuals under 16 not matched
• Old patents added under the assumption that:Sven Ivar Johanson Sven Ivar JohansonStorgatan 1 = Storgatan 1111 00 Stockholm 111 00 Stockholm
Match rate 64% of inventor-patent pairs. Low peak 23% in 1978 to high peak 93% in 2008. This is because of mobility of inventors.
Register 2008-2010 Patent applied for in 1992
• InfoTorg returned 56% match rate• Manual check (visual – no robot) + 8%
64% match rate
19781980
19821984
19861988
19901992
19941996
19982000
20022004
20062008
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Fractions 64%
1985-2005: present access to individual registers at Statistics Sweden 2006-2009: additions as of Sep. 30th 2011
2nd matching (April-Sep 2011)
• Use public access to registers (Swedish geneaological association )– CD:s of Swedish population (1980)/1990
published by old addresses and birth date– CD ”Book of dead” 1901-2009 address at death
+ personnummer• Match birth date + name to personnummer
using service by InfoTorg or online sources
Methodology
• Extract data from Swedish deadbook and Swedish genealogy records for 1990 (to some extent also 1980) on all individuals in the population by letter
• Generate a variable containing name, address and postal address for all individuals in the population as well as for inventors who are not fully matched
Normalized Levenshtein (”strgroup”) in STATA
• An example of the ”name-address” string:”Sven Ivar Johanson, Storgatan 1, 111 00
Stockholm” (from EPO)= ”Sven Ifwar Johanson, Storgatan 1, 111 00
Stockholm” (from Swedish population 1990) • Replace/insert 3 letters to make strings equal• Divided by length of shortest string (48)
(3/48) = 0.0625 (=a good hit)
Adding date of birth
1. 1990 Levensthein names & adresses2. 1990 Levensthein unique names 3. Levenshtein from CD dead 1901-2009 - names
and adresses 4. Strgroup: similarity on name-address hits 1-35. Some manual additions and minor changes 6. 1980 Levenshtein names and addresses (letters
D&H)
Methodology: continued
• Manually examine each match to see whether Levenshtein-command has matched correctly
• Some hits discarded incl ambiguous name match hits
New match rate 80%
19781979
19801981
19821983
19841985
19861987
19881989
19901991
19921993
19941995
19961997
19981999
20002001
20022003
20042005
20062007
20082009
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Fractions 64%Fractions 80%
Adding personnummer (ongoing)New match rate 80%, but not full personnummer. What to do?1. Use date of birth-part of personal number for fully matched
inventors2. Join all possible combinations of birth dates for those fully
matched and those with only birth dates.3. Run Levenshtein-distance on inventor names4. Small Levenshtein-distance: accept that the inventors are the
same since name and birth date match5. Large Levenshtein-distance: reject6. Further, manually check remaining inventors. Look at
addresses for further confirmation if uncertain.
Adding personnummer ctd.
• Use Deathbook yrs 1975-2009. Use date of birth-part of personal numbers
• Re-run step 2-6 on previous slide
Adding personnummer ctd.
Problem: not all inventors were previously identified no 4 last digitsTwo options to get full personal numbers from birth dates:1. Use InfoTorg again with name + added
parameter ”birthdate”2. Manually add four last digits by using
internet service (www.upplysning.se)
Some matching problems
• Difficult to match individuals who change last names (mainly women) or with common names and who move a lot.
• Two people with the same name can live on the same address (i.e. father names his son after himself) – possibility to match the wrong person. If detected, oldest person is chosen.
• For inventors affiliated with some firms (AstraZeneca), company address given
Towards 100%• Idea: scoring methods based on identified inventors
– Name– Identified co-inventors– Technology class– City– Postal code– Which algorithm?
• Statistics Sweden for validating parent/child name similarity problem?
• Use 1980 population CD?• Strategy of focusing on highly productive unmatched inventors?
Suggestions/questions
Patent distribution by sector
Patent distribution in manufacturing (share of total patenting)
Patent distribution in services (share of total patenting).
Education level among inventors
Percentile distribution of inventors’ patent productivity.
Percentile All patents Contribution Patents 2004-07 Contribution 2004-07
Percentile value Percentile value Percentile value Percentile value
1% 1 0.12 1 0.11
5% 1 0.20 1 0.17
10% 1 0.25 1 0.20
25% 1 0.33 1 0.33
50% 1 0.83 1 0.50
75% 3 1.50 2 1.00
90% 6 3.00 4 2.00
95% 9 5.00 6 3.00
99% 21 11.50 12 5.83
Mean/inventor 2.81 1.40 2.06 0.97
Number of inventors
18 489 18 489 8 526 8 526
Sectors, SNI92-codes, # inventors, contribution 2004-2005.
Sector SNI92-codes Unique inventors, mean/year 2004-2005
Contribution*, mean 2004-2005
% cooperation cross sector
1994-1995
% cooperation cross sector
2004-2005
Primary 1000-14999 8.5 5.9 28% 28%
Manufacturing 15000-37999 1567 749.9 11% 11%
Services 38000-74999, 80410, 80423-80425, 80427-80429, 85200, 85325, 91111-91330, 92110-92130, 92310, 92330-92400, 92611-92614, 92621-99000
806.5 411.1 23% 23%
Academia 80301-80309 and ** 190 72.6 54% 54%
Public sector 75000-80299, 80421-80422, 80426, 85000-85140, 85311-85324, 90000-90008, 92200, 92320, 92511/92530, 92615
62.5 28.4 67% 67%
* ”Contribution” counts patent fractions which adjusts for co-inventorship.** ”Academia” can also in a few cases be found in the sectors R&D in technical and natural sciences (73101-73104) and in technical testing and analysis (74300).
Cooperation by sector, 2004-05Primary Manufacturin
gServices Academia Public
sectorSum
Primary43% 57% 0% 0% 100%
Manufacturing
1% 77% 17% 5% 100%Services
1% 66% 24% 9% 100%Academia
0% 29% 48% 22% 100%Public sector
0% 18% 37% 45% 100%
The most important patenting academic institutions 2004-2005
Univ/institute
Contributions/year
Share Patents/billion research revenue SEK
Patents/thousand FTE, NTM
Lund 20.3 23% 6.3 15.0
Uppsala 11.6 13% 4.2 9.7
Karolinska 11.6 13% 3.9 9.3
KTH 9.8 11% 5.7 8.7
Göteborg 9.0 10% 3.7 10.9
Linköping 7.9 9% 6.4 10.3
Chalmers 7.2 8% 5.1 8.6
Stockholm 2.9 3% 1.7 4.1
Umeå 2.3 3% 1.5 2.8
Sum 82.6 94% 4.4 9.3
Others (13) 5.0 6% 1.3 1.8