View
213
Download
0
Embed Size (px)
Citation preview
2
OutlineOutline
• Patent
• USPTO
• Search USPTO Patents
• Data Extraction: Case Study of NSE Patents
3
PatentPatent
• “Patent" usually refers to a right granted to anyone who invents or discovers any new and useful process, machine, article of manufacture, or composition of matter, or any new and useful improvement. – A patent is not a right to practice or use the invention. Rather, it
provides the right to exclude others from making, using, selling, offering for sale, usually 20 years from the filing date.
– It is a limited property right that the government offers to inventors in exchange for their agreement to share the details of their inventions with the public.
• A patent is a special type of technology document which documents many important innovations and technology advances.
4
USPTOUSPTO
• The United States Patent and Trademark Office (USPTO) is an agency in the United States Department of Commerce that provides patent protection to inventors and businesses for their inventions, and trademark registration for product and intellectual property identification.
• Each year, the USPTO issues thousands of patents to companies and individuals worldwide. As of March 2006, the USPTO has issued over 7 million patents, with 3,500 to 4,500 newly granted patents each week.
• USPTO provides online full-text access for patents issued since 1976.
• URLs:– USPTO Official Website: http://www.uspto.gov/– USPTO Patent Search: http://www.uspto.gov/main/search.html
8
Data Extraction: Case Study of NSE PatentsData Extraction: Case Study of NSE Patents
• Nanoscale Science and Engineering (NSE) field– Fundamental technology that is critical for a nation’s
technological competence.– Revolutionize a wide range of application domains.
• Nanotechnology– Is an applied science/ technology field that is multi-
disciplinary and encompasses engineering and other work taking place at the nanoscale.
– Critical for a nation’s technological competence. – R&D status attracts various communities’ interest.
9
Data Extraction ProcedureData Extraction Procedure
• The goal is to gather all the related patents from USPTO Web site as free-text html pages and then parse them into structured data and stored in a database.
• Procedure of extracting NSE patents from USPTO:1. Spider search results (summary pages)2. Spider individual patent documents (detailed pages)3. Noise filtering4. Parsing
10
1. Spider search results (summary pages)1. Spider search results (summary pages)
• A list of keywords can be used to search for patents related to NSE domain. The keywords were provided by domain experts.
• A spider program written by Perl was used to spider the search result pages.
Keywordsatomic force microscopeatomic force microscopicatomic force microscopyatomic-force-microscopeatomic-force-microscopyatomistic simulationbiomotormolecular devicemolecular electronicsmolecular modelingmolecular motormolecular sensormolecular simulationnano*quantum computingquantum dot*quantum effect*scanning tunneling microscopescanning tunneling microscopicscanning tunneling microscopyscanning-tunneling-microscopescanning-tunneling-microscopyself assembledself assemblingself assemblyselfassembl*self-assembledself-assemblingself-assembly
11
use HTML::TokeParser;
use LWP;
use URI::Escape;
use strict;
sub query
{ … … … …
open(f, $ARGV[0]);
my @keywords = <f>;
close(f);
… … … …
$query_url = "http://patft.uspto.gov/netacgi/nphParser?Sect1=PTO2&Sect2=HITOFF&p=$pno&u=%2Fnetahtml%2Fsearc-bool.html&r=0&f=S&l=50&TERM1=$kw&FIELD1=&co1=AND&TERM2=$start%3E$end&FIELD2=ISD&d=ptx";
$response = $browser->get($query_url);
$result = $response->content();
open(f, "> $fpage-$pno.html");
select(f);
print $result;
close(f);
}
query('1/1/2007', '12/31/2007');
Example code
Get keywords
Download search pages
Set up time range
13
2. Spider individual patent documents (detailed pages)2. Spider individual patent documents (detailed pages)
• In this step, we need to:– 1st, collect all the patent IDs;– 2nd, download all the patents based on
the patent IDs by using proxies.• The data set is often very large, so using
proxies can save a lot of time.
14
1
Download detailed patent documents
Create several files, each of which contains a fixed amount of patent IDs (e.g., 300 patent IDs).
Server:
Send different patent ID files to different client threads.
… … … …
open(f, $ARGV[0]);my @theids = <f>;close(f);
my $theid;foreach $theid (@theids){
$new_sock = $sock->accept(); my $buf = <$new_sock>;
print ($new_sock $theid."\n");print $buf . " " . $theid."\n";
close $new_sock;… … … …
Client:
Use proxy to download the patents whose IDs are in the file sent from the server.
… … … …
do {
$response = $browser->get($pat_url);
if (!$response->is_success()){
select(stdout);
print $response->status_line, "\n\n";
sleep(rand(7)+1);
}while (!$response->is_success())
… … … …
17
3. Noise filtering3. Noise filtering
• Some patents we gathered may have noisy NSE keywords, some may even have no NSE keywords.– Such patents need to be filtered out.
• Noise keywords includes:– nanosecond– nanoliter– nano$– nano-second– nano-liter– nano.sub– nano [space]– nano2
18
4. Parsing4. Parsing
• Extract different data fields from the HTML patent documents and parse into database.
USP_Patent
PK patentId
issueDate title appSerialNumber appDate appType attorneyAgent primaryExaminer assistantExaminer
USP_inventor
PK inventorId
iLname iMname iFname iCity iState iCoutnry
USP_Patent_Inventor
PK patentIdPK inventorId
USP_Assignee
PK AssigneeId
aName aCity aState aCountry
USP_Patent_Assignee
PK patentIdPK assgneeId
USP_OtherRef
PK refenceId
CitingPatentId reference
USP_usClass
PK patentId
us_Class1 us_Class2 major
USP_Countryname
PK Country_lable
Country_fullname
USP_Patent_Content
PK patentId
Abstract Title Claim USP_Patent_Citation
PK CitingPatentIdPK CitedPatentId
CitedPatentDate
USP_Foreignref
PK CitingPatentIdPK CitedPatentID
CitedPatentDate CitedPatentSource
USP_Int_Class
PK patentId
section class subclass maingroup subgroup
Database Design (USPTO)
19
public static void processAssignees( ) throws IOException{ … … … …
String[] assignees = assigneeString.split("<BR>");for (int i = 0; i < assignees.length; i++){
currentassignee=assignees[i].trim();if(currentassignee.length()==0)
continue;currentassignee = currentassignee.replaceAll("\r\n", "");
name =findBetween(currentassignee,0,"<B>","</B>");currPosition=currentassignee.indexOf("</B>")+"</B>".length();
address=findBetween(currentassignee,currPosition,"(",")");if(address==null){ System.err.println("wrong address: " + patentId); }int startIndex=0, endIndex=0;if((endIndex = address.lastIndexOf(',')) >= 0){ city = address.substring(0, endIndex);
if (city.lastIndexOf(',') >= 0){ city = city.substring(city.lastIndexOf(',')
+ 1);city.replaceAll("[^a-zA-Z]", "");
}startIndex = endIndex + 1;
}else
city="-";address = address.substring(startIndex);country=findBetween(address,0,"<B>","</B>");if(country==null){ country="US";
state=address.trim();}else
state="-";name=name.trim();city=city.trim();state=state.trim();rank++;
}}
Parsing example: parsing inventor data
Process inventor name
Process inventor address
Keep the ranking order of inventors
20
Data Analysis ExamplesData Analysis Examples
• Bibliographic analysis– Top 50 countries
select c.countryName, count(distinct b.patentId)
from usp_assignee a, usp_patentAssignee b, usp_countryName c
where a.assigneeId=b.assigneeId and a.aCountry not in ('unknown','') and a.aCountry=c.countryCode
group by c.countryName
order by count(distinct b.patentId)desc
Rank Assignee CountryNumber of
Patents
1 United States 13,506
2 Japan 2,653
3 Federal Republic of Germany 836
4 France 534
5 China (Taiwan) 428
6 Republic of Korea 406
7 Canada 333
8 Netherlands 325
9 Australia 276
10 United Kingdom 258
11 Switzerland 193
12 Israel 163
13 Sweden 108
14 Belgium 106
15 Italy 82
16 Singapore 70
17 China 66
18 Denmark 56
19 Finland 51
20 India 39
21 Hong Kong 33
22 Bermuda 28
23 Ireland 26
24 Austria 24
25 Norway 23
26 Spain 15
27 Liechtenstein 13
28 Barbados 13
29 British Virgin Islands 7
30 New Zealand 7
21
Citation Network AnalysisCitation Network Analysis
Developing software: Graphviz http://www.pixelglow.com/graphviz/download/
22
Content Map AnalysisContent Map Analysis
Developing software: multi-level self-organizing map algorithm developed by AI Lab at the U of Arizona