11
Alberto Román [email protected] Mapplas GeoParser A New Approach to Loca.on En.ty Recogni.on and Disambigua.on

Mapplas Geoparser

Embed Size (px)

Citation preview

Page 1: Mapplas Geoparser

 Alberto Román [email protected]

Mapplas GeoParser A  New  Approach  to  Loca.on  En.ty  Recogni.on  and  Disambigua.on  

Page 2: Mapplas Geoparser

Geoparsing (named entity recognition of places)

With a state-of-the-art algorithm, Mapplas Geoparser can find more places (even

local business) and with more accuracy than any other solution.

Mapplas Geoparser enables advanced geospatial analytics on big data content by extracting and resolving geographical places from unstructured text.

A significant proportion of the content available has a spatial dimension. This may be either explicit or implicit, for instance, a place name within a document or the location of a place referred to within a tweet.

Page 3: Mapplas Geoparser

Disambiguation issue

ª  Quality over sparse unstructured text. ª  The most granular solution, when others say Las Vegas we say The Bellagio. ª  Language independent. ª  Able to extract implicit locations, understanding the meaning intended by the

author.

Springfield: which one of them?

-  Matt Groening: Springfield was named after Springfield, Oregon.

Geoparser competitive advantages:

Page 4: Mapplas Geoparser

Approach

This new gazetteer utilizes rich geographical information, linking places with their corresponding geographical geometry (polygons or points). This allow us to perform logical operations between topological shapes. High precision and recall is achieved using large datasets of POIs. Additionally, is language independent.

Reaching an accuracy down to local business is still an unresolved issue, but the users needed that kind of granular resolution. Disambiguating using engineered features, Freebase, Wikipedia or Geonames did not provide enough accuracy. Instead, we decided to build or own Gazetteer, gathering data from many other additional resources.

Hotel  Madrid  in  Madrid  

 Madrid,  City            Hotel  Madrid,  located  at  lat=40.4154  and  lon=-­‐3.7031    Madrid,  County/Province    

Page 5: Mapplas Geoparser

Gazetteer Many of the current approaches are dependent of features such as: location features type, words starting with a capital letter or preceded by a trigger word like “in” or “near”, capitalized words, population numbers, percentage of appearances in Wikipedia or other corpus. Some have tried location clustering, checking distance between places (0,1º or 5 km)… but this does not provide enough accuracy.

Our Gazetteer: 1.  For high level hierarchies such as continents, countries, states, counties, or

neighborhoods, we utilize their geographical boundaries instead of the geo-center of those locations.

2.  We gather geographical information from many sources, other open-source gazetteers, crawling the web and from open street maps.

3.  We consolidated the information to remove duplicates and get a high quality set of data by using Google Refine, Jaccard Index an GIS operations.

4.  Using PostgreSQL with PostGIS, we built a database with rich geographical features. For example, Madrid is a polygon and Hotel Madrid is a point.

With this approach, each place is unique even if they share the same name. Their geometries are different, which reduces the disambiguation issue to a geometric and contextual problem.

Page 6: Mapplas Geoparser

Algorithm In order to get candidates, we use NLTK but we do not use its classifier. So we obtain a list of entities without knowing if they are organizations, people or locations.

“Hotel  Madrid”  in  “Madrid”   Hotel  Madrid  (place  in  Spain  /  point)  e1  

Hotel  Madrid  (place  in  Paris  /  point)  e2  

Madrid  (city  in  Spain  /  polygon)  e3  

Madrid  (city  in  Peru  /  polygon)  e4  

Hotel  Madrid  (place  in  Spain  /  point)  e1   1        (x11)   0        (x12)   1        (x13)   0        (x13)   E1  =  0.5  

Hotel  Madrid  (place  in  Paris  /  point)  e2   0        (x21)   1        (x22)   0        (x23)   0        (x24)   E2  =  0.25    

Madrid  (city  in  Spain  /  polygon)  e3   0        (x31)   0        (x32)   1        (x33)   0        (x34)   E3  =  0.25  

Madrid  (city  in  Peru  /  polygon)  e4   0        (x41)   0        (x42)   0        (x43)   0        (x44)   E4  =  0.25  

To disambiguate we use modification of a co-occurrence matrix. We look on our gazetteer for candidates. For example with “Hotel Madrid” in “Madrid” we get some polygons and a couple points.

M = arg max(Eii∈n

) = arg maxi∈n

xijj=1

n

n

#

$

%%%%%

&

'

(((((

with xij1, if ei ⊆ ej or ei,ej ⊆ P

0, otherwise

*

+,

-,

P = boundary one step higher in the hierarchyei,ej = gazetteer entities

M = arg max(Eii∈n

)

Page 7: Mapplas Geoparser

Xij  =  1     Xij  =  0    

Point  with  Point  

Point  with  Boundary    

Boundary  with  Boundary  

Algorithm For each case, we check if the entity is within or the same, or if the POIs is close to each other (defined as if they share the same “parent” one step higher in the hierarchy). And finally we choose the most likely candidate.

Page 8: Mapplas Geoparser

Precision & Recall ( Mapplas F1*: 93,87% / Other F1

*: 72,34% )

30% improvement over other commercial solutions. Side by side comparison.

* Recall is the % of existing relevant locations that are found. Precision: is the % of retrieved locations that are right. F1=(2*P*R)/(P+R).!

Page 9: Mapplas Geoparser

API* – places mentioned in a document

All the locations retrieved are linked to points of interest or polygons. Very granular information of USA, Cities, Counties, States and POI is available. For the rest of the world we have the polygons of major cities (with more than 100,000 people). * Note that the API doesn’t show the full potential of the algorithm, we had to limit it in order to address several use cases.

* http://www.mapplas.com/location-entity-recognition/!

Page 10: Mapplas Geoparser

Conclusions & Future Work Our algorithm takes advantage of the geographical properties unique to each place, using the contextual information embedded in the text. On top of that, shows interesting properties: on one hand, we are able to build an ontology of each place, only taking care of the quality of the high level boundaries (Countries, States, Counties, Neighborhood…). On the other hand, we are able to extract implicit locations, understanding the meaning intended by the author. One way to improve the algorithm is to know where the information is generated and consumed, reducing consequently the number of candidates and improving the accuracy. For example, articles/news benefits from understanding the target audience. For reviews, the context of the author can be used to infer the meaning intended, such as language, places where has been living, location where the review was written among many others. We have applied this algorithm to build a location based app search engine, but we foresee many more interesting applications such as: advertising, reviews, news…

Page 11: Mapplas Geoparser

 Alberto Román [email protected]

“Mental  maps.  Maps  with  edges.  And  for  Auden,  for  so  many  of  us,  it's  the  edges  of  the  maps  that  fascinate...”    ―  David  Mitchell,  The  Bone  Clocks  

Mapplas GeoParser