Upload
alberto-roman
View
646
Download
1
Embed Size (px)
Citation preview
Alberto Román [email protected]
Mapplas GeoParser A New Approach to Loca.on En.ty Recogni.on and Disambigua.on
Geoparsing (named entity recognition of places)
With a state-of-the-art algorithm, Mapplas Geoparser can find more places (even
local business) and with more accuracy than any other solution.
Mapplas Geoparser enables advanced geospatial analytics on big data content by extracting and resolving geographical places from unstructured text.
A significant proportion of the content available has a spatial dimension. This may be either explicit or implicit, for instance, a place name within a document or the location of a place referred to within a tweet.
Disambiguation issue
ª Quality over sparse unstructured text. ª The most granular solution, when others say Las Vegas we say The Bellagio. ª Language independent. ª Able to extract implicit locations, understanding the meaning intended by the
author.
Springfield: which one of them?
- Matt Groening: Springfield was named after Springfield, Oregon.
Geoparser competitive advantages:
Approach
This new gazetteer utilizes rich geographical information, linking places with their corresponding geographical geometry (polygons or points). This allow us to perform logical operations between topological shapes. High precision and recall is achieved using large datasets of POIs. Additionally, is language independent.
Reaching an accuracy down to local business is still an unresolved issue, but the users needed that kind of granular resolution. Disambiguating using engineered features, Freebase, Wikipedia or Geonames did not provide enough accuracy. Instead, we decided to build or own Gazetteer, gathering data from many other additional resources.
Hotel Madrid in Madrid
Madrid, City Hotel Madrid, located at lat=40.4154 and lon=-‐3.7031 Madrid, County/Province
Gazetteer Many of the current approaches are dependent of features such as: location features type, words starting with a capital letter or preceded by a trigger word like “in” or “near”, capitalized words, population numbers, percentage of appearances in Wikipedia or other corpus. Some have tried location clustering, checking distance between places (0,1º or 5 km)… but this does not provide enough accuracy.
Our Gazetteer: 1. For high level hierarchies such as continents, countries, states, counties, or
neighborhoods, we utilize their geographical boundaries instead of the geo-center of those locations.
2. We gather geographical information from many sources, other open-source gazetteers, crawling the web and from open street maps.
3. We consolidated the information to remove duplicates and get a high quality set of data by using Google Refine, Jaccard Index an GIS operations.
4. Using PostgreSQL with PostGIS, we built a database with rich geographical features. For example, Madrid is a polygon and Hotel Madrid is a point.
With this approach, each place is unique even if they share the same name. Their geometries are different, which reduces the disambiguation issue to a geometric and contextual problem.
Algorithm In order to get candidates, we use NLTK but we do not use its classifier. So we obtain a list of entities without knowing if they are organizations, people or locations.
“Hotel Madrid” in “Madrid” Hotel Madrid (place in Spain / point) e1
Hotel Madrid (place in Paris / point) e2
Madrid (city in Spain / polygon) e3
Madrid (city in Peru / polygon) e4
Hotel Madrid (place in Spain / point) e1 1 (x11) 0 (x12) 1 (x13) 0 (x13) E1 = 0.5
Hotel Madrid (place in Paris / point) e2 0 (x21) 1 (x22) 0 (x23) 0 (x24) E2 = 0.25
Madrid (city in Spain / polygon) e3 0 (x31) 0 (x32) 1 (x33) 0 (x34) E3 = 0.25
Madrid (city in Peru / polygon) e4 0 (x41) 0 (x42) 0 (x43) 0 (x44) E4 = 0.25
To disambiguate we use modification of a co-occurrence matrix. We look on our gazetteer for candidates. For example with “Hotel Madrid” in “Madrid” we get some polygons and a couple points.
M = arg max(Eii∈n
) = arg maxi∈n
xijj=1
n
∑
n
#
$
%%%%%
&
'
(((((
with xij1, if ei ⊆ ej or ei,ej ⊆ P
0, otherwise
*
+,
-,
P = boundary one step higher in the hierarchyei,ej = gazetteer entities
M = arg max(Eii∈n
)
Xij = 1 Xij = 0
Point with Point
Point with Boundary
Boundary with Boundary
Algorithm For each case, we check if the entity is within or the same, or if the POIs is close to each other (defined as if they share the same “parent” one step higher in the hierarchy). And finally we choose the most likely candidate.
Precision & Recall ( Mapplas F1*: 93,87% / Other F1
*: 72,34% )
30% improvement over other commercial solutions. Side by side comparison.
* Recall is the % of existing relevant locations that are found. Precision: is the % of retrieved locations that are right. F1=(2*P*R)/(P+R).!
API* – places mentioned in a document
All the locations retrieved are linked to points of interest or polygons. Very granular information of USA, Cities, Counties, States and POI is available. For the rest of the world we have the polygons of major cities (with more than 100,000 people). * Note that the API doesn’t show the full potential of the algorithm, we had to limit it in order to address several use cases.
* http://www.mapplas.com/location-entity-recognition/!
Conclusions & Future Work Our algorithm takes advantage of the geographical properties unique to each place, using the contextual information embedded in the text. On top of that, shows interesting properties: on one hand, we are able to build an ontology of each place, only taking care of the quality of the high level boundaries (Countries, States, Counties, Neighborhood…). On the other hand, we are able to extract implicit locations, understanding the meaning intended by the author. One way to improve the algorithm is to know where the information is generated and consumed, reducing consequently the number of candidates and improving the accuracy. For example, articles/news benefits from understanding the target audience. For reviews, the context of the author can be used to infer the meaning intended, such as language, places where has been living, location where the review was written among many others. We have applied this algorithm to build a location based app search engine, but we foresee many more interesting applications such as: advertising, reviews, news…
Alberto Román [email protected]
“Mental maps. Maps with edges. And for Auden, for so many of us, it's the edges of the maps that fascinate...” ― David Mitchell, The Bone Clocks
Mapplas GeoParser