10
A common challenge we’ve heard from businesses, governments, and academics who work with place data is the lack of standardization when referring to a place. A given place may be referred to across various data sets by name, address, geocode, or any number of different data-provider IDs. Often these pieces of identifying information are messy and unstable over time (e.g., a business may change its name, or a street name may be changed), some pieces of information are not unique to a given piece of information (e.g., a new business moves in at an address), or some pieces of information may not be present for all places (e.g., a park without a street address). Placekey was created to serve as a standard universal identifier for any physical place, so that information pertaining to those places can be shared easily across organizations and data sets. Why

Why - Placekey API Docs

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Why - Placekey API Docs

A common challenge we’ve heard from businesses, governments, and academics who work with place data is the lack of standardization when referring to a place. A given place may be referred to across various data sets by name, address, geocode, or any number of different data-provider IDs. Often these pieces of identifying information are messy and unstable over time (e.g., a business may change its name, or a street name may be changed), some pieces of information are not unique to a given piece of information (e.g., a new business moves in at an address), or some pieces of information may not be present for all places (e.g., a park without a street address). Placekey was created to serve as a standard universal identifier for any physical place, so that information pertaining to those places can be shared easily across organizations and data sets.

Why

Page 2: Why - Placekey API Docs

The structure of a placekeyEach Placekey is divided into two parts: What and Where, written as “What@Where”. The What part encodes information about the place and its address, while the Where part situates that place on Earth. The What part of a Placekey is optional, and a What-less Placekey like “@5vg-7qg-tvz” refers to a region on the Earth while a What part plus a Where part specifies a particular place within a region on the Earth.

The Where part of a Placekey encodes a hexagon of approximately 15,000 m2 on the surface of the Earth. These hexagons have an edge length of 66 m on average, and it can be helpful to think of them as roughly circles with a diameter of 132 m. The exact area and edge length of the hexagon varies by location. In particular, these hexagons are given by resolution 10 H3 indices. This document will not cover H3 in full detail, as that is covered by the documentation for H3.

Page 3: Why - Placekey API Docs

Each Where part is specified by 9 characters split into three triplets for legibility. The triplets do not explicitly code exact spatial distances, but the code does become more specific when reading left to right. There is an upper bound on how far apart two Where parts can be based on the length of their shared prefix (see the table below).

Length of shared prefix

Maximal distance (meters)

0 20,040,000

1 20,040,000

2 2,777,000

3 1,065,000

4 152,400

5 21,770

6 8,227

7 1,176

8 443.2

9 63.47

However, it is important to be aware that nearby hexagons may have codes that are not very similar. This occurs when Placekey grid cells are near the edges of larger (i.e., lower resolution) hexagons in H3’s spatial hierarchy. Pictured below are three neighboring Placekeys whose shared vertex is also shared by three resolution 5 (with edge length roughly 8.5 km, in orange) hexagons. Each of these Placekey hexagons is nested under the resolution 5 hexagon that contains most of its area, which is why their encodings are so different.

Page 4: Why - Placekey API Docs

The What part of a Placekey is split into two triplets, for example, “223-227”, where the first triplet is a serial index of an address located in the Where part of the Placekey, and the second triplet is a serial index of POIs located at that address. These are referred to as the address encoding, and POI encoding, respectively. The POI encoding is optional, while the address encoding will always be present if the Placekey has a What part. If the POI does not have an actual address (like some parks or monuments), the address encoding will be “zzz”. What parts are only unique up to the Where part of a Placekey.

The Address encoding of a Placekey is assigned to addresses that have been CASS validated, meaning that the USPS recognizes the address as a place to which mail can be delivered. Address encodings beginning with “z” are reserved, and in particular the following special Address encodings are in use:

1. zzz - reserved for use with POI which do not have a mailing address, e.g., a park,

2. zzy - reserved for use with POI which do not have CASS-valid street address,

3. zzw - reserved for use with POI whose mailing address geocodes to a different Where part than their physical location.

Page 5: Why - Placekey API Docs

Both the address and POI encodings can accommodate slightly less than 22,000 values (strings of length 3 with 28 possible encoding characters), meaning that for each Where there are about 482 million possible Whats. In the US, we have seen no more than 3000 distinct Addresses per Where and POI per Where, so the What part has plenty of overhead for places to change over time.

Encoding Placekeys

One of the primary goals with the design of Placekey was to make them as user friendly as possible. This means that Placekeys should be short, legible, and comprehensible. The above section on the structure of Placekeys covered their comprehensibility in that information is carried by the structure of the What and Where parts. Shortness and legibility are covered by the method in which Placekeys are encoded. This section will serve as a short introduction to the encoding system used for Placekey, for full details on the encoding see Placekey Encoding Spec.

Each part of Placekey can be thought of as an encoded integer. When encoding integers one can try to compress the number of characters needed to represent the integer by using a larger character set, or one can try to put the integer into a machine friendly format like binary. With Placekey we have opted to shorten the encoded integers by using a large character set, However, there are tradeoffs to using a large character set such as using characters that are visually similar (e.g., ’O’ and ‘0’ or ‘1’ and ‘l’), making encoded values harder to read, for instance by mixing lower and upper case letters, and introducing the possibility of spelling undesirable words with the encoding.

Placekey uses the following set of 28 characters for encoding:

23456789bcdfghjkmnpqrstvwxyz

The characters aeu are also reserved as special characters in the encoding. The use of a 28 character set allows H3 indices of resolution up to 12 to be encoded using 9 characters, so Placekey is robust to future improvements in geocoding. Resolutions greater than 12 will generally not be practicable for physical places as resolution 12 hexagons are already under a square meter in area.

The selected alphabet contains no vowels in an attempt to prevent unintentionally spelling offensive words with an encoded value. However, there are still a small number of abbreviations or acronyms that can be spelled with this character set that we determined to avoid, and these are programmatically removed by substituting in the letters ‘e’ or ‘u’

Page 6: Why - Placekey API Docs

for a character in these words in a way that does not introduce a new avoided word. This list of avoided words and their replacements is curated and ordered so that their removal does not break the one-to-one mapping between integers and encoded values.

The character ‘a’ is used to left pad the Where part out to 9 digits in cases where the encoded value requires fewer digits. This is necessary because the most significant digits in a Where part correspond to H3 base cells of which there are 122, requiring 2 digits to describe in base 28. So, you can think of the first three digits of a Where part that starts with the letter “a” in the same way that you would think of an integer that starts with a “0”, for instance, “a4t” is the same value as “4t” in our encoding just as 055 is the same value as 55 with regular integers. In the What part encodings, “2”, the first letter in the character set is used for paddining, so that the string “222” corresponds to the first encoded value for either the Address or POI encoding.

When encoding H3 indices we do not encode the entire 64-bit integer, rather we encode 43 bits of the integer plus a constant. We only need 43 bits because the first 12 bits of an H3 index code metadata about the index that does not contain location information, and the last 9 bits encode location information at a higher precision (resolutions 13-15) than required for Placekey. The truncated bits are constant across all of the H3 indices used for Placekey, so their removal does not impact the one-to-one relationship between Where parts and resolution 10 H3 indices. We add a constant to a shift in the value of H3 base cell indices by 1, which reduces the amount of padding required to make an encoded index have 9 characters.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1 Res Mode Res./edge Resolution Base cell

2 Base cell Digit 1 Digit 2 Digit 3 Digit 4

3 Digit 5 Digit 6 Digit 7 Digit 8 Digit 9 Digit 10

4 Digit 11 Digit 12 Digit 13 Digit 14 Digit 15

(The bit layout of an H3 index. The highlighted bits are encoded for a Plackey)

Page 7: Why - Placekey API Docs

Use of H3

There are a number of ways to convey information about a location on the Earth. The standard and most basic is latitude and longitude pairs, but these have the issue of defining points along a continuum rather than a region so that multiple pairs of coordinates would be required to specify a region, as well as requiring at least 5 decimal places to specify something the size of a big-box store.

One desirable property of latitude and longitude is that it is easy to tell relative spatial relationships between multiple points (often referred to as “proximity”), and this motivates the use of hierarchical systems for geocoding so that, outside of boundary conditions, nearby locations will have similar codes and vice-versa. These boundary conditions are inescapable as codes are one-dimensional while the surface of the Earth is not. Some geocoding solutions are ruled out because they do not have this proximity property, such as what3words.

There are several candidate geocoding systems such as Geohash, Open Location Code (also known as Plus Codes), S2, and H3. The first three of these use a rectangular grid system to tile the globe, while H3 uses a hexagonal grid system. Both of these grid systems are not completely regular since rectangles and hexagons tile the plane but not the sphere. In the case of rectangular grid systems, the grid breaks down at the poles where triangles must be used instead of rectangles and a large number of grid cells must touch. H3 handles the grid breakdown by starting with an icosahedral projection of the surface of the Earth (i.e., a 3d shape with 20 faces and 12 vertices), where each face can be regularly tiled by hexagons and boundary conditions are introduced at each of the twelve vertices of the icosahedron. These boundary conditions require the use of pentagons rather than hexagons for the cells containing these vertices, which has less impact on the adjacency structure of cells than the boundary cases of the rectangular grids. H3 is also designed so that these pentagonal cells are centered over bodies of water.

Page 8: Why - Placekey API Docs

(Centers of the 12 pentagonal H3 cells)

Another benefit of H3 over grid based systems is the aforementioned adjacency structure of cells. In a hexagonal grid, each cell has 6 neighbors with which it shares an edge, and the centers of each of these neighbors are equidistant from the given cell. In the case of a rectangular grid, each cell has 4 neighbors with which it shares an edge and 4 neighbors with which it only shares a vertex, furthermore the centers of the edge-sharing neighbors are closer to the center of the given cell than the vertex sharing neighbors. The simpler adjacency structure of the hexagonal grid makes analyses of spatial data easier than with rectangular grids (see ESRI’s Why Hexagons? for instance). This property also makes it easier to approximate complicated shapes such as various governmental boundaries using hexagons than with rectangles of a similar size.

(Images courtesy of the H3 documentation)

Page 9: Why - Placekey API Docs

A final benefit of using H3 is that it has low distortion of hexagons across the globe when compared to a grid system (e.g., expansion of landmasses near the poles in a Mercator projection). For H3 the entire range of distortion occurs across each face of the icosahedron, where hexagons in the center of each face have larger area than those near the edges of the face.

(A resolution 10 H3 hexagon with a nearby OLC code of length 8 from Nome, AK (left) and Key West, Fl (right). In

the left example, the hexagon has 13,807 m2 while the rectangle has area 33483 m2, and in the right example, the

hexagon has area 15,403 m2 while the rectangle has area 70,140 m2.)

A final reason we opted to use H3 for Placekey is that it is open source and there is already a community of other libraries, tools, and services using H3 (e.g., Unfolded, kepler.gl, deck.gl, h3-py, h3-js, h3-node, geojson2h3, geo (Clojure), pgh3, bigquery-jslibs, H3 Indexes, Logstash H3 filter plugin). We wish to make Placekey as widely usable as possible, which means bundling its functionality into open source libraries and including Placekeys as part of other services and data sets.

Page 10: Why - Placekey API Docs

Address and History Tracking

In order to maintain the What part of a Placekey, SafeGraph maintains databases of addresses and POI. Incoming addresses and POI are either matched against pre-existing places in our databases, or they are assigned new Placekeys in our database. Since the address, location, or name of a place may change over time we also keep a historical record of related Placekeys to enable easy deduping of places in a data set.

Libraries and API

The Placekey service API provides the ability to lookup Placekeys based on location (latitude and longitude), address, or address plus location name.

There are Python and Javascript libraries for working with Placekeys. These cannot provide or validate the What part of a Placekey, but they can compute and validate Where parts of Placekeys. These libraries also contain additional functionality that allows for the conversion of Where parts into various geometric formats, and vice-versa. Example notebooks for the Python library are hosted in a separate repository.

Task Use Result

Look up the Placekey for a location by (latitude, longitude)

API, library Placekey of the form “@Where”

Look up the Placekey for an address API Placekey of the form “Address@Where”

Look up the Placekey for an address with location name

API Placekey of the form “Address-POI@Where”

Find all Plackeys of the form “@Where” which intersect a region

library

Validate the formatting of a Plackey library

To learn more about Placekey or to try it for yourself, visit our website at placekey.io