This is primarily concerned with retrofitting our legacy Ingestion system to use Twofishes instead of Bing and GeoNames, but should serve to document how we do forward and reverse geocoding lookups, and why we use the services that we do.
Basic Problem, as of January, 2017
We have been using the legacy ingestion system to process our providers after being forced to cut our losses with Ingestion 2. While Ingestion 2 uses Twofishes for geocoding place names into coordinates and for filling in feature name hierarchies (e.g. City, County, State) given a place, the legacy system relies on a combination of Bing and GeoNames API requests for the same thing. Bing has recently put limits on our access unless we pay for what they call an Enterprise plan, such that we have decided to run Twofishes instead for all of our legacy system geocoding needs.
The legacy system uses a combination of lookups, both forward (finding coordinates for a place name) and reverse (finding a place name given coordinates), which do not appear necessary anymore. When that system was originally designed in 2012 the developers involved did not like the quality of GeoNames's coordinate lookups, so they added Bing lookups to back them up, but it's possible that things have changed in the past four years, because we've since been finding that the data returned by Twofishes (which uses data from GeoNames) satisfies our needs.
The trick is to revise the legacy enrichment code to use a significantly different API service (which is different despite the data underneath being similar) without breaking anything.
Twofishes Infrstructure Notes
The Twofishes dataset comes from Geonames, flickr, and NaturalEarth (naturalearthdata.com).
We're using an old version (0.84.9) from pre-April 2015. Latest version is 0.90.5, from Sep. 2015. We have the latest index, updated 2015-03-05.
We may want to upgrade the installed version of Twofishes in our automation project, but the index that we have is up-to-date.
In commit 464f061 (2013-03-26) there were DplaBingGeocoder and DplaGeonamesGeocoder, but only the Bing one was ever instantiated. So we started out (briefly, in development, I suppose) just using Bing. Both DplaBingGeocoder and DplaGeonamesGeocoder were added in this commit, as if the developers intended to bring in GeoNames, but the latter was unused. Before that commit, zen.akamod.geolookup_service was used.
GeoNames lookup was added in the next commit (2096791, of 2013-03-29). The docstring for geocode() was amended as follows:
Adds geocode data to the record coming in as follows:
1. Attempt to get a lat/lng coordinate from the property. We use Bing to
lookup lat/lng from a string as it is much better than Geonames.
2. For the given lat/lng coordinate, attempt to determine its parent
features (county, state, country). We use Geonames to reverse geocode
the lat/lng point and retrieve the location hierarchy.
So at the time the only reason for the complexity was that they didn't think that GeoNames's forward lookups were good enough. I don't think we have any problem with Twofishes lookups now in 2017 (with the 2015 index, which comes from GeoNames data that is probably newer).
I think that the complexity of the
geocode module follows from that decision back in 2013. The only reason for including Bing was that they didn't feel good about GeoNames's forward lookups.
There's nothing in our issue tracker documenting these decisions.
Current Control Flow and Function Calls
The sequence in geocode.geocode() is as follows:
DplaGeonamesGeocoder.enrich_place() (REVERSE or FORWARD lookup, get coordinates and features). If coordinates are already defined, we update our
Place from the coordinates, via a GeoNames API call. This sets place's properties, like city, state, etc., and may normalize coordinates. If coordinates are not defined, we update our
Place with two GeoNames API calls to get geographic features and coordinates. Note that we assign coordinates here! (Keep that in mind during the Bing lookup, which comes next.)
DplaBingGeocoder.enrich_place() to make another pass to get coordinates (FORWARD lookup). Fill in coordinates, if successful, otherwise the
Place goes unmodified.
DplaGeonamesGeocoder.enrich_place() (REVERSE lookup again). See step 1.
It's worth emphasizing here that the Bing lookup is extra – the first GeoNames lookup will populate the coordinates. All the Bing lookup does is give a second opinion about the coordinates. There's some logic involved to weed out incorrect results, but all that's returned is an alternative pair of coordinates.
Some more detail:
self._place_from_coordinates()(when coordinates given) or
self.geocode_place() (when no coordinates given)
DplaGeonamesGeocoder.geocode_place() makes 2 API calls in lines 407 and 408! (Two calls to
_name_search() result in two API requests) This does not seem right and there's no comment justifying why it's done this way.
Other Classes in the
There are two classes worth noting here: Address and Place.
Address existed early on in the code's history and serves the sole purpose currently of providing an iterator of candidate feature names to pass to Bing for forward lookups.
Place was added in 2014 to maintain state for the properties of the place, like city, state, and coordinates.
Analysis of Ingestion 2 (audumbla) coarse_geocode Enrichment
The Geokit Ruby gem is used by
coarse_geocode, but only for determining if a looked-up Place matches the existing Place. It checks if the old place is within new place's bounds, and checks if the old Place's center is within a distance of new place's center.
twofishes Ruby gem is also used by
coarse_geocode. It manages request timeouts and retries, but that's most of what we get out of that gem. It seems that what it's used for could be done with an HTTP standard library module.
As a further note when it comes time to pick which of
audumbla's behaviors to emulate, it uses a
timeout parameter that appears to be a response timeout, whereas I think it would make more sense to think about connection timeouts and allow requests as much time as they need to complete. Since we're running Twofishes on the internal network it seems we should only retry once after a short connect timeout and not worry about the response-completion timeout.
edm:Place is quite different than MVPv3.1's
dpla:Place, so the
audumbla enrichment is concerned with adding
skos:closeMatch URIs, which we don't have to worry about with Ingestion 1. When we do get to Ingestion 3, it seems they should be easy enough to add because Twofishes provides one or more URIs for any given feature, for Wikipedia or other sources.
audumbla populates GeoNames URIs, in the form of:
http://sws.geonames.org/<id>/. It should also be easy enough to simply add a parent feature instead of filling in the city, county, state, etc.
audumbla enrichment, only the Place's identity (URI or node ID) and
providedLabel are kept. All other properties are replaced. Ingestion1 by comparison preserves any properties that were provided. For example,
audumbla will overwrite the city, but Ingestion1 will keep it. It gets replaced with the "display name," for example, "Somerville, MA, United States".
Though the DPLA Geographic and Temporal Guidelines document says providers can give us
providedLabels like "United States, Pennsylvania, Erie, 42.1167, -80.07315", such a string does not return any results, given directly to Twofishes. It doesn't appear that there's anything in the Ruby Twofishes gem that parses out the coordinates. You have to use the
ll (el el) parameter to get Twofishes to do a reverse lookup, whereas searches for names use the
query parameter. It's possible that the recommendations in that document are aspirational, pending future work on the geocoding enrichment.
The Ingestion1 enrichment does something I don't think the
audumbla enrichment does. It detects when there are multiple
dpla:Place values that are part of the same hierarchy, and combines them. See
test_geocode.test_collapse_hierarchy(). We don't want to overlook this when redoing Ingestion1's
geocode, or when writing the new one for Ingestion3.
Twofishes Suitability For Our Purpose
A basic Twofishes reverse query (e.g. "somerville ma") gives us everything we need, including coordinates and what we have always viewed as the canonical GeoNames hierarchy of feature names, with "name" and "display name" properties, e.g. "Somerville", "Middlesex County", "Massachusetts", and "United States".
You can use "WOE Types" (where on earth types) to narrow a search, e.g. to ask for "Hanford" as the name of a town (woe type 7) as distinct from "Hanford Site," but it tends to default to the most-specific one, e.g. for "tulare" it only gives one "interpretation" for Tulare, CA, not Tulare County. With Twofishes's
maxInterpretations parameter you can ask for multiple interpretations (where, in the example, you'd get Tulare, CA in addition to Tulare County, CA), but you can also narrow your results by providing the
woeRestrict parameter, for instance, to limit the interpretations to WOE type "TOWN."
That query returns three interpretations. This level of restriction to "TOWN" isn't enough on its own, because there are many places around the world with the same names (Boston, MA and Boston, Linconlshire, England; Erie, PA and Erie, IL) There is logic that is captured in the legacy geocoder module in addition to the Ingestion 2 coarse_geocode module that we will need to be aware of when reworking geocoder. Both of these modules are careful, when given a place name, to consider any existing coordinate data that may appear in the record to weed out false results for geographic features that are not applicable. They are also careful to weight results, for example, in the case where we're given nothing by a town name from an American archive's metadata. In this case, we'll weight the candidate interpretations in favor of United States locations.
DT-1138Getting issue details...
DT-202Getting issue details...
DT-1084Getting issue details...