Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Place was added in 2014 to maintain state for the properties of the place, like city, state, and coordinates.

Twofishes Suitability For Our Purpose

A basic Twofishes reverse query (e.g. "somerville ma") gives us everything we need, including coordinates and what we have always viewed as the canonical GeoNames hierarchy of feature names, with "name" and "display name" properties, e.g. "Somerville", "Middlesex County", "Massachusetts", and "United States".

You can use "WOE Types" (where on earth types) to narrow a search, e.g. to ask for "Hanford" as the name of a town (woe type 7) as distinct from "Hanford Site," but it tends to default to the most-specific one, e.g. for "tulare" it only gives one "interpretation" for Tulare, CA, not Tulare County. With Twofishes's maxInterpretations parameter you can ask for multiple interpretations (where, in the example, you'd get Tulare, CA in addition to Tulare County, CA), but you can also narrow your results by providing the woeRestrict parameter, for instance, to limit the interpretations to WOE type "TOWN."

Code Block
http://geo-prod:8081/?query=tulare&responseIncludes=PARENTS,DISPLAY_NAME&lang=en&maxInterpretations=3&woeRestrict=TOWN

...

Analysis of Ingestion 2 (audumbla) coarse_geocode Enrichment

The Geokit Ruby gem is used by adumbla's coarse_geocode, but only for determining if a looked-up Place matches the existing Place. It checks if the old place is within new place's bounds, and checks if the old Place's center is within a distance of new place's center.

The twofishes Ruby gem is also used by audumbla's coarse_geocode. It manages request timeouts and retries, but that's most of what we get out of that gem. It seems that what it's used for could be done with an HTTP standard library module.

As a further note when it comes time to pick which of audumbla's behaviors to emulate, it uses a timeout parameter that appears to be a response timeout, whereas I think it would make more sense to think about connection timeouts and allow requests as much time as they need to complete. Since we're running Twofishes on the internal network it seems we should only retry once after a short connect timeout and not worry about the response-completion timeout.

MAPv4's edm:Place is quite different than MVPv3.1's dpla:Place, so the audumbla enrichment is concerned with adding skos:exactMatch and skos:closeMatch URIs, which we don't have to worry about with Ingestion 1. When we do get to Ingestion 3, it seems they should be easy enough to add because Twofishes provides one or more URIs for any given feature, for Wikipedia or other sources. audumbla populates GeoNames URIs, in the form of: http://sws.geonames.org/<id>/. It should also be easy enough to simply add a parent feature instead of filling in the city, county, state, etc.

In the audumbla enrichment, only the Place's identity (URI or node ID) and providedLabel are kept. All other properties are replaced. Ingestion1 by comparison preserves any properties that were provided. For example, audumbla will overwrite the city, but Ingestion1 will keep it. It gets replaced with the "display name," for example, "Somerville, MA, United States".

Though the DPLA Geographic and Temporal Guidelines document says providers can give us providedLabels like "United States, Pennsylvania, Erie, 42.1167, -80.07315", such a string does not return any results, given directly to Twofishes. It doesn't appear that there's anything in the Ruby Twofishes gem that parses out the coordinates. You have to use the ll (el el) parameter to get Twofishes to do a reverse lookup, whereas searches for names use the query parameter. It's possible that the recommendations in that document are aspirational, pending future work on the geocoding enrichment.

The Ingestion1 enrichment does something I don't think the audumbla enrichment does. It detects when there are multiple dpla:Place values that are part of the same hierarchy, and combines them. See test_geocode.test_collapse_hierarchy(). We don't want to overlook this when redoing Ingestion1's geocode, or when writing the new one for Ingestion3.

Twofishes Suitability For Our Purpose

A basic Twofishes reverse query (e.g. "somerville ma") gives us everything we need, including coordinates and what we have always viewed as the canonical GeoNames hierarchy of feature names, with "name" and "display name" properties, e.g. "Somerville", "Middlesex County", "Massachusetts", and "United States".

...

Code Block
http://geo-prod:8081/?query=tulare&responseIncludes=PARENTS,DISPLAY_NAME&lang=en&maxInterpretations=3&woeRestrict=TOWN

That query returns three interpretations. This level of restriction to "TOWN" isn't enough on its own, because there are many places around the world with the same names (Boston, MA and Boston, Linconlshire, England; Erie, PA and Erie, IL) There is logic that is captured in the legacy geocoder module in addition to the Ingestion 2 coarse_geocode module that we will need to be aware of when refactoring reworking geocoder. Both of these modules are careful, when given a place name, to consider any existing coordinate data that may appear in the record to weed out false results for geographic features that are not applicable. They are also careful to weight results, for example, in the case where we're given nothing by a town name from an American archive's metadata. In this case, we'll weight the candidate interpretations in favor of United States locations.

Related Stories / Tasks

Jira Legacy
serverJIRA (digitalpubliclibraryofamerica.atlassian.net)
serverIdad79b576-13d4-30a5-8426-684792b0ff76
keyDT-1138

Jira Legacy
serverJIRA (digitalpubliclibraryofamerica.atlassian.net)
serverIdad79b576-13d4-30a5-8426-684792b0ff76
keyDT-202

Jira Legacy
serverJIRA (digitalpubliclibraryofamerica.atlassian.net)
serverIdad79b576-13d4-30a5-8426-684792b0ff76
keyDT-1084