DPLA tries to ensure that the IDs we mint for your records will not change over time. However, this requires that the value we base our identifier on does not change over time. One issue we have discovered during the development of ingestion3 was that the in some cases the value we based our identifiers on in ingestion1 was not the most stable.
History of DPLA identifiers in ingestion1
The ID minting service in ingestion1 is called select-id
(https://github.com/dpla/ingestion/blob/develop/lib/akamod/select-id.py).
This service pulls values from the field specified by theĀ prop parameter (default field is 'handle' which is just an alias for the dc:identifier field). It will use the last absolute URI and if no URI is present then it will use the first value in the specified field. Next, the base identifier has all white space characters replaced with a double under bar (__
). Finally an optional prefix and double dash (--
) is prepended to the base identifier. This prefix is a way of "salting" identifiers to prevent collisions between data providers.
The last step is to take the final provider identifier and create and MD5 hash which will be the persistent DPLA identifier for this record.
Examples:
Provider | Use prefix? | Identifier field | pre-MD5 hash value | DPLA identifier |
---|---|---|---|---|
Illinois | Yes | dc:identifier | il--https://madison-historical.siue.edu/archive/files/original/79c4cf9b0da358e32fa7bab46563e79e.pdf | 02a5aa4975b941d340d14cb9ad4f7a37 |
PA Digital | No | OAI header identifier | oai:libcollab.temple.edu:dplapa:SLPa_biologicalfactst00bate | 000178f5b0d971292ca1f6539a9f3a9b |
Problems with this approach
- Default source of local "persistent" identifiers for providers using either DC or QDC is the dc:identifier field which is not what it was designed for
- The order of identifiers is now significant and adding a new URI to the dc:identifier field will change the DPLA identifier
- Moving from http:// to https:// will change the DPLA identifier
- Changing a domain will change the DPLA identifier
These and other subtle changes can cause the DPLA identifiers to change without notice. Additionally, DPLA did not save the pre-hashed value so it is very difficult to reverse engineer what the DPLA identifier is derived from.
Ingestion3 DPLA ID minting
We have sought to remedy many of these issues in our approach to DPLA ID minting in ingestion3, however, because of this legacy it is virtually impossible to live up to guaranteeing persistent identifiers to all records in our corpus. However, we can make some changes which will make it less likely for DPLA identifiers to change going forward.
- Use the OAI header identifier instead of any value in the dc:identifier property
- If data is delivered by some other mechanism then the field which we use should be carefully vetted and the consequences of even minor changes well understood by all parties
- Checking for uniqueness and reporting duplicatesĀ
We have also made the selection of the pre-hashed ID value an explicit part of the mapping process for every provider. There is no "default" behavior that can obfuscate what is happening at mapping time. The three methods that define how a provider's pre-hashed ID value is constructed are:
// ID minting functions override def useProviderName(): Boolean override def getProviderName(): String override def originalId(implicit data: Document[NodeSeq]): ZeroToOne[String]
// ID minting functions for Tennessee override def useProviderName(): Boolean = true override def getProviderName(): String = "tn" override def originalId(implicit data: Document[NodeSeq]): ZeroToOne[String] = extractString(data \ "header" \ "identifier")