Changing DPLA Identifiers
DPLA tries to ensure that the IDs we mint for your records will not change over time. However, this requires that the value we base our identifier on does not change over time. One issue discovered during the development of ingestion3 was that in the original ingestion1 system in some cases the value we based our identifiers on were not stable and did in fact change over time.
History of DPLA identifiers in ingestion1
The ID minting service in ingestion1 is called select-id
(https://github.com/dpla/ingestion/blob/develop/lib/akamod/select-id.py).
This service pulls values from the field specified by the prop parameter (default field is 'handle' which is just an alias for the dc:identifier field). It will use the last absolute URI and if no URI is present then it will use the first value in the specified field. Next, the base identifier has all white space characters replaced with a double under bar (__
). Finally an optional prefix and double dash (--
) is prepended to the base identifier. This prefix is a way of "salting" identifiers to prevent collisions between data providers.
The last step is to take the final provider identifier and create and MD5 hash which will be the persistent DPLA identifier for this record.
Examples:
Provider | Use prefix? | Identifier field | pre-MD5 hash value | DPLA identifier |
---|---|---|---|---|
Illinois | Yes | dc:identifier | il--https://madison-historical.siue.edu/archive/files/original/79c4cf9b0da358e32fa7bab46563e79e.pdf | 02a5aa4975b941d340d14cb9ad4f7a37 |
PA Digital | No | OAI header identifier | oai:libcollab.temple.edu:dplapa:SLPa_biologicalfactst00bate | 000178f5b0d971292ca1f6539a9f3a9b |
Problems with this approach
- Default source of local "persistent" identifiers for providers using either DC or QDC is the dc:identifier field which is not what it was designed for
- The order of identifiers is significant and adding a new URI to the dc:identifier field will change the DPLA identifier
- Moving from http:// to https:// will change the DPLA identifier
- Changing a domain will change the DPLA identifier
These and other subtle changes can cause the DPLA identifiers to change without notice. Additionally, DPLA did not save the pre-hashed value so it is very difficult to reverse engineer what the DPLA identifier is derived from.
Ingestion3 DPLA ID minting
We have sought to remedy many of these issues in our approach to DPLA ID minting in ingestion3, however, because of this legacy it is virtually impossible to live up to guaranteeing persistent identifiers to all records in our corpus. However, we can make some changes which will make it less likely for DPLA identifiers to change going forward.
- Use the OAI header identifier instead of any value in the dc:identifier property
- If data is delivered by some other mechanism then the field which we use should be carefully vetted and the consequences of even minor changes well understood by all parties
- Checking for uniqueness and reporting duplicates
We have also made the selection of the pre-hashed ID value an explicit part of the mapping process for every provider. There is no "default" behavior that can obfuscate what is happening at mapping time. The three methods that define how a provider's pre-hashed ID value is constructed are:
// ID minting functions override def useProviderName(): Boolean override def getProviderName(): String override def originalId(implicit data: Document[NodeSeq]): ZeroToOne[String]
// ID minting functions for Tennessee override def useProviderName(): Boolean = true override def getProviderName(): String = "tn" override def originalId(implicit data: Document[NodeSeq]): ZeroToOne[String] = extractString(data \ "header" \ "identifier")
Making the switch
Implementing these changes can create some short-term headaches when the previous identifiers were based on values in dc:identifier or some other non-persistent value. In these cases, all of the DPLA identifiers will change and links to DPLA item pages may be broken. Will we try to fix an broken internal links (Primary Source Sets, Exhibitions etc.) but external links are outside our control. This is unfortunate but the status quo of continuing to use the a non-persistent identifier is just a fraught and we cannot guarantee that the identifiers won't eventaully change. Performing the switchover in this way gives us control to try and identify problems ahead of time and make the appropriate corrections quickly.