Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Current »


DPLA tries to ensure that the IDs we mint for your records will not change over time. However, this requires that the value we base our identifier on does not change over time. One issue we have discovered during the development of ingestion3 was that the in some cases the value we based our identifiers on in ingestion1 was not the most stable.


History of DPLA identifiers in ingestion1

The ID minting service in ingestion1 is called select-id (https://github.com/dpla/ingestion/blob/develop/lib/akamod/select-id.py).

This service pulls values from the field specified by theĀ prop parameter (default field is 'handle' which is just an alias for the dc:identifier field). It will use the last absolute URI and if no URI is present then it will use the first value in the specified field. Next, the base identifier has all white space characters replaced with a double under bar (__). Finally an optional prefix and double dash (--) is prepended to the base identifier. This prefix is a way of "salting" identifiers to prevent collisions between data providers.

The last step is to take the final provider identifier and create and MD5 hash which will be the persistent DPLA identifier for this record.

Examples:

ProviderUse prefix?Identifier fieldpre-MD5 hash valueDPLA identifier
IllinoisYesdc:identifieril--https://madison-historical.siue.edu/archive/files/original/79c4cf9b0da358e32fa7bab46563e79e.pdf02a5aa4975b941d340d14cb9ad4f7a37
PA DigitalNoOAI header identifieroai:libcollab.temple.edu:dplapa:SLPa_biologicalfactst00bate 000178f5b0d971292ca1f6539a9f3a9b

Problems with this approach

  • Default source of local "persistent" identifiers for providers using either DC or QDC is the dc:identifier field which is not what it was designed for
  • The order of identifiers is now significant and adding a new URI to the dc:identifier field will change the DPLA identifier
  • Moving from http:// to https:// will change the DPLA identifier
  • Changing a domain will change the DPLA identifier

These and other subtle changes can cause the DPLA identifiers to change without notice. Additionally, DPLA did not save the pre-hashed value so it is very difficult to reverse engineer what the DPLA identifier is derived from.

Ingestion3 DPLA ID minting

We have sought to remedy many of these issues in our approach to DPLA ID minting in ingestion3, however, because of this legacy it is virtually impossible to live up to guaranteeing persistent identifiers to all records in our corpus. However, we can make some changes which will make it less likely for DPLA identifiers to change going forward.

  • Use the OAI header identifier instead of any value in the dc:identifier property
  • If data is delivered by some other mechanism then the field which we use should be carefully vetted and the consequences of even minor changes well understood by all parties
  • Checking for uniqueness and reporting duplicatesĀ 

We have also made the selection of the pre-hashed ID value an explicit part of the mapping process for every provider. There is no "default" behavior that can obfuscate what is happening at mapping time. The three methods that define how a provider's pre-hashed ID value is constructed are:

ingestion3 ID Minting
// ID minting functions 
override def useProviderName(): Boolean
override def getProviderName(): String
override def originalId(implicit data: Document[NodeSeq]): ZeroToOne[String]
ingestion3 ID Minting for TN
// ID minting functions for Tennessee
override def useProviderName(): Boolean = true

override def getProviderName(): String = "tn"

override def originalId(implicit data: Document[NodeSeq]): ZeroToOne[String] =
  extractString(data \ "header" \ "identifier")




  • No labels