Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

DPLA tries to ensure that the IDs we mint for your records will not change over time. However, this requires that the value we base our identifier on does not change over time. One issue we have discovered during the development of ingestion3 was that that in the original ingestion1 system in some cases the value we based our identifiers on in ingestion1 was were not the most stablestable and did in fact change over time.

History of DPLA identifiers in ingestion1

...

The last step is to take the final provider identifier and create and MD5 hash which will be the persistent DPLA identifier for this record.

Examples:

ProviderUse prefix?Identifier fieldpre-MD5 hash valueDPLA identifier
IllinoisYesdc:identifieril--https://madison-historical.siue.edu/archive/files/original/79c4cf9b0da358e32fa7bab46563e79e.pdf02a5aa4975b941d340d14cb9ad4f7a37
PA DigitalNoOAI header identifier
oai
oai:libcollab.temple.edu:dplapa:
SLPa
SLPa_biologicalfactst00bate000178f5b0d971292ca1f6539a9f3a9b


Problems with this approach

  • Default source of local "persistent" identifiers for providers using either DC or QDC is the dc:identifier field which is not what it was designed for
  • The order of identifiers is now significant and adding a new URI to the dc:identifier field will change the DPLA identifier
  • Moving from http:// to https:// will change the DPLA identifier
  • Changing a domain will change the DPLA identifier

These and other subtle changes can cause the DPLA identifiers to change without notice. Additionally, DPLA did not save the pre-hashed value so it is very difficult to reverse engineer what the DPLA identifier is derived from.

Ingestion3 DPLA ID minting

We have sought to remedy many of these issues in our approach to DPLA ID minting in ingestion3, however, because of this legacy it is virtually impossible to live up to guaranteeing persistent identifiers to all records in our corpus. However, we can make some changes which will make it less likely for DPLA identifiers to change going forward.

...

Code Block
languagescala
titleingestion3 ID Minting for TN
// ID minting functions for Tennessee
override def useProviderName(): Boolean = true

override def getProviderName(): String = "tn"

override def originalId(implicit data: Document[NodeSeq]): ZeroToOne[String] =
  extractString(data \ "header" \ "identifier")


Making the switch

Implementing these changes can create some short-term headaches when the previous identifiers were based on values in dc:identifier or some other non-persistent value. In these cases, all of the DPLA identifiers will change and links to DPLA item pages may be broken. Will we try to fix an broken internal links (Primary Source Sets, Exhibitions etc.) but external links are outside our control. This is unfortunate but the status quo of continuing to use the a non-persistent identifier is just a fraught and we cannot guarantee that the identifiers won't eventaully change. Performing the switchover in this way gives us control to try and identify problems ahead of time and make the appropriate corrections quickly.