Metadata Provenance

KriKri, LDP, & RDF Sources

Two kinds of resources form the core of the ingestion process:

  • The OriginalRecord, representing the metadata content as it was harvested.
  • The Aggregation, an RDF graph reflecting a DPLA Metadata Application Profile record.

KriKri manages these resources using the HTTP patterns defined by Linked Data Platform (LDP). Each OriginalRecord corresponds to an LDP NonRDFSource (LDP-NR); while an Aggregation is an RDFSource (LDP-RS).  Additionally, each OriginalRecord has an LDP-RS that contains an RDF description (OriginalRecordMetadata), mainly for the purpose of tracking provenance for the harvested records.  Provenance for both is handled through use of Dublin Core Terms and a version of PROV-O modified to handle stateful resources.

OriginalRecordMetadata LDP-RSs are linked to the OriginalRecord they describe using the describedby HTTP Link header as specified in LDP 5.2.3.12, and by a dcterms:hasFormat statement added by the server.

Server Managed DC Terms Statements

The LDP server (Marmotta) is expected to manage two simple datetime properties on both OriginalRecord and Aggregation resources: dcterms:created and dcterms:modified.  The created stamp is added at the time the resource is initially created, and should never be updated; the modified timestamp will be updated with each additional change.

DCTerms provenance for Aggregations
parent:02036db5f2cec63fcb561a774ad2b6a8 a ldp:Resource , ldp:RDFSource ,
                                          ldp:Container , ldp:BasicContainer ;
    ldp:interactionModel ldp:Container ;
    dcterms:created "2015-07-17T11:21:31.000-04:00"^^xsd:dateTime ;
    dcterms:modified "2015-12-09T13:24:33.000-05:00"^^xsd:dateTime ;
    a <http://www.openarchives.org/ore/terms/Aggregation> .
DCTerms provenance for OriginalRecords
parent:02036db5f2cec63fcb561a774ad2b6a8 a ldp:Resource , ldp:RDFSource ,
                                          ldp:Container , ldp:BasicContainer ;
	dcterms:created "2015-07-13T14:59:23.000-04:00"^^xsd:dateTime ;
	dcterms:modified "2015-12-07T17:09:06.000-05:00"^^xsd:dateTime ;
	dcterms:hasFormat parent:02036db5f2cec63fcb561a774ad2b6a8.xml . 

The PROV Model

More advanced provenance use cases in KriKri are supported by PROV-O, the RDF representation of the PROV Data Model.

PROV Key Concepts

This model maps directly to KriKri's internal dataflow. In KriKri:

  • Entities (resources) are records, as described above.
  • Agents (Krikri::SoftwareAgents, as prov:SoftwareAgent) do the work of transforming resources.
  • Activities (Krikri::Activity) are time bound occurrences handling the processing of Entities via Agents.

Provenance in KriKri

KriKri tracks parts of the PROV Data Model directly in LDP-RSs. To account for mutability of resources in LDP, the representations diverge from PROV in several ways. This section describes the provenance implementation.

Basics 

RDFSource allows saving with provenance. When an LDP-RS is created in this way, a triple is added representing the relationship between it and the generating Activity.

Aggregation with PROV-O
 
<http://localhost:8983/marmotta/ldp/items/agg> a <http://www.w3.org/ns/ldp#Resource>,
     <http://www.w3.org/ns/ldp#RDFSource>,
     <http://www.w3.org/ns/ldp#Container>,
     <http://www.w3.org/ns/ldp#BasicContainer>,
     <http://www.openarchives.org/ore/terms/Aggregation>;
   <http://purl.org/dc/terms/created> "2016-05-16T15:20:27.000-07:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>;
   <http://purl.org/dc/terms/modified> "2016-05-16T15:20:27.000-07:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>;
   <http://www.w3.org/ns/ldp#interactionModel> <http://www.w3.org/ns/ldp#Container>;
   <http://www.w3.org/ns/prov#wasGeneratedBy> <http://localhost:8983/marmotta/ldp/activity/1> .

Similarly saving an OriginalRecord with provenance has similar results, with the provenance triple added to the OriginalRecordMetadata RDF description.

OriginalRecord with PROV-O
<http://localhost:8983/marmotta/ldp/original_record/a> <http://www.w3.org/ns/prov#wasGeneratedBy> <http://localhost:8983/marmotta/ldp/activity/1> .

Activities are tracked within a SQL database (as ActiveRecord objects). They can be rebuilt to retrieve the activity start and end times, as well as the Agent that was associated with the activity.

This pattern allows query by creating activity across LDP Resource types. Serialization of PROV-O is limited by the use of ActiveRecord, and the lack of URIs for SoftwareAgents.

Revision

Updates to LDP Resources, when saved with provenance, are marked by the internal http://dp.la/about/map/wasRevisedBy predicate. This diverges from the PROV Data Model, which would require a unique resource for each state.

 

LDP Versioning

Marmotta supports a limited form of versioning, using Memento. However, versions are not centered on LDP resources, and are therefore disconnected from our provenance model. Methods of bridging this gap to take advantage of versioning in the context of provenance are under consideration for future development.

Related LDP versioning work is taking place in the context of ongoing Fedora API Specification discussions, in which we are participating.

Invalidation

 

Replacement