Metadata Provenance
KriKri, LDP, & RDF Sources
Two kinds of resources form the core of the ingestion process:
- The
OriginalRecord
, representing the metadata content as it was harvested. - The
Aggregation
, an RDF graph reflecting a DPLA Metadata Application Profile record.
KriKri manages these resources using the HTTP patterns defined by Linked Data Platform (LDP). Each OriginalRecord
corresponds to an LDP NonRDFSource
(LDP-NR); while an Aggregation
is an RDFSource
(LDP-RS). Additionally, each OriginalRecord
has an LDP-RS that contains an RDF description (OriginalRecordMetadata
), mainly for the purpose of tracking provenance for the harvested records. Provenance for both is handled through use of Dublin Core Terms and a version of PROV-O modified to handle stateful resources.
OriginalRecordMetadata
LDP-RSs are linked to the OriginalRecord
they describe using the describedby
HTTP Link header as specified in LDP 5.2.3.12, and by a dcterms:hasFormat
statement added by the server.Server Managed DC Terms Statements
The LDP server (Marmotta) is expected to manage two simple datetime properties on both OriginalRecord
and Aggregation
resources: dcterms:created
and dcterms:modified
. The created stamp is added at the time the resource is initially created, and should never be updated; the modified timestamp will be updated with each additional change.
The PROV Model
More advanced provenance use cases in KriKri are supported by PROV-O, the RDF representation of the PROV Data Model.
This model maps directly to KriKri's internal dataflow. In KriKri:
- Entities (resources) are records, as described above.
- Agents (
Krikri::SoftwareAgents
, asprov:SoftwareAgent
) do the work of transforming resources. - Activities (
Krikri::Activity
) are time bound occurrences handling the processing of Entities via Agents.
Provenance in KriKri
KriKri tracks parts of the PROV Data Model directly in LDP-RSs. To account for mutability of resources in LDP, the representations diverge from PROV in several ways. This section describes the provenance implementation.
Basics
RDFSource
allows saving with provenance. When an LDP-RS is created in this way, a triple is added representing the relationship between it and the generating Activity
.
Similarly saving an OriginalRecord with provenance has similar results, with the provenance triple added to the OriginalRecordMetadata
RDF description.
Activities are tracked within a SQL database (as ActiveRecord
objects). They can be rebuilt to retrieve the activity start and end times, as well as the Agent
that was associated with the activity.
This pattern allows query by creating activity across LDP Resource types. Serialization of PROV-O is limited by the use of ActiveRecord
, and the lack of URIs for SoftwareAgents
.
Revision
Updates to LDP Resources, when saved with provenance, are marked by the internal http://dp.la/about/map/wasRevisedBy
predicate. This diverges from the PROV Data Model, which would require a unique resource for each state.
LDP Versioning
Marmotta supports a limited form of versioning, using Memento. However, versions are not centered on LDP resources, and are therefore disconnected from our provenance model. Methods of bridging this gap to take advantage of versioning in the context of provenance are under consideration for future development.
Related LDP versioning work is taking place in the context of ongoing Fedora API Specification discussions, in which we are participating.
Invalidation