View Source

DPLA manages content over LDP using three general processes: a harvest, a mapping, and a series of enrichments. Each of these processes can be characterized as a predictable series of REST interactions with given resources. These jobs are run in large batches through a queuing system. A batch typically handles 50,000 - 300,000 resources sequentially; we have avoided parallelizing using smaller batch sizes due to the performance issues described throughout the Marmotta section.

The result of the processes described below is that for each "record", we have a single LDP-RS, a corresponding LDP-NR, and its describedBy resource. The bulk of the issues we describe in this wiki have become manifest as we have scaled past (roughly) 1,000,000 "records"/ore:Aggregations; i.e. 2 million LDP-RSs and 1 million LDP-NRs. We are aiming to push to 11 million as soon as possible, and anticipate the need to scale into the 100s of millions in the next several years.

Harvest

A harvest retrieves raw metadata, usually in an XML or JSON format, from an upstream source. The process identifies individual "records" representing specific items, and stores them in their original format an an LDP NonRDFSource. A URI slug is generated for each record from unique elements of the source data (usually a formal identifier field), and the LDP-NR is saved by an HTTP PUT. Some metadata expressing provenance is added to the associated (describedBy) LDP RDFSource created by the server. A harvest may write new LDP-NRs and/or entirely replace the persistent state of existing resources.

In terms of specific interactions, this looks like:

An initial HEAD request, checking if the resource exists.
1. If that request returns 2xx, a GET request loading the resource's current state
A PUT request with the new content (and appropriate content-type) is sent
1. If the resource already exists, the request is sent to its location, with extension
2. If the resource does not exist, the request is sent to a base location, adjusting for Marmotta's LDP-NR URI behavior
A GET request is sent to the associated LDP-RS, retrieving its current state (server properties have been updated by the previous PUT).
An additional PUT request is sent to the LDP-RS updating provenance for the LDP-NR.

These processes are the most predictable, in our experience, and we find they scale reasonably well through parallelization.

Mapping

A mapping transforms metadata stored in an LDP-NR to RDF conforming to a DPLA Metadata Application Profile. This transformation is done on a one-to-one basis. The URI slug used for the LDP-NR is shared by the LDP-RS that contains the mapped RDF Graph. Each "mapped record" is represented as a resource of class ore:Aggregation, and results in a single LDP-RS, typically including on the order of 50 triples including a handful of blank nodes (~5-15) .

Prior to starting a mapping process, we run a SPARQL query to identify the LDP-NRs that will be processed during the batch. Then, for each LDP-NR:

A HEAD request is sent to the LDP-NR verifying that the resource exists
A GET request loads the LDP-NR's current state; A process generates an RDF graph from the LDP-NR contents and...
A HEAD request checks for the existence of the LDP-RS (this is needed to generate the correct provenance triples)
A PUT request fully replaces the current state of the LDP-RS.

These processes have been moderately predictable, and tolerated some parallelization earlier on. Performance has declined as the overall triple count (size of the database) has grown. We have experienced significantly degraded performance, but haven't been able to isolate the conditions that cause this.

Enrichment

An enrichment transforms an existing LDP-RS by adding, replacing, or removing some of its triples. An enrichment process may constitute a small change (e.g. literal data-typing), a large change (e.g. the transformation of literals into resources with structured data), or (most commonly) a series of large and small changes handled together.

Prior to starting a mapping process, we run a SPARQL query to identify the LDP-RSs that will be processed during the batch. Then, for each LDP-RS:

A HEAD request is sent to the LDP-RS verifying that the resource exists
A GET request loads the LDP-RSs current state; a process or sequence of processes alters the retrieved graph, and:
A PUT request updates the entire persistent state of the LDP-RS to the resulting Graph

With a small triple count, we found that these processes ran at a similar speed, and had similar performance characteristics, to those described in the mapping section. As the dataset has scaled, the base cost of these operations has increased. Also, these processes have become subject to the database locking issues described in the other Marmotta subsections.