DPLA manages content over LDP using three general processes: a harvest, a mapping, and a series of enrichments. Each of these processes can be characterized as a predictable series of REST interactions with given resources. These jobs are run in large batches through a queuing system. A batch typically handles 50,000 - 300,000 resources sequentially; we have avoided parallelizing using smaller batch sizes due to the performance issues described throughout the Marmotta section.
The result of the processes described below is that for each "record", we have a single LDP-RS, a corresponding LDP-NR, and its describedBy resource. The bulk of the issues we describe in this wiki have become manifest as we have scaled past (roughly) 1,000,000 "records"/ore:Aggregations; i.e. 2 million LDP-RSs and 1 million LDP-NRs. We are aiming to push to 11 million as soon as possible, and anticipate the need to scale into the 100s of millions in the next several years.
A harvest retrieves raw metadata, usually in an XML or JSON format, from an upstream source. The process identifies individual "records" representing specific items, and stores them in their original format an an LDP NonRDFSource. A URI slug is generated for each record from unique elements of the source data (usually a formal identifier field), and the LDP-NR is saved by an HTTP PUT. Some metadata expressing provenance is added to the associated (describedBy) LDP RDFSource created by the server. A harvest may write new LDP-NRs and/or entirely replace the persistent state of existing resources.
In terms of specific interactions, this looks like:
These processes are the most predictable, in our experience, and we find they scale reasonably well through parallelization.
A mapping transforms metadata stored in an LDP-NR to RDF conforming to a DPLA Metadata Application Profile. This transformation is done on a one-to-one basis. The URI slug used for the LDP-NR is shared by the LDP-RS that contains the mapped RDF Graph. Each "mapped record" is represented as a resource of class ore:Aggregation, and results in a single LDP-RS, typically including on the order of 50 triples including a handful of blank nodes (~5-15) .
Prior to starting a mapping process, we run a SPARQL query to identify the LDP-NRs that will be processed during the batch. Then, for each LDP-NR:
These processes have been moderately predictable, and tolerated some parallelization earlier on. Performance has declined as the overall triple count (size of the database) has grown. We have experienced significantly degraded performance, but haven't been able to isolate the conditions that cause this.
An enrichment transforms an existing LDP-RS by adding, replacing, or removing some of its triples. An enrichment process may constitute a small change (e.g. literal data-typing), a large change (e.g. the transformation of literals into resources with structured data), or (most commonly) a series of large and small changes handled together.
Prior to starting a mapping process, we run a SPARQL query to identify the LDP-RSs that will be processed during the batch. Then, for each LDP-RS:
With a small triple count, we found that these processes ran at a similar speed, and had similar performance characteristics, to those described in the mapping section. As the dataset has scaled, the base cost of these operations has increased. Also, these processes have become subject to the database locking issues described in the other Marmotta subsections.