KriKri Datastore Requirements
Goals
- Establish requirements for a long-term, scalable and performant backend for KriKri.
Assumptions
- The datastore needs to handle schema-less RDF.
- KriKri mapping and enrichments are not (and should not be) tightly bound to DPLA MAP.
- KriKri needs to serialize RDF and have SPARQL queries.
Out of Scope
- Indexing
- Reconsideration of LDP as interaction API
Requirements
Requirement | Importance | Notes |
---|---|---|
1. Linked Data Platform 1.0 compliance | Must | Various implementations are possible: |
1.1. LDP-NR support | Must | We use LDP-NRs for OriginalRecords. Ideally, a system should create a "describedby" LDP-RS for LDP-NRs for provenance and technical metadata. |
1.2. POST, PUT | Must |
|
1.3. PATCH | Medium-Low | A good patch format would improve certain parts of our workflow; especially if used with RDF.rb 2.0's Transactions. Multiple PATCH formats currently compete. Marmotta supports an RDFPatch, while Fedora 4 supports SPARQL Update as a patch format. The LDP WG declined to recommend LDPatch. Ruby RDF has some support for LDPatch documents. |
1.4. Server Managed Triples | Medium | Marmotta currently handles created/deleted timestamps as server managed triples. An customizable server managed triples system that would allow us to (e.g.) handle PROV is a lower level desiderata. |
2. SPARQL | Must | |
2.1. Query across LDPC and LDP-RS | Must | This is required to support provider based queries. These queries will normally have large result sets. |
2.2. SPARQL for QA | Medium-Low | SPARQL performance to the degree that we can do some amount of QA analytics reliably inline on the production server. This is a relatively low priority because we need will almost certainly need to maintain secondary indexes or analytics systems for some of our use cases. |
Not Required | ||
3. Bulk Export | High | |
4. Scale Up | High | TK:
LDP GET, POST, PUT/PATCH should be minimally affected by overall dataset size. Queries should scale performantly to [...] Aggregations. |
4.1. Scale Out | A single server system that can scale to our needs in 4 is considered acceptable. Benefits of horizontal scalability are to be evaluated on a system by system basis. | |
5. Throughput | High | We require highly responsive performance under heavy concurrent update load. See Performance/LDP Throughput below. |
6. Open Source | High | System should have a permissive license, per DPLA & Tech Team values. Consider implications for protocol level concerns. |
6.1. Default Backend is Open Source | Must |
|
6.2. DPLA's Production Backend is Open Source | Medium-High | |
7. General Purpose RDF in LDP-RS | Medium-High | LDP-RSs need to support arbitrary triples. Workarounds are possible for other models, but would require significant realignment of data pipeline and provenance. |
8. Versioning | Medium | An existing versioning system is desirable. Service alignment with future LDP Community Group work is considered more important. Details of versioning systems need to be discussed in further detail. |
Not Required | ||
10. High Availability | Low | Fault tolerant replication is a plus. |
Performance
Performance needs are focused in three areas; in order of importance: LDP throughput, simple but large scale SPARQL queries, & bulk export.
All performance measures should be considered as the total number of LDP-RSs increases to the numbers listed in #4 above.
LDP Throughput
Throughput for LDP requests is a high priority. Our processes typically involve a large number of HTTP requests, both reads and updates, to individual resources. These normally do not benefit from caching, since we tend to interact with items only once in a given process. Prewarming may be an option for reads.
We can define throughput, for our purposes, first in terms of read/update (GET/PUT) cycles per second through a single threaded remote client; and, second, in terms of cycles per second achievable with multiple concurrent remote client processes.
We expect the first to be easiest to measure, but the second is the more important metric. With the datastore we choose, we will aim to optimize the number of workers operating concurrently (and the internal threading of individual processes) to maximize total updates per second. The main benefit of improved single-threaded throughput is the reduced need to optimize on the client side.
Benchmark.ips do |bm| aggs = activity.entities bm.report('read/update cycle') do agg = aggs.next agg.get # change something so the datastore needs to update the resource agg.sourceResource.first.title = "Moomin" agg.save end end
SPARQL Performance
In the requirements above, we discuss SPARQL performance for two use cases: Query across LDPC and LDP-RS and SPARQL for QA.
Query across LDPC and LDP-RS
These queries are high priority, since we rely on them in our basic workflow. These queries need to identify large numbers (scaling to at least 5 million) of LDP-RSs by Activity URI (prov:wasGeneratedBy
, dpla:wasRevisedBy
) and by Provider (dpla:provider
). In both cases, performance for a single NOT EXISTS
filter should also be evaluated. See https://github.com/dpla/KriKri/blob/develop/lib/krikri/provenance_query_client.rb#L9-L54 for an example query.
These queries can be long running and cached, but need to be consistently successful.
SPARQL for QA
A lower priority for SPARQL performance is fast, dynamic SPARQL for evaluating data quality at scale. Query types in this category include:
- missing properties in SourceResources
- SourceResources with property literals matching a regex
- Aggregations by ranges of datetime literals
- other complex BGP queries
As a proxy for these kinds of queries, we expect to use a generic SPARQL benchmark (Berlin or LUBM).
Bulk Export
Bulk Export will be measured by triples extracted by a single threaded client when exporting:
- all triples in the dataset; and
- all triples related to a single provider
Where possible, tests will be run with a streaming client.
Attachments