KriKri Datastore Requirements

Document status	FINAL
Responsible	Tom Johnson (Unlicensed)
Stakeholders	Mark Matienzo (Unlicensed), Mark Breedlove (Unlicensed), Scott Williams (Unlicensed)

Goals

Establish requirements for a long-term, scalable and performant backend for KriKri.

Assumptions

The datastore needs to handle schema-less RDF.
- KriKri mapping and enrichments are not (and should not be) tightly bound to DPLA MAP.
KriKri needs to serialize RDF and have SPARQL queries.

Out of Scope

Indexing
Reconsideration of LDP as interaction API

Requirements

Requirement	Importance	Notes
1. Linked Data Platform 1.0 compliance	Must	Various implementations are possible: we are agnostic about underlying data structure storage backends that can support a thin LDP server layer like RDF::LDP or LDPjs are considered compliant and are under consideration to the degree that such a stack meets other requirements.
1.1. LDP-NR support	Must	We use LDP-NRs for OriginalRecords. Ideally, a system should create a "describedby" LDP-RS for LDP-NRs for provenance and technical metadata.
1.2. POST, PUT	Must
1.3. PATCH	Medium-Low	A good patch format would improve certain parts of our workflow; especially if used with RDF.rb 2.0's Transactions. Multiple PATCH formats currently compete. Marmotta supports an RDFPatch, while Fedora 4 supports SPARQL Update as a patch format. The LDP WG declined to recommend LDPatch. Ruby RDF has some support for LDPatch documents.
1.4. Server Managed Triples	Medium	Marmotta currently handles created/deleted timestamps as server managed triples. An customizable server managed triples system that would allow us to (e.g.) handle PROV is a lower level desiderata.
2. SPARQL	Must
2.1. Query across LDPC and LDP-RS	Must	This is required to support provider based queries. These queries will normally have large result sets.
2.2. SPARQL for QA	Medium-Low	SPARQL performance to the degree that we can do some amount of QA analytics reliably inline on the production server. This is a relatively low priority because we need will almost certainly need to maintain secondary indexes or analytics systems for some of our use cases.
2.3. SPARQL Update, Federation, Graph Store Protocol	Not Required
3. Bulk Export	High
4. Scale Up	High	TK: current: LDP-RS count; LDP-NR count; 1,577,868,572 total quads (rows in triples table); LDP-NR disk usage; estimated: LDP-RS count at 25/100/250 million MAP Aggregations; LDP GET, POST, PUT/PATCH should be minimally affected by overall dataset size. Queries should scale performantly to [...] Aggregations.
4.1. Scale Out		A single server system that can scale to our needs in 4 is considered acceptable. Benefits of horizontal scalability are to be evaluated on a system by system basis.
5. Throughput	High	We require highly responsive performance under heavy concurrent update load. See Performance/LDP Throughput below.
6. Open Source	High	System should have a permissive license, per DPLA & Tech Team values. Consider implications for protocol level concerns.
6.1. Default Backend is Open Source	Must
6.2. DPLA's Production Backend is Open Source	Medium-High
7. General Purpose RDF in LDP-RS	Medium-High	LDP-RSs need to support arbitrary triples. Workarounds are possible for other models, but would require significant realignment of data pipeline and provenance.
8. Versioning	Medium	An existing versioning system is desirable. Service alignment with future LDP Community Group work is considered more important. Details of versioning systems need to be discussed in further detail.
~~9. Reasoning~~	Not Required
10. High Availability	Low	Fault tolerant replication is a plus.

Performance

Performance needs are focused in three areas; in order of importance: LDP throughput, simple but large scale SPARQL queries, & bulk export.

All performance measures should be considered as the total number of LDP-RSs increases to the numbers listed in #4 above.

LDP Throughput

Throughput for LDP requests is a high priority. Our processes typically involve a large number of HTTP requests, both reads and updates, to individual resources. These normally do not benefit from caching, since we tend to interact with items only once in a given process. Prewarming may be an option for reads.

We can define throughput, for our purposes, first in terms of read/update (GET/PUT) cycles per second through a single threaded remote client; and, second, in terms of cycles per second achievable with multiple concurrent remote client processes.

We expect the first to be easiest to measure, but the second is the more important metric. With the datastore we choose, we will aim to optimize the number of workers operating concurrently (and the internal threading of individual processes) to maximize total updates per second. The main benefit of improved single-threaded throughput is the reduced need to optimize on the client side.

Single threaded throughput pseudo-code

Benchmark.ips do |bm|
  aggs = activity.entities
  bm.report('read/update cycle') do
    agg = aggs.next
    agg.get
    # change something so the datastore needs to update the resource
    agg.sourceResource.first.title = "Moomin"
    agg.save
  end
end

SPARQL Performance

In the requirements above, we discuss SPARQL performance for two use cases: Query across LDPC and LDP-RS and SPARQL for QA.

Query across LDPC and LDP-RS

These queries are high priority, since we rely on them in our basic workflow. These queries need to identify large numbers (scaling to at least 5 million) of LDP-RSs by Activity URI (prov:wasGeneratedBy, dpla:wasRevisedBy) and by Provider (dpla:provider). In both cases, performance for a single NOT EXISTS filter should also be evaluated. See https://github.com/dpla/KriKri/blob/develop/lib/krikri/provenance_query_client.rb#L9-L54 for an example query.

These queries can be long running and cached, but need to be consistently successful.

SPARQL for QA

A lower priority for SPARQL performance is fast, dynamic SPARQL for evaluating data quality at scale. Query types in this category include:

missing properties in SourceResources
SourceResources with property literals matching a regex
Aggregations by ranges of datetime literals
other complex BGP queries

As a proxy for these kinds of queries, we expect to use a generic SPARQL benchmark (Berlin or LUBM).

Bulk Export

Bulk Export will be measured by triples extracted by a single threaded client when exporting:

all triples in the dataset; and
all triples related to a single provider

Where possible, tests will be run with a streaming client.

Attachments

	File	Modified
Labels No labels Preview View	PDF File WP2.pdf Europeana Triplestore Evaluation	Jul 21, 2016 by Tom Johnson (Unlicensed)