KriKri Datastore Requirements

Goals

  • Establish requirements for a long-term, scalable and performant backend for KriKri.

Assumptions

  • The datastore needs to handle schema-less RDF.
    • KriKri mapping and enrichments are not (and should not be) tightly bound to DPLA MAP.
  • KriKri needs to serialize RDF and have SPARQL queries.

Out of Scope

  • Indexing
  • Reconsideration of LDP as interaction API

Requirements

RequirementImportanceNotes
1. Linked Data Platform 1.0 complianceMust

Various implementations are possible:

    • we are agnostic about underlying data structure
    • storage backends that can support a thin LDP server layer like RDF::LDP or LDPjs are considered compliant and are under consideration to the degree that such a stack meets other requirements.
1.1. LDP-NR supportMust

We use LDP-NRs for OriginalRecords.

Ideally, a system should create a "describedby" LDP-RS for LDP-NRs for provenance and technical metadata.

1.2. POST, PUTMust

 

1.3. PATCHMedium-Low

A good patch format would improve certain parts of our workflow; especially if used with RDF.rb 2.0's Transactions.

Multiple PATCH formats currently compete. Marmotta supports an RDFPatch, while Fedora 4 supports SPARQL Update as a patch format. The LDP WG declined to recommend LDPatch.

Ruby RDF has some support for LDPatch documents.

1.4. Server Managed TriplesMedium

Marmotta currently handles created/deleted timestamps as server managed triples.

An customizable server managed triples system that would allow us to (e.g.) handle PROV is a lower level desiderata.

2. SPARQL

Must 
2.1. Query across LDPC and LDP-RSMust

This is required to support provider based queries. These queries will normally have large result sets.

2.2. SPARQL for QAMedium-Low

SPARQL performance to the degree that we can do some amount of QA analytics reliably inline on the production server.

This is a relatively low priority because we need will almost certainly need to maintain secondary indexes or analytics systems for some of our use cases.

2.3. SPARQL Update, Federation, Graph Store ProtocolNot Required 

3. Bulk Export

High (question) 
4. Scale UpHigh

TK:

  • current
    • LDP-RS count; 
    • LDP-NR count; 
    • 1,577,868,572 total quads (rows in triples table); 
    • LDP-NR disk usage;
  • estimated: 
    • LDP-RS count at 25/100/250 million MAP Aggregations;

LDP GET, POST, PUT/PATCH should be minimally affected by overall dataset size. Queries should scale performantly to [...] Aggregations.

4.1. Scale Out(question)A single server system that can scale to our needs in 4 is considered acceptable. Benefits of horizontal scalability are to be evaluated on a system by system basis.
5. ThroughputHigh

We require highly responsive performance under heavy concurrent update load.

See Performance/LDP Throughput below.

6. Open SourceHigh

System should have a permissive license, per DPLA & Tech Team values.

Consider implications for protocol level concerns.

6.1. Default Backend is Open SourceMust

 

6.2. DPLA's Production Backend is Open SourceMedium-High 
7. General Purpose RDF in LDP-RSMedium-High

LDP-RSs need to support arbitrary triples.

Workarounds are possible for other models, but would require significant realignment of data pipeline and provenance.

8. VersioningMedium

An existing versioning system is desirable. Service alignment with future LDP Community Group work is considered more important.

Details of versioning systems need to be discussed in further detail.

9. ReasoningNot Required 
10. High AvailabilityLowFault tolerant replication is a plus.

Performance

Performance needs are focused in three areas; in order of importance: LDP throughput, simple but large scale SPARQL queries, & bulk export.

All performance measures should be considered as the total number of LDP-RSs increases to the numbers listed in #4 above.

LDP Throughput

Throughput for LDP requests is a high priority. Our processes typically involve a large number of HTTP requests, both reads and updates, to individual resources. These normally do not benefit from caching, since we tend to interact with items only once in a given process. Prewarming may be an option for reads.

We can define throughput, for our purposes, first in terms of read/update (GET/PUT) cycles per second through a single threaded remote client; and, second, in terms of cycles per second achievable with multiple concurrent remote  client processes.

We expect the first to be easiest to measure, but the second is the more important metric. With the datastore we choose, we will aim to optimize the number of workers operating concurrently (and the internal threading of individual processes) to maximize total updates per second. The main benefit of improved single-threaded throughput is the reduced need to optimize on the client side.

Single threaded throughput pseudo-code
Benchmark.ips do |bm|
  aggs = activity.entities
  bm.report('read/update cycle') do
    agg = aggs.next
    agg.get
    # change something so the datastore needs to update the resource
    agg.sourceResource.first.title = "Moomin"
    agg.save
  end
end

SPARQL Performance

In the requirements above, we discuss SPARQL performance for two use cases: Query across LDPC and LDP-RS and SPARQL for QA.

Query across LDPC and LDP-RS

These queries are high priority, since we rely on them in our basic workflow. These queries need to identify large numbers (scaling to at least 5 million) of LDP-RSs by Activity URI (prov:wasGeneratedBy, dpla:wasRevisedBy) and by Provider (dpla:provider). In both cases, performance for a single NOT EXISTS filter should also be evaluated. See https://github.com/dpla/KriKri/blob/develop/lib/krikri/provenance_query_client.rb#L9-L54 for an example query.

These queries can be long running and cached, but need to be consistently successful.

SPARQL for QA

A lower priority for SPARQL performance is fast, dynamic SPARQL for evaluating data quality at scale. Query types in this category include:

  • missing properties in SourceResources
  • SourceResources with property literals matching a regex
  • Aggregations by ranges of datetime literals
  • other complex BGP queries

As a proxy for these kinds of queries, we expect to use a generic SPARQL benchmark (Berlin or LUBM).

Bulk Export

Bulk Export will be measured by triples extracted by a single threaded client when exporting: 

  • all triples in the dataset; and
  • all triples related to a single provider

Where possible, tests will be run with a streaming client.

Attachments

  File Modified

PDF File WP2.pdf Europeana Triplestore Evaluation

Jul 21, 2016 by Tom Johnson