KriKri Datastore Requirements

KriKri Datastore Requirements

Document status

Final

Responsible

@Tom Johnson (Unlicensed)

Stakeholders

@Mark Matienzo (Unlicensed), @Mark Breedlove (Unlicensed), @Scott Williams

Goals

  • Establish requirements for a long-term, scalable and performant backend for KriKri.

Assumptions

  • The datastore needs to handle schema-less RDF.

    • KriKri mapping and enrichments are not (and should not be) tightly bound to DPLA MAP.

  • KriKri needs to serialize RDF and have SPARQL queries.

Out of Scope

  • Indexing

  • Reconsideration of LDP as interaction API

Requirements

Requirement

Importance

Notes

Requirement

Importance

Notes

1. Linked Data Platform 1.0 compliance

Must

Various implementations are possible:

1.1. LDP-NR support

Must

We use LDP-NRs for OriginalRecords.

Ideally, a system should create a "describedby" LDP-RS for LDP-NRs for provenance and technical metadata.

1.2. POST, PUT

Must

 

1.3. PATCH

Medium-Low

A good patch format would improve certain parts of our workflow; especially if used with RDF.rb 2.0's Transactions.

Multiple PATCH formats currently compete. Marmotta supports an RDFPatch, while Fedora 4 supports SPARQL Update as a patch format. The LDP WG declined to recommend LDPatch.

Ruby RDF has some support for LDPatch documents.

1.4. Server Managed Triples

Medium

Marmotta currently handles created/deleted timestamps as server managed triples.

An customizable server managed triples system that would allow us to (e.g.) handle PROV is a lower level desiderata.

2. SPARQL

Must

 

2.1. Query across LDPC and LDP-RS

Must

This is required to support provider based queries. These queries will normally have large result sets.

2.2. SPARQL for QA

Medium-Low

SPARQL performance to the degree that we can do some amount of QA analytics reliably inline on the production server.

This is a relatively low priority because we need will almost certainly need to maintain secondary indexes or analytics systems for some of our use cases.

2.3. SPARQL Update, Federation, Graph Store Protocol

Not Required

 

3. Bulk Export

High

 

4. Scale Up

High

TK:

  • current

    • LDP-RS count; 

    • LDP-NR count; 

    • 1,577,868,572 total quads (rows in triples table); 

    • LDP-NR disk usage;

  • estimated: 

    • LDP-RS count at 25/100/250 million MAP Aggregations;

LDP GET, POST, PUT/PATCH should be minimally affected by overall dataset size. Queries should scale performantly to [...] Aggregations.

4.1. Scale Out

A single server system that can scale to our needs in 4 is considered acceptable. Benefits of horizontal scalability are to be evaluated on a system by system basis.

5. Throughput

High

We require highly responsive performance under heavy concurrent update load.

See Performance/LDP Throughput below.

6. Open Source

High

System should have a permissive license, per DPLA & Tech Team values.

Consider implications for protocol level concerns.

6.1. Default Backend is Open Source

Must

 

6.2. DPLA's Production Backend is Open Source

Medium-High

 

7. General Purpose RDF in LDP-RS

Medium-High

LDP-RSs need to support arbitrary triples.

Workarounds are possible for other models, but would require significant realignment of data pipeline and provenance.

8. Versioning

Medium

An existing versioning system is desirable. Service alignment with future LDP Community Group work is considered more important.

Details of versioning systems need to be discussed in further detail.

9. Reasoning

Not Required

 

10. High Availability

Low

Fault tolerant replication is a plus.

Performance

Performance needs are focused in three areas; in order of importance: LDP throughput, simple but large scale SPARQL queries, & bulk export.

All performance measures should be considered as the total number of LDP-RSs increases to the numbers listed in #4 above.

LDP Throughput

Throughput for LDP requests is a high priority. Our processes typically involve a large number of HTTP requests, both reads and updates, to individual resources. These normally do not benefit from caching, since we tend to interact with items only once in a given process. Prewarming may be an option for reads.

We can define throughput, for our purposes, first in terms of read/update (GET/PUT) cycles per second through a single threaded remote client; and, second, in terms of cycles per second achievable with multiple concurrent remote  client processes.

We expect the first to be easiest to measure, but the second is the more important metric. With the datastore we choose, we will aim to optimize the number of workers operating concurrently (and the internal threading of individual processes) to maximize total updates per second. The main benefit of improved single-threaded throughput is the reduced need to optimize on the client side.

Single threaded throughput pseudo-code
Benchmark.ips do |bm| aggs = activity.entities bm.report('read/update cycle') do agg = aggs.next agg.get # change something so the datastore needs to update the resource agg.sourceResource.first.title = "Moomin" agg.save end end

SPARQL Performance

In the requirements above, we discuss SPARQL performance for two use cases: Query across LDPC and LDP-RS and SPARQL for QA.

Query across LDPC and LDP-RS

These queries are high priority, since we rely on them in our basic workflow. These queries need to identify large numbers (scaling to at least 5 million) of LDP-RSs by Activity URI (prov:wasGeneratedBy, dpla:wasRevisedBy) and by Provider (dpla:provider). In both cases, performance for a single NOT EXISTS filter should also be evaluated. See https://github.com/dpla/KriKri/blob/develop/lib/krikri/provenance_query_client.rb#L9-L54 for an example query.

These queries can be long running and cached, but need to be consistently successful.

SPARQL for QA

A lower priority for SPARQL performance is fast, dynamic SPARQL for evaluating data quality at scale. Query types in this category include:

  • missing properties in SourceResources

  • SourceResources with property literals matching a regex

  • Aggregations by ranges of datetime literals

  • other complex BGP queries

As a proxy for these kinds of queries, we expect to use a generic SPARQL benchmark (Berlin or LUBM).

Bulk Export

Bulk Export will be measured by triples extracted by a single threaded client when exporting: 

  • all triples in the dataset; and

  • all triples related to a single provider

Where possible, tests will be run with a streaming client.

Attachments

  File Modified

PDF File WP2.pdf Europeana Triplestore Evaluation

Jul 21, 2016 by Tom Johnson (Unlicensed)