Ingestion 3 Storage Specification

This document specifies how the schemas of our Avro files are defined, and the folder structure and file types within our S3 bucket for ingestion.

See DT-1137 - Getting issue details... STATUS

Avro File Schema

{
	"namespace": <NAMESPACE>
	"type": "record",
	"name": <NAME>,
	"doc": <DOCUMENTATION>,
	"fields": [
		{"name": <FIELDNAME>, "type": "string"},
		...
	]
}

Where:

<NAMESPACE> is "la.dp.avro.MAP_3.1" or "la.dp.avro.MAP.4_0"
<NAME> is "OriginalRecord" or "MappedRecord" or "EnrichedRecord"
<DOCUMENTATION> is optional, but should contain some contextual note about the Activity that generated the file; especially for mappings and enrichments, where this could say the name of the harvested-records or mapped-records Avro file from which it is derived. It should contain the software version number or git commit identifier of the agent that wrote it.
<FIELDNAME>is a string with the following possible values:
- "id" – A field of this name must always be present. This is the DPLA record ID.
- "rdf_document" – The Turtle serialization of a mapped or enriched record.
- "or_document" – The OriginalRecord document that came from the provider. This is present in all Avro files, even for mapped and enriched records, so that the original record travels along for QA purposes.
- "or_mimetype" – The MIME type of the original record, either application/xml or application/json. This is for when we allow users to download our original records, either through the API or otherwise, in the future.

S3 Bucket Folder Structure and Contents

<PROVIDER>
    "plan"
        <TIMESTAMP>
			<TIMESTAMP>"-"<SCHEMA NAME>".json"
    <SCHEMA NAME>
        <DATE>
            <TIMESTAMP>"-"<PROVIDER>"-"<SCHEMA NAME>".avro"
            <TIMESTAMP>"-"<PROVIDER>"-"<SCHEMA NAME>"-prov.json"

Where:

<PROVIDER> is a token like "cdl" or "mdl".
<DATE> is a string date representation in the form yyyymmdd, for example, "20170209". This is always the date when the relevant event started.
<TIMESTAMP> is a string timestamp representation in the form yyyymmdd_hhmmss, for example, "20170209_104428". This is always the time when the relevant event started.
<SCHEMA NAME> corresponds to the "name" property of the Avro file schema documented above. It's "OriginalRecord" for harvests, "MappedRecord" for mappings, and "EnrichedRecord" for enrichments.
"plan" is for manifests of whole ingests, where the term "plan" refers to the PROV-O notion of a Plan, or sequence of activities. The manifest is a set of JSON objects that documents which Avro files are relevant elsewhere in the provider's folder tree. A Plan can include one or more Activities, so a standalone remapping, for instance, could be part of a Plan with just that one activity. The JSON files in the "plan" directory are written only – they are not modified after being written. They function as a journal of all of the activities that have happened in the particular ingest.
The files ending in "-prov.json" document which Plan the relevant Avro file belongs to, or which Avro file represents the generator Activity from which the relevant one is derived. For example, if you are looking at a particular mapping Avro file, this can direct you to the harvest Avro file whose records were mapped. This may be redundant with the information in a Plan file, but can make it easier to find things.

Examples:

A manifest for a whole ingest:

/cdl/plan/20170209_104428/20170209_104428-MappedRecord.json

CDL's records from a harvest started on Feb. 9th, 2017:

/cdl/OriginalRecord/20170209/20170209_104428-cdl-OriginalRecord.avro

Plan Files and Provenance Files

Plan file format:

{
    <ACTIVITY>: <AVRO FILE>,
	"version": <VERSION>
}

Provenance file format:

{
    "generator": <AVRO FILE>,
	"version": <VERSION>
}

Where:

<ACTIVITY> corresponds to the relevant kind of <ACTIVITY> as noted in S3 Bucket Folder Structure and Contents.
<AVRO FILE> is the path to the file relative to the top <PROVIDER> folder in the S3 folder structure.
<VERSION> is the software version identifier or git commit identifier of the software that wrote the Avro file. In the case of the provenance file, it is the version of the program that wrote the like-named file that accompanies it, not that of the generator activity.

Notes

Though there is a folder structure with provider names and dates, Avro files are still named with the timestamp, provider, and activity to make them easier to identify if they're copied to another location.

RDDs can be instantiated by Spark driver programs by passing a list of paths to a DataFrame or RDD constructor. If we need to do something like reindex all of our providers at once, we can descend through the folder hierarchy and assemble a list of relevant paths corresponding to the most recent activities of all providers. It's easier when processing a single provider.

Using this folder structure, it should be possible to algorithmically recreate a snapshot of a provider at a particular time; for instance, by retrieving the most recent Avro file of mapped records from a particular date (perhaps by walking backwards until a date is found), and then enriching those and indexing the results. In a case like this, a new provenance file could be created that points to the historical mapped-records Avro file.

The Plan files (in the "plan" folder and for an activity) are recommended but optional. Our software should be designed so that you can run an activity like a mapping either automatically by passing it the name of a Plan file, or by passing it the name of a generator activity's file directly.