This document specifies how the schemas of our Avro files are defined, and the folder structure and file types within our S3 bucket for ingestion.

See 

Avro File Schemas

In all schemas below, <DOCUMENTATION> is optional, but should contain some contextual note about the Activity that generated the file; especially for mappings and enrichments, where this could say the name of the harvested-records or mapped-records Avro file from which it is derived. It should contain the software version number or git commit identifier of the agent that wrote it.

dpla.avro.v1.MAP3_1.IndexRecord

For enriched records coming out of Ingestion 1 in JSON format, ready to be indexed into Elasticsearch.

{
  "namespace": "dpla.avro.v1.MAP3_1",
  "type": "record",
  "name": "IndexRecord",
  "doc": "<DOCUMENTATION>",
  "fields": [
    {"name": "id", "type": "string"},
    {"name": "document", "type": "string"}
  ]
}

dpla.avro.v1.OriginalRecord

For Original Records, as they have been harvested from our providers.

{
  "namespace": "dpla.avro.v1",
  "type": "record",
  "name": "OriginalRecord",
  "doc": "<DOCUMENTATION>",
  "fields": [
    {"name": "id", "type": "string"},
    {"name": "ingestDate", "type": "long", "doc": "UNIX timestamp"},
    {"name": "provider", "type": "string"},
    {"name": "document", "type": "string"},
    {"name": "mimetype",
     "type": {"name": "MimeType",
              "type": "enum",
              "symbols": ["application_json", "application_xml", "text_turtle"]}}
  ]
}

Notes On Fields

We will have to translate "_" characters in the mimetype field into "/" characters in our code. The "/" character is not allowed within an enum symbol in an Avro schema.

In Ingestion1, ingestDate was when the particular record was enriched. It would probably be more useful, however, to have this be the timestamp of the beginning of the harvest, which is more important as far as the provider and the end-user are concerned. Having ingestDate expressed internally in the Avro file as an integer allows for easier decision-making calculations about date ranges, such as queries that filter records. The timestamp can be output as a date string later when the record is indexed.

dpla.avro.v1.MAP4_x.MAPRecord

For DPLA MAP records in our system – those that have been mapped or enriched, and are represented as MAPv4 RDF.

Not tested yet


// See http://dp.la/info/wp-content/uploads/2015/03/MAPv4.pdf
// and https://dp.la/info/wp-content/uploads/2013/04/DPLA-MAP-V3.1-2.pdf
//
// Figures below like "1" or "0-1" indicate field obligations specified in those
// documents.
//
// Fields are mostly taken from the MAPv4 document, except for Place, which is
// pulled from the MAPv3.1 document.
{
  "namespace": "dpla.avro.v1.MAP4_x",
  "type": "record",
  "name": "MAPRecord",
  "doc": "<DOCUMENTATION>",

  "types": [
    {"name": "StringArray", "type": "array", "items": "string"},
    {
      "name": "Collection",  // dcmitype:Collection
      "type": "record",
      "fields": [
        {   // 0-1
          "name": "title",
          "type": ["null", "string"],
          "default": null
        },
        {   // 0-1
          "name": "description",
          "type": ["null", "string"],
          "default": null
        }
      ]
    },
    {
      "name": "Agent",  // edm:Agent
      "type": "record",
      "fields": [
        { // 0-1
          "name": "name",
          "type": ["null", "string"],
          "default": null
        },
        { // 0-1
          "name": "providedLabel",
          "type": ["null", "string"],
          "default": null
        },
        { // 0-1
          "name": "note",
          "type": ["null", "string"],
          "default": null
        },
        { // 0-1
          "name": "inScheme",
          "type": ["null", "string"],
          "default": null
        },
        { // 0-1
          "name": "exactMatch",
          "type": ["null", "StringArray"],
          "default": null
        },
        {
          // 0-n
          "name": "closeMatch",
          "type": ["null", "StringArray"],
          "default": null
        }
      ]
    },
    {"name": "AgentArray", "type": "array", "items": "Agent"},
    {
      "name": "Concept",  // skos:Concept
      "type": "record",
      "fields": [
        {
          // 0-1
          "name": "name",  // skos:prefLabel
          "type": ["null", "string"],
          "default": null
        },
        {
          // 0-1
          "name": "providedLabel",
          "type": ["null", "string"],
          "default": null
        },
        {
          // 0-1
          "name": "note",
          "type": ["null", "string"],
          "default": null
        },
        {
          // 0-1
          "name": "inScheme",
          "type": ["null", "string"],
          "default": null
        },
        {
          // 0-n
          "name": "exactMatch",
          "type": ["null", "StringArray"],
          "default": null
        },
        {
          // 0-n
          "name": "closeMatch",
          "type": ["null", "StringArray"],
          "default": null
        }
      ]
    },
    {"name": "ConceptArray", "type": "array", "items": "Concept"},
    {
      "name": "Place",  // dpla:Place from MAP 3.1 spec
      "type": "record",
      "fields": [
        {"name": "name", "type": "string"}, // 1
        {   // 0-1
          "name": "city",
          "type": ["null", "string"],
          "default": null
        },
        {   // 0-1
          "name": "county",
          "type": ["null", "string"],
          "default": null
        },
        {   // 0-1
          "name": "state",
          "type": ["null", "string"],
          "default": null
        },
        {   // 0-1
          "name": "country",
          "type": ["null", "string"],
          "default": null
        },
        {   // 0-1
          "name": "region",
          "type": ["null", "string"],
          "default": null
        },
        {   // 0-1
          "name": "coordinates",
          "type": ["null", "string"],
          "default": null
        }
      ]
    },
    {"name": "PlaceArray", "type": "array", "items": "Place"},
    {
      "name": "TimeSpan",  // edm:TimeSpan
      "type": "record",
      "fields": [
        {   // 0-n
          "name": "providedLabel",
          "type": ["null", "StringArray"],
          "default": null
        },
        {
          // 0-1
          "name": "begin",
          "type": ["null", "string"],
          "default": null
        },
        {
          // 0-1
          "name": "end",
          "type": ["null", "string"],
          "default": null
        }
      ]
    },
    {"name": "TimeSpanArray", "type": "array", "items": "TimeSpan"}
  ],  // types

  "fields": [

    // 1
    {"name": "id", "type": "string", "doc": "DPLA record ID"},

    // 1
    {
      "name": "ingestType",
      "type": {
        "name": "IngestType",
        "type": "enum",
        "symbols": ["item", "collection"]
      }
    },

    // 1
    {
      "name": "ingestDate",
      "type": "long",
      "doc": "UNIX timestamp"
    },

    // 1
    {"name": "dataProvider", "type": "string"},

    // 0-n
    {
      "name": "hasView",
      "type": ["null", "StringArray"],
      "default": null
    },

    // 0-n
    {
      "name": "intermediateProvider",
      "type": ["null", "StringArray"],
      "default": null
    },

    // 1
    {"name": "isShownAt", "type": "string"},

    // 0-1
    {
      "name": "object",
      "type": ["null", "string"],
      "default": null
    },

    // 1
    {"name": "originalRecord", "type": "string"},

    // 1
    {"name": "preview", "type": "string"},

    // 1
    {"name": "provider", "type": "string"},

    // 0-1
    {
      "name": "rightsStatement",
      "type": ["null", "string"],
      "default": null
    },

    // 1
    {
      "name": "sourceResource",
      "type": {
        "type": "record",
        "name": "SourceResource",
        "fields": [

          {   // 0-n
            "name": "alternative",
            "type": ["null", "StringArray"],
            "default": null
          },
          {   // 0-n
            "name": "collection",
            "type": ["null", "CollectionArray"],
            "default": null
          },
          {   // 0-n
            "name": "contributor",
            "type": ["null", "AgentArray"],
            "default": null
          },
          {   // 0-n
            "name": "creator",
            "type": ["null", "AgentArray"],
            "default": null
          },
          {   // 0-n
            "name": "date",
            "type": ["null", "TimeSpanArray"],
            "default": null
          },
          {
            // 0-n
            "name": "description",
            "type": ["null", "StringArray"],
            "default": null
          },
          {
            // 0-n
            "name": "extent",
            "type": ["null", "StringArray"],
            "default": null
          },
          {
            // 0-n
            "name": "format",
            "type": ["null", "StringArray"],
            "default": null
          },
          {
            // 0-n
            "name": "genre",
            "type": ["null", "ConceptArray"],
            "default": null
          },
          {
            // 0-n
            "name": "identifier",
            "type": ["null", "StringArray"],
            "default": null
          },
          {
            // 0-n
            "name": "language",
            "type": ["null", "StringArray"],
            "default": null
          },
          {
            // 0-n
            "name": "spatial",
            "type": ["null", "PlaceArray"],
            "default": null
          },
          {   // 0-n
            "name": "publisher",
            "type": ["null", "AgentArray"],
            "default": null
          },
          {
            // 0-n
            "name": "relation",
            "type": ["null", "StringArray"],
            "default": null
          },
          {
            // 0-n
            "name": "isReplacedBy",
            "type": ["null", "StringArray"],
            "default": null
          },
          {
            // 0-n
            "name": "replaces",
            "type": ["null", "StringArray"],
            "default": null
          },
          {
            // 1-n
            "name": "rights",
            "type": "StringArray"
          },
          {   // 0-n
            "name": "rightsHolder",
            "type": ["null", "AgentArray"],
            "default": null
          },
          {
            // 0-n
            "name": "subject",
            "type": ["null", "ConceptArray"],
            "default": null
          },
          {   // 0-n
            "name": "temporal",
            "type": ["null", "TimeSpanArray"],
            "default": null
          },
          {
            // 1-n
            "name": "title",
            "type": "StringArray"
          },
          {
            // 0-n
            "name": "type",
            "type": ["null", "StringArray"],
            "default": null
          }
        ] // fields
      }  // type

    }, // sourceResource

    {
      "name": "title",
      "type": ["null", "string"],
      "doc": "Only for collection records"
    }

  ]  // fields

}

Notes About Fields

See note above in OriginalRecord for ingestDate.

dpla.avro.v1.MAP4_0.IndexRecord

For DPLA MAPv4 records that have been mapped or enriched, and are being indexed into Elasticsearch.

This has not been merged with la.dp.avro.MAP.3_1.IndexRecord.v1 because we may want to add information in the future about which Elasticsearch schema it is destined for.

{
  "namespace": "dpla.avro.v1.MAP4_0",
  "type": "record",
  "name": "IndexRecord",
  "doc": "<DOCUMENTATION>",
  "fields": [
    {"name": "id", "type": "string"},
    {"name": "document", "type": "string"}
  ]
}


S3 Bucket Folder Structure and Contents

<PROVIDER>
    "plan"
        <TIMESTAMP>
			<TIMESTAMP>"-"<SCHEMA NAME>".json"
    <ACTIVITY>
        <DATE>
            <TIMESTAMP>"-"<PROVIDER>"-"<SCHEMA NAME>".avro"
                _MANIFEST
                _LOGS
            <TIMESTAMP>"-"<PROVIDER>"-"<SCHEMA NAME>"-prov.json"

Where:

Examples:

A manifest for a whole ingest:

/cdl/plan/20170209_104428/20170209_104428-MappedRecord.json

CDL's records from a harvest started on Feb. 9th, 2017:

/cdl/harvest/20170209/20170209_104428-cdl-OriginalRecord.v1.avro

MDL's records from an enrichment started on Feb. 14th, 2017:

/mdl/enrichment/20170214/20170214_151848-mdl-MAP.4_0.MAPRecord.v1.avro

Plan Files and Provenance Files

Plan file format:

{
    <ACTIVITY>: <AVRO FILE>,
	"version": <VERSION>
}

Provenance file format:

{
    "generator": <AVRO FILE>,
	"version": <VERSION>
}

Where:

Notes

Though there is a folder structure with provider names and dates, Avro files are still named with the timestamp, provider, and activity to make them easier to identify if they're copied to another location.

RDDs can be instantiated by Spark driver programs by passing a list of paths to a DataFrame or RDD constructor. If we need to do something like reindex all of our providers at once, we can descend through the folder hierarchy and assemble a list of relevant paths corresponding to the most recent activities of all providers. It's easier when processing a single provider.

Using this folder structure, it should be possible to algorithmically recreate a snapshot of a provider at a particular time; for instance, by retrieving the most recent Avro file of mapped records from a particular date (perhaps by walking backwards until a date is found), and then enriching those and indexing the results. In a case like this, a new provenance file could be created that points to the historical mapped-records Avro file.

The Plan files (in the "plan" folder and for an activity) are recommended but optional. Our software should be designed so that you can run an activity like a mapping either automatically by passing it the name of a Plan file, or by passing it the name of a generator activity's file directly.