Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Add location of manifest and log files.

...

In all schemas below, <DOCUMENTATION> is optional, but should contain some contextual note about the Activity that generated the file; especially for mappings and enrichments, where this could say the name of the harvested-records or mapped-records Avro file from which it is derived. It should contain the software version number or git commit identifier of the agent that wrote it.

...

dpla.

...

avro.

...

v1.

...

MAP3_1.IndexRecord

...

For enriched records coming out of Ingestion 1 in JSON format, ready to be indexed into Elasticsearch.

Code Block
{
  "namespace": "ladpla.dp.avro.MAPv1.3MAP3_1",
  "type": "record",
  "name": "IndexRecord.v1",
  "doc": "<DOCUMENTATION>",
  "fields": [
    {"name": "id", "type": "string"},
    {"name": "document", "type": "string"}
  ]
}

...

dpla.

...

avro.

...

v1.OriginalRecord

...

For Original Records, as they have been harvested from our providers.

Code Block
{
  "namespace": "ladpla.dpavro.avrov1",
  "type": "record",
  "name": "OriginalRecord.v1",
  "doc": "<DOCUMENTATION>",
  "fields": [
    {"name": "id", "type": "string"},
    {"name": "provideringestDate", "type": "long", "doc": "stringUNIX timestamp"},
    {"name": "documentprovider", "type": "string"},
    {"name": "mimetypedocument",
     "type": {"namestring": "},
    {"name": "mimetype",
     "type": {"name": "MimeType",
              "type": "enum",
              "symbols": ["application_json", "application_xml", "text_turtle"]}}
  ]
}

...

Notes On Fields

We will have to translate "_" characters in the mimetype field into "/" characters in our code. The "/" character is not allowed within an enum symbol in an Avro schema.

la.dp.avro.MAP.4_0.MAPRecord.v1

For DPLA MAP records in our system – those that have been mapped or enriched, and are represented as MAPv4 RDF.

Code Block
{
  "namespace": "la.dp.avro.MAP.4_0",
  "type": "record",
  "name": "MAPRecord.v1",
  "doc": <DOCUMENTATION>,
  "fields": [
    {"name": "id", "type": "string"},
    {"name": "document", "type": "string"}
  ]
}

...

In Ingestion1, ingestDate was when the particular record was enriched. It would probably be more useful, however, to have this be the timestamp of the beginning of the harvest, which is more important as far as the provider and the end-user are concerned. Having ingestDate expressed internally in the Avro file as an integer allows for easier decision-making calculations about date ranges, such as queries that filter records. The timestamp can be output as a date string later when the record is indexed.

dpla.avro.v1.MAP4_x.MAPRecord

For DPLA MAP records in our system – those that have been mapped or enriched, and are represented as MAPv4 RDF.

Info
titleWarning

Not tested yet


Code Block
languagejs
// See http://dp.la/info/wp-content/uploads/2015/03/MAPv4.pdf
// and https://dp.la/info/wp-content/uploads/2013/04/DPLA-MAP-V3.1-2.pdf
//
// Figures below like "1" or "0-1" indicate field obligations specified in those
// documents.
//
// Fields are mostly taken from the MAPv4 document, except for Place, which is
// pulled from the MAPv3.1 document.
{
  "namespace": "dpla.avro.v1.MAP4_x",
  "type": "record",
  "name": "MAPRecord",
  "doc": "<DOCUMENTATION>",

  "types": [
    {"name": "StringArray", "type": "array", "items": "string"},
    {
      "name": "Collection",  // dcmitype:Collection
      "type": "record",
      "fields": [
        {   // 0-1
          "name": "title",
          "type": ["null", "string"],
          "default": null
        },
        {   // 0-1
          "name": "description",
          "type": ["null", "string"],
          "default": null
        }
      ]
    },
    {
      "name": "Agent",  // edm:Agent
      "type": "record",
      "fields": [
        { // 0-1
          "name": "name",
          "type": ["null", "string"],
          "default": null
        },
        { // 0-1
          "name": "providedLabel",
          "type": ["null", "string"],
          "default": null
        },
        { // 0-1
          "name": "note",
          "type": ["null", "string"],
          "default": null
        },
        { // 0-1
          "name": "inScheme",
          "type": ["null", "string"],
          "default": null
        },
        { // 0-1
          "name": "exactMatch",
          "type": ["null", "StringArray"],
          "default": null
        },
        {
          // 0-n
          "name": "closeMatch",
          "type": ["null", "StringArray"],
          "default": null
        }
      ]
    },
    {"name": "AgentArray", "type": "array", "items": "Agent"},
    {
      "name": "Concept",  // skos:Concept
      "type": "record",
      "fields": [
        {
          // 0-1
          "name": "name",  // skos:prefLabel
          "type": ["null", "string"],
          "default": null
        },
        {
          // 0-1
          "name": "providedLabel",
          "type": ["null", "string"],
          "default": null
        },
        {
          // 0-1
          "name": "note",
          "type": ["null", "string"],
          "default": null
        },
        {
          // 0-1
          "name": "inScheme",
          "type": ["null", "string"],
          "default": null
        },
        {
          // 0-n
          "name": "exactMatch",
          "type": ["null", "StringArray"],
          "default": null
        },
        {
          // 0-n
          "name": "closeMatch",
          "type": ["null", "StringArray"],
          "default": null
        }
      ]
    },
    {"name": "ConceptArray", "type": "array", "items": "Concept"},
    {
      "name": "Place",  // dpla:Place from MAP 3.1 spec
      "type": "record",
      "fields": [
        {"name": "name", "type": "string"}, // 1
        {   // 0-1
          "name": "city",
          "type": ["null", "string"],
          "default": null
        },
        {   // 0-1
          "name": "county",
          "type": ["null", "string"],
          "default": null
        },
        {   // 0-1
          "name": "state",
          "type": ["null", "string"],
          "default": null
        },
        {   // 0-1
          "name": "country",
          "type": ["null", "string"],
          "default": null
        },
        {   // 0-1
          "name": "region",
          "type": ["null", "string"],
          "default": null
        },
        {   // 0-1
          "name": "coordinates",
          "type": ["null", "string"],
          "default": null
        }
      ]
    },
    {"name": "PlaceArray", "type": "array", "items": "Place"},
    {
      "name": "TimeSpan",  // edm:TimeSpan
      "type": "record",
      "fields": [
        {   // 0-n
          "name": "providedLabel",
          "type": ["null", "StringArray"],
          "default": null
        },
        {
          // 0-1
          "name": "begin",
          "type": ["null", "string"],
          "default": null
        },
        {
          // 0-1
          "name": "end",
          "type": ["null", "string"],
          "default": null
        }
      ]
    },
    {"name": "TimeSpanArray", "type": "array", "items": "TimeSpan"}
  ],  // types

  "fields": [

    // 1
    {"name": "id", "type": "string", "doc": "DPLA record ID"},

    // 1
    {
      "name": "ingestType",
      "type": {
        "name": "IngestType",
        "type": "enum",
        "symbols": ["item", "collection"]
      }
    },

    // 1
    {
      "name": "ingestDate",
      "type": "long",
      "doc": "UNIX timestamp"
    },

    // 1
    {"name": "dataProvider", "type": "string"},

    // 0-n
    {
      "name": "hasView",
      "type": ["null", "StringArray"],
      "default": null
    },

    // 0-n
    {
      "name": "intermediateProvider",
      "type": ["null", "StringArray"],
      "default": null
    },

    // 1
    {"name": "isShownAt", "type": "string"},

    // 0-1
    {
      "name": "object",
      "type": ["null", "string"],
      "default": null
    },

    // 1
    {"name": "originalRecord", "type": "string"},

    // 1
    {"name": "preview", "type": "string"},

    // 1
    {"name": "provider", "type": "string"},

    // 0-1
    {
      "name": "rightsStatement",
      "type": ["null", "string"],
      "default": null
    },

    // 1
    {
      "name": "sourceResource",
      "type": {
        "type": "record",
        "name": "SourceResource",
        "fields": [

          {   // 0-n
            "name": "alternative",
            "type": ["null", "StringArray"],
            "default": null
          },
          {   // 0-n
            "name": "collection",
            "type": ["null", "CollectionArray"],
            "default": null
          },
          {   // 0-n
            "name": "contributor",
            "type": ["null", "AgentArray"],
            "default": null
          },
          {   // 0-n
            "name": "creator",
            "type": ["null", "AgentArray"],
            "default": null
          },
          {   // 0-n
            "name": "date",
            "type": ["null", "TimeSpanArray"],
            "default": null
          },
          {
            // 0-n
            "name": "description",
            "type": ["null", "StringArray"],
            "default": null
          },
          {
            // 0-n
            "name": "extent",
            "type": ["null", "StringArray"],
            "default": null
          },
          {
            // 0-n
            "name": "format",
            "type": ["null", "StringArray"],
            "default": null
          },
          {
            // 0-n
            "name": "genre",
            "type": ["null", "ConceptArray"],
            "default": null
          },
          {
            // 0-n
            "name": "identifier",
            "type": ["null", "StringArray"],
            "default": null
          },
          {
            // 0-n
            "name": "language",
            "type": ["null", "StringArray"],
            "default": null
          },
          {
            // 0-n
            "name": "spatial",
            "type": ["null", "PlaceArray"],
            "default": null
          },
          {   // 0-n
            "name": "publisher",
            "type": ["null", "AgentArray"],
            "default": null
          },
          {
            // 0-n
            "name": "relation",
            "type": ["null", "StringArray"],
            "default": null
          },
          {
            // 0-n
            "name": "isReplacedBy",
            "type": ["null", "StringArray"],
            "default": null
          },
          {
            // 0-n
            "name": "replaces",
            "type": ["null", "StringArray"],
            "default": null
          },
          {
            // 1-n
            "name": "rights",
            "type": "StringArray"
          },
          {   // 0-n
            "name": "rightsHolder",
            "type": ["null", "AgentArray"],
            "default": null
          },
          {
            // 0-n
            "name": "subject",
            "type": ["null", "ConceptArray"],
            "default": null
          },
          {   // 0-n
            "name": "temporal",
            "type": ["null", "TimeSpanArray"],
            "default": null
          },
          {
            // 1-n
            "name": "title",
            "type": "StringArray"
          },
          {
            // 0-n
            "name": "type",
            "type": ["null", "StringArray"],
            "default": null
          }
        ] // fields
      }  // type

    }, // sourceResource

    {
      "name": "title",
      "type": ["null", "string"],
      "doc": "Only for collection records"
    }

  ]  // fields

}

Notes About Fields

See note above in OriginalRecord for ingestDate.

dpla.avro.v1.MAP4_0.IndexRecord

For DPLA MAPv4 records that have been mapped or enriched, and are being indexed into Elasticsearch.

...

Code Block
{
  "namespace": "ladpla.dp.avro.MAPv1.4MAP4_0",
  "type": "record",
  "name": "IndexRecord.v1",
  "doc": "<DOCUMENTATION>",
  "fields": [
    {"name": "id", "type": "string"},
    {"name": "document", "type": "string"}
  ]
}


S3 Bucket Folder Structure and Contents

Code Block
<PROVIDER>
    "plan"
        <TIMESTAMP>
			
			<TIMESTAMP>"-"<SCHEMA NAME>".json"
    <ACTIVITY>
        <DATE>
            <TIMESTAMP>"-"<PROVIDER>"-"<SCHEMA NAME>".jsonavro"
     <ACTIVITY>           _MANIFEST
  <DATE>             <TIMESTAMP>"-"<PROVIDER>"-"<SCHEMA NAME>".avro" _LOGS
            <TIMESTAMP>"-"<PROVIDER>"-"<SCHEMA NAME>"-prov.json"

...

  • <PROVIDER> is a token like cdl or mdl.
  • <DATE> is a string date representation in the form yyyymmdd, for example, "20170209". This is always the date when the relevant event started.
  • <TIMESTAMP> is a string timestamp representation in the form yyyymmdd_hhmmss, for example, "20170209_104428". This is always the time when the relevant event started.
  • <ACTIVITY> is one of harvest, mapping, enrichment, or indexing.
  • <SCHEMA NAME> corresponds to the "name" property of the Avro file schema documented above minus the la.dp.avro prefix.
  • plan is for manifests of whole ingests, where the term "plan" refers to the PROV-O notion of a Plan, or sequence of activities. The manifest is a set of JSON objects that documents which Avro files are relevant elsewhere in the provider's folder tree. A Plan can include one or more Activities, so a standalone remapping, for instance, could be part of a Plan with just that one activity. The JSON files in the "plan" directory are written only – they are not modified after being written. They function as a journal of all of the activities that have happened in the particular ingest.

  • The files ending in "-prov.json" document which Plan the relevant Avro file belongs to, or which Avro file represents the generator Activity from which the relevant one is derived. For example, if you are looking at a particular mapping Avro file, this can direct you to the harvest Avro file whose records were mapped. This may be redundant with the information in a Plan file, but can make it easier to find things.

  • _MANIFEST is a file containing basic information about the activity, such as the source of any input files.
  • _LOGS is a directory containing log files that pertain to the activity.

Examples:

A manifest for a whole ingest:

...