...
Code Block |
---|
<PROVIDER>
"plan"
<TIMESTAMP>
<TIMESTAMP>"-"<SCHEMA NAME>".json"
<ACTIVITY>
<DATE>
<TIMESTAMP>"-"<PROVIDER>"-"<SCHEMA NAME>".avro"
_MANIFEST
_LOGS
<TIMESTAMP>"-"<PROVIDER>"-"<SCHEMA NAME>"-prov.json" |
...
<PROVIDER>
is a token likecdl
ormdl
.<DATE>
is a string date representation in the formyyyymmdd
, for example, "20170209". This is always the date when the relevant event started.<TIMESTAMP>
is a string timestamp representation in the formyyyymmdd_hhmmss
, for example, "20170209_104428". This is always the time when the relevant event started.<ACTIVITY>
is one ofharvest
,mapping
,enrichment
, orindexing
.<SCHEMA NAME>
corresponds to the "name" property of the Avro file schema documented above minus thela.dp.avro
prefix.plan
is for manifests of whole ingests, where the term "plan" refers to the PROV-O notion of a Plan, or sequence of activities. The manifest is a set of JSON objects that documents which Avro files are relevant elsewhere in the provider's folder tree. A Plan can include one or more Activities, so a standalone remapping, for instance, could be part of a Plan with just that one activity. The JSON files in the "plan" directory are written only – they are not modified after being written. They function as a journal of all of the activities that have happened in the particular ingest.The files ending in "-prov.json" document which Plan the relevant Avro file belongs to, or which Avro file represents the generator Activity from which the relevant one is derived. For example, if you are looking at a particular mapping Avro file, this can direct you to the harvest Avro file whose records were mapped. This may be redundant with the information in a Plan file, but can make it easier to find things.
_MANIFEST
is a file containing basic information about the activity, such as the source of any input files._LOGS
is a directory containing log files that pertain to the activity.
Examples:
A manifest for a whole ingest:
...