Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

All DPLA data in the DPLA repository are is available for download as gzipped zipped JSON and parquet files . These include the standard DPLA fields, as well as the complete record received from the partner.

...

on Amazon Simple Storage Service (S3) in the bucket named s3://dpla-provider-export.

For more details about how to access and download these files from S3, see the S3 documentation.

Current JSON format

Files are formatted as JSONJSONL, and have the following structure. Every line is a JSON object.

Code Block
languagejs
{
	...
	"_source": { ... record ... }
	...
}
{ ... another record ... }
... more records ... 

This is a straight dump of an Elasticsearch index and has some fields outside of "_source" that you can ignore.

Former JSON file formats

Before August 2018 the file format was as follows. Note that this is a JSON array.

Code Block
languagejs
[
    {
		...
		"_source": { ... record ... }
		...
    },
	... more ...
]


If you wrote software to process our files before December 16th, 2015, it was designed to work with one of the following structures.

The first format resulted from our old method of exporting the data from CouchDB views, where each element of "rows" had a "doc" property, as follows.

Code Block
{
    "total_rows": <number>,
    "rows": [
                {
                    "doc": {
                               ... record ...
                    }
                },
                ... more rows ...
            ]
}

...

The "doc" property of each object in "rows" is the same JSON object that you would get back from our API for an individual record.

Prior to May 28th, 2014, we were also including various other CouchDB-related properties alongside "doc" in every row element.

New file format

We will be changing the structure of our export files' JSON, at a date still to be determined.  The existing format is a legacy of the way we used to export the direct output of CouchDB views, where each element of "rows" had a "doc" property.  The new format will be more simple, and will result in lower file sizes, especially for the larger files.  The format that we are currently considering is as follows:

Code Block
languagejs
[
    {  ... record ... },
    ... more records ...
]

 

 second format that we used on some of the older files was the JSON array format described above.

Please let us know if you have any comments or questions about the new format, using our contact form.