Database export files

All DPLA data in the DPLA repository is available for download as zipped JSON and parquet files on Amazon Simple Storage Service (S3) in the bucket named s3://dpla-provider-export.

For more details about how to access and download these files from S3, see the S3 documentation.

Current JSON format

Files are formatted as JSONL, and have the following structure. Every line is a JSON object.

{
	...
	"_source": { ... record ... }
	...
}
{ ... another record ... }
... more records ... 

This is a straight dump of an Elasticsearch index and has some fields outside of "_source" that you can ignore.

Former JSON file formats

Before August 2018 the file format was as follows. Note that this is a JSON array.

[
    {
		...
		"_source": { ... record ... }
		...
    },
	... more ...
]


If you wrote software to process our files before December 16th, 2015, it was designed to work with one of the following structures.

The first format resulted from our old method of exporting the data from CouchDB views, where each element of "rows" had a "doc" property, as follows.

{
    "total_rows": <number>,
    "rows": [
                {
                    "doc": {
                               ... record ...
                    }
                },
                ... more rows ...
            ]
}

The second format that we used on some of the older files was the JSON array format described above.

Please let us know if you have any comments or questions about the new format, using our contact form.