Delete specific records

This document covers how to delete specific records from the search index and from the Marmotta triplestore. This could be necessary if, for example, a set is ingested that should not have been, or there is some serious metadata problem with particular records. We are not considering the ordinary deletion of records from harvest to harvest that have been removed from a provider's repository.

From the search index

Use elasticdump. This is a Node.js utility that's designed for backing up and transferring Elasticsearch indices, but has a useful deletion mode. We will use it to simultaneously delete records from the search index based on a query, and save those records to a file.

If you don't have Node.js and npm set up, you'll need to do so. If you're using OS X, we can recommend using the Homebrew package manager, and installing its node package. [I have found since first writing this document that it's nicer to use NVM (Node Version Manager, see also). --mb] (That will give you both node and npm.) After you've installed npm, you can install elasticdump by following the instructions in its README.

Be sure to install version 0.7.7 of elasticdump.  Somewhere between that version and the latest one, it became incompatible with Elasticsearch version 0.90, which is what we still use. We know that 0.7.7 works.

Example:

npm install elasticdump@0.7.7 -g

The next step is to figure out how you want to query the records to delete. In most cases, this will probably be a sourceResource.collection.title. You can query by anything else that makes sense as a query as defined by the Elasticsearch Query DSL.

Here is an example where we're deleting all records from a particular OAI setSpec.  ELASTICSEARCH_LOADBALANCER_HOSTNAME is just the Elasticsearch server hostname. In our case, at the DPLA, we use a loadbalancer, but this is really whichever Elasticsearch hostname that you normally use. We're outputting the deleted records to a file. In this example elasticdump is installed in a local directory and is invoked with "./" but you may have installed it globally and it may be in your path; in which case you would probably leave off the "./".

./elasticdump --delete --input=http://ELASTICSEARCH_LOADBALANCER_HOSTNAME:9200/dpla_alias --output=/path/to/deleted-records/p16373coll60.json --searchBody '{"query":{"term":{"sourceResource.collection.title":"p16373coll60"}}}'

From the triplestore

A method remains to be devised to delete the records deleted above from the triplestore. The method documented below should not be attempted, as it will result in 409 Conflict errors in the future if the records are reharvested.

You need to have executed the command above, giving yourself a JSON file that contains record IDs.  With that, you can run the following command to delete records from your triplestore, where "ldp.dp.la" needs to resolve to your Marmotta server. You will probably want to have that defined in your /etc/hosts file.

# DO NOT DO THIS ...
 
cd /path/to/deleted-records  # See above.  Note that you could have multiple files for multiple sets.
 
cat * | jq -r '.[] | "http://ldp.dp.la/ldp/items/" + ._id, ._source.originalRecord' | xargs -L 20 -I = curl -s -o /dev/null -w "%{http_code} =\n" -X DELETE =