NARA Special Case

The NARA harvester requires an ids file, which defaults to /var/tmp/nara_ids. The target file can be overridden with an initializer option. The file needs to be present on the worker box that will run the harvest.

We have used this to break the (very large) NARA ingest into multiple smaller ingests. This is a stopgap pending work to break Activities into multiple jobs. Unfortunately, this process is cumbersome. However, it does lend a lot of flexibility about how to triage this large ingest.

Get the IDs file

The file will be provided to DPLA by NARA staff (emailed to ingest@dp.la or attached to the Redmine ticket). Since all harvester workers run on only one worker box it just needs to be SCP'd to worker-prod1:/var/tmp/

Split the IDs file

Break the nara_ids file into a number of smaller files of a chosen size using split:

split ids

split -l 150000 nara_ids nara_ids_

This will split the file into chunks with 150,000 lines (and hence, ids) each. The new files will be named like nara_ids_aa, nara_ids_ab, etc...

Queue individual harvests

Queue a harvest targeting each file. It's a good idea to track the activities associated with each file to help track progress in later stages of the ingest. You can queue all the harvests at once, but it's usually best to limit to a handful. Resque workers normally pull items off the queue in the order they were added, so limiting makes it easier to add another harvest without waiting for all NARA jobs to complete.

nara harvest

NaraHarvester.enqueue(id_source_filename: '/var/tmp/nara_ids_aa')
NaraHarvester.enqueue(id_source_filename: '/var/tmp/nara_ids_ab')
NaraHarvester.enqueue(id_source_filename: '/var/tmp/nara_ids_ac')
# ...

Map, Enrich, & Index harvested items

Map and enrich each harvest as its own ingest. Again, tracking Activity IDs is a good idea; as is avoiding queueing many long-running jobs.

Ingest tracking

Use this Google Sheet in conjunction with Redmine to track the status of each of the fragment files.