Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We have used this to break the (very large) NARA ingest into multiple smaller ingests. This is a stopgap pending work to break Activities into multiple jobs. Unfortunately, this process is cumbersome. However, it does lend a lot of flexibility about how to triage this large ingest.

Get the IDs file

The file will be provided to DPLA by NARA staff (emailed to ingest@dp.la or attached to the Redmine ticket). Since all harvester workers run on only one worker box it just needs to be SCP'd to worker-prod1:/var/tmp/

Split the IDs file

Break the nara_ids file into a number of smaller files of a chosen size using split:

...

Queue a harvest targeting each file. It's a good idea to track the activities associated with each file to help track progress in later stages of the ingest. You can queue all the harvests at once, but it's usually best to limit to a handful. Resque workers normally pull items off the queue in the order they were added, so limiting makes it easier to add another harvest without waiting for all NARA jobs to complete.

...

Map and enrich each harvest as its own ingest. Again, tracking Activity IDs is a good idea; as is avoiding queueing many long-running jobs. 

Ingest tracking

Use this Google Sheet in conjunction with Redmine to track the status of each of the fragment files.