Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

A library for harvesting records and sets from OAI-PMH repositories to Spark SQL.

...

The Spark OAI Harvester is currently housed within a larger application called "Ingestion3".  Ingestion3 is an ETL system for cultural heritage metadata, and is under active development.

...

You can get this library from the DPLA Ingestion3 github repo.  To build a JAR file, run sbt package from the project root.  To execute the test suite, run sbt test.

Linking

We do not currently offer SBT or Maven coordinates for this library, but may do so in the future.

...

  • Partitioning.  The harvester strategically partitions data during the harvest to make the process more efficient.  For example, when harvesting records from multiple sets, it runs HTTP request in parallel.  The results of the OAI harvest are loaded into a Spark DataFrame, a distributed collection of data.
  • Harvest sets and records.  You can harvest OAI sets or records.  If you harvest records, this harvester will also return data about both the records and any sets to which they belong.
  • Specify which sets to harvest records from.  You can tell this harvester which OAI sets to include or exclude in a records request.
  • Flow control.  Often, OAI harvests are comprised of a series of HTTP requests and responses.  In this case, each individual response contains a partial list of the requested data along with a resumption token, which is used to compose subsequent request.  This harvester handles the flow control.  You need only compose the initial request; the harvester will compose all necessary subsequent requests and return a complete set of data.  For a standard records or sets harvest, the HTTP requests must be executed sequentially.  However, in the case of a records request from known sets, the harvester can initiate an independent series of sequential requests for each set.  These request series can run in parallel, speeding up the harvest process.
  • Error handling.  This harvester performs as much of the harvest as possible.  It returns any errors encountered during the process along with successfully harvested data.  This allows you to retain and examine records, sets, and errors from a partially successful harvest.

...