Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

A library for harvesting records and sets from OAI-PMH repositories to Spark SQL.

...

The Spark OAI Harvester is currently housed within a larger application called "Ingestion3".  Ingestion3 is an ETL system for cultural heritage metadata, and is under active development.

...

You can get this library from the DPLA Ingestion3 github repo.  To build a JAR file, run sbt package from the project root.  To execute the test suite, run sbt test.

Linking

We do not currently offer SBT or Maven coordinates for this library, but may do so in the future.

...

Location to save output This should be a local path but Amazon S3 may be supported at some point in the future
OptionObligationUsage
endpointRequired.The base URL for the OAI repository.
verbRequired."ListSets" to harvest only sets; "ListRecords" to harvest records and any sets to which the records may belong. Case-sensitive.
outputDir
Required..
metadataPrefixRequired when verb="ListRecords"; prohibited when verb="ListSets".The the metadata format in OAI-PMH requests issued to the repository.

harvestAllSets

Optional when verb="ListRecords"; cannot be used in conjunction with either setlist or blacklist."True" to harvest records from all sets. Default is "false". Case-insensitive. Results will include all sets and all their records. This will only return records that belong to at least one set; records that do not belong to any set will not be included in the results.
setlistOptional when verb="ListRecords"; cannot be used in conjunction with either harvestAllSets or blacklist.Comma-separated lists of sets to include in the harvest. Use the OAI setSpec to identify a set. Results will include all sets in the setlist and all their records.
blacklistOptional when verb="ListRecords"; cannot be used in conjunction with either harvestAllSets or setlist.Comma-separated lists of sets to exclude from the harvest. Use the OAI setSpec to identify a set. Results will include all sets not in the blacklist and all their records. Records that do not belong to any set will not be included in the results.
provider
RequiredThe name of the source of the records


Record harvests that that include the option harvestAllSets, setlist, or blacklist generally run faster than those that do not.  This is because a known list of sets allows the harvester to distribute the process of making HTTP requests across multiple nodes.  Therefore, it is advantageous to use these options wherever possible.

...