Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Changed "Spark" to "Scala" for Scala code example

A library for harvesting records and sets from OAI-PMH repositories to Spark SQL.

...

The Spark OAI Harvester is currently housed within a larger application called "Ingestion3".  Ingestion3 is an ETL system for cultural heritage metadata, and is under active development.

...

You can get this library from the DPLA Ingestion3 github repo.  To build a JAR file, run sbt package from the project root.  To execute the test suite, run sbt test.

Linking

We do not currently offer SBT or Maven coordinates for this library, but may do so in the future.

...

Code Block
root
 |-- set: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- document: string (nullable = true)
 |    |-- setSource: struct (nullable = true)
 |    |    |-- queryParams: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- url: string (nullable = true)
 |    |    |-- text: string (nullable = true)
 |-- record: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- document: string (nullable = true)
 |    |-- setIds: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- recordSource: struct (nullable = true)
 |    |    |-- queryParams: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- url: string (nullable = true)
 |    |    |-- text: string (nullable = true)
 |-- error: struct (nullable = true)
 |    |-- message: string (nullable = true)
 |    |-- errorSource: struct (nullable = true)
 |    |    |-- queryParams: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- url: string (nullable = true)
 |    |    |-- text: string (nullable = true)


Examples

...

Scala

Code Block
languagescala
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession.builder().master("local").getOrCreate()


val df = spark.read
    .format("dpla.ingestion3.harvesters.oai")
    .option("endpoint", "http://my-oai-repository.edu")
    .option("verb", "ListRecords")
    .option("metadataPrefix", "mods")
    .option("setlist", "set1, set2, set3")
    .load()

val records = df.select("record.*").where("record is not null")
val sets = df.select("set.*").where("record is not null")
val errors = df.select("error.*").where("error is not null")

...