Ingestion 3 Dependencies
Notes, for reference by the DPLA tech team.
The following dependencies work together, and enable read/write to S3
- spark 2.3.1
- hdfs 2.8.4
- spark-avro 4.0.0
- hadoop-aws 2.7.6
- aws-java-sdk 1.7.4
Flintrock
Specify spark and HDFS versions in Flintrock's config
file.
Use either of these methods to add dependencies to an EC2 cluster:
- Add as jar files to
/home/ec2-user/spark/jars/
- List after the
--packages
flag when runningspark-submit
Dependencies for HarvestEntry
com.databricks:spark-avro_2.11:4.0.0
org.apache.hadoop:hadoop-aws:2.7.6
com.amazonaws:aws-java-sdk:1.7.4
org.rogach:scallop_2.11:3.0.3
com.typesafe:config:1.3.1
Dependencies for MappingEntry
com.databricks:spark-avro_2.11:4.0.0
org.apache.hadoop:hadoop-aws:2.7.6
com.amazonaws:aws-java-sdk:1.7.4
org.rogach:scallop_2.11:3.0.3
com.typesafe:config:1.3.1
Dependencies for EnrichEntry
com.databricks:spark-avro_2.11:4.0.0
org.apache.hadoop:hadoop-aws:2.7.6
com.amazonaws:aws-java-sdk:1.7.4
org.rogach:scallop_2.11:3.0.3
com.typesafe:config:1.3.1
org.eclipse.rdf4j:rdf4j-model:2.2
org.jsoup:jsoup:1.10.2
Dependencies for JsonlEntry
com.databricks:spark-avro_2.11:4.0.0
org.apache.hadoop:hadoop-aws:2.7.6
com.amazonaws:aws-java-sdk:1.7.4
org.rogach:scallop_2.11:3.0.3
Dependences for IngestRemap
com.databricks:spark-avro_2.11:4.0.0
org.apache.hadoop:hadoop-aws:2.7.6
com.amazonaws:aws-java-sdk:1.7.4
org.rogach:scallop_2.11:3.0.3
com.typesafe:config:1.3.1
org.eclipse.rdf4j:rdf4j-model:2.2
org.jsoup:jsoup:1.10.2