Record ProcessingTechnology OptionLanguageStrengthsWeaknessesOpportunitiesThreats

Page Tree

Description:

An execution environment for running harvests, maps, and enrichments across a provider's contributed metadata.

Selection criteria:

TODO

Nice-to-haves:

TODO

Notes:

It might not make sense to consider the Record Processing, Mapping DSL, and Queuing System projects separately if they are highly coupled.

root	@self

Ingest Shared Concerns

Generally speaking, the following sections ("Mapping DSL," "Queueing System", "Record Processing") represent intimately linked decisions to be made, in that it is impossible to make incompatible choices across the three areas. However, it's interesting to consider each of these independently to avoid prejudicing the discussion by making an early decision in one area and disregarding the other two.

Mapping DSL

Description:

A generalized, easy-to-use language for converting documents of arbitrary schemas into DPLA MAP. At first, this will likely be implemented in a general programing language as part of the Record Processing project, with the expectation that we will eventually deliver a language that metadata experts with little-to-no programming experience will be successful using on their own, or with minimal supervision.

It's expected that this project will primarily be custom code with possibly a number of implementations if it needs to work in mutually-incompatible environments. Therefore, framework exploration probably isn't needed.

Selection criteria:

Simple to use
Accessible by non-programmers
Needs to handle core use cases
- JSON
- XML
- RDF
- multi-schema/multi-namespace documents
- DPLA MAP (which is JSON-LD)

Nice-to-haves:

Able to run in a variety of execution contexts (browser, command line, grid computing frameworks)
- Support for some kind of 'live-preview' of DSL transformations, allow for rapid prototyping and development of mappings by non-developers
Easily usable by partners in their own environments
Deep document validity checks (not just well-formedness)
Use of a declarative language like XPath, JSONPath, or XQuery for specification of field sources/destinations
Use of a framework for creating custom languages starting from some sort of formal specification ala YACC/LLVM, etc.

Technology Option	Strengths	Weaknesses	Opportunities	Threats
Javascript	JSON is native. DOM manipulations are built-in. Runs in a number of contexts (JVM, Browser, CLI).	Weak typing. Tooling is less mature than other options. Lots of churn in best practices/fashionable libraries. Code executing in native Javascript is not that fast. Not generally used for data munging.
Python	Explicit but succinct syntax. Very mature. Certain libs built from C can be very fast. Strong typing. Functional. Expressive. Easy to understand at a glance.	Native Python isn't very fast.	Python is strong in the data science community, so potential for crossover to analytics/general data munging. Mark knows it well.
Scala	Nearly as fast as Java while being less verbose. Great type system. XML parsing / navigation / document creation already available in strongly-typed DSLs. Functional. Can run in the browser via ScalaJS.	Because Haskell-type people are working on it, can be overcomplicated.	Also a good data science crossover, but more for data engineering. Michael knows it some.
Java	Fast. Mature. Strong typing. Sorta functional (in 8).	Verbose. Libraries can be overcomplicated or too low-level. Not very expressive.	Michael knows it well. There is some community love for Java (Fedora)
Ruby	We "know" it. Very expressive.	Slow. Attempts to write expressive code yield to unintelligible code. Hard to manage large projects.	Well-loved in the library community.
Go	Fast. Strong typing. Intended for writing concurrent server processes. Designed for easy deployments (executable binaries, statically linked, not interpreted).	No exception handling. Omits some OO features people might be used to.		Nobody on the team knows it.

Job Queueing System

Description:

The queuing system controls the runtime execution of activities. Currently, Ingestion 2 uses Resque, which is a Ruby-based environment that uses Redis as a datastore and for transaction logic.

Selection criteria:

Must allow for an Activity's Job to be queued
Must somehow report statistics about the state of play of an Activity for reporting purposes.
Must allow for management of failures.
Must allow for distribution of Jobs among multiple workers.
We need to be able to support this system, so having an ops playbook ahead of time is a good idea.

Nice-to-haves:

Choice of implementation languages for workers
Retrying capabilities
Broader utility outside of ingestion use cases
Allow for jobs to be scheduled for a specific date/time
Able to define workflows for different phases of ingestion and explicitly model our workflows. We might want this because the steps for a new provider are different than a normal re-ingest.

Technology Option	Worker Language	Strengths	Weaknesses	Opportunities
Airflow	Any	Allows one to model both Activities and individual record operations. Polyglot. Has a built-in management UI. Can handle graphs of dependencies vs. only queues. Can handle retrying tasks. Prebuilt operators for a variety of tasks like dealing with S3, REST endpoints, sending emails. Good takeup by startups doing data operations tasks. Same backends as Celery because it uses it. Simple implementation for what it is.	More complicated than straight up queueing libraries. Larger codebase to understand.	Reusable for other situations where we need to do ETL or other data operations. ETL on GoogleAnalytics?
RQ	Python	Management via RQ Dashboard. Uses Redis which we know how to run or can get Amazon to run for us. Simple, understandable implementation.	Single worker language option. Task-only.
Custom	Any	(We talked about this being a bad idea, but it was up on the whiteboard.)	Have to build it ourselves. Hard to assure correctness.	Ability to say we built a queueing framework? https://www.sadtrombone.com/
Kafka	Many	Very durable document storage for replay.	More complicated. Doesn't track activities, just worker tasks. Need to run Kafka + Zookeeper.
Resque	Ruby	Builtin management dashboard. Uses Redis which we know how to run or can get Amazon to run for us.	Single worker language option. Task-only.	We already use it.
Celery	Python	Management via Flower. Multiple brokers on the backend, including Redis and relational dbs.	Single worker language option. Task-only.
Spark	Python, Java, Scala	Batch oriented. Option to not have standing infrastructure for ingest/map/enrichments. Can run on spots. Incorporates the worker runtime as well. Can run multiple providers on different clusters. Amazon launches Spark clusters as part of EMR. Highly scalable.	Task only.	Michael knows it.

Record Processing

Description:

An execution environment for running harvests, maps, and enrichments across a provider's contributed metadata.

Selection criteria:

Speed
- Fast replay of ingestion pipeline; no provenance to support rollback
- TODO: Establish baseline requirements for ingestion velocity in terms of providers and records
Reliability
Will need to interact with and share models between queueing system and QA app.

Nice-to-haves:

Concurrency
Scalability
Easy to understand or common platform/framework/language within community

Notes:

It might not make sense to consider the Record Processing, Mapping DSL, and Queuing System projects separately if they are highly coupled.

Technology Option	Language	Strengths	Weaknesses	Opportunities	Threats

Webapp Shared Concerns

As each of the following tech selection sections are related to creating webapps, they share concerns. However, in this case, the decisions are not intimately related; we could very easily make separate decisions in each case.

Dashboard

Description:

The Dashboard is a web application that will allow DPLA staff and partners and hubs to see the status of ingestion, mapping and enrichment processes on their data. It is now the Tech Team's intent that this application will get information about the status of these ingests through a REST API, which means that the Dashboard will be loosely coupled to the Ingestion stack. This will allow for evolution of the implementation and implementation technology of Ingestion without needing to modify the Dashboard application.

Selection criteria:

TODO

Easy to code
Easy to deploy
Easy to manage dependencies
Open source
Active community of developers
Well documented
Hudson Molonglo can do it
Quality error logging
Ability to communicate with ingestion systems over API
Authentication
Lightweight relational database
REST
HTML templates
JSON
Quality test runner
Secure
Asset precompiling (JavaScript, SASS, etc)
Forms?

Nice-to-haves:

TODOUsed in the library community
Language or framework familiar to staff
Bootstrap compatible
Lightweight
ORM
Lots of high-quality plugins or packages available
MVC? (not sure how the team feels about this)

Notes:

The tech selection process for the Dashboard may very well be similar to that of the QA app, with the caveat that the Dashboard app will be built by a third party (HM).

Technology Option	Language	Strengths	Weaknesses	Opportunities	Threats
Rails	Ruby	Stable and widely used Good test runner Conventions make things like routing very easy Good HTML templating Good ORM for basic CRUD Bundler for managing dependencies Expressive language	Resource-intensive hosting More features than we really need Many dependencies Mediocre documentation Not as performant or flexible as lighter-weight frameworks	We already use it Popular in library community Development is fast if prepackaged features and existing gems meet our needs	High learning curve
Flask	Python	Simple, extensible core Serves JSON responses relatively fast Good for building APIs Flexible Explicit, expressive language	Less mature Smaller web community Poor documentation Testing okay, not great Have to do more of your own security	Python popular in library community Extensions probably available meet most of our needs	Not MVC
Django	Python	Stable and widely used Good HTML templating Good ORM Pip for managing dependencies Explicit, expressive language	More features than we really need Many dependencies Not as performant or flexible as lighter-weight frameworks Testing okay, not great	Python popular in library community Development is fast if prepackaged features and existing gems meet our needs	High learning curve
Play	Java or Scala	Good testing Good in-browser error handling Flexible	Immature Small dev community Upgrades not backward compatible Async, non-blocking I/O can make it hard to keep code clean SBT has reputation for being difficult build system Hot reload slow with Scala	Extensions probably available meet most of our needs	Not widely used in library community High learning curve No one really knows it

QA App

Description:

The QA application will allow metadata experts to examine the output of mapping and harvest prior to writing to the production Elasticsearch index.

Selection criteria:

TODO

Nice-to-haves:

TODO

Technology OptionLanguageStrengthsWeaknessesOpportunitiesThreatsRailsRubyFlaskPythonDjangoPythonPlayJava or Scala

Queueing System

Description:

The queuing system controls the runtime execution of activities. Currently, Ingestion 2 uses Resque, which is a Ruby-based environment that uses Redis as a datastore and for transaction logic.

Selection criteria:

Must allow for a batch of operations to be queued
Must somehow report statistics about the state of play of a batch for reporting purposes
Must allow for management of failures

Must allow for distribution of tasks among multiple workers

- Easy to code
- Easy to deploy
- Easy to manage dependencies
- Open source
- Active community of developers
- Well documented
- Hudson Molonglo can do it
- Quality error logging
- Ability to communicate with ElasticSearch
- Authentication
- REST
- HTML templates
- JSON
- Quality test runner
- Secure
- Asset precompiling (JavaScript, SASS, etc)

Nice-to-haves:

Choice of implementation languages for workers

Retrying capabilities

Broader utility outside of ingestion use cases

- Used in the library community
- Language or framework familiar to staff
- Bootstrap compatible
- Lightweight
- Lots of high-quality plugins or packages available

Technology Option

Worker

Language	Strengths	Weaknesses	Opportunities	Threats

AirflowManyRQPythonCustomManyKafkaManyResqueRuby


Rails	Ruby	Already in our stack ("we know it"). Popular in the community (blacklight, spotlight)	Blacklight is not integrated with ES (we would need to rely on Solr or build the integration with ES)
Flask	Python	Lightweight Python is a common language across team	Lightweight means that Extensions available for Auth DB abstractions
Django	Python	Strong community adoption, mature framework. Python is a common language across team	"Heavy" not as performant as Flask
Play	Java or Scala		No one on staff has exp.	Java and Scala are familiar so the uptake may not be too difficult.

Developers Experience / Interests

Dev	Expert At	Good At	Familiar With	Wants to Learn
Audrey	HTML+CSS, Javascript for DOM manipulations, Ruby (in Ruby on Rails context)	Object oriented Javascript, PHP (a little rusty), Ruby, SQL	Python, Java	Python, Scala, Java
Mark	Unix, Python(was pretty confident, now a little rusty), Javascript, PHP(formerly, doesn't like), HTML+CSS(a little rusty), Perl(rusty, been a while, is so over that)	Ruby	C, Java	Go, more Python, Scala, Java, Natural Language Processing
Michael	Java, XML, Solr, Hadoop	Scala, Ruby (mostly not Rails)	Python, Javascript, Perl, C, Objective-C, XSLT, Spark, NLP, Machine Learning, Elasticsearch, Redshift,	Python, more Scala, Spark,
Scott	Not claiming "expert" skills in these subject but its what I'm strongest at. Java (<1.7), SQL (MSSSQL), Solr	Ruby (still learning), Python	Elasticsearch, Django (the only web framework I've do work with), C++ (bloodshed days)	Java 1.8, Scala, Spark, Go

Versions Compared

Old Version 23

New Version Current

Key

Ingest Shared Concerns

Mapping DSL

Job Queueing System

Record Processing

Webapp Shared Concerns

Dashboard

QA App

Queueing System

Developers Experience / Interests

Page Comparison

Versions Compared

Old Version 23

New Version Current

Key

Ingest Shared Concerns

Mapping DSL

Job Queueing System

Record Processing

Webapp Shared Concerns

Dashboard

QA App

Queueing System

Developers Experience / Interests