Record Processing
Description:
An execution environment for running harvests, maps, and enrichments across a provider's contributed metadata.
Selection criteria:
- TODO
Nice-to-haves:
- TODO
Notes:
It might not make sense to consider the Record Processing, Mapping DSL, and Queuing System projects separately if they are highly coupled.
Technology Option | Language | Strengths | Weaknesses | Opportunities | Threats |
---|---|---|---|---|---|
Mapping DSL
Description:
A generalized, easy-to-use language for converting documents of arbitrary schemas into DPLA MAP. At first, this will likely be implemented in a general programing language as part of the Record Processing project, with the expectation that we will eventually deliver a language that metadata experts with little-to-no programming experience will be successful using on their own, or with minimal supervision.
It's expected that this project will primarily be custom code with possibly a number of implementations if it needs to work in mutually-incompatible environments. Therefore, framework exploration probably isn't needed.
Selection criteria:
- Simple to useĀ
- Accessible by non-programmers
- Needs to handle core use casesĀ
- JSON
- XML
- RDF
- multi-schema/multi-namespace documents
- DPLA MAP
Nice-to-haves:
- Able to run in a variety of execution contexts (browser, command line, grid computing frameworks)
- Easily usable by partners in their own environments
- Deep document validity checks (not just well-formedness)
- Use of a declarative language like XPath, JSONPath, or XQuery for specification of field sources/destinations
- Use of a framework for creating custom languages starting from some sort of formal specification ala YACC/LLVM, etc.
Technology Option | Strengths | Weaknesses | Opportunities | Threats |
---|---|---|---|---|
Javascript | ||||
Python | ||||
Scala | ||||
Java | ||||
Dashboard
Description:
The Dashboard is a web application that will allow DPLA staff and partners and hubs to see the status of ingestion, mapping and enrichment processes on their data. It is now the Tech Team's intent that this application will get information about the status of these ingests through a REST API, which means that the Dashboard will be loosely coupled to the Ingestion stack. This will allow for evolution of the implementation and implementation technology of Ingestion without needing to modify the Dashboard application.
Selection criteria:
- TODO
Nice-to-haves:
- TODO
Notes:
The tech selection process for the Dashboard may very well be similar to that of the QA app, with the caveat that the Dashboard app will be built by a third party (HM).
Technology Option | Language | Strengths | Weaknesses | Opportunities | Threats |
---|---|---|---|---|---|
Rails | Ruby | ||||
Flask | Python | ||||
Django | Python | ||||
Play | Java or Scala |
QA App
Description:
The QA application will allow metadata experts to examine the output of mapping and harvest prior to writing to the production Elasticsearch index.
Selection criteria:
- TODO
Nice-to-haves:
- TODO
Technology Option | Language | Strengths | Weaknesses | Opportunities | Threats |
---|---|---|---|---|---|
Rails | Ruby | ||||
Flask | Python | ||||
Django | Python | ||||
Play | Java or Scala |
Queueing System
Description:
The queuing system controls the runtime execution of activities. Currently, Ingestion 2 uses Resque, which is a Ruby-based environment that uses Redis as a datastore and for transaction logic.
Selection criteria:
- Must allow for a batch of operations to be queued
- Must somehow report statistics about the state of play of a batch for reporting purposes
- Must allow for management of failures
- Must allow for distribution of tasks among multiple workers
Nice-to-haves:
- Choice of implementation languages for workers
- Retrying capabilities
- Broader utility outside of ingestion use cases
Technology Option | Worker Language | Strengths | Weaknesses | Opportunities | Threats |
---|---|---|---|---|---|
Airflow | Many | ||||
RQ | Python | ||||
Custom | Many | ||||
Kafka | Many | ||||
Resque | Ruby |
Developers Experience / Interests
Dev | Expert At | Good At | Familiar With | Wants to Learn |
---|---|---|---|---|
Audrey | HTML+CSS, Javascript for DOM manipulations, Ruby (in Ruby on Rails context) | Object oriented Javascript, PHP (a little rusty), Ruby, SQL | Python, Java | Python, Scala, Java |
Mark | Unix, Python(was pretty confident, now a little rusty), Javascript, PHP(formerly, doesn't like), HTML+CSS(a little rusty), Perl(rusty, been a while, is so over that) | Ruby | C, Java | Go, more Python, Scala, Java, Natural Language Processing |
Michael | Java, XML, Solr, Hadoop | Scala, Ruby (mostly not Rails) | Python, Javascript, Perl, C, Objective-C, XSLT, Spark, NLP, Machine Learning, Elasticsearch, Redshift, | Python, more Scala, Spark, |
Scott | Not claiming "expert" skills in these subject but its what I'm strongest at. Java (<1.7), SQL (MSSSQL), Solr | Ruby (still learning), Python | Elasticsearch, Django (the only web framework I've do work with), C++ (bloodshed days) | Java 1.8, Scala, Spark, Go |