Record Processing
Description:
An execution environment for running harvests, maps, and enrichments across a provider's contributed metadata.
Selection criteria:
- TODO
Nice-to-haves:
- TODO
Notes:
It might not make sense to consider the Record Processing, Mapping DSL, and Queuing System projects separately if they are highly coupled.
Technology Option | Language | Strengths | Weaknesses | Opportunities | Threats |
---|---|---|---|---|---|
Mapping DSL
Description:
A generalized, easy-to-use language for converting documents of arbitrary schemas into DPLA MAP. At first, this will likely be implemented in a general programing language as part of the Record Processing project, with the expectation that we will eventually deliver a language that metadata experts with little-to-no programming experience will be successful using on their own, or with minimal supervision.
It's expected that this project will primarily be custom code with possibly a number of implementations if it needs to work in mutually-incompatible environments. Therefore, framework exploration probably isn't needed.
Selection criteria:
- Simple to useĀ
- Accessible by non-programmers
- Needs to handle core use casesĀ
- JSON
- XML
- RDF
- multi-schema/multi-namespace documents
- DPLA MAP
Nice-to-haves:
- Able to run in a variety of execution contexts (browser, command line, grid computing frameworks)
- Easily usable by partners in their own environments
- Deep document validity checks (not just well-formedness)
Technology Option | Strengths | Weaknesses | Opportunities | Threats |
---|---|---|---|---|
Javascript | ||||
Python | ||||
Scala | ||||
Java | ||||
Dashboard
Description:
TODO
Selection criteria:
- TODO
Nice-to-haves:
- TODO
Technology Option | Language | Strengths | Weaknesses | Opportunities | Threats |
---|---|---|---|---|---|
QA App
Description:
TODO
Selection criteria:
- TODO
Nice-to-haves:
- TODO
Technology Option | Language | Strengths | Weaknesses | Opportunities | Threats |
---|---|---|---|---|---|
Queueing System
Description:
The queuing system controls the runtime execution of activities. Currently, Ingestion 2 uses Resque, which is a Ruby-based environment that uses Redis as a datastore and for transaction logic.
Selection criteria:
- Must allow for a batch of operations to be queued
- Must somehow report statistics about the state of play of a batch for reporting purposes
- Must allow for management of failures
- Must allow for distribution of tasks among multiple workers
Nice-to-haves:
- Choice of implementation languages for workers
- Retrying capabilities
- Broader utility outside of ingestion use cases
Technology Option | Worker Language | Strengths | Weaknesses | Opportunities | Threats |
---|---|---|---|---|---|
Airflow | Many | ||||
RQ | Python | ||||
Custom | Many | ||||
Kafka | Many | ||||
Resque | Ruby |
Developers Experience / Interests
Dev | Expert At | Good At | Familiar With | Wants to Learn |
---|---|---|---|---|
Audrey | HTML+CSS, Javascript for DOM manipulations, Ruby (in Ruby on Rails context) | Object oriented Javascript, PHP (a little rusty), Ruby, SQL | Python, Java | Python, Scala, Java |
Mark | Unix, Python(was pretty confident, now a little rusty), Javascript, PHP(formerly, doesn't like), HTML+CSS(a little rusty), Perl(rusty, been a while, is so over that) | Ruby | C, Java | Go, more Python, Scala, Java, Natural Language Processing |
Michael | Java, XML, Solr, Hadoop | Scala, Ruby (mostly not Rails) | Python, Javascript, Perl, C, Objective-C, XSLT, Spark, NLP, Machine Learning, Elasticsearch, Redshift, | Python, more Scala, Spark, |
Scott |