Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Ingest Shared Concerns

Generally speaking, the following sections ("Mapping DSL," "Queueing System", "Record Processing") represent intimately linked decisions to be made, in that it is impossible to make incompatible choices across the three areas. However, it's interesting to consider each of these independently to avoid prejudicing the discussion by making an early decision in one area and disregarding the other two.


Mapping DSL

Description:

A generalized, easy-to-use language for converting documents of arbitrary schemas into DPLA MAP. At first, this will likely be implemented in a general programing language as part of the Record Processing project, with the expectation that we will eventually deliver a language that metadata experts with little-to-no programming experience will be successful using on their own, or with minimal supervision.

It's expected that this project will primarily be custom code with possibly a number of implementations if it needs to work in mutually-incompatible environments. Therefore, framework exploration probably isn't needed.

Selection criteria:

  • Simple to useĀ 
  • Accessible by non-programmers
  • Needs to handle core use casesĀ 
    • JSON
    • XML
    • RDF
    • multi-schema/multi-namespace documents
    • DPLA MAP

Nice-to-haves:

  • Able to run in a variety of execution contexts (browser, command line, grid computing frameworks)
  • Easily usable by partners in their own environments
  • Deep document validity checks (not just well-formedness)
  • Use of a declarative language like XPath, JSONPath, or XQuery for specification of field sources/destinations
  • Use of a framework for creating custom languages starting from some sort of formal specification ala YACC/LLVM, etc.


Technology OptionStrengthsWeaknessesOpportunitiesThreats
JavascriptJSON is native. DOM manipulations are built-in. Runs in a number of contexts (JVM, Browser, CLI).Weak typing. Tooling is less mature than other options. Lots of churn in best practices/fashionable libraries. Code executing in native Javascript is not that fast. Not generally used for data munging.

PythonExplicit but succinct syntax. Very mature. Certain libs built from C can be very fast. Strong typing. Functional. Expressive. Easy to understand at a glance.Native Python isn't very fast.Python is strong in the data science community, so potential for crossover to analytics/general data munging. Mark knows it well.
ScalaNearly as fast as Java while being less verbose. Great type system. XML parsing / navigation / document creation already available in strongly-typed DSLs. Functional. Can run in the browser via ScalaJS.Because Haskell-type people are working on it, can be overcomplicated.Also a good data science crossover, but more for data engineering. Michael knows it some.
JavaFast. Strong typing. Sorta functional (in 8).Verbose. Libraries can be overcomplicated or too low-level. Not very expressive.Michael knows it well.
Ruby

We "know" it. Very expressive.


Slow. Attempts to write expressive code yield to unintelligible code. Hard to manage large projects.Well-loved in the library community.
GoFast. Strong typing.No exception handling.
Nobody on the team knows it.



Queueing System

Description:

The queuing system controls the runtime execution of activities. Currently, Ingestion 2 uses Resque, which is a Ruby-based environment that uses Redis as a datastore and for transaction logic.


Selection criteria:

  • Must allow for a batch of operations to be queued.
  • Must somehow report statistics about the state of play of a batch for reporting purposes.
  • Must allow for management of failures.
  • Must allow for distribution of tasks among multiple workers.
  • We need to be able to support this system, so having an ops playbook ahead of time is a good idea.

Nice-to-haves:

  • Choice of implementation languages for workers
  • Retrying capabilities
  • Broader utility outside of ingestion use cases


Technology OptionWorker LanguageStrengthsWeaknessesOpportunitiesThreats
AirflowManyAllows one to model both Activities and individual record operations. Polyglot. Has a built-in management UI. Can handle graphs of dependencies vs. only queues. Can handle retrying tasks. Prebuilt operators for a variety of tasks like dealing with S3, REST endpoints, sending emails. Good takeup by startups doing data operations tasks. Same backends as Celery because it uses it.More complicated than embedded queueing libraries. Larger codebase to understand.Reusable for other situations where we need to do ETL or other data operations, even if the implementation
RQPython"Only" works for Python codeManagement via RQ Dashboard. Uses Redis which we know how to run or can get Amazon to run for us.Single worker language option. Task-only.

CustomMany(We talked about this being a bad idea, but it was up on the whiteboard.)Have to build it ourselves. Hard to assure correctness.Ability to say we built a queueing framework? https://www.sadtrombone.com/
KafkaManyVery durable document storage for replay.More complicated. Doesn't track activities, just worker tasks. Need to run Kafka + Zookeeper.

ResqueRuby"Only" works for Ruby codeBuiltin management dashboard. Uses Redis which we know how to run or can get Amazon to run for us.Single worker language option. Task-only.
We already use it.
CeleryPythonManagement via Flower. Multiple brokers on the backend, including Redis and relational dbs.Single worker language option. Task-only.



Record Processing

Description:

An execution environment for running harvests, maps, and enrichments across a provider's contributed metadata.


Selection criteria:

  • TODO

Nice-to-haves:

  • TODO

Notes:

It might not make sense to consider the Record Processing, Mapping DSL, and Queuing System projects separately if they are highly coupled.


Technology OptionLanguageStrengthsWeaknessesOpportunitiesThreats
































Webapp Shared Concerns

As each of the following tech selection sections are related to creating webapps, they share concerns. However, in this case, the decisions are not intimately related; we could very easily make separate decisions in each case.

Dashboard

Description:

The Dashboard is a web application that will allow DPLA staff and partners and hubs to see the status of ingestion, mapping and enrichment processes on their data. It is now the Tech Team's intent that this application will get information about the status of these ingests through a REST API, which means that the Dashboard will be loosely coupled to the Ingestion stack. This will allow for evolution of the implementation and implementation technology of Ingestion without needing to modify the Dashboard application.


Selection criteria:

  • TODO

Nice-to-haves:

  • TODO

Notes:

The tech selection process for the Dashboard may very well be similar to that of the QA app, with the caveat that the Dashboard app will be built by a third party (HM).


Technology OptionLanguageStrengthsWeaknessesOpportunitiesThreats
RailsRuby



FlaskPython



DjangoPython



PlayJava or Scala





QA App

Description:

The QA application will allow metadata experts to examine the output of mapping and harvest prior to writing to the production Elasticsearch index.


Selection criteria:

  • TODO

Nice-to-haves:

  • TODO


Technology OptionLanguageStrengthsWeaknessesOpportunitiesThreats
RailsRuby



FlaskPython



DjangoPython



PlayJava or Scala





Developers Experience / Interests

DevExpert AtGood AtFamiliar WithWants to Learn
AudreyHTML+CSS, Javascript for DOM manipulations, Ruby (in Ruby on Rails context)Object oriented Javascript, PHP (a little rusty), Ruby, SQLPython, JavaPython, Scala, Java
MarkUnix, Python(was pretty confident, now a little rusty), Javascript, PHP(formerly, doesn't like), HTML+CSS(a little rusty), Perl(rusty, been a while, is so over that)RubyC, JavaGo, more Python, Scala, Java, Natural Language Processing

Michael

Java, XML, Solr, HadoopScala, Ruby (mostly not Rails)

Python, Javascript, Perl, C, Objective-C, XSLT, Spark, NLP, Machine Learning, Elasticsearch, Redshift,Python, more Scala, Spark,
ScottNot claiming "expert" skills in these subject but its what I'm strongest at. Java (<1.7), SQL (MSSSQL), SolrRuby (still learning), PythonElasticsearch, Django (the only web framework I've do work with), C++ (bloodshed days)Java 1.8, Scala, Spark, Go