Priorities (what problems are we solving?) | All | - Speed: speed is a feature. Predictably say how long some ingest will take.
- Allowing recovery from failure; pick up where it left off. Speed affects this; if it's fast enough you don't have to worry about it. Otherwise, make sure there's recovery. Harvesters should allow recovery, where possible. Indexers could also be less speedy than mappings and enrichments, and may deserve recovery features.
- Adding automation that was originally specified: have a program that shepherds the process all the way through. Scheduling.
- Eventually, provide a useable mapping DSL
- Needs real market research
- This is not a turnkey solution yet. Some things like DSLs will be evaluated later when we can more confident in understanding how big the user base is.
- Writing mappings ourselves in the third system without a DSL will allow us to understand the problem space better.
- Ability to debug things, especially mappings
|
Code examples | Michael et. al. | Got walkthroughs of the Python, Python + Spark, Java, and Scala prototypes Considerations for language environments: - Staff allocation for necessary professional development
- Mental "context switching" with multiple environments
- Server cost, if having to run more servers (consider execution speed memory usage if able to run on just one node, e.g.)
- Performance
- Ease of use for novice / non-programmer
- Ease of writing a DSL in it
- Ability to be explicit in code (e.g. types). Fewer inferences, LESS MAGIC!
- Ease of deployment. (local, production, dependencies)
- How easy for other institutions to adopt our code or experiment with it.
- Developer enjoyment
What scale will we be working in for the foreseeable future? - Ingestion total or by provider?
- Extra pressure b/c we want to reinvest more frequently - that could mean 10 million records per month
- Targets for 12-24 months (these are just proposals)
- 200 million records stored
- 15 million records per month
- CDL is today’s edge case but it is not a future-proof case
- LOC is ready to open the floodgates - maybe 20 million every couple months - 5,000 is only a test case
- NARA - 3.8 million
Priorities - Writing harvests, mappings, and enrichments in a scalable system
- We can grow into decisions about how to scale
|