2016-10-28 Meeting notes

Date

28 Oct 2016

Attendees

Goals

Discuss R&D projects.
Discuss allocation and make issue tracker tasks for prototype work.
Discuss milestones, deadlines.
Generate list of priorities for what we want our solution to achieve

Discussion items

Item	Notes
Priorities (what problems are we solving?)	Speed: speed is a feature. Predictably say how long some ingest will take. Allowing recovery from failure; pick up where it left off. Speed affects this; if it's fast enough you don't have to worry about it. Otherwise, make sure there's recovery. Harvesters should allow recovery, where possible. Indexers could also be less speedy than mappings and enrichments, and may deserve recovery features. Adding automation that was originally specified: have a program that shepherds the process all the way through. Scheduling. Eventually, provide a useable mapping DSL Needs real market research This is not a turnkey solution yet. Some things like DSLs will be evaluated later when we can more confident in understanding how big the user base is. Writing mappings ourselves in the third system without a DSL will allow us to understand the problem space better. Ability to debug things, especially mappings
Code examples	Got walkthroughs of the Python, Python + Spark, Java, and Scala prototypes https://github.com/dpla/pyfeta https://github.com/dpla/jetta https://github.com/dpla/scetta Considerations for language environments: Staff allocation for necessary professional development Mental "context switching" with multiple environments Server cost, if having to run more servers (consider execution speed memory usage if able to run on just one node, e.g.) Performance Ease of use for novice / non-programmer Ease of writing a DSL in it Ability to be explicit in code (e.g. types). Fewer inferences, LESS MAGIC! Ease of deployment. (local, production, dependencies) How easy for other institutions to adopt our code or experiment with it. Developer enjoyment What scale will we be working in for the foreseeable future? Ingestion total or by provider? Extra pressure b/c we want to reinvest more frequently - that could mean 10 million records per month Targets for 12-24 months (these are just proposals) 200 million records stored 15 million records per month CDL is today’s edge case but it is not a future-proof case LOC is ready to open the floodgates - maybe 20 million every couple months - 5,000 is only a test case NARA - 3.8 million Priorities Writing harvests, mappings, and enrichments in a scalable system We can grow into decisions about how to scale
Scheduling system	Scheduling / operation chaining / "Plans" in the Prov-O sense Need metrics for what qualifies job failure. (Partly thought out) Need to get together and assess our experiences running ingests. If we automate things, we need to know how to define success. Tools exist that can help with this. Need to schedule a period after basic manual ingest running is figured out, but need to design for there being a scheduling facility. Per "General consensus" section below, we will design programs for each activity in the ingest process that have their concerns passed to them. They will not know anything about the scheduling system that calls them. They will not be bound to them with database models. They will save manifests that document the results of their operations.
Roadmapping all of this	Harvester development - At the end of next week (after language selection) we should vote to select which class of harvester we want to prototype ResourceSync (harvesting DPLA's sitemap) API NARA CDL OAI-PMH WI Digital Commonwealth

General consensus on the project's design philosophy is to follow these principles:

Sequence files

Seems preferable to have a record i/o abstraction that allows either single files or sequence files

Tech

2016-10-28 Meeting notes

Date

Attendees

Goals

Discussion items

Action items

Related content