Our ingestion system will embody some of the principles of the Lambda Architecture, espoused in the book Big Data, by Nathan Marz (Manning Publications Co., 2015). "Data," meaning original record data, will be stored in immutable files, to which we can always return for reprocessing. The data are raw, immutable, and permanently true. The steps of our mapping and enrichment activities produce information that we will also store in immutable files. None of our activities will persist information in databases, either relational or otherwise. The files will be read by worker processes operating in parallel on individual partitions of the data. These processes will generate more files that can, in turn, be read by subsequent workers.
We intend to use Apache Spark, a computing framework that is well-suited to distributed batch processing. Spark handles a lot of the engineering issues associated with coordinating a number of worker processes to transform data. One of the main issues is how to make sure that the individual workers have access to the right chunks of data, without having to do anything too complicated to synchronize particular parts of our dataset with a number of worker machines.
We pondered early on the possibility of the workers sharing a networked filesystem, like NFS, or Amazon's Elastic File System (EFS). In this case, where we assumed we'd have potentially millions of files per ingest (one per record), the challenge was to ensure that this large collection of files was spread out across enough directories so as not to exceed the Linux kernel's limit on the maximum number of files in a directory.
However, we eventually settled on an easier way to partition the data for parallelization, while eliminating unnecessary systems engineering effort. The Avro data serialization system provides a cross-platform library for persisting data for a parallelized data processing workflow. It works along with Apache Spark's Resilient Distributed Datasets (RDDs) to allow Spark to send out chunks of the whole dataset to separate worker processes (called "executors" in Spark). The driver program that you submit to Spark collates the results of the workers and persists the new data in a new Avro file. Spark's RDD is a mechanism by which a large dataset is distributed in a very efficient manner, and in which the handoff of the data to the function that you write to process it is handled transparently.
Spark is typically used with Hadoop SequenceFiles, which resemble Avro files in many ways; but Avro can be used outside of Spark, whereas SequenceFiles are less easy to work with outside of their intended Hadoop or Spark context. With Avro, it's more feasible to write the kinds of utilities that we need to support our system, which might do things like package up directories of flat files into Avro files, or perform analytical or diagnostic functions.
TODO: links here to the pyfeta project, with its Avro demonstrations.