...
However, we eventually settled on an easier way to partition the data for parallelization, while eliminating unnecessary systems engineering effort. The Avro data serialization system provides a cross-platform library for persisting data for a parallelized data processing workflow. It works along with Apache Spark's Resilient Distributed Datasets (RDDs) to allow Spark to send out chunks of the whole dataset to separate worker processes (called "executors" in Spark). The driver program that you submit to Spark collates the results of the workers and persists the new data in a new Avro file. Spark's RDD is a mechanism by which a large dataset is distributed in a very efficient manner, and in which the handoff of the data to the function that you write to process it is handled transparently.
Spark is typically can be used with a variety of data formats such as Hadoop SequenceFiles, which resemble Avro files in many ways; but Avro can be used outside of Spark, whereas SequenceFiles are less easy to work with outside of their intended Hadoop or Spark context. With Avro, it's more feasible to write the kinds of utilities that we need to support our system, which might do things like package packaging up directories of flat files into Avro files, or perform performing analytical or diagnostic functions.
...