Everything that is old is new again. That’s the feeling I get when I look at Spark, which I learned is one of the fastest growing Apache projects in the big data space.
There is remarkable similarity in the underlying architecture between Spark and that of a Massively Parallel Processing (MPP) Database like Greenplum or Netezza. A Resilient Distributed Dataset (RDD) in Spark is, of course, essentially the same thing as a distributed table in an MPP database: a distributed collection of key-value pairs partitioned by a hash on the key and where the value can be more-or-less arbitrary objects. Each of the common RDD operations, including map(), filter(), union(), intersection(), cartesian(), distinct(), aggregate(), fold(), count(), take(), collect(), etc, has a direct corresponding equivalent operation in an MPP database.¹ Even advanced Spark programming constructs like accumulators and broadcast variables have natural counterparts in MPP database operators that deal with movement of data.
For the uninitiated, an MPP database, in simple terms, is nothing more than a relational database that has been extended in two ways:
- the basic set of database operators is extended with new operators for moving data around a cluster, including relational join operators that implement distributed algorithms like hash-join;
- the addition of an query optimiser that knows how to turn a SQL query into a query plan that uses the extended set of database operators.
So one can argue that Spark = MPP Database – Query Optimisation – Transaction Support, if you ignore the R&D work around Spark SQL, which is of course all about constructing a SQL query translator/optimiser for Spark.
In an MPP database like Greenplum, the query optimiser works out how to translate a given SQL query into a sequence of database operators that perform computations on distributed tables to arrive at the final answer. In Spark, a programmer has to work out how to compute what s/he wants by explicit and deliberate sequencing of RDD operations. The Greenplum programmer engages in declarative programming, defining what s/he wants to compute and leaving the query optimiser to work out the how. The Spark programmer engages in imperative programming (in a functional context); s/he defines exactly how the computation should be performed, aided with the conveniences offered by a functional programming language like higher-order functions. This is of course a classical age-old trade-off in computer science, one between
- programmer productivity and precise control; and
- a clever but complex system that can be volatile/unpredictable vs a simple system that only does a small sufficient set of things well but is stable and reliable.
I have no idea which approach will turn out to be the eventual winner. Many large traditional enterprises are still on MPP databases (most banks are still on Teradata and most telcos are still on Greenplum/Netezza), but Spark is clocking up wins in internet companies, both large and small, at an impressive rate. Maybe the two technologies will co-exist, maybe they will even converge. My own experience is that for most relatively straightforward business queries, one would prefer the SQL interface offered by an MPP database. But for more complex computations like those involved in the implementation of machine learning algorithms, it’s better to work directly with the lower-level RDD operations offered by Spark. (I was a contributor to Apache MADlib and I’ll admit openly to the difficulty of implementing machine learning algorithms in SQL and procedural extensions.) Spark SQL is currently still way behind a system like Greenplum, but I expect all the secret sauce in Greenplum’s Orca query optimiser and Cloudera’s Impala to eventually be ported across to Spark SQL, especially given the statistics that ~60% of Spark users adopted it because of Spark SQL, which is ironic for a NoSQL (Not only SQL) platform. 🙂
¹ Higher-order functions — functions that take other functions as arguments — like map() and filter(), can be supported natively (when it’s one-level deep) in MPP databases through straight SQL and PL/* procedural extensions. Beyond one-level nesting, higher-order functions can be ‘faked’ via meta-programming techniques (passing function names as strings to SQL functions and then relying on database catalogues to retrieve function definitions followed by appropriate casting) in MPP databases but it’s awkward. All the usual trade-offs between functional programming, which is based on lambda calculus or higher-order logic, and logic programming, which is based on fragments of first-order logic, apply here. There is nothing that can be done in one system but not the other, and it all ultimately comes down to taste, or lack there of…