Most modern big data platforms support parallel execution of (non-native) code written in languages like Python, Perl, R, and Java. On Greenplum and HAWQ, two massively parallel relational database systems, these facilities come in the form of PL/Python, PL/Perl, PL/R, and PL/Java, which are inherited from PostgreSQL. These programming facilities are useful for a range of computational tasks that are not traditionally done on a relational database but that are nevertheless important for a data scientist.
For example, one can take an existing sentiment analysis model written in R from the web and easily create a user-defined function in Greenplum/HAWQ with a command like this
CREATE FUNCTION getSentiment(x text) RETURNS Float AS $$
return(GetSentiment(x, words_pos, words_neg))
$$ LANGUAGE plr;
and, voila, you can now run parallel sentiment analysis on a large corpus of text data stored distributively with a simple SQL query like
SELECT msg, getSentiment(msg) FROM atable;
This is the simplest way to do large-scale in-database analytics and it is neat and powerful. If you spend enough time playing with these programming facilities to push the capability boundaries of systems like Greenplum and HAWQ, you may come across situations where you want to execute a piece of non-trivial Fortran or C/C++ code in this parallel in-database style but found that there is no such thing as a PL/Fortran or a PL/C++. What do you do?
I came across this scenario some time back while working with a few colleagues on chemical process modelling. Specifically, we were trying to use the Levenberg-Marquardt (LM) algorithm to find the optimal parameter w for a process model f(x;w) formulated in the form of a set of ordinary differential equations (ODE). Each iteration of the LM algorithm requires the evaluation of the derivative of the ODE model f(x;w)
and each partial derivative involves two evaluations of the ODE model
Since there are many parameters in the ODE model and each partial derivative requires two numerical integrations of a complex ODE model, we wanted to parallelise the computation of the partial derivatives on Greenplum to speed up the LM algorithm. This would have been a simple programming task if not for the fact that the ODE numerical integration code is in Fortran!
In light of the simplicity of the eventual solution we found, it is somewhat embarrassing to say we got stuck on this problem for weeks and tried many different things, including an ill-fated attempt to rewrite hundreds of lines of Fortran code in C. (We were desperate and the f2c program didn’t really work for us.) So how did we solve the problem eventually?
Answer: By writing a PL/R function that calls the Fortran routine through dynamic loading!
This is how it works. Suppose you have a Fortran program called foobar.f. You can compile it to a shared library using the command
R CMD SHLIB foobar.f
You need to move the resultant foobar.so file to every node on the underlying Greenplum/HAWQ cluster. Executing the Fortran routine in parallel in-database is now a simple matter of writing a PL/R function like this
CREATE FUNCTION foobar() RETURNS float AS $$
and then calling the foobar() function in a (parallelised) SQL query.
One can use the same technique to execute C code in-database. With RCpp, one can also call C++ code in-database in parallel!