Large-Scale Distributed Analytics: A Research Program

Since starting my part-time appointment as an associate professor at the Australian National University, I have been thinking about spending more time on fundamental research. As Don Knuth counsels, “if you find that you’re spending almost all your time on theory, start turning some attention to practical things; it will improve your theories. If you find that you’re spending almost all your time on practice, start turning some attention to theoretical things; it will improve your practice.” Having spent the last 10 years on the practice of data science in different organisations, I feel it’s time for me to indulge in some theoretical pursuits.

The first step is to find a problem of sufficient scope and interest. There are a number of topics I could pursue, so I thought writing them down would help clarify things in my mind. Hence this post and possibly a few others that will come.

One of the research problems I have always wanted to come back and solve is the inference problem for a completely general probabilistic logic described in this paper: Probabilities on Sentences in an Expressive Logic. The full inference problem is of course intractable for many different reasons so we are really looking for practical and efficient approximate solutions for subsets of the overall problem. One way to frame the lesser problem is the study of computational systems capable of addressing practical business and scientific problems, in all their complexities, using modern knowledge-based distributed probabilistic inference techniques on enterprise-grade massively parallel database systems. Let me motivate that formulation.

The last twenty years have seen tremendous progress in the areas of machine learning, computational logic, and parallel database systems. Each area addresses an important problem: efficient probabilistic inference and learning in the case of machine learning; automated reasoning with codified human knowledge in the case of computational logic; and efficient big data processing in the case of parallel database systems. Progress in each of these areas, on their own, has had significant impacts in the actual practice of business and science worldwide, and these productivity-enhancing innovations are still percolating through from the early adopters to the broader community. This is, in essence, what the global Big Data movement is about.

So what’s next? Looking ahead 5 to 10 years, where might the next set of innovations come from? Studies on characteristics of emerging fields have repeatedly shown that multi-disciplinary research that unifies previously poorly connected areas have the best chance of achieving lasting values. To make further progress and achieve what Charlie Munger might call Lollapalooza effects in Big Data and Data Science, we have to start thinking about understanding and building systems that integrate core technologies from machine learning, computational logic, and parallel database systems. After all, real-world business problems often contain a high level of uncertainty, but there are usually deep domain knowledge and large amounts of data – possibly messy and disparate – available to solve them, if only we have the right computational tools to codify and reason probabilistically with such knowledge and data. The goal of this research program is to understand and build such decision-support tools, by integrating and innovating on existing highly successful problem-solving formalisms, namely probabilistic graphical models and distributed algorithms from machine learning, equational reasoning and automated theorem proving from computational logic, and parallelised SQL and in-database analytics from parallel database systems.

We are certainly not alone in this venture. The infer.NET program at Microsoft, the Apache MADlib project started at Pivotal and UC Berkeley, the Apache Mahout and Spark projects – to give a few examples of platforms in large-scale commercial use – and countless other research projects in universities and companies, provide significant social proof on the importance of the subject. As is usual and healthy in an emerging field, each project has its own methodology and biases.

Here’s an outline of the goals and principles that would provide guidance on how I would tackle the problem, using my own biases and experience.

  1. The research team will take a multi-disciplinary engineering approach favouring practical but principled solution building. The emphasis will be on the research, design, and implementation of open systems that can be and are used in commercial settings.
  2. To achieve engineering professionalism from day one and facilitate adoption of our technologies, all research and development will be done on top of existing widely adopted open platforms. For example, we will work with OpenBUGS and R in developing our probabilistic inference and automated reasoning layers, and stick to SQL and widely-used systems like Greenplum and Hadoop for the data management layer. In particular, when faced with limitations in existing platforms, we will build extensions but not invent new languages and systems to circumvent those limitations.
  3. To stay practically grounded and ensure we have pathways to achieve business impacts, we favour projects that plug readily into existing industry supported projects.
  4. To maintain scientific discipline, all R&D work need to be guided by solid and fundamental principles underpinned by rigorous mathematical theories.

Here is a list of specific projects that are consistent with the goals and principles outlined above. The list is of course not exhaustive.

  • Large-Scale In-Database Machine Learning – The project goal is to study the design and implementation of practical distributed learning algorithms and make them available in platforms like MADlib and R.
  • Real-Time Machine Learning – The project goal is to study the design and implementation of efficient online learning algorithms and make them available in open platforms like Apache Spark and Apache Geode.
  • Automated Inference in Distributed Probabilistic Databases – The project goal is to study the design and implementation of a probabilistic inference engine that is tightly integrated with a parallel relational database. The system will support a Bugs-like language, with full equational reasoning support, for capturing probabilistic graphical models and an inference engine that can turn probabilistic queries into SQL operations that can be executed efficiently in the underlying parallel database.

There you have it, one way to keep (meaningfully) busy for at least 5 years.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s