Since starting my part-time appointment as an associate professor at the Australian National University, I have been thinking about spending more time on fundamental research. As Don Knuth counsels, “if you find that you’re spending almost all your time on theory, start turning some attention to practical things; it will improve your theories. If you … More Large-Scale Distributed Analytics: A Research Program
What is the optimal way to introduce and embed analytics in an organisation? This is a question I have been wrestling with for years. There are three broad options and a lot of in-between: (1) a centralised analytics team that services different business units in the organisation; (2) a distributed structure where analytics professionals are … More The Case for Centralising Analytics
I recently concluded a two-year stint managing a team of ten highly skilled analytics professionals spread across three different locations. There were of course many challenges but the team over-achieved on just about every measure of success one can imagine. The team’s wins include completing on-time and under-budget a data-matching project that delivers tens of … More Agile Data Science: A Portfolio Approach to Managing An Analytics Team
My colleagues and I have just published on arXiv a simple but highly effective Entity Resolution algorithm that can scale to billions of records and handle significant data quality issues. The paper is titled Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases and it is an extension of our previous paper on linking millions of addresses … More Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases
Many of my most successful data science projects happen by accident. You know, the little skunkworks that arose from a serendipitous hallway conversation where an important and urgent business problem meets a half-baked analytical idea. With a suitable dash of data and the right mix of office politics and corporate kung fu, a baby data-science … More Agile Data Science: On Opportunism
Having spent nearly a decade studying the design and implementation of declarative programming languages in a previous life, I get a bit frustrated whenever I see people getting religious about programming languages and platforms. In the data science circle, an active discussion is around Scala (on Spark) vs SQL (on parallelised relational databases). They are … More The Missing Data Science Language?
Most data science projects are doomed to failure before they even start. There are a couple of reasons. The aspiring data scientist and management may be drawn to a sexy problem rather than an important problem. The full range of data required to do a complete analysis may be inaccessible or even non-existent. And even … More Agile Data Science: Don’t Let Your Model Die in a Powerpoint Presentation