Agile Data Science: A Portfolio Approach to Managing An Analytics Team

I recently concluded a two-year stint managing a team of ten highly skilled analytics professionals spread across three different locations. There were of course many challenges but the team over-achieved on just about every measure of success one can imagine. The team’s wins include completing on-time and under-budget a data-matching project that delivers tens of … More Agile Data Science: A Portfolio Approach to Managing An Analytics Team

Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases

My colleagues and I have just published on arXiv a simple but highly effective Entity Resolution algorithm that can scale to billions of records and handle significant data quality issues. The paper is titled Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases and it is an extension of our previous paper on linking millions of addresses … More Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases

The Missing Data Science Language?

Having spent nearly a decade studying the design and implementation of declarative programming languages in a previous life, I get a bit frustrated whenever I see people getting religious about programming languages and platforms. In the data science circle, an active discussion is around Scala (on Spark) vs SQL (on parallelised relational databases). They are … More The Missing Data Science Language?

Agile Data Science: Don’t Let Your Model Die in a Powerpoint Presentation

Most data science projects are doomed to failure before they even start. There are a couple of reasons. The aspiring data scientist and management may be drawn to a sexy problem rather than an important problem. The full range of data required to do a complete analysis may be inaccessible or even non-existent. And even … More Agile Data Science: Don’t Let Your Model Die in a Powerpoint Presentation

Practical Algorithms for Distributed Privacy-Preserving Risk Modelling

In a previous post on the problem of detecting complex financial crimes, I described the following basic technology framework for financial intelligence units (FIUs) and their partner agencies and reporting entities (REs) to engage in collaborative but privacy-preserving and distributed risk modelling using confidential computing technologies. In this post, I describe a few concrete algorithms that … More Practical Algorithms for Distributed Privacy-Preserving Risk Modelling

How to Quickly and Meaningfully Improve the Financial System’s Collective Ability to Detect Crimes

Complex financial crimes are hard to detect primarily because data related to different pieces of the overall puzzle are usually distributed across a network of financial institutions, regulators, and law-enforcement agencies. The problem is also rapidly increasing in complexity because new platforms are emerging all the time that facilitate the transfer of value across a … More How to Quickly and Meaningfully Improve the Financial System’s Collective Ability to Detect Crimes