Having spent nearly a decade studying the design and implementation of declarative programming languages in a previous life, I get a bit frustrated whenever I see people getting religious about programming languages and platforms. In the data science circle, an active discussion is around Scala (on Spark) vs SQL (on parallelised relational databases). They are … More The Missing Data Science Language?
Most data science projects are doomed to failure before they even start. There are a couple of reasons. The aspiring data scientist and management may be drawn to a sexy problem rather than an important problem. The full range of data required to do a complete analysis may be inaccessible or even non-existent. And even … More Agile Data Science: Don’t Let Your Model Die in a Powerpoint Presentation
In a previous post on the problem of detecting complex financial crimes, I described the following basic technology framework for financial intelligence units (FIUs) and their partner agencies and reporting entities (REs) to engage in collaborative but privacy-preserving and distributed risk modelling using confidential computing technologies. In this post, I describe a few concrete algorithms that … More Practical Algorithms for Distributed Privacy-Preserving Risk Modelling
Complex financial crimes are hard to detect primarily because data related to different pieces of the overall puzzle are usually distributed across a network of financial institutions, regulators, and law-enforcement agencies. The problem is also rapidly increasing in complexity because new platforms are emerging all the time that facilitate the transfer of value across a … More How to Quickly and Meaningfully Improve the Financial System’s Collective Ability to Detect Crimes
The Paillier Cryptosystem is a partial homomorphic encryption scheme that supports two important operations: addition of two encrypted integers and the multiplication of an encrypted integer by an unencrypted integer. In practice, many applications of Paillier require an extension of the underlying scheme beyond integers to handle floating-point numbers. For example, just about every popular machine learning … More Extending the Paillier Cryptosystem to Handle Floating Point Numbers
I have learned over the years to distinguish between good data scientists and great data scientists in the way they handle the seemingly mundane aspects of data analysis, tasks like loading large but poorly structured datasets, dealing with missing data or poor quality data, finding the right way to interrogate and transform variables to satisfy … More The Education of a Data Scientist: On Sands and Other Irritants
Solving big hairy problems like detecting complex financial crimes requires solving a series of smaller, mundane but technically non-trivial problems. Performing efficient record linkage on large databases with tens to hundreds of millions of rows of data is one such pesky problem. A few of my colleagues have just made a small dent on the overall … More How to Link Millions of Addresses with Ten Lines of Code in Ten Minutes