My colleagues and I have just published on arXiv a simple but highly effective Entity Resolution algorithm that can scale to billions of records and handle significant data quality issues. The paper is titled Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases and it is an extension of our previous paper on linking millions of addresses … More Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases
Having spent nearly a decade studying the design and implementation of declarative programming languages in a previous life, I get a bit frustrated whenever I see people getting religious about programming languages and platforms. In the data science circle, an active discussion is around Scala (on Spark) vs SQL (on parallelised relational databases). They are … More The Missing Data Science Language?
In a previous post on the problem of detecting complex financial crimes, I described the following basic technology framework for financial intelligence units (FIUs) and their partner agencies and reporting entities (REs) to engage in collaborative but privacy-preserving and distributed risk modelling using confidential computing technologies. In this post, I describe a few concrete algorithms that … More Practical Algorithms for Distributed Privacy-Preserving Risk Modelling
Solving big hairy problems like detecting complex financial crimes requires solving a series of smaller, mundane but technically non-trivial problems. Performing efficient record linkage on large databases with tens to hundreds of millions of rows of data is one such pesky problem. A few of my colleagues have just made a small dent on the overall … More How to Link Millions of Addresses with Ten Lines of Code in Ten Minutes
Financial Intelligence Units (FIUs) around the world collect data like threshold transaction reports, international fund transfer reports, and suspicious matter/activity reports from Reporting Entities (REs), which include banks, money remitters, casinos, law firms, real-estate companies, and financial companies. They may also get data about entities of interest from partner agencies (PAs) like law-enforcement agencies (LEAs) … More Detecting Financial Crimes: Current State, Limitations, and A Way Forward
I have just received the excellent news that Apache MADlib, a big data machine learning library for which I was a committer until recently, has graduated to become a top-level Apache project. The basic idea behind MADlib is actually quite interesting and deserves to be more widely known. Massively Parallel Processing (MPP) databases like Greenplum have … More In-Database Machine Learning Illustrated
In Part 1 of this blog article, we looked at the problem of tokenising a URL as an intermediate step towards learning user preference models from browsing histories. In Part 2, we next look at the problem of learning a URL classifier model from the preprocessed Shalla dataset using Support Vector Machines. A standard way … More Large-scale Subscriber Preference Modelling for Telcos – Part 2