A Note on Large Scale Data Matching and Entity Resolution

Data matching and entity resolution is a common first step in data preparation and there is a thousand academic papers written on the subject in the literature. In practice, for large datasets – anything more than a million records will do as a definition of large here because most data-matching algorithms can’t handle that because … More A Note on Large Scale Data Matching and Entity Resolution

Large-Scale Distributed Analytics: A Research Program

Since starting my part-time appointment as an associate professor at the Australian National University, I have been thinking about spending more time on fundamental research. As Don Knuth counsels, “if you find that you’re spending almost all your time on theory, start turning some attention to practical things; it will improve your theories. If you … More Large-Scale Distributed Analytics: A Research Program

Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases

My colleagues and I have just published on arXiv a simple but highly effective Entity Resolution algorithm that can scale to billions of records and handle significant data quality issues. The paper is titled Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases and it is an extension of our previous paper on linking millions of addresses … More Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases

The Missing Data Science Language?

Having spent nearly a decade studying the design and implementation of declarative programming languages in a previous life, I get a bit frustrated whenever I see people getting religious about programming languages and platforms. In the data science circle, an active discussion is around Scala (on Spark) vs SQL (on parallelised relational databases). They are … More The Missing Data Science Language?

Practical Algorithms for Distributed Privacy-Preserving Risk Modelling

In a previous post on the problem of detecting complex financial crimes, I described the following basic technology framework for financial intelligence units (FIUs) and their partner agencies and reporting entities (REs) to engage in collaborative but privacy-preserving and distributed risk modelling using confidential computing technologies. In this post, I describe a few concrete algorithms that … More Practical Algorithms for Distributed Privacy-Preserving Risk Modelling

How to Link Millions of Addresses with Ten Lines of Code in Ten Minutes

Solving big hairy problems like detecting complex financial crimes requires solving a series of smaller, mundane but technically non-trivial problems. Performing efficient record linkage on large databases with tens to hundreds of millions of rows of data is one such pesky problem. A few of my colleagues have just made a small dent on the overall … More How to Link Millions of Addresses with Ten Lines of Code in Ten Minutes

Detecting Financial Crimes: Current State, Limitations, and A Way Forward

Financial Intelligence Units (FIUs) around the world collect data like threshold transaction reports, international fund transfer reports, and suspicious matter/activity reports from Reporting Entities (REs), which include banks, money remitters, casinos, law firms, real-estate companies, and financial companies. They may also get data about entities of interest from partner agencies (PAs) like law-enforcement agencies (LEAs) … More Detecting Financial Crimes: Current State, Limitations, and A Way Forward

In-Database Machine Learning Illustrated

I have just received the excellent news that Apache MADlib, a big data machine learning library for which I was a committer until recently, has graduated to become a top-level Apache project. The basic idea behind MADlib is actually quite interesting and deserves to be more widely known. Massively Parallel Processing (MPP) databases like Greenplum have … More In-Database Machine Learning Illustrated

Large-scale Subscriber Preference Modelling for Telcos – Part 2

In Part 1 of this blog article, we looked at the problem of tokenising a URL as an intermediate step towards learning user preference models from browsing histories. In Part 2, we next look at the problem of learning a URL classifier model from the preprocessed Shalla dataset using Support Vector Machines. A standard way … More Large-scale Subscriber Preference Modelling for Telcos – Part 2

Large-scale Subscriber Preference Modelling for Telcos – Part 1

An important way telcos can increase revenue is to improve, within the constraints of privacy laws, provision of personalised services for subscribers. To achieve that, they need to be able to build good subscriber preference models. These can take a number of forms, depending on the specific business context and the exact data available. In this … More Large-scale Subscriber Preference Modelling for Telcos – Part 1