A Note on Large Scale Data Matching and Entity Resolution

Data matching and entity resolution is a common first step in data preparation and there is a thousand academic papers written on the subject in the literature. In practice, for large datasets – anything more than a million records will do as a definition of large here because most data-matching algorithms can’t handle that because … More A Note on Large Scale Data Matching and Entity Resolution

Towards Fair and Privacy-Preserving Federated Deep Learning Models

My former postdoc Lingjuan Lyu has been working with a few research collaborators on a fair and privacy-preserving federated deep-learning framework and a paper describing the framework has just been published at the IEEE Transactions on Parallel and Distributed Systems. Here’s the paper details: Title: Towards Fair and Privacy-Preserving Federated Deep Models Abstract: The current … More Towards Fair and Privacy-Preserving Federated Deep Learning Models

Distributed Privacy-Preserving Prediction

Another day, another paper, this time by my postdoc Lingjuan Lyu and a few collaborators. Here’s the abstract: In privacy-preserving machine learning, individual parties are reluctant to share their sensitive training data due to privacy concerns. Even the trained model parameters or prediction can pose serious privacy leakage. To address these problems, we demonstrate a … More Distributed Privacy-Preserving Prediction

Accurate and Efficient Privacy-Preserving String Matching

A few ANU colleagues and I have just completed a paper on a suffix-tree-based algorithm for computing the longest common substring of two strings in a privacy-preserving manner. Here’s the abstract: The task of calculating similarities between strings held by different organizations without revealing these strings is an increasingly important problem in areas such as … More Accurate and Efficient Privacy-Preserving String Matching

Linking Integer Records: The Simplest Case of PPRL

Privacy-Preserving Record Linkage (PPRL) is one of those problems that still doesn’t have a solid and widely accepted mathematical definition, perhaps because the problem of Record Linkage itself, especially the kind that doesn’t reduce to supervised learning through an abundance of labelled matches, still doesn’t have a solid mathematical definition despite thousands of papers published … More Linking Integer Records: The Simplest Case of PPRL

Hardening Bloom Filters using Paillier Encryption

Bloom Filters is a popular technique for privacy-preserving record linkage. However, recent work by Christen et al [1] and others have shown that Bloom Filters (BF) are susceptible to different forms of frequency attack. There are many ideas on hardening BF to protect against frequency attacks, and one idea we will explore in this blog article … More Hardening Bloom Filters using Paillier Encryption

Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases

My colleagues and I have just published on arXiv a simple but highly effective Entity Resolution algorithm that can scale to billions of records and handle significant data quality issues. The paper is titled Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases and it is an extension of our previous paper on linking millions of addresses … More Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases

Practical Algorithms for Distributed Privacy-Preserving Risk Modelling

In a previous post on the problem of detecting complex financial crimes, I described the following basic technology framework for financial intelligence units (FIUs) and their partner agencies and reporting entities (REs) to engage in collaborative but privacy-preserving and distributed risk modelling using confidential computing technologies. In this post, I describe a few concrete algorithms that … More Practical Algorithms for Distributed Privacy-Preserving Risk Modelling

How to Quickly and Meaningfully Improve the Financial System’s Collective Ability to Detect Crimes

Complex financial crimes are hard to detect primarily because data related to different pieces of the overall puzzle are usually distributed across a network of financial institutions, regulators, and law-enforcement agencies. The problem is also rapidly increasing in complexity because new platforms are emerging all the time that facilitate the transfer of value across a … More How to Quickly and Meaningfully Improve the Financial System’s Collective Ability to Detect Crimes

How to Link Millions of Addresses with Ten Lines of Code in Ten Minutes

Solving big hairy problems like detecting complex financial crimes requires solving a series of smaller, mundane but technically non-trivial problems. Performing efficient record linkage on large databases with tens to hundreds of millions of rows of data is one such pesky problem. A few of my colleagues have just made a small dent on the overall … More How to Link Millions of Addresses with Ten Lines of Code in Ten Minutes