Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases

My colleagues and I have just published on arXiv a simple but highly effective Entity Resolution algorithm that can scale to billions of records and handle significant data quality issues. The paper is titled Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases and it is an extension of our previous paper on linking millions of addresses … More Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases

How to Link Millions of Addresses with Ten Lines of Code in Ten Minutes

Solving big hairy problems like detecting complex financial crimes requires solving a series of smaller, mundane but technically non-trivial problems. Performing efficient record linkage on large databases with tens to hundreds of millions of rows of data is one such pesky problem. A few of my colleagues have just made a small dent on the overall … More How to Link Millions of Addresses with Ten Lines of Code in Ten Minutes

From Words to Concepts: Explicit Semantic Analysis

Everyone with a rudimentary understanding of text analytics knows about term-frequency-inverse-document-frequency (TF-IDF) vectors. What is less known but deserve to be more widely appreciated and applied is a little related trick called Explicit Semantic Analysis (ESA) introduced by E. Gabrilovish and S. Markovitch, for which they were awarded the 2014 IJCAI-JAIR Best Paper Prize. (https://www.jair.org/bestpaper.html) ESA … More From Words to Concepts: Explicit Semantic Analysis