How to Link Millions of Addresses with Ten Lines of Code in Ten Minutes

Solving big hairy problems like detecting complex financial crimes requires solving a series of smaller, mundane but technically non-trivial problems. Performing efficient record linkage on large databases with tens to hundreds of millions of rows of data is one such pesky problem. A few of my colleagues have just made a small dent on the overall … More How to Link Millions of Addresses with Ten Lines of Code in Ten Minutes

Automatic Data Integration using Normalised Compression Distance

Whenever two organisations come together to share data, we have a data integration problem. Mapping of datasets is typically done manually and that can be a labour-intensive and error-prone process. Importantly, the manual data-mapping process doesn’t scale and that is a problem when you want to build an information-sharing network where arbitrary organisations can sign … More Automatic Data Integration using Normalised Compression Distance