The Education of a Data Scientist: On Sands and Other Irritants

I have learned over the years to distinguish between good data scientists and great data scientists in the way they handle the seemingly mundane aspects of data analysis, tasks like loading large but poorly structured datasets, dealing with missing data or poor quality data, finding the right way to interrogate and transform variables to satisfy … More The Education of a Data Scientist: On Sands and Other Irritants

How to Link Millions of Addresses with Ten Lines of Code in Ten Minutes

Solving big hairy problems like detecting complex financial crimes requires solving a series of smaller, mundane but technically non-trivial problems. Performing efficient record linkage on large databases with tens to hundreds of millions of rows of data is one such pesky problem. A few of my colleagues have just made a small dent on the overall … More How to Link Millions of Addresses with Ten Lines of Code in Ten Minutes

Detecting Financial Crimes: Current State, Limitations, and A Way Forward

Financial Intelligence Units (FIUs) around the world collect data like threshold transaction reports, international fund transfer reports, and suspicious matter/activity reports from Reporting Entities (REs), which include banks, money remitters, casinos, law firms, real-estate companies, and financial companies. They may also get data about entities of interest from partner agencies (PAs) like law-enforcement agencies (LEAs) … More Detecting Financial Crimes: Current State, Limitations, and A Way Forward

In-Database Machine Learning Illustrated

I have just received the excellent news that Apache MADlib, a big data machine learning library for which I was a committer until recently, has graduated to become a top-level Apache project. The basic idea behind MADlib is actually quite interesting and deserves to be more widely known. Massively Parallel Processing (MPP) databases like Greenplum have … More In-Database Machine Learning Illustrated

Setting up a Data Science Practice: Analytics Processes

In this third post on setting up a data science practice, I address some of the analytics processes that need to be in place to maximise value from analytics. After more than two decades of practice and development, there are now well- established data analytics frameworks like the Cross Industry Standard Process for Data Mining. … More Setting up a Data Science Practice: Analytics Processes

Setting up a Data Science Practice: People Dimension

In the previous post, we discussed the key principles of setting up a data science practice. In this post, we’ll discuss the people dimension. One should read the below as suggestions, not prescriptions. There is more than one way to set up a data science practice. Critical to the success of a data science practice are … More Setting up a Data Science Practice: People Dimension

Setting up a Data Science Practice: Fundamental Principles

I have been involved in the setup of several data science practices in both industry and government. Here are a few key principles I use in establishing a data science practice. Principle 1: Building a predictive enterprise is, first and foremost, about building a human infrastructure. Many companies mistakenly believe that analytics is primarily about software … More Setting up a Data Science Practice: Fundamental Principles