Programming – Page 2 – Mental Models 4 Life

Extending the Paillier Cryptosystem to Handle Floating Point Numbers

August 13, 2017

The Paillier Cryptosystem is a partial homomorphic encryption scheme that supports two important operations: addition of two encrypted integers and the multiplication of an encrypted integer by an unencrypted integer. In practice, many applications of Paillier require an extension of the underlying scheme beyond integers to handle floating-point numbers. For example, just about every popular machine learning … More Extending the Paillier Cryptosystem to Handle Floating Point Numbers

The Education of a Data Scientist: On Sands and Other Irritants

July 23, 2017

I have learned over the years to distinguish between good data scientists and great data scientists in the way they handle the seemingly mundane aspects of data analysis, tasks like loading large but poorly structured datasets, dealing with missing data or poor quality data, finding the right way to interrogate and transform variables to satisfy … More The Education of a Data Scientist: On Sands and Other Irritants

How to Link Millions of Addresses with Ten Lines of Code in Ten Minutes

July 16, 2017

Solving big hairy problems like detecting complex financial crimes requires solving a series of smaller, mundane but technically non-trivial problems. Performing efficient record linkage on large databases with tens to hundreds of millions of rows of data is one such pesky problem. A few of my colleagues have just made a small dent on the overall … More How to Link Millions of Addresses with Ten Lines of Code in Ten Minutes

In-Database Machine Learning Illustrated

June 1, 2017

I have just received the excellent news that Apache MADlib, a big data machine learning library for which I was a committer until recently, has graduated to become a top-level Apache project. The basic idea behind MADlib is actually quite interesting and deserves to be more widely known. Massively Parallel Processing (MPP) databases like Greenplum have … More In-Database Machine Learning Illustrated

Apache Spark vs MPP Databases

August 14, 2016

Everything that is old is new again. That’s the feeling I get when I look at Spark, which I learned is one of the fastest growing Apache projects in the big data space. There is remarkable similarity in the underlying architecture between Spark and that of a Massively Parallel Processing (MPP) Database like Greenplum or … More Apache Spark vs MPP Databases

PL/Fortran and PL/C++ on PostgreSQL and Greenplum

December 18, 2015

Most modern big data platforms support parallel execution of (non-native) code written in languages like Python, Perl, R, and Java. On Greenplum and HAWQ, two massively parallel relational database systems, these facilities come in the form of PL/Python, PL/Perl, PL/R, and PL/Java, which are inherited from PostgreSQL. These programming facilities are useful for a range … More PL/Fortran and PL/C++ on PostgreSQL and Greenplum

A Digitisation Project: Fun with Tesseract

October 12, 2015

As part of a broader data science project, I recently had the chance to undertake a digitisation project to augment the structured dataset we have for analysis. The project turned out to be quite instructive and I came away with a few lessons, lessons I hope to share in this blog piece. The actual digitisation problem is … More A Digitisation Project: Fun with Tesseract

A Note on Lazy Evaluation in R

September 4, 2015

R is commonly thought of as a functional programming language. If you associate functional programming (FP) with lambda calculus and pure FP languages like Haskell, then you may get surprised by aspects of R’s computational model. One of these has to do with R’s lazy evaluation mechanism, in particular the concept of “promise objects” (as pointed out by some, … More A Note on Lazy Evaluation in R