Whenever two organisations come together to share data, we have a data integration problem. Mapping of datasets is typically done manually and that can be a labour-intensive and error-prone process. Importantly, the manual data-mapping process doesn’t scale and that is a problem when you want to build an information-sharing network where arbitrary organisations can sign up and start sharing data with other organisations in the same network. To support such use cases, we need algorithms that can perform (semi-)automatic data integration. There are now a bit of research on this topic and I want to contribute an idea in this area in this post.
A key problem in automatic data integration is detection of possible matches between two table columns. For example, should we match a column called First_name in table CustomerDB with a column called 1stName in table Clients? This is a non-trivial problem and requires contextual information to resolve. One could take an ontology-based approach, which requires organisations wanting to share data to carefully define their respective meta-data in a suitable logic (e.g. description logic is a common system in the semantic network literature) and then pose data-mapping questions as queries in the logic. Or one can take a data-driven approach.
If labelled data is available, one can easily employ supervised-learning techniques to learn, for example, data-type classification models. In the more common case where no labelled data is available, an interesting idea to try is to calculate the Normalised Compression Distance (NCD) between two columns of values and use an appropriate threshold to detect possible matches. The basic idea is simply that a column of, say, dollar values should compress better together with another dataset of dollar values compared to, say, a dataset of dates.
NCD is attractive in this context because it avoids the need to do explicit feature construction. It is a universal method for comparing two arbitrary objects, each represented as a sequence of symbols. All we need is a general-purpose data compressor (or coder) like gzip. Here’s the definition of NCD.
Here are some R code to illustrate the compression-based similarity measure idea:
- history <- rep(c(1,2,3,4,5), 20)
- str <- Reduce(paste0, history)
- h1 <- memCompress(str)
- h2 <- memCompress(‘12345’)
- length(h1)
- length(h2)
- h3 <- memCompress(paste0(str, ‘12345’))
- length(h3) # this results in no additional characters in the compressed string
- h4 <- memCompress(paste0(str, ‘11803’))
- length(h4) # this results in 4 more characters