Data matching and entity resolution is a common first step in data preparation and there is a thousand academic papers written on the subject in the literature. In practice, for large datasets – anything more than a million records will do as a definition of large here because most data-matching algorithms can’t handle that because data-matching have worst-case quadratic computational complexity – there are only a small number of options: use a search tool like Elasticsearch or adopt rule-based data-matching procedures based on name/address standardisation followed by exact matching on configurable keys. The following two papers describe the P-Sig algorithm, which belongs to the latter class and is close to the state-of-the-art.
Here are the main differences and trade-offs between the two approaches:
- Elasticsearch is fast but less accurate, because its comparison function between two strings str1 and str2 can only rely on information local to str1 and str2.
- P-sig is slow(er) but more accurate, because its comparison function between two strings str1 and str2 can make use of global information like the frequencies of str1 and str2 in the database.
For probabilistic matching, those global information seems to matter a great deal in calibrating the probability of match, allowing us to say, for example, that a match on “Wuttke” is quantitatively more significant than a match on “Watson” because Wuttke is quite a bit more rare than Watson, even though both are string matches of exactly 6 characters.
There is a place for both approaches, and the correct solution is dependent on the context of use. Our recommendation is that for incremental entity search / matching, use Elasticsearch. But for batch entity-matching of two large datasets, use the P-sig algorithm or something similar.