Deep Similarity Functions for Entity Matching

Mark Pinches
5 min readFeb 20, 2021

The notion of similarity, and it’s complimentary counterpart distance, play key roles in many data science and machine learning projects, where similarity functions serve as methods by which we can group or separate objects and concepts and make generalisations across our data sets.

There are a vast array of different similarity metrics, which may be used in a variety of different scenarios. Very often it is possible to identify a similarity metric that has the desirable characteristics for a given problem space, and there are some great python functions which make such methods easily accessible. However my experience suggests that in general similarity functions have desirable characteristics only for specific input types, and if the similarity function is challenged with slightly different inputs, then the outputs may not reflect our intuitive understanding of similarity in that situation. This is particularly the case for text similarity in entity matching, where names and addresses may include abbreviations, dropped items, and variable ordering.

Traditionally in the domain of entity matching, multiple similarity functions are used across a variety of different input data types such as name, address ,telephone number, email and so on. The output from these functions are used to create a feature matrix which is then inputted into a supervised model such as random forest or a boosted tree. The hope is that by using multiple similarity functions on the same input data, often normalised with multiple data pre-processing transformations, that one of the similarity functions will output a distant measure indicating that the two entities are related and which itself can be leveraged by the tree.

However the impact of this is that model certainty is reduced and what should be strong matches often return as poor or weak matches requiring human review. Below are some examples of pairwise company names being evaluated with a variety of different similarity functions.

With this at the forefront of my mind, I wondered whether it might be possible to build a deep learning model to act as a similarity function for specific scenarios that are not well served by traditional similarity functions such as short sequences of text, names or addresses.

Deep learning models would seem ideally suited to this challenge as they can be understood as a highly flexible mapping functions, that take a given input and transform it to an…

--

--

Mark Pinches

Head of Informatics at Medicines Discovery Catapult