Deep Similarity Functions for Entity Matching

The notion of similarity, and it’s complimentary counterpart distance, play key roles in many data science and machine learning projects, where similarity functions serve as methods by which we can group or separate objects and concepts and make generalisations across our data sets.

There are a vast array of different similarity metrics, which may be used in a variety of different scenarios. Very often it is possible to identify a similarity metric that has the desirable characteristics for a given problem space, and there are some great python functions which make such methods easily accessible. However my experience suggests that in general similarity functions have desirable characteristics only for specific input types, and if the similarity function is challenged with slightly different inputs, then the outputs may not reflect our intuitive understanding of similarity in that situation. This is particularly the case for text similarity in entity matching, where names and addresses may include abbreviations, dropped items, and variable ordering.

Traditionally in the domain of entity matching, multiple similarity functions are used across a variety of different input data types such as name, address ,telephone number, email and so on. The output from these functions are used to create a feature matrix which is then inputted into a supervised model such as random forest or a boosted tree. The hope is that by using multiple similarity functions on the same input data, often normalised with multiple data pre-processing transformations, that one of the similarity functions will output a distant measure indicating that the two entities are related and which itself can be leveraged by the tree.

However the impact of this is that model certainty is reduced and what should be strong matches often return as poor or weak matches requiring human review. Below are some examples of pairwise company names being evaluated with a variety of different similarity functions.

With this at the forefront of my mind, I wondered whether it might be possible to build a deep learning model to act as a similarity function for specific scenarios that are not well served by traditional similarity functions such as short sequences of text, names or addresses.

Deep learning models would seem ideally suited to this challenge as they can be understood as a highly flexible mapping functions, that take a given input and transform it to an output leveraging non-linear relationships which can be learnt from data.

So to build a deep similarity function we will need to build a model with the following general features.

1. A shared (twin) processing step that will output condensed representations of the input strings.

2. A method to compare the two vector representations and output a “distance”

3. A final dense layer with sigmoid output that will give the “probability” of match

And that’s it. There is no need to build complicated text pre-processing pipelines based on rules or various nlp techniques, we can just push the raw strings in and we should get a reasonable output from the model if the strings are similar.

And after some trial and error I was able to find a working architecture that trained quickly, and where the outputs were apparently sensible similarity metrics. I found that a dual loss using MSE on the distance and a contrastive loss on the probability (as described by Haksell et al) gave the best performance. Here is a more detailed view of the architecture…

Contrastive loss is commonly used as the loss function in twin network architectures and is well explained here.

Here is how this would be coded up using Keras. First make the imports.

Encode the input text (assuming in a pandas dataframe)

Some additional data preparation steps to take care of..

Create the input layers…

Create the shared layers…

Pass the inputs through both sides of the twin network…

Calculate the distance with an Exponential Negative Manhatten distance function…

Finally the sigmoid layer to output the probability…

Now create the model object…

Final couple of helper functions, the contrastive loss function and an accuracy function..

Run compile on the model…

Finally fit the model to the data..

This example can be amended for larger or smaller text sequences by adjusting the convolution layer.

Good luck with your entity matching! Please comment below.

Data Scientist at

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store