The notion of similarity, and it’s complimentary counterpart distance, play key roles in many data science and machine learning projects, where similarity functions serve as methods by which we can group or separate objects and concepts and make generalisations across our data sets.
There are a vast array of different similarity metrics, which may be used in a variety of different scenarios. Very often it is possible to identify a similarity metric that has the desirable characteristics for a given problem space, and there are some great python functions which make such methods easily accessible. However my experience suggests that in general similarity functions have desirable characteristics only for specific input types, and if the similarity function is challenged with slightly different inputs, then the outputs may not reflect our intuitive understanding of similarity in that situation. This is particularly the case for text similarity in entity matching, where names and addresses may include abbreviations, dropped items, and variable ordering.
Traditionally in the domain of entity matching, multiple similarity functions are used across a variety of different input data types such as name, address ,telephone number, email and so on. The output from these functions are used to create a feature matrix which is then inputted into a supervised model such as random forest or a boosted tree. The hope is that by using multiple similarity functions on the same input data, often normalised with multiple data pre-processing transformations, that one of the similarity functions will output a distant measure indicating that the two entities are related and which itself can be leveraged by the tree.
However the impact of this is that model certainty is reduced and what should be strong matches often return as poor or weak matches requiring human review. Below are some examples of pairwise company names being evaluated with a variety of different similarity functions.
With this at the forefront of my mind, I wondered whether it might be possible to build a deep learning model to act as a similarity function for specific scenarios that are not well served by traditional similarity functions such as short sequences of text, names or addresses.
Deep learning models would seem ideally suited to this challenge as they can be understood as a highly flexible mapping functions, that take a given input and transform it to an output leveraging non-linear relationships which can be learnt from data.
So to build a deep similarity function we will need to build a model with the following general features.
1. A shared (twin) processing step that will output condensed representations of the input strings.
2. A method to compare the two vector representations and output a “distance”
3. A final dense layer with sigmoid output that will give the “probability” of match
And that’s it. There is no need to build complicated text pre-processing pipelines based on rules or various nlp techniques, we can just push the raw strings in and we should get a reasonable output from the model if the strings are similar.
And after some trial and error I was able to find a working architecture that trained quickly, and where the outputs were apparently sensible similarity metrics. I found that a dual loss using MSE on the distance and a contrastive loss on the probability (as described by Haksell et al) gave the best performance. Here is a more detailed view of the architecture…
Contrastive loss is commonly used as the loss function in twin network architectures and is well explained here.
Here is how this would be coded up using Keras. First make the imports.
# Make the imports
import pandas as pd
from keras.layers import Input, Embedding, Conv1D, Dense, Bidirectional, MaxPooling1D, Lambda, LSTM
from tensorflow.keras import backend as K
from tensorflow.keras.models import Model
from string import printable
Encode the input text (assuming in a pandas dataframe)
# Convert input strings to indexes
for i, character in enumerate(printable):
character_dict[character] = i + 1
vector = 
characters = list(item)
for char in characters:
return vector# generate the left and right text vectors
data['left_vectors'] = data['left'].apply(lambda x : generate_vector(x))
data['right_vectors'] = data['right'].apply(lambda x : generate_vector(x))
Some additional data preparation steps to take care of..
# Find the max length of the text sequence and create a max_len variable
# pad the vectors using keras.preprocessing.pad_sequences(values, max_len)
# Split the data into training and test sets with the target y as a 1.0 or 0.0 score
Create the input layers…
# Create the input layers
left_input = Input(shape=(max_len,), dtype='int32', name='left_input')
right_input = Input(shape=(max_len,), dtype='int32', name='right_input')
Create the shared layers…
# Create the shared layers# Embed the character vectors
shared_embedding = Embedding(output_dim=16, input_dim=len(character_dict),input_length=max_len , name ='shared_encoding')shared_conv1D = Conv1D(32,2, padding='same', name = 'shared_1D')
shared_max_pool1 = MaxPooling1D()shared_lstm = Bidirectional(LSTM(16, return_sequences=False, name ='shared_lstm'))
Pass the inputs through both sides of the twin network…
# Pass the inputs through the shared layersleft = shared_embedding(left_input)
right = shared_embedding(right_input)left = shared_conv1D(left)
left = shared_max_pool1(left)right = shared_conv1D(right)
right = shared_max_pool1(right)left = shared_lstm(left)
right = shared_lstm(right)
Calculate the distance with an Exponential Negative Manhatten distance function…
# Helper function for calculating the exponential negative manhatten distance on the LSTM outputs
def exp_neg_manhat(left, right):
return K.exp(-K.sum(K.abs(left-right), axis=1, keepdims=True))# Calculate the distance
distance = Lambda(function = lambda x: exp_neg_manhat(x, x),
output_shape = lambda x : (x, 1), name = 'distance')([left, right])
Finally the sigmoid layer to output the probability…
# Calculate the probability score
probability = Dense(1, activation='sigmoid', name = 'probability')(distance)
Now create the model object…
# create the model
model = Model(inputs =[left_input, right_input], outputs=[distance, probability])
Final couple of helper functions, the contrastive loss function and an accuracy function..
def contrastive_loss(y_true, y_pred):
# From Hadsell-et-al '06
margin = 1
square_pred = K.square(y_pred)
margin_square = K.square(K.maximum(margin - y_pred, 0))
return K.mean(y_true * square_pred +(1 - y_true) * margin_square)def accuracy(y_true, y_pred):
return K.mean(K.equal(y_true, K.cast(y_pred <0.5, y_true.dtype)))
Run compile on the model…
loss = ['mean_squared_error', contrastive_loss],
loss_weights =[1.0, 1.0])
Finally fit the model to the data..
# pass the training and test data into the model.
# y values are the 1.0 or 0.0 scores of if the two items are a match.
# These serve to help train both the distance score and the "probability" score
history = model.fit([left_vectors, right_vectors], [match_bool, match_bool], epoch=10, batch_size = 32,
validation_data = [test_left_vectors, test_right_vectors], [test_match_bool, test_match_bool])
This example can be amended for larger or smaller text sequences by adjusting the convolution layer.
Good luck with your entity matching! Please comment below.