Deduplication of records using DedupliPy

Deduplication or entity resolution is the task to combine different representations of the same real world entity. The Python package DedupliPy implements deduplication using active learning. Active learning allows for rapid training without having to provide a large, manually labelled dataset. In this post I demonstrate how the package works and show more advanced settings. In case you want to apply entity resolution on large data in Spark, please have a look at Spark-Matcher, a package I developed together with two colleagues.

Installation

DedupliPy can simply be installed from PyPi. Just type the following in the command line:

pip install deduplipy

Simple deduplication

DedupliPy comes with example data. We first load the ‘voters’ data that contains duplicate records:

from deduplipy.datasets import load_data

df = load_data(kind='voters')
Column names: 'name', 'suburb', 'postcode'

This dataset contains names, suburbs and postcodes.

name suburb postcode
khimerc thomas charlotte 2826g
lucille richardst kannapolis 28o81
reb3cca bauerboand raleigh 27615

Create a Deduplicator instance and provide the column names to be used for deduplication:

from deduplipy.deduplicator import Deduplicator
myDedupliPy = Deduplicator(['name', 'suburb', 'postcode'])

Fit the Deduplicator by active learning; enter whether a pair is a match (y) or not (n). When the training is converged, you will be notified and you can finish training by entering ‘f’.

myDedupliPy.fit(df)

Apply the trained Deduplicator on (new) data. The column deduplication_id is the identifier for a cluster. Rows with the same deduplication_id are found to be the same real world entity.

res = myDedupliPy.predict(df)
name suburb postcode deduplication_id
caria macartney charlotte 28220 1
carla macartney charlotte 28227 1
martha safrit cha4lotte 282l5 2
martha safrit charlotte 28215 2
jeanronel corbier charlotte 28213 3
jeanronel corpier charrlotte 28213 3
melissa kaltenbach charlotte 28211 4
melissa kalteribach charlotte 28251 4
kiea matthews charlotte 28218 5
kiera matthews charlotte 28216 5

The Deduplicator instance can be saved as a pickle file and be applied on new data after training:

import pickle
with open('mypickle.pkl', 'wb') as f:
    pickle.dump(myDedupliPy, f)
with open('mypickle.pkl', 'rb') as f:
    loaded_obj = pickle.load(f)
res = loaded_obj.predict(df)
name suburb postcode deduplication_id
caria macartney charlotte 28220 1
carla macartney charlotte 28227 1
martha safrit cha4lotte 282l5 2
martha safrit charlotte 28215 2
jeanronel corbier charlotte 28213 3
jeanronel corpier charrlotte 28213 3
melissa kaltenbach charlotte 28211 4
melissa kalteribach charlotte 28251 4
kiea matthews charlotte 28218 5
kiera matthews charlotte 28216 5

Advanced deduplication

If you’re intested in the inner workings of DedupliPy, please watch my presentation at PyData Global 2021:

Let’s explore some advanced settings to tailor the deduplicator to our needs. We are going to select the similarity metrics per field, define our own blocking rules and include interaction between the fields.

The similarity metrics per field are entered in a dict. Similarity metric can be any function that takes two strings and output a number. We use some string similarity functions that are implemented in the Python package called ‘thefuzz’ (pip install thefuzz):

from thefuzz.fuzz import ratio, partial_ratio, token_set_ratio, token_sort_ratio
field_info = {'name':[ratio, partial_ratio], 
              'suburb':[token_set_ratio, token_sort_ratio], 
              'postcode':[ratio]}

We choose a set of rules for blocking which we define ourselves. We only apply this rule to the ‘name’ column.

def first_two_characters(x):
    return x[:2]

When we setinteraction=True, the classifier includes interaction features, e.g. ratio('name') * token_set_ratio('suburb'). When interaction features are included, the logistic regression classifier applies a L1 regularisation to prevent overfitting. We also set verbose=1 to get information on the progress and a distribution of scores

myDedupliPy = Deduplicator(field_info=field_info, interaction=True, 
                           rules={'name': [first_two_characters]}, verbose=1)

Fit the Deduplicator by active learning; enter whether a pair is a match (y) or not (n). When the training is converged, you will be notified and you can finish training by entering ‘f’.

myDedupliPy.fit(df)

After fitting, the histogram of scores is shown. Based on this histogram, we decide to ignore all pairs with a similarity probability lower than 0.1 when predicting:

Apply the trained Deduplicator on (new) data. The column deduplication_id is the identifier for a cluster. Rows with the same deduplication_id are found to be the same real world entity.

res = myDedupliPy.predict(df, score_threshold=0.1)
name suburb postcode deduplication_id
lucille richardst kannapolis 28o81 1
lucille richards kannapolis 28081 1
lutta baldwin whiteville 28472 3
lutta baldwin whitevill 28475 3
repecca harrell winton 27q86 5
rebecca harrell winton 27986 5
rebecca harrell witnon 27926 5
rebecca bauerband raleigh 27615 6
reb3cca bauerboand raleigh 27615 6
rebeccah shelton whittier 28789 7