Deduplication of records using DedupliPy

Deduplication or entity resolution is the task to combine different representations of the same real world entity. The Python package DedupliPy implements deduplication using active learning. Active learning allows for rapid training without having to provide a large, manually labelled dataset. In this post I demonstrate how the package works and show more advanced settings. In case you want to apply entity resolution on large data in Spark, please have a look at Spark-Matcher, a package I developed together with two colleagues.

Installation

DedupliPy can simply be installed from PyPi. Just type the following in the command line:

pip install deduplipy

Simple deduplication

DedupliPy comes with example data. We first load the ‘voters’ data that contains duplicate records:

from deduplipy.datasets import load_data

df = load_data(kind='voters')

Column names: 'name', 'suburb', 'postcode'

This dataset contains names, suburbs and postcodes.

name	suburb	postcode
khimerc thomas	charlotte	2826g
lucille richardst	kannapolis	28o81
reb3cca bauerboand	raleigh	27615

Create a Deduplicator instance and provide the column names to be used for deduplication:

from deduplipy.deduplicator import Deduplicator

myDedupliPy = Deduplicator(['name', 'suburb', 'postcode'])

Fit the Deduplicator by active learning; enter whether a pair is a match (y) or not (n). When the training is converged, you will be notified and you can finish training by entering ‘f’.

myDedupliPy.fit(df)

Apply the trained Deduplicator on (new) data. The column deduplication_id is the identifier for a cluster. Rows with the same deduplication_id are found to be the same real world entity.

res = myDedupliPy.predict(df)

name	suburb	postcode	deduplication_id
caria macartney	charlotte	28220	1
carla macartney	charlotte	28227	1
martha safrit	cha4lotte	282l5	2
martha safrit	charlotte	28215	2
jeanronel corbier	charlotte	28213	3
jeanronel corpier	charrlotte	28213	3
melissa kaltenbach	charlotte	28211	4
melissa kalteribach	charlotte	28251	4
kiea matthews	charlotte	28218	5
kiera matthews	charlotte	28216	5

The Deduplicator instance can be saved as a pickle file and be applied on new data after training:

import pickle

with open('mypickle.pkl', 'wb') as f:
    pickle.dump(myDedupliPy, f)

with open('mypickle.pkl', 'rb') as f:
    loaded_obj = pickle.load(f)

res = loaded_obj.predict(df)

name	suburb	postcode	deduplication_id
caria macartney	charlotte	28220	1
carla macartney	charlotte	28227	1
martha safrit	cha4lotte	282l5	2
martha safrit	charlotte	28215	2
jeanronel corbier	charlotte	28213	3
jeanronel corpier	charrlotte	28213	3
melissa kaltenbach	charlotte	28211	4
melissa kalteribach	charlotte	28251	4
kiea matthews	charlotte	28218	5
kiera matthews	charlotte	28216	5

Advanced deduplication

If you’re intested in the inner workings of DedupliPy, please watch my presentation at PyData Global 2021:

Let’s explore some advanced settings to tailor the deduplicator to our needs. We are going to select the similarity metrics per field, define our own blocking rules and include interaction between the fields.

The similarity metrics per field are entered in a dict. Similarity metric can be any function that takes two strings and output a number. We use some string similarity functions that are implemented in the Python package called ‘thefuzz’ (pip install thefuzz):

from thefuzz.fuzz import ratio, partial_ratio, token_set_ratio, token_sort_ratio

field_info = {'name':[ratio, partial_ratio], 
              'suburb':[token_set_ratio, token_sort_ratio], 
              'postcode':[ratio]}

We choose a set of rules for blocking which we define ourselves. We only apply this rule to the ‘name’ column.

def first_two_characters(x):
    return x[:2]

When we setinteraction=True, the classifier includes interaction features, e.g. ratio('name') * token_set_ratio('suburb'). When interaction features are included, the logistic regression classifier applies a L1 regularisation to prevent overfitting. We also set verbose=1 to get information on the progress and a distribution of scores

myDedupliPy = Deduplicator(field_info=field_info, interaction=True, 
                           rules={'name': [first_two_characters]}, verbose=1)

Fit the Deduplicator by active learning; enter whether a pair is a match (y) or not (n). When the training is converged, you will be notified and you can finish training by entering ‘f’.

myDedupliPy.fit(df)

After fitting, the histogram of scores is shown. Based on this histogram, we decide to ignore all pairs with a similarity probability lower than 0.1 when predicting:

Apply the trained Deduplicator on (new) data. The column deduplication_id is the identifier for a cluster. Rows with the same deduplication_id are found to be the same real world entity.

res = myDedupliPy.predict(df, score_threshold=0.1)

name	suburb	postcode	deduplication_id
lucille richardst	kannapolis	28o81	1
lucille richards	kannapolis	28081	1
lutta baldwin	whiteville	28472	3
lutta baldwin	whitevill	28475	3
repecca harrell	winton	27q86	5
rebecca harrell	winton	27986	5
rebecca harrell	witnon	27926	5
rebecca bauerband	raleigh	27615	6
reb3cca bauerboand	raleigh	27615	6
rebeccah shelton	whittier	28789	7