from deduplipy.datasets import load_data
= load_data(kind='voters') df
Column names: 'name', 'suburb', 'postcode'
Deduplication or entity resolution is the task to combine different representations of the same real world entity. The Python package DedupliPy implements deduplication using active learning. Active learning allows for rapid training without having to provide a large, manually labelled dataset. In this post I demonstrate how the package works and show more advanced settings. In case you want to apply entity resolution on large data in Spark, please have a look at Spark-Matcher, a package I developed together with two colleagues.
DedupliPy can simply be installed from PyPi. Just type the following in the command line:
DedupliPy comes with example data. We first load the ‘voters’ data that contains duplicate records:
Column names: 'name', 'suburb', 'postcode'
This dataset contains names, suburbs and postcodes.
name | suburb | postcode |
---|---|---|
khimerc thomas | charlotte | 2826g |
lucille richardst | kannapolis | 28o81 |
reb3cca bauerboand | raleigh | 27615 |
Create a Deduplicator
instance and provide the column names to be used for deduplication:
Fit the Deduplicator
by active learning; enter whether a pair is a match (y) or not (n). When the training is converged, you will be notified and you can finish training by entering ‘f’.
Apply the trained Deduplicator
on (new) data. The column deduplication_id
is the identifier for a cluster. Rows with the same deduplication_id
are found to be the same real world entity.
name | suburb | postcode | deduplication_id |
---|---|---|---|
caria macartney | charlotte | 28220 | 1 |
carla macartney | charlotte | 28227 | 1 |
martha safrit | cha4lotte | 282l5 | 2 |
martha safrit | charlotte | 28215 | 2 |
jeanronel corbier | charlotte | 28213 | 3 |
jeanronel corpier | charrlotte | 28213 | 3 |
melissa kaltenbach | charlotte | 28211 | 4 |
melissa kalteribach | charlotte | 28251 | 4 |
kiea matthews | charlotte | 28218 | 5 |
kiera matthews | charlotte | 28216 | 5 |
The Deduplicator
instance can be saved as a pickle file and be applied on new data after training:
name | suburb | postcode | deduplication_id |
---|---|---|---|
caria macartney | charlotte | 28220 | 1 |
carla macartney | charlotte | 28227 | 1 |
martha safrit | cha4lotte | 282l5 | 2 |
martha safrit | charlotte | 28215 | 2 |
jeanronel corbier | charlotte | 28213 | 3 |
jeanronel corpier | charrlotte | 28213 | 3 |
melissa kaltenbach | charlotte | 28211 | 4 |
melissa kalteribach | charlotte | 28251 | 4 |
kiea matthews | charlotte | 28218 | 5 |
kiera matthews | charlotte | 28216 | 5 |
If you’re intested in the inner workings of DedupliPy, please watch my presentation at PyData Global 2021:
Let’s explore some advanced settings to tailor the deduplicator to our needs. We are going to select the similarity metrics per field, define our own blocking rules and include interaction between the fields.
The similarity metrics per field are entered in a dict. Similarity metric can be any function that takes two strings and output a number. We use some string similarity functions that are implemented in the Python package called ‘thefuzz’ (pip install thefuzz
):
We choose a set of rules for blocking which we define ourselves. We only apply this rule to the ‘name’ column.
When we setinteraction=True
, the classifier includes interaction features, e.g. ratio('name') * token_set_ratio('suburb')
. When interaction features are included, the logistic regression classifier applies a L1 regularisation to prevent overfitting. We also set verbose=1
to get information on the progress and a distribution of scores
Fit the Deduplicator
by active learning; enter whether a pair is a match (y) or not (n). When the training is converged, you will be notified and you can finish training by entering ‘f’.
After fitting, the histogram of scores is shown. Based on this histogram, we decide to ignore all pairs with a similarity probability lower than 0.1 when predicting:
Apply the trained Deduplicator
on (new) data. The column deduplication_id
is the identifier for a cluster. Rows with the same deduplication_id
are found to be the same real world entity.
name | suburb | postcode | deduplication_id |
---|---|---|---|
lucille richardst | kannapolis | 28o81 | 1 |
lucille richards | kannapolis | 28081 | 1 |
lutta baldwin | whiteville | 28472 | 3 |
lutta baldwin | whitevill | 28475 | 3 |
repecca harrell | winton | 27q86 | 5 |
rebecca harrell | winton | 27986 | 5 |
rebecca harrell | witnon | 27926 | 5 |
rebecca bauerband | raleigh | 27615 | 6 |
reb3cca bauerboand | raleigh | 27615 | 6 |
rebeccah shelton | whittier | 28789 | 7 |