from pyminhash.datasets import load_data
= load_data() df
Finding duplicate records using PyMinHash
MinHashing is a very efficient way of finding similar records in a dataset based on Jaccard similarity. My Python package PyMinHash implements efficient minhashing for Pandas dataframes. In this post I explain how to use PyMinHash. You can find more information on how the minhashing algorithm works here.
Installation
Install PyMinHash from Pypi in the command line as follows:
pip install pyminhash
Apply on Pandas dataframe
PyMinHash comes with a toy dataset containing various name and address combinations of Stoxx50 companies. Let’s load the data and see what it contains.
name |
---|
adidas ag adi dassler strasse 1 91074 germany |
adidas ag adi dassler strasse 1 91074 herzogenaurach |
adidas ag adi dassler strasse 1 91074 herzogenaurach germany |
airbus se 2333 cs leiden netherlands |
airbus se 2333 cs netherlands |
We’re going to match various representations that belong to the same company. For this, we import create a MinHash
object and tell it to use 10 hash tables. More hash tables means more accurate Jaccard similarity calculation but also requires more time and memory.
from pyminhash.pyminhash import MinHash
= MinHash(n_hash_tables=10) myHasher
The fit_predict
method needs the dataframe and the name of the column to which minhashing should be applied. The result is a dataframe containing all pairs that have a non-zero Jaccard similarity:
= myHasher.fit_predict(df, 'name') result
name_1 | name_2 | jaccard_sim |
---|---|---|
engie sa 1 place samuel de champlain 92400 courbevoie | engie sa 1 place samuel de champlain 92400 france | 1.0 |
koninklijke philips n v amstelplein 2 1096 bc | koninklijke philips n v amstelplein 2 1096 bc amsterdam | 1.0 |
asml holding n v de run 6501 5504 dr veldhoven netherlands | asml holding n v de run 6501 veldhoven netherlands | 1.0 |
amadeus it group s a salvador de madariaga 1 28027 madrid | amadeus it group s a salvador de madariaga 1 28027 madrid spain | 1.0 |
deutsche telekom ag 53113 bonn germany | deutsche telekom ag 53113 germany | 1.0 |
As one can see below, for a Jaccard similarity of 1.0, all words in the shortest string appear in the longest string. For lower Jaccard similarity values, the match is less than perfect. Note that Jaccard similarity has granularity of 1/n_hash_tables, in this example 0.1.
name_1 | name_2 | jaccard_sim |
---|---|---|
engie sa 1 place samuel de champlain 92400 courbevoie | engie sa 1 place samuel de champlain 92400 france | 1.0 |
koninklijke philips n v amstelplein 2 1096 bc | koninklijke philips n v amstelplein 2 1096 bc amsterdam | 1.0 |
vinci sa 1 cours ferdinand de lesseps 92851 france | vinci sa 1 cours ferdinand de lesseps rueil malmaison france | 0.9 |
sanofi 54 rue la boetie 75008 france | sanofi 54 rue la boetie 75008 paris france | 0.9 |
airbus se 2333 cs leiden netherlands | airbus se 2333 cs netherlands | 0.8 |
fresenius se co kgaa else kroner strasse 1 61352 bad homburg vor der hohe germany | fresenius se co kgaa else kroner strasse 1 61352 germany | 0.8 |
bayerische motoren werke aktiengesellschaft munich germany | bayerische motoren werke aktiengesellschaft petuelring 130 80788 munich | 0.7 |
bayerische motoren werke aktiengesellschaft munich germany | bayerische motoren werke aktiengesellschaft petuelring 130 munich germany | 0.7 |
axa sa 75008 paris france | sanofi 54 rue la boetie 75008 paris france | 0.6 |
banco bilbao vizcaya argentaria s a 48005 bilbao spain | banco bilbao vizcaya argentaria s a bilbao | 0.6 |
lvmh moet hennessy louis vuitton societe europeenne 75008 france | sanofi 54 rue la boetie 75008 paris france | 0.5 |
safran sa 2 boulevard du general martial valin paris france | safran sa 75724 france | 0.5 |
munchener ruckversicherungs gesellschaft aktiengesellschaft koniginstrasse 107 munich germany | siemens aktiengesellschaft munich germany | 0.4 |
allianz se koniginstrasse 28 munich germany | munchener ruckversicherungs gesellschaft aktiengesellschaft koniginstrasse 107 munich germany | 0.4 |
bnp paribas sa 16 boulevard des italiens 75009 france | societe generale societe 75009 paris france | 0.3 |
essilorluxottica 1 6 rue paul cezanne paris france | l oreal s a 92117 france | 0.3 |
bnp paribas sa paris | engie sa 92400 courbevoie france | 0.2 |
essilorluxottica 1 6 rue paul cezanne paris france | schneider electric s e 92500 france | 0.2 |
societe generale societe 75009 paris france | total s a 2 place jean millier paris france | 0.1 |
vinci sa 1 cours ferdinand de lesseps rueil malmaison france | vivendi sa 75380 paris | 0.1 |
If you increase the number of hash tables, the accuracy of the Jaccard similarity increases but it comes at the cost of speed and memory usage.