Finding duplicate records using PyMinHash

MinHashing is a very efficient way of finding similar records in a dataset based on Jaccard similarity. My Python package PyMinHash implements efficient minhashing for Pandas dataframes. In this post I explain how to use PyMinHash. You can find more information on how the minhashing algorithm works here.

Installation

Install PyMinHash from Pypi in the command line as follows:

pip install pyminhash

Apply on Pandas dataframe

PyMinHash comes with a toy dataset containing various name and address combinations of Stoxx50 companies. Let’s load the data and see what it contains.

from pyminhash.datasets import load_data
df = load_data()
name
adidas ag adi dassler strasse 1 91074 germany
adidas ag adi dassler strasse 1 91074 herzogenaurach
adidas ag adi dassler strasse 1 91074 herzogenaurach germany
airbus se 2333 cs leiden netherlands
airbus se 2333 cs netherlands

We’re going to match various representations that belong to the same company. For this, we import create a MinHash object and tell it to use 10 hash tables. More hash tables means more accurate Jaccard similarity calculation but also requires more time and memory.

from pyminhash.pyminhash import MinHash
myHasher = MinHash(n_hash_tables=10)

The fit_predict method needs the dataframe and the name of the column to which minhashing should be applied. The result is a dataframe containing all pairs that have a non-zero Jaccard similarity:

result = myHasher.fit_predict(df, 'name')
name_1 name_2 jaccard_sim
engie sa 1 place samuel de champlain 92400 courbevoie engie sa 1 place samuel de champlain 92400 france 1.0
koninklijke philips n v amstelplein 2 1096 bc koninklijke philips n v amstelplein 2 1096 bc amsterdam 1.0
asml holding n v de run 6501 5504 dr veldhoven netherlands asml holding n v de run 6501 veldhoven netherlands 1.0
amadeus it group s a salvador de madariaga 1 28027 madrid amadeus it group s a salvador de madariaga 1 28027 madrid spain 1.0
deutsche telekom ag 53113 bonn germany deutsche telekom ag 53113 germany 1.0

As one can see below, for a Jaccard similarity of 1.0, all words in the shortest string appear in the longest string. For lower Jaccard similarity values, the match is less than perfect. Note that Jaccard similarity has granularity of 1/n_hash_tables, in this example 0.1.

name_1 name_2 jaccard_sim
engie sa 1 place samuel de champlain 92400 courbevoie engie sa 1 place samuel de champlain 92400 france 1.0
koninklijke philips n v amstelplein 2 1096 bc koninklijke philips n v amstelplein 2 1096 bc amsterdam 1.0
vinci sa 1 cours ferdinand de lesseps 92851 france vinci sa 1 cours ferdinand de lesseps rueil malmaison france 0.9
sanofi 54 rue la boetie 75008 france sanofi 54 rue la boetie 75008 paris france 0.9
airbus se 2333 cs leiden netherlands airbus se 2333 cs netherlands 0.8
fresenius se co kgaa else kroner strasse 1 61352 bad homburg vor der hohe germany fresenius se co kgaa else kroner strasse 1 61352 germany 0.8
bayerische motoren werke aktiengesellschaft munich germany bayerische motoren werke aktiengesellschaft petuelring 130 80788 munich 0.7
bayerische motoren werke aktiengesellschaft munich germany bayerische motoren werke aktiengesellschaft petuelring 130 munich germany 0.7
axa sa 75008 paris france sanofi 54 rue la boetie 75008 paris france 0.6
banco bilbao vizcaya argentaria s a 48005 bilbao spain banco bilbao vizcaya argentaria s a bilbao 0.6
lvmh moet hennessy louis vuitton societe europeenne 75008 france sanofi 54 rue la boetie 75008 paris france 0.5
safran sa 2 boulevard du general martial valin paris france safran sa 75724 france 0.5
munchener ruckversicherungs gesellschaft aktiengesellschaft koniginstrasse 107 munich germany siemens aktiengesellschaft munich germany 0.4
allianz se koniginstrasse 28 munich germany munchener ruckversicherungs gesellschaft aktiengesellschaft koniginstrasse 107 munich germany 0.4
bnp paribas sa 16 boulevard des italiens 75009 france societe generale societe 75009 paris france 0.3
essilorluxottica 1 6 rue paul cezanne paris france l oreal s a 92117 france 0.3
bnp paribas sa paris engie sa 92400 courbevoie france 0.2
essilorluxottica 1 6 rue paul cezanne paris france schneider electric s e 92500 france 0.2
societe generale societe 75009 paris france total s a 2 place jean millier paris france 0.1
vinci sa 1 cours ferdinand de lesseps rueil malmaison france vivendi sa 75380 paris 0.1

If you increase the number of hash tables, the accuracy of the Jaccard similarity increases but it comes at the cost of speed and memory usage.