Frits Hermans
Home
About
Data science blog
Welcome to my data science blog.
Aggregation learning
For some modelling exercises, input data is at a different granularity level than the target.
Deduplication of records using DedupliPy
Deduplication or entity resolution is the task to combine different representations of the same real world entity. The Python package DedupliPy implements deduplication…
Distributed hyperparameter tuning of Scikit-learn models in Spark
Hyperparameter tuning of machine learning models often requires significant computing time. Scikit-learn implements parallel processing to speed things up, but real speed…
Finding duplicate records using PyMinHash
MinHashing is a very efficient way of finding similar records in a dataset based on Jaccard similarity. My Python package…
Taxonomy feature encoding
Features like zipcodes or industry codes (NAICS, MCC) contain information that is part of a taxomy. Although these feature values are numerical, it doesn’t necessarily make…
No matching items