Patterns for amalgamation include and extend standards joins in databases and datasets (e.g., Pandas) to a probabilistic join .


You have found a data set that includes features you care about : they are relevant to your domain. You have many datasets that are potentially useful.


How do you enrich one starter, baseline dataset with additional (features coming from new) data sets so you can increase accuracy, precision and recall, optimize confusion matrices and AUC curves?

Solution Strategy

Start with a data set that captures features relevant to your domain. Often you need training data which means you may need a curated/labelled data set to start with.

Take a 3 datasets and enrich data using one of the datasets as a basis. Then add other datasets to the original to refine and enrich the data with new features/columns.

Sometimes the new the datasets may not have a clear unequivocal join: you have a name of an author who you have ranked for credibility in an NLP application. But in another data set you do not have authors that may not be in your base, ranked dataset.

You might do a probabilistic join by looking at the cosine similarity between the ones you have and the new one (or ones) that are new and you have not assessed credibility for (for example).


Take into account the balance of your data: does it overwhelmingly have one type / category versus equal amounts from all types you are considering?

Use SMOTE or IBM Fairness 360 or other APIs to find the balance or lack thereof of your datasets.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s