Blog

Uncategorized

Amalgamation

Patterns for amalgamation include and extend standards joins in databases and datasets (e.g., Pandas) to a probabilistic join .

Context

You have found a data set that includes features you care about : they are relevant to your domain. You have many datasets that are potentially useful.

Problem

How do you enrich one starter, baseline dataset with additional (features coming from new) data sets so you can increase accuracy, precision and recall, optimize confusion matrices and AUC curves?

Solution Strategy

Start with a data set that captures features relevant to your domain. Often you need training data which means you may need a curated/labelled data set to start with.

Take a 3 datasets and enrich data using one of the datasets as a basis. Then add other datasets to the original to refine and enrich the data with new features/columns.

Sometimes the new the datasets may not have a clear unequivocal join: you have a name of an author who you have ranked for credibility in an NLP application. But in another data set you do not have authors that may not be in your base, ranked dataset.

You might do a probabilistic join by looking at the cosine similarity between the ones you have and the new one (or ones) that are new and you have not assessed credibility for (for example).

Considerations

Take into account the balance of your data: does it overwhelmingly have one type / category versus equal amounts from all types you are considering?

Use SMOTE or IBM Fairness 360 or other APIs to find the balance or lack thereof of your datasets.