# Introduction

Despite the inimitable title, label bias in positive and unlabeled learning is probably among the most common and problematic issues faced by machine learning practitioners. This can be a hard problem to detect and can seriously hinder model generalization. At Oracle Data Cloud, many of our supervised learning tasks fall under this classification, and we have researched several techniques that help to mitigate certain problems that might arise. If you are working on supervised learning problems, I would theorize that you too could be exposed to some of these issues, and this article may prove useful in identifying and mitigating these problems.

# Grocery Store Scenario

To help illustrate some of these points, let’s set up an example scenario. Suppose we run analytics for a grocery store and want to send out a mailer to entice people to buy a new brand of corn chips we just started stocking. It costs us money to send each mailer, so we want to only contact people that we think are likely to buy corn chips. A data scientist recently hired into the analytics department wants to use machine learning to find the most likely buyers of corn chips in the future, and only target likely buyers, improving the bottom line of the campaign. However, the data the grocery store collects on its customers is imperfect. There are a few flaws in the dataset our data scientist has to work with that might prevent the models from working if not properly accounted for.

# Definitions

Supervised learning is a subset of machine learning, where examples are assigned a specific label. Generally, there are two types of supervised learning: classification (labels are discrete) and regression (labels are a continuous numeric value). This is probably the most common type of machine learning, and things like image classification, customer value prediction, risk estimation, etc. all fall under this umbrella. Supervised learning models are explicitly passed labels (dependent variables) and asked to determine how the independent variables differentiate the labels from one another. Our grocery store scenario is an example of supervised learning. Our data scientist wants to use the information the store has about existing chip buyers to differentiate likely prospective chip buyers from unlikely prospects.

Label bias occurs when the set of labeled data is not fully representative of the entire universe of potential labels. This is a very common problem in supervised learning, stemming from the fact that data often needs to be labeled by hand (which is difficult and expensive). For instance, let’s say I wanted to create a service that identifies handwritten letters, but my training set consists solely of some letters I wrote by hand. Likely, my handwriting would not be representative of all the different ways letters can be written, and so any model trained from this dataset would not generalize to handwriting that differs significantly from mine. Or, referring back to our grocery store example, perhaps the only transactions we record as chip purchases are for customers that input their loyalty card number by entering their phone number, or perhaps only certain brands of chips are labeled in the store’s system as corn chips. It is reasonable to believe that there is some sort of structural difference between people that enter their phone number as opposed to use their card or prefer one brand of chips over another.

Positive and unlabeled learning is a subset of supervised learning in which only the positive labels are known. The missing data mechanism here falls under ‘Missing Not at Random’ since all non-missing information will be a positive label (the label itself is a predictor of it being missing). For instance, let’s say I own a grocery store, and keep track of what people are purchasing using their loyalty cards. Referring to our grocery store example again, the store knows with some certainty that the people observed purchasing corn chips are corn chip buyers (let’s call this group the 1s). However, the customers not observed purchasing corn chips (let’s call them the 0s) may have bought one at another store or may have paid without entering their loyalty card number. They are buyers, but the store hasn’t observed that fact. So, some of the ‘0s’ actually are corn chip buyers, and probably should be labeled as a ‘1’.

The focus of this article is the intersection of these three areas. We are building supervised learning models (in this example, we’ll be building a binary classifier), but our labels are positive and unlabeled (as opposed to having explicit 1s and 0s). Furthermore, we will create a few contrived scenarios in which the labeling is biased—that is, the unlabeled set is not unlabeled randomly.

# Four Approaches

## Ignore the Bias

The naïve approach to solving this problem is simply to ignore the possibility that some of our non-corn chip buyers may actually be corn chip buyers. In this case, we simply label the buyers with a 1, and the non-observed with a 0 and fit a standard binary classifier.

```
from sklearn.ensemble import RandomForestClassifier
rf_params = dict(n_estimators=128, max_depth=5, n_jobs=16)
RandomForestClassifier(**rf_params)
```

## Semi-Supervised Labeling

In this approach, (derived from Charles Elkan and Keith Noto’s paper, "Learning Classifiers From Only Positive and Unlabeled Data") we use an initial modeling algorithm to infer a probability that the unlabeled examples are true 1s and true 0s. Each example is then fed back into a classifier and labeled as both a 1 and a 0, with a sample weight proportional to the probability of the training instance belonging to each class (and then down weighted according to our belief about the prevalence of the 1 class within the unlabeled group).

```
from scipy.sparse import vstack, csr_matrix, csc_matrix
from scipy.sparse.data import _data_matrix
import numpy as np
from collections import Counter
rf_params = dict(n_estimators=128, max_depth=5, n_jobs=16)
class SemiSupervisedClassifier(object):
def __init__(
self,
clf_1=None,
clf_2=None,
expected_false_negative_frac=0.5
):
self.expected_false_negative_frac = expected_false_negative_frac
self.clf_1 = (
RandomForestClassifier(**rf_params)
if clf_1 is None else clf_1
)
self.clf_2 = (
RandomForestClassifier(**rf_params)
if clf_2 is None else clf_2
)
self.clf_1_is_fit = False
self.clf_2_is_fit = False
@staticmethod
def intermediate_data_set(clf, x, y, false_negative_fraction):
assert isinstance(x, (np.ndarray, _data_matrix))
y = np.array(y).astype(np.float32)
pos_mask = y > 0
probs = clf.predict_proba(x[~pos_mask])
prob_neg = probs[:,0]
prob_pos = probs[:,1]
assumed_coverage = 1 / (1 + false_negative_fraction)
weights_pos = (1-assumed_coverage) / assumed_coverage * prob_pos / prob_neg
weights_neg = 1 - weights_pos
assert (min(weights_neg + weights_pos), max(weights_neg + weights_pos)) == (1.0, 1.0), "weights must sum to 1!"
num_pos = np.sum(pos_mask)
num_unlabeled = np.sum(~pos_mask)
weights = np.concatenate((np.full(num_pos, 1), weights_pos, weights_neg))
y_new = np.concatenate((y[pos_mask], np.full(num_unlabeled, np.mean(y[pos_mask])), np.full(num_unlabeled, 0)))
if isinstance(x, _data_matrix):
if not isinstance(x, csr_matrix):
x = x.tocsr()
x_new = vstack([x[pos_mask], x[~pos_mask], x[~pos_mask]])
else:
x_new = np.concatenate((x[pos_mask], x[~pos_mask], x[~pos_mask]))
return x_new, y_new, weights
@staticmethod
def balance_weights(y):
counts = Counter(y)
max_weight = float(max(counts.values()))
class_weights = {int(k): max_weight / v for k, v in dict(counts).items()}
print(class_weights)
return class_weights
def fit(self, x, y, update_weights=True):
print("fitting non-traditional classifier...")
if update_weights:
self.clf_1.set_params(class_weight=self.balance_weights(y))
self.clf_1.fit(x, y)
x, y, w = self.intermediate_data_set(self.clf_1, x, y, self.expected_false_negative_frac)
print("fitting semi-supervised classifier...")
if update_weights:
self.clf_2.set_params(class_weight=self.balance_weights(y))
self.clf_2.fit(x, y, w)
return self
def predict(self, x):
return self.clf_2.predict(x)
def predict_proba(self, x):
return self.clf_2.predict_proba(x)
```

## Semi-Supervised Labeling 2

Similar to semi-supervised labeling except instead of treating every unlabeled instance as both positive and negative with proportional weighting, we simply treat the top x% most likely to have a positive label as positive, and the rest negative.

```
class SemiSupervisedClassifier2(SemiSupervisedClassifier):
@staticmethod
def intermediate_data_set(clf, x, y, false_negative_fraction):
assert isinstance(x, (np.ndarray, _data_matrix))
y = np.array(y).astype(np.float32)
pos_mask = y > 0
probs = clf.predict_proba(x[~pos_mask])
prob_pos = probs[:,1]
pos_cutoff = np.percentile(prob_pos, (1-false_negative_fraction) * 100)
inferred_pos_mask = prob_pos > pos_cutoff
num_pos = np.sum(pos_mask)
num_pos_inferred = np.sum(inferred_pos_mask)
num_neg_inferred = np.sum(~inferred_pos_mask)
print("labeled positives: {}".format(num_pos))
print("inferring {} positives".format(num_pos_inferred))
print("inferring {} negatives".format(num_neg_inferred))
y_new = np.concatenate((y[pos_mask], np.full(num_pos_inferred, 1), np.full(num_neg_inferred, 0)))
if isinstance(x, _data_matrix):
if not isinstance(x, csr_matrix):
x = x.tocsr()
x_new = vstack(
[x[pos_mask], x[~pos_mask][inferred_pos_mask], x[~pos_mask][~inferred_pos_mask]]
)
else:
x_new = np.concatenate(
(x[pos_mask], x[~pos_mask][inferred_pos_mask], x[~pos_mask][~inferred_pos_mask])
)
return x_new, y_new, None
```

## Multi-Model Feature Set Segmentation

This approach begins with the assumption that the bias in our labeling is predictable given our feature set, and that when we make an initial classifier, we can capture that bias—or at least approximate it—with a small subset of the available features. We then have three models:

- Determine the feature subset that we think may define our bias, and to classify records as belonging to the biased set or not.
- A model using all the available features.
- A model that excludes the ‘bias-predicting-features.’ We train each of these models on the full set of training records, but in the inference stage, we segregate our records into the biased group or not (using model 1), and then make our predictions for the biased set with model 2, and the predictions with the non-biased set with model 3.

```
class MultiModelClassifier(SemiSupervisedClassifier):
def __init__(
self,
clf_1=None,
clf_2=None,
bias_model=None,
n_feats_define_bias=2
):
self.clf_1 = (
RandomForestClassifier(**rf_params)
if clf_1 is None else clf_1
)
self.clf_2 = (
RandomForestClassifier(**rf_params)
if clf_2 is None else clf_2
)
self.bias_model = (
RandomForestClassifier(**rf_params)
if bias_model is None else bias_model
)
self.feature_mask = None
self.n_feats_define_bias = n_feats_define_bias
def fit(self, x, y, update_weights=True):
if update_weights:
weights = self.balance_weights(y)
self.clf_1.set_params(class_weight=weights)
self.clf_2.set_params(class_weight=weights)
self.bias_model.set_params(class_weight=weights)
self.clf_1.fit(x, y)
if hasattr(self.clf_1, 'feature_importances_'):
min_import = sorted(self.clf_1.feature_importances_)[-self.n_feats_define_bias]
self.feature_mask = self.clf_1.feature_importances_ >= min_import
else:
coef_abs = np.abs(self.clf_1.coef_)
min_import = sorted(coef_abs)[-self.n_feats_define_bias]
self.feature_mask = coef_abs >= min_import
self.bias_model.fit(x[:, self.feature_mask], y)
self.clf_2.fit(self.filter_columns(x), y)
return self
def filter_columns(self, x):
return x[:, ~self.feature_mask]
def predict(self, x):
pred = np.empty((x.shape[0],))
bias_mask = self.bias_model.predict(x[:, self.feature_mask]) == 1
pred[bias_mask] = self.clf_1.predict(x[bias_mask])
pred[~bias_mask] = self.clf_2.predict(self.filter_columns(x[~bias_mask]))
return pred
def predict_proba(self, x):
pred = np.empty((x.shape[0],2))
bias_mask = self.bias_model.predict(x[:, self.feature_mask]) == 1
pred[bias_mask] = self.clf_1.predict_proba(x[bias_mask])
pred[~bias_mask] = self.clf_2.predict_proba(self.filter_columns(x[~bias_mask]))
return pred
```

# The Experiment

Back to the grocery store scenario we outlined previously. Let’s pretend we are in the data scientist’s shoes and run a few experiments on some datasets. The data we will use is the UC Irvine Online Retail Dataset found here. We will turn this transactional dataset into features that we can use for this problem (recency, frequency, and monetary for each of 20 derived product category clusters). Then, for purposes of this article, we will permute our labeled data in a few interesting ways that reflect potential biases. This will give us a few different scenarios in which we can test the effectiveness of our different approaches.

In each of these approaches, we will permute our training data, and leave the test set intact. We will assume that aside from our intentional label removal, we have complete label coverage. This will enable us to see how our model actually generalizes to a scenario with pristine labeling.

The following biases will be applied to our training sets:

**‘uk_only’:**We’ll pretend we have no post-period data (labels) from outside of the UK. In the ‘Online Retail’ dataset, the majority of transactions come from within the UK. This scenario would be possible if it took longer to process and load transactions coming from outside the UK.- ‘
**no_uk**’: The opposite of*uk_only*, we’ll pretend that we have no labels from consumers within the UK. - ‘
**no_prod_x**’: In this scenario, we’ll pretend that if we observed a purchase of product x, we did not record any labeled data. - ‘
**only_prod_x**’: The opposite of*no_prod_x*, we’ll pretend that we only have labels if we observed a product x purchase in the pre-period. - ‘
**random**’: A random portion of labels (30%) are unlabeled.

Our null hypothesis is that “ignoring the bias” is the best approach. We will falsify this hypothesis if one or more of our alternative approaches produces better results.

Since the objective of this effort is to pick the ‘most likely prospective corn chip buyers’, we will use a ranking metric (area under the ROC curve) as our determinant of success. We could also look at precision at a given score continuum depth as our metric.

Also, we are not using any particularly fancy modeling techniques for this experiment, as that is not really the subject of this article. We will use random forest models for all of our models with the same set of reasonable hyperparameters. These hyperparameters have been shown to perform reasonably well on the datasets here, while running relatively quickly.

The only hyperparameter we are tuning in these experiments is our prior assumption about false negative prevalence in the unlabeled group.

Even though the application we initially are optimizing for is corn chips, the same technique will likely be applied across all of the different products. We’ll run the experiment for each product (and run 3 different cross validation folds for each). This will give us a better idea of how reliable any results we calculate are and allow us to run hypothesis testing.

# Results

## AUC — Holdout

### Area under ROC curves for the holdout group

## Significance

### T-statistic for the hypothesis that AUC of _ > AUC of ‘Ignore’ (for the holdout group)

# Analysis

We ran a total of 60 trials in this experiment: 20 ‘product categories’ and three separate tests (with different train and test sets) for each.

In most of the experiments, no semi-supervised approach appears to significantly improve upon the ignore-bias approach. In the ‘uk_only’ experiment, some of the semi-supervised approaches do appear to be significantly better, but only to a p-value of ~0.07.

The ‘only_prod_x’ permutation appears to be a particularly problematic bias, and one that the semi-supervised techniques appear to exacerbate to a statistically significant degree. This is likely because the dependent variable in training can only be positive if product x was also purchased – therefore, not purchasing product x is a strong predictor (perfect) of a negative dependent variable. In the semi-supervised approaches, the first stage models would likely pick up on this, and then the effect of this variable would be amplified in the second-stage model. To be clear, however, all models in this scenario are worse than random, and so not useful at all.

In a situation like this (and if the modeler happened to know that this correlation was spurious), a better approach would probably be to limit the training dataset to only records that had purchased product x, thus eliminating the variance in that independent variable.

# Conclusion

Based on the experiments conducted here, it seems that under certain circumstances, semi-supervised approaches can help to overcome positive and incomplete labeling, even if there is structure to the missing labels. Improving model generalization can have massively positive business outcomes and should therefore be considered under certain circumstances.

However, it’s also quite clear that significant testing should be done before resorting to a semi-supervised approach. Semi-supervised models are more expensive and time-consuming to run, and as we’ve seen, can sometimes amplify label bias and further hinder model generalization. The potential for improved outcomes may well be worth the additional cost, but proper vetting is required.

Hopefully this article has brought to light some of the labeling issues that can exist in supervised learning problems and introduced a couple techniques that may help to identify, simulate, and overcome those issues.