Machine Learning Recap 1 Concepts

Setting up the scene

In a supervised learning, we specify

an example space \(\mathcal X\)
a label space \(\mathcal Y\)
a collection of hypotheses \(h: \mathcal X \to \mathcal Y\), making up the hypothesis class (inductive bias), from which we want to pick a predictor
a loss function \(\ell: \mathcal Y \times \mathcal Y \to \mathbb{R}_+\), quantifying how good or bad a prediction \(\hat y\) is compared to the ground truth label \(y\)

Important

We are given an independent and identically distributed input-output pairs \(S = \{(x_i,y_i), i=[m]\}\subset \mathcal X\times\mathcal Y\) with distribution \(D\).

Our goal is to pick the “best” predictor in the sense of minimising the true loss
\[ L_D(h) = \mathbb{E}_{(x,y)\sim D}[\ell(h(x),y))] \] where \(D\) is the true distribution of \((x,y)\).

Obviously a priori the distribution \(D\) of the samples is unkonwn. However by law of large numbers we know that \[ L_S(h):= \frac{1}{m}\sum_{i=1}^m \ell(h(x_i),y_i) \] is a consistent estimator of \(L_D(h)\) (when \(m\to\infty\) under mild condition on the distribution of \(\ell(h(x),y)\)). This motivates the empirical risk minimisation (ERM) approach i.e. we look for \[ h_S \in \mathrm{argmin}_{h\in\mathcal H} L_S(h) \] Hence, instead of minimising the true risk, we minimise the empirical risk, which is close to the true risk when \(m\) is large, for a fixed \(h\). Whether such approximation is valid uniformly for all the hypothesis in \(\mathcal H\), in other words, whether \(h_S\approx h^*\in \mathrm{argmin}_{h\in\mathcal H} L_D(h)\), is at the centre of the so-called learning theory.

We often decompose the generalisation error \(L_D(h_S)\) in two parts: \[ L_D(h_S) = L_D(h^*) + [L_D(h_S) - L_D(h^*)] \] the first is called the approximation error, and second estimation error.

Let’s be concrete

From a practical point of view, we may not want to get into the learning theory bounds despite its elegance, what we do is to split the sample \(S^0\) into two parts \(S, V\), where \(S\) is used to find an ERM which we denote by \(h_S\), another for testing whether the found ERM achieves small \(L_D(h_S)\). The rationale is simple, since we make iid assumption, \(V\) is independent of \(S\), hence \(L_{V}(h_S) \approx L_D(h_S)\) when \(|V|\) is not too small. Therefore, the smallness of \(L_{V}(h_S)\) indicates good quality (small true risk) of our predictor \(h_1\).

The split method for arrays is implemented in sklearn.model_selection

from sklearn.model_selection import train_test_split
import numpy as np
g = np.random.default_rng(12)
x = g.normal(0,1,(20,))
x_tr, x_te = train_test_split(x,test_size=0.2,random_state=12)

Typically, examples are vectors in \(\mathbb{R}^d, d\ge 1\). In \(k\)-way (\(k\ge 2\)) classification problems, labels are one-hot encodings \(e_1,..., e_k\) where \(e_i\) is the unit vector in \(\mathbb{R}^k\) with one on the \(i\) th coordinate and zero elsewhere. If \(k=2\), we can drop the second coordinate and simply denote the two classes by \(\{1, -1\}\) or \(\{0,1\}\). In regression problems, the labels are in the continuum \(\mathbb{R}^k, k\ge 1\).

Now consider \(\mathcal H\). For classification problems, instead of predicting discrete class labels directly, it is sometimes beneficial to predict a probability mass function over the \(k\) classes. From the pmf, a label can be obtained by taking argmax. In other words, the range of \(h\in\mathcal H\) is assumed to be \(\{y\in\mathbb{R}^k: y_i\ge 0, y_1+...+y_k=1\}\). For regression problems there are no such constraints and the range can be the whole \(\mathbb{R}^k\).

The choie of the loss function may vary, depending on what goal we are trying to achiecve. Researchers can design new losses suitable for their use case. Here we mention a few popular ones. For classification problems, if we use hypothesis predicting pmf, the cross entropy loss is often a good choice \[ XE(p,\hat p) = - \sum_{i=1}^k p_i \log(\hat p_i) \] For regression problems, the squared loss is often a good choice \[SE(y,\hat y)= \|y-\hat y\|^2_2\] where \(\|.\|\) is Euclidean norm.

One of the advantages of these loss functions is that they are convex in the \(\hat p\) or \(\hat y\) variables, making it possible to leverage the machinery of convex optimistion when it comes to the actual training process.

Many loss functions are already implemented in sklearn.metrics. The XE is named log_loss and the squared loss is named mean_sqquared_error. Let’s list all of them.

import sklearn.metrics as metrics
for m in dir(metrics):
    if not m.startswith('_'): print(m)

ConfusionMatrixDisplay
DetCurveDisplay
DistanceMetric
PrecisionRecallDisplay
PredictionErrorDisplay
RocCurveDisplay
accuracy_score
adjusted_mutual_info_score
adjusted_rand_score
auc
average_precision_score
balanced_accuracy_score
brier_score_loss
calinski_harabasz_score
check_scoring
class_likelihood_ratios
classification_report
cluster
cohen_kappa_score
completeness_score
confusion_matrix
consensus_score
coverage_error
d2_absolute_error_score
d2_pinball_score
d2_tweedie_score
davies_bouldin_score
dcg_score
det_curve
euclidean_distances
explained_variance_score
f1_score
fbeta_score
fowlkes_mallows_score
get_scorer
get_scorer_names
hamming_loss
hinge_loss
homogeneity_completeness_v_measure
homogeneity_score
jaccard_score
label_ranking_average_precision_score
label_ranking_loss
log_loss
make_scorer
matthews_corrcoef
max_error
mean_absolute_error
mean_absolute_percentage_error
mean_gamma_deviance
mean_pinball_loss
mean_poisson_deviance
mean_squared_error
mean_squared_log_error
mean_tweedie_deviance
median_absolute_error
multilabel_confusion_matrix
mutual_info_score
nan_euclidean_distances
ndcg_score
normalized_mutual_info_score
pair_confusion_matrix
pairwise
pairwise_distances
pairwise_distances_argmin
pairwise_distances_argmin_min
pairwise_distances_chunked
pairwise_kernels
precision_recall_curve
precision_recall_fscore_support
precision_score
r2_score
rand_score
recall_score
roc_auc_score
roc_curve
silhouette_samples
silhouette_score
top_k_accuracy_score
v_measure_score
zero_one_loss

It comes in handy that sklearn has them already defined. Implementing each one of them is often not hard, e.g.

def log_loss(yt,yp):
    yt[yt==0]= 1-yp[yt==0]
    return -np.log(y).mean()

Analysis of errors

In practice, \(D\) is unknown and we only observe the training error \(L_S(h_S)\) and the validation error \(L_V(h_S)\). Hence we decompose the generalisation error differently \[ L_D(h_S) = [L_D(h_S) - L_V(h_S)] + [L_V(h_S) - L_S(h_S)] + L_S(h_S) \] The first term is small when \(|V|\) is moderately large by independence of \(V\) and \(S\). The second and third term are observalbe and several cases may arise.

the gap is small and the training error is small. This is a happy scenario
the training error is large. To address this, we may consider
- engarge the hypothesis class,
- completely changing it,
- find a better feature representation,
- find a better optimiser
training error is small but the gap is large. To address this, we may consider
- add regularisation
- get more training data
- reduce the hypothesis class

It is benefiical to plot the learning curve during training. This amounts to visualise the training error and validation error on the same plot as time evolves (every X batches, every X epoch etc).