Supervised learning#

Goal#

Given some observations $(x_{i}, y_{i}) \in \mathscr{X}\times \mathscr{Y}, i=1, ..., n$ , of inputs/outputs, features/labels, covariates/responses (as the training data), the main goal of supervised learning is to predict a new $y \in \mathscr{Y}$ given a new previous unseen $x \in \mathscr{X}$ The unobserved data are usually referred to as the testing data.

困难#

输出是有噪声；
预测函数可能相当复杂;
只观测到少数的 $x$ 值.
维度灾难（输入空间很大）
训练数据分布和测试数据分布之间可能关联较弱

交叉验证

取出测试集，剩下的(A,B,C,D,E)当训练集和验证集，第一次A当验证，BCDE训练。第二次B当验证，ACDE训练…

决策理论#

掌握数据的潜在概率分布

损失函数#

$\ell(y, z) = 1_{y\neq z}$ （0 - 1 损失）
当 $y$ 等于 $z$ 时（无错误），损失为 $0$ ；否则（有错误），损失为 $1$ 。
- 二分类： $\mathcal{Y} = \{0, 1\}$ （常见的还有 $\mathcal{Y} = \{-1, 1\}$ ，较少情况下，当视为下面损失的子情况时， $\mathcal{Y} = \{1, 2\}$ ）。
- 多分类： $\mathcal{Y} = \{1,\ldots,k\}$
- 多标签分类：损失 $\ell(y, z) = \sum\limits_{j = 1}^k 1_{y_j\neq z_j}$
$\ell(y, z) =\frac{1}{2} (y - z)^2$ （平方损失）
贝叶斯最优为均值
$\ell(y, z) = |y - z|$ （绝对损失）
贝叶斯最优为中位数
- 输出空间 $\mathcal{Y} = \mathbb{R}$

风险(损失函数的期望) : 性能评判标准#

给定损失函数 $\ell:\mathcal{Y} \times \mathcal{Y} \to \mathbb{R}$ ，我们可以将函数 $f:\mathcal{X} \to \mathcal{Y}$ 的期望风险（也称为泛化性能或测试误差），定义为输出 $y$ 与预测值 $f(x)$ 之间损失函数的期望。

期望风险: $f:\mathcal{X} \to \mathcal{Y}$ 、一个损失函数 $\ell:\mathcal{Y} \times \mathcal{Y} \to \mathbb{R}$ 以及一个分布 $dp(x, y)$ ，预测函数 $f:\mathcal{X} \to \mathcal{Y}$ 的期望风险定义为：

\mathcal{R}(f)=\mathbb{E}[\ell(y,f(x))]=\int_{\mathcal{X} \times \mathcal{Y}}\ell(y,f(x))dp(x,y)

经验风险/训练误差: $f:\mathcal{X} \to \mathcal{Y}$ 、一个损失函数 $\ell:\mathcal{Y} \times \mathcal{Y} \to \mathbb{R}$ 以及数据 $(x_i, y_i) \in \mathcal{X} \times \mathcal{Y}$ ， $i = 1,\ldots,n$ ，预测函数 $f:\mathcal{X} \to \mathcal{Y}$ 的经验风险定义为：

\widehat{\mathcal{R}}(f)=\frac{1}{n}\sum_{i = 1}^{n}\ell(y_i,f(x_i))

贝叶斯风险和贝叶斯预测器: 最佳的预测函数#

先验估计：第一印象
后验分布：后面根据情况调整印象。

利用条件期望及全期望公式，有

\mathcal{R}(f)=\mathbb{E}[\ell(y,f(x))]=\mathbb{E}[\mathbb{E}(\ell(y,f(x))|x)],

可以将其改写为

\mathcal{R}(f)=\mathbb{E}_{x'\sim dp(x)}[\mathbb{E}(\ell(y,f(x))|x = x')]=\int_{\mathcal{X}}(\mathbb{E}(\ell(y,f(x))|x = x'))dp(x').

给定对于任意 $x' \in \mathcal{X}$ 的条件分布，即 $y|x = x'$ ，我们可以为任意 $z \in \mathcal{Y}$ 定义条件风险（它是一个确定性函数）：

r(z|x')=\mathbb{E}(\ell(y,z)|x = x'),

由此可得

\mathcal{R}(f)=\mathbb{E}(r(f(x)|x))=\mathbb{E}_{x'\sim dp(x)}[r(f(x')|x')]=\int_{\mathcal{X}}r(f(x')|x')dp(x').

要得到 $\mathcal{R}(f)$ 的极小值点，可以考虑对于任意 $x' \in \mathcal{X}$ ，使函数值 $f(x')$ 等于 $r(z|x') = \mathbb{E}(\ell(y,z)|x = x')$ 在 $z \in \mathcal{Y}$ 上的极小值点

贝叶斯预测器和贝叶斯风险：期望风险在贝叶斯预测器 $f^*:\mathcal{X} \to \mathcal{Y}$ 处达到最小，对于所有 $x' \in \mathcal{X}$ ， $f^*(x') \in \arg\min_{z \in \mathcal{Y}}\mathbb{E}(\ell(y,z)|x = x') = \arg\min_{z \in \mathcal{Y}}r(z|x')$ 。贝叶斯风险 $\mathcal{R}^*$ 是所有贝叶斯预测器的风险，且等于

\mathcal{R}^*=\mathbb{E}_{x'\sim dp_{x}(x')}\inf_{z \in \mathcal{Y}}\mathbb{E}(\ell(y,z)|x = x')

超额风险: 函数 $f:\mathcal{X} \to \mathcal{Y}$ 的超额风险等于 $\mathcal{R}(f) - \mathcal{R}^*$ （它总是非负的）

对于我们常用的损失函数集，我们可以计算贝叶斯预测器：

0 - 1损失：对于 $\mathcal{Y} = \{0, 1\}$ 和 $\ell(y,z) = 1_{y\neq z}$ ，贝叶斯预测器满足

f^*(x') \in \arg\min_{z \in \{0, 1\}}\mathbb{P}(y \neq z|x = x') = \arg\min_{z \in \{0, 1\}}1 - \mathbb{P}(y = z|x = x') = \arg\max_{z \in \{0, 1\}}\mathbb{P}(y = z|x = x')

记 $\eta(x') = \mathbb{P}(y = 1|x = x')$ ，如果 $\eta(x') > 1/2$ ， $f^*(x') = 1$ ；而如果 $\eta(x') < 1/2$ ， $f^*(x') = 0$ 。当 $\eta(x') = 1/2$ 时，结果无关紧要.

$\mathcal{R}^*=\mathbb{E}[\min\{\eta(x),1 - \eta(x)\}]$ ，一般来说它严格为正（除非 $\eta(x) \in \{0, 1\}$ 几乎必然成立，即 $y$ 是 $x$ 的确定性函数）.

$k \geq 2$ 时 $\mathcal{Y} = \{1,\ldots,k\}$ ，此时 $f^*(x') \in \arg\max_{i \in \{1,\ldots,k\}}\mathbb{P}(y = i|x = x')$ 。

平方损失： $\mathcal{Y} = \mathbb{R}$ 和 $\ell(y,z) = (y - z)^2$ ，贝叶斯预测器满足

f^*(x') \in \arg\min_{z \in \mathbb{R}}\mathbb{E}[(y - z)^2|x = x'] = \arg\min_{z \in \mathbb{R}}\left\{\mathbb{E}[(y - \mathbb{E}(y|x = x'))^2|x = x']+(z - \mathbb{E}(y|x = x'))^2\right\}

这会得到条件期望 $f^*(x') = \mathbb{E}(y|x = x')$ .