Supervised learning# Goal# Given some observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n (x_{i}, y_{i}) \in \mathscr{X}\times \mathscr{Y}, i=1, ..., n ( x i , y i ) ∈ X × Y , i = 1 , ... , n , of inputs/outputs, features/labels, covariates/responses (as the training data), the main goal of supervised learning is to predict a new y ∈ Y y \in \mathscr{Y} y ∈ Y given a new previous unseen x ∈ X x \in \mathscr{X} x ∈ X The unobserved data are usually referred to as the testing data.
输出是有噪声;
预测函数可能相当复杂;
只观测到少数的x x x 值.
维度灾难(输入空间很大)
训练数据分布和测试数据分布之间可能关联较弱
交叉验证
取出测试集,剩下的(A,B,C,D,E)当训练集和验证集,第一次A当验证,BCDE训练。第二次B当验证,ACDE训练…
决策理论# 掌握数据的潜在概率分布
损失函数# ℓ ( y , z ) = 1 y ≠ z \ell(y, z) = 1_{y\neq z} ℓ ( y , z ) = 1 y = z (0 - 1 损失)
当 y y y 等于 z z z 时(无错误),损失为 0 0 0 ;否则(有错误),损失为 1 1 1 。
二分类: Y = { 0 , 1 } \mathcal{Y} = \{0, 1\} Y = { 0 , 1 } (常见的还有 Y = { − 1 , 1 } \mathcal{Y} = \{-1, 1\} Y = { − 1 , 1 } ,较少情况下,当视为下面损失的子情况时,Y = { 1 , 2 } \mathcal{Y} = \{1, 2\} Y = { 1 , 2 } )。 多分类:Y = { 1 , … , k } \mathcal{Y} = \{1,\ldots,k\} Y = { 1 , … , k } 多标签分类:损失 ℓ ( y , z ) = ∑ j = 1 k 1 y j ≠ z j \ell(y, z) = \sum\limits_{j = 1}^k 1_{y_j\neq z_j} ℓ ( y , z ) = j = 1 ∑ k 1 y j = z j ℓ ( y , z ) = 1 2 ( y − z ) 2 \ell(y, z) =\frac{1}{2} (y - z)^2 ℓ ( y , z ) = 2 1 ( y − z ) 2 (平方损失)
贝叶斯最优为均值
ℓ ( y , z ) = ∣ y − z ∣ \ell(y, z) = |y - z| ℓ ( y , z ) = ∣ y − z ∣ (绝对损失)
贝叶斯最优为中位数
输出空间 Y = R \mathcal{Y} = \mathbb{R} Y = R 风险(损失函数的期望) : 性能评判标准# 给定损失函数ℓ : Y × Y → R \ell:\mathcal{Y} \times \mathcal{Y} \to \mathbb{R} ℓ : Y × Y → R ,我们可以将函数f : X → Y f:\mathcal{X} \to \mathcal{Y} f : X → Y 的期望风险(也称为泛化性能或测试误差),定义为输出y y y 与预测值f ( x ) f(x) f ( x ) 之间损失函数的期望。
期望风险: f : X → Y f:\mathcal{X} \to \mathcal{Y} f : X → Y 、一个损失函数ℓ : Y × Y → R \ell:\mathcal{Y} \times \mathcal{Y} \to \mathbb{R} ℓ : Y × Y → R 以及一个分布d p ( x , y ) dp(x, y) d p ( x , y ) ,预测函数f : X → Y f:\mathcal{X} \to \mathcal{Y} f : X → Y 的期望风险定义为: R ( f ) = E [ ℓ ( y , f ( x ) ) ] = ∫ X × Y ℓ ( y , f ( x ) ) d p ( x , y ) \mathcal{R}(f)=\mathbb{E}[\ell(y,f(x))]=\int_{\mathcal{X} \times \mathcal{Y}}\ell(y,f(x))dp(x,y) R ( f ) = E [ ℓ ( y , f ( x ))] = ∫ X × Y ℓ ( y , f ( x )) d p ( x , y ) 经验风险/训练误差: f : X → Y f:\mathcal{X} \to \mathcal{Y} f : X → Y 、一个损失函数ℓ : Y × Y → R \ell:\mathcal{Y} \times \mathcal{Y} \to \mathbb{R} ℓ : Y × Y → R 以及数据( x i , y i ) ∈ X × Y (x_i, y_i) \in \mathcal{X} \times \mathcal{Y} ( x i , y i ) ∈ X × Y ,i = 1 , … , n i = 1,\ldots,n i = 1 , … , n ,预测函数f : X → Y f:\mathcal{X} \to \mathcal{Y} f : X → Y 的经验风险定义为: R ^ ( f ) = 1 n ∑ i = 1 n ℓ ( y i , f ( x i ) ) \widehat{\mathcal{R}}(f)=\frac{1}{n}\sum_{i = 1}^{n}\ell(y_i,f(x_i)) R ( f ) = n 1 i = 1 ∑ n ℓ ( y i , f ( x i )) 贝叶斯风险和贝叶斯预测器: 最佳的预测函数# 先验估计:第一印象
后验分布:后面根据情况调整印象。
利用条件期望及全期望公式,有
R ( f ) = E [ ℓ ( y , f ( x ) ) ] = E [ E ( ℓ ( y , f ( x ) ) ∣ x ) ] , \mathcal{R}(f)=\mathbb{E}[\ell(y,f(x))]=\mathbb{E}[\mathbb{E}(\ell(y,f(x))|x)], R ( f ) = E [ ℓ ( y , f ( x ))] = E [ E ( ℓ ( y , f ( x )) ∣ x )] , 可以将其改写为
R ( f ) = E x ′ ∼ d p ( x ) [ E ( ℓ ( y , f ( x ) ) ∣ x = x ′ ) ] = ∫ X ( E ( ℓ ( y , f ( x ) ) ∣ x = x ′ ) ) d p ( x ′ ) . \mathcal{R}(f)=\mathbb{E}_{x'\sim dp(x)}[\mathbb{E}(\ell(y,f(x))|x = x')]=\int_{\mathcal{X}}(\mathbb{E}(\ell(y,f(x))|x = x'))dp(x'). R ( f ) = E x ′ ∼ d p ( x ) [ E ( ℓ ( y , f ( x )) ∣ x = x ′ )] = ∫ X ( E ( ℓ ( y , f ( x )) ∣ x = x ′ )) d p ( x ′ ) . 给定对于任意x ′ ∈ X x' \in \mathcal{X} x ′ ∈ X 的条件分布,即y ∣ x = x ′ y|x = x' y ∣ x = x ′ ,我们可以为任意z ∈ Y z \in \mathcal{Y} z ∈ Y 定义条件风险(它是一个确定性函数):
r ( z ∣ x ′ ) = E ( ℓ ( y , z ) ∣ x = x ′ ) , r(z|x')=\mathbb{E}(\ell(y,z)|x = x'), r ( z ∣ x ′ ) = E ( ℓ ( y , z ) ∣ x = x ′ ) , 由此可得
R ( f ) = E ( r ( f ( x ) ∣ x ) ) = E x ′ ∼ d p ( x ) [ r ( f ( x ′ ) ∣ x ′ ) ] = ∫ X r ( f ( x ′ ) ∣ x ′ ) d p ( x ′ ) . \mathcal{R}(f)=\mathbb{E}(r(f(x)|x))=\mathbb{E}_{x'\sim dp(x)}[r(f(x')|x')]=\int_{\mathcal{X}}r(f(x')|x')dp(x'). R ( f ) = E ( r ( f ( x ) ∣ x )) = E x ′ ∼ d p ( x ) [ r ( f ( x ′ ) ∣ x ′ )] = ∫ X r ( f ( x ′ ) ∣ x ′ ) d p ( x ′ ) . 要得到R ( f ) \mathcal{R}(f) R ( f ) 的极小值点,可以考虑对于任意x ′ ∈ X x' \in \mathcal{X} x ′ ∈ X ,使函数值f ( x ′ ) f(x') f ( x ′ ) 等于r ( z ∣ x ′ ) = E ( ℓ ( y , z ) ∣ x = x ′ ) r(z|x') = \mathbb{E}(\ell(y,z)|x = x') r ( z ∣ x ′ ) = E ( ℓ ( y , z ) ∣ x = x ′ ) 在z ∈ Y z \in \mathcal{Y} z ∈ Y 上的极小值点
贝叶斯预测器和贝叶斯风险 : 期望风险在贝叶斯预测器f ∗ : X → Y f^*:\mathcal{X} \to \mathcal{Y} f ∗ : X → Y 处达到最小,对于所有x ′ ∈ X x' \in \mathcal{X} x ′ ∈ X ,f ∗ ( x ′ ) ∈ arg min z ∈ Y E ( ℓ ( y , z ) ∣ x = x ′ ) = arg min z ∈ Y r ( z ∣ x ′ ) f^*(x') \in \arg\min_{z \in \mathcal{Y}}\mathbb{E}(\ell(y,z)|x = x') = \arg\min_{z \in \mathcal{Y}}r(z|x') f ∗ ( x ′ ) ∈ arg min z ∈ Y E ( ℓ ( y , z ) ∣ x = x ′ ) = arg min z ∈ Y r ( z ∣ x ′ ) 。贝叶斯风险R ∗ \mathcal{R}^* R ∗ 是所有贝叶斯预测器的风险,且等于
R ∗ = E x ′ ∼ d p x ( x ′ ) inf z ∈ Y E ( ℓ ( y , z ) ∣ x = x ′ ) \mathcal{R}^*=\mathbb{E}_{x'\sim dp_{x}(x')}\inf_{z \in \mathcal{Y}}\mathbb{E}(\ell(y,z)|x = x') R ∗ = E x ′ ∼ d p x ( x ′ ) z ∈ Y inf E ( ℓ ( y , z ) ∣ x = x ′ ) 超额风险: 函数f : X → Y f:\mathcal{X} \to \mathcal{Y} f : X → Y 的超额风险等于R ( f ) − R ∗ \mathcal{R}(f) - \mathcal{R}^* R ( f ) − R ∗ (它总是非负的) 对于我们常用的损失函数集,我们可以计算贝叶斯预测器:
0 - 1损失:对于Y = { 0 , 1 } \mathcal{Y} = \{0, 1\} Y = { 0 , 1 } 和ℓ ( y , z ) = 1 y ≠ z \ell(y,z) = 1_{y\neq z} ℓ ( y , z ) = 1 y = z ,贝叶斯预测器满足 f ∗ ( x ′ ) ∈ arg min z ∈ { 0 , 1 } P ( y ≠ z ∣ x = x ′ ) = arg min z ∈ { 0 , 1 } 1 − P ( y = z ∣ x = x ′ ) = arg max z ∈ { 0 , 1 } P ( y = z ∣ x = x ′ ) f^*(x') \in \arg\min_{z \in \{0, 1\}}\mathbb{P}(y \neq z|x = x') = \arg\min_{z \in \{0, 1\}}1 - \mathbb{P}(y = z|x = x') = \arg\max_{z \in \{0, 1\}}\mathbb{P}(y = z|x = x') f ∗ ( x ′ ) ∈ arg z ∈ { 0 , 1 } min P ( y = z ∣ x = x ′ ) = arg z ∈ { 0 , 1 } min 1 − P ( y = z ∣ x = x ′ ) = arg z ∈ { 0 , 1 } max P ( y = z ∣ x = x ′ ) 记η ( x ′ ) = P ( y = 1 ∣ x = x ′ ) \eta(x') = \mathbb{P}(y = 1|x = x') η ( x ′ ) = P ( y = 1∣ x = x ′ ) ,如果η ( x ′ ) > 1 / 2 \eta(x') > 1/2 η ( x ′ ) > 1/2 ,f ∗ ( x ′ ) = 1 f^*(x') = 1 f ∗ ( x ′ ) = 1 ;而如果η ( x ′ ) < 1 / 2 \eta(x') < 1/2 η ( x ′ ) < 1/2 ,f ∗ ( x ′ ) = 0 f^*(x') = 0 f ∗ ( x ′ ) = 0 。当η ( x ′ ) = 1 / 2 \eta(x') = 1/2 η ( x ′ ) = 1/2 时,结果无关紧要.
R ∗ = E [ min { η ( x ) , 1 − η ( x ) } ] \mathcal{R}^*=\mathbb{E}[\min\{\eta(x),1 - \eta(x)\}] R ∗ = E [ min { η ( x ) , 1 − η ( x )}] ,一般来说它严格为正(除非η ( x ) ∈ { 0 , 1 } \eta(x) \in \{0, 1\} η ( x ) ∈ { 0 , 1 } 几乎必然成立,即y y y 是x x x 的确定性函数).
k ≥ 2 k \geq 2 k ≥ 2 时Y = { 1 , … , k } \mathcal{Y} = \{1,\ldots,k\} Y = { 1 , … , k } ,此时f ∗ ( x ′ ) ∈ arg max i ∈ { 1 , … , k } P ( y = i ∣ x = x ′ ) f^*(x') \in \arg\max_{i \in \{1,\ldots,k\}}\mathbb{P}(y = i|x = x') f ∗ ( x ′ ) ∈ arg max i ∈ { 1 , … , k } P ( y = i ∣ x = x ′ ) 。
平方损失:Y = R \mathcal{Y} = \mathbb{R} Y = R 和ℓ ( y , z ) = ( y − z ) 2 \ell(y,z) = (y - z)^2 ℓ ( y , z ) = ( y − z ) 2 ,贝叶斯预测器满足 f ∗ ( x ′ ) ∈ arg min z ∈ R E [ ( y − z ) 2 ∣ x = x ′ ] = arg min z ∈ R { E [ ( y − E ( y ∣ x = x ′ ) ) 2 ∣ x = x ′ ] + ( z − E ( y ∣ x = x ′ ) ) 2 } f^*(x') \in \arg\min_{z \in \mathbb{R}}\mathbb{E}[(y - z)^2|x = x'] = \arg\min_{z \in \mathbb{R}}\left\{\mathbb{E}[(y - \mathbb{E}(y|x = x'))^2|x = x']+(z - \mathbb{E}(y|x = x'))^2\right\} f ∗ ( x ′ ) ∈ arg z ∈ R min E [( y − z ) 2 ∣ x = x ′ ] = arg z ∈ R min { E [( y − E ( y ∣ x = x ′ ) ) 2 ∣ x = x ′ ] + ( z − E ( y ∣ x = x ′ ) ) 2 } 这会得到条件期望f ∗ ( x ′ ) = E ( y ∣ x = x ′ ) f^*(x') = \mathbb{E}(y|x = x') f ∗ ( x ′ ) = E ( y ∣ x = x ′ ) .