statduck
Smoothing Splines & Smoother Matrices 본문
Smoothing Splines
✏️ Smoothing Spline
Avoiding the knot selection problem completely by using a maximal set of knots.
$$ RSS(f,\lambda)=\sum^N_{i=1}{y_i-f(x_i)}^2+\lambda\int {f''(t)}^2dt $$
Our goal is to find the form of function minimizing RSS. The constrains mean curvature as follows:
$$ r = (x,y), ||r'||=\sqrt{x'(s)^2+y'(s)^2}=1 \\ T(s)=(x'(s),y'(s)) = unit \; tangent \; vector \\ \kappa(s)=||T'(s)||=||r''(s)||=\sqrt{x''(s)^2+y''(s)^2} $$
$$ x=t, y=f(t), \kappa(s)=\sqrt{f''(s)^2}=|f''(s)| $$
To get the curvature on whole points, we need to calculate $\int |f''(s)| ds$. For a convenience of calculation, we can adjust it into $\int f''(s)^2 ds$. The curvature on one point is the norm of the derivative of a tangent vector, and $\kappa(s)$ means the curvature on this point. What we want to get is the norm of curvature on support range. In this function, the norm of curvature becomes small when this function becomes smooth(close to linear).
First term controls error, and second term controls curvature.
Lambda is a fixed smoothing parameter. To control the curvature, the error term can be big, so we need to control lambda for balancing. When lambda becomes so big, rss cane reduced a lot when we make the curvature a bit small. The bigger lambda becomes, the smoother function becomes.
- lambda = 0: f can be any function that interpolates the data.
- lambda = inf: the simple least squares line fit, since no second derivative can be tolerated.
There must exist the second derivative form of the function above, rss has to be included sobolev space.
The function minimizing RSS above is the natural cubic spline. The proof is following.
Proof)
Ex5.7 Derivation of smoothing splines (Green and Silverman, 1994). Suppose that $N \geq 2$, and that $g$ is the natural cubic spline interpolant to the pairs ${x_i,z_i}^N_1$, with $a<x_1<\cdots<x_N<b$. This is a natural spline with a knot at every $x_i$; being an N-dimensional space of functions, we can determine the coefficients such that it interpolates the sequence $z_i$exactly. Let $\tilde{g}$ be any other differentiable function on $[a,b]$ that interpolates the N pairs.
(a) Let $h(x)=\tilde{g}(x)-g(x)$. Use integration by parts and the fact that g is a natural cubic spline to show that
$$ \int^b_a g''(x)h''(x)dx=-\sum^{N-1}{j=1}g'''(x_j^+){h(x{j+1})-h(x_j)}=0 $$
(b) Hence show that
$$ \int^b_a\tilde{g}''(t)^2dt \geq \int^b_a g''(t)^2dt $$
and that equality can only hold if h is identically zero in $$[a,b]$$.
(c) Consider the penalized least squares problem
$$ min_f[\sum^N_{i=1}(y_i-f(x_i))^2+\lambda\int^b_af''(t)^2dt]. $$
Use (b) to argue that the minimizer must be a cubic spline with knots at each of the $x_i$
John L. Weatherwax & David Epstein, A Solution Manual and Notes for: The Elements of Statistical Learning by Jerome Friedman, Trevor Hastie, and Robert Tibshirani, 1 March 2021
Because this function is the natural cubic spline, we can narrow down the problem from the estimation of function to the estimation of theta.
$$
RSS(\theta,\lambda)=(y-N\theta)^T(y-N\theta)+\lambda\theta^T\Omega_N\theta \\
where ; {N}{ij}=N_j(x_i) ; and ; {\Omega_N}{jk}=\int N''_j(t)N''_k(t)dt.
$$
$$
\
\hat{\theta}=(N^TN+\lambda\Omega_N)^{-1}N^Ty \\
\hat{f}(x)=\sum^N_{j=1}N_j(x)\hat{\theta}_j
$$
The form of theta is similar with the theta in ridge regression. A ridge regression has an identity matrix instead of omega matrix, so a smoothing spline is a more general expression.
Smoother Matrices
In smoothing splines, the estimated function is like this.
$$ \hat{f}=N(N^TN+\lambda \Omega_N)^{-1}N^Ty=S_\lambda y $$
In multinomial spline, the estimated function is like this.
$$ \hat{f}=B_\xi(B^T_\xi B_\xi)^{-1}B^T_\xi y=H_\xi y $$
We can compare the matrix $S_\lambda$ and $H_\xi$
H(Projection Operator, N*N) | S(Smoothing matrix, N*N) |
Symmetric, $H \geq 0$ | Symmetric, $S \geq 0$ |
Idempotent ($HH=H$) | $SS \leq S$ (수축) |
Rank M | Rank N |
M=trace($H$) | $ df_\lambda = trace(S)$ |
$$ \begin{split} S_\lambda = ; & N(N^TN+\lambda\Omega_N)^{-1}N^T \\ = \; & (N^{-T}(N^TN+\lambda\Omega_N)N^{-1})^{-1} \\ = \; & (N^{-T}N^TNN^{-1}+\lambda N^{-T}\Omega_N N^{-1})^{-1} \\ = \; & (I+\lambda N^{-T} \Omega_N N^{-1})^{-1} \\ = =\; & (I+\lambda K)^{-1} \end{split} $$
$$ min_f(y-f)^T(y-f)+\lambda\theta^T\Omega_N\theta = min_f(y-f)^T(y-f)+\lambda f^TKf $$
The matrix $K$ is called penalty matrix. $df_\lambda$ is the effective degree of freedom, which is the number of input variables we use.
$$
S_\lambda=\sum^N_{k=1}\rho_k(\lambda)u_ku_k^T
\\
\hat{y}=S_\lambda y= \sum^N_{k=1}\mu_k\rho_k(\lambda)\langle \mu_k,y\rangle \\
\rho_k(\lambda)=\dfrac{1}{1+\lambda d_k}
$$
- $\hat{y}$ is decomposed into the sum of N eigen basis vector.
- $\rho_k(\lambda)$는 S행렬의 고유값, $d_k$ 는 K행렬의 고유값이다. 두 고유값이 역수관계에 놓여있음을 확인할 수 있다.
Smoothing Spline 예시
붉은 선은 5개의 변수를, 초록 선은 11개의 변수를 $df_\lambda$로 잡은 경우이다. 각 고유벡터에 해당하는 고유값은 특정 고유벡터의 크기를 나타내는 것이고 여기서 고유값이 작은 고유벡터는 무시하는 것이다. 우측 하단의 경우에는 각 데이터 포인트에 해당하는 고유벡터의 값을 나타낸 것이다.
- 람다에 의해 고유벡터 자체가 바뀌지는 않기 때문에 스무딩 스플라인의 계는 모두 같은 고유벡터를 가진다.
- $\hat{y}=S_\lambda y=\Sigma^N_{k=1}u_k\rho_k(\lambda)\langle u_k,y \rangle$의 경우 $\rho_k(\lambda)$로 축된 고유벡터들의 선형결합으로 표현된다.
- $\hat{y}=Hy$의 경우 고유값이 0또는 1이기 때문에 고유벡터를 살리거나 죽인다는 두 가지 역할만 수행할 수 있다.
- 고유벡터 인덱스 순서는 해당하는 고유값의 크기에 달려있다. 즉 $u_1$은 가장 큰 고유값을 가지고 $u_N$는 가장 작은 고유값을 가진다. $S_\lambda u_k=\rho_k(\lambda)u_k$이기 때문에 인덱스가 커질수록(차원이 높아질수록) 스무딩 매트릭스는 고유벡터를 더 축소시킨다고 생각할 수 있다.
- $u_1,u_2$의 경우 고유값을 1을 가진다. 이 두가지는 절대 축소되지 않는다. 상수랑 1차로 생각하면 될듯.
- $\rho_k(\lambda)=1/(1+\lambda d_k)$
- Reparameterization: $min _\theta||y-U\theta||^2+\lambda\theta^TD\theta$, $U$는 $u_k$를 열로 가지고 $D$는 $d_k$를 대각성분으로 가지는 대각행렬이다.
- $df_\lambda=trace(S_\lambda)=\Sigma^N_{k=1}\rho_k(\lambda)$
비모수 로지스틱 회귀
패널 방식을 적용한다는 이야기는 X매트릭스 대신에 범위함수가 들어간 N매트릭스를 사용한다는 것이다.
penalized log-likelihood criterion
$$ \begin{split} l(f; \lambda) & = \sum^N_{i=1} [y_i logp(x_i)+(1-y_i)log(1-p(x_i))]-\dfrac{1}{2} \lambda \int {f''(t)}^2 dt \\ & = \sum^N_{i=1} [y_i f(x_i)-log(1+e^{f(x_i)})]-\dfrac{1}{2} \lambda \int {f''(t)}^2 dt \end{split}$$
$$ \dfrac{\partial l(\theta)}{\partial \theta}=N^T(y-p)-\lambda \Omega \theta \\ \dfrac{\partial ^2 l(\theta)}{\partial \theta \partial \theta ^T} = -N^T WN - \lambda \Omega $$
$$ \begin{split} \theta^{new}& =(N^TWN+\lambda \Omega)^{-1}N^TW(N \theta^{old}+W^{-1}(y-p)) \\ & = (N^TWN+\lambda \Omega)^{-1}N^TWz \end{split} $$
Reference
Hastie, T., Tibshirani, R.,, Friedman, J. (2001). The Elements of Statistical Learning. New York, NY, USA: Springer New York Inc..
'Machine Learning' 카테고리의 다른 글
Time Series Forecasting (0) | 2022.11.26 |
---|---|
Kernel Smoothing (0) | 2022.06.09 |
Basis Expansions & Regularization (0) | 2022.06.09 |
Linear Classfier (2) (0) | 2022.05.27 |
Linear Classifier (1) (0) | 2022.05.27 |