statduck
Kernel Smoothing 본문
Kernel method
Kernel method is a method to estimate the function using kernel. It estimates the value of a function by investigating data around x values. Neareness is defined by the distance, so a weight is given by the distance.
KNN as a kernel method
KNN can be viewed as a kernel method because it calculates distances in choosing nearest k points.
$$ \hat{f}(x)=Ave(y_i|x_i \subset N_k(x)) $$
$N_k(x)$ is the set containing nearest k points of euclidean distance. In this case, this function is not continuous because the support isn't continuous. KNN finds nearest k points, but this is based on the number of points as called 'k' so knn differs from a metric based way.
Metric | KNN | |
Bias | constant | Inverse of local density |
Var | Inverse of local density | constant |
Tied values | Additional Weights |
Another method
Nadaraya-Watson kernel weighted average with the Epanechnikov quadratic kernel.
$$ \hat{f}(x_o) = \dfrac{\sum^N_{i=1} K_\lambda(x_0,x_i)y_i}{\sum^N_{i=1} K_\lambda(x_0,x_i)}, \quad K_\lambda(x_o,x)=D(\dfrac{|x-x_o|}{\lambda}) $$
$$ D(t) = \begin{cases} \frac{3}{4}(1-t^2) & \quad if |t|\leq1, \ 0 & otherwise. \end{cases} $$
We consider more values when lambda becomes bigger. In this case, distance is defined as a quadratic form and lambda has a role in scaling factor. This scaling factor determines the distance between solution in quadratic equation.
We just can use distance function for kernel, why do we have to make a composite function by calling D function. When lambda is fixed, $K(x,x_0)=D(f(x,x_0))$
- D(t): Density function
- f: Distance function
D(t) is a density function, and the sum of D in support t has to be 1. So 3/4 is multiplied by D(t). The interesting fact is that we doesn't want to get the density for one point, but the density for the distance between two points. The support of probability set function is a set, and the distance becomes a set in this case. When x = x0, x value which maximizes density would be a mode. When two points get closer, the density becomes bigger. It means that the closer point of x from fixed x0 has a more density.
More generally, kernel can be defined as below:
$$ K_\lambda(x_o,x)=D(\dfrac{|x-x_o|}{h_\lambda(x_0)}), \quad in ; knn ; h(k(x_0))=|x_0-x{[k]}| $$
$$ tricube:D(t) = \begin{cases} (1-|t|^3)^3 & \quad if |t|\leq1, \ 0 & otherwise. \end{cases} $$
$$ Gaussian:D(t) =\phi(t) $$
Kernel Density Estimation
Kernel Density Estimation
$$ \hat{f}X (x)=\dfrac{1}{N}\Sigma^n{i=1}\phi_\lambda(x-x_i)=(\hat{F}\star\phi_\lambda)(x) $$
각 데이터 포인트들을 평균으로 가지는 정규분포들을 더한 후에 데이터의 갯수만큼 나눈 것이 Kernel Density Estimation 결과이다. 이는 $\hat{F}$와 $\phi_\lambda$의 convolution으로 볼 수 있고 이렇게 보는 경우에 $\hat{F}$은 각 관측치에 대해 1/N의 mass를 할당한다고 해석할 수 있다. 아래의 discrete convolution의 정의를 이용한 것이다.
$$
(f*g)(m)=\sum _{n}{f(n)g(m-n)}
$$
f(x) 추정
LDA와 Logistic의 비교에 대해 살펴보자.(p127) Logistic모형은 P(X) 가정이 들어가있지 않다. 대신 P(G|X)를 피팅시켜서 conditional likelihood를 고르는 것이다.
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt
import seaborn as sns
## 데이터가 들어오면 해당 데이터 근처 lambda 범위만큼 탐색한다.
def kernel(x,y,x0,lamb,type=['quad','tri']):
dist = np.where(abs(x-x0)>1,lamb,abs(x-x0))/lamb
quad = (3/4*(1-dist**2)); tri=((1-abs(dist)**3)**3)
message = 'You have to choose type'
dens = {type=='quad':quad, type=='tri':tri}.get(True, message)
hat = (dens*y).sum(axis=1)/(dens).sum(axis=1)
return(hat)
dataset = load_boston()
x = dataset['data'].transpose()[-1]
y = dataset['target']
x0 = np.reshape(np.linspace(1,35,100),(-1,1)) #min(x)근처로 주자
y_hat = kernel(x=x,y=y,x0=x0,lamb=3, type='quad')
fig, ax = plt.subplots()
sns.lineplot(x=x,y=y)
sns.lineplot(x=x0.reshape(1,-1)[0],y=y_hat)
def kde(x,x0,lamb,type=['quad','tri'])
dist = np.where(abs(x-x0)>1,lamb,abs(x-x0))/lamb
quad = (3/4*(1-dist**2)); tri=((1-abs(dist)**3)**3)
message = 'You have to choose type'
dens = {type=='quad':quad, type=='tri':tri}.get(True, message)
hat = (dens).sum(axis=1) / len(x)*lamb
return(hat)
fig, ax = plt.subplots()
sns.lineplot(x=x,y=y)
sns.lineplot(x=x0.reshape(1,-1)[0],y=y_hat)
Local Likelihood
Local Likelihood and Other Models
어떤 모수모형이더라도 local하게 만들 수 있다.(관측치의 웨이트를 포함하는 경우라면 모두) likelihood의 개념자체가 그럴듯한 모수를 추정해내는 개념이다. 그렇기 때문에 모수모형이라는 이야기가 나온다. Local likelihood를 이용하면 모수모형이더라도 Maximum Likelihood로 로컬하게 추정할 수 있다. 모수모형은 global하게 피팅하는게 국룰이고, semi모수나 비모수 모형의 경우가 로컬한 모형이라는 인식이 있는데 모수에 대해 해석할 수 있는 모수모형 역시 로컬하게 피팅할 수 있다는 점에서 장점이 있다. 실제로 logistic regression을 생각해보자.
$$ Y_i|X \sim Ber(logit(\eta(X)) \\ \eta(X)=\beta_0+\beta_1x+\cdots+\beta_px^p $$
x에 대한 $\beta$ local log-likelihood를 생각하면 다음과 같다.
$$ \begin{split} l(\beta) & =\sum^n_{i=1}{Y_ilog(logit(\eta(X))+(1-Y_i)log(1-logit(\eta(X))} \\ & = \sum^n_{i=1}l(Y_i,\eta(X)) \end{split} $$
$$ l_{x,h}(\beta)=\Sigma^n_{i=1}l(Y_i,\eta(X))K_h(x-X) $$
[ref: https://bookdown.org/egarpor/NP-UC3M/kre-ii-loclik.html]
Y가 binary 응답일 때 경향성을 시각화하기에 좋다.
Y가 범주형인 경우에 모형의 시각적 해석을 붙이기가 어려웠는데 local linear logistic은 이러한 해석이 가능하다. 위 그림은 실제 특정 X값에서의 확률이 예측된 것을 그린 것이다.
Local regression
Local regression
locality is given by kernel method.
1) Directely use kernel in estimating function form.
2) We already know the function form, so use kernel in estimating parameters.
The local linear, polynomial uses 2)
Local linear regression
$$ \hat{f}(x_0)=\hat{\alpha}(x_0)+\hat{\beta}(x_0)x_0 \\ \min_{\alpha(x_0),\beta(x_0)}\Sigma^N_{i=1}K_\lambda(x_0,x_i)[y_i-\alpha(x_o)-\beta(x_0)x_i]^2 $$
Orginally, linear regression estimates function globally. We can make a variation using kernel in doing regression. Each function value has different kernel value, so we can estimate function locally.
$$ \hat{f}(x_0)=b(x_0)^T(B^TW(x_0)B)^{-1}B^{T}W(x_0)y=\Sigma l_i(x_0)y_i $$
and then, what is a local estimation? It means we can derive $$\hat{f}(x_0)$$after we fit the regression line. When fitting the line, we consider the density of distance so closer points get more probability. Closer points have more power on determining the beta coefficients.
However, kernel method has a fatal flaw. In the side data, we don't consider the half of data. We just care about points in the right of one point which is the located on the leftmost. We call it boundary effect.
Trevor Hastie, Clive Loader "Local Regression: Automatic Kernel Carpentry," Statistical Science, Statist. Sci. 8(2), 120-129, (May, 1993)
In this paper, local linear regression solves this boundary issue like the below.
$$ \begin{split} E\hat{f}(x_0)= & \sum^N_{i=1}l_i(x_0)f(x_i) \\ = & f(x_0)\sum^N_{i=1} l_i(x_0)+f'(x_0)\sum^N_{i=1} (x_i-x_0)l_i(x_0) \\ & + \dfrac{f''(x_0)}{2} \sum^N_{i=1}(x_i-x_0)^2 l_i(x_0) + R \end{split} $$
This expression is subject to $\Sigma l_i(x_0) =1$, $\Sigma (x_i-x_0)l_i(x_0)=0$. It means bias $E(\hat{f}(x_0))-f(x_0)$ only depends on quadratic and higher order terms(curvature term). So if we estimate function in first order, curvature term and R becomes 0(bias becomes bigger). It solves the boundary issue.
Q. proof: statkwon.github.io
To sum up, to control boundary effect we need to make a model with slope. Local linear regression is best for it.
Local polynomial regression
$$ \hat{f}(x_0)=\hat{\alpha}(x_0)+\Sigma^d_{j=1}\hat{\beta}_j(x_0)x_0^j $$
$$ \min_{\alpha(x_0),\beta(x_0),j=1,...,d}\Sigma^N_{i=1}K_\lambda(x_0,x_i)[y_i-\alpha(x_o)-\Sigma^d_{j=1}\beta_j(x_0)x_i^j]^2 $$
Local polynomial can also derive $\hat{f}$ as local linear regression does. The difference is how to define $b(x_0)^T$.
- Linear $[1 \quad x]$
- Polynomial $[1 \quad x \quad x^2 \quad ...]$
Local Regression in p dimension
$$
\min_{\beta(x_0)}\Sigma^N_{i=1} K_\lambda (x_0,x_i)(y_i-b(x_i)^T\beta(x_0))^2
$$
$$
K_\lambda(x_0,x)=D(\dfrac{||x-x_0||}{\lambda})
$$
Let's think about high dimension. $||x-x_0||$ is a euclidean distance, and it is needed to standardize this with unit standard deviation because a different x has different distance.
Boundary effects become serious in high dimension, because of the dimensionality curse. The number of data close to boundary increase in high dimension, and also it becomes hard to visualize our data even if smoothing is usually for the visualization. Sampling density is proportional to $N^{1/p}$, so sparsity problem would happen.
Structured Kernels
각 축에 동일한 웨이트를 적용하지 않고, 좌표축을 표준화 하는 대신 weight matrix A를 곱해준다.
$$K_{\lambda,A}(x_0,x)=D(\dfrac{(x-x_0)^T A(x-x_0)}{\lambda})$$
Diagonal condition: 각 $X_j$의 영향을 조절할 수 있다.
- A가 High-frequency contrasts에 덜 집중하도록 만들기.
- low rank A가정 - ridge function을 만든다.
그렇지만 이렇게 A의 차원이 커지면 다루기 힘들어져 Structured regression을 사용한다.
Structured Regression
ANOVA decomposition의 방식을 사용.
$$f(X)=\alpha(Z)+\beta_1(Z)X_1+\cdots+\beta_q(Z)X_q$$
Reference
Hastie, T., Tibshirani, R.,, Friedman, J. (2001). The Elements of Statistical Learning. New York, NY, USA: Springer New York Inc..
'Machine Learning' 카테고리의 다른 글
크롤링 & 워드클라우드 (0) | 2022.12.18 |
---|---|
Time Series Forecasting (0) | 2022.11.26 |
Smoothing Splines & Smoother Matrices (0) | 2022.06.09 |
Basis Expansions & Regularization (0) | 2022.06.09 |
Linear Classfier (2) (0) | 2022.05.27 |