Orthogonalization

Notice

Recent Posts

Recent Comments

Link

« 2024/11 »
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

statduck

Orthogonalization 본문

Machine Learning

Orthogonalization

statduck 2022. 5. 27. 16:11

Orthogonalization is important.

1) Vector to span{vector}

Let's consider two vectors $v,w$ in n dimension.

$$
Project \; v \; onto \; span\{w\}=Proj_wv=\dfrac{v \cdot w}{||w||^2}w=\dfrac{v\cdot w}{||w||}\dfrac{w}{||w||}
$$

$\dfrac{v \cdot w}{||w||}$$: length $$\dfrac{w}{||w||}$: direction

Gram-schmidt process

Every non-zero subspace of $\mathbb{R}^n$ has an orthonormal basis.

$v_1=w_1$
$v_2=w_2-Proj_{w_1}w_2$
$v_3=w_3-Proj_{w_2}w_3=w_3-Proj_{v_1,v_2}w_3$

$v_k=w_k-\sum^{k-1}_{i=1}Proj_{v_i}w_i$

$$ \{v_1,v_2,...,v_k\} : \; orthonormal \; basis \; for \; a \; subspace \;W \; of \; \mathbb{R}^n $$

$$ w=(w\cdot v_1)v_1+(w \cdot v_2)v_2+\cdots (w \cdot v_k)v_k $$

Projection Matrix

$ A\mathbf{x}=b \;\;\; would \;\; have \;\; no \;\; solutions. $

$ A\hat{\mathbf{x}}=p \;\;\;where \;\; p \; is \; a \; projected \; vector \; from \; b \; to \; col(A)$

$$ A\hat{\mathbf{x}}=p, \; A^T(b-A\hat{\mathbf{x}})=0 $$
$$ \hat{\mathbf{x}}=(A^TA)^{-1}A^Tb, \; p=A(A^TA)^{-1}A^Tb=Pb $$

$P = (A(A^TA)^{-1}A^T)$ is a symmetric and idempotent matrix.

$$
P=A(A^TA)^{-1}A^T=AA^T=[v_1 \cdots v_n] \begin{bmatrix}v_1^T\\ \vdots \\v_n^T \end{bmatrix}= v_1\cdot v_1^T + \cdots + v_n \cdot v_n^T $$

$$ when \; A \; has \; an \; orthonormal \; basis, A^TA=I $$

$$ Pb=(b\cdot v_1)v_1 +\cdots +(b\cdot v_n)v_n $$

🦊 Regression by Successive Orthogonalization 🦊

Why do we use orthogonalization?

$$ Y=X\beta +\varepsilon $$
$$ \hat{\beta}=(X^TX)^{-1}X^Ty $$
$$ \hat{\beta}_j=\dfrac{\langle \mathbf{x}_j,\mathbf{y}\rangle}{\langle \mathbf{x}_j,\mathbf{x}_j\rangle}, \quad when \;X=Orthogonal \; matrix $$

When X is an orthogonal matrix, simply we can find the coefficient through inner product of j$_{th}$input vector and y vector. It is computationally efficient compared to matrix multiplication. However, there is no situation we have orthogonal data in real world. Thus we need to make our data orthogonal.

Initialize $\mathbf{z}_0=\mathbf{x}_0=1$
For $j=1,2,\dots,p$ $$
Regress \; \mathbf{x}_j \; on \; \mathbf{z}_0,\mathbf{z}_1,\dots,\mathbf{z}_{j-1} \\
\mathbf{z}_j=\mathbf{x}_j-\sum^{j-1}_{k=0}\hat{\gamma}_{kj}\mathbf{z}_k, \quad \hat{\gamma}_{lj}=\dfrac{\langle \mathbf{z}_l,\mathbf{x}_j \rangle}{\langle \mathbf{z}_l, \mathbf{z}_l \rangle} \\
\mathbf{z}_j=\mathbf{x}_j-\sum^{j-1}_{k=0}\dfrac{\langle \mathbf{z}_k,\mathbf{x}_j \rangle}{\langle \mathbf{z}_k, \mathbf{z}_k \rangle} \mathbf{z}_k=\mathbf{x}_j-\sum^{j-1}_{k=0} Proj_{\mathbf{z}_k}\mathbf{x}_j
$$
Regress $\mathbf{y}$on the residual $\mathbf{z}_p$to give the estimate $\hat{\beta}_p$

The result of this algorithm is $\hat{\beta}_p=\dfrac{\langle \mathbf{z}_p,\mathbf{y} \rangle}{\langle \mathbf{z}_p,\mathbf{z}_p \rangle}$. X has an orthonormal basis, so our input vectors all could be expressed as an linear combination of $z_j$.

$$ x_j=\mathbf{j} \cdot \mathbf{z} \\ $$

We can do extend it to matrix.

$$ \begin{split} \mathbf{X} &=\mathbf{Z}\mathbf{\Gamma} \\ &= \begin{bmatrix} z_{11} & \cdots & z_{p1} \\ \vdots & \ddots & \vdots \\ z_{n1} & \cdots & z_{np} \end{bmatrix} \begin{bmatrix} \gamma_{11} & \gamma_{12} &\cdots & \gamma_{1p} \\ 0 & \gamma_{22} & \cdots & \gamma_{2p} \\ 0 & 0 & \cdots & \gamma_{3p} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \gamma_{pp} \end{bmatrix} \\ &=\mathbf{Z}\mathbf{D}^{-1}\mathbf{D}\mathbf{\Gamma} \\ &=\mathbf{Q}\mathbf{R} \\ &= \begin{bmatrix} q_1 & \cdots & q_p \end{bmatrix} \begin{bmatrix} (x_1 \cdot q_1) & (x_2 \cdot q_1) & \cdots & (x_p \cdot q_1) \\ 0 & (x_2 \cdot q_2) & \cdots & (x_p \cdot q_2) \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & (x_p \cdot q_p) \end{bmatrix} \end{split} $$

$\mathbf{Q}$: direction, $\mathbf{R}$: length(scale)

$$
\hat{\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}=(\mathbf{R}^T\mathbf{Q}^T\mathbf{Q}\mathbf{R})^{-1}\mathbf{R}^T\mathbf{Q}^Ty=
\mathbf{R}^{-1}\mathbf{Q}^T\mathbf{y} \\
\hat{\mathbf{y}}=\mathbf{Q}\mathbf{Q}^T\mathbf{y}
$$

> It is so computationally efficient!

Variation Version

✏️ LAR(Least Angle Regression)

$r_0=y-\hat{\beta}_0=y-\bar{y}$을 첫번째 모형으로 시작($\forall \beta=0)$하고, X변수는 zero mean, unit norm을 가지도록 정규화를 시킨다.
$r_0$와 가장 관련성이 높은 $x_j$를 찾는다. 다시 말해, $argmax_{x_j}\langle x_j,r_0\rangle$를 찾는다.
같은 인덱스 j를 가지는 계수 $\beta_j$를 0에서 $\langle x_j,r_0\rangle$ 으로 값을 키운다. 이 때 모형은 $r_1=y-\bar{y}-\hat{\beta}_jx_j$. 값을 다음의 부등식을 유지하는 범위에서 키운다.$\langle x_j,r_1 \rangle \leq \langle x_k,r_1\rangle$
위의 인덱스 j,k를 가지는 계수 $\beta_j,\beta_k$ 를 $\langle x_j,x_k \rangle$ 으로 값을 키운다. 이 때 모형은 $r_2=y-\bar{y}-\hat{\beta}_jx_j-\hat{\beta}_kx_k$. 값을 다음의 부등식을 유지하는 범위에서 키운다. $\langle x_j,x_k \rangle \leq \langle x_l,r_2\rangle$
이러한 방식을 p개의 x들이 모두 모형에 들어갈 때 까지 반복한다. $min(N-1,p)$스텝 이후 이는 full-least-squares solution과 같아진다.

Full least squares을 구하기 위해서 계산을 그저 p번만 행하면 되므로 매우 계산 효과적인 알고리즘이다. 변수를 표준화하는 이유는 corr을 바로 내적으로 계산하기 위해서이다.

✏️ PCR(Principle Component Regression)

보통의 경우 상관관계가 높은 다량의 독립변수가 존재한다. 이 경우 해당 변수들을 PC로 줄여서 독립변수로 사용하는 것이 PCR이다.

$$ \hat{y}= \hat{\beta}_0 + \sum^{M}_{m=1}\hat{\theta}_m z_m \\

Xv_m=z_m, \quad \hat{\theta}_m z_m=Proj _{z_m}y, \quad \hat{\beta}^{pcr}(M)=\sum^M_{m=1}\hat{\theta}_m v_m $$

이 때 $z_m$은 principal components이다. 참고로 PC는 다음과 같이 정의된다. Ridge regression과 마찬가지로 해당 방식은 입력변수의 scale에 영향을 많이 받기 때문에 먼저 표준화를 시킨다.(Shrinkage의 경우 변수들간에 scaling이 다르면 불균형한 shrinkage가 일어난다.)

$$
X=U\Sigma V^T, \quad PC_i=Xv_i= U\sigma_i
$$

PC 변수 결국 표본분 $S=\dfrac{X^TX}{n}$ 의 scaled 고유벡이다. 이 때 $col(U)=Eigenvectors \; of \; X^TX$ 이다. 고유벡터 표기에서 상수 n은 무시하자. $U\sigma_i$는 고유값으로 스케일링된 고유벡터를 의미한다. 이는 PC변수들로 결국 y를 적합하는 것이다. 이는 ridge regression과 비슷한데, 왜냐하면 ridge에서도 결국 PC축 스케일링으로 데이터를 적합시켰기 때문이다.

$$
X\hat{\beta}^{ridge}=\sum^p_{j=1}\frac{d_j^2}{d_j^2+\lambda} u_j u_j^Ty=\sum^p_{j=1}\frac{(d_ju_j)(d_ju_j)^T}{d_j^2+\lambda} y
$$

해당 식은 $XX^T$의 고유공간에 y를 정사영 내린 후 스케일링 하는 것을 의미한다.

$$
X=UDV^T, ; XV=UD \\
Xv_i=d_iu_i=PC_i(i_{th} ; PC ; variable)
$$

✏️ PLS

Partial Least Squares는 Y와의 공분산이 높은 k개의 선형조합을 추출한다.

$$ \hat{\psi}{1j}=\langle x_j,y \rangle, \quad z_1=\sum_j \hat{\psi}_{1j}x_j $$

$x_j$를 표준화시킨다. $\hat{y}^{(0)}=\bar{y}1$
$z_m=\sum^p_{j=1} \hat{\psi}_{mj}x_j^{(m-1)}, \quad where \; \hat{\psi}_{mj}=\langle x_j^{(m-1)},y\rangle$
$\hat{\theta}_m=\langle z_m,y \rangle/\langle z_m,z_m \rangle$
$\hat{y}^{(m)}=\hat{y}^{(m-1)}+\hat{\theta}_mz_m$
Orthogonalizae each $x_j^{(m-1)}$with respect to $z_m$: $x_j^{(m)}$

Y와 high variance&high correlation을 가지도록 집중한다.

Reference

Hastie, T., Tibshirani, R.,, Friedman, J. (2001). The Elements of Statistical Learning. New York, NY, USA: Springer New York Inc..

'Machine Learning' 카테고리의 다른 글

Linear Classfier (2) (0)	2022.05.27
Linear Classifier (1) (0)	2022.05.27
Linear Method (0)	2022.05.27
Statistical Decision Theory (0)	2022.05.27
Weighting procedure (0)	2022.05.27

'Machine Learning' Related Articles

Comments

statduck

Orthogonalization 본문

Orthogonalization

Orthogonalization is important.

1) Vector to span{vector}

Gram-schmidt process

Projection Matrix

Variation Version

✏️ LAR(Least Angle Regression)

✏️ PCR(Principle Component Regression)

✏️ PLS

Reference

'Machine Learning' 카테고리의 다른 글

티스토리툴바