Notice
Recent Posts
Recent Comments
Link
«   2024/11   »
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
Tags
more
Archives
Today
Total
관리 메뉴

statduck

Linear Classifier (1) 본문

Machine Learning

Linear Classifier (1)

statduck 2022. 5. 27. 16:43

Linear method: Classification

    Classification problem. Classification means target variable has a categorical value. Let's assume Y has only zero or one, and g is the value. Connect Y and g.

$$ Y_{ij}=I(G(x)=g_j) $$

Linear Model

  • If $\delta_k(x)$is linear, then $D$ is also linear.
  • $E(Y_k|X=x)=P(G|X)$ When we estimate this with $X^T\beta$, the range can be out of range 0 to 1.
  • $\min_{\mathbf{B}} \Sigma||y_i-[(1,x_i^T)\mathbf{B}]^T||^2$
  • $\hat{G}(x)=argmax_{k \in g} \hat{f}k(x) \rightarrow \hat{G}(x)=argmin{k}||\hat{f}(x)-t_k||^2$
  • $\sum_k\hat{G}(x)=1$

LDA & QDA

LDA

✏️ Goal: The goal is to know the class posterior $P(G|X)$.

✏️ Expression: $ P(G=k|X)=\dfrac{P(X|G=k)P(G=k)}{P(X)}=\dfrac{f_k(x)\pi_k}{\Sigma^K_{l=1}f_l(x)\pi_l} $

✏️ Assumption: The distribution in each class follows multivariate normal distribution with same variance.

$$
(X|G=k) \sim N(\mu_k, \Sigma_k) \\
f_k(x)=\dfrac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}}exp{-\dfrac{1}{2}(x-\mu_k)^T\Sigma^{-1}_k(x-\mu_k)}
, \; s.t. \Sigma_k=\Sigma \forall k $$

 

✏️ Unfolding

Because of the multivariate normal distribution and same variance assumption, log-ratio is easily unfolded.

$$ \begin{split} log\dfrac{P(G=k|X)}{P(G=l|X)} & = log\dfrac{f_k(x)}{f_l(x)}+log\dfrac{\pi_k}{\pi_l} \\ & = log\dfrac{\pi_k}{\pi_l}-\dfrac{1}{2}(\mu_k+\mu_l)^T\Sigma^{-1}(\mu_k-\mu_l) + x^T\Sigma^{-1}(\mu_k-\mu_l) \end{split} $$

✏️ Classification

Decision boundary is as follows:

$$ D = {x|P(G=k|X)=P(G=l|X)}  = {x|\delta_k(x)=\delta_l(x)} $$

$$ P(G=k|X)=P(G=l|X), \;  \dfrac{P(G=k|X)}{P(G=l|X)}=1, \; log\dfrac{P(G=k|X)}{P(G=l|X)}=0 $$

$$ log\dfrac{\pi_k}{\pi_l}-\dfrac{1}{2}(\mu_k+\mu_l)^T\Sigma^{-1}(\mu_k-\mu_l) + x^T\Sigma^{-1}(\mu_k-\mu_l) =0 \\
x^T\Sigma^{-1}\mu_k-\dfrac{1}{2}\mu_k^T\Sigma^{-1}\mu_k+log\pi_k = x^T\Sigma^{-1}\mu_l-\dfrac{1}{2}\mu_l^T\Sigma^{-1}\mu_l+log\pi_l \\ \delta_k(x)=\delta_l(x) $$

✏️ Parameter estimation

We can't know the parameter of normal distribution in real data, we just estimate this parameter.

$$
\hat{\pi}k=N_k/N \\
\hat{\mu}_k=\Sigma{g_i=k}x_i/N_k \\
\hat{\Sigma}=\Sigma^K_{k=1}\Sigma_{g_i=k}(x_i-\hat{\mu}_k)(x_i-\hat{\mu}_k)^T/(N-K)
$$

✏️ Graph

 

 

✏️ Another Perspective

Row Height Weight Gender
#1 5.2 1.4 0
#2 5.2 3.5 0
#3 3.5 2.2 0
#4 3.6 5.4 1
#5 7.5 6.5 1
#6 6.6 7.5 1

 

We can use linear combination of $a_ 1H+a_2W$ to make this table into two cluster.

 

$$ argmax_{a_1,a_2} E[a_1H+a_2W|G=0]-E[a_2H+a_2W|G=1] \\ s.t. ;Var[a_1H+a_2W]\leq constant $$

It can be more simply changed into this:

$$ argmax_{a_1,a_2} h(a_1,a_2), \;\; s.t. \;\;g(a_1,a_2)=c $$

$$ \mu_1=E[H|G=0]-E[H|G=1] \\ \mu_2=E[W|G=0]-E[W|G=1] \\ h(a_1,a_2)=(a_1 \;\; a_2)(\mu_1 \;\; \mu_2)^T \\ g(a_1,a_2) = a_1^2Var(H|G=0)+2a_1 a_2Cov(H,W|G=0)+a_2^2Var(W|G=0) $$

 

Using Lagrange multiplier, it is solved.

$$
\nabla g = \lambda \nabla h \\ (a_1 \;\; a_2)=(\mu_1 \;\; \mu_2)\begin{pmatrix}
COV(H,H) \; COV(H,W) \ COV(W,H) \; COV(W,W)\end{pmatrix}^{-1}
$$

QDA

✏️ Assumption

Like LDA, it assumes multivariate normal distribution at each class but there is no same variance assumption.

✏️ Classification

$$
D={x|\delta_k(x)=\delta_l(x)} \\ \delta_k(x)=-\dfrac{1}{2}log|\Sigma_k|-\dfrac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)+log\pi_k
$$

This expression is similar with the expression of LDA, but a quadratic term still remains.

 

Generalized Discriminant Analysis

Generalized LDA

The strength of LDA is the very simplicity.

  • It's a simple prototype classifier. One point is classified into the class with closest centroid. Distance is measure with Mahalanobis metric, using a pooled covariance estimate.
  • The decision boundary is linear, so it provides simple description and implementation. LDA is informative because it provides low-dimensional view.

The weakness of LDA is the simplicity also.

  • It is not enough to describe our data just by using two prototypes(class centroid and a common covariance matrix.)
  • Linear decision boundary can't adequately separate the data into classes.
  • When many features are used, LDA estimates high variance and the performance becomes weak. In this case, we need to restrict or regularize LDA.

Flexible Discriminant Analysis

Definition

This is devised for nonlinear classification.

$$
\min_{\beta, \theta}\sum^N_{i=1}(\theta(g_i)-x_i^T\beta)^2
$$

    $g_i$ is the label for $i_{th}$group, and $\theta$ is a function mapping with $G\rightarrow \mathbb{R}^1$. $G$ is a quantitative variable(score) and $\theta$ maps this quantitative variable(score) to a categorical value. We call $\theta(g_i)$ as transformed class labels which can be predicted by linear regression.

 

$$
ASR=\dfrac{1}{N}\sum^L_{l=1}[\sum^N_{i=1}(\theta_l(g_i)-x_i^T\beta_l)^2]
$$

 

    $\theta_l$ and $\beta_l$ are chosen to minimize Average Squared Residual(ASR).

 

Matrix Notation

  • $Y$ is a $N\times J$ matrix, $Y_{ij}=1$ if $i_{th}$ observation falls into $j_{th}$ class.
  • $\Theta$ is a $J \times K$ matrix, the column vectors are $k$ score vectors for $j_{th}$ class.
  • $\Theta^*=Y\Theta$, it is a transformed class label vector.
  • $ ASR(\Theta)=tr(\Theta^{*T}(I-P_X)\Theta^{*})/N = tr(\Theta^T Y^T(I-P_X)Y\Theta)/N $, s.t. $P_X$ is a projection onto the column space of $X$

    For reference, $\sum^N_{i=1}(y_i-\hat{y}_i)^2=y^T(I-P_X)y$. If the scores($\Theta^{T}$) have mean zero, unit variance, and are uncorrelated for the $N$ observation ($\Theta^{T}\Theta/N=I_K$), minimizing $ASR(\Theta)$ amounts to finding $K$ largest eigenvectors $\Theta$ of $Y^TP_XY$ with normalization $\Theta^TD_p\Theta=I_K$

$$
\min_\Theta tr(\Theta^TY^T(1-P_X)Y\Theta)/N=\min_{\Theta}[tr(\Theta^TY^TY\Theta)/N-tr(\Theta^TY^TP_XY\Theta)/N] \\ =\min_\Theta [K-tr(\Theta^{T}YP_XY^T\Theta^)/N]=\max_\Theta tr(\Theta^{T}S\Theta^)/N
$$

The theta maximizing trace is the matrix which consists of $K$ largest eigenvectors of $S$ by Courant-Fischer-characterization of eigenvalues. Finally, we can find an optimal$\Theta$.

 

 

Implementation

  1. Initialize: Form $Y$, $N\times J$ indicator matrix.(described above)
  2. Multivariate Regression: Set $\hat{Y}=P_XY$, $\mathbf{B}:\hat{Y}=X\mathbf{B}$
  3. Optimal scores: Find the eigenvector matrix $\Theta$ of $Y^TY$=$Y^TP_XY$ with normalization $\Theta^TD_p\Theta=I$
  4. Update: $\mathbf{B}\leftarrow\mathbf{B}\Theta$

The final regression fit is a $(J-1)$ vector function $\eta(x)=\mathbf{B}^Tx$. The canonical variates form is as follows.

$$
U^Tx=D\mathbf{B}^Tx=D\eta(x), \;\;s.t. \;D_{kk}^2=\dfrac{1}{\alpha_k^2(1-\alpha_k^2)}
$$

 

    $\alpha_k^2$ is the $k_{th}$ largest eigenvalue computed in the 3. Optimal scores. We update our coefficient matrix $\mathbf{B}$ by using $\Theta$ which is the eigenvector matrix of $Y^TP_XY$. $U^Tx$ is the linear canonical variates and $D\eta(x)$ is a nonparametric version of this discriminant variates. By replacing $X,P_X$ with $h(X),P_{h(x)}=S(\lambda)$ we can expand it to a nonparametric version. We can call this extended version as a FDA.

Implementation

  1. Initialize: Form $\Theta_0$, s.t. $\Theta^TD_p\Theta=I, \; \Theta_0^*=Y\Theta_0$
  2. Multivariate Nonparametric Regression: Fit $\hat{\Theta}_0^*=S(\lambda)\Theta_0^*, \eta(x)=\mathbf{B}^T h(x)$
  3. Optimal scores: Find the eigenvector matrix $\Phi$ of $\Theta_0^*\hat{\Theta}_0^*=\Theta_0^*S(\hat{\lambda})\Theta_0^{*T}$. The optimal score is $\Theta=\Theta_0\Phi$
  4. Update: $\eta(x)=\Phi^T\eta(x)$

With this implementation, we can get a $\Phi$ and update $\eta(x)$. The final $\eta(x)$ is used for calculation of a canonical distance $\delta(x,j)$ which is the only thing for classification.

$$
\delta(x,j)=||D(\hat{\eta}(x)-\bar{\eta}(x)^j)||^2
$$

Penalized Discriminant Analysis

$$
ASR({\theta_l,\beta_l}^L_{l=1}=\dfrac{1}{N}\sum^L_{l=1}[\sum^N_{i=1}(\theta_l(g_i)-h^T(x_i)\beta_l)^2+\lambda\beta_l^T\Omega\beta_l]
$$

When we can choose $\eta_l(x)=h(x)\beta_l$, $\Omega$ becomes

  • $h^T(x_i)=[h_1^T(x_i) \; | \;h_2^T(x_i) \;| \; \cdots \; | \; h_p^T(x_i)]\;$, we can define $h_j$be a vector of up to $N$ natural-spline basis function.
  • $S(\lambda)=H(H^TH+\Omega(\lambda))^{-1}H^T$
  • $ASR_p(\Theta)=tr(\Theta^TY^T(1-S(\lambda))Y\Theta)/N$
  • $\Sigma_{wthn}+\Omega$: penalized within-group covariance of $h(x_i)$
  • $\Sigma_{btwn}$: between-group covariance of $h(x_i)$
  • Find $argmax_{u} u^T\Sigma_{btwn}u, \;s.t.\;u^T(\Sigma_{wthn}+\Omega)u=1$, $u$ becomes a canonical variate.
  • $D(x,u)=(h(x)-h(u))^T(\Sigma_W+\lambda\Omega)^{-1}(h(x)-h(u))$,

Reference

Hastie, T., Tibshirani, R.,, Friedman, J. (2001). The Elements of Statistical Learning. New York, NY, USA: Springer New York Inc..

'Machine Learning' 카테고리의 다른 글

Basis Expansions & Regularization  (0) 2022.06.09
Linear Classfier (2)  (0) 2022.05.27
Orthogonalization  (0) 2022.05.27
Linear Method  (0) 2022.05.27
Statistical Decision Theory  (0) 2022.05.27
Comments