statduck
Linear Classifier (1) 본문
Linear method: Classification
Classification problem. Classification means target variable has a categorical value. Let's assume Y has only zero or one, and g is the value. Connect Y and g.
$$ Y_{ij}=I(G(x)=g_j) $$
Linear Model
- If $\delta_k(x)$is linear, then $D$ is also linear.
- $E(Y_k|X=x)=P(G|X)$ When we estimate this with $X^T\beta$, the range can be out of range 0 to 1.
- $\min_{\mathbf{B}} \Sigma||y_i-[(1,x_i^T)\mathbf{B}]^T||^2$
- $\hat{G}(x)=argmax_{k \in g} \hat{f}k(x) \rightarrow \hat{G}(x)=argmin{k}||\hat{f}(x)-t_k||^2$
- $\sum_k\hat{G}(x)=1$
LDA & QDA
LDA
✏️ Goal: The goal is to know the class posterior $P(G|X)$.
✏️ Expression: $ P(G=k|X)=\dfrac{P(X|G=k)P(G=k)}{P(X)}=\dfrac{f_k(x)\pi_k}{\Sigma^K_{l=1}f_l(x)\pi_l} $
✏️ Assumption: The distribution in each class follows multivariate normal distribution with same variance.
$$
(X|G=k) \sim N(\mu_k, \Sigma_k) \\
f_k(x)=\dfrac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}}exp{-\dfrac{1}{2}(x-\mu_k)^T\Sigma^{-1}_k(x-\mu_k)}
, \; s.t. \Sigma_k=\Sigma \forall k $$
✏️ Unfolding
Because of the multivariate normal distribution and same variance assumption, log-ratio is easily unfolded.
$$ \begin{split} log\dfrac{P(G=k|X)}{P(G=l|X)} & = log\dfrac{f_k(x)}{f_l(x)}+log\dfrac{\pi_k}{\pi_l} \\ & = log\dfrac{\pi_k}{\pi_l}-\dfrac{1}{2}(\mu_k+\mu_l)^T\Sigma^{-1}(\mu_k-\mu_l) + x^T\Sigma^{-1}(\mu_k-\mu_l) \end{split} $$
✏️ Classification
Decision boundary is as follows:
$$ D = {x|P(G=k|X)=P(G=l|X)} = {x|\delta_k(x)=\delta_l(x)} $$
$$ P(G=k|X)=P(G=l|X), \; \dfrac{P(G=k|X)}{P(G=l|X)}=1, \; log\dfrac{P(G=k|X)}{P(G=l|X)}=0 $$
$$ log\dfrac{\pi_k}{\pi_l}-\dfrac{1}{2}(\mu_k+\mu_l)^T\Sigma^{-1}(\mu_k-\mu_l) + x^T\Sigma^{-1}(\mu_k-\mu_l) =0 \\
x^T\Sigma^{-1}\mu_k-\dfrac{1}{2}\mu_k^T\Sigma^{-1}\mu_k+log\pi_k = x^T\Sigma^{-1}\mu_l-\dfrac{1}{2}\mu_l^T\Sigma^{-1}\mu_l+log\pi_l \\ \delta_k(x)=\delta_l(x) $$
✏️ Parameter estimation
We can't know the parameter of normal distribution in real data, we just estimate this parameter.
$$
\hat{\pi}k=N_k/N \\
\hat{\mu}_k=\Sigma{g_i=k}x_i/N_k \\
\hat{\Sigma}=\Sigma^K_{k=1}\Sigma_{g_i=k}(x_i-\hat{\mu}_k)(x_i-\hat{\mu}_k)^T/(N-K)
$$
✏️ Graph
✏️ Another Perspective
Row | Height | Weight | Gender |
#1 | 5.2 | 1.4 | 0 |
#2 | 5.2 | 3.5 | 0 |
#3 | 3.5 | 2.2 | 0 |
#4 | 3.6 | 5.4 | 1 |
#5 | 7.5 | 6.5 | 1 |
#6 | 6.6 | 7.5 | 1 |
We can use linear combination of $a_ 1H+a_2W$ to make this table into two cluster.
$$ argmax_{a_1,a_2} E[a_1H+a_2W|G=0]-E[a_2H+a_2W|G=1] \\ s.t. ;Var[a_1H+a_2W]\leq constant $$
It can be more simply changed into this:
$$ argmax_{a_1,a_2} h(a_1,a_2), \;\; s.t. \;\;g(a_1,a_2)=c $$
$$ \mu_1=E[H|G=0]-E[H|G=1] \\ \mu_2=E[W|G=0]-E[W|G=1] \\ h(a_1,a_2)=(a_1 \;\; a_2)(\mu_1 \;\; \mu_2)^T \\ g(a_1,a_2) = a_1^2Var(H|G=0)+2a_1 a_2Cov(H,W|G=0)+a_2^2Var(W|G=0) $$
Using Lagrange multiplier, it is solved.
$$
\nabla g = \lambda \nabla h \\ (a_1 \;\; a_2)=(\mu_1 \;\; \mu_2)\begin{pmatrix}
COV(H,H) \; COV(H,W) \ COV(W,H) \; COV(W,W)\end{pmatrix}^{-1}
$$
QDA
✏️ Assumption
Like LDA, it assumes multivariate normal distribution at each class but there is no same variance assumption.
✏️ Classification
$$
D={x|\delta_k(x)=\delta_l(x)} \\ \delta_k(x)=-\dfrac{1}{2}log|\Sigma_k|-\dfrac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)+log\pi_k
$$
This expression is similar with the expression of LDA, but a quadratic term still remains.
Generalized Discriminant Analysis
Generalized LDA
The strength of LDA is the very simplicity.
- It's a simple prototype classifier. One point is classified into the class with closest centroid. Distance is measure with Mahalanobis metric, using a pooled covariance estimate.
- The decision boundary is linear, so it provides simple description and implementation. LDA is informative because it provides low-dimensional view.
The weakness of LDA is the simplicity also.
- It is not enough to describe our data just by using two prototypes(class centroid and a common covariance matrix.)
- Linear decision boundary can't adequately separate the data into classes.
- When many features are used, LDA estimates high variance and the performance becomes weak. In this case, we need to restrict or regularize LDA.
Flexible Discriminant Analysis
Definition
This is devised for nonlinear classification.
$$
\min_{\beta, \theta}\sum^N_{i=1}(\theta(g_i)-x_i^T\beta)^2
$$
$g_i$ is the label for $i_{th}$group, and $\theta$ is a function mapping with $G\rightarrow \mathbb{R}^1$. $G$ is a quantitative variable(score) and $\theta$ maps this quantitative variable(score) to a categorical value. We call $\theta(g_i)$ as transformed class labels which can be predicted by linear regression.
$$
ASR=\dfrac{1}{N}\sum^L_{l=1}[\sum^N_{i=1}(\theta_l(g_i)-x_i^T\beta_l)^2]
$$
$\theta_l$ and $\beta_l$ are chosen to minimize Average Squared Residual(ASR).
Matrix Notation
- $Y$ is a $N\times J$ matrix, $Y_{ij}=1$ if $i_{th}$ observation falls into $j_{th}$ class.
- $\Theta$ is a $J \times K$ matrix, the column vectors are $k$ score vectors for $j_{th}$ class.
- $\Theta^*=Y\Theta$, it is a transformed class label vector.
- $ ASR(\Theta)=tr(\Theta^{*T}(I-P_X)\Theta^{*})/N = tr(\Theta^T Y^T(I-P_X)Y\Theta)/N $, s.t. $P_X$ is a projection onto the column space of $X$
For reference, $\sum^N_{i=1}(y_i-\hat{y}_i)^2=y^T(I-P_X)y$. If the scores($\Theta^{T}$) have mean zero, unit variance, and are uncorrelated for the $N$ observation ($\Theta^{T}\Theta/N=I_K$), minimizing $ASR(\Theta)$ amounts to finding $K$ largest eigenvectors $\Theta$ of $Y^TP_XY$ with normalization $\Theta^TD_p\Theta=I_K$
$$
\min_\Theta tr(\Theta^TY^T(1-P_X)Y\Theta)/N=\min_{\Theta}[tr(\Theta^TY^TY\Theta)/N-tr(\Theta^TY^TP_XY\Theta)/N] \\ =\min_\Theta [K-tr(\Theta^{T}YP_XY^T\Theta^)/N]=\max_\Theta tr(\Theta^{T}S\Theta^)/N
$$
The theta maximizing trace is the matrix which consists of $K$ largest eigenvectors of $S$ by Courant-Fischer-characterization of eigenvalues. Finally, we can find an optimal$\Theta$.
Implementation
- Initialize: Form $Y$, $N\times J$ indicator matrix.(described above)
- Multivariate Regression: Set $\hat{Y}=P_XY$, $\mathbf{B}:\hat{Y}=X\mathbf{B}$
- Optimal scores: Find the eigenvector matrix $\Theta$ of $Y^TY$=$Y^TP_XY$ with normalization $\Theta^TD_p\Theta=I$
- Update: $\mathbf{B}\leftarrow\mathbf{B}\Theta$
The final regression fit is a $(J-1)$ vector function $\eta(x)=\mathbf{B}^Tx$. The canonical variates form is as follows.
$$
U^Tx=D\mathbf{B}^Tx=D\eta(x), \;\;s.t. \;D_{kk}^2=\dfrac{1}{\alpha_k^2(1-\alpha_k^2)}
$$
$\alpha_k^2$ is the $k_{th}$ largest eigenvalue computed in the 3. Optimal scores. We update our coefficient matrix $\mathbf{B}$ by using $\Theta$ which is the eigenvector matrix of $Y^TP_XY$. $U^Tx$ is the linear canonical variates and $D\eta(x)$ is a nonparametric version of this discriminant variates. By replacing $X,P_X$ with $h(X),P_{h(x)}=S(\lambda)$ we can expand it to a nonparametric version. We can call this extended version as a FDA.
Implementation
- Initialize: Form $\Theta_0$, s.t. $\Theta^TD_p\Theta=I, \; \Theta_0^*=Y\Theta_0$
- Multivariate Nonparametric Regression: Fit $\hat{\Theta}_0^*=S(\lambda)\Theta_0^*, \eta(x)=\mathbf{B}^T h(x)$
- Optimal scores: Find the eigenvector matrix $\Phi$ of $\Theta_0^*\hat{\Theta}_0^*=\Theta_0^*S(\hat{\lambda})\Theta_0^{*T}$. The optimal score is $\Theta=\Theta_0\Phi$
- Update: $\eta(x)=\Phi^T\eta(x)$
With this implementation, we can get a $\Phi$ and update $\eta(x)$. The final $\eta(x)$ is used for calculation of a canonical distance $\delta(x,j)$ which is the only thing for classification.
$$
\delta(x,j)=||D(\hat{\eta}(x)-\bar{\eta}(x)^j)||^2
$$
Penalized Discriminant Analysis
$$
ASR({\theta_l,\beta_l}^L_{l=1}=\dfrac{1}{N}\sum^L_{l=1}[\sum^N_{i=1}(\theta_l(g_i)-h^T(x_i)\beta_l)^2+\lambda\beta_l^T\Omega\beta_l]
$$
When we can choose $\eta_l(x)=h(x)\beta_l$, $\Omega$ becomes
- $h^T(x_i)=[h_1^T(x_i) \; | \;h_2^T(x_i) \;| \; \cdots \; | \; h_p^T(x_i)]\;$, we can define $h_j$be a vector of up to $N$ natural-spline basis function.
- $S(\lambda)=H(H^TH+\Omega(\lambda))^{-1}H^T$
- $ASR_p(\Theta)=tr(\Theta^TY^T(1-S(\lambda))Y\Theta)/N$
- $\Sigma_{wthn}+\Omega$: penalized within-group covariance of $h(x_i)$
- $\Sigma_{btwn}$: between-group covariance of $h(x_i)$
- Find $argmax_{u} u^T\Sigma_{btwn}u, \;s.t.\;u^T(\Sigma_{wthn}+\Omega)u=1$, $u$ becomes a canonical variate.
- $D(x,u)=(h(x)-h(u))^T(\Sigma_W+\lambda\Omega)^{-1}(h(x)-h(u))$,
Reference
Hastie, T., Tibshirani, R.,, Friedman, J. (2001). The Elements of Statistical Learning. New York, NY, USA: Springer New York Inc..
'Machine Learning' 카테고리의 다른 글
Basis Expansions & Regularization (0) | 2022.06.09 |
---|---|
Linear Classfier (2) (0) | 2022.05.27 |
Orthogonalization (0) | 2022.05.27 |
Linear Method (0) | 2022.05.27 |
Statistical Decision Theory (0) | 2022.05.27 |