Notice
Recent Posts
Recent Comments
Link
«   2024/11   »
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
Tags
more
Archives
Today
Total
관리 메뉴

statduck

Weighting procedure 본문

Machine Learning

Weighting procedure

statduck 2022. 5. 27. 13:43

    To estimate unknown parameters like mean and variance, we sample a data set from a target population. After we sample a data set, we should check whether this set well represents our target population. If it doesn't well represent our target, we need to adjust our estimator.

    Let's see how to adjust our estimator. Before we adjust it, we first need to construct estimating equation which of solution to obtain a consistent estimator.

 

$$ \hat{U}_n(\theta) = \dfrac{1}{n} \sum^n_{i=1} U(\theta ; z_i) = 0, \; z_i = (x_i, y_i) \; random \; sample $$

Example

 

    Let $\theta$ be $\mu$ (popluation mean), and estimating equation be $U(\theta ; x_i) = x_i - \theta $. Normally to find the estimating equation we use a score function $ S(\theta) = \dfrac{\partial l(\theta)}{\partial \theta} $ ( $l(\theta)$ is a log-likelihood of $\theta$). In this case $ \dfrac{1}{n} \sum^n_{i=1} (x_i - \bar{x})= 0 $.

CC(Complete Case method)

    We define $\delta_i$ is a group indicator, which may have two meaning as following.

- One more parameter $\delta_i$ is a binary membership indicator for unit $i$. Partition our sample A into $A_1 = \{i \in A | \delta_i = 1\}, \; A_i = \{i\in A | \delta_i =0 \} $.

- As a sampling indicator, $\delta_i=1, \; i\in A$ or $\delta_i=0 \; o.w. $

 

    Complete Case means that fully observed data is only used for our prediction. In other words, we adjust our estimating equation like this one $ \sum ^n_{i=1} \delta_i U(\theta ; z_i) = 0 $. Because this only make use of the information for $\delta=1$, it is not efficient.

WCC(Weighted Complete Case method)

    There is a possibility of above method having an unbiasedeness. Because bias takes place when we just sample part of our target populaton data. To correct this bias, estimating equation must reflect weighted information.

$$ \hat{U}_W(\theta) = \dfrac{1}{n} \sum^n_{i=1} \dfrac{\delta_i}{\pi_i} U(\theta ; z_i) = 0, \; where \; \pi_i = P(\delta_i|z_i)$$

In the above equation, $\pi\_i$ is often called **propensity scores**. It seems reasonable to define $pi\_i$ as $P(\delta\_i=1 | z_i)$. Let's assume the situation where $k_{th}$ unit is so easy to pick, so $\pi_k$ becomes enormous than other units. For correction, the power of $k_{th}$ unit has must be diminished by multiplyint $1/ \pi_k$ in solving the equation. 

    Let $\hat{\theta}_W$ be the parameter derived from the weigthed estimating equation. Is this $\hat{\theta}_W$ really better that $\hat{\theta}$? 

- Asymptotically unbiased

    To check which estimator is better, we set three criteria of **consistency, unbiasdeness, and variance.**

    Usually, the two estimators $\hat{\theta}_W, \; \hat{\theta}$ are constient estimators. Thus, just check the other two criteria of unbiasdeness and variance.

$$ \hat{U}_W(\hat{\theta}_W) \simeq \hat{U}_W (\theta_0) + \dfrac{\partial}{\partial \theta} \hat{U}_w (\theta_0) (\hat{\theta}_W - \theta_0) = 0 $$

$$ \hat{\theta}_W - \theta_0 = - \Big[ \dfrac{\partial \hat{U}_W(\theta_0)}{\partial \theta} \Big]^{-1} \hat{U}_W (\theta_0) = - \Big[ E \Big( \dfrac{\partial \hat{U}_W (\theta_0)}{\partial \theta} \Big) \Big]^{-1} \hat{U}_W(\theta_0)= -\Big[ E (\dot{U} (\theta_0 | z) \Big]^{-1}\hat{U}_W(\theta_0)$$

$$ E(\hat{U}_W(\theta_0)) = E[E(\hat{U}_W(\theta_0)|z)] = E[ \dfrac{1}{n} \sum \dfrac{E(\delta_i | z_i)}{\pi_i} U(\theta | z_i) ] = E[\dfrac{1}{n} \sum U(\theta | z_i)] = E(\hat{U}_n(\theta_0)) = 0 \\ where \; \hat{U}_n(\theta) = n^{-1} \sum^n_{i=1} U(\theta; z_i) $$

    Because $E(\hat{\theta}_W) = \theta_0$, so $\hat{\theta}_W$ is an asymptotically unbiased estimator of $\theta_0$.

- Asymptotic variance

    First we assume our samples are derived from finite population, so $Cov(\delta_i, \delta_j)=0,\; for \; i\neq j $.

$$ V(\hat{\theta}_W) \simeq \tau^{-1} V \Big \{ \hat{U}_W (\theta_0) \Big\} \tau ^{-1'},\; \tau = E\{ \dot{U} (\theta_0;Z) \}$$

$$ \begin{align} V \Big \{ \hat{U}_W (\theta_0) \Big \} &= V(E(\hat{U}_W(\theta_0) | Z )) + E(V(\hat{U}_W(\theta_0) | Z)) \\ &= V(\hat{U}_n(\theta_0)) + E(V(\dfrac{1}{n} \sum^n_{i=1} \dfrac{\delta_i}{\pi_i} U(\theta_0 | z_i) | Z) \\ &= V(\hat{U}_n(\theta_0) + E(\dfrac{1}{n^2} \sum^n_{i=1} \dfrac{V(\delta_i)}{\pi_i^2} U(\theta_0 | z_i)^{\bigotimes 2} \\ &= V(\hat{U}_n(\theta_0) + E(n^2{-2} \sum^n_{i=1} (\pi_i^{-1}) U (\theta_0|z_i)^{\bigotimes 2}) \\ & \simeq E(n^{-2} \sum^n_{i=1} \pi_i^{-1} U(\theta_0 | z_i)^{\bigotimes 2}) \end{align} $$

- How to estimate $\pi$?

    In survey sampling, by the design of a experiment, $\pi_i$ are usually known. However, in general case we need to estimat the very $\pi_i$, are called as propensity scores or first-order inclusion probabilities.

    The first way to estimate it is a logistic modeling. In the case that unit $i$ responds,

$$ \pi_i= P(\delta_i=1 | x_i,\beta) = \{ 1+ exp(-x_i^T\beta) \} ^{-1} $$

$$ \begin{align} l(\beta) &= \sum^n_{i=1} \Big\{ \delta_i log \pi_i + (1-\delta_i) log(1-\pi_i) \Big\} \\ &= \sum^n_{i=1} \Big\{ \delta_i \beta^T x_i - log(1+e^{\beta^T x_i})\Big\} \end{align} $$

$$ \dfrac{\partial l(\beta)}{\partial \beta} = S(\beta) = \sum^n_{i=1} x_i(\delta_i - \pi_i) = \sum^n_{i=1} \pi_i(\dfrac{\delta_i}{\pi_i} - 1 ) x_i$$

$$ U(\beta) = \sum_{i \in A} w_i\Big( \dfrac{\delta_i}{\pi_i} -1 \Big) x_i = 0 $$

Propensity score

$$ \hat{U}_{PS}(\theta) = \dfrac{1}{n} \sum^n_{i=1} \dfrac{\delta_i}{\hat{\pi}_i} U(\theta ; z_i) = 0, \; \hat{\pi}_i = \pi(z_i ; \hat{\phi})\; \hat{phi} = \;MLE \; of\; \phi_0 $$

$$ S(\phi)=0$$

  • Non Ignorable Case $P(\delta=1 | y,x)$: To estimate the propensity score, we need to solve these two equations simlultaneously.
  • Ignorable Case $P(\delta=1 | x)$

$$ f(y_i, x_i, \delta_i) = f_1(\delta_i | y_i, x_i) f_2(y_i|x_i) $$

 

 

Example

Gaussian Mixture

$$
X|Z=e_k \sim N(\mu_k,\Sigma_k) \;\; \\where \; \mu_1,...,\mu_k\in \mathbb{R}^d, \; \Sigma_1,...,\Sigma_k\in \mathbb{R}^{d \times d}
$$

$$Z=\{e_1,e_2,\cdots,e_K\}, \;e_1=[1 \;0 \; \cdots \; 0]^T$$

$$p(X|Z=e_k)=N(X|\mu_k,\Sigma_k)=pdf \;of \;normal \; dist$$

$$p(Z=e_k)=\pi_k =probability \;of \;selecting\ ; each \; cluster$$

$$p(X)= \sum_z p(Z) \times p(X|Z)=\sum^K_{k=1} \pi_k \times N(X|\mu_k,\Sigma_k)$$

$$p(X,Z)=p(Z) \times p(X|Z)=\pi_kN(X|\mu_k,\Sigma_k)$$

$$p(Z=e_k|X)=\dfrac{p(Z) \times p(X|Z)}{p(X)}=\dfrac{\pi_k \times N(X|\mu_k,\Sigma_k)}{\sum^K_{j=1} \pi_j \times N(X|\mu_j,\Sigma_j)}=r(Z_{nk})$$

 

 

Algorithm[EM Algorithm]

 

Initialization 

Initialize the means $\mu_k$, covariance $\Sigma_k$ and mixing coefficients $\pi_k$.

 

E step

 Evaluate $$r(z_{nk})=\dfrac{\pi_k N(x_n|\mu_k,\Sigma_k)}{\sum^K_{j=1} \pi_j N(x_n|\mu_j, \Sigma_j)}$$

 

M step 

Re-estimate the parameters using $r(z_{nk})$

$$\mu_k^{new}=\dfrac{1}{N_k}\sum^N_{i=1}\gamma(z_{nk})x_n$$

$$\Sigma_k^{new}=\dfrac{1}{N_k}\sum^N_{i=1}\gamma(z_{nk})(x_n-\mu_k^{new})(x_n-\mu_k^{new})^T$$

$$\pi_k^{new}=\dfrac{N_k}{N}, \;\; s.t ; N_k=\sum^N_{n=1}\gamma(z_{nk})$$

 

Evaluation 

Evaluate the log likelihood

$$ln p(X|\mu,\Sigma,\pi)=\sum^N_{n=1}ln{\sum^K_{k=1}\pi_k N(x_n|\mu_k,\Sigma_k)}$$

'Machine Learning' 카테고리의 다른 글

Orthogonalization  (0) 2022.05.27
Linear Method  (0) 2022.05.27
Statistical Decision Theory  (0) 2022.05.27
Curse of dimensionality  (0) 2022.05.26
[ESL CH2] Overview: Supervised Learning  (0) 2022.05.26
Comments