Weighting procedure

Notice

Recent Posts

Recent Comments

Link

« 2025/08 »
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Tags more

Archives

Today

Total

관리 메뉴

statduck

Weighting procedure 본문

Machine Learning

Weighting procedure

statduck 2022. 5. 27. 13:43

To estimate unknown parameters like mean and variance, we sample a data set from a target population. After we sample a data set, we should check whether this set well represents our target population. If it doesn't well represent our target, we need to adjust our estimator.

Let's see how to adjust our estimator. Before we adjust it, we first need to construct estimating equation which of solution to obtain a consistent estimator.

$$ \hat{U}_n(\theta) = \dfrac{1}{n} \sum^n_{i=1} U(\theta ; z_i) = 0, \; z_i = (x_i, y_i) \; random \; sample $$

Example

Let $\theta$ be $\mu$ (popluation mean), and estimating equation be $U(\theta ; x_i) = x_i - \theta $. Normally to find the estimating equation we use a score function $ S(\theta) = \dfrac{\partial l(\theta)}{\partial \theta} $ ( $l(\theta)$ is a log-likelihood of $\theta$). In this case $ \dfrac{1}{n} \sum^n_{i=1} (x_i - \bar{x})= 0 $.

CC(Complete Case method)

We define $\delta_i$ is a group indicator, which may have two meaning as following.

- One more parameter $\delta_i$ is a binary membership indicator for unit $i$. Partition our sample A into $A_1 = \{i \in A | \delta_i = 1\}, \; A_i = \{i\in A | \delta_i =0 \} $.

- As a sampling indicator, $\delta_i=1, \; i\in A$ or $\delta_i=0 \; o.w. $

Complete Case means that fully observed data is only used for our prediction. In other words, we adjust our estimating equation like this one $ \sum ^n_{i=1} \delta_i U(\theta ; z_i) = 0 $. Because this only make use of the information for $\delta=1$, it is not efficient.

WCC(Weighted Complete Case method)

There is a possibility of above method having an unbiasedeness. Because bias takes place when we just sample part of our target populaton data. To correct this bias, estimating equation must reflect weighted information.

$$ \hat{U}_W(\theta) = \dfrac{1}{n} \sum^n_{i=1} \dfrac{\delta_i}{\pi_i} U(\theta ; z_i) = 0, \; where \; \pi_i = P(\delta_i|z_i)$$

In the above equation, $\pi\_i$ is often called **propensity scores**. It seems reasonable to define $pi\_i$ as $P(\delta\_i=1 | z_i)$. Let's assume the situation where $k_{th}$ unit is so easy to pick, so $\pi_k$ becomes enormous than other units. For correction, the power of $k_{th}$ unit has must be diminished by multiplyint $1/ \pi_k$ in solving the equation.

Let $\hat{\theta}_W$ be the parameter derived from the weigthed estimating equation. Is this $\hat{\theta}_W$ really better that $\hat{\theta}$?

- Asymptotically unbiased

To check which estimator is better, we set three criteria of **consistency, unbiasdeness, and variance.**

Usually, the two estimators $\hat{\theta}_W, \; \hat{\theta}$ are constient estimators. Thus, just check the other two criteria of unbiasdeness and variance.

$$ \hat{U}_W(\hat{\theta}_W) \simeq \hat{U}_W (\theta_0) + \dfrac{\partial}{\partial \theta} \hat{U}_w (\theta_0) (\hat{\theta}_W - \theta_0) = 0 $$

$$ \hat{\theta}_W - \theta_0 = - \Big[ \dfrac{\partial \hat{U}_W(\theta_0)}{\partial \theta} \Big]^{-1} \hat{U}_W (\theta_0) = - \Big[ E \Big( \dfrac{\partial \hat{U}_W (\theta_0)}{\partial \theta} \Big) \Big]^{-1} \hat{U}_W(\theta_0)= -\Big[ E (\dot{U} (\theta_0 | z) \Big]^{-1}\hat{U}_W(\theta_0)$$

$$ E(\hat{U}_W(\theta_0)) = E[E(\hat{U}_W(\theta_0)|z)] = E[ \dfrac{1}{n} \sum \dfrac{E(\delta_i | z_i)}{\pi_i} U(\theta | z_i) ] = E[\dfrac{1}{n} \sum U(\theta | z_i)] = E(\hat{U}_n(\theta_0)) = 0 \\ where \; \hat{U}_n(\theta) = n^{-1} \sum^n_{i=1} U(\theta; z_i) $$

Because $E(\hat{\theta}_W) = \theta_0$, so $\hat{\theta}_W$ is an asymptotically unbiased estimator of $\theta_0$.

- Asymptotic variance

First we assume our samples are derived from finite population, so $Cov(\delta_i, \delta_j)=0,\; for \; i\neq j $.

$$ V(\hat{\theta}_W) \simeq \tau^{-1} V \Big \{ \hat{U}_W (\theta_0) \Big\} \tau ^{-1'},\; \tau = E\{ \dot{U} (\theta_0;Z) \}$$

$$ \begin{align} V \Big \{ \hat{U}_W (\theta_0) \Big \} &= V(E(\hat{U}_W(\theta_0) | Z )) + E(V(\hat{U}_W(\theta_0) | Z)) \\ &= V(\hat{U}_n(\theta_0)) + E(V(\dfrac{1}{n} \sum^n_{i=1} \dfrac{\delta_i}{\pi_i} U(\theta_0 | z_i) | Z) \\ &= V(\hat{U}_n(\theta_0) + E(\dfrac{1}{n^2} \sum^n_{i=1} \dfrac{V(\delta_i)}{\pi_i^2} U(\theta_0 | z_i)^{\bigotimes 2} \\ &= V(\hat{U}_n(\theta_0) + E(n^2{-2} \sum^n_{i=1} (\pi_i^{-1}) U (\theta_0|z_i)^{\bigotimes 2}) \\ & \simeq E(n^{-2} \sum^n_{i=1} \pi_i^{-1} U(\theta_0 | z_i)^{\bigotimes 2}) \end{align} $$

- How to estimate $\pi$?

In survey sampling, by the design of a experiment, $\pi_i$ are usually known. However, in general case we need to estimat the very $\pi_i$, are called as propensity scores or first-order inclusion probabilities.

The first way to estimate it is a logistic modeling. In the case that unit $i$ responds,

$$ \pi_i= P(\delta_i=1 | x_i,\beta) = \{ 1+ exp(-x_i^T\beta) \} ^{-1} $$

$$ \begin{align} l(\beta) &= \sum^n_{i=1} \Big\{ \delta_i log \pi_i + (1-\delta_i) log(1-\pi_i) \Big\} \\ &= \sum^n_{i=1} \Big\{ \delta_i \beta^T x_i - log(1+e^{\beta^T x_i})\Big\} \end{align} $$

$$ \dfrac{\partial l(\beta)}{\partial \beta} = S(\beta) = \sum^n_{i=1} x_i(\delta_i - \pi_i) = \sum^n_{i=1} \pi_i(\dfrac{\delta_i}{\pi_i} - 1 ) x_i$$

$$ U(\beta) = \sum_{i \in A} w_i\Big( \dfrac{\delta_i}{\pi_i} -1 \Big) x_i = 0 $$

Propensity score

$$ \hat{U}_{PS}(\theta) = \dfrac{1}{n} \sum^n_{i=1} \dfrac{\delta_i}{\hat{\pi}_i} U(\theta ; z_i) = 0, \; \hat{\pi}_i = \pi(z_i ; \hat{\phi})\; \hat{phi} = \;MLE \; of\; \phi_0 $$

$$ S(\phi)=0$$

Non Ignorable Case $P(\delta=1 | y,x)$: To estimate the propensity score, we need to solve these two equations simlultaneously.
Ignorable Case $P(\delta=1 | x)$

$$ f(y_i, x_i, \delta_i) = f_1(\delta_i | y_i, x_i) f_2(y_i|x_i) $$

Example

Gaussian Mixture

$$
X|Z=e_k \sim N(\mu_k,\Sigma_k) \;\; \\where \; \mu_1,...,\mu_k\in \mathbb{R}^d, \; \Sigma_1,...,\Sigma_k\in \mathbb{R}^{d \times d}
$$

$$Z=\{e_1,e_2,\cdots,e_K\}, \;e_1=[1 \;0 \; \cdots \; 0]^T$$

$$p(X|Z=e_k)=N(X|\mu_k,\Sigma_k)=pdf \;of \;normal \; dist$$

$$p(Z=e_k)=\pi_k =probability \;of \;selecting\ ; each \; cluster$$

$$p(X)= \sum_z p(Z) \times p(X|Z)=\sum^K_{k=1} \pi_k \times N(X|\mu_k,\Sigma_k)$$

$$p(X,Z)=p(Z) \times p(X|Z)=\pi_kN(X|\mu_k,\Sigma_k)$$

$$p(Z=e_k|X)=\dfrac{p(Z) \times p(X|Z)}{p(X)}=\dfrac{\pi_k \times N(X|\mu_k,\Sigma_k)}{\sum^K_{j=1} \pi_j \times N(X|\mu_j,\Sigma_j)}=r(Z_{nk})$$

Algorithm[EM Algorithm]

Initialization

Initialize the means $\mu_k$, covariance $\Sigma_k$ and mixing coefficients $\pi_k$.

E step

Evaluate $$r(z_{nk})=\dfrac{\pi_k N(x_n|\mu_k,\Sigma_k)}{\sum^K_{j=1} \pi_j N(x_n|\mu_j, \Sigma_j)}$$

M step

Re-estimate the parameters using $r(z_{nk})$

$$\mu_k^{new}=\dfrac{1}{N_k}\sum^N_{i=1}\gamma(z_{nk})x_n$$

$$\Sigma_k^{new}=\dfrac{1}{N_k}\sum^N_{i=1}\gamma(z_{nk})(x_n-\mu_k^{new})(x_n-\mu_k^{new})^T$$

$$\pi_k^{new}=\dfrac{N_k}{N}, \;\; s.t ; N_k=\sum^N_{n=1}\gamma(z_{nk})$$

Evaluation

Evaluate the log likelihood

$$ln p(X|\mu,\Sigma,\pi)=\sum^N_{n=1}ln{\sum^K_{k=1}\pi_k N(x_n|\mu_k,\Sigma_k)}$$

'Machine Learning' 카테고리의 다른 글

Orthogonalization (0)	2022.05.27
Linear Method (0)	2022.05.27
Statistical Decision Theory (0)	2022.05.27
Curse of dimensionality (0)	2022.05.26
[ESL CH2] Overview: Supervised Learning (0)	2022.05.26

'Machine Learning' Related Articles

Comments

statduck

Weighting procedure 본문

Weighting procedure

CC(Complete Case method)

WCC(Weighted Complete Case method)

- Asymptotically unbiased

- Asymptotic variance

- How to estimate $\pi$?

Propensity score

Example

Gaussian Mixture

'Machine Learning' 카테고리의 다른 글

티스토리툴바