Notice
Recent Posts
Recent Comments
Link
«   2024/11   »
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
Tags
more
Archives
Today
Total
관리 메뉴

statduck

Curse of dimensionality 본문

Machine Learning

Curse of dimensionality

statduck 2022. 5. 26. 23:42

Curse of dimensionality

There are three problems in high dimension

  • First, More length is needed to capture same rate of data.
  • Second, All sample points are close to an edge of the sample.
  • Third, We need much more sample in high dimension to capture same percentage of data in low dimension.

✏️ First Problem

Let's think about a unit hypercube in p dimension, $\{x|; x_i<1, x \in \mathbb{R}^p\}$. We assume the density of x is uniformly distributed.

The average length of one side in p dimension is that: $e_p(r)=r^{1/p}$

There is a subcube in hypercube occupying r% of all data.

  • In$\mathbb{R}^1$, one side length of a subcube is $1/4$ for $1/4$ data.
  • In$\mathbb{R}^2$, one side length of a subcube is $1/2$for $1/4$data.
  • In$\mathbb{R}^3$, one side length of a subcube is $1/\sqrt{2}$ for $1/4$data.

When dimension becomes higher, an amount of information becomes increased. More information means longer length of one side for same rate of data.

$$
e_{10}(0.01)=0.63
$$

In $\mathbb{R}^{10}$, we need the length 0.63 to capture 1% of data.

✏️ Second Problem

All sample becomes close into an edge.

 

$$ \omega_p=\frac{\pi^{p/2}}{(p/2)!} $$
$$ F(x)=x^p, \; 0\leq x \leq 1 $$

$$ f(x)=px^{p-1}, \; 0\leq x \leq 1 $$

$$ g(y)=n(1-y^p)^{n-1}py^{p-1} $$

$$ G(y)=1-(1-y^p)^n $$

$$ d(p,N)= (1-\frac{1}{2}^{1/N})^{1/p} $$

 

Find the y value satisfying $G(y)=1/2$. The distance the closest point from origin.

 

✏️ Third Problem

The distance becomes very short when the number of sample is large and the dimensionality is high. For our fixed sample, the sampling density is proportional to $N^{1/p}$

  • To maintain the same density, we need one hundred samples in one dimension. However, we need $100^{10}$samples in 100 dimension to maintain the same density.

In this context, sampling density means the number of sample in unit length.

 

Reference

Hastie, T., Tibshirani, R.,, Friedman, J. (2001). The Elements of Statistical Learning. New York, NY, USA: Springer New York Inc..

'Machine Learning' 카테고리의 다른 글

Orthogonalization  (0) 2022.05.27
Linear Method  (0) 2022.05.27
Statistical Decision Theory  (0) 2022.05.27
Weighting procedure  (0) 2022.05.27
[ESL CH2] Overview: Supervised Learning  (0) 2022.05.26
Comments