Notice
Recent Posts
Recent Comments
Link
«   2024/11   »
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
Tags
more
Archives
Today
Total
관리 메뉴

statduck

(ENCO)Efficient Neural Causal Discovery Without Acyclicity Constraints - review 본문

Machine Learning

(ENCO)Efficient Neural Causal Discovery Without Acyclicity Constraints - review

statduck 2023. 7. 14. 22:50

Contributions

Seperate paramemters and uses an objective unconstrained about acyclicity, which allows for easier optimization and low-variance gradient estimator having convergence guarantees

  • Scaling to larger graphs and also can be leveraged to detect latent confounders

Background Knowledge

Causal Structure Learning : From sample of $\mathbb{P}$, discovering $G$ = causal structure learning or causal discovery. DAG 학습시키는게 목표이며 Data Set 종류로는 Observational Data, Interventional Data 존재.

Intervention Data 설명 https://www.inference.vc/causal-inference-2-illustrating-interventions-in-a-toy-example/

Joint distribtuion is not enough to predict behaviour under intervensions, however the causal diagram allows us to predict how the model will behave under intervention without carrying out the intervention.

 

Recent Work - continuous optimization score based causal discovery

  • Search the space of possible graphs with gradient based mothods
  • Adjacency matrix parameterized by independent probabilisites per edge

Method

  • Score-base: Search thorugh all possible spaces
  • Constrained-base: It constrains on functional class of SEM for identifiability result (like linear non-Gaussian acyclic model(LiNGAM), additive noise model(ANM) with conditional independence tests.  
  • Continuous-optimization (This paper adopts this method)  
    • Reinterpret Search as Discrete graph topology -> continuous problem
    • Restrict the search space to acyclic graphs: Originally with lagrangian, regularizer which have cons like local optima problem and it can not scale over 100 variables

Restrition of acyclicity (Example) NOTEARS)

https://proceedings.neurips.cc/paper_files/paper/2018/file/e347c51419ffb23ca3fd5050202f9c3d-Paper.pdf

 

Problem :How to limit the search space to directed acyclic grpaph

To limit the search space, 

  • NOTEARS: $ h(W) = tr(e^{W\cdot W}) = 0$ (W is an adjacency matrix) (slow, param-sensitive)
  • Regularization: penalize cyclic graphs (Hyperparameter sensitive and limited guarantees)

These two methods, however, have its own disadvantages. 

 

Goal

  • How can we reliably perform causal discovery with gradient-based methods on large scales?
  • Find the DAG of causal graphical model(CGM) from observational and interventional samples

Central Idea : Learn distributions $p(X_1|...)$ from observational data, test generalization to interventional data

Parameterize graph with edge existence and orientation parameters

  • Probability of an edge: $\sigma(\gamma_{ij}) \cdot \sigma(\theta_{ij}), \; with \; \theta_{ij}=-\theta_{ji}$

Benefits of two-variable parameterization:

  • More control over gradient updates
  • No constraint for acyclicity 

And it leads to:

  • Unbiased low-varaiance gradient estimators (-> able to scale larger)
  • If intervene on all variables, global optima is guaranteed

Some Deifinitins CGM(Causal Graphical Models)

  • DAG. G=(V,E)
  • Random Variables $X = \{ X_1 ,..., X_N\}, \mathbb{P} = p(X)$
    • Each node $i \in V  = X_i $, edge $(i,j) \in E $ equals to causal relation $X_i \rightarrow X_j$
    • $\mathbb{P}$ is faithful to a graph $G$ if the conditional independencies in $\mathbb{P}$ exactly equals to ones encoded in $G$ via d-separation https://blog.naver.com/sw4r/221004466858
    • $\mathbb{P}$ is Markov to a graph $G$ if $p(X)$ can be factorized into $\Pi^N_{i=1} p_i (X_i | pa(X_i))$
  • PGM(Probabilistic Graphical Model) based on DAG = BN(Bayesian Networks)
  • CGM locally replaces $p_i(X_i|pa(X_i))$ by $\tilde{p}(X_i|pa(X_i))$. 이 때 $X_i$ = intervention target
    • Intervention is perfect when $\tilde{p}(X_i|pa(X_i)) = \tilde{p}(X_i)$

Assumption 

  • The CGM is causally sufficient. (All common causes of variables are included and observable)
  • We have N interventional datasets
  • The interventions are sparse, perfect(indep of original parents), available for all observable variables

Learning Process

  Distribution Fitting

At distribution fitting, it learns multiple conditional sets with drop-out from observational data. The bernollui parameter $p(X_j \rightarrow X_i)$ is updated at Graph fitting

 

Q: NN(Neural Network)에서 Conditional Density를 학습시키는 방법?

A. Leverage Normalizing flows for continuous variables, softmax output activiation for categorical variables.

(Github에서 flow method 대신 simple way로 gaussiannoisemodel 제시: 분포 가정 후 파라미터 학습시키기: output_dims=2 loss function의 mean, log_var로 학습되도록)

ENCO(class) > DistributionFitting(class) > create_continuous_model(function)  in  multivariate_flow > if use_flow_model=False: GaussianNoiseModel

 

 Graph Fitting

  1. Sample a set of graphs(which doesn't have to be acyclic)
  2. Estimate log-likelihoods and average them for gradient estimators

 

Gradient Estimator

$$ \tilde{\mathcal{L}} = \mathbb{E}_{\hat{I} \sim p_I(I)} \mathbb{E}_{\tilde{p}_\hat{I}(X)} \mathbb{E}_{p_{\gamma,\theta} (C)} \Big[ \sum^N_{i=1} \mathcal{L}_C (X_i) \Big] + \lambda_{sparse} \sum^N_{i=1} \sum^N_{j=1} \sigma(\gamma_{ij}) \cdot \sigma(\theta_{ij}) $$

 

For Stochastic Gradient Descent method, we need to get the gradients for this form.

 

$$\dfrac{\partial}{\partial \gamma_{ij}} \tilde{\mathcal{L}} = \sigma(\gamma_{ij}) \cdot (1-\sigma(\gamma_{ij}))\cdot \sigma(\theta_{ij}) \cdot \mathbb{E}_{X,C_{-ij}}[ \mathcal{L}_{X_i \rightarrow X_j} (X_j) - \mathcal{L}_{X_i \nrightarrow X_j} (X_j) + \lambda_{sparse} ] $$

 

Evalulation

 This paper evaluated 64,000 sampled adjacency matrix from $p_{\gamma,\theta}(C)$ in terms of log-likelihood estimates. These 64,000 samples are grouped into K samples(from 20 to 4,000 groups)

 

Low Variance For Edge Param($\gamma$)

Low Variance For Orientation Param ($\theta$)

 In the learning process of $\theta$, regularizer doesn't need to be placed.

Because the $\theta$ parameter is only updated when intervention over index i,j so the other variables are not considered while learning at $ij$ 

 

Condition For global convergence

 The main assumptions to converge globally are follows:

 

 More intuitively speaking, Ancestors and descendants are not independent under interventions on the ancestor. Local optima only occur under this setting when there is a common confounder from parents, however this is not a usual case.

 

Score detecting latent confounders

 Under latent confounder situation, the structure learning methods may predict false positive edges. (실제로는 관계가 없는데 있다고 착각할 수 있음) 

 

Let $\gamma_{ij} = \gamma_{ij}^{(I)} + \gamma_{ij}^{(O)}$

$\gamma_{ij}^{(I)} $ is only updated under interventions on $X_i$

If this latent score is bigger than a threshold hyperparameter, then we can say there is a latent confounder between $X_i,X_j$.

 

Limitation

  • Converengce guarantees require interventions on all variables
  • Orientations are missing transitivity

 

Reference: 

저자직강: https://www.youtube.com/watch?v=ro_5FqiS5qU

VAE 설명: https://process-mining.tistory.com/161

 

Comments