ICML 2016: The Information Sieve

The Information SieveGreg Ver Steeg and Aram Galstyan

Soup = data

“Main ingredient” extracted at each layer

Factorial code

• Carry recipe instead of soup• Missing ingredients?• Make more soup

• Compression• Prediction• Generative model

Recipe-Ingredient 1-Ingredient 2-… Invertible transform that makes

components independent

Finding such a transform is a generally intractable problem.We use a sequence that incrementally removes dependence

Two Steps1.Find the most informative function of the input data2. Transform the data to remove the

information in Yk, and then repeat

Soup

Main ingredient

The main ingredient: multivariate information• Multivariate mutual information, or Total Correlation (Watanabe, 1960)

• TC(X|Y) = 0 if and only if Y “explains” all the dependence in X• So we search for Y that minimizes TC(X|Y) • Equivalently, we define the total correlation explained by Y as:

The main ingredient:Total Correlation Explanation (CorEx)

• Optimize over all probabilistic functions• Solution has special form that makes it tractable• Computational complexity is linear in the number of variables

Sift out the main ingredient: remainder info

The remainder is a transformation of the inputs with 2 properties:

Soup

Remainder contains no info about Y

Transformation is invertible

Iterative sifting as:

Multivariate mutual information in data (Total Correlation)

Contribution from each layer of the sieve (optimized)

Remainder (at layer r)

Decomposition of information

Iterative sifting as:

Dependence at each layer of the sieve decreases until we get to zero, i.e. complete independence

Dependence (at layer r)

Extracting dependence

Recover spatial clusters from fMRI data

Ground truth ICA Sieve

Example of recovering spatial clusters in brain data from temporal activation patterns

Lossy compression and in-painting• Sieve representation with 12 layers/bits/binary latent factors on

MNIST digits

We can use the sieve for standard prediction and generative model tasks

Lossless compression (on MNIST)• Same size codebooks for Random and Sieve-based codes• (gzip is sequence-based, shown for reference)Proof of principle for lossless compression; though specialized

compression techniques are better on MNIST.

Method Naive gzip Random codebook

Sieve codebook

Bits per digit 784 328 267 243

Conclusion

• Incrementally decomposing multivariate information is useful, practical, and delicious• Could improve with joint optimization and better

transformations for remainder info

Link to all papers and codehttp://bit.ly/corex_info

Contact: [email protected], [email protected]

• The extension to continuous random variables is nontrivial but more practical and demonstrates connections to “common information”: “Sifting Common Information from Many Variables”, arXiv:1606.02307.

mailto:[email protected]

mailto:[email protected]

Science

ICML 2016: The Information Sieve