Quantum Annealing for Statistical Machine Learning

1. Quantum Annealing in Statistical Machine Learning Issei Sato Mathematical Informatics, Graduate School of Information Science and Technology in University of Tokyo Supervisor Hiroshi Nakagawa

2. Copyright c 2011, Issei Sato. 3. Abstract Machine learning is a process for improving the performance and behaviors of a machine through the use of empirical data. The aim of machine learning is to extract hidden properties in data and make predictions yet to be observed. We humans decide our behaviors based on knowledge learned or abstracted from past experiences and information. However, a large amount of information or data makes it dicult for us to extract useful information and develop accurate solutions to problems, so called information overload. Therefore, it is important to develop systems that automatically learn underlying mechanisms in observed data. Machine learning is related to many elds: probability theory and statistics, data mining, information theory, computational neuroscience, theoretical computer science, and statistical physics. We propose a novel machine learning framework based on quantum mechanics. Basically, machine learning is formulated as an optimization problem. Simulated Annealing (SA) is a well-known physics-based approach for solving optimization problems in machine learning. SA is used to solve problems by using a concept of statistical mechanics, temperature. In physics, quantum annealing (QA) has attracted much attention as an alternative annealing method of optimization problems through quantum uctuations. QA is the quantum-mechanical version of SA. We developed two QA variant learning algorithms of the variational Bayes inference and the Gibbs sampler, which are common learning algorithms. The proposed learning algorithms can be applicable to problems to which these conventional algorithms are adaptable. We empirically demonstrate that our QA-based learning algorithms works better than SA-based learning algorithms in the unigram mixture model, latent Dirichlet allocation , hidden Markov model and the Dirichlet process mixture models for clustering documents, extracting topics of documents, predicting users preference of music artists, and modeling web page visits of users. 4. iii Contents Chapter 1 Introduction 1 1.1 Statistical Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Quantum Annealing in Statistical Machine Learning . . . . . . . . . . . . . . . . . . 4 1.3 Summary of Remaining Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 2 Learning Algorithms 11 2.1 Conjugate Exponential Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Slice Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6 Variational Bayes Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.7 Simulated Annealing Variational Bayes Inference . . . . . . . . . . . . . . . . . . . . 21 2.8 Variational Bayes Inference for Conjugate Exponential Models . . . . . . . . . . . . 21 2.9 Collapsed Variational Bayes Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Chapter 3 Probabilistic Latent Variable Models 24 3.1 Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.1 Mixture of Unigrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.2 Probabilistic Latent Semantic Indexing . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.3 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.4 Interpretation of Topic Models on Simplex Space . . . . . . . . . . . . . . . . . . 27 3.1.5 Collapsed Gibbs Sampler for Latent Dirichlet Allocation . . . . . . . . . . . . . . 28 3.1.6 Variational Bayes Inference for Latent Dirichlet Allocation . . . . . . . . . . . . 29 3.1.7 Collapsed Variational Bayes Inference for Latent Dirichlet Allocation . . . . . . 30 3.1.8 Particle Filter for Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . 31 3.1.9 Deterministic Online Learning for Latent Dirichlet Allocation . . . . . . . . . . . 32 5. iv Contents 3.1.10 Other Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Dirichlet Process Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.1 Dirichlet Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.2 Dirichlet Process Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.3 Chinese Restaurant Process Representation . . . . . . . . . . . . . . . . . . . . . 43 3.2.4 Stick-breaking Process Representation . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.5 (Collapsed) Gibbs Sampler for Chinese Restaurant Process . . . . . . . . . . . . 45 3.2.6 (Collapsed) Gibbs Sampler for Stick-breaking Process . . . . . . . . . . . . . . . 47 3.2.7 Slice Sampler for Stick-breaking Process . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.8 Variational Bayes Inference for Dirichlet Process Mixture Models . . . . . . . . . 49 3.2.9 Collapsed Variational Bayes Inference for Dirichlet Process Mixture Models . . . 51 3.3 Topic Models Using Pitman-Yor Process . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.1 Pitman-Yor Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.2 Pitman-Yor Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3.3 Gibbs Sampler for Pitman-Yor Topic Models . . . . . . . . . . . . . . . . . . . . 59 3.3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Chapter 4 Quantum Annealing Variational Bayes Inference 66 4.1 Introduction to Density Matrix and Quantum Annealing for Machine Learning . . . 66 4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.3 Introduction of Quantum Eect into Marginal Probability . . . . . . . . . . . . . . 71 4.4 Derivation of Variational Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.5 Derivation of Update Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.6 Estimation of Label Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.7.1 Annealing Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Chapter 5 Quantum Annealing Gibbs Sampler for Dirichlet Process Mixture 88 5.1 Review of Chinese Restaurant Process . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.2 Bit Matrix Representation for Chinese Restaurant Process . . . . . . . . . . . . . . 90 5.3 Quantum Annealing for Chinese Restaurant Process . . . . . . . . . . . . . . . . . . 92 5.3.1 Density Matrix Representation for Classical Chinese Restaurant Process . . . . . 92 5.3.2 Formulation for Quantum Chinese Restaurant Process . . . . . . . . . . . . . . . 92 5.3.3 Approximation Inference for Quantum Chinese Restaurant Process . . . . . . . . 93 6. v 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.4.1 Annealing Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Chapter 6 Conclusion 101 Acknowledgments 103 Appendix A Hidden Markov Model 110 Appendix B Kronecker products 112 Appendix C Matrix functions 114 7. 1 Chapter 1 Introduction Machine learning is a sub-topic of articial intelligence and has wide applications such as in natural language processing, bioinformatics, computer vision, recommendation systems, pattern recognition, and speech recognition. Machine learning is a process for improving the performance and behaviors of a machine through the use of empirical data and past humans experiences. In real life, since the set of all possible behaviors for all possible inputs is too large to be covered by the set of observed examples, a learner must generalize the given examples to make useful decisions for future observations. Hence, important research in machine learning is to automatically learn underlying generation processes of observations. First, we describe statistical machine learning, and then give an overview of this thesis. 1.1 Statistical Machine Learning We dene statistical machine learning as a process for statistically describing the underlying generation mechanism of data and making predictions yet to be observed. Suppose that there exists a probability distribution P from which observed samples x1:n = {x1, x2, , xn} are independently generated. Let p (x) be a probability density function of P. Samples are also called training samples or data and a set of samples is a called training set or data set. In statistical machine learning, we rst construct a statistical model described by a probability density function with parameters, denoted as p(x|) where is a parameter. By using p(x|) modeling observations, the next goal of statistical machine learning is specied as nding the optimal parameters so that the model becomes close to the underlying distribution p (x) generating the observations x1:n. Let us consider the Kullback-Leibler (KL) divergence which is a standard measure of the dierence between the densities p (x) and p(x|). By using the KL divergence, the dierence between the model density 8. 2 Chapter 1 Introduction p(x|) and the true density p (x) is given by KL[p (x)||p(x|)] = p (x) log p (x) p(x|) dx = E[log p (x) p(x|) ]p(x), (1.1) where E[]p(x) denotes the expectation with respect to the random variable x. The KL divergence is almost positive denite (Kullback, 1968): KL[p (x)||p(x|)] 0, (1.2) KL[p (x)||p(x|)] = 0 (if and only if p (x) = p(x|)). (1.3) Thus, one of the goals in statistical machine learning is to minimize the KL divergence KL[p (x)||p(x|)]. Since the true density p (x) is xed, the minimization task reduces to the following maximization problem: maximize E[log p(x|)]p(x). (1.4) Since this expectation uses the true distribution p (x) which is unknown, we substitute the expectation with the empirical expectation as follows. maximize 1 n i log p(xi|). (1.5) This maximization is known as the maximum likelihood (ML) problem. That is, maximizing the log likelihood leads to minimizing the KL divergence KL[p (x)||p(x|)]. Therefore, ML learning is a fundamental learning framework. The ML estimation for the parameter is formulated as ML = argmax n i=1 p(xi|). (1.6) The parameter can be viewed as a random variable in which we have to reect the distribution over the parameter p() in a learning algorithm. This learning algorithm is called maximum a posteriori (MAP) learning because the MAP estimation is obtained by maximizing the posterior of the parameter , which is described below. The MAP estimation is formulated as MAP = argmax p() n i=1 p(xi|). (1.7) Once the statistical model is specied and the parameter is obtained, the predictive distribution is obtained to estimate future observations. In ML and MAP learning, the predictive distributions are given by p(x|ML) and p(x|MAP ), respectively. These methods use only a one-point estimate of the parameter. In contrast, Bayesian learning uses the distribution over the parameter for predicting future observations. 9. 1.1 Statistical Machine Learning 3 Bayesian learning views the parameter as a random variable and uses the posterior distributions by introducing the prior distribution p(). Using the Bayes theorem, the posterior distribution over the parameter is calculated by p(|x1:n) p() n i=1 p(xi|). (1.8) Thus, we obtain the predictive distribution as p(x|x1:n) = p(x|)p(|x1:n)d. (1.9) This prediction can be viewed as the ensemble of models according to the distribution p(|x1:n). Machine learning tasks are categorized into two basic tasks according to the type of observed data: supervised and unsupervised learning. The supervised learning task learns the input-output relation given the training data {(x1, y1), (x2, y3), , (xn, yn)}, which are examples of the input and output vectors pairs. The input-output relation is typically formulated as a function, e.g., the conditional distribution, of the output given the input. A major supervised learning task is a classication that constructs a function that maps inputs into labels, which are to be the output. On the contrary, the unsupervised learning task deals with only the inputs {x1, x2, , xn}. Major examples are the density estimation that estimates distribution of the data using a statistical model, and clustering that divides the data set into disjoint sets of mutually similar data. Furthermore, the types of learning are also classied according to the way in which the data are processed. There are mainly two ways: batch learning, which updates parameters sweeping through all data a number of times to converge, and online learning, which updates parameters for data points one at a time. Online learning works well for real-time streaming data because it processes data points one at a time. Finally, we briey describe the following machine learning tasks. Semi-supervised learning combines both unsupervised learning without any labeled training data and supervised learning with completely labeled training data. Typically, we use a small amount of labeled data and a large amount of unlabeled data. Reinforcement learning learns how to take actions in an environment so as to maximize a reward. Every action has some impact in the environment, and the environment provides feedback in the form of rewards that guides the learning algorithm. Active learning interactively collects new examples from the learning system itself, while it is assumed with other learning tasks that all of the training examples are given at the start. For example, the learning system actively makes queries to a human user for constructing the data set. When the queries are unlabeled data, active learning is combined with semi-supervised learning. 10. 4 Chapter 1 Introduction Learning to rank automatically constructs a ranking model that typically produces a numerical or ordinal score, or a binary relevant judgment, e.g. relevant or not relevant for each item. Training data consists of lists of items with some partial order. Kernel methods are widely used in a variety of learning tasks. However, it is left to users to optimally choose a kernel. Manually searching for an appropriate kernel is often time-consuming. Multiple kernel learning simultaneously learns a linear mixture of kernels and model parameters. Transfer learning refers to the method of applying the knowledge learned in one or more tasks to a dierent but related problem. Transfer learning is also called domain adaptation when dealing with the transfer of knowledge across domains. For example, learning to recognize chairs might help to recognize tables; or learning to play checkers might improve the learning of chess. Multi-task learning deals with a problem together with other related problems at the same time, using a shared representation. This often leads to a better model for the main task, because it allows the learner to use the commonality among the tasks. Multi-task learning also is considered to be a kind of transfer learning. Metric learning learns a distance/similarity function for data. Metric learning has become a popular task because learning distance or similarity can be applied in a variety of machine learning problems. Most of the algorithms in metric learning rely on learning a Mahalanobis distance, which generalizes the standard Euclidean distance by admitting arbitrary linear scaling and ro- tations of the feature space and works on many real-world problems. Chapters 2 and 3 discuss unsupervised learning in batch and online settings. The subject of chapters 4 and 5 is unsupervised learning in a batch setting. 1.2 Quantum Annealing in Statistical Machine Learning Machine learning is inspired by several elds: probability theory and statistics, data mining, information theory, computational neuroscience, theoretical computer science, and physics. Statistical physics is often used as a physics-based approach in machine learning. In physics, quantum physics is an extensively studied eld as well as statistical physics. Therefore, it also seems important to study quantum physics approaches in machine learning. We propose a novel machine learning framework based on quantum mechanics, in particular, for learning latent variable models. A latent variable model is an important model in machine learning because of its exibility in modeling real world phenomena and generation processes of observations. Latent variables can indicate hidden and specic properties of observed data, or be viewed as abstract features or concepts of observed data. We focus on a discrete latent variable model. For example, a latent class model is a famous discrete 11. 1.2 Quantum Annealing in Statistical Machine Learning 5 latent variable model that expresses correlations among observable variables by using a latent class. A latent class model is typically used in clustering data where latent classes indicate class assignments of data. Many machine learning studies have used a probabilistic latent variable model that introduces uncertainty to states of latent variables as probabilistic distributions over latent variables. Bayesian learning is commonly used in probabilistic latent variable modeling. For example, a probabilistic mixture model is a probabilistic model for a latent class model. We introduce a dierent uncertainty, quantum eect, into the learning process for the states of latent variables than with existing methods. Several approaches related to quantum mechanics have attracted attention in machine learning. A density matrix has an important role in connecting machine learning with quantum mechanics. A density matrix is a self-adjoint positive-semidenite matrix of trace one and generalizes a probability vector that is regarded as a special case when the density matrices are diagonal and a diagonal element corresponds to a probability vector. Wolf, et al. (Wolf, 2006) connected the basic probability rule of quantum mechanics, called the Born Rule, which formulates a generalized probability by using a density matrix. They applied it to spectral clustering and other machine learning algorithms based on spectral theory. Crammer, et al. (Crammer and Globerson, 2006) combined a margin maximization scheme with a probabilistic modeling approach by incorporating the concepts of quantum detection and estimation theory (Helstrom, 1969). Tanaka, et al. (Tanaka and Horiguchi, 2002) proposed a quantum Markov random eld using a density matrix and applied it to image restoration. The Bayesian framework has been generalized using a density matrix. Schack, et al. (Schack et al., 2001) proposed a quantum Bayes rule for conditional density between two probability spaces. War- muth, et al. generalized the Bayes rule to treat a case where the prior was a density matrix (Warmuth, 2005) and unied Bayesian probability calculus for density matrices with rules for translation between joints and conditionals (Warmuth, 2006). However, computing the full posterior distributions over model parameters for complex probabilistic latent variable models, e.g. latent Dirichlet allocation (LDA) (Blei et al., 2003) and the hidden Markov model (HMM), remains dicult in these quantum Bayesian frame- works. We generalize the variational Bayes (VB) inference (Attias, 1999, 2000), which is a widely used approximation framework for learning probabilistic latent variable models, based on the ideas that have been used in quantum mechanics. We explain our framework and compare it with simulated annealing (SA) (Kirkpatrick et al., 1983), which is an optimization framework inspired by statistical mechanics. Machine learning is basically formulated as an optimization problem. SA (Kirkpatrick et al., 1983) is a well-known physics-based approach for solving optimization problems in machine learning. SA is based on a concept of statistical mechanics, temperature. The terminology is borrowed from metallurgy where a slow decrease in the temperature of a metal, i.e., annealing process, is used to obtain a minimum energy crystalline structure. By analogy, SA introduces a temperature parameter 12. 6 Chapter 1 Introduction 2 1 m 2 1 m 2 1 m Iteration {m Simulated Annealing (SA) 2 1 m 2 1 m 2 1 m Iteration {m Quantum Annealing (QA) f f f (a) (b) Fig. 1.1. . Panel(a) shows running m multiple SAs independently. Panel (b) shows a picture of running QA framework. QA connects neighboring SAs with interaction f and runs multiple SAs interactively. and searches an optimal solution by decreasing the temperature gradually in an optimization process. Since the energy landscape attens at high temperature, it is easy to change the state (see Fig.1.2(a)). However, the state is trapped at a low temperature because of the valley in the energy barrier, that is, the transition probability becomes very low. Therefore, SA is also stuck in a local optimum in the practical cooling schedule of temperature. Although Geman and Geman proved that SA could nd the global optimum with a slow cooling schedule, this schedule is too slow in practice (Geman and Geman, 1984). In physics, quantum annealing (QA) has attracted attention as an alternative annealing method of optimization problems through quantum uctuations. QA is the quantum-mechanical version of SA. A quantum uctuation is expected enables us to nd better local optima by helping the state avoid getting trapped by poor local optima at low temperatures. Kadowaki and Nishimori (1998); Farhi et al. (2001); Santoro et al. (2002) showed that QA is more ecient than SA for global optimum searches for Ising spin models. Matsuda et al. demonstrated that the annealing time of SA takes longer as the problem becomes more complicated, on the other hand, that of QA is nearly independent from the diculty of the problem (Matsuda et al., 2009). Quantum uctuation in this thesis is achieved by running multiple SAs with interactions as shown in Fig.1.1, where m denotes the number of multiple simulations. Let us consider running multiple SAs and j (j = 1, , m), which indicates latent variables, e.g, class assignments, of observations in the j-th SA. The QA dependently runs multiple SAs where dependently means that we run multiple SAs introducing interaction f among neighboring SAs that are randomly numbered such as j 1, j 13. 1.2 Quantum Annealing in Statistical Machine Learning 7 and j + 1 (see Fig.1.2) where we dene m+1 = 1. The independent SAs have a very low transition probability among states, i.e., they have been trapped, at a high temperature as shown in Fig.1.2(c), while the dependent QAs can change the state in that situation due to interaction f. Annealing means that interaction f starts from zero (i.e., independent), gradually increases, and makes j1 and j gradually approach each other. This scheme using interaction f is derived from QA mechanisms explained in the following section. We describe QA in terms of an optimization problem. When we run m SAs independently, we optimize p(j) individually, i.e., we nd j = argmaxj p(j) for each j and we use that has the highest p(j) of all j. In QA, we optimize the joint probability of states {j}m j=1: ( 1, 2, , m) = argmax (1,2, ,m) pQA(1, 2, , m). (1.10) Equation (1.10) means that pQA() is a probability measure over a set of states i.e., that each state j (j = 1, , m) can take an independent state and QA gives the probability for these states. A set of states (1, 2, , m) represents (quantum) superposition of dierent states, i.e., pQA() is a probability measure over superposition of dierent states in physics. Note that pQA(1, 2, , m) = m j=1 p(j). In clustering, (1, 2, , m) represents a superposition of m class assignments. This optimization algorithm is derived from the QA mechanism. We derived for the quantum annealing variational Bayes (QAVB) inference and evaluated the eects of the QAVB inference through experiments. The VB inference is an optimization algorithm that minimizes the (negative) variational free energy. Since the VB inference is a gradient algorithm similar to the expectation maximization (EM) algorithm (Dempster et al., 1977), it suers from a local optima problem in practice. SA algorithms has been proposed for the EM algorithm (SAEM) (Ueda and Nakano, 1995) and the VB inference (SAVB) (Katahira et al., 2008) to overcome issues with local optima. The QAVB inference is a generalization of the SAVB inference by using a density matrix. Interestingly, although the QAVB inference is generalized and formulated by a density matrix, the algorithm for the QAVB inference we nally derived does not require operations for a density matrix such as eigenvalue decomposition and only has simple changes from the SAVB algorithm. The QAVB inference requires the assumption that the model structure, e.g., the number of components is given as well as the VB inference. However, the model structure is often unknown in real problems. Nonparametric Bayesian modeling (Antoniak, 1974) is a useful approach for this case. A Dirichlet process mixture (DPM) model is a basic probabilistic mixture model in nonparametric Bayesian modeling. A Gibbs sampler is often used inference for the DPM models. Therefore, we also propose a QA-based Gibbs sampler for the DPM models. Finally, we organize learning algorithms for probabilistic latent variable models, as listed in Table 1.1. 14. 8 Chapter 1 Introduction (a) (b) (c) Fig. 1.2. (a) Schematic picture of SA. (Upper panel) At low temperature, state often falls into local optima. (Bottom panel) At high temperature, since energy landscape becomes at, state can change over a wide range. (b) and (c) Schematic picture of QA. (b) QA connects neighboring SAs. (c) j can change state owing to interaction f. It seems to go through energy barrier. We categorize the algorithms into two groups according to their purposes, sampling and optimization. Sampling algorithms obtain a set of samples drawn independently from the (posterior) distribution and approximate the expectation by a nite sum. Optimization algorithms search optimal solutions for the objective function. For example, the ML estimation aims to obtain parameters maximizing the data- likelihood, and the MAP estimation is obtained by maximizing the posterior. The VB inference also searches the optimal approximation distribution by maximizing the variational free energy. We propose two learning algorithms. Both fall into the optimization category. Furthermore, these algorithms are classied according to the number of processes. Typical machine learning algorithms run a single process to search optimal solutions. In particular, there is no theoretical approach in the optimization/multi process category. However, a multi-process search is expected to improve search eciency. Chapters 4 and 5 discuss this category, as shown in Table 1.1, by using a framework inspired by quantum mechanisms, and are the main contributions of this thesis. Algorithms in other categories are described in Chapter 2. 1.3 Summary of Remaining Chapters This thesis is organized as follows: Chapter 2 describes learning algorithms for probabilistic latent variable models, in particular, general algorithms for conjugate exponential models. This chapter discusses the following cat- 15. 1.3 Summary of Remaining Chapters 9 egories: single-process/Markov chain Monte Carlo (MCMC), single-process/optimization, and multi-process/MCMC as listed in Table 1.1. We call these categories classical. Chapter 3 describes the applications of the learning algorithms explained in chapter 2 to probabilistic latent variable models such as LDA and DPM models. Moreover, we show our contributions of classical machine learning algorithms where we developed a deterministic online algorithm for LDA, a novel topic model using the Pitman-Yor (PY) process, and an application to graph clustering. Chapter 4 forms the theoretical core of the thesis and presents the QAVB inference. We also explain the density matrix needed for application to machine learning. A density matrix is the key concept for connecting quantum mechanics to machine learning. No prior knowledge of quantum mechanics is assumed. We show that it is possible to apply the QAVB inference to a large family of conjugate-exponential graphical models to which the VB inference is applicable. Moreover, we demonstrate that the QAVB inference can outperform simulated annealing VB (SAVB) inference on real data in terms of the variational free energy optimization for the unigram mixture model, LDA and HMM. It is also shown that the QAVB inference is more eective in applications for clustering documents, extracting topics of documents, predicting users preference of music artists and modeling web page visits of users than the SAVB inference. Chapter 5 developes the QA Gibbs sampler for DPM models. The purpose of this chapter is to nd the MAP estimation of DPM models as fast as possible. We propose QA variant of the Chinese restaurant process (CRP). The QA algorithm developed in the chapter 4 cannot be applicable to the CRP. The key point is how to represent the states of data and to formulate a density matrix of the CRP. The QAVB inference has to use the number of distinct latent states to formulate a density matrix. However, this number is equal to the number of tables and so is unknown in the CRP. We developed a bit matrix representation for the CRP to formulate a density matrix to avoid this problem. We show that the QA Gibbs sampler nds better MAP solutions of DPM models than the SA Gibbs Sampler in document datasets. Chapter 6 concludes with a summary of the main contributions of the thesis. 16. 10 Chapter 1 Introduction Table. 1.1. Summary of Inference for Probabilistic Latent Variable Model MCMC/Sampling Optimization Variational Approximation ML/MAP estimation Single process Metropolis Hastings, Gibbs sampler (GS), Split-Marge (SM) sampler, Slice sampler, Importance sampler. Variational Bayes (VB) inference, Simulated annealing (SA) VB inference, SM-VB inference. Expectation Maximization (EM) algorithm, SA-EM algorithm, SM-EM algorithm, Stochastic search (SA-GS), Beam search, Greedy search. Multiple process Particle Filter. (Chap.4) (Chap.5) 17. 11 Chapter 2 Learning Algorithms This chapter discusses Bayesian learning algorithms for probabilistic latent variable models. The Bayesian framework models the process generating data, taking into account the prior belief about the generative process. Prior knowledge is useful to prevent over-tting to training data and to make a model robust to outliers. The prior belief about model parameters is expressed by prior distributions on the parameters, possibly in a hierarchical manner. In Bayesian learning for probabilistic latent variable models, we need the posterior distribution over latent variables and parameters. However, it is typically impossible to analytically calculate the posterior distribution of the models of interest. Therefore, we need approximation techniques for inference. There are mainly two approximate inference methods. The rst one is a nite sampling approximation where we use nite samples of latent variables and parameters through sampling techniques such as Markov chain Monte Carlo (MCMC) algorithm. The second one is a variational Bayes approximation where we use mathematically tractable approximate distributions for the true posteriors. In this chapter, we introduce these two types of algorithms for approximate inferences. We use a directed graphical model to describe examples of probabilistic latent variable models. A directed graphical model provides a succinct description of the factorization of a joint distribution where nodes denote random variables, edges denote dependency relations between random variables, plates denote replication of a substructure with indexing of the relevant variables, and observed random variables are shaded and unobserved random variables are not. First, we introduce a basic class of graphical models, called conjugate-exponential (CE) models. Then, we explain major learning algorithms used in Bayesian modeling. 18. 12 Chapter 2 Learning Algorithms 2.1 Conjugate Exponential Models CE models are used often for modeling problems and phenomena in machine learning because they are mathematically convenient and have useful algebraic properties. The models used in this research are included in this class of models. First, we explain an exponential family and describe the CE models. An exponential family is a class of probability distributions given by p(x|) = g()f(x) exp(() u(x)), (2.1) g()1 = dxf(x) exp(() u(x)), (2.2) where denotes the parameter of the family, () is the vector of natural parameters, u(x) is a vector of the sucient statistics, f is the function that denes the exponential family, and g is a normalization constant. For example, we can write the Dirichlet distribution in this form by p(|) = ( T t t) T t (t) T t t1 t = ( T t t) T t (t) exp( T t (t 1) log t), (2.3) where () is the gamma function and () is the digamma function, which is the rst derivative of the log Gamma function. From this form, we immediately see that the natural parameter of the Dirichlet distribution is ()t = t 1, the sucient statistic is u()t = log t and g() = ( T t t) T t (t) . A exponential family is mathematically convenient. For example, using the general fact that the expectation of the sucient statistic is equal to the derivative of the log normalization factor with respect to the natural parameter: E[u(x)]p(x|) = u(x)p(x|)dx = log g() () , (2.4) we can easily calculate the expectation of log t as E[log t]p(|) = [ log g() ]t = (t) ( t t), (2.5) which is sometimes used in models using the Dirichlet distribution. The CE models use an exponential family for data likelihood and the conjugate prior for paramters. The prior p(|, ) is said to be conjugate to the likelihood p(x|) if and only if the posterior p(| , ) p(|, )p(x|) (2.6) is the same parametric form as the prior. 19. 2.2 Gibbs Sampler 13 iz ix n Fig. 2.1. Example of graphical model for CE models. Each node denotes random variables. Edges denote dependence between random variables. Data xi is generated from latent variable zi and parameter and latent variable zi is also generated from parameter . Plates denote n times replication for generating random variables. Observed random variables {xi} are shaded and unobserved random variables are not. The conjugate prior over parameters is given by p(|, ) = h(, )g() exp(() ), (2.7) h(, )1 = dg() exp(() ), (2.8) where scalar and vector are hyperparamters of the prior, and h(, ) is a normalization constant. Using conjugate priors typically makes inference easier. In CE models with latent variables, the complete-data likelihood of data xi and latent variable zi is given by p(xi, zi|) = g()f(xi, zi) exp(() u(xi, zi)), (2.9) g()1 = dxf(xi, zi) exp(() u(xi, zi)). (2.10) The parameter prior is conjugate to the complete-data likelihood. A graphical model of the CE models is shown in Fig. 2.1 2.2 Gibbs Sampler The Gibbs sampler is a widely used MCMC algorithm that generates a sequence of samples from the joint probability distribution of multivariate random variables. It can be considered as a special case 20. 14 Chapter 2 Learning Algorithms of the Metropolis-Hastings algorithm. The Gibbs sampler is commonly used as a means of statistical inference, especially Bayesian inference; for example, the approximation of the joint distribution, the marginal distribution of one of the variables, or some subset of the variables, and the calculation of an integral such as the expected value of one of the variables. Let us consider the joint distribution p(z1:n) = p(z1, z2, , zn) from which we wish to sample. In the Gibbs sampling procedure, we replace variable zi with a value drawn from the distribution p(zi|zi 1:n) that is conditional on the current values of the other variables where zi 1:n denotes {z1, z2, , zi1, zi+1, , zn}. Therefore, the Gibbs sampler works well when the conditional distribution of each variable, p(zi|zi 1:n), is known and it is easy to sample zi even if the joint distribution p(z1:n) is dicult to sample directly. Geman and Geman (1984) proved that the sequence of samples constitutes a Markov chain and the stationary distribution of that Markov chain is just the sought-after joint distribution. A blocked Gibbs sampler does not deal with each one variable individually but with two or more variables together and samples from their joint distribution conditioned on all other variables. A collapsed Gibbs sampler marginalizes over one or more variables when sampling for some other variable (Liu, 1994). For example, imagine that a model consists of variables z1:n and where zi depends on . A simple Gibbs sampler would sample from p(zi|) (i = 1, , n), and then p(|z1:n), which looks like an EM-algorithm procedure. A collapsed Gibbs sampler integrates out variable and samples from the marginal distribution p(zi|zi 1:n) = p(zi|)p(|zi 1:n)d instead of p(zi|). The distribution over variable zi , which arises when collapsing variable is generally tractable when p() is the conjugate prior for p(z|), particularly when p(z|) and p() are members of the exponential family. The iterated conditional mode (ICM) algorithm is a greedy search algorithm if we make a point estimate of a variable given by the maximum of the conditional distribution at each step of the Gibbs sampling algorithm. The ICM can be seen as a greedy search algorithm for the MAP estimation. 2.3 Slice Sampler The slice sampler is an auxiliary variable method that samples the variables we are interested in by using auxiliary variables (Neal, 2000). The slice sampler draws samples from the joint space of an augmenting variable and an auxiliary variable. A fundamental theorem of simulation underlies the slice sampler (Neal, 2000). Theorem 2.3.1. Simulating from a density x f(x) (2.11) 21. 2.3 Slice Sampler 15 is equivalent to simulating uniformly from the region under its density function (x, u) U{(x, u) : 0 < u < f(x)}, (2.12) where U is a uniform distribution. If f is the density of interest, on an arbitrary space, we have, f(x) = f(x) 0 du. (2.13) Thus, f appears as the marginal density of the joint distribution, (x, u) U{((x, u)) : 0 < u < f(x)}. (2.14) Since U is not directly related to the original problem, it is called an auxiliary variable. We can generate from this joint density of x and u by generating a uniform random variable on the constrained set {(x, u) : 0 < u < f(x)}. Moreover, since the marginal distribution of x is the target distribution f, we generate a random variable from f by generating a uniform variable on {(x, u) : 0 < u < f(x)}. This generation is produced without using f other than through the calculation of f(x). This idea plays a key role in the construction of the slice sampler. The slice sampler implements a random walk on {(x, u) : 0 < u < f(x)} to move iteratively along the u-axis and then along the x-axis. For example, starting from a point (x0, u0) in {(x, u) : 0 < u < f(x)}, the move along the u-axis will correspond to the conditional distribution, u1 U{u : u f(x0)}, (2.15) resulting in a change from point (x0, u0) to point (x0, u1), and then the move along the x-axis to the conditional distribution, x1 U{x : u1 f(x)}, (2.16) resulting in a change from point(x0, u1) to (x1, u1). This algorithm remains applicable if f is an unnormalized density, which is quite useful. Suppose f(x) = f(x)/Z where Z = f(x)dx. The algorithm is to sample uniformly from the area under the distribution given by p(x, u) = 1 Z if 0 < u < f(x). 0 otherwise. . (2.17) Note that the marginal distribution over x is given by p(x, u)du = f(x) 0 1 Z = f(x) Z = f(x). (2.18) In this way, the marginal distribution is equivalent to the target distribution. 22. 16 Chapter 2 Learning Algorithms Algorithm 1 Slice sampler 1: for all iteration t do 2: Draw u(t) from U[0, f(x(t1) )]. 3: Draw x(t) from U{x : 0 < u(t) < f(x)}. 4: end for 2.4 Particle Filter Particle lters are usually used for approximating a probability distribution over a latent variable as observations are acquired in a dynamical system (Kitagawa, 1996; Doucet et al., 2000). Particle lters are also known as sequential Monte Carlo methods due to the sequential (online) analogue of MCMC batch methods and often similar to importance sampling methods. Particle lters are so named because they approximate the distribution of a latent variable at a specic time given all observations up to that time by using a set of particles which are dierently-weighted samples of the distribution. Particle lters are also useful in a multi-processor environment, since it is easy to parallelize multiple particles by dedicating one particle to a single machine. The probability of a latent variable at time i + 1 given observations up to that time is p(zi+1|x1:i) = p(zi+1|z1:i)p(z1:i|x1:i)dz1:i = p(zi+1|z1:i) p(z1:i|x1:i) q(z1:i|x1:i) q(z1:i|x1:i)dzi+1 = p(zi+1|z1:i) p(zi+1|x1:i) q(zi+1|x1:i) q(zi+1|x1:i)dzi+1 = p(zi+1|z1:i)w(zi+1)q(zi+1|x1:i)dzi+1 ( w(zi+1) = p(z1:i|x1:i) q(z1:i|x1:i) ) , (2.19) where q(z1:i|x1:i) is a proposal distribution. We approximate the integral in Eq. (2.19) by using samples z (s) 1:i q(z1:i|x1:i) (s = 1, , S), i.e., p(zi+1|x1:i) 1 S S s=1 p(zi+1|z (s) 1:i )w(z (s) 1:i ). (2.20) 23. 2.5 Simulated Annealing 17 The weight for each particle can be sequentially calculated as follows. w(z (s) 1:i ) w(z (s) 1:i1) = p(z (s) 1:i , x1:i)p(x1:i1)/p(z (s) 1:i1, x1:i1)p(x1:i) q(z (s) 1:i |x1:i)/q(z (s) 1:i1|x1:i1) (2.21) p(z (s) 1:i , x1:i)/p(z (s) 1:i1, x1:i1) q(z (s) i |x1:i, z (s) 1:i1) (2.22) p(z (s) i , xi|z (s) 1:i1, x1:i1) q(z (s) i |x1:i, z (s) 1:i1) (2.23) p(xi|z (s) i , z (s) 1:i1, x1:i1)p(z (s) i |z (s) 1:i1, x1:i1) q(z (s) i |x1:i, z (s) 1:i1) (2.24) p(z (s) i |z (s) 1:i1, x1:i1) is typically easy to sample from and used for the proposal distribution, i.e., q(z (s) i |x1:i, z (s) 1:i1) = p(z (s) i |z (s) 1:i1, x1:i1). In this case, we have w(z (s) 1:i ) w(z (s) 1:i1)p(xi|z (s) i , z (s) 1:i1, x1:i1). (2.25) The weights are normalized to sum to 1, i.e., s w(z (s) 1:i ) = 1. Resampling is often used to avoid the problem that all but one of the importance weights are close to zero. The performance of the algorithm can be also aected by proper choice of resampling method. The stratied sampling is optimal in terms of variance. First, it computes an estimate of the eective sample size (ESS) of particles as ESS = 1 ( S s=1(w(s))2) . (2.26) If ESS is less than a given threshold r, i.e., ESS < r, then perform the following resampling: 1. Draw S particles from the current particle set with probabilities proportional to their weights and replace the current particle set with this new one. 2. For s = 1, , S, set w(s) = 1 S . Particle lters represent a true posterior as the number of particles S . 2.5 Simulated Annealing SA (Kirkpatrick et al., 1983) is a stochastic optimization framework. SA can be applicable to various machine learning problems by using the Gibbs measure, which is a probability measure frequently seen in statistical mechanics and widely apparent in many problems outside physics. It is the measure associated with the Boltzmann distribution. Let E(z) be an energy function from the space of states to the real value interpreted as the energy of the conguration z in physics. E(z) is formulated as the 24. 18 Chapter 2 Learning Algorithms ]|[ pqKL ][qF L Fig. 2.2. Relationships of log likelihood of data (L) , KL-divergence (KL), and variational free energy (F). objective function, or loss function, in machine learning such that a better solution has a lower energy. For probabilistic models, the energy function is dened by the negative log likelihood: E(z) log p(z). The Gibbs measure gives the probability of the random variable z having value x as p(z = x) = 1 Z(T) exp( 1 T E(x)). (2.27) Here, the parameter T is a hyper parameter and, in physics, the temperature. The normalizing constant Z(T) is called the partition function. In Bayesian learning, SA is often used for searching the MAP estimation and can be used with all Markov Chain sampling procedures. When a Markov chain simulation is used to sample from the Gibbs measure given some energy function, the SA procedure gradually reduces the temperature, T, from a high initial value to the low value at which we wish to sample. The goal is to sample from the Gibbs distribution at a temperature of zero in which the probability is concentrated on the state or states of minimal energy. Let us consider the simulated annealing Gibbs sampler (SAGS). In each step, the SAGS searches for the next solution near the current one. The next solution is chosen with a Gibbs measure that depends on both T and the energy function value of the next solution. The SAGS almost randomly chooses the next solution when T is high, and it goes down the hill of the energy function when T is low. Slower cooling T increases the probability of nding the global optimum. Geman and Geman (1984) analyzed the following schedule for the temperature: Tt = T1 log(t) , for t > 1, (2.28) where t denotes a time or step in a simulation. They proved that, under certain conditions, there exists a value for T1 such that use of this logarithmic schedule guarantees that the distribution will eventually be concentrated on the set of point where the energy reaches its global optimum (minimum). 25. 2.6 Variational Bayes Inference 19 2.6 Variational Bayes Inference Let us consider n data points x1:n = {x1, x2, , xn} and latent variables z1:n = {z1, z2, , zn} corresponding to the data points. For simplicity, denotes the collection of model parameters. In the Bayesian approach, we are interested in the posterior distribution over z1:n and given observed data x1:n. Using the Bayes theorem, the posterior is given by p(, z1:n|x1:n) = p(x1:n|z1:n, )p(z1:n, ) p(x1:n) . (2.29) p(x1:n) = z1:n p(x1:n|z1:n, )p(z1:n, )d (2.30) When this calculation is computationally intractable, we have to use an approximation method for estimating the posterior in practice. The MCMC is a popular method for the Bayesian inference. However, the MCMC is not always applicable because of its slow speed and diculty in determining whether the state converges or not. The VB inference approximates the posterior p(z1:n, |x1:n) by a variational approximation distribution q(z1:n, ). The variational distribution can be freely selected; however, we often construct it so that we can easily solve the optimization problem. The VB inference converts the calculation of the posterior p(z1:n, |x1:n) in Eq. (2.29) into an optimization problem, namely minimizing the KL divergence between q(z1:n, ) and p(z1:n, |x1:n). In other words, instead of calculating Eq. (2.29), we estimate the variational distribution by solving the optimization problem given by q (z1:n, ) = argmin q(z1:n,) KL(q(z1:n, )|p(z1:n, |x1:n)). (2.31) Unfortunately, the optimization problem of Eq.(2.31) is also intractable; therefore we consider the following theorem. Theorem 2.6.1. The relationships between the log likelihood of data and the KL-divergence is given by log p(x1:n) =F[q(z1:n, )] + KL[q(z1:n, )|p(z1:n, |x1:n)], (2.32) F[q(z1:n, )] = z1:n q(z1:n, ) log p(x1:n, z1:n, ) q(z1:n, ) d. (2.33) 26. 20 Chapter 2 Learning Algorithms Proof. log p(x1:n) = log z1:n p(x1:n, z1:n, )d (2.34) = log z1:n q(z1:n, ) p(x1:n, z1:n, ) q(z1:n, ) d (2.35) z1:n q(z1:n, ) log p(x1:n, z1:n, ) q(z1:n, ) d (Jensens inequality) =F[q(z1:n), q()] (2.36) Thus, we have log p(x1:n) F[q(z1:n), q()] = z1:n q(z1:n, )d log p(x1:n) z1:n q(z1:n, ) log p(x1:n, z1:n, ) q(z1:n, ) d, = z1:n q(z1:n, ) log q(z1:n, ) p(z1:n, |x1:n) d, = KL[q(z1:n, )|p(z1:n, |x1:n)]. (2.37) F[q(z1:n, )] is called the variational free energy. As Fig. 2.2 shows, Eq.(2.32) means that the minimization of the KL-divergence is equivalent to the maximization of F[q(z1:n, )] with respect to q(z1:n, ) because the log likelihood, log p(x1:n), is constant with respect to q(z1:n, ). Therefore, instead of Eq.(2.31), we solve the optimization problem given by q (z1:n, ) = argmax q(z1:n,) F[q(z1:n, )]. (2.38) The variational distribution is generally assumed to be fully factorized with each factorized component in the exponential family to make the optimization problem tractable. We often assume q(z1:n, ) =q(z1:n)q() = [ j q(zj)]q(). (2.39) We obtain the update equations by taking functional derivatives of F[q(z1:n, )] with respect to {q(zj)}, q(), respectively, and equating to zero. Then the update formulas are as follows: q(zj = k) exp q() log p(xj, zj = k, )d (2.40) q() p() exp z1:n q(z1:n) log p(x1:n, z1:n|) (2.41) Algorithm 2 shows the VB inference. 27. 2.7 Simulated Annealing Variational Bayes Inference 21 Algorithm 2 Variational Bayes Inference 1: Initialize model parameters. 2: for all iteration t such that 1 t L where L denotes the number of iterations do 3: for i = 1, ..., n do 4: VB-E step: Update q(zi) by Eq. (2.40) 5: end for 6: VB-M step: Update q() by Eq. (2.41) 7: end for 2.7 Simulated Annealing Variational Bayes Inference Katahira et al. (2008) proposed the simulated annealing for variational Bayes inference (SAVB) by introducing the inverse temperature parameter = 1 T to the variational free energy F[q(z1:n), q()] as follows. F[q(z1:n), q()] = log p(x1:n, z1:n, )q(z1:n)q() + 1 S[q(z1:n)q()]. (2.42) This approach is based on the basic equations of statistical physics: F = E TS where F is free energy, T is temperature, and S is the entropy*1 . We obtain the following updates by taking functional derivatives of F[q(z1:n), q()] with respect to q(z1:n) and q() , and equating to zero: q(zj = k) exp q() log p(xj, zj = k, )d, (2.43) q() p() exp z1:n q(z1:n) log p(x1:n, z1:n|). (2.44) Algorithm 3 shows the SAVB inference. Other approaches to overcome the local maximum problem are split and merge algorithms for the EM algorithm (Ueda et al., 1998, 1999, 2000) and the VB inference (Ueda and Ghahramani, 2002). 2.8 Variational Bayes Inference for Conjugate Exponential Models We can explicitly apply the VB method to the CE models and drive a simple general update algorithm. *1 Note that free energy F corresponds to the negative variational free energy 28. 22 Chapter 2 Learning Algorithms Algorithm 3 Simulated Annealing Variational Bayes Inference 1: Initialize inverse temperature and model parameters. 2: for all iteration t such that 1 t L where L denotes the number of iterations do 3: repeat 4: for i = 1, ..., n do 5: VB-E step: Update q(zi) by Eq. (2.43) 6: end for 7: VB-M step: Update q() by Eq. (2.44) 8: until convergence 9: Increase inverse temperature (if > 1, = 1). 10: end for q(zi) f(xi, zi) exp(() u(xi, zi)) (2.45) q() h( + n, + n i u(xi, zi))g()+n exp(() ( + n i u(xi, zi))) (2.46) The SAVB inference is the following updates: q(zi) f(xi, zi) exp(() u(xi, zi)) (2.47) q() h( + n, + n i u(xi, zi))g()+n exp(() ( + n i u(xi, zi))) (2.48) We have used to denote expectation under the variational posterior over the random variables. That is () = dq()() (2.49) u(xi, zi) = zi u(xi, zi)q(zi) (2.50) 2.9 Collapsed Variational Bayes Inference With the variational Bayes (VB) inference, a factorized form for the posterior distribution of parameters and latent variables is assumed, which means that we assume that parameters are independent of latent variables, i.e., q(z1:n, ) = q(|z1:n)q(z1:n) = q()q(z1:n). However, parameters and latent variables can be related in many cases. The collapsed variational Bayes (CVB) inference was proposed to relax the factorization assumption (Teh et al., 2007). Instead of assuming parameters to be independent 29. 2.9 Collapsed Variational Bayes Inference 23 from latent variables, the CVB inference marginalizes over parameters and deals with the marginal distribution over observations and latent variables: p(x1:n, z1:n) = p(x1:n, z1:n|)p()d. (2.51) The variational lower bound in the CVB inference is given by FCV B[q(z1:n)] = z1:n q(z1:n) log p(x1:n, z1:n) + z1:n q(z1:n) log q(z1:n). (2.52) The only assumption in the CVB inference is that the latent variables {zi} are mutually independent, q(z1:n) = i q(zi). Thus, the updates for the variational posterior of latent variables is derived by taking derivatives of FCV B[q(z1:n)] with respect to {q(zi)} and equating to zero. q(zi = k) exp zi 1:n q(zi 1:n) log p(x1:n, zi 1:n, zi = k), (2.53) where the superscript i means the corresponding variables are excluded, i.e., zi 1:n = z1:nzi. The exact implementation of the above expectation is computationally too expensive to be practical, and so a Gaussian approximation is applied. The CVB inference is a better approximation than the standard VB inference. Teh et al. (2007) proved this assertion as follows. Let FV B[q(z1:n)q()] be the variational lower bound in the VB inference. Theorem 2.9.1. The CVB lower bound is tighter than the VB lower bound: FV B[q(z1:n)q()] FCV B[q(z1:n)]. (2.54) Proof. In the variational lower bound F[q(z1:n, )] = F[q(|z1:n)q(z1:n)], the assumption q(|z1:n) = p(|x1:n, z1:n) indicates F[q(z1:n, )] = FCV B[q(z1:n)]. Since the maximum of F[q(z1:n, )] with respect to q(|z1:n) is achieved at the true posterior, F[q(|z1:n)q(z1:n)] max q(|z1:n) F[q(|z1:n)q(z1:n)] = F[p(|x1:n, z1:n)q(z1:n)] = FCV B[q(z1:n)]. (2.55) Therefore, since the assumption q(|z1:n) = q() indicates F[q(z1:n, )] = FV B[q(z1:n)q()], FV B[q(z1:n)q()] FCV B[q(z1:n)]. (2.56) 30. 24 Chapter 3 Probabilistic Latent Variable Models This chapter describes probabilistic latent variable models used in the following chapters. First, we explain topic models that are major probabilistic models where latent variables work well. Second, we explain the Dirichlet process mixture models, which are highly exible probabilistic models for modeling observations. Finally, we describe the proposed topic model, Pitman-Yor topic model. 3.1 Topic Models Topic models have attracted a great deal of attention in machine learning in the last decade. Topic models are useful tools for the statistical analysis of document collections and other discrete data; for example, applications including information retrieval, collaborative ltering and image processing. Topic models represent documents as random mixtures over latent topics and each topic is characterized by a distribution over words. Topic models basically assume thebag-of-words representations for a document in which the order of words in a document is not considered. We dene the notations. T is the number of topics, M is the number of documents, V is the vocabulary size, and Nd is the number of words in document d. wd denotes a word vector in document d and wd,i is the i-th word in document d. Multi() is the multinomial distribution, Dir() is the Dirichlet distribution, and is a multinomial parameter T V matrix where t,v species the probability of generating word v given topic t. That is, t is a word distribution corresponding to topic t. The word distribution t is viewed as representations of topic t. For example, a specic topic, e.g. search engines, can be represented as a multinomial distribution that gives high probability to the word specic to the topic, e.g., Google, Yahoo, or MSN. 31. 3.1 Topic Models 25 dz k idw , K dn M idz , k K dn M idw , d idz , d k idw , K dn M (a) (b) (c) Fig. 3.1. Graphical models of (a) mixture of unigrams, (b)pLSI and (c) LDA. 3.1.1 Mixture of Unigrams The bag-of-words assumption indicates that the words w of every document are drawn independently from a single multinomial distribution: p(wd) = p(wd,1, wd,2, , wd,Nd ) = Nd i=1 p(wd,i). (3.1) This model is called the unigram model. If we augment the unigram model with a discrete random topic variable zd for each document d, we obtain a mixture of unigrams model (Nigam et al., 2000), which is a probabilistic mixture model for the unigram model with discrete latent variables. It is assumes with a mixture of unigrams that each document is generated from a word distribution corresponding to a topic that is a latent variable, and words are independently generated from the conditional multinomial p(w|z). The probability of document wd is given by p(wd) = zd p(zd)p(wd|zd) = zd p(zd) Nd i=1 p(wd,i|zd). (3.2) A graphical model is depicted in Fig.3.1 (a). 32. 26 Chapter 3 Probabilistic Latent Variable Models Algorithm 4 Generative process of LDA 1: for all topic t(= 1, , T) do 2: Draw t Dir(t|) v 1 t,v . 3: end for 4: for all document d(= 1, , M) do 5: Draw d Dir(d|) t t1 d,t . 6: for all word i(= 1, , Nd) do 7: Draw topic zd,i Multi(z|d). 8: Draw word wd,i p(w|zd,i ). 9: end for 10: end for 3.1.2 Probabilistic Latent Semantic Indexing Probabilistic latent semantic indexing (pLSI) (Hofmann, 1999) is a statistical model which is called an aspect model.The aspect model is a latent variable model for co-occurrence data, which associates an unobserved latent variable z with each observation. The pLSI relaxes the assumption made in the mixture of unigrams model that each document is generated from only one topic, and models multiple topics in a document. The pLSA model posits that a document label d and a word w are conditionally independent given a topic z. A joint probability over a document d and a word w is given by p(d, w) = p(d) z p(w|z)p(z|d). (3.3) Each observed pair (d, w) is assumed to be generated independently under the bag of words assumption. A graphical model is depicted in Fig.3.1 (b). 3.1.3 Latent Dirichlet Allocation The pLSA is a generative model at word level but not at document level. Latent Dirichlet Allocation (LDA) provides a hierarchical Bayesian generative model at word and document levels. Let d be the topic distribution of document d, which is a T-dimensional probability vector where d,t species the probability of generating topic t in document d. It is assumed with LDA that d follows the Dirichlet distribution with parameter , which is the T-dimensional vector. Based on these denitions, the generative process shown in Algorithm 4. 33. 3.1 Topic Models 27 The likelihood of document wd is p(wd|, ) = zd p(1:T |)p(d|) Nd i p(wd,i|zd,i )p(zd,i|d)ddd1:T . (3.4) A graphical model is depicted in Fig.3.1 (c). De Finettis representation theorem is behind the modeling using LDA. De Finettis representation theorem establishes that any collection of exchangeable random variables has a representation as a mixture distribution, in general an innite mixture. Thus, if we wish to consider exchangeable representations for documents and words, it is reasonable that we consider mixture models that capture the exchangeability of both words and documents. 3.1.4 Interpretation of Topic Models on Simplex Space The models described above - mixture of unigrams, pLSI, and LDA- are probabilistic latent variable models in the space of distributions over words. A word distribution that indicates an occurrence probability of words is a probability vector and so can be viewed as a point on the simplex, which we call the word simplex. Therefore, the latent variable models is considered as T points on the word simplex as topics, and form a sub-simplex, which we call the topic simplex. Note that any point on the topic simplex is also a point on the word simplex. The dierent latent variable models use the topic simplex in dierent ways to generate a document. Let us consider the word simplex for three words and the topic simplex for two topics embedded in the word simplex (see Fig.3.2). The corners of the word simplex correspond to the three distributions where one of the three words has probability one. The corners of the topic simplex correspond to topics that are the dierent distributions over words. An empirical word distribution of each document is on the word simplex, denoted by x. It is assumed with the mixture of unigrams that all the words of each document are drawn from one of the topics. That is, the mixture of unigrams relates each document to one of the corners of the topic simplex. (Fig.3.2 (a)). Therefore, the mixture of unigrams clusters documents. The pLSI and LDA relate each empirical distribution on the word simplex to each point on the topic simplex (Fig.3.2 (b)). Each point on the topic simplex indicates a topic distribution corresponding to each document. Therefore, the pLSI and LDA reduce the dimensions of a document from a word space to topic space. LDA also places the Dirichlet distribution on the topic simplex as a prior knowledge for points on the topic simplex. 34. 28 Chapter 3 Probabilistic Latent Variable Models Word simplex Topic simplex Topic 1 Topic 2 xx x x x x x x Word simplex Topic simplex Topic 1 Topic 2 xx x x x x x x (a) (b) Fig. 3.2. . Interpretation of mixture of unigrams, pLSI and LDA on simplex space. Empirical word distribution of each document is denoted by x. (a) Mixture of unigrams relates each document to a corner of topic simplex. (b) pLSI and LDA relate each empirical distribution on word simplex to each point on topic simplex. 3.1.5 Collapsed Gibbs Sampler for Latent Dirichlet Allocation Standard Gibbs sampling for LDA iteratively samples latent variables z and parameters , , which can have slow convergence due to dependencies between the parameters and latent variables. Griths and Steyvers (2004) presented a collapsed Gibbs sampler for LDA, where the state space is the set of all possible topic assignments to the words in every document. The variables and are analytically integrated out, and only the latent topic variables z are sampled. The topic assignment of the i-th word in document d is sampled according to its conditional distribution p(zd,i = k|wd,i, zd,i , wd,i , , ) =p(wd,i|zd,i = k, wd,i , zd,i , )p(zd,i = k|zd,i , ) (3.5) = p(wd,i|k)p(k|wd,i , zd,i , )p(zd,i = k|d)p(d|zd,i , )dkdd (3.6) nd,i k,wd,i + v(nd,i k,v + ) nd,i d,k + k k(nd,i d,k + k) (3.7) where nk,v is the number of times word v is assigned to topic k , nd,t is the number of times a word in document d is assigned to topic t, and the superscript d, i means the corresponding variables or counts with wd,i and zd,i excluded. 35. 3.1 Topic Models 29 3.1.6 Variational Bayes Inference for Latent Dirichlet Allocation The VB inference for LDA(Blei et al., 2003) introduces a factorized variational posterior q(z, , ) over z = {zd,i}, = {d} and = {k} given by q(z, , ) = d,i q(zd,i) d q(d|d) k q(k|k), (3.8) where and are variational parameters, d and k are the parameters of the Dirichlet distributions over d and k, respectively, i.e., q(d|d) k d,k1 d,k and q(t|t) v k,v1 k,v . The log-likelihood of documents is lower bounded by F[q(z, , )] = z q(z, , ) log d p(wd, zd, d|, ) t p(k|) q(z, , ) ddd. (3.9) The parameters are updated as q(zd,i = k) exp (k,wd,i ) exp ( v k,v) exp (d,k)), (3.10) d,k =t + Nd i=1 q(zd,i = k), (3.11) k,v = + j nj,t,v, (3.12) where nd,k,v = i q(zd,i = k)I(wd,i = v) and I() is an indicator function. The parameter updates for the SAVB inference are given by q(zd,i = k) exp ((k,wd,i ) ( v k,v) + (d,k)), (3.13) d,k =(t + Nd i=1 q(zd,i = k) 1) + 1, (3.14) k,v =( + j nj,t,v 1) + 1, (3.15) where denotes the inverse temperature. We can estimate with the xed point iteration (?Asuncion et al., 2009) by introducing the gamma prior G(k|a0, b0), i.e., k G(k|a0, b0)(k = 1, ..., K), as new k = a0 1 + d{(old k + nd,k) (old k )}old k b0 + d((nd + old 0 ) (old 0 )) , (3.16) where 0 = k k, and a0 and b0 are the parameters for the gamma distribution. 36. 30 Chapter 3 Probabilistic Latent Variable Models 3.1.7 Collapsed Variational Bayes Inference for Latent Dirichlet Allocation Teh et al. (2007) applied the CVB inference to LDA. The CVB inference for LDA only introduced a variational posterior q(z) where it marginalized out and over the priors p(d|) k k1 d,k and p(k|) v 1 k,v . The CVB inference optimizes the following lower bound given by FCV B[q(z)] = M d=1 z q(zd) log p(wd, zd|, ) q(zd) . (3.17) Thus, the updates for the variational posterior of latent variables is derived by taking derivatives of FCV B[q(z)] with respect to {q(zd,i)} and equating to zero. q(zd,i = k) exp zd,i q(zd,i ) log p(w, zd,i , zd,i = k|, ), exp zd,i q(zd,i ) log p(wd,i|zd,i = k, wd,i , zd,i |)p(zd,i = k|zd,i ), = exp ( E[log p(wd,i|zd,i = k, wd,i , zd,i |)]q(zd,i) + E[log p(zd,i = k|zd,i )]q(zd,i) ) . (3.18) The derivation of the update equation for q(z) is slightly complicated and involves approximations to compute intractable expectation. Using the central limit theorem, the expectation is expected to be closely approximated using Gaussian distributions with means E and variances V given by Ed,k = Nd i=1 q(zd,i = k), (3.19) Vd,k = Nd i=1 q(zd,i = k)(1 q(zd,i = k)), (3.20) Ek,v = d,i q(zd,i = k)I(wd,i = v), (3.21) Vk,v = d,i q(zd,i = k)(1 q(zd,i = k))I(wd,i = v). (3.22) Moreover, using the following second order Taylor expansion, E[f(x)] f(E[x]) + 1 2 f (E[x])V [x], (3.23) we can approximately calculate q(zd,i = k) + Ed,i k,wd,i V + v Ed,i k,v (k + Ed,i d,k ) exp ( V d,i k,v 2( + Ed,i k,v )2 + v V d,i k,v 2(V + v Ed,i kv )2 V d,i d,k 2(k + Ed,i d,k )2 ) , (3.24) 37. 3.1 Topic Models 31 where -d,i denotes subtracting q(zd,i = k) and q(zd,i = k)(1 q(zd,i = k)). Teh et al. (2007) showed that the convergence of the CVB inference for LDA was experimentally faster than that of the VB inference, and the CVB inference outperformed the VB inference in terms of predictive performance. Asuncion et al. (2009) showed the usefulness of an approximation using only zero-order information, called the CVB0 inference. The update using only zero-order information is given by q(zd,i = k) + Ed,i k,wd,i V + v Ed,i k,v (k + Ed,i d,k ). (3.25) The CVB0 inference for LDA is computationally faster than the VB and CVB inferences. 3.1.8 Particle Filter for Latent Dirichlet Allocation Huge quantities of text data such as news articles and blog posts arrives in a continuous stream. Online learning has attracted a great deal of attention as a useful method for handling this growing quantity of streaming data. The learning algorithms for LDA described in the previous sections are designed to be run over an entire collection of documents after they have been observed. Canini et al. (2009) proposed a particle lter method for LDA. In the following explanations, we use the superscript (d, i), which denotes the corresponding variables or counts with previous variables, i.e., w(d,i) = (w1:d1, wd,1:i) and z(d,i) = (z1:d1, zd,1:i) where w1:d1 = (w1, , wd1), wd,1:i = (wd,1, , wd,i), z1:d1 = (z1, , zd1) and zd,1:i = (zd,1, , zd,i). The posterior probability for the topic of new word wd,i by conditioning on the words already observed is given by p(zd,i|z(d,i1) , w(d,i) , , ) =p(wd,i|z (s) d,i , z(d,i1)(s) , w(d,i1) , )p(z (s) d,i |z(d,i1)(s) , ) (3.26) n (d,i1) k,wd,i + v(n (d,i1) k,v + ) n (d,i1) d,k + k k(n (d,i1) d,k + k) . (3.27) By introducing importance weights (s) for particles s = 1, , S, we have p(zd,i|w(d,i) , , ) 1 S S s=1 p(zd,i|z(d,i1) , w(d,i) , , )(z(d,i1)(s) ). (3.28) The weight for each particle can be sequentially calculated as follows. (z(d,i)(s) ) (z(d,i1)(s)) p(wd,i|z (s) d,i , z(d,i1)(s) , w(d,i1) )p(z (s) d,i |z(d,i1)(s) , ) q(z (s) d,i |z(d,i1)(s)) (3.29) p(z (s) d,i |z(d,i1)(s) , ) is used for the proposal distribution; hence we have (z(d,i)(s) ) (z(d,i1)(s) )p(wd,i|z (s) d,i , z(d,i1)(s) , w(d,i1) , ). (3.30) 38. 32 Chapter 3 Probabilistic Latent Variable Models Algorithm 5 Particle Filter for Latent Dirichlet Allocation 1: Initialize weights (s) = 1 S for s = 1, , S 2: Set the ESS threshold r. 3: for Observing word wd,i do 4: for s = 1, , S do 5: Update weights (z(d,i)(s) ) (z(d,i1)(s) )p(wd,i|z (s) d,i , z(d,i1)(s) , w(d,i1) , ) by Eq.(3.30) 6: Normalize weights to sum to 1 7: Sample z (s) d,i form p(z (s) d,i |z(d,i1)(s) , w(d,i) , , ) by Eq.(3.27) 8: end for 9: if ESS < r then 10: Resample particles: 11: for z (s) l,m R(d, i) do 12: Sample z (s) l,m from p(z (s) l,m|z(d,i1)(s) zl,m, w(d,i) |, ) by Eq.(3.31) 13: end for 14: Set (s) = 1 S for s = 1, , S 15: end if 16: end for Canini et al. (2009) performed resampling by rejuvenating old topic assignments of the words already observed. R(d, i) denotes previously observed words as of observing wd,i, called rejuvenation sequence, and chosen randomly from past observations. The topic assignment zl,m in the rejuvenation sequence R(d, i) is drawn from the conditional distribution p(zl,m|z(d,i1) zl,m, w(d,i) |, ) n (d,i1) k,wd,il,m + v(n (d,i1) k,vl,m + ) n (d,i1) d,kl,m + k k(n (d,i1) d,kl,m + k) , (3.31) where the subscript l, m denotes the corresponding variables or counts with wl,m and zl,m excluded. Finally, we summarize the particle lter for LDA in Algorithm 5. 3.1.9 Deterministic Online Learning for Latent Dirichlet Allocation We proposed two deterministic online algorithms; an incremental algorithms and a single-pass algorithm (Sato et al., 2010). Our incremental algorithm is an incremental variant of the reverse EM (REM) algorithm (Minka, 2001). The incremental algorithm updates parameters by replacing old sucient statistics with new one for each datum. Our single-pass algorithm is based on an incremental algorithm, but it does not need to store old statistics for all data. In our single-pass algorithm, we propose a sequential 39. 3.1 Topic Models 33 update method for the Dirichlet parameters. Asuncion et al. (2009); Wallach et al. (2009) indicated the importance of estimating the parameters of the Dirichlet distribution, which is the distribution over the topic distributions of documents. Moreover, we can deal with the growing vocabulary size. In real life, the total vocabulary size is unknown, i.e., increasing as a document is observed. For these situations, we want to process text one at a time and then discard them. We do not repeat iterations for whole data.We repeat iterations only for each word within a document. That is, we update parameters from an arriving document and discard the document after doing l iterations. Therefore, we do not need to store statistics about discarded documents, and our single-pass algorithm can be applied to semi-innite and time-series text streams. First, we derived an incremental algorithm for LDA, and then we extended the incremental algorithm to a single-pass algorithm. Incremental Learning for LDA Radford M. Neal (1998) provided a framework of incremental learning for the EM algorithm. In general unsupervised-learning, we estimate sucient statistics si for each data i, compute whole sucient statistics (= i si) from all data, and update parameters by using . In incremental learning, for each data i, we estimate si, compute (i) from si , and update parameters from (i) . It is easy to extend an existing batch algorithm to the incremental learning if whole sucient statistics or parameters updates are constructed by simply summarizing all data statistics. The incremental algorithm processes data i by subtracting old sold i and adding new snew i , i.e., (i) = sold i +snew i . The incremental algorithm needs to store old statistics {sold i } for all data. While batch algorithms update parameters sweeping through all data, the incremental algorithm updates parameters for each data one at a time, which results in more parameter updates than batch algorithms. Therefore, the incremental algorithm sometimes converge faster than batch algorithms. Our motivation for devising the incremental algorithm for LDA was to compare CVB-LDA and VB- LDA. Statistics {nk,v} and {nd,t} are updated after each word is updated in CVB-LDA. This update schedule is similar to that of the incremental algorithm. This incremental property seems to be the reason CVB-LDA converges faster than VB-LDA. Below, let us consider the incremental algorithm for LDA. We start by optimizing the lower-bound dierent form VB-LDA by using the reverse EM (REM) 40. 34 Chapter 3 Probabilistic Latent Variable Models algorithm (Minka, 2001) as follows: p(wd|, ) = nd i=1 K t=1 V v=1 (j,tk,v)I(wd,i=v) p(d|)dd = nd i=1 K t=1 (j,tt,wd,i )p(d|)dd, (3.32) nd i=1 K t=1 ( j,tt,wd,i q(zd,i = k) )q(zd,i=k) p(d|)dd, (3.33) = nd i=1 K t=1 ( t,wd,i q(zd,i = k) )q(zd,i=k) K t=1 i q(zd,i=k) j,t p(d|)dd. (3.34) Equation (3.33) is derived from Jensens inequality as follows. log x f(x) = log x q(x)f(x) q(x) x q(x) log f(x) q(x) = log x(f(x) q(x) )q(x) where x q(x) = 1, and so x f(x) x(f(x) q(x) )q(x) . Therefore, the lower bound for the log-likelihood is given by F[q(z)] = j,i,t q(zd,i = k) log t,wd,i q(zd,i = k) + d log ( ( k k) (Nd + k k) k (k + nd i=1 q(zd,i = k)) (k) ) . (3.35) The maximum of F[q(z)] with respect to q(zd,i = k) and is given by q(zd,i = k) t,wd,i exp{(k + i q(zd,i = k))}, (3.36) kv + M d=1 nd,k,v, nd,k,v = nd i=1 q(zd,i = k)I(wd,i = v). (3.37) The updates of are the same as Eq.(3.16). Note that we use the maximum a posteriori (MAP) estimation for , however, we do not use 1 to avoid 1 + d nd,t,v taking a negative value. The lower bound F[q(z)] introduces only q(z) like CVB-LDA. Equation (3.36) incrementally updates the topic distribution of a document for each word as in CVB-LDA because we do not need variational parameter d,k in Eq.(3.36) due to marginalizing out of d. Equation (3.36) is a xed point update, whereas CVB-LDA can be interpreted as a coordinate ascent algorithm. and are updated from the entire document. That is, when we compare this algorithm with VB-LDA, it looks like a hybrid variant of a batch updates for and , and incremental updates for d, Here, we consider an incremental update for to be analogous to CVB-LDA, in which is updated for each word. Note that in the LDA setup, each independent identically distributed data point is a document not a word. Therefore, we incrementally estimate for each document by swapping statistics nj,t,v = nd i=1 q(zj,i = k)I(wj,i = v) which is the number of word v generated from topic t in document j. Algorithm 3 shows our incremental algorithm for LDA. This algorithm incrementally optimizes the lower bound in Eq.(3.35). 41. 3.1 Topic Models 35 Single-Pass Algorithm for LDA Our single-pass algorithm for LDA is inspired by the Bayesian formulation, which internally includes a sequential update. The posterior distribution with the contribution from the data point xn is separated out so that p(|x1:n) p(xn|)p(|x1:n1), where denotes a parameter. This indicates that we can use a posterior given an observed datum as a prior for the next datum. We use parameters learned from observed data as prior parameters for the next data. For example, k,v in Eq. (3.37) is represented as k,v { + M1 d=1 nd,k,v} + nM,k,v. Here, we can interpret { + M1 d=1 nd,k,v} as prior parameter (M1) k,v for the M-th document. Our single-pass algorithm sequentially sets a prior for each arrived document. By using this sequential setting of prior parameters, we present a single-pass algorithm for LDA as shown in Algorithm 4. First, we update parameters from d-th arrived document given prior parameters { (d1) k,v } for l iterations q(zd,i = k) (d) k,wd,i exp{( (d) k + nd i=1 q(zd,i = k))}, (3.38) (d) k,v (d1) k,v + nd i=1 q(zd,i = k)I(wd,i = v), (3.39) where (0) k,v = and (d) k is explained below. Then, we set prior parameters by using statistics from the document for the next document as follows, and nally discard the document. (d) k,v = (d1) k,v + nd i=1 q(zd,i = k)I(wd,i = v). (3.40) Since the updates are repeated within a document, we need to store statistics {q(zd,i = k)} for each word in a document, but not for all words in all documents. In the CVB and iREM algorithms, the Dirichlet parameter, , uses batch updates, i.e., is updated by using the entire document once in one iteration. We need an online-update algorithm for to process a streaming text. However, unlike parameter k,v, the update of in Eq.(3.16) is not constructed by simply summarizing sucient statistics of data and a prior. Therefore, we derive a single-pass update for the Dirichlet parameter using the following interpretation. We consider Eq.(3.16) to be the expectation of k over posterior G(k|ak,b) given documents D and prior G(k|a0, b0), i.e, new k = E[k]G(|ak,b) = ak 1 b , where ak =a0 + M d=1 ad,k, b = b0 + M d=1 bd, (3.41) ad,k = {(old k + nd,k) (old k )}old k , bd = (Nd + old 0 ) (old 0 ). (3.42) 42. 36 Chapter 3 Probabilistic Latent Variable Models Algorithm 3 Incremental algorithm for LDA 1: for iteration it = 1, , L do 2: for d = 1, , M do 3: for i = 1, , nd do 4: Update q(zd,i = k) by Eq. (3.36) 5: end for 6: Replace nold d,k,v with nnew d,k,v for v {wd,i}Nd i=1 in of Eq. (3.37) . 7: end for 8: Update by Eq. (3.16) 9: end for Algorithm 4 Single-pass algorithm for LDA 1: for d = 1, , M do 2: for iteration it = 1, , l do 3: for i = 1, ..., nd do 4: Update q(zd,i = k) by Eq. (3.38). 5: end for 6: Update (d) by Eq.(3.39). 7: Update (d) by Eq.(3.43). 8: end for 9: Update (d) by Eq.(3.40). 10: Update a(d) and b(d) by Eq.(3.43). 11: end for We regard ad,k and bd as statistics for each document, which indicates that the parameters that we actually update are ak and b in Eq.(3.16). These updates are simple summarizations of ad,k and bd and prior parameters a0 and b0. Therefore, we have an update for (d) k after observing document d given by (d) k = E[k]G(|a (d) k ,b(d)) = a (d) k 1 b(d) , a (d) k = a (d1) k + ad,k, b(d) = b(d1) + bd, (3.43) ad,k = {( (d1) k + nd,k) ( (d1) k )} (d1) k , bd = (Nd + (d1) 0 ) ( (d1) 0 ), (3.44) where a (0) k = a0 and b(0) = b0. a (d1) k and b(d1) are used as prior paramters for the next d-th documents. Analysis Here, we analyze the proposed updates for parameters and in the single-pass algorithm. We eventually update parameters (d) and (d) given document d as (d) k = a0 1 + d1 l al,k + ad,k b0 + d1 l bl + bd = (d1) k (1 d ) + d ad,k bd , d = bd b0 + d l bl . (3.45) (d) k,v = + d1 l nd,k,v + nd,k,v Vd + d1 l nd,k, + nd,k, = (d1) k,v (1 d ) + d nd,k,v nd,k, , d = (Vd Vd1) + nd,k, Vd + d l nd,k, . (3.46) 43. 3.1 Topic Models 37 where nk, = v nk,v and Vd is the vocabulary size of total observed documents w1:d. Our single-pass algorithm sequentially sets a prior for each arrived document, and so we can select a prior (a dimension of Dirichlet distribution) corresponding to observed vocabulary. In fact, this property is useful for our problem because the vocabulary size is growing in the text stream. These updates indicate that d and d interpolate the parameters estimated from old and new data. These updates look like a stepwise algorithm (Robbins and Monro, 1951; Sato and Ishii, 2000), although a stepsize algorithm interpolates sucient statistics whereas our updates interpolate parameters. In our updates, how we set the stepsize for parameter updates is equivalent to how we set the hyper-parameters for priors. Therefore, we do not need to newly introduce a stepsize parameter. In our update of , the appearance rate of word v in topic k in document d, nd,k,v/nd,k,, is added to old parameter (d1) k,v with weight d , which gradually decreases as the document is observed. The same relation holds for . Therefore, the inuence of new data decreases as the number of document observations increases as shown in Theorem 3.1.1. Moreover, Theorem 3.1.1 is an important role in analyzing the convergence of parameter updates by using the super-martingale convergence theorem (Bertsekas and Tsitsiklis, 1996; Brochu et al., 2004). Theorem 3.1.1. If and exist satisfying 0 < < Sd < for any j, d = Sd + d l Sl (3.47) satises lim j d = 0, d d = , d 2 d < (3.48) Note that d and d are shown as d given by Eq. (3.47). Proof. If and exist satisfying 0 < < Sj < , + d < d = Sj + d d Sd < + d . (3.49) Moreover, it is known (Robbins and Monro, 1951) that the following stepsize satises the conditions in Eq. (3.48), d = 1 2 + d (1, 2 > 0). (3.50) Therefore, since lim d + d = 0, lim d + d = 0, (3.51) j / / + d = , j / / + d = , (3.52) 44. 38 Chapter 3 Probabilistic Latent Variable Models the squeeze theorem shows lim d Sj + d d Sd = 0, j Sj + d d Sd = . (3.53) Also j ( Sj + d d Sd )2 < j ( / / + d )2 < . (3.54) Experiments We carried out experiments on document modeling in terms of perplexity. We compared the inferences for LDA in two sets of text data. The rst was Associated Press (AP) where the number of documents was M = 10, 000 and the vocabulary size was V = 67, 291. The second was The Wall Street Journal (WSJ) where M = 10, 000 and V = 56, 738. The ordering of document is time-series. The comparison metric for document modeling was the test set perplexity. We randomly split both data sets into a training set and a test set by assigning 20% of the words in each document to the test set. Stop words were eliminated in datasets. We performed experiments on six inferences, PF, VB, CVB0, CVB, iREM and sREM. PF denotes the particle lter for LDA used in Canini et al. (2009). We set k as 50/K in PF. The number of particles, denoted by P, is 64. The number of words for resampling, denoted by R, is 20. The eective sample size (ESS) threshold, which controls the number of resamplings, is set at 10. CVB0 and CVB are collapsed variational inference for LDA using zero-order and second-order information, respectively. iREM represents the incremental reverse EM algorithm in Algorithm 3. CVB0 and CVB estimates the Dirichlet parameter over the topic distribution for all datasets, i.e., a batch framework. We estimated in iREM for all datasets like CVB to clarify the properties of iREM compared with CVB. L denotes the number of iterations for whole documents in Algorithms 1 and 2. sREM indicates a single-pass variants of iREM in Algorithm 4. l denotes the number of iterations within a document in Algorithm 4. sREM does not make iterations for whole documents. Figure 3.3 demonstrates the results of experiments on the test set perplexity where lower values indicate better performance. We ran experiments ve times with dierent random initializations and show the averages *1 . PF and sREM calculate the test set perplexity after sweeping through all training set. *1 We exclude the error bar with standard deviation because it is so small that it is hidden by the plot markers 45. 3.1 Topic Models 39 VB converges slower than CVB and iREM. Moreover, iREM outperforms CVB in the convergence rate. Although CVB0 outperforms other algorithms for the cases of low number of topics, the convergence rate of CVB0 depends on the number of topics. sREM does not outperform iREM in terms of perplexities, however, the performance of sREM is close to that of iREM As a result, we recommend sREM in a large number of documents or document streams. sREM does not need to store old statistics for all documents unlike other algorithms. In addition, the convergence of sREM depends on the length of a document, rather than the number of documents. Since we process each document individually, we can control the number of iterations corresponding to the length of each arrived document. Finally, we discuss the running time. The running time of sREM is O(L l ) times shorter than that of VB, CVB0, CVB and iREM. The averaged running times of PF(K=300,P=64,R=20) are 28.2 hours in AP and 31.2 hours in WSJ. Those of sREM(K=300,l=5) are 1.2 hours in AP and 1.3 hours in WSJ. 3.1.10 Other Topic Models A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. Blei and Laerty (2006a) developed the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution. They derived a variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. Mimno and McCallum (2008) developed Gibbs sampling in non- conjugate logistic normal topic models, which belong to a new class of topic models with arbitrary graph-structured priors. When we try to analyze time-series documents, the assumption of exchangeable documents is inap- propriate. Document collections such as scholarly journals, email, news articles, and search query logs all reect evolving contents. The themes in a document collection evolve over time, and it is of interest to explicitly model the dynamics of the underlying topics. Blei and Laerty (2006b) developed the dynamic topic model (dDTM) which captures the evolution of topics in a sequentially organized corpus of documents. The dDTM requires that time be discretized. Wang et al. (2008) developed the continuous time dynamic topic model (cDTM), which uses Brownian motion to model the latent topics through a sequential collection of documents and enables us to easily handle many time points. Iwata et al. (2010) proposed a topic model with multiple timescales for modeling both the long-timescale dependency as well as the short-timescale dependency in topics . Topics naturally evolve with multiple timescales. For 46. 40 Chapter 3 Probabilistic Latent Variable Models 1500 2000 2500 3000 3500 50 100 150 200 250 300 TestsetPerplexity Number of Topics AP VB(L=100) CVB0(L=100) CVB(L=100) iREM(L=100) sREM(l=5) PF(P=64) 1500 1700 1900 2100 2300 2500 2700 2900 50 100 150 200 250 300 TestsetPerplexity Number of Topics WSJ VB(L=100) CVB0(L=100) CVB(L=100) iREM(L=100) sREM(l=5) PF(P=64) (a) (b) 1.50E+03 2.00E+03 2.50E+03 3.00E+03 3.50E+03 4.00E+03 4.50E+03 10 20 30 40 50 60 70 80 90 100 TestsetPerplexity Number of iteraons AP(T=100) VB CVB0 CVB iREM sREM PF 1.50E+03 2.00E+03 2.50E+03 3.00E+03 3.50E+03 4.00E+03 10 20 30 40 50 60 70 80 90 100 TestsetPerplexity Number of iteraons WSJ(T=100) VB CVB0 CVB iREM sREM PF (c) (d) 1.50E+03 2.00E+03 2.50E+03 3.00E+03 3.50E+03 4.00E+03 4.50E+03 5.00E+03 5.50E+03 10 20 30 40 50 60 70 80 90 100 TestsetPerplexity Number of iteraons AP(T=300) VB CVB0 CVB iREM sREM PF 1.50E+03 2.00E+03 2.50E+03 3.00E+03 3.50E+03 4.00E+03 4.50E+03 10 20 30 40 50 60 70 80 90 100 TestsetPerplexity Number of iteraons WSJ(T=300) VB CVB0 CVB iREM sREM PF (e) (f) 1700 1800 1900 2000 2100 2200 2300 2400 2500 50 100 150 200 250 300 TestsetPerplexity Number of Topics AP sREM(l=3) sREM(l=4) sREM(l=5) 1680 1780 1880 1980 2080 2180 2280 2380 50 100 150 200 250 300 TestsetPerplexity Number of Topics WSJ sREM(l=3) sREM(l=4) sREM(l=5) (g) (h) Fig. 3.3. Results of experiments. Left line indicates the results in AP corpus. Right line indicates the results in WSJ corpus. (a) and (b) compared test set perplexity with respect to the number of topics. (c), (d), (e) and (f) compared test set perplexity with respect to the number of iterations in topic K = 100 and K = 300, respectively. (g) and (h) show the relationships between test set perplexity and the number of iterations within a document, i.e., l. 47. 3.2 Dirichlet Process Mixture Models 41 example, some words may be used consistently over one hundred years, while other words emerge and disappear over periods of a few days. Iwata et al. (2010) also derived ecient online inference procedures based on a stochastic EM algorithm, by which the model is sequentially updated using newly obtained data; this means that past data are not required to make the inference. LDA has been used to construct features for classication. Since LDA reduces data dimensions by using topics, LDA topics are expected to be useful for categorization. However, it is not necessarily the case that topics estimated from LDA may correspond to good discriminative topics for a classication. For example, although it seems highly possible that LDA topics correspond to genre in a movie rating review article, good discriminative topics for a classication will dierentiate words like excellent, terrible, and average, without regard to genre. Blei and McAulie (2007) proposed supervised latent Dirichlet allocation (sLDA), a statistical model of labeled documents. They added to LDA a response variable associated with each document such as a variable indicating the number of stars given to a movie in movie rating review. They derived a ML procedure for parameter estimation, which relies on variational approximations to handle intractable posterior expectations. We developed a topic model for modeling a generation process of a graph structure (Sato et al., 2008). The use of graph structures is a common way in several elds such as natural language processing and web analysis. We assume that each node has multiple latent topics and links are generated corresponding

Education

Quantum Annealing for Statistical Machine Learning