IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 11, NO. 3, SEPTEMBER

IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 11, NO. 3, SEPTEMBER 2012 203

Estimating Functional Groups in Human GutMicrobiome With Probabilistic Topic Models

Xin Chen, TingTing He*, Xiaohua Hu, Yanhong Zhou, Yuan An, and Xindong Wu

Abstract—In this paper, based on the functional elements de-rived from non-redundant CDs catalogue, we show that the con-figuration of functional groups in meta-genome samples can beinferred by probabilistic topic modeling. The probabilistic topicmodeling is a Bayesian method that is able to extract useful topicalinformation from unlabeled data. When used to study microbialsamples (assuming that relative abundance of functional elementsis already obtained by a homology-based approach), each samplecan be considered as a “document,” which has a mixture of func-tional groups, while each functional group (also known as a “latenttopic”) is a weight mixture of functional elements (including taxo-nomic levels, and indicators of gene orthologous groups andKEGGpathway mappings). The functional elements bear an analogy with“words.” Estimating the probabilistic topic model can uncover theconfiguration of functional groups (the latent topic) in each sample.The experimental results demonstrate the effectiveness of our pro-posed method.

Index Terms—Bioinformatics databases, biological data mining,metagenomics, probabilistic topic model.

I. INTRODUCTION

I N THE SYSTEM biology community, there has beena long time focus on studying gene-expression data in

isolated organisms and cultures. However, relatively less efforthas been made to study the genome-wide gene-expression datafrom uncultured environment samples (like the ocean, soiland human body) and understand the underlying biologicalprocesses. With the fast advancing sequencing techniques(such as Roche/454 Sequencing and Illumina Sequencing),large amount of sequenced genomes and meta-genomes fromuncultured microbial samples becomes available. Based onthe meta-genome sequences, bioinformatics researchers havedone a lot of work to study the underlying biology processsuch as signal transduction, translation, and molecular func-tions like the biochemical activity of gene product. However,

Manuscript received July 16, 2012; accepted August 01, 2012. Date of currentversion September 10, 2012. This work was supported in part by NSF CCF-0905337, NSF CCF 0905291, NSF CCF 1049864, NSFC 90920005, and China12-5 Plan Project 2012BAK24B01. Asterisk indicates corresponding author.X. Chen, X. Hu, and Y. An are with the College of Information Sci-

ence & Technology, Drexel University, Philadelphia, PA 19104 (e-mail:[email protected]; [email protected]; [email protected]).*T. He is with the Department of Computer Science, Central China Normal

University, Wuhan, China (e-mail: [email protected]).Y. Zhou is with the School of Life Science and Technology, Huazhong Uni-

versity of Science and Technology, China (e-mail: [email protected]).X. Wu is with the Department of Computer Science, University of Vermont,

Burlington, VT, USA (e-mail: [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TNB.2012.2212204

our knowledge about the biological functions encoded in themete-genome sequence is still limited. Current functional anno-tation (genome-level annotation of biological functions) is stillfar from satisfied. The lack of high quality functional annotationof the major functionality encoded in the gene-expression dataof given genome/meta-genome posed a great challenge in thetask of interpreting the biological process of meta-genome.The major objectives of analyzing and interpreting the large

amount of meta-genomic data involve answering two ques-tions. The first question is, “Given a large number of genomefragments from an environmental sample, what genomes arethere?” Answering this question requires mapping the meta-ge-nomic reads to taxonomic units (usually a homology-basedsequence alignment, this task is also known as taxonomicclassification or taxonomic analysis). The second question is,“What are the major functions of these genomes?” The answersto this question involve annotating the major functional units(such as signal transduction, metabolic capacity and generegulatory) on the genome-level (a.k.a. functional analysis).In analyzing meta-genomics data, the identification of bothtaxonomic unit and functional unit are difficult because thelarge amount of meta-genome fragments are usually very short

and may be from a variety of organisms. Thechallenge is that, when dealing with large genome-wide geneexpression data, the samples may be from different individualswith different genetic and environmental background. What’smore, the samples usually represent collections of diversecell-types mixed together in different proportions. Therefore,in processing the meta-genomic reads, it’s required that theraw reads be firstly assembled to longer contigs .After that, the protein-encoding sequences (CDs) are predictedfrom assembled contigs, results in a non-redundant catalogueof CDs (gene regions). The construction of non-redundantCDs catalogue gives rise to a “minimal genome” and providesopportunity to identify bacterial functions that play importantroles in microbial samples. Based on the non-redundant CDscatalogue, both taxonomic unit identification (identifying theexistence of certain microbial species in the meta-genome) andfunctional unit identification (identifying the existence of cer-tain gene product) can be readily achieved by matching aminoacid sequence in CDs regions to standard reference sequencesusing homology-based alignments. The homology-based align-ments (such as BLASTP and BLASTX) have been intensivelyused to deal with both taxonomic analysis and functionalanalysis problems [3]. By aligning local amino acid sequenceto the reference sequences (of known species) in standarddatabases (such as NCBI NR database, eggNOG databaseand KEGG database), researchers are able to acquire a lot of

1536-1241/$31.00 © 2012 IEEE

204 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 11, NO. 3, SEPTEMBER 2012

useful information with respect to the functionality encoded inpredicted CDs regions, including taxonomic levels, indicator ofgene orthologous groups (OGs) and KEGG pathway mappings.The alignment of amino acid sequence also provides an insightabout the functionality groups existing in the genomes. Al-though genes vary from strain to strain, similar genes can havesimilar functions among different species known as clusters ofOrthologous (COGs). The relative abundance of certain COGcategories in a microbial sample may indicate whether thesample is rich in particular functions. In practice, COGs can bedetermined based on their sequence similarity and can be clas-sified into different function categories [10] including signaltransduction, metabolic pathways and gene regulatory network.With this consideration, the functional units in microbial sam-ples can be identified by the gene clusters such as COGs sharedamong species/strains. Understanding the functionality of geneclusters is of practical and theoretical importance. For example,the functionality roles of organ/cell specific gene clusters maybe different from gene clusters which are active across diversecell types. Set of genes that are very specific to a particularcell type or organ may be useful as diagnostic bio-markers. Incontrast, gene clusters that are active across diverse cell typescan give us insight to uncover functional similarities amongorgans/cells. Since different microbial samples are taken fromdifferent micro-environment and expressing different setsof genes, we may assume that each microbial sample (withmultiple cell types) has its own configuration of gene clusters,some clusters will be shared among many cell types whileothers will be more specific. It has been pointed out that theexistence of commonly shared gene clusters across samplessuggests functional similarity and biological relevance [5], [7].Therefore, we aim to develop a method that enables analyzingthe genome-level configuration of both taxonomic units andfunctional units derived from the non-redundant CDs catalogue.As we mentioned, by homology-based alignment, each CDssequence can be represented as a triplet (i.e., taxonomic levels,indicator of gene orthologous groups and KEGG pathway map-pings)—each unit may be considered as a functional element atdifferent levels (i.e., taxonomic level, gene level and pathwaylevel). As a result, the functional meta-genome annotation canbe achieved by first decomposing the meta-genomic samplesinto a mixture of functional elements (from three differentlevels); and then analyzing the genome-level configuration offunctional elements to learn how those functional elements aregrouped and jointly participate in the biological processes.In this paper, based on the functional elements derived from

non-redundant CDs catalogue, we show that the configurationof functional groups in meta-genome samples can be inferredby probabilistic topic modeling. The probabilistic topic mod-eling is a Bayesian method that is able to extract useful topicalinformation from unlabeled data. When used to study micro-bial samples (assuming that relative abundance of functional el-ements is already obtained by homology-based approach), eachsample can be considered as a “document,” which has a mix-ture of functional groups, while each functional group (alsoknown as a “latent topic”) is a weight mixture of functional el-ements (including taxonomic levels, indicator of gene orthol-ogous groups and KEGG pathway mappings). The functional

elements bear an analogy with “words.” Estimating the proba-bilistic topic model will uncover the configuration of functionalgroups (latent topic) in each sample. The experimental resultsdemonstrate the effectiveness of proposed model.The rest of this paper is organized as follows: In Section II, we

review the background and related work of probabilistic topicmodels. In Section III, we present the framework of functionalgroup identification via topicmodel. In Section IV, we introducethe extended Enterotypr-HDP model to infer functional basis ofdetected enterotypes. Experimental results and discussion arepresented in Section V. We summarize our work in Section VI.

II. BACKGROUND AND RELATED WORK

Probabilistic topic models have been developed for appli-cations in various domains such as text mining [8], informa-tion retrieval [2], and computer vision [4], [13]. In bioinfor-matics domain, generative topic model has been previously usedto learn protein-protein relations from MEDLINE abstracts ofbiomedical literatures [1], [15]; it has also been applied to iden-tify gene relations from microarray profiles [5]; the generativetopic model is also used to describe the process of constructingmRNA module collections [7]. In [7], the author uses a topic-model-based Gene Program algorithm to allocate mRNA fromeach tissue to different gene expression programs, in which eachtissue is considered as a sample from a population of related tis-sues. In the model, gene sets have different chances of beingco-expressed in different subset of samples, which also encodesthe assumption that similar sample groups are more likely toshare similar gene sets. The model provides the flexibility in al-locating the expression data and discovering co-expressed genesets.The latent Dirichlet allocation (LDA) model [2], is an effec-

tive probabilistic topic model firstly introduced in text miningdomain to infer latent semantic topics from text documents. TheLDA model allows us to study underlying concurrence patternsof the data and extract useful knowledge such as latent semantictopics from the data. What’s more, the learning process of LDAmodel is totally unsupervised; therefore, it is very suitable forresearch areas which lack of labeled data. Due to its solid theo-retic foundation and promising performance, the LDA modelhas been popular with the data mining community in recentyears. It is widely agreed that the LDA model promises goodresults across most text data categories including domain spe-cific text data (such as MEDLINE) [15] and general text data(such as the New York Times Dataset) [8] and may also bringgood results in other text-like data such as visual code words[4]. Those data categories are also known as “bag-of-words”models since they represent each document by a distributionover fixed vocabulary (in which the order of the vocabularydoesn’t matter). One limitation of LDA-based topic model isthat it requires specifying the exact number of mixture compo-nents, which remains unchanged during the model estimation.In practice, in order to get an optimal number, the researchershave to try different mixture components numbers and make achoice by comparing the log-likelihood, perplexity and othercriteria that indicate how good the model fits the data. The hi-eratical Dirichlet process (HDP) model [19], is a nonparametricextension of the latent Dirichlet allocation (LDA)-based topic

CHEN et al.: ESTIMATING FUNCTIONAL GROUPS IN HUMAN GUT MICROBIOME WITH PROBABILISTIC TOPIC MODELS 205

models, it enables modeling documents with countable infinitemixture components, thus provides the flexibility of modelingdata with different semantic component numbers.

III. PROBABILISTIC TOPIC MODEL FOR FUNCTIONAL GROUPIDENTIFICATION

In this section, we present the probabilistic topic mod-eling process of identifying functional groups from microbialsamples. The approach is based on the functional elementabundance data acquired from homology-based alignment(high abundance of specific functional element indicates highexpression level of specific taxon, gene cluster, or specificmetabolic pathway). In using the probabilistic topic model forinferring functional groups of biological process, we definethe model as follows. The genome set serves as the documentcorpus, with individual samples representing the documents.The functional elements (including NCBI taxonomic levelindicators, indicator of gene orthologous groups and KEGGpathway indicators) serve as words, which jointly define a fixedwords vocabulary of the corpus (take the genes orthologousgroup (OGs) indicators for example, the COG and NOG termsfrom the eggNOG database [16] can be used as vocabulary ofthe model). Consequently, each document can be represented asa bag of words, in which the order of words is not considered.Each inferred latent topic (i.e., functional group such as bacteriagroups or group of gene clusters) defines a multinomial distri-bution over given vocabulary. In other words, each functionalgroup specifies a multinomial distribution over functional ele-ments. The discrete expression levels of functional elements aretreated analogously as the word frequency in text documents.The configuration of functional groups in each sample as wellas the distribution of functional elements in each functionalgroup are considered generated conditional independently bythe topic model. With inferred latent topics, meta-genome sam-ples can be represented as weighted combinations of functionalgroups. Different functional groups (latent topics) may co-existin the same sample and may be shared across a set of samples.The samples differ in terms of which functional groups arepresenting in and how they weighted.

A. Preliminaries of LDA Model

The conventional LDA model assumes that, a document maydeal with multiple topics; and each of these topics can be rep-resented by a unique distribution of words. A latent topic is ahigh-level concept which explains the co-occurrence patternsof words that appear in one document, it provides an effectiveway to analyze the composition of documents. Depending ondifferent application context, a latent topic may have differentsemantic meanings. Based on the definition of latent topics,the objective of LDA model is to assign these latent topics towords in a document (each word wi may only be assigned onetopic), so that a document may in turn be represented as a mix-ture of latent topics. In practice, the latent topic assignment isachieved by manipulating some unseen latent random variablesto determine the conditional probability of words given a latenttopic and probability of latent topics given a doc-ument . When a probabilistic topic model

(such as LDA model) is used to study meta-genome samples,each sample can be considered as a “document,” which has amixture of functional groups, while each functional group (alsoknown as a latent topic) is a weighted mixture of functional ele-ments (which bear an analogy with words). The LDA model formeta-genome samples can be defined by the hierarchical struc-ture as follows. There are a total of samples (documents) inthe data collection, which contains a total of functional ele-ments derived from the non-redundant CDs catalog, assigned toa total of different indicators; and there are a total of latenttopics. For the dth meta-genome sample (document), the LDAmodel samples a latent variable , in which isa -dimensional vector of topic priors in sample . Then, forthe jth topic, the model samples a latent variable ,which serves as prior probabilities for functional element distri-butions of different latent topics. After sampling latent variablesand , the probability that topic appears in sample is de-

fined as: , in which means topic .The probability that a given function element be generatedby is defined as: . The generativeprocess of this model is defined as follows:For the th meta-genome sample(documents), sample

For the th topic, sample .For each of the functional element wi in sample: sample a latent topic , and sample

.The estimation of LDA model given the functional element

abundance data can be estimated via the Gibbs Sampling MonteCarlo process [8]. The estimation process requires a separatelysampling latent topic for each functional element in each sampleaccording to the posterior probability:

(1)

In which is the total number of functional elements as-signed to latent topic except for current functional element ,and is the total number of functional element in sample(except for ) that have been assigned to topic j.

B. LDA Model With Background Distribution

Given non-redundant CDs catalog, and derived functional el-ements, we are interested in identifying the frequent co-occur-rence patterns of functional elements. Commonly shared func-tional elements (such as taxa groups, gene clusters and path-ways) across samples may suggest functional similarity and bio-logical relevance among samples. If strong genome-wide co-ex-isting patterns of functional elements do exist, then it may sug-gest the existence of “core” genome.With this consideration, we extend the LDAmodel by adding

a background distribution of commonly shared functional el-ements. We present graphical representation of the proposedmodel in Fig. 1. Following the convention in depicting graph-ical representation of topic models, we use round nodes to rep-resent random variables, in which the white nodes stand for la-tent random variables, while the gray nodes denote observations


Fig. 1. Hierarchical structure of proposed LDA-B model.

during the model training. The rounded boxes are used to rep-resent fixed hyper-parameters of the model, while the edges il-lustrate the conditional dependency underlying the generativeprocess.The generative process of the proposed model is as follows.

As shown in Fig. 1, a switch variable is introduced in themodel, which fits a Binomial distribution (with a Beta prior of) and only takes binary values . Before sampling the la-tent topics in sample , the switch variable needs to be sampledfor each functional element . For a given , if its switchvariable equals to 0, then it should be generated by the back-ground topic , otherwise, if its switch variable takes thevalue 1, it should be generated by one of the regular latenttopics.For functional element in sample , its assigned topic

(either background topic or one of the regular latent topics)is sampled according to the posterior probability:

(2)

(3)

For formally, the generative process of this model is definedas follows:1. For the th document(meta-genome sample), sample

and2. For the th latent topic, sample ,for the background topic, sample

3. For each of the functional element in docu-ment(sample) :

4. For each functional element , sample a switch

a. If , sampleb. If , sample a topic andsample

In our model, we assume symmetric priors and set ,, . Such a parameter setting is for the consid-

eration of making topic modeling results more diverse. For ex-ample, by setting Dirichlet distribution with parameter ,the topic mixture for each genome will converge on severalunique topics instead of having equal probability for every topic.We follow the model selection method in [15] to determine theoptimal latent topic number. In general, a larger topic number

may provide higher resolution to the uncovered functional core(either microbial core or gene core) of genome. However, alarge topic number may also cause an over-fitting problem tothe model. The selection among the models with different topicnumber is carried out based on the approximated evidence (loglikelihood) of samples. Usually, it takes less than 100 iterationsfor the Gibbs sampling process to converge.

C. Mutual Information

After estimating the topic model and assigning latent topicto each functional element, the relevance between latent topicsand functional element indicators (i.e., NCBI taxonomic levelindicators, indicator of gene orthologous groups, and KEGGpathway indicators) can be obtained by calculating the mutualinformation (MI) between functional element indicators and ob-tained latent topics based on the final latent topic assignmentsto functional elements. The MI between specific functional el-ement indicators and a latent topic is shown in (4), in whichRg and Zt are binary indicator variables corresponding to func-tional element and latent topic, respectively. The variable pair(Rg, Zt) indicates whether a latent topic has been assigned to aspecific functional element.

(4)

Given the training data, the joint probability andtwo marginal probabilities and can be simply esti-mated by counting the number of evidences over all the trainingdata.

IV. EXTENDED HIERARCHICAL DIRICHLET PROCESS FORDETECTED ENTEROTYPES

In this section, we introduce the extended background HDPmodel to infer the functional basis for detected phylogeneticclusters (a.k.a. Enterotypes).In recent studies [12], [18], there has been some general con-

sensus about the phylogenetic composition in human gut micro-biome. However, the composition of gene functions in humangut microbiome and their variations across human population isstill not clear. It’s unknown whether inter-individual variationmay lead to dramatically different gene function compositionor whether individual human gut microbiome congregates onseveral categories with shared functional properties.It has been demonstrated in that researcher may identified dis-

tinctclusters in human gut microbiome by analyzing the phylo-genetic composition. Specifically, a large fraction of the meta-genome can be matched to the reference genome set on thegenus and phylum level. In [18], multidimensional cluster anal-ysis and principle component analysis (PCA) are performed onphylogenetic abundance profiles to further cluster 33 samplesinto 3 distinct clusters (a.k.a. “Enterotypes”), which are identi-fied by the levels of one of three genera: Bacteroides, Prevotella,and Ruminococcus. It’s hoped that the identified Enterotype-smay explain either the host properties (such as IBD) or thecomplex mixture of functional properties. However, when clus-tering the samples using purely functional metric (such as theabundance of the orthologous groups derived from predictedgenes) the grouping of samples doesn’t very much agree with


the Enterotypes obtained by phylogenetic clustering, indicatingthat the abundance of function may not be coinciding with theabundance of genera.Typically, the most abundant molecular functions can be

traced back to the most dominant species or genera. However,it should be noted that abundant species or genera cannot revealthe entire functional complexity of the gut microbiota, someidentified orthologous groups may also be primary contributedby low-abundance genera. In our study, in order to determinethe functional basis of the identified Enterotypes, we introducethe extended background HDP model of inferring sample-levelcomposition of orthologous groups with respect to differentEnterotypes.

A. Preliminaries of HDP Model

Before introducing our models, we briefly review the back-ground of hierarchical Dirichlet process.The Dirichlet process (DP) is defined as a distribution of

random probability measure , in which isa concentration parameter and is a base measure definedon a sample space . By its definition, for any finite measur-able partition of :

.Due to discrete nature of DP [17], it can be constructed by

stick-breaking construction as follows (eachis a distinct value on the space , they are also considered as theparameters of mixture components during modeling).

The weights of mixture componentsare also refer to as .

The hierarchical Dirichlet process (HDP) considersas a global probability measure across the corpora

and defines a set of child random probability measuresfor each document , which leads to different doc-

ument-level distribution over semantic mixture components.

(5)

Each can also be constructed by stick-breaking construc-tion as

, in whichspecifies the weights of integer mixture component

indicator .Now consider indicator variable set

for ; then become a finite partition ofinteger indicators.Substitute the stick-breaking construction of and to

(5), it follows that

Fig. 2. Hierarchical structure of proposed Enterotype HDP model.

(6)

Based on the aggregation properties of Dirichlet distributionand its connection with Beta distribution, we can show that

(7)

It follows that .

B. Extended Background HDPModel for Detected Enterotypes

To indicate the Enterotypes label of each sample, a switchvariable is introduced. The generative process of theEnterotype HDP model is represented in Fig. 2. For eachorthologous group (OG) indicator in sample , the value of(which takes values 0–1) is sampled from a binomial distri-

bution (with a Beta prior ). When the value of equals0, the topical indicator of OG indicator is draw uniformlyfrom the functional basis learned from the correspondingEnterotypes (the blue arrows in Fig. 2 illustrate this procedure).When equals 1, a mixture component of functional prop-erties will be sampled according to the sample-level weightsof functional mixture components for sample, and OG indicator will be drawn from the distributionof functional mixture component (the red dashed arrows

in Fig. 2 illustrate this procedure). Detailed explanations ofnotations in the model are summarized in Table I.The generative process of this model begins with drawing

a global probability measure and for eachsample , draw a child Dirichlet process .Following the stick-breaking construction, it is equivalent tofirstly drawing a global weight for functionalcomponent indicators , then for each sample , draw the doc-ument-level weights of functional component indicators

. The data observations in sample are generatedby repeatedly drawing functional component indicatorfrom and then draw each OG indicator from the con-ditional probability of the sampled functional component

.For formally, the generative process of this model is defined

as follows:


TABLE INOTATIONS IN PROPOSED TOPIC MODEL

1. Draw a global weight ;2. For each functional component , draw conditional OGindicator distribution ;

3. For each Enterotype , sample conditional OG indicatordistribution of its functional basis:

;4. For the th sample, draw ;5. For the th of the OG indicators in the th sample

a. sample functional component indicator

b. sample its OG indicator ;6. For each sample , sample ;7. For each OG indicator in sample ;

a. sample a switch variable ;b. ifGenerate OG indicator from the functionalbasis of corresponding Enterotype

c. ifSample functional indicator .Then generate OG indicator from functionalindicator according to the conditional proba-bility .

In the following, we describe the Gibbs sampling scheme forthe proposed Enterotype HDP model. The sampling schemeconsists of two steps. The first step is sampling for semanticcomponent indicators as well as the corresponding HDPhyper-parameters . In order to sample a HDP-like model, onemay either follow the Chinese restaurant franchise (CRF) oruse direct assignment [19]. In our work, the direct assignmentis used (Table II).The second step is sampling for switch variable , and con-

ditional distribution of OG indicators and . We derive the

TABLE IITHE POSTERIOR SAMPLING PROCESS

sampling equation of switch variable for each OG indicatorin sample as follows:

(8)

(9)

After a set of sampling processes based on the posterior distri-bution calculated above, other parameters can be sampled usingthe following equations:


V. EXPERIMENTAL RESULTS AND DISCUSSIONS

In this section, we conduct a probabilistic topic modelingexperiment to identify functional groups from microbial sam-ples from two large published gut microbiome datasets: the Il-lumina-based metagenomics data from 112 Danish individuals[12] and the Sanger-sequenced meta-genome of 39 individuals[18]. Following the methods in Sections III and IV, we apply theproposed probabilistic topic models to the functional elementabundance data acquired from non-redundant gene catalog ofhuman gut microbial samples.

A. Experimental Data Collection

The Illumina human gut microbial community taxon abun-dance data is generated by [12], which is openly accessiblevia http://gutmeta.genomics.org.cn/. According to [12], theIllumina GA reads from human gut microbial samples arefirstly assembled into longer contigs. After that, the Glimmerprogram was used to predict protein-encoding sequences (CDs)from assembled contigs. The predicted CDs sequences werethen aligned to each other and form a non-redundant CDscatalog (a.k.a. minimal gut genome). The non-redundant CDscatalog consists of 3 299 822 non-redundant CDs sequenceswith an average length of 704 bp. For a given non-redundantCDs sequence, its NCBI taxonomical level is obtained by car-rying out BLASTP alignment against the NCBI NR database.The taxonomical level of each non-redundant CDs sequenceis determined by the lowest common ancestor (LCA)- basedalgorithm. The taxonomic abundance data for each sample canbe computed by counting the indicators of NCBI taxonomicallevels. The assignments of gene orthologous indicator andKEGG pathway indicator are achieved by BLASTP alignmentof the amino-acid sequence from predicted CDs to the eggNOGdatabase and KEGG database. The human gut microbial sam-ples from [12] belong to both healthy subjects (HS) and patientswith inflammatory bowel disease (IBD). Specifically, the IBDpatients are from two different groups, one group with Crohn’sdisease (CD), and the other group with ulcerative colitis (UC).In total, there are 85 healthy samples, 15 UC samples, and 12CD samples.The Sanger sequenced gut microbime dataset [18] includes

22 European meta-genomes from Danish, French, Italian,and Spanish individuals in combine with Sanger gut datasetfrom 13 Japanese and 4 American individuals. For sequencingprocessing, the raw Sanger reads are trimmed to removelow-quality reads and possible human DNA contaminations.The cleaned Sanger reads are then assembled to longer contigsfor gene prediction. The phylogentic annotation of sampleswas performed by aligned Sanger reads against a total of 1511reference genomes.The gene functions are annotated via BLASTP against

eggNOG and KEGG databases, which yields high through-putgene function profiling, as 63.5% of the predicted genes in theSanger-sequenced samples can be assigned to the orthologousgroup. The gene function profile, may then be used to study thecomposition of eggNOG and KEGG orthologous groups acrossdistinct samples.

B. Topic Inferring From Predicted Gene Catalogue WithLDA-B Model

As introduced in Section III, the functional elements, whichbear an analogy with text words, includes three different typesof indicators, i.e., NCBI taxonomic level indicators, indicator ofgene orthologous groups, and KEGG pathway indicators. Theunion of unique functional elements jointly defines a fixed wordvocabulary. In Illumina dataset, there are 647 136 NCBI taxo-nomic level indicators, with a vocabulary size of 748; there area total of 1 293 764 gene orthologous group indicators, with avocabulary size of 4667; and there are 953 493 KEGG pathwayindicators, with a vocabulary size of 237.It should be pointed out that, in our approach we separately

estimated three probabilistic topic models with respect tothree different types of functional elements (i.e., NCBI taxo-nomic level indicators, indicator of gene orthologous groupsand KEGG pathway indicators). We apply a Gibbs samplingprocess to iteratively update the model estimation from thefunctional element abundance data until converge (basically, ittakes less than 100 iterations to converge). During topic mod-eling, we assume symmetric priors and set hyper-parametersfollowing the method in Section III. On the convergence of theGibbs sampling process, we will be able to tell the topic-leveldistribution of functional elements as well as the sample-leveldistribution of latent topics. In our experiment, we test differenttopic numbers on the proposed LDA-B model and compare thelog-likelihood. Log-likelihood is one of the standard criteriafor generative model evaluation. It provides a quantitativemeasurement of how well a topic model fits the training data.The score of log-likelihood (which is a negative number) is thehigher the better. In practice, the log-likelihood of elementsgiven latent topics can be calculated by integrating out all thelatent variables.

(10)

For the LDA model, the log-likelihood can be calculated in asimilar way. In Fig. 3(a)–(c), we show the log-likelihood com-parison of the proposed LDA-B models and LDA model onthree different types of functional elements under different topicnumber. It shows that, for both models, the likelihood increasesas the number of topic increases, which means that a relativelylarger topic numbers may potentially result in better fitting ofthe data. However, it should be noted that there is a trade-offbetween topic numbers and convergence time of models. And,as wewould see in Section VI, the increase of topic number doesnot always lead to the improvement of predictive results. In gen-eral, the log-likelihood of LDA model is higher than that of theLDA-B model, which shows LDA model fit the training databetter. The difference between two models can be explained bythe introducing of background topic in the LDA-B model.


Fig. 3. (a)–(c) The Log-likelihood comparison of the proposed LDA-B models and LDA model on three different types of functional elements (as topic numberchanging), (e)–(f) The perplexity comparison on three different types of functional elements (as topic number increasing).

The Perplexity is a widely used criterion for evaluating thepredictive ability of probabilistic topic models. The perplexityis calculated for held-out testing data. In our experiment, weuse a 50% subset of the functional elements as training dataand the other 50% as testing data. On constructing the twosubsets, we ensure that functional elements from the samesample are equally split to both subsets. In practice, it is theinverse predicted model likelihood of data in held-out testingdata, using parameters inferred from the trained topic model.Thus the smaller perplexity value indicates better model fitting.

(11)

One advantage of our LDA-B model is that it assigns com-monly shared functional elements to the background distribu-tion, whichmakes themodel more suitable to represent genome-wide co-existing patterns of functional elements. Fig. 3(d)–(f)represents the perplexity comparison of the proposed LDA-Bmodels and LDA model on three different types of functionalelements as the topic number increasing. It shows that the per-plexity of our model is consistently lower than LDA model,which suggests that our model is “less surprised” by the testingdata, thus demonstrates better predictive ability. Also, it showsthat the predictive ability of our model may benefit from greatertopic number, as it tends to have lower perplexity as the topicnumber increases. The proposed LDA-B model achieves bestlog-likelihood and perplexity scores when topic number equalsto 200. Therefore, the LDA-B models are inferred with topicnumber set to 200.

C. Inferring Functional Basis for Enterotypes With ExtendedBackground HDP Model

In this section, we investigate the performance of theproposed Enterotype-HDP model using the Sanger-se-quenced meta-genome samples [18] (openly accessible viahttp://www.bork.embl.de/Docu/Arumugam_et_al_2011/).In [18], the predicted gene catalog from Sanger-sequenced

meta-genome samples covering a wide spectrum of bac-teria—only 0.14% of the reads could be classified as humancontamination. Also, 63.5% of the predicted genes in theSanger-sequenced samples can be assigned to the orthologousgroups. Across the 33 samples in 3 distinct Enterotypes, thereare 2 319 439 genes assigned to 13507 eggNOG orthologousgroups and 1 543 293 genes mapped to 4900 KEGG ortholo-gous groups.The values of global concentration parameter are deter-

mined by log-likelihood and perplexity comparison on a se-rial of values (Figs. 4 and 5). Other hyper-parameters (such asDirichlet distribution priors: , and Beta distribution prior )are set in prior and fixed during the experiments. The predic-tion of functional basis for each Enerotype and functional mix-ture components across the samples is achieved by performingGibbs sampling on sample orthologous-group (OG) profiles (in-cluding both eggNOG andKEGGOG indicators) to estimate thesample-level distribution of switch variable and functional com-ponents. The output will be a set of functional components andEnterotype functional basis inferred from the training dataset.Fig. 4 shows the log-likelihood comparison of the Enterotype

HDP model with different concentration parameter . Overall,the log-likelihood of the model increase over the iterationsduring the Gibbs sampling process, indicating better fitting


Fig. 4. Log-likelihood comparison of Enterotype HDP model.

Fig. 5. Perplexity comparison of Enterotype HDP model.

TABLE IIIILLUSTRATION OF THE BACKGROUND TOPIC OF TAXONOMIC LEVEL

INDICATORS

to the training data. The best (highest) log-likelihood scoreis achieved with . We also compare the perplexity(11) of Enterotype HDP model on a serial of values. Theresults show that the model achieve best perplexity score when

.

D. Illustration of Discovered Latent Themes

Onemajor objective of the proposed models is inferring func-tional groups from meta-genomes to facilitate knowledge or-

TABLE IVILLUSTRATION OF THE BACKGROUND TOPIC OF GENE OGS INDICATORS

TABLE VILLUSTRATION OF THE BACKGROUND TOPIC OF KEGG PATHWAY INDICATORS

ganizing and interpreting the biological processes encoded inmeta-genome sequences. Inferred latent topicmay providemoredetails to study both the phylogenetic variation at the genus andphylum level and the functional variations at gene and func-tional class levels across samples. With this consideration, wevisualize the uncovered background topics of NCBI taxonomiclevel indicators, geneOGs indicators and KEGG pathway indi-cators from three independent LDA-Bmodels and providing thetop-ranked functional elements (Tables III–V).Table III illustrates the background topic of taxonomic level

indicators, which provides an insight of the “bacteria core” ofthe most common co-existing taxa across meta-genome sam-ples. Table IV represents the background topic of gene OGsindicators. As we can see, the top-ranked functional elementsnot only involves general biology process and molecular func-tions such as signal transduction, metabolic capacity, and im-


TABLE VIILLUSTRATION OF TOP-RANKED LATENT TOPICS WITH RESPECT TO DIFFERENT MICROBIAL SAMPLES

TABLE VIIILLUSTRATION OF TOP-RANKED TAXA WITH RESPECT TO DIFFERENT LATENT TOPICS

TABLE VIIIILLUSTRATION OF THE MOST RELEVANT LATENT TOPICS WITH RESPECT TO DIFFERENT TAXA

portant protein synthesis (RNA and DNA polymerase, ATP syn-thase) but also involves gut-specific functions such as adhesionto the host protein or in harvesting sugars of the glycolipids.Table V shows the background topic of KEGG pathway indi-cators, which involves the main metabolic pathways such ascarbon metabolism and amino acid metabolism. More exam-ples about uncovered latent topics with respect to NCBI taxo-nomic indicators are illustrated in Tables VI–VIII. Specifically,Table VI illustrates the top-ranked latent topics of three differentsamples, in which the ID of latent topics are sorted by the prob-ability with respect to different samples. Table VII representsthe top-ranked taxa with respect to different latent topics, inwhich the taxa are sorted by the probability of being generatedby topics.Table VIII illustrated the most relevant latent topics of each

taxon. For each taxon, latent topics are sorted with respect tothe mutual information score (MI score). The MI score is cal-culated by following (4) in Section III, which severs as a rele-vance measurement between taxa and latent topics. As shownin Table VIII, phylum Firmicutes is most relevant to the back-ground topic (Topic 0). According to Table VI, the probabilityof Topic 0 in Healthy and UC samples (0.475 in MH0001 and0.363 in O2.UC-1) is much higher than that in CD samples(0.286 in V1.CD-1). This suggests that for CD samples, the pro-portion of bacteria belong to phylum Firmicutes is significantlyreduced. Similarly, since genus Clostridium is most relevant to

Topic 50, 153, 95, and genus Bacteroides is most relevant toTopic 156, 77, 52, the prevalence of Topic 95 and 52 in sam-ples O2.UC-1 and sample V1.CD-1 may indicate the existenceand possibly high abundance of genus Clostridium and genusBacteroides, correspondingly. Our conclusion from the resultsis evidenced by the recent discoveries in fecal microbiota studyof inflammatory bowel disease (IBD) patients [6], [9], [11], [14].It has been reported that there is a significant reduction in theproportion of bacteria belonging to phylum Firmicutes in CDsamples, which is consistent with our results. This can be ex-plained by the fact mucosal microbial diversity is reduced inIBDs, particular in CD, which is associated with bacterial inva-sion of the mucosa. In UC, the inflammation is typically moresuperficial; therefore, the reduction of phylum Firmicutes in UCis not significant.In our experiment, the phylogenetic composition inferred

from latent topics (Fig. 6) agrees with previous observations in[12] and [18]: the Firmicutes and Bacteroidetes phyla constitutethe vast majority of the dominant human gut microbiota, andBacteroidesis among the most abundant yet most variablegenus across samples.In order to facilitate analyzing the composition of micro-

biome community of human gut across cohorts, and get insightsinto functional differences between gut microbiomes across dif-ferent samples, we use the Enterotypes HDP model introducedin Section IV to infer the functional basis of each of the three


Fig. 6. Box-plot of background topic probability in samples.

TABLE IXILLUSTRATION OF THE FUNCTIONAL BASIS OF GENE OGS IN ENTEROTYPE 1 OF

SANGER SEQUENCED SAMPLES [18]

identified Enterotype in [18]. Specifically, we illustrate the in-ferred functional basis learned from the corresponding En-terotypes in Tables IX–XI.

VI. CONCLUSIONS

In this paper, based on the functional elements derived fromthe non-redundant CDs catalogue, we have shown that the con-figuration of functional groups encoded in the gene-expressiondata of meta-genome samples can be inferred by applying prob-abilistic topic modeling to functional elements derived from the

TABLE XILLUSTRATION OF THE FUNCTIONAL BASIS OF GENE OGS IN ENTEROTYPE 2 OF


TABLE XIILLUSTRATION OF THE FUNCTIONAL BASIS OF GENE OGS IN ENTEROTYPE 3 OF


non-redundant CDs catalogue (including taxonomic levels, in-dicators of gene orthologous groups and KEGG pathway map-pings). When used to study microbial samples, the proposed


model considers each sample as a “document,” which has amix-ture of “latent topic”; while each latent topic is a weighted mix-ture of functional elements that bear an analogy with “words.”We also introduce the extended Enterotypr-HDP model to inferfunctional basis from detected enterotypes. The latent topics es-timated from human gut microbial samples are evidenced by therecent discoveries in fecal microbiota study, which demonstratethe effectiveness of the proposed method.

REFERENCES[1] T. Aso and K. Eguchi, “Predicting protein-protein relationships from

literature using latent topics,” Genome Inf., vol. 23, pp. 3–12, 2009.[2] D. Blei, A. Ng, and M. Jordan, “Latent dirichlet allocation,” J. Mach.

Learn. Res., vol. 3, pp. 993–1022, 2003.[3] D. C. Richter andD.H. Huson, “Functional metagenome analysis using

gene ontology (MEGAN 4),” in SIG M3 Meeting (ISMB 2009), Stock-holm, Sweden.

[4] L. Fei-Fei and P. Perona, “A Bayesian heirarcical model for learningnatural scene categories,” in Proc. CVPR, 2005.

[5] P. Flaherty, G. Giaever, J. Kumm, M. I. Jordan, and A. P. Arkin, “Alatent variable model for chemogenomic profiling,”Bioinformatics, pp.3286–3293, 2005.

[6] G. W. Tannock, “The bowel microbiota and inflammatory bowel dis-eases,” Int. J. Inflammation, pp. 9–9, 2010, Article ID 954051.

[7] G. K. Gerber, R. D. Dowell, T. S. Jaakkola, and D. K. Gifford, “Hier-archical Dirichlet process-based models for discovery of cross-speciesmammalian gene expression,” Tech. Rep., 2007, .

[8] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proc. Natl.Acad. Sci. USA, vol. 101, pp. 5228–5235, 2004.

[9] S. Harry et al., “Specificities of the fecal microbiota in inflammatorybowel disease,” Inflammatory Bowel Diseases, vol. 12, no. 2, pp.106–111, Feb. 2006.

[10] D. Huson, D. Richter, S. Mitra, A. Auch, and S. Schuster, “Methods forcomparative metagenomics,” BMC Bioinformatics, vol. 10, no. Suppl.1, 2009.

[11] C. Manichanh et al., “Reduced diversity of faecal microbiota inCrohn’s disease revealed by a meta-genomic approach,” Gut, vol. 55,no. 2, pp. 205–211, Feb. 2006.

[12] J. Qin et al., “A human gut microbial gene catalogue established bymetagenomic sequencing,”Nature, vol. 464, no. 7285, pp. 59–65, Mar.2010.

[13] E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Wilsky,“Learning hierarchical models of scences, objects, and parts,” in Proc.Int. Conf. Comput. Vis., Beijing, China, 2005.

[14] A. Walker et al., “J high-throughput clone library analysis of the mu-cosa-associated microbiota reveals dysbiosis and differences betweeninflamed and non-inflamed regions of the intestine in inflammatorybowel disease,” BMC Microbiol., vol. 11, pp. 7–7, 2011.

[15] B. Zheng, D. C. Mclean, and X. Lu, “Identifying biological conceptsfrom a protein-related corpus with a probabilistic topic model,” BMCBioinformatics, vol. 7, 2006.

[16] J. Muller, D. Szklarczyk, P. Julien, I. Letunic, A. Roth, M. Kuhn,S. Powell, C. von Mering, T. Doerks, L. J. Jensen, and P. Bork,“eggNOG v2.0: Extending the evolutionary genealogy of genes withenhanced non-supervised orthologous groups, species and functionalannotations,” Nucleic Acids Res., vol. 38, pp. D190–D195, 2010.

[17] J. Sethuraman, “A constructive definition of Dirichlet priors,” Statisti-caSinica, vol. 4, no. 2, pp. 639–650, 1994.

[18] M. Arumugam et al., “Enterotypes of the human gut microbiome,” Na-ture, vol. 472, no. 7343, Apr. 2011.

[19] Y. Teh, M. Jordan, M. Beal, and D. Blei, “Hierarchical Dirichletprocess,” J. Amer. Stat. Assoc., vol. 101, no. 476, pp. 1566–1581,2006.

Xin Chen received both B.Eng. and M.Eng. degreesin Electronic Engineering and Information Sciencefrom the University of Science and Technology ofChina, Hefei, in 2003 and 2007, respectively. He iscurrently a Ph.D. candidate at the College of Infor-mation Science and Technology, Drexel University,Philadelphia, PA.His research interests include text and web mining,

and information retrieval.

Tingting He received her B.Sc. and M.Sc. degreesin computer science from Wuhan University, China,in 1985 1988 respectively, and her Ph.D. degree inChinese Language Information Processing in 2003.She is currently a full Professor and Chair of

the Department of Computer Science, CentralChina Normal University, Wuhan. Her researchinterests are information retrieval, Chinese languageunderstanding and processing, text mining, andbioinformatics.

Xiaohua (Tony) Hu received his B.Sc. degree (soft-ware) from Wuhan University, China, in 1985, theM.Eng. degree in Computer Engineering from the In-stitute of Computing Technology, Chinese Academyof Science, in 1988, theM.Sc. degree in computer sci-ence from Simon Fraser University, Canada, in 1992,and the Ph.D. degree in computer science from theUniversity of Regina, Canada, in 1995.He is currently a full Professor and the founding

director of the data mining and bioinformatics lab atthe College of Information Science and Technology,

Drexel University, Philadelphia, PA. He is now also serving as the IEEE Com-puter Society Bioinformatics and Biomedicine Steering Committee Chair andthe IEEE Computational Intelligence Society Granular Computing TechnicalCommittee Chair (2007–2009). He is a scientist, teacher and entrepreneur. Hejoined Drexel University in 2002, founded the International Journal of DataMining and Bioinformatics in 2006, International Journal of Granular Com-puting, Rough Sets and Intelligent Systems in 2008. Earlier, he worked as aResearch Scientist in the world-leading R&D centers such as Nortel ResearchCenter, GTE labs, and HP Labs. In 2001, he founded DMW Software in SiliconValley, CA. His research ideas have been integrated intomany commercial prod-ucts and applications. His current research interests are in biomedical literaturedata mining, bioinformatics, text mining, semantic web mining and reasoning,rough set theory and application, information extraction, and information re-trieval. He has published more than 200 peer-reviewed research papers in var-ious journals, conferences and books such as various IEEE/ACM Transactions(IEEE/ACM TCBB, IEEE TFS, IEEE TDKE, IEEE TITB, IEEE Computer),JIS, KAIS, CI, DKE, IJBRA, SIG KDD, IEEE ICDM, IEEE ICDE, SIGIR,ACM CIKM, IEEE BIBE, IEEE CICBC, etc., and co-edited 9 books/proceed-ings.Dr. Hu has received a few prestigious awards including the 2005 National

Science Foundation (NSF) Career award, (the most prestigious award fromNSF to young faculty in the United States), the best paper award at the 2007International Conference on Artificial Intelligence, the best paper award at the2004 IEEE Symposium on Computational Intelligence in Bioinformatics andComputational Biology, the 2007 IEEE Bioinformatics and BioengineeringOutstanding Contribution Award, the 2010 IEEE Granular Computing Out-standing Contribution Award, and the 2001 IEEE Data Mining OutstandingService Award. His research projects are funded by the National ScienceFoundation (NSF), U.S. Department of Education, and the PA Department ofHealth, and he has obtained more than US $6.3 million research grants in thepast 4 years as PI or Co-PI.

Yanhong Zhou received his B.S. (1986) and Ph.D.(1997) degrees in mechanical engineering fromHuazhong University of Science and Technology(HUST), China.He is a professor in the School of Life Science and

Technology, HUST. From 1999 to 2001, he did twoyears of postdoctoral research in the field of biome-chanics and ergonomics at Harvard University, andone year of visiting research in the area of bioinfor-matics at University of Missouri-Columbia. He hascompleted more than 30 R&D projects as PI or co-PI,

authored or co-authored 2 books, and more than 100 papers. Currently, he is en-gaged in the teaching and research in the field of bioinformatics.


Yuan An a Bachelor’s degree in automation anda Master’s degree in systems engineering fromTsinghua University, China, in 1987 and 1989,respectively. He also received a Master’s degreein computer science from Dalhousie University,Canada and a Ph.D. degree from the University ofToronto, Canada, in 2007 in computer science.He has been Assistant Professor in the College

of Information Science and Technology at DrexelUniversity, Philadelphia, PA, since then. He hasresearch interests in semantic technologies for infor-

mation integration, information modeling including ontology design, schemamapping, health informatics, and the semantic web. He designed and developedthe MAPONTO tool for creating semantic mappings between heterogeneousdata representation. Moreover, he investigated a new semantic method fordiscovering direct mappings between database schemas. The method uniquelyleverages the schema semantics expressed in terms of ontologies, and greatlyimproves the performance of schema mapping systems. For data integrationon the Web, he has also studied the problem of automatically understandingforms and developed machine learning methods for mapping forms to databaseschemas and ontologies.

Xindong Wu is a Professor of Computer Science atthe University of Vermont (USA), a Yangtze RiverScholar in the School of Computer Science andInformation Engineering at the Hefei Universityof Technology (China), and a Fellow of the IEEE.He received his Bachelor’s and Master’s degreesin Computer Science from the Hefei University ofTechnology, China, and his Ph.D. degree in ArtificialIntelligence from the University of Edinburgh,Britain. His research interests include data mining,knowledge-based systems, and Web information

exploration.Dr.Wu is the Steering Committee Chair of the IEEE International Conference

on Data Mining (ICDM), the Editor-in-Chief of Knowledge and InformationSystems (KAIS, by Springer), and a Series Editor of the Springer Book Serieson Advanced Information and Knowledge Processing (AI&KP). He was theEditor-in-Chief of the IEEE Transactions on Knowledge and Data Engineering(TKDE, by the IEEE Computer Society) between 2005 and 2008. He servedas Program Committee Chair/Co-Chair for ICDM ’03 (the 2003 IEEE Interna-tional Conference on Data Mining), KDD-07 (the 13th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining), and CIKM 2010(the 19th ACM Conference on Information and Knowledge Management).

Documents

IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 11, NO. 3, SEPTEMBER