LEARNING STRUCTURE FROM THE GROUND UP HIERARCHICAL

LEARNING STRUCTURE FROM THE GROUND UP:HIERARCHICAL REPRESENTATION LEARNING BY CHUNKING

Shuchen WuMax Planck Institute for Biological Cybernetics

[email protected]

Noémi ÉltetoMax Planck Institute for Biological Cybernetics

Ishita DasguptaDeepMind New York City

Eric SchulzMax Planck Institute for Biological Cybernetics

ABSTRACT

From learning to play the piano to speaking a new language, reusing and recombining previouslyacquired representations enables us to master complex skills and easily adapt to new environments.Inspired by the Gestalt principle of grouping by proximity and theories of chunking in cognitivescience, we propose a hierarchical chunking model (HCM). HCM learns representations from non-i.i.d sequential data from the ground up by first discovering the minimal atomic sequential unitsas chunks. As learning progresses, a hierarchy of chunk representation is acquired by chunkingpreviously learned representations into more complex representations guided by sequential depen-dence. We provide learning guarantees on an idealized version of HCM, and demonstrate that HCMlearns meaningful and interpretable representations in visual, temporal, visual-temporal domainsand language data. Furthermore, the interpretability of the learned chunks enables flexible transferbetween environments that share partial representational structure. Taken together, our results showhow cognitive science in general and theories of chunking in particular could inform novel and moreinterpretable approaches to representation learning.

1 Introduction

The last decade has witnessed a meteoric rise in the abilities of artificial systems, in particular deep learning models[24]. From beating the world champion of Go [49] to predicting the structure of the human proteome [19], deeplearning models have accomplished impressive end-to-end learning achievements. Yet, researchers were quick topoint out shortcomings [23, 26]. Two such shortcomings concern the hierarchical structure and interpretability ofthe representations learned by neural network architectures. Specifically, deep learning models suffer from a lack ofinterpretability since ANNs are sub-symbolic, nested, non-linear structures. This means they the way predictions aregenerated can be difficult to understand [8, 38, 40]. Furthermore, deep learning models can also struggle to learnhierarchical representations altogether [12, 21]. To address these shortcomings, it has been suggested that machinelearning researchers could seek inspiration from cognitive science and construct models that resemble the hierarchicaland interpretable representations observed in human learners [5, 22].

We take inspiration from the cognitive phenomenon of chunking and the Gestalt principle of grouping by proximity. Toget an intuition for chunking, try to read through the following sequence of letters: “DFJKJKJKDFDFJKJKDFDF”.Upon reaching the end, if you were tasked to repeat the letters from memory, you might recall fragments of thesequence such as “DF” or “JK”. By parsing the sequence of letters only once, one detects frequently occurring patternswithin and memorize of them together as units, i.e. chunks. Chunking has been observed in a range of sensorymodalities. We perceive units and structures when learning a language [28, 36], in action sequences [34, 39], andin visual structures [4, 9, 17]. The extracted chunks have been argued to facilitate memory compression [15, 30],compositional generalization [42], predictive processing [20, 31], and communication [43].

In the current work, we propose a hierarchical chunking model (HCM) that learns chunks from non-i.i.d sequentialdata with a hierarchical structure. HCM starts out learning an atomic set of chunks to explain the sequence and

Hierachical Chunking Model

Figure 1: Schematic of the Hierarchical Chunking Model. a) Example of a hierarchical model generating training se-quences. b) Intermediate representation of learned marginal and transition matrices. The most frequent transition thatviolates the independence testing criterion is marked in red and can be turned into a new chunk. c) HCM combines thetwo chunks cL and cR to form a new chunk. d) As HCM observes longer sequences, it gradually learns a hierarchicalrepresentation of chunks. e) HCM arrives at the finally chunk hierarchy isomorphic to the generative hierarchy.

gradually combines them into increasingly larger and more complex chunks, thereby learning interpretable hierarchicalstructures. The output of the model is a dynamical graph that is a trace of the evolving representation. The resultingrepresentations are therefore easy to interpret, and flexibly reusable (e.g. we can choose to re-use specific parts). Wederive learning guarantees on an idealized generative model and demonstrate convergence on sequential data comingfrom this generative model. Furthermore, we show that our model can benefit from having learned previous structureswith shared components, leading to flexible transfer and generalization. Finally, we demonstrate that HCM can returninterpretable representation in discrete sequential, visual, and language domains, in several small-scale experiments.Taken together, our results show how cognitive science in general and theories of chunking and grouping by proximityin particular can inform novel and more interpretable approaches to representation learning.

2 Hierarchical Chunking Model

When talking about chunks in a sequence, a chunk is defined as a unit created by concatenating several atomic sequen-tial units together. Take the training sequence shown in Figure 1a as an example. The sequence consists of discrete,size-one atomic units from an atomic alphabet A0: in this case A0 = {0, 1, 2}. A chunk c is made up of a combinationof one or more atomic units in A0\{0}. 0 denotes an empty observation in the sequence.

Intuitively, if the training sequence contains inherent hierarchical structure, then there are patterns which span severalsequential units sharing these internal structures, like repeated melodies and sub-melodies in a piece of music. If thepattern occurs in the sequence, observations between sequential units within the pattern will be correlated. In this case,chunking patterns within a sequence as units simplifies perceptual processing in the sense that the sequence can beperceived one chunk at a time, instead of one sequential unit at a time. Furthermore, the acquired “primary” chunksserve as building blocks to discover larger chunks that are embedded within the intricate hierarchy of the sequentialstructure.

More formally, HCM acquires a belief set B of chunks, and uses chunks from the belief set to parse the sequenceone chunk at a time. HCM assumes that a sequence is generated from samples of independently occurring chunkswith probability of PB(c) evaluated on the belief set B. The probability of observing a sequence of parsed chunks

2


Figure 2: a) Example graph generated from the hierarchical generative model with a depth of d = 3. b) Learningperformance of HCM and RNN with increasing training length and for different depths. Performance was averagedover 30 randomly-generated graphs.

c1, c2, ..., cN can be denoted as P (c1, c2, ..., cN ) =∏ci∈B PB(ci). Chunks as perceiving units serve as independent

factors that disentangle observations in the sequence.

From the beginning of the sequence, HCM ‘perceives’ sequential units one chunk at a time, in other words, the trainingsequence is parsed by HCM into chunks. At every parsing step, the longest chunk in the belief set that is consistentwith the upcoming sequence is chosen to explain the up-coming sequential observations. The end of the previouschunk parse initiates the next parse.

Observing a hierarchically structured sequence as illustrated in Figure 1a, HCM gradually builds up a hierarchy ofchunks starting from an empty belief set. It first identifies a set of atomic chunks to construct its initial belief set B.Initially, these will be chunks of length one, yielding one-by-one processing of the primitive elements.

For one belief set B, HCM keeps track of the marginal parsing frequency M(ci) for each chunk ci in B, a vector withsize |B| and the transition frequency T between chunk ci followed by chunk cj , as illustrated in Figure 1b. Entriesin M and T are used to test the hypothesis that consecutive chunk parses have a correlated consecutive occurrencewithin the sequence. If two chunks cL and cR have a significant adjacency dependence based on their entries in Mand T , they are chunked together to become cL ⊕ cR, which augments the belief set B by one. One example of chunkmerging is shown in Figure 1c.

There are two different versions of HCM. The Rational Chunk Learning HCM produces chunks in an idealized waywhich we use to study learning guarantees. The online version of HCM is an approximation to the rational HCM thatcan be adapted to different environments and processes sequences online. Pseudo-code for both algorithms can befound in Algorithm 1 and 2 in the SI.

Rational Chunk Learning: HCM as an Ideal Observer The model is initiated with an empty belief set, and it firstfinds a minimally complete belief set after the first sequence parse. In each iteration, the entire sequence is parsed toevaluate M and T , which are used to find consecutive chunk parses in the existing belief set fulfilling the dependencetesting criterion. From these dependent chunk pairs, the pair with the largest estimated joint probability is combinedinto a new chunk. The new chunk enlarges the belief set by one. The chunks in the new belief set are used to parsethe sequence in the next iteration. This process repeats until all of the chunks in the belief set pass an independencetesting criterion.

Online Chunk Learning The online chunk learning HCM approximates the ideal observer HCM by learning newchunks when the training sequence is processed online. To have a feature that encourages adaptation to new environ-mental statistics, entries in M and T can be subject to memory decay. We use the ideal observer model to demonstratelearning guarantees, but use the online model to learn representations in realistic and more complex set-ups.

2.1 HCM Builds Representations from the Ground Up

As HCM learns from a sequence, it starts with no representation and gradually builds up the representations which canbe readily interpreted at any learning stage. The representation can be described by a chunk hierarchy graph G with thevertex set being the chunks and edges pointing from chunk constituents to the constructed chunks. Shown in Figure 1dis the gradual build-up of one such chunk hierarchy graph as the model learns from a training sequence coming fromthe generative hierarchy displayed in Figure 1a. At t = 10, HCM learns only the atomic chunks, at t = 60, HCM hasalready constructed two additional chunks; when t = 100, two more additional chunks are constructed based on thepreviously acquired chunks, and at t = 150, HCM arrives at the final chunk hierarchy.

3


3 Hierarchical Generative Model

To study learning guarantee of the HCM, a generative model with a random hierarchy of chunks is constructed. Therelation between chunks and their constructive components in the generative model is described by a chunk hierarchygraph Gd with vertex set VAd

and edge set EAd. One example is illustrated in Figure 1a. Ad is the set of chunks used

to construct the sequence. The depth d specifies the number of chunks created in the generative process.

Starting with an initial set of atomic chunks A0, at the i-th iteration, two chunks cL, cR are randomly chosen from thecurrent set of chunks Ai and are concatenated into a new chunk cL ⊕ cR, augmenting Ai by one to Ai+1. Meanwhile,an independent occurrence probability is assigned to each chunk under the condition that the probability of occurrencefor every new chunk ci in the construction process evaluated on the support set Ai carries the largest probability mass.

To generate a sequence from a constructed hierarchical graph, chunks are sampled independently from the set of allchunks and appended after each other, under the constraint that no two chunks with a child chunk should be sampledconsecutively.

3.1 Learning Guarantee

Theorem: As the length of the sequence approaches infinity, HCM learns a hierarchical chunking graph G isomorphicto the generative hierarchical graph G.

Proof Sketch: We approach this proof by induction. Further details can be found in SI. Base step: The first step ofthe rational chunking algorithm is to find the minimally complete atomic set of chunks to form its initial belief set.This procedure guarantees that G0 = G0. Additionally, the probability mass of the learning model at step 0 and thegenerative model at step 0 is asymptotically the same as the sequence length approaches infinity. Induction hypothesis:Assume that the learned belief set Bi at step i contains the same chunks as the alphabet set Ai in the generative model,the chunk combination pair with the biggest evaluated joint occurrence probability violating the independence test ispicked to be concatenated into a chunk to extend the belief set: this chunk is the same chunk node created by thehierarchical generative model. End step: The chunk learning process stops once the independence criterion is passed.This is the case once the chunk learning algorithm has learned a belief set Bd = Ad.

3.2 Learning Convergence

HCM’s learning performance with increasing sequence length is evaluated and shown in Figure 2. For this, HCM wastrained on random graph hierarchies constructed from the hierarchical generative model over 3000 trials while alsovarying the depth d. Figure 2a displays an example of a random generative hierarchy with a depth of d = 3. Weused the Kullback-Leibler divergence to evaluate the deviation of HCM’s learned representations from the generativehierarchical model. This was done by using HCM’s representation to produce “imagined” sequences, which werethen compared to the ground truth probability for each chunk in the used generative model. Figure 2b shows the KL-divergence between the learned and ground-truth distribution for different depths d of the generative graphs. For eachdepth, the KL-divergence is evaluated on 30 random generative models with sequence length increasing from 50 to3000 in each model. Overall, the KL-divergence decreased as the length of training sequence increased and convergedwith larger training sequences. This shows empirically that HCM learns representations bearing closer resemblance tothe generative model.

We used the same sequences to train a 3-layer Recurrent Neural Network (RNN) with 40 hidden units and the samemethod to measure the KL-divergence by using “imagined” sequences produced by the RNN evaluated on the supportset of the generative hierarchy. As the length of the training sequence increased, the KL-divergence also convergedbut at a much slower rate. Note that HCM’s competitive advantage increased further as the depth of the generativehierarchy increased. Thus, HCM learns quickly about the hierarchical generative model and does so faster than aneural network.

3.3 HCM Permits Transfer Between Environments with Overlapping Structure

After training on a sequence, HCM acquires an interpretable representation. Knowing what the model has learnedenables us to directly know what type of hierarchical environment would facilitate or interfere with the learned rep-resentations. This is impossible to do using standard neural networks because we generally would not know whatrepresentation they have learned, and therefore do not have fine-grained control over retaining or removing parts of it.

More formally, two HCM models might have acquired different hierarchical chunking graphs Gi and Gj from theirpast experience. These might lie on the graph construction path (G0,G1,G2, ...,Gd). The HCM with a chunk hierarchy

4


Figure 3: a) Example of a representation learned by an HCM. b) An environment facilitative to the learned represen-tation. Gray shadows mark the chunks that can be directly transferred. c) Average performance over the first 500 trialsafter environment change in the facilitative environment. d) Interfering environment. Gray shadows marks chunksthat the learned representation needs to acquire from scratch. e) Average performance over the first 500 trials afterenvironment changes into an the interfering environment.

graph ‘closer’ to the ground truth chunk Gd on the path, takes fewer iteration to arrive at Gd. This also applies whenthe chunk hierarchies starting out are not along the graph construction path but only showing partial overlap.

Similarly, the chunk hierarchy Gi learned by an HCM might facilitate its performance in a new environment where Gilies along the graph construction path to the true Gd, i.e. there is partial overlap between the chunk hierarchies. Wedemonstrate this in Figure 3. Here, HCM was trained on a random hierarchical generative model and acquired therepresentation shown in Figure 3a. Knowing this representation, we also know that an environment with a hierarchicalgenerative graph as shown in Figure 3b allows for positive transfer, i.e. the HCM that already contains the chunkhierarchy Figure 3a would learn faster than a naive one by transferring its previously learned chunks. As shown inFigure 3c, the trained HCM learns faster than a naive HCM which had to learn about the structure from scratch.

Vice versa, we might have a situation in which transfer is detrimental. For example, there is no overlap in the chunkhierarchy learned by the HCM in Figure 3a with the graph shown in Figure 3d. The shaded chunks need to be learnedanew by the HCM trained on Figure 3a, yet the previously acquired chunks might mislead HCM causing it to adaptto the new environment more slowly. As a result, the performance of the pre-trained HCM suffers more from aninterfering environment than a naive HCM (Figure 3e). This is similar to catastrophic interferences in neural networks[47], where performance on one task interferes with others. The advantage of the HCM is that with visibility into thechunks learned, we have a better a priori sense of whether transfer will be facilitative or interfering. This means thatwe can examine whether to re-use the previously learned chunks, or to start from scratch with a naive model.

4 Generalizing to Visual Temporal Chunks via the Principle of Proximal Grouping

Humans are not only able to identify sequential chunks but also to find structure in visual-temporal stimuli and groupvisual points as a whole. For example, despite the fact that our retina receives pixel-wise visual inputs that vary acrosstemporal slices, we are able to perceive complex moving objects. The Gestalt principle of grouping by proximity statesthat objects that are close to one another appear to form groups [52]. This principle has been argued to play a key rolein human perceptual grouping [6] and benefit working memory [37] and the reduction of visual complexity [7]. Indeed,in humans and other animals, learning of adjacent relationships prevails over non-adjacent ones [25]. Therefore, theadjacent dependency structure can be seen as the primary driver of chunking in visual temporal domains [14]. Toemulate this ability of chunking via proximal grouping in visual temporal perception, we extend HCM to also learnvisual temporal chunks.

Visual temporal chunks not only subsume temporal length but also varying visual slices in each temporal slice. Onecan imagine visual temporal chunks as having a 3D shape — the first two height and width dimensions are the visualpart of the chunk, the length of the object is the temporal part, made of stacked visual-temporal pixels. Within eachtemporal slice are the visual features identified by the chunk. Since nearby points are more likely to be in the samechunk, this assumption is an implementation of the principle of grouping by proximity in the visual-temporal domain.As the model iterates through data across its temporal slice, the chunk that attains the biggest visual temporal volumeis chosen to explain parts of the visual-temporal observations. Multiple visual temporal chunks can be identified tooccur simultaneously. Starting at the visual temporal time point marked by the previous chunk, chunks are identifiedand stored in M . The transition matrix T is modified to account for the temporal lag-difference between adjacentchunk pairs and records the frequency that one chunk transitions into another one with a given time-lag. A hypothesistest is conducted every time when a pair of adjacent chunks are identified, chunks that violate the hypothesis test aregrouped together to augment the belief set which is then used to parse future sequences.

5


Figure 4: a) A designed visual hierarchical model where elementary components compose more complex images. b)Initial, intermediate, and complex chunks learned by HCM trained on sequences of images generated by the visualhierarchical model.

Figure 5: a) A GIF of a moving squid used as a sequence to train HCM. b) Examples of temporal chunks learned byHCM. c) Examples of visual chunks learned by HCM.

4.1 Learning Part-Whole Relationship Between Visual Components

We let HCM learn chunks in a visual domain by learning from a sequence of images. To this end, a hierarchicalgenerative model in the pixel-wise image domain shown in Figure 4a was constructed to test HCM’s visual chunkingability. Specifically, a set of elementary visual units in the lowest level of the hierarchy are combined to constructintermediate and more complex visual units higher up in the hierarchy. An empty image is included to denote pauses.All of the constructed elements in the hierarchy occurred independently according to a draw from a Dirichlet flatdistribution. To generate the sequence which was used to train HCM, images in the hierarchy were sampled from thegenerative distribution and appended to the end of the sequence. In this way, despite the visual correlations in eachimage described by the hierarchy, each image slice was temporally independent from other slices.

HCM learns a hierarchy of visual chunks from a sequence of visually correlated but temporally independent imagessampled from the visual hierarchical model. Shown in Figure 4b are the chunk representations learned by HCM atdifferent stages. Having no knowledge about the image parts before starting to learn, HCM acquires the individualpixels as chunks to explain the observations. As HCM proceeds with learning, visual correlations among the pixels arediscovered and larger chunks are formed. As the number of observations increases, the model learns more sophisticatedyet still interpretable chunks.

4.2 Learning Visual-Temporal Movement Hierarchies

Instead of seeing one image after another sampled from an independent, identically distributed distribution, real worldexperiences contain correlations in both the visual and temporal dimension. From observing object movements acrossspace and time, the visual system learns structures from correlated visual and temporal observations, decomposesmotion structure and groups moving objects together as a whole [2]. To emulate this type of environment as a learningtask, an animated GIF of a squid swimming in the sea as shown in Figure 5a was used as a visual-temporal sequence

6


Table 1: Example Chunks Learned from The Hunger Games

Simple Chunks ’an’, ’in ’, ’be’, ’at’, ’me’, ’le’, ’a ’, ’ar’, ’re’, ’and ’, ’ve’, ’ing ’, ’on’,’st’, ’se’, ’to ’, ’of ’, ’he’, ’my ’, ’te’, ’pe’, ’ou’, ’we’, ’ad’, ’de’, ’li’,’the’, ’ce’, ’is’, ’as’, ’il’, ’ch’, ’al’, ’no’, ’she’, ’ing’, ’am’, ’ack’, ’we’,’raw’, ’on the’, ’day’, ’ear’, ’oug’, ’bea’, ’tree’, ’sin’, ’that’, ’log’, ’ters’,’wood’, ’now’, ’was’, ’even’, ’leven’, ’ater’, ’ever’, ’but’, ’ith’, ’ity’

Intermediate Chunks ’this’, ’pas’, ’eak’, ’if’, ’sing’, ’bed’, ’men’, ’raw’, ’day’, ’in the’,’link’, ’for’, ’one’, ’the wood’, ’bell’, ’other’, ’...’, ’lose to’, ’hunger’,’mother’, ’death’, ’would’, ’district 12’, ’try’, ’under’, ’prim’, ’beg’,’then’, ’into’, ’gale’, ’read’, ’come’, ’he want’, ’leave’, ’where’, ’older’,’says’, ’might’, ’dont’, ’add’, ’know’, ’man who’, ’of the’

Complex Chunks ’out of’, ’it out’, ’our district’, ’capitol’, ’reaping’, ’fair’, ’berries’, ’thelast’, ’fish’, ’again’, ’as well’, ’the square’, ’scomers’, ’fully’, ’, but the’, ’in the school’, ’at the’, ’you can’, ’tribute’, ’to remember’, ’it is notjust’, ’I can’, ’peace’, ’feel’, ’you have to’, ’I know’, ’bother’, ’in our’,’kill’, ’cause of the’, ’the pig’, ’to the baker’, ’I have’, ’what was’

to train HCM. As learning advances, HCM learns chunks spanning both the visual and temporal domain. Examplesof such visual-temporal chunks are shown in Figure 5b and c. There are visual-temporal chunks that mark movementsof a tentacle and the rising-up motion of a bubble. Additionally, there are visual chunks that resemble a part of thesquid’s eye and face.

5 Learning Chunks from Realistic Language Data

We so far have only shown demonstrations of HCM learning chunks on simple sequences. Thus, one step further is torun HCM on real world datasets that contain complex hierarchical structures. One immediate testbed containing suchstructures is natural language. To this end, we trained HCM to learn chunks on the first book of The Hunger Games.

HCM starts with learning individual English letters and punctuation. After having seen 10,000 characters of the book,HCM acquires frequently used chunks that resemble common English pre- and suffixes such as “ity”, “ing” , “re”,and “ith”; definite and indefinite articles such as “a”, “an”, “the”; conjunctions such as “and”, “but”, “that”, and “as”;prepositions such as “of, “to”, “in”, “at”, variants of “is”, “are”, “was”; and pronouns such as “he”, “my”, “me”, “we”,“she”.

After having seen 100,000 characters of the book, HCM learns intermediate chunks that include various commonlyused verbs such as “come”, “sing”, “leave”, “try”; nouns such as “bed”, “day”, “mother”; common word combinationssuch as “in the”, “lose to”, “the wood”. Additionally, HCM has acquired words specific to The Hunger Games such as“prim”, “death”, “hunger”, and “district 12”.

After having seen 300,000 characters of the book, HCM learns more complex phrases from the previously learnedwords, such as the commonly used phrases “it is not just”, “in the school”, “our district”, and “cause of the”.

6 Related Work

Our model is a successor to decades of work on different approaches to chunk learning. In cognitive science, re-searchers have put forward models that produce qualitatively similar chunks as humans learning a natural or artificiallanguages [35, 45], as well as models that can extract chunks from visual inputs [27]. HCM can be viewed as a prin-cipled version of these earlier cognitive models because it uses hypothesis testing in its decision to chunk elementstogether instead of using chunking heuristics. Therefore, HCM comes with learning guarantees for a fairly generalclass of hierarchically structured environments. Additionally, HCM extends past cognitive models to the higher di-mensional domain of visual-temporal chunking.

The primary approach to chunk learning in the language domain were n-gram models, dating all the way back to [46].An n-gram model learns the marginal probabilities of all chunks of size n. A major limitation of this approach is thatthe number of chunks grows exponentially as a function of chunk length. For instance, with an alphabet of 26 letters,the number of possible 5-letter words already goes beyond 10 million. Due to the large vocabulary size, buildingword-level n-gram models is virtually unfeasible. To this end, a Bayesian non-parametric extension of n-gram modelshas been put forward [50]. A Bayesian non-parametric n-gram model builds up structure as evidence is accumulated.

7


That is, it flexibly reduces to a 1-gram model if chunks are not present or ‘opens up’ higher and higher n-gram levelsdepending on the size of chunks. [50]’s model is different from HCM in several regards. Instead of using the chunks to‘look forward’ and parse the sequence in large steps, it ‘looks back’ and predicts only one element conditioning on thecontext of the previous elements. Then, instead of storing an explicit bag of chunks, it represents a chunk distributionweighted by evidence. For prediction, it smooths over the evidence of all chunks that are consistent with the currentcontext. A shortcoming of [50]’s model and n-gram models in general is that they do not leverage the concatenationprocess observed in humans but the chunks are built up primitive element-wise.

Hierarchical hidden Markov models [11] were developed in a similar vein, extending hidden Markov models to be ableto capture multi-scale sequential structure. These models can capture larger patterns than non-hierarchical versionswhile maintaining the computational tractability of simple Markov processes. However, they still lack the adaptiverecombination and reuse of pre-existing components. Fragment grammars [32] address this by balancing the creationand re-use of chunks by Bayesian principles, while also preserving the symbolic interpretability. However, fragmentgrammars are intractable and their inference is exceptionally costly.

In the era of deep learning, both natural language processing and image processing became increasingly dominated byneural networks, with one of their primary tasks being the extraction and prediction of chunks [33, 48, 53]. Commonly,these so-called sequence chunks are used as units of segmentation for other downstream tasks such as text translation.However, in these approaches, the architecture needs to be pre-specified before training as compared to HCM, whereno such specification is required. HCM can therefore be seen as an interpretable and transparent alternative to neuralnetworks for some of these tasks. Natural language processing approaches often assume fixed ‘chunk’s at the level ofwords [1] which are then used as the input to large-scale tasks like machine translation [51]. The inherent issue with afixed chunk size is that it doesn’t account for structure at other scales. These have since been addressed with variousapproaches that account for ‘phrase’ structure across words [29] or account for sub-word structure [3, 44]. These riskexponential growth. The advantage of a model like HCM is that it provides units across scales, allowing us to flexiblyrepresent multi-scale structure.

A final related line of research attempts to learn explicit representations by inducing programs [22]. Fore example,in a recent approach to this challenge, the interpretable structures returned by program induction algorithms wascombined with a deep neural network to learn meaningful representations from data [10]. Yet in these approaches theretrieved representations are highly dependent on the initial set of building blocks which are usually specified by theexperimenter. HCM, in comparsion, is task-agnostic and requires no primitive program description.

7 Discussion

We have proposed the Hierarchical Chunking Model (HCM) as a cognitively-inspired method for learning representa-tions from the ground up. HCM starts out by learning atomic units from sequential data which it then chunks together,gradually building up a hierarchical representation that can be expressed as a dynamical and intepretable graph. Wehave provided learning guarantees for an idealized version of HCM, shown how HCM’s interpretability can facilitategeneralization, and demonstrated that HCM learns meaningful representations from visual, temporal, visual-temporal,and language data.

Although we have showcased HCM’s abilities across a set of diverse experiments, some challenges remain. First of all,HCM is currently not computationally efficient: we had to run the online version, rather than the full rational model,for most of the presented tasks. There are several directions that could enhance HCM’s computational efficiency. Oneis to stitch multiple chunks together in one decision. Another direction could be separate the chunk decision process(between all of the acquired chunks) from the process of parsing observations, allowing parallelized implementations.We believe that scaling HCM up will be beneficial to learn in increasingly more complex data sets than the ones wehave applied here.

Secondly, all patterns of chunks in the current project came from adjacent events in the sequences. This featurewas inspired by the Gestalt principle of grouping by proximity. Yet many patterns observed in natural data setsmight exhibit non-adjacent dependencies in space or time. The adjacency assumption therefore limits the model fromdetecting such patterns. How to relax the adjacency assumption as a grouping criterion, to allow for non-adjacentrelationships to be chunked together, remains an open challenge. Furthermore, HCM currently assumes that there isa hierarchy of chunks which occur independently in the observational sequence. This is intended as a simplifyingassumption for a first approach toward building a cognitively plausible hierarchical chunking model. Our approachcan and should be extended to more sophisticated assumptions such as higher order Markovian dependencies betweenchunks. Finally, HCM passes visual data “as is” and does not take into consideration any additional assumptions aboutvisual inputs such as translation or rotational symmetries, which humans can detect when perceiving visual structures.

8


There are also several avenues for future investigations. One immediate step is to run HCM on other, more complexdata sets such as musical scores, neural data, and large natural language corpora, to name but a few. Furthermore, weintend to not only run HCM on raw visual inputs directly but also to employ neural networks to compress inputs intoa latent space and then train HCM on these latent dimensions [13]. Additionally, one could also use deep learningmodels to label parts of objects and then train HCM on the movements of these labelled parts [18]. Finally, we arecurrently only testing for independence when deciding on whether or not to chunk, although other statistical tests areconceivable. One general class of tests could be to assess whether or not chunks are causally related with each other,in an attempt to find the best causal structure to explain the sequential, observational data [16]. This would make HCMa useful model of causal representation learning [41].

References[1] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model.

The journal of machine learning research, 3:1137–1155, 2003.[2] Johannes Bill, Hrag Pailian, Samuel J. Gershman, and Jan Drugowitsch. Hierarchical structure is employed by

humans during visual motion perception. Proceedings of the National Academy of Sciences, 117(39):24581–24589, 2020. ISSN 0027-8424. doi: 10.1073/pnas.2008961117. URL https://www.pnas.org/content/117/39/24581.

[3] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subwordinformation. Transactions of the Association for Computational Linguistics, 5:135–146, 2017.

[4] Timothy F. Brady, Talia Konkle, and George A. Alvarez. Compression in Visual Working Memory: UsingStatistical Regularities to Form More Efficient Memory Representations. Journal of Experimental Psychology:General, 138(4), 2009. ISSN 00963445. doi: 10.1037/a0016797.

[5] Francois Chollet. Deep Learning with Python. Manning Publications Co., USA, 1st edition, 2017. ISBN1617294438.

[6] Brian J Compton and Gordon D Logan. Evaluating a computational model of perceptual grouping by proximity.Perception & Psychophysics, 53(4):403–421, 1993.

[7] Don C Donderi. Visual complexity: a review. Psychological bulletin, 132(1):73, 2006.[8] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint

arXiv:1702.08608, 2017.[9] Dennis E. Egan and Barry J. Schwartz. Chunking in recall of symbolic drawings. Memory & Cognition, 7(2),

1979. ISSN 0090502X. doi: 10.3758/BF03197595.[10] Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sablé-Meyer, Lucas Morales, Luke Hewitt, Luc Cary,

Armando Solar-Lezama, and Joshua B. Tenenbaum. Dreamcoder: Bootstrapping inductive program synthesiswith wake-sleep library learning. In Proceedings of the 42nd ACM SIGPLAN International Conference onProgramming Language Design and Implementation, PLDI 2021, pp. 835–850, New York, NY, USA, 2021.Association for Computing Machinery. ISBN 9781450383912. doi: 10.1145/3453483.3454080. URL https://doi.org/10.1145/3453483.3454080.

[11] Shai Fine, Yoram Singer, and Naftali Tishby. The hierarchical hidden markov model: Analysis and applications.Machine learning, 32(1):41–62, 1998.

[12] Jerry A. Fodor and Zenon W. Pylyshyn. Connectionism and cognitive architecture: A critical analysis. Cognition,28(1-2):3–71, 1988. doi: 10.1016/0010-0277(88)90031-5.

[13] Nicholas T Franklin, Kenneth A Norman, Charan Ranganath, Jeffrey M Zacks, and Samuel J Gershman. Struc-tured event memory: A neuro-symbolic model of event cognition. Psychological Review, 127(3):327, 2020.

[14] Vicky Froyen, Jacob Feldman, and Manish Singh. Bayesian hierarchical grouping: Perceptual grouping asmixture estimation. Psychological Review, 122(4):575, 2015.

[15] Fernand Gobet, Peter C.R. Lane, Steve Croker, Peter C.H. Cheng, Gary Jones, Iain Oliver, and Julian M. Pine.Chunking mechanisms in human learning. Trends in Cognitive Sciences, 5(6), 2001. ISSN 13646613. doi:10.1016/S1364-6613(00)01662-4.

[16] Christina Heinze-Deml, Marloes H. Maathuis, and Nicolai Meinshausen. Causal Structure Learning. Annual Re-view of Statistics and Its Application, 2018. ISSN 2326-8298. doi: 10.1146/annurev-statistics-031017-100630.

[17] Geoffrey Hinton. Some demonstrations of the effects of structural descriptions in mental imagery*. Cog-nitive Science, 3(3):231–250, 1979. doi: https://doi.org/10.1207/s15516709cog0303\_3. URL https://onlinelibrary.wiley.com/doi/abs/10.1207/s15516709cog0303_3.

9

https://www.pnas.org/content/117/39/24581

https://www.pnas.org/content/117/39/24581

https://doi.org/10.1145/3453483.3454080

https://doi.org/10.1145/3453483.3454080

https://onlinelibrary.wiley.com/doi/abs/10.1207/s15516709cog0303_3

https://onlinelibrary.wiley.com/doi/abs/10.1207/s15516709cog0303_3


[18] Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. Deepercut: Adeeper, stronger, and faster multi-person pose estimation model. In European Conference on Computer Vision,pp. 34–50. Springer, 2016.

[19] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tun-yasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure predictionwith alphafold. Nature, 596(7873):583–589, 2021.

[20] Iring Koch and Joachim Hoffmann. Patterns, chunks, and hierarchies in serial reaction-time tasks. PsychologicalResearch, 63(1), 2000. ISSN 14302772. doi: 10.1007/PL00008165.

[21] B. Lake and Marco Baroni. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In ICML, 2018.

[22] B. Lake, R. Salakhutdinov, and J. Tenenbaum. Human-level concept learning through probabilistic programinduction. Science, 350:1332 – 1338, 2015.

[23] Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. Building machines thatlearn and think like people. Behavioral and Brain Sciences, 40:e253, 2017. doi: 10.1017/S0140525X16001837.

[24] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.

[25] Raphaëlle Malassis, Arnaud Rey, and Joël Fagot. Non-adjacent dependencies processing in human and non-human primates. Cognitive Science, 42(5):1677–1699, 2018.

[26] G. Marcus. Deep learning: A critical appraisal. ArXiv, abs/1801.00631, 2018.

[27] Denis Mareschal and Robert French. Tracx2: a connectionist autoencoder using graded chunks to model infantvisual statistical learning. Philosophical Transactions of the Royal Society B: Biological Sciences, 372:20160057,11 2016. doi: 10.1098/rstb.2016.0057.

[28] Stewart M McCauley and Morten H Christiansen. Computational investigations of multiword chunks in languagelearning. Topics in Cognitive Science, 9(3):637–652, 2017.

[29] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of wordsand phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119,2013.

[30] George A. Miller. The magical number seven, plus or minus two: some limits on our capacity for processinginformation. Psychological Review, 1956. ISSN 0033295X. doi: 10.1037/h0043158.

[31] Diana M Müssgens and Fredrik Ullén. Transfer in Motor Sequence Learning: Effects of Practice Schedule andSequence Context. 9(November), 2015. doi: 10.3389/fnhum.2015.00642.

[32] Timothy J O’Donnell, Joshua B Tenenbaum, and Noah D Goodman. Fragment grammars: Exploring computa-tion and reuse in language. 2009.

[33] Katrin Ortmann. Chunking historical german. In NODALIDA, 2021.

[34] Virginia B. Penhune and Christopher J. Steele. Parallel contributions of cerebellar, striatal and M1 mechanismsto motor sequence learning, 2012. ISSN 01664328.

[35] Pierre Perruchet and Annie Vinter. Parser: A model for word segmentation. Journal of Memory and Language,39(2):246 – 263, 1998. ISSN 0749-596X. doi: https://doi.org/10.1006/jmla.1998.2576. URL http://www.sciencedirect.com/science/article/pii/S0749596X98925761.

[36] Pierre Perruchet, Bénédicte Poulin-Charronnat, Barbara Tillmann, and Ronald Peereman. New evidence forchunk-based models in word segmentation. Acta psychologica, 149:1–8, 2014.

[37] Dwight J Peterson and Marian E Berryhill. The gestalt principle of similarity benefits visual working memory.Psychonomic bulletin & review, 20(6):1282–1289, 2013.

[38] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should i trust you?": Explaining the predictionsof any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining, KDD ’16, pp. 1135–1144, New York, NY, USA, 2016. Association for Computing Machin-ery. ISBN 9781450342322. doi: 10.1145/2939672.2939778. URL https://doi.org/10.1145/2939672.2939778.

[39] David A. Rosenbaum, Sandra B. Kenny, and Marcia A. Derr. Hierarchical control of rapid movement sequences.Journal of Experimental Psychology: Human Perception and Performance, 1983. ISSN 00961523. doi: 10.1037/0096-1523.9.1.86.

10

http://www.sciencedirect.com/science/article/pii/S0749596X98925761

http://www.sciencedirect.com/science/article/pii/S0749596X98925761

https://doi.org/10.1145/2939672.2939778

https://doi.org/10.1145/2939672.2939778


[40] Wojciech Samek, Thomas Wiegand, and Klaus-Robert Müller. Explainable artificial intelligence: Understanding,visualizing and interpreting deep learning models. ITU Journal: ICT Discoveries - Special Issue 1 - The Impactof Artificial Intelligence (AI) on Communication Networks and Services, 1:1–10, 10 2017.

[41] Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal,and Yoshua Bengio. Toward causal representation learning. Proceedings of the IEEE, 109(5):612–634, 2021.

[42] Eric Schulz, Joshua B. Tenenbaum, David Duvenaud, Maarten Speekenbrink, and Samuel J. Gershman.Compositional inductive biases in function learning. Cognitive Psychology, 2017. ISSN 00100285. doi:10.1016/j.cogpsych.2017.11.002.

[43] Eric Schulz, Francisco Quiroga, and Samuel J Gershman. Communicating compositional patterns. Open Mind,4:25–39, 2020.

[44] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subwordunits. arXiv preprint arXiv:1508.07909, 2015.

[45] Emile Servan-Schreiber and John Anderson. Learning artificial grammars with competitive chunking. Journal ofExperimental Psychology: Learning, Memory, and Cognition, 16:592–608, 07 1990. doi: 10.1037/0278-7393.16.4.592.

[46] Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948.

[47] Noel E Sharkey and Amanda JC Sharkey. An analysis of catastrophic interference. Connection Science, 1995.

[48] Sun Si, Chenyan Xiong, Zhenghao Liu, Zhiyuan Liu, and Jie Bao. Joint keyphrase chunking and salience rankingwith bert, 04 2020.

[49] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, JulianSchrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go withdeep neural networks and tree search. nature, 529(7587):484–489, 2016.

[50] Yee Whye Teh. A hierarchical bayesian language model based on pitman-yor processes. In Proceedings of the21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association forComputational Linguistics, pp. 985–992, 2006.

[51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pp.5998–6008, 2017.

[52] Johan Wagemans, Jacob Feldman, Sergei Gepshtein, Ruth Kimchi, James R Pomerantz, Peter A Van der Helm,and Cees Van Leeuwen. A century of gestalt psychology in visual perception: Ii. conceptual and theoreticalfoundations. Psychological bulletin, 138(6):1218, 2012.

[53] Feifei Zhai, Saloni Potdar, Bing Xiang, and Bowen Zhou. Neural models for sequence chunking. In AAAI, 2017.

A Independence Test

We use hypothesis testing on the assumption of independence to decide whether there exists an association betweenconsecutive parses of any two chunks cL and cR in the current belief set. If the independence test is violated, thenthe two chunks are combined together. Another independence test is used to evaluate if there are still statisticalassociations between chunk observations for each possible chunks in the current belief set; this is used as a criterionto continue or halt the chunking process.

We use a χ2-test of independence to assess if the consecutive parses of cL and cR observed in T violate the inde-pendence criterion. Observations of cL and cR in parses are categorical variables and can be represented as rows andcolumns of a contingency table. The number of observations that cL = 1 or any other observations (cL = 0) consistsof the row entries, indicating observations of cL, while the number of observations cR = 1 and cR = 0 make up thecolumn entries. The table, therefore, consists of two rows and two columns.

The null hypothesis is that the occurrence of the consecutive observations is statistically independent. Given thehypothesis of independence, the theoretical frequency for observing cL followed by cR is E[cL = 1, cR = 1] =

Np(cL = 1)p(cR = 1), with N being the total number of parses. p(cL = 1) = N(cL)N = N(cL→cR)

N + N(cL→¬cR)N ,

11


p(cR = 1) = N(cR)N = N(cL→cR)

N + N(¬cL→cR)N

χ2 =∑

l={0,1}

∑r={0,1}

N(cL = l, cR = r)− E[cL = l, cR = r]

E[cL = l, cR = r]

= N∑

l={0,1}

∑r={0,1}

p(cL = l)p(cR = r)((N(cL=l,cR=r)

N )− p(cL = l)p(cR = r)

p(cL = l)p(cR = r))

(1)

The degree of freedom for this test is 1. A χ2-probability of less than or equal to 0.05 is used as a criterion for rejectingthe hypothesis of independence.

The independence test is also employed to evaluate whether there are still statistical associations between chunkobservations for each possible chunk in the current belief set, which we use as a criterion to continue or to halt thechunking process. For this test, the contingency table contains rows and columns corresponding to all possible chunksin the current belief set, and the χ2-statistic is calculated as:

χ2 =∑cL∈sB

∑cR=sB

N(cL, cR)− E[cL, cR]

E[cL, cR]= N

∑cL∈sB

∑cR=sB

p(cL)p(cR)(N(cL,cR)

N − p(cL)p(cR)p(cL)p(cR)

) (2)

The degrees of freedom are (|sB| − 1) ∗ (|sB| − 1), and a p-value of p ≤ 0.05 is again used as a criterion to reject thenull hypothesis and therefore as evidence to continue the chunking process.

B Rational Chunking Algorithm

Algorithm 1: Rational Chunking Algorithminput : Seq, maxIteroutput: Bd, Gd, Td, Md

d← 0, iter ← 0;Bd, Md, Td = getSingleElementSets(Seq); /* minimally complete atomic set */while !IndependenceTest(Md, Td) or iter ≤ maxiter do

Md, Td = Parse(Seq, Bd);cL, cR ← None; MaxChunk,MaxChunkP ← None; PreCk = {};for (ci, cj) ∈ Bd\{0} × Bd\{0} do

Pd(ci ⊕ cj) = CalculateJoint(Md,Td, ci, cj);Pd+1(ci ⊕ cj) = Pd(ci⊕cj)

1−Pd(ci⊕cj) ;if Pd+1(ci ⊕ cj) ≥ MaxChunkP and ci ⊕ cj /∈ PreCk and !IndependenceTest(ci, cj) then

cL ← ci, cR ← cj ;MaxChunkP ← Pd+1(ci ⊕ cj), MaxChunk ← ci ⊕ cj

endendc← cL ⊕ cR; Bd+1 ← Bd ∪ c; Gd+1 ←AugmentGraph(Gd, (cL, c), (cR, c)); PreCk.add(c);

end

C Online HCM with Generalization to Visual-Temporal Sequences

HCM learns a chunk hierarchy graph G by training on a visual-temporal sequence. The visual-temporal chunks canretain various continuous shapes and may not fill the entire space, which means that there are no jumps from any visualtemporal pixel within a chunk to another. Within a chunk, there is always a path that connects any two visual temporalpixels.

The input chunk hierarchy graph G could be an empty graph which corresponds to the case that the HCM model hasnot been trained yet, or it can be a chunk hierarchy graph that HCM has acquired from training. M contains thefrequency count of each chunk in the belief set B. T stores pairs of parsed adjacent chunks and their difference intemporal lag. The temporal lag is the time difference between the end of the previous chunk and start of the followingchunk. The generative model of the world assumed by the chunk learning model is such that each visual temporalchunk occurs independently for each parse. It then keeps checking whether adjacent chunks in the visual and temporal

12


domain violate the independence testing criterion. If so, then the visual/temporal adjacent chunks are grouped together.The constituent parts of a chunk remains in the belief set.

Algorithm 2: Visual-Temporal Chunking

input : Seq, G, θ, DToutput: GM , T ← G.M , G.T ;PreviousChunkBoundaryRecord← []; /* Record Chunk Endings */ChunkTerminationTime.setall(-1);while Sequence not over do

CurrentChunks, ChunkTerminationTime = IdentifyTheLatestChunks(ChunkTerminationTime);ObservationToExplain← refactor(Seq, ChunkTerminationTime);for Chunk in CurrentChunks do

for CandidateAdjacentChunk in PreviousChunkBoundaryRecord doif CheckAdjacency(Chunk, CandidateAdjacentChunk) then

M ,T ,B, G ← LearnChunking(Chunk, CandidateAdjacentChunk. M ,T ,B, G);end

endChunkTerminationTime.update(CurrentChunks)

endPreviousChunkBoundaryRecord.add(CurrentChunks);Forgetting(M ,T ,B, G, θ, DT , PreviousChunkBoundaryRecord);

end

The pseudocode for the Visual-Temporal HCM is shown in Algorithm 2. The input can be a visual-temporal sequence,an i.i.d visual sequence, or a temporal sequence.

To process and update chunks online, HCM iterates through the visual temporal sequence, identifies chunks, andmarks the termination time corresponding to each visual dimension stored in ChunkTerminationTime. ChunkTermi-nationTime is a matrix with its size being the same as the visual dimension that stores the end point in each visualdimension that the previous chunk finished. Because chunks could finish at different time points in each visual tem-poral dimension, the algorithm explain chunks starting at the end point in each visual temporal dimension when thelast identified chunk is finished up until the current time point to identify current, on-going chunks. The chunks withthe biggest visual temporal volume are prioritized to explain the relevant observations up to the current time point.As multiple visual temporal chunks can be identified to occur simoutaneously, CurrentChunks is a set that stores theidentified chunk that has not reached its end point.

Once one or several chunks are identified to be ending at a time point, they are stored inside the PreviousChunkBound-aryRecord and their finishing time is updated for each visual pixel in ChunkTerminationTime. Corresponding entriesin M are updated upon once a chunk has ended. The chunks that finishes after the start of the current chunk is checkedwith each current chunk on whether there is a visual temporal adjacency. Additionally, the independence test betweenadjacent chunk pairs are carried out for the decision to possibly merge chunks.

Once a pair of adjacent chunks cL cR violates the independence testing criterion, they are combined into one chunkcL ⊕ cR. A new entry is created in M with the estimated joint occurrence frequency for cL ⊕ cR, which is subtractedfrom the marginal record of cL and cR. The transition entries associated with cL ⊕ cR are initialized to be 1. Otherchunks inherit the transition entries from cR to other chunks.

At each time step the visual temporal chunking algorithm does the following things:

1. Identifies the chunks biggest in volume that explain observation in each dimension from the time point whenthe last chunk ended to the current time point, mark their time point, and store them in the set of currentchunks.

2. Identifies adjacent chunks in previous chunks with each of the currently ending chunk and updates theirmarginal and transition counts.

3. Modifies the set of chunks used to parse the sequence based on their adjacencies.

Entries in M and T are subject to memory decay at the rate of θ. If any entry goes below the deletion threshold DT ,their corresponding entries in M , T , B and G are deleted.

13


D Details to the Proof of Recoverability

D.1 Definitions

An observational sequence is made up of discrete, integer valued, size-one elementary observational unit coming froman atomic alphabet set A0, where 0 represents the empty observation set.

One example of such an observational sequence S is:

010021002112000...

The elementary observation units such as ‘0’, ‘1’, and ‘2’ come from the atomic alphabet set A0 = {0, 1, 2}.Definition 1 (Chunk)A chunk is composed of several non-empty, concatenated elementary observations, embedded in the sequence. Thatis, a chunk can be made up from any combination of elements in A0 \ {0}

Examples of chunks from the observational sequence can be ‘1’, ‘21’, ‘211’, ‘12’, ‘2112’, ... etc. 0 represents anempty observation in the sequence.Definition 2 (Belief Set)A belief set is the set of chunks that the HCM uses to parse training sequences. The belief set is denoted as B.

An example of a belief set that that the model has learned from S could be B = {0, 1, 21, 211, 12, 2112}.Definition 3 (Parsing)The parsing of chunks initiates from the beginning of the sequence. At every parsing step, the biggest chunk in thebelief set that matches the upcoming sequence is chosen to explain the observation. The end of the previous chunkparse initiates the next parse.

The sequence S parsed by the model that uses the belief set {0, 1, 21, 211, 12, 2112} results in the following partition.0 1 0 0 21 0 0 2112 0 0 0.

We say that a belief set is complete if at any point when the model parses the sequence, the upcoming observationscan be explained at least by one chunk within the belief set.

In this work, we only refer to complete belief sets.Definition 4 (Parsing Length NB)A parsing length NB of a sequence parsed by a belief set B is the length of the parsing result after the sequence hasbeen parsed by chunks within B.

Definition 5 (Count Function NB(c))NB(c) denotes the count of how many times the chunk c in the belief set B appears in the parsed sequence.

The parsing length and the count function for all of the chunks c on a belief set B satisfy the following relation:

NB =∑c∈B

NB(c) (3)

Definition 6 (NB(x → y))The number of times chunk x is being parsed following the parse of chunk y. x and y are both chunks in the belief setB.

For any chunk x within any belief set B, one can relate NB(x) with NB(x→ y) by:

NB(x) =∑y∈B

NB(x→ y) (4)

When the length of the sequence becomes infinite, it is easier to work with probabilities instead of the count function.One can define a probability space over the belief set:

14


Figure 6: Illustration of Visual Temporal Chunks

Definition 7 (Probability space of a belief set)With a belief set B, one can define a associated probability space (SB,FB,PB). SB is the sample space representingall of the possible outcomes of a chunk parse.An event space F is the space for all possible sets of events. F contains all the subsets of SB.Additionally, the probability function PAB : FB → R is defined on the event space SB. The probability function PABsatisfies the basic axioms of probability:

• PAB(E) ≥ 0 ∀E ∈ F . For any subset in the event space, the probability of an observation being in the subsetis positive.

• M,N ∈ F , and M ∩ N = E, then P (M ∪ N) = P (M) + P (N). For two non-intersecting subsets in theevent space, the probability of observing any element that falls within the union of the two subsets is the sumof the probability of observing any event within one subset and the probability of observing any event fromthe other subset.

• P (S) = 1. The probability of observing any event that belongs to the sample space is one.

The probability of a chunk c on a support set of chunks B, is the limiting case of this ratio when NB goes to infinity.

PB(c) = limNB→∞

NB(c)

NB(5)

A learning model keeps track of the occurrence probability associated with each chunk in the belief set. For a currentbelief set, the model assumes that the chunks within the belief set occurs independently.

The occurrence probability of chunk ci with the belief set Bd is PBd(ci), which refers to the normalized frequency

of observing chunk ci when the number of parsing the sequence using chunks from the belief set goes to infinity. Inthis way, the probability of observing a sequence of chunks c1, c2, ....cN can be denoted as P (c1, c2, ....cN ). The jointprobability of observing any chunk in the generative process is:

P (c1, c2, ....cN ) =∏ci∈Bd

PBd(ci) (6)

In this formulation, chunks as observation units serve as independent factors that disentangle observations in thesequence.

Definition 8 (Marginal Parsing Frequency Md)A vector that stores the number of parses for each chunk c in the belief set Bd.

Md contains size |Bd|. Additionally the model keeps track of the transition probability from one chunk to another, asthey are used to test whether two chunks have immediate temporal adjacency association.

Definition 9 (Td)The set of transition frequency from any chunk ci ∈ Bd to cj ∈ Bd

15


Definition 10 (Chunk Hierarchy Graph Gd)The relation between chunks and their constructive components in the generative model is described by a chunkhierarchy graph Gd with vertex set VAd

and edge set EAd. One example of a chunk hierarchy graph is illustrated in

Figure 1 a). In this hierarchical generative model, d is the depth of the graph and Ad is the set of chunks used as atomicunits to construct the sequence. Each vertex in VAd

is a chunk, and edges connect the parent chunk vertices to theirchild chunk vertices.

As the belief set B keeps changing when one modifies the chunks in a sequence, so does the parsing length NB andthe probability associated with the belief set PAB . Based on the definition of N on how chunks are parsed whenthe support set changes, this change of N changes a set of constraints on the probabilities defined on the new,augmented support set. To do this, we need to:

• Formulate the definition of probability based on N.• Identify all relevant changes of N before and after the chunk update.• Translate this change of N to the constraints on probability updates.

We derive the relation between the probabilities when two chunks cL and cR ∈ Ad are concatenated together to forma new chunk cL ⊕ cR and update the alphabet to Ad+1.

D.2 Identify N Updates

This is how the count function changes before and after the update. We start with the update of the summary N whenthe alphabet goes from Ad to Ad+1, and proceed to the update for the marginal N, and then the transitional N.

D.2.1 Summary N

When going from Ad to Ad+1, cL and cR are both chunks in Ad and merged together as a new chunk to augment Ad.

The number of parses changes only in places where the cL and cR associated with the new chunk occurs. The chunksin Ad can be divided into three groups, cL, cR, and Ad \ {cL, cR}. The count function associated with chunks fromthe set Ad \ {cL, cR} does not change when Ad updates to Ad+1.

Since with the chunk update Ad+1, cL and cR are chunked together, they are recognized together as a whole, so everytime when they are recognized together as a whole, the count reduces twofold. Nd+1(cL) and Nd+1(cR) are thenumber of counts for cL and cR when they do not occur together. This count is different from Nd(cL) and Nd(cR)because the occasions when they occur one after another is taken into the count by Nd+1(cL ⊕ cR). The relationbetween Nd and Nd+1 is:

Nd+1 =

[ ∑c∈Ad−cL−cR

Nd(c)

]+Nd+1(cL) +Nd+1(cR) +Nd+1(cL ⊕ cR) (7)

Nd+1(cL) could occur in the case where cR does not follow cL.

Nd+1(cL ⊕ cR) = Nd(cL → cR) (8)

Nd+1(cL) = Nd(cL)−Nd(cL → cR) (9)Nd+1(cR) = Nd(cR)−Nd(cL → cR) (10)

Nd+1(cL ⊕ cR) ∗ 2 +Nd+1(cL) +Nd+1(cR) = Nd(cL) +Nd(cR) (11)

16


Chunking reduces the number of times sub-chunks are being parsed when sub-chunks occur right after each other bytwofold.

Nd =∑c∈Ad

Nd(c) =


Nd(c)

]+Nd(cL) +Nd(cR) (12)

Comparing the above two equations we arrive at

Nd −Nd(cL)−Nd(cR) = Nd+1 −Nd+1(cL)−Nd+1(cR)−Nd+1(cL ⊕ cR) (13)

Since:Nd(cL) +Nd(cR) = Nd+1(cL) +Nd+1(cR) + 2Nd+1(cL ⊕ cR) (14)

This is related to:Nd(cL → cR) = Nd+1(cL ⊕ cR) (15)

Because both are counting the number of times they occur consecutively.

Nd+1(cL) +Nd+1(cR) = Nd(cL) +Nd(cR)− 2Nd(cL → cR) (16)

One can define Nd+1 in terms of counts in Nd as:

Nd+1 =


Nd(c)

]+Nd(cL) +Nd(cR)− 2Nd(cL → cR) +Nd(cL → cR)

Nd+1 =


Nd(c)

]+Nd(cL) +Nd(cR)−Nd(cL → cR)

(17)

Generally, the result is also the case with the total count Nd and Nd+1 when one switches from the alphabet set Ad toAd+1 by chunking cL and cR in Ad together.

Nd+1 =


Nd(c)

]+Nd(cL) +Nd(cR)−Nd(cL → cR) (18)

Nd+1 = Nd −Nd(cL → cR) (19)

D.2.2 Marginal N

To proceed into formulating the joint and conditional probability given a particular belief space, we need to formulatehow the count of N(c) for a chunk changes when the belief space when switching from Ad to Ad+1, with the samedivision as before.

Of course, the count function should be fixed. However, the probability function associated with the chunks willchange based on the update of the belief set. We use the update of the count function to find the relation between theprobability updates.

For all x in Ad − {cL, cR, cL ⊕ cR}:

Nd+1(x) = Nd(x) (20)

Nd+1(cR) = Nd(cR)−Nd(cL → cR) (21)

Nd+1(cL) = Nd(cL)−Nd(cL → cR) (22)

Nd+1(cL ⊕ cR) = Nd(cL → cR) (23)

17


D.2.3 Transitional N

The following relationship needs to hold:

for all x, y in Ad \ {cL, cR, cL ⊕ cR}:Nd+1(x→ y) = Nd(x→ y) (24)

Nd+1(x→ cL) = Nd(x→ cL)−Nd+1(x→ cL ⊕ cR) (25)Nd+1(x→ cR) = Nd(x→ cR) (26)

Nd+1(x→ cL ⊕ cR) = Nd+1(x)−Nd+1(x→ cR)−Nd+1(x→ cL)−Nd+1(x→ y) (27)Nd+1(x→ cL ⊕ cR) ≤ Nd(cL → cR) (28)

From cL we have this relation.Nd+1(cL → x) = Nd(cL → x) (29)

Nd+1(cL → cL) = Nd(cL → cL)−Nd+1(cL → cL ⊕ cR) (30)Nd+1(cL → cR) = 0 (31)

Nd+1(cL → cL ⊕ cR) = Nd+1(cL)−Nd+1(cL → cR)−Nd+1(cL → cL)−Nd+1(cL → x) (32)Nd+1(cL → cL ⊕ cR) ≤ Nd(cL → cR) (33)Nd+1(cL → cL ⊕ cR) ≤ Nd(cL → cR) (34)

From cR we have the following relation:

Nd+1(cR → y) = Nd(cR → y)−Nd+1(cL ⊕ cR → y) (35)

Nd+1(cR → cL) = Nd(cR → cL)−Nd+1(cL ⊕ cR → cL) (36)Nd+1(cR → cR) = Nd(cR → cR)−Nd+1(cL ⊕ cR → cR) (37)

Nd+1(cR → cL ⊕ cR) = Nd+1(cR)−Nd+1(cR → cR)−Nd+1(cR → cL)−Nd+1(cR → y) (38)Nd+1(cR → cL ⊕ cR) ≤ Nd(cL ⊕ cR) (39)

From the chunked unit cL ⊕ cR we have the following relation:

Nd+1(cL ⊕ cR → x) ≤ min(Nd(cL ⊕ cR), Nd+1(cR ⊕ x)) (40)

Nd+1(cL ⊕ cR → cL) ≤ min(Nd(cL ⊕ cR), Nd(cR ⊕ cL)) (41)Nd+1(cL ⊕ cR → cR) ≤ min(Nd+1(cL ⊕ cR), Nd+1(cR ⊕ cR)) (42)Nd+1(cL ⊕ cR → cL ⊕ cR) ≤ min(Nd+1(cL ⊕ cR), Nd(cR ⊕ cL)) (43)

Finally, we need to satisfy the marginal constraint, that is:

Nd+1(cL⊕cR) = Nd+1(cL⊕cR → cL)+Nd+1(cL⊕cR → x)+Nd+1(cL⊕cR → cR)+Nd+1(cL⊕cR → cL⊕cR)(44)

D.3 Probability Density Switch when Ad expands to Ad+1

The constraint is: the number of counts N for all chunks defined for the support set Ad must remain the same for thesupport set sAd+1, so that the definition of PAd

for all relevant chunks within Ad remains the same when Ad expandsto Ad+1.

Relating the number of counts to probability: The probability of a chunk occurring in the alphabet set Ad is definedas:

PAd(c) = lim

Nd→∞

Nd(c)

Nd(45)

Because Nd and Nd+1 are only a constant away, both go to infinity if one of them does, so there is a relation betweenthe definition of probability PAd

(c) and PAd+1(c). For any chunk x in Ad that is not cL and cR, Nd+1(x) = Nd(x):

PAd+1(x) = lim

Nd+1→∞

Nd+1(x)

Nd+1= limNd→∞

Nd(x)

Nd −Nd(cL → cR)(46)

That is, the probability of a chunk of this category at d and d+1 satisfies this relationship:

limNd+1→∞

PAd+1(x)Nd+1 = lim

Nd→∞PAd

(x)Nd (47)

18


PAd+1(x) = PAd

(x)limNd→∞Nd

limNd+1→∞Nd+1(48)

PAd+1(x) = PAd

(x)limNd→∞Nd

limNd+1→∞Nd −Nd(cL → cR)(49)

For cL and cR in Ad+1:

PAd+1(cL) = lim

Nd+1→∞

Nd+1(cL)

Nd+1(50)

PAd(cL) = lim

Nd+1→∞

Nd(cL)

Nd(51)

PAd+1(cR) = lim

Nd+1→∞

Nd+1(cR)

Nd+1(52)

PAd(cR) = lim

Nd→∞

Nd(cR)

Nd(53)

Because:Nd+1(cL) = Nd(cL)−Nd(cL ⊕ cR) (54)

PAd+1(cL) = lim

Nd+1→∞

Nd(cL)−Nd(cL ⊕ cR)Nd+1

(55)

PAd+1(cL) = lim

Nd→∞

Nd(cL)−Nd(cL → cR)


PAd+1(cR) = lim

Nd→∞

Nd(cR)−Nd(cL → cR)


Finally:

PAd+1(cL ⊕ cR) = lim

Nd+1→∞

Nd+1(cL ⊕ cR)Nd+1

(58)

PAd(cL ⊕ cR) = lim

Nd→∞

Nd(cL → cR)

Nd(59)

since Nd(cL → cR) = Nd+1(cL ⊕ cR), we have

PAd+1(cL ⊕ cR) = lim

Nd→∞

PAd(cL ⊕ cR)NdNd+1

(60)

D.3.1 Summary Probabilities

Nd+1 = Nd −Nd(cL ⊕ cR) = Nd −NdPd(cL ⊕ cR) (61)

Nd+1

Nd= 1− Pd(cL ⊕ cR) (62)

19


D.3.2 Marginal Probabilities

The constraints on marginal probabilities when the support set changes from Ad to Ad+1, derived from the constraintson the marginal counts, are the following:

Pd+1(x) =Pd(x)

1− Pd(cL ⊕ cR)(63)

Pd+1(cR) =Pd(cR)− Pd(cL ⊕ cR)

1− Pd(cL ⊕ cR)(64)

Pd+1(cL) =Pd(cL)− Pd(cL ⊕ cR)

1− Pd(cL ⊕ cR)(65)

Pd+1(cL ⊕ cR) =Pd(cL ⊕ cR)

1− Pd(cL ⊕ cR)(66)

With this set of formulations, we have defined the next level marginal probability measures as a function of the previouslevel observations and their implied probability measures.

D.3.3 Transitional Probabilities

for all x, y in Ad − {cL, cR, cL ⊕ cR}, from x we have this relation:

0 ≤ Pd+1(·|x) ≤ 1 (67)

Pd+1(y|x) = Pd(y|x) (68)

Pd+1(cR|x) = Pd(cR|x) (69)

Pd+1(cL|x) + Pd+1(cL ⊕ cR|x) = Pd(cL|x) (70)This equation is sampled so that both Pd+1(cL|x) and Pd+1(cL ⊕ cR|x) satisfy the following constraints:

Pd+1(cL|x) ≤Pd+1(cL)

Pd+1(x)(71)

20


Pd+1(cL ⊕ cR|x) ≤Pd+1(cL ⊕ cR)

Pd+1(x)(72)

Basically, Pd+1(cL|x) is sampled from the range[0,min{1, Pd+1(cL)

Pd+1(x) , Pd(cL|x)}].

Additionally, Pd+1(cL ⊕ cR|x) is constrained to be within this range:[0,min{1, Pd+1(cL⊕cR)

Pd+1(x) , Pd(cL|x)}].

From cL we have this relation:

Pd+1(x|cL) =Pd(x|cL)

1− Pd(cR|cL)(73)

Pd+1(cR|cL) = 0 (74)

Pd+1(cL|cL) =Pd(cL|cL)− (1− Pd(cR|cL))Pd+1(cL ⊕ cR|cL)

1− Pd(cR|cL)(75)

Put into simplified terms:

Pd+1(cL|cL) + Pd+1(cL ⊕ cR|cL) =Pd(cL|cL)

1− Pd(cR|cL)(76)

Pd+1(cL|cL) ≤Pd+1(cL)

Pd+1(cL)= 1 (77)

Pd+1(cL ⊕ cR|cL) ≤Pd+1(cL ⊕ cR)Pd+1(cL)

(78)

Additionally, Pd+1(cL|cL) is constrained to fall within this range:[0,min{1, Pd(cL|cL)

1−Pd(cR|cL)}].

Pd+1(cL ⊕ cR|cL) within the range[0,min{1, Pd+1(cL⊕cR)

Pd+1(cL) , Pd(cL|cL)1−Pd(cR|cL)}

]From cR we have the following relation:

0 ≤ Pd+1(·|cR) ≤ 1 (79)

Pd+1(y|cR) =Pd(y|cR)Pd(cR)− Pd+1(y|cL ⊕ cR)Pd(cL ⊕ cR)

Pd(cR)− Pd(cL ⊕ cR)(80)

Pd+1(cL|cR) =Pd(cL|cR)Pd(cR)− Pd+1(cL|cL ⊕ cR)Pd(cL ⊕ cR)


Pd+1(cR|cR) =Pd(cR|cR)Pd(cR)− Pd+1(cR|cL ⊕ cR)Pd(cL ⊕ cR)


Pd+1(·|cR) ≥ Pd+1(·|cL ⊕ cR) (83)

Additionally, Pd+1(cL ⊕ cR|cR) and Pd+1(cL ⊕ cR|cL ⊕ cR) needs to satisfy:

Pd+1(cL ⊕ cR|cR) = 1− Pd+1(y|cR)− Pd+1(cR|cR)− Pd+1(cL|cR) (84)

Pd+1(cL ⊕ cR|cL ⊕ cR) = 1− Pd+1(y|cL ⊕ cR)− Pd+1(cR|cL ⊕ cR)− Pd+1(cL|cL ⊕ cR) (85)

For the generative model, when the alphabet set goes from Ad to Ad+ 1 by chunking cL and cR together, the aboveconstraints associated with chunks in Ad and Ad+1 need to be satisfied.

D.4 Hierarchical Generative Model

At the beginning of the generative process, the atomic alphabet set A0 is specified. Another parameter, d, specifies thenumber of additional chunks that are created in the process of generating the hierarchical chunks. Starting from thealphabet A0 with initialized elementary chunks ci from the alphabet, the probability associated with each chunk ci inA0 needs to satisfy the following criterion: ∑

ci∈A0

PA0(ci) = 1 (86)

Meanwhile, P (ci) ≥ 0, ∀ci ∈ A0.

21


We assume that at each step the marginal and transitional probability of the previous steps are known. The next chunkis chosen as the combined chunks with the biggest probability. The order of construction in the generative modelfollows the rule that the combined chunk with the biggest probability on the support set of pre-existing chunk sets ischosen to be added to the set of chunks.

cL ⊕ cR = argmaxcL,cR∈Ad\{0}

PAd(cL ⊕ cR) (87)

Under the constraint that:

PAd(cL)PAd

(cR) ≤ PAd(cL ⊕ cR) ≤ min{PAd

(cL), PAd(cR)} (88)

This can be calculated from the transitional and marginal probability of the previous step.

cL ⊕ cR = argmaxcL,cR∈Ad\{0}

PAd(cL ⊕ cR) = argmax

cL,cR∈Ad\{0}PAd

(cL)PAd(cR|cL) (89)

Once the chunk of the next step: cL ⊕ cR is chosen, the support set changes from Ad to Ad+1, which is one sizebigger. On the new support set, the marginal probability is specified by the marginal update. That is, the marginal andtransitional probability defined on the support set Ad fully specifies the marginal probability of the support set Ad.Going from generating a new chunk from the support set Ad to the set Ad+1, the transitional probabilities betweenchunks in Ad need to satisfy the following constraints:

• PAd+1(cL|x) + PAd+1

(cL ⊕ cR|x) = PAd(cL|x) for all x in Ad

• PAd+1(cL|cL)(1− PAd

(cR|cL)) + PAd+1(cL ⊕ cR|cL)(1− PAd

(cR|cL)) = PAd(cL|cL)

• PAd+1(y|cR)(PAd

(cR)− PAd(cL ⊕ cR)) + PAd+1

(y|cL ⊕ cR)PAd(cL ⊕ cR) = PAd

(y|cR)PAd(cR)

• PAd+1(cL|cR)(PAd


(cL|cL ⊕ cR)PAd(cL ⊕ cR) = PAd

(cL|cR)PAd(cR)

• PAd+1(cR|cR)(PAd


(cR|cL ⊕ cR)PAd(cL ⊕ cR) = PAd

(cR|cR)PAd(cR)

• Pd+1(cL ⊕ cR|cR) = 1− Pd+1(y|cR)− Pd+1(cR|cR)− Pd+1(cL|cR)• Pd+1(cL ⊕ cR|cL ⊕ cR) = 1− Pd+1(y|cL ⊕ cR)− Pd+1(cR|cL ⊕ cR)− Pd+1(cL|cL ⊕ cR)

In total, there are |x| + |y| + 3 degrees of freedom. Since |x| = |y| = |Ad| − 2, at each step, there are 2|Ad| − 1number of values to be specified.

In practice, after the chunks are specified in Ad, the probability value associated with chunks in A0 are sampled froma flat Dirichlet distribution, which is then sorted so that the smaller sized chunks contain more of the probabilitymass and the null-chunk carries the biggest probability mass. Then, the above constraint is checked for the assignedprobability on each of the newly generated chunk with their associated alphabet set Ai. This process repeats until theprobability drawn satisfies the condition for every newly created chunk.Theorem 1 (Marginal Probability Space Conservation). After the addition of cd,i⊕ cd,j and the change of probability,PAd

is still a valid probability distribution.Proof: ∑

cd,k∈Ad

PAd(cd,k) =

∑cd,k∈Ad−1−cd−1,i−cd−1,j

PAd−1(cd−1,k)+

+ PAd−1(cd−1,i)− PAd−1

(cd−1,j |cd−1,i)PAd−1(cd−1,i)

+ PAd−1(cd−1,j) + PAd−1

(cd−1,j |cd−1,i)PAd−1(cd−1,i)

= 1

(90)

�

Theorem 2 (Conditional Probability Space Preservation). The conditional probability Pd+1(z|c) for any c ∈ Ad+1

after sampling is still a valid distribution.Proof: Show by manipulating the equations, that the sum of the conditions on c sums to 1, and each of them is biggerthan 0. �

Theorem 3 (Measure Space Preservation). Given that at the end of the generative process with depth d one ends uphaving an alphabet set Ad, the probability space defined on Ai, which includes the marginal and joint probability ofany chunk and combinations of chunks in Ai, i = 0, 1, 2, . . . d, which are predecessor alphabet sets of Ad, all valuesin the set Md and Td remain the same no matter how the future support set changes according to the generative model.

22


Proof: By induction.

• Base case: starting from the initialized alphabet set A0, the probability of PA0(c), c ∈ A0, and the probabilityof PA1

(xy), x, y ∈ A0, for all valid c, x, y, when the alphabet is A1. Going from A0 to A1, N0(c), N0 andN0(x→ y) does not change, therefore PA0

(c) and PA0(x→ y) at the alphabet A1 is the same as that when

the alphabet is A0.

• Induction Step: starting from the initialized alphabet set Ad, the probability of PAd(c), c ∈ Ad, and the

probability of PAd(xy), x, y ∈ Ad, for all valid c, x, y, when the alphabet is Ad+1. Going from Ad to Ad+1,

Nd(c), Nd and Nd(x → y) does not change, therefore PAd(c) and PAd

(x → y) at the alphabet Ad+1 is thesame as that when the alphabet is Ad.

�

Theorem 4. The order of PAi(xy), x, y ∈ Ai for any i = 0, 1, 2, . . . d at any previous belief space is preserved

throughout the update.Proof: At the end of the generative process with depth d, one ends up having such an alphabet set: Ad. Theprobability space defined on Ai, which includes the marginal and joint probability of any chunk and combinations ofchunks in Ai, i = 0, 1, 2, . . . d is preserved, hence the order is preserved.

�

The generative process can be described by a graph update path. The specification of the initial set of atomic chunksA0 corresponds to an initial graph G0 with the atomic chunks as its vertices. At the i-th iteration, as the generativegraph goes from GAi

to graph GAi+1, two none zero chunks cL, cR chosen from the pre-existent set of chunks Ai and

are concatenated into a new chunk cL ⊕ cR, augmenting Ai by one to Ai+1. The vertex set also increments from VAi

to VAi+1= VAi

∪ cL ⊕ cR. Moreover, two directed edges connecting the parental chunks to the newly-created chunkare added to the set of edges: EAi

to EAi= EAi

∪ (cL, cL⊕ cR)∪ (cR, cL⊕ cR). The series of graphs created duringthe chunk construction process going from GA0

to the final graph GAdwith d constructed chunks can be denoted as a

graph generating path P (GA0,GA0

) = (GA0,GA1

,GA2, ...,GAd

).

D.5 Learning the Hierarchy

The rational chunking model is initialized with one minimally complete belief set, the learning algorithm ranks thejoint probability of every possible new chunk concatenated by its pre-existing belief set, and picks the one with themaximal occurrence joint probability on the basis of the current set of chunks as the next new chunk to enlarge thebelief set. With the one-step agglomerated belief set, the learning model parses the sequence again. This processrepeats until the chunks in the belief set pass the independence testing criterion.Theorem 5 (Learning Guarantees on the Hierarchical Generative Model). As N →∞, the chunk construction graphlearned by the model G is the same as the chunk construction graph of the generative model: G = G, which entailsthat they have the same vertex set: V = VG and the same edge set: E = EG . Additionally, the belief set learned bythe chunk learning model Bd = Ad, and the marginal probability evaluated on the learned belief set MBd

associatedwith each chunk is the same as the marginal probability imposed by the generative model on the generative belief setMAd

.Proof: Given that all of the empirical estimates are the same as the true probabilities defined by the generativemodel, we prove that starting with B0, the learning algorithm will learn BD = AD. AD is the belief set imposed bythe generative model. We approach this proof by induction.

Base Step: As the chunk learner acquires a minimal set of atomic chunks that can be used to explain the sequence atfirst, the set of elementary atomic chunks learned by the model is the same as the elementary alphabet imposed by thegenerative model, i.e. B0 = A0. Hence, the root of the graph, which contains the nodes without their parents, is thesame, G = G; put differently, V0 = V0

Additionally, the learning model approximates the probability of a specific atomic chunk as PA0(ai). As n→∞, forall chunks c in the set of atomic elementary chunks in B0, the empirical probability evaluated on the support set is thesame as the true probability assigned in the generative model with the alphabet set A0:

PB0(c) = lim

n→∞

N0(c)

N0= PA0

(c) (91)

Induction hypothesis: Assume that the learned belief set Bd at step d contains the same chunks as the alphabet setAd in the generative model.

23


The HCM, by keeping track of the transition probability between any pairs of chunks, calculates PBd(ci|cj) for all ci,

cj in Bd. Afterwards, it finds the pair of chunks ci, cj , such that the chunk created by combining ci and cj togethercontains the maximum joint probability violating the independence test as candidate chunks to be combined together.

PBd(ci ⊕ cj) = sup

ci,cj∈Bd

PBd(ci)PBd

(cj |ci) (92)

We know that in the generative step the supremum of the joint probability with the support set Ad is being picked toform the next chunk in the representation graph, so each step of the process at step d satisfies the condition that:

PAd(ci ⊕ cj) = sup

ci,cj∈Ad

PAd(ci)PAd

(cj |ci) (93)

Since PAd(cj |ci) = PBd(cj |ci), PAd(ci) = PBd(ci), the chunks ci and cj chosen by the learning model will be thesame ones as those created in the generative model.

End step: The chunk learning process stops once an independence test has been passed, which means that the sequenceis better explained by the current set of chunks than any of the other possible next-step chunk combinations. This isthe case once the chunk learning algorithm has learned a belief set Bd that is the same as the generative alphabet setAd. At this point G = G �

E Experiment Detail: Chunk Recovery and Convergence

To test the model’s learning behavior on this type of sequential data, random graphs of chunk hierarchies with anassociated occurrence probability for each chunk are specified by the hierarchical generative process. To do so, aninitial set of specified atomic chunks A0 and a pre-specified level of depth (new chunks) d is used to initiate thegeneration of a random hierarchical generative graph G. In the end, the generative model generates a set of chunks A.In total, there are |A0|+ d number of chunks in the generative alphabet A, with chunk c from the alphabet set havingan occurrence probability of PA(c) on the sample space A.

Once the hierarchical generative model is specified, it is then used to produce training sequences with varying lengthN to test the chunk recovery.

The rational chunk model is trained on the the sequence S with increasing sizes (from 100 to 3000 with steps of 100)generated by the hierarchical generative graph, and it learns a hierarchical chunking graph G. To test how good therepresentation learned by the chunking graph is compared to the ground truth generative model G, a discrete versionof Kullback–Leibler divergence is used to compare the ground truth probability PA(c) of every chunk on the samplespace of the ground truth alphabet, to the learned probability Q(c) of the chunking model.

KL(P ||Q) =∑c∈A

PA(c) log2(PA(c)

QA(c)) (94)

While PA is clearly defined by the generative model, QA(c) needs to be calculated from what the model has learned.Note that the set of chunks B learned by HCM may or may not be the same as A, as B also augments or shrinks indifferent learning stages.

To evaluate QA(c), the hierarchical chunk graph with their occurrence probability associated with each chunk is usedto produce “imagined” sequences of length 1000. Imagined sequences are the sequences that HCM produces basedon its learned representations. After that, the occurrence probability of each chunk c in A(c) is used to evaluate Q,comparing the HCM’s learned representation with the ground truth.

KL divergence is used to evaluate the deviation of learned representations in the hierarchical chunking model from theoriginal representations on the corresponding support set of the original representations.

For the comparsion, we used the same sequence used for training HCM to train a 3 layer recurrent neural network (oneembedding layer with 40 hidden units, one LSTM with drop-out rate = 0.2, and one fully connected layer with batchsize = 5, sequence length = 3, epoch = 1, so that the data used for training is the same as N) of the training sequence,and used it to generate imagined sequences which are then used to calculate the KL divergence as before.

To generate the KL divergence plot for Figure 2, sequences with varying size N were used to train HCM. HCM wasthen used to generate imaginative sequence so that KL measurements can be taken.

24


F Experiment Detail: Visual Hierarchical Chunks

The visual hierarchical chunks are crafted by hand as binary arrays. The dark pixels correspond to having a arrayvalue of 1 and background a value of 0. Each image in the generative hierarchy is 25 dimensional (5 x 5) in the visualdomain and size 1 in the temporal domain. An empty array is included to denote no observation. The alphabet A ofthe generative model include all 14 the images in the generative hierarchy.

The probability of occurrance for each generative visual chunk is drawn from a flat dirichlet distribution with theempty observation retaining the highest mass, this is to emulate the process that real world observatipons are mostlysparse in the environment.

The parameters (α1, .., αK) with K = |A| are all set to one, and PA(c), c1, ..., cK ∈ A are sampled from the proba-bility density function

f(x1, ..., xk;α1, ..., αK) =1

B(a)

K∏i=1

(xαi−1i ) (95)

Where the beta function when expressed using gamma function is: B(a) =∏K

i=1 Γ(αi)

Γ(∑K

i αi), and a = (α1, .., αK).

To generate the sequence, images in the hierarchy are sampled from the occurrence distribution and appended to theend of the sequence. As a result, there are visual correlations in the sequence defined by the hierarchy, but temporally,each image slice is i.i.d. from the dirichlet distribution.

Examples of chunk representations learned by the hierarchical chunking model at different stages is shown are col-lected at the learning stages of t = 10, t = 100, t = 1000 respectively.

G Experiment Detail: Gif Movement

The gif file is converted into an [T x H x W] sized tensor. With T being the temporal dimension of the spatial visionsequence. The entire moment is 10 frames of 25 x 25 images. Each color in the gif file is mapped to a unique integer,with the background having a value of 0. TO construct the training sequence for spatial temporal chunks. In this way,the gif file is converted into a tensor with size 10 x 25 x 25. The entire movement is repeated 100 times and trained onHCM.

H Experiment Detail: The Hunger Games

The first book of The Hunger Games is stored as ’test_data.txt’ with the code in the supplementary material. Thetext file contains approximately 520,000 characters in total.

To convert the book into a temporal sequences, a unique mapping between each character and an integer is created.This sequence of integer is used to train the online HCM, with a forgetting rate of 0.99 and a chunk deletion thresholdof 1e−5. Empty spaces are also mapped onto a nonzero integer to enable the model to learn chunks with the inclusionof empty spaces. The HCM trains on sequences of 1,000 characters in length at each step. More examples of learnedchunks taken from M at different stages of learning are displayed in Table 2. Simple representation examples are takenfrom chunks learned after 10,000 characters in the book. Intermediate examples are taken after 100,000 charactershave been parsed, and Complex chunks are taken when HCM reaches 300,000 characters of the book.

25


Table 2: Representation Learned in Hunger Games

Simple Chunks ’an’, ’in ’, ’be’, ’at’, ’me’, ’le’, ’a ’, ’ar’, ’re’, ’and ’, ’ve’, ’ing’, ’on’,’st’, ’se’, ’to ’, ’i ’, ’n ’, ’of ’, ’he’, ’my ’, ’te’, ’pe’, ’ou’,’we’, ’ad’, ’de’,’li’, ’oo’, ’bu’, ’fo’, ’ave’, ’the’, ’ce’, ’is’, ’as’,’il’, ’ch’, ’al’, ’no’, ’she’,’ing’, ’am’, ’ack’, ’we’, ’raw’, ’on the’,’day’, ’ear’, ’oug’, ’bea’, ’tree’,’sin’, ’that’, ’log’, ’ters’, ’wood’,’now’, ’was’, ’even’, ’leven’, ’ater’,’ever’, ’but’, ’ith’, ’ity’, ’if’,’the wood’, ’bell’, ’other’

Intermediate Chunks ’old’, ’gather’, ’as’, ’under’, ’way.’, ’day’, ’hunger’, ’very’, ’death’,’ping’, ’the seam’, ’add’, ’ally’, ’king’, ’lose’, ’sing’, ’loser’, ’money’,’man who’, ’in the’, ’says’, ’tome’, ’might’, ’rave’, ’even’, ’ick’,’wood’, ’he want’, ’for’, ’into’, ’leave’, ’reg’, ’lose to’, ’lock’, ’where’,’up’, ’gale’, ’older’, ’ask’, ’come’, ’raw’, ’real’, ’bed’, ’ing for’, ’from’,’link’, ’few’, ’close’, ’arrow’, ’ull’, ’cater’, ’this’, ’one’, ’almo’, ’lack’,’shop’, ’year’, ’ring’, ’cause’, ’is the’, ’ugh’, ’eve’, ’are’, ’the leap’,’lly’, ’still’, ’heal’, ’tow’, ’never’, ’try’, ’prim’, ’iting’, ’bread’, ’ould’, ’,but’, ’Now’, ’beg’, ’liday’, ’arm’, ’quick’, ’hot’, ’men’, ’know’, ’then’,’bell’, ’pan’, ’mother’, ’only’, ’war’, ’eak’, ’high’, ’read’, ’district 12’,’can’, ’would’, ’pas’,

Complex Chunks ’arent’, ’he may’, ’I want’, ’gather’, ’capitol’, ’been’, ’trip’, ’a baker’,’, but’, ’madge’, ’heir’, ’mouth’, ’you can’, ’a few’, ’berries’, ’fully’,’tribute’, ’I can’, ’cause of the’, ’feel’, ’to the baker’, ’its not just’,’he want’, ’slim’, ’hand’, ’the ball’, ’quick’, ’green’, ’the last’, ’peace’,’off’, ’you have to’, ’kill’, ’of the’, ’the back’, ’they’, ’scomers’, ’havebeen’, ’reaping’, ’as well’, ’, but the ’, ’cent’, ’thing’, ’I remem-ber’, ’ally’, ’though’, ’again’, ’dont’, ’need’, ’in the school’, ’the pig’,’in our’, ’them’, ’to remember’, ’fair’, ’bother’, ’at the’, ’older’, ’thesquare’, ’I know’, ’house’, ’its not’, ’once’, ’what was’, ’out of’, ’it isnot just’, ’our district’, ’too’, ’I have’, ’it out’

26

Documents

LEARNING STRUCTURE FROM THE GROUND UP HIERARCHICAL