Understanding Hinton’s Capsule Networks. Part I: Intuition. · represented numerically as a 4D pose matrix. When these relationships are built into internal representation of data,

Max Pechyonkin November 3, 2017

Understanding Hinton’s Capsule Networks. Part I: Intuition.medium.com/ai³-theory-practice-business/understanding-hintons-capsule-networks-part-i-intuition-b4b559d1159b

Part of Understanding Hinton’s Capsule Networks Series:

Part I: Intuition (you are reading it now)Part II: How Capsules WorkPart III: Dynamic Routing Between CapsulesPart IV: CapsNet Architecture

Quick announcement about our new publication AI³. We are getting the best writers together to talk about theTheory, Practice, and Business of AI and machine learning. Follow it to stay up to date on the latest trends.

1. Introduction

Last week, Geoffrey Hinton and his team published two papers that introduced a completely new type ofneural network based on so-called capsules. In addition to that, the team published an algorithm, calleddynamic routing between capsules, that allows to train such a network.

Geoffrey Hinton has spent decades thinking about capsules. Source.

For everyone in the deep learning community, this is huge news, and for several reasons. First of all, Hinton isone of the founders of deep learning and an inventor of numerous models and algorithms that are widely

1/6

victoria

Typewriter

https://medium.com/ai%C2%B3-theory-practice-business/understanding-hintons-capsule-networks-part-i-intuition-b4b559d1159b



https://medium.com/ai%C2%B3-theory-practice-business/understanding-hintons-capsule-networks-part-ii-how-capsules-work-153b6ade9f66

https://medium.com/@pechyonkin/understanding-hintons-capsule-networks-part-iii-dynamic-routing-between-capsules-349f6d30418

https://medium.com/@pechyonkin/part-iv-capsnet-architecture-6a64422f7dce

https://medium.com/ai%C2%B3-theory-practice-business?source=logo-bc670b9a1aca---ab837e78463f

https://arxiv.org/abs/1710.09829

https://openreview.net/pdf?id=HJWLfGWRb

http://condo.ca/wp-content/uploads/2017/03/Vector-director-Institute-artificial-intelligence-Toronto-MaRS-Discovery-District-Hinton-Google-Apple-Siri-Alexa-Condo.ca_.jpg

used today. Secondly, these papers introduce something completely new, and this is very exciting because itwill most likely stimulate additional wave of research and very cool applications.

In this post, I will explain why this new architecture is so important, as well as intuition behind it. In thefollowing posts I will dive into technical details.

However, before talking about capsules, we need to have a look at CNNs, which are the workhorse of today’sdeep learning.

Architecture of CapsNet from the original paper.

2. CNNs Have Important Drawbacks

CNNs (convolutional neural networks) are awesome. They are one of the reasons deep learning is so populartoday. They can do amazing things that people used to think computers would not be capable of doing for along, long time. Nonetheless, they have their limits and they have fundamental drawbacks.

Let us consider a very simple and non-technical example. Imagine a face. What are the components? Wehave the face oval, two eyes, a nose and a mouth. For a CNN, a mere presence of these objects can be a verystrong indicator to consider that there is a face in the image. Orientational and relative spatial relationshipsbetween these components are not very important to a CNN.

2/6


https://en.wikipedia.org/wiki/Convolutional_neural_network

http://www.yaronhadad.com/deep-learning-most-amazing-applications/

To a CNN, both pictures are similar, since they both contain similar elements. Source.

How do CNNs work? The main component of a CNN is a convolutional layer. Its job is to detect importantfeatures in the image pixels. Layers that are deeper (closer to the input) will learn to detect simple featuressuch as edges and color gradients, whereas higher layers will combine simple features into more complexfeatures. Finally, dense layers at the top of the network will combine very high level features and produceclassification predictions.

An important thing to understand is that higher-level features combine lower-level features as a weightedsum: activations of a preceding layer are multiplied by the following layer neuron’s weights and added, beforebeing passed to activation nonlinearity. Nowhere in this setup there is pose (translational and rotational)relationship between simpler features that make up a higher level feature. CNN approach to solve this issueis to use max pooling or successive convolutional layers that reduce spacial size of the data flowing throughthe network and therefore increase the “field of view” of higher layer’s neurons, thus allowing them to detecthigher order features in a larger region of the input image. Max pooling is a crutch that made convolutionalnetworks work surprisingly well, achieving superhuman performance in many areas. But do not be fooled byits performance: while CNNs work better than any model before them, max pooling nonetheless is losingvaluable information.

Hinton himself stated that the fact that max pooling is working so well is a big mistake and a disaster:

Hinton: “The pooling operation used in convolutional neural networks is a big mistake and the fact that itworks so well is a disaster.”

Of course, you can do away with max pooling and still get good results with traditional CNNs, but they still donot solve the key problem:

Internal data representation of a convolutional neural network does not take into account important spatialhierarchies between simple and complex objects.

In the example above, a mere presence of 2 eyes, a mouth and a nose in a picture does not mean there is aface, we also need to know how these objects are oriented relative to each other.

3/6

http://sharenoesis.com/wp-content/uploads/2010/05/7ShapeFaceRemoveGuides.jpg


https://www.eetimes.com/document.asp?doc_id=1325712

https://www.reddit.com/r/MachineLearning/comments/2lmo0l/ama_geoffrey_hinton/clyj4jv/

3. Hardcoding 3D World into a Neural Net: Inverse Graphics Approach

Computer graphics deals with constructing a visual image from some internal hierarchical representation ofgeometric data. Note that the structure of this representation needs to take into account relative positions ofobjects. That internal representation is stored in computer’s memory as arrays of geometrical objects andmatrices that represent relative positions and orientation of these objects. Then, special software takes thatrepresentation and converts it into an image on the screen. This is called rendering.

Computer graphics takes internal representation ofobjects and produces an image. Human brain doesthe opposite. Capsule networks follow a similarapproach to the brain. Source.

Inspired by this idea, Hinton argues that brains, infact, do the opposite of rendering. He calls itinversegraphics: from visual information received byeyes, they deconstruct a hierarchical representationof the world around us and try to match it withalready learned patterns and relationships stored inthe brain. This is how recognition happens. And thekey idea is that representation of objects in the braindoes not depend on view angle.

So at this point the question is: how do we model these hierarchical relationships inside of a neural network?The answer comes from computer graphics. In 3D graphics, relationships between 3D objects can berepresented by a so-called pose, which is in essence translation plus rotation.

Hinton argues that in order to correctly do classification and object recognition, it is important to preservehierarchical pose relationships between object parts. This is the key intuition that will allow you to understandwhy capsule theory is so important. It incorporates relative relationships between objects and it isrepresented numerically as a 4D pose matrix.

When these relationships are built into internal representation of data, it becomes very easy for a model tounderstand that the thing that it sees is just another view of something that it has seen before. Consider theimage below. You can easily recognize that this is the Statue of Liberty, even though all the images show itfrom different angles. This is because internal representation of the Statue of Liberty in your brain does notdepend on the view angle. You have probably never seen these exact pictures of it, but you still immediatelyknew what it was.

4/6

https://en.wikipedia.org/wiki/3D_computer_graphics

https://en.wikipedia.org/wiki/Rendering_%28computer_graphics%29

https://upload.wikimedia.org/wikipedia/commons/a/ad/Utah_teapot.png

https://youtu.be/TFIMqt0yT2I

http://helper.ipam.ucla.edu/publications/gss2012/gss2012_10754.pdf

https://en.wikipedia.org/wiki/Translation_%28geometry%29

https://en.wikipedia.org/wiki/Rotation_%28mathematics%29

http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/MARBLE/high/pose/express.htm

Your brain can easily recognize this is the same object, even though all photos are taken from differentangles. CNNs do not have this capability.

For a CNN, this task is really hard because it does not have this built-in understanding of 3D space, but for aCapsNet it is much easier because these relationships are explicitly modeled. The paper that uses thisapproach was able to cut error rate by 45% as compared to the previous state of the art, which is a hugeimprovement.

Another benefit of the capsule approach is that it is capable of learning to achieve state-of-the artperformance by only using a fraction of the data that a CNN would use (Hinton mentions this in his famoustalk about what is wrongs with CNNs). In this sense, the capsule theory is much closer to what the humanbrain does in practice. In order to learn to tell digits apart, the human brain needs to see only a couple ofdozens of examples, hundreds at most. CNNs, on the other hand, need tens of thousands of examples toachieve very good performance, which seems like a brute force approach that is clearly inferior to what we dowith our brains.

4. What Took It so Long?

The idea is really simple, there is no way no one has come up with it before! And the truth is, Hinton has beenthinking about this for decades. The reason why there were no publications is simply because there was notechnical way to make it work before. One of the reasons is that computers were just not powerful enough inthe pre-GPU-based era before around 2012. Another reason is that there was no algorithm that allowed toimplement and successfully learn a capsule network (in the same fashion the idea of artificial neurons wasaround since 1940-s, but it was not until mid 1980-s when backpropagation algorithm showed up andallowed to successfully train deep networks).

In the same fashion, the idea of capsules itself is not that new and Hinton has mentioned it before, but therewas no algorithm up until now to make it work. This algorithm is called “dynamic routing between capsules”.This algorithm allows capsules to communicate with each other and create representations similar to scenegraphs in computer graphics.

5/6


https://youtu.be/rTawFwUvnLE

https://en.wikipedia.org/wiki/Backpropagation

https://en.wikipedia.org/wiki/Scene_graph

The capsule network is much better than other models at telling that images in top and bottom rows belongto the same classes, only the view angle is different. The latest papers decreased the error rate by awhopping 45%. Source.

5. Conclusion

Capsules introduce a new building block that can be used in deep learning to better model hierarchicalrelationships inside of internal knowledge representation of a neural network. Intuition behind them is verysimple and elegant.

Hinton and his team proposed a way to train such a network made up of capsules and successfully trained iton a simple data set, achieving state-of-the-art performance. This is very encouraging.

Nonetheless, there are challenges. Current implementations are much slower than other modern deeplearning models. Time will show if capsule networks can be trained quickly and efficiently. In addition, weneed to see if they work well on more difficult data sets and in different domains.

In any case, the capsule network is a very interesting and already working model which will definitely getmore developed over time and contribute to further expansion of deep learning application domain.

This concludes part one of the series on capsule networks. In the Part II, more technical part, I will walk youthrough the CapsNet’s internal workings step by step.

6/6


https://github.com/llSourcell/capsule_networks

https://github.com/XifengGuo/CapsNet-Keras

https://medium.com/@pechyonkin/understanding-hintons-capsule-networks-part-ii-how-capsules-work-153b6ade9f66


Understanding Hinton’s Capsule Networks. Part II: HowCapsules Work.

medium.com/ai³-theory-practice-business/understanding-hintons-capsule-networks-part-ii-how-capsules-work-153b6ade9f66


Part I: IntuitionPart II: How Capsules Work (you are reading it now)Part III: Dynamic Routing Between CapsulesPart IV: CapsNet Architecture


Introduction

In Part I of this series on capsule networks, I talked about the basic intuition and motivation behind the novelarchitecture. In this part, I will describe, what capsule is and how it works internally as well as intuition behindit. In the next part I will focus mostly on the dynamic routing algorithm.

What is a Capsule?

In order to answer this question, I think it is a good idea to refer to the first paper where capsules wereintroduced — “Transforming Autoencoders” by Hinton et al. The part that is important to understanding ofcapsules is provided below:

“Instead of aiming for viewpoint invariance in the activities of “neurons” that use a single scalar output tosummarize the activities of a local pool of replicated feature detectors, artificial neural networks should use local“capsules” that perform some quite complicated internal computations on their inputs and then encapsulate theresults of these computations into a small vector of highly informative outputs. Each capsule learns to recognizean implicitly defined visual entity over a limited domain of viewing conditions and deformations and it outputsboth the probability that the entity is present within its limited domain and a set of “instantiation parameters” thatmay include the precise pose, lighting and deformation of the visual entity relative to an implicitly definedcanonical version of that entity. When the capsule is working properly, the probability of the visual entity beingpresent is locally invariant — it does not change as the entity moves over the manifold of possible appearanceswithin the limited domain covered by the capsule. The instantiation parameters, however, are “equivariant” — asthe viewing conditions change and the entity moves over the appearance manifold, the instantiation parameterschange by a corresponding amount because they are representing the intrinsic coordinates of the entity on theappearance manifold.”

The paragraph above is very dense, and it took me a while to figure out what it means, sentence by sentence.Below is my version of the above paragraph, as I understand it:

Artificial neurons output a single scalar. In addition, CNNs use convolutional layers that, for each kernel,replicate that same kernel’s weights across the entire input volume and then output a 2D matrix, where eachnumber is the output of that kernel’s convolution with a portion of the input volume. So we can look at that 2D

1/8







https://medium.com/@pechyonkin/understanding-hintons-capsule-networks-part-i-intuition-b4b559d1159b

http://www.cs.toronto.edu/~fritz/absps/transauto6.pdf

matrix as output of replicated feature detector. Then all kernel’s 2D matrices are stacked on top of eachother to produce output of a convolutional layer.

Not only can the CapsNet recognize digits, it can also generate them from internal representations. Source.

Then, we try to achieve viewpoint invariance in the activities of neurons. We do this by the means of maxpooling that consecutively looks at regions in the above described 2D matrix and selects the largest numberin each region. As result, we get what we wanted — invariance of activities. Invariance means that by changingthe input a little, the output still stays the same. And activity is just the output signal of a neuron. In otherwords, when in the input image we shift the object that we want to detect by a little bit, networks activities(outputs of neurons) will not change because of max pooling and the network will still detect the object.

The above described mechanism is not very good, because max pooling loses valuable information and alsodoes not encode relative spatial relationships between features. We should use capsules instead, becausethey will encapsulate all important information about the state of the features they are detecting in a form ofa vector (as opposed to a scalar that a neuron outputs).

Capsules encapsulate all important information about the state of the feature they are detecting in vector form.

Capsules encode probability of detection of a feature as the length of their output vector. And the state of thedetected feature is encoded as the direction in which that vector points to (“instantiation parameters”). Sowhen detected feature moves around the image or its state somehow changes, the probability still stays thesame (length of vector does not change), but its orientation changes.

Imagine that a capsule detects a face in the image and outputs a 3D vector of length 0.99. Then we startmoving the face across the image. The vector will rotate in its space, representing the changing state of thedetected face, but its length will remain fixed, because the capsule is still sure it has detected a face. This iswhat Hinton refers to as activities equivariance: neuronal activities will change when an object “moves over

2/8


the manifold of possible appearances” in the picture. At the same time, the probabilities of detection remainconstant, which is the form of invariance that we should aim at, and not the type offered by CNNs with maxpooling.

How does a capsule work?

Let us compare capsules with artificial neurons. Table below summarizes the differences between thecapsule and the neuron:

Important differences between capsules and neurons. Source: author, inspired by the talk on CapsNets givenby naturomics.

Recall, that a neuron receives input scalars from other neurons, then multiplies them by scalar weights andsums. This sum is then passed to one of the many possible nonlinear activation functions, that take the inputscalar and output a scalar according to the function. That scalar will be the output of the neuron that will goas input to other neurons. The summary of this process can be seen on the table and diagram below on theright side. In essence, artificial neuron can be described by 3 steps:

1. scalar weighting of input scalars2. sum of weighted input scalars3. scalar-to-scalar nonlinearity

3/8

https://github.com/naturomics/CapsNet-Tensorflow

Left: capsule diagram; right: artificial neuron. Source: author, inspired by the talk on CapsNets given bynaturomics.

On the other hand, the capsule has vector forms of the above 3 steps in addition to the new step, affinetransform of input:

1. matrix multiplication of input vectors2. scalar weighting of input vectors3. sum of weighted input vectors4. vector-to-vector nonlinearity

Let’s have a better look at the 4 computational steps happening inside the capsule.

1. Matrix Multiplication of Input Vectors

Input vectors that our capsule receives (u1, u2 and u3 in the diagram) come from 3 other capsules in thelayer below. Lengths of these vectors encode probabilities that lower-level capsules detected theircorresponding objects and directions of the vectors encode some internal state of the detected objects. Letus assume that lower level capsules detect eyes, mouth and nose respectively and out capsule detects face.

These vectors then are multiplied by corresponding weight matrices W that encode important spatial andother relationships between lower level features (eyes, mouth and nose) and higher level feature (face). Forexample, matrix W2j may encode relationship between nose and face: face is centered around its nose, itssize is 10 times the size of the nose and its orientation in space corresponds to orientation of the nose,because they all lie on the same plane. Similar intuitions can be drawn for matrices W1j and W3j. Aftermultiplication by these matrices, what we get is the predicted position of the higher level feature. In otherwords, u1hat represents where the face should be according to the detected position of the eyes, u2hatrepresents where the face should be according to the detected position of the mouth and u3hat representswhere the face should be according to the detected position of the nose.

At this point your intuition should go as follows: if these 3 predictions of lower level features point at thesame position and state of the face, then it must be a face there.

4/8

https://github.com/naturomics/CapsNet-Tensorflow

Predictions for face location of nose, mouth and eyes capsules closely match: there must be a face there.Source: author, based on original image.

2. Scalar Weighting of Input Vectors

At the first glance, this step seems very familiar to the one where artificial neuron weights its inputs beforeadding them up. In the neuron case, these weights are learned during backpropagation, but in the case of thecapsule, they are determined using “dynamic routing”, which is a novel way to determine where eachcapsule’s output goes. I will dedicate a separate post to this algorithm and only offer some intuition here.

5/8

http://sharenoesis.com/wp-content/uploads/2010/05/7ShapeFaceRemoveGuides.jpg

Lower level capsule will send its input to the higher level capsule that “agrees” with its input. This is theessence of the dynamic routing algorithm. Source.

In the image above, we have one lower level capsule that needs to “decide” to which higher level capsule itwill send its output. It will make its decision by adjusting the weights C that will multiply this capsule’s outputbefore sending it to either left or right higher-level capsules J and K.

Now, the higher level capsules already received many input vectors from other lower-level capsules. All theseinputs are represented by red and blue points. Where these points cluster together, this means thatpredictions of lower level capsules are close to each other. This is why, for the sake of example, there is acluster of red points in both capsules J and K.

So, where should our lower-level capsule send its output: to capsule J or to capsule K? The answer to thisquestion is the essence of the dynamic routing algorithm. The output of the lower capsule, when multiplied bycorresponding matrix W, lands far from the red cluster of “correct” predictions in capsule J. On the otherhand, it will land very close to “true” predictions red cluster in the right capsule K. Lower level capsule has amechanism of measuring which upper level capsule better accommodates its results and will automaticallyadjust its weight in such a way that weight C corresponding to capsule K will be high, and weight Ccorresponding to capsule J will be low.

3. Sum of Weighted Input Vectors

This step is similar to the regular artificial neuron and represents combination of inputs. I don’t think there isanything special about this step (except it is sum of vectors and not sum of scalars). We therefore can moveon to the next step.

6/8

https://youtu.be/rTawFwUvnLE?t=36m39s

4. “Squash”: Novel Vector-to-Vector Nonlinearity

Another innovation that CapsNet introduce is the novel nonlinear activation function that takes a vector, andthen “squashes” it to have length of no more than 1, but does not change its direction.

Squashing nonlinearity scales input vector without changing its direction.

The right side of equation (blue rectangle) scales the input vector so that it will have unit length and the leftside (red rectangle) performs additional scaling. Remember that the output vector length can be interpretedas probability of a given feature being detected by the capsule.

Graph of the novel nonlinearity in its scalar form. In real application the function operates on vectors.Source: author.

On the left is the squashing function applied to a 1D vector, which is a scalar. I included it to demonstrate theinteresting nonlinear shape of the function.

It only makes sense to visualize one dimensional case; in real application it will take vector and output avector, which would be hard to visualize.

7/8

Conclusion

In this part we talked about what the capsule is, what kind of computation it performs as well as intuitionbehind it. We see that the design of the capsule builds up upon the design of artificial neuron, but expands itto the vector form to allow for more powerful representational capabilities. It also introduces matrix weightsto encode important hierarchical relationships between features of different layers. The result succeeds toachieve the goal of the designer: neuronal activity equivariance with respect to changes in inputs andinvariance in probabilities of feature detection.

Summary of the internal workings of the capsule. Note that there is no bias because it is already included inthe W matrix that can accommodate it and other, more complex transforms and relationships.Source: author.

The only parts that remain to conclude the series on the CapsNet are the dynamic routing between capsulesalgorithm as well as the detailed walkthrough of the architecture of this novel network. These will bediscussed in the following posts.

8/8


Understanding Hinton’s Capsule Networks. Part III:Dynamic Routing Between Capsules.

medium.com/ai³-theory-practice-business/understanding-hintons-capsule-networks-part-iii-dynamic-routing-between-capsules-349f6d30418


Part I: IntuitionPart II: How Capsules WorkPart III: Dynamic Routing Between Capsules (you are reading it now)Part IV: CapsNet Architecture


Introduction

This is the third post in the series about a new type of neural network, based on capsules, called CapsNet. Ialready talked about the intuition behind it, as well as what is a capsule and how it works. In this post, I willtalk about the novel dynamic routing algorithm that allows to train capsule networks.

One of the earlier figures explaining capsules and routing between them. Source.

As I showed in Part II, a capsule i in a lower-level layer needs to decide how to send its output vector tohigher-level capsules j. It makes this decision by changing scalar weight c_ij that will multiply its output vectorand then be treated as input to a higher-level capsule. Notation-wise, c_ij represents the weight that multiplies

1/6

https://medium.com/ai%C2%B3-theory-practice-business/understanding-hintons-capsule-networks-part-iii-dynamic-routing-between-capsules-349f6d30418






http://helper.ipam.ucla.edu/publications/gss2012/gss2012_10754.pdf

output vector from lower-level capsule i and goes as input to a higher level capsule j.

Things to know about weights c_ij:

1. Each weight is a non-negative scalar2. For each lower level capsule i, the sum of all weights c_ij equals to 13. For each lower level capsule i, the number of weights equals to the number of higher-level capsules4. These weights are determined by the iterative dynamic routing algorithm

The first two facts allow us to interpret weights in probabilistic terms. Recall that the length a capsule’soutput vector is interpreted as probability of existence of the feature that this capsule has been trained todetect. Orientation of the output vector is the parametrized state of the feature. So, in a sense, for each lowerlevel capsule i, its weights c_ij define a probability distribution of its output belonging to each higher levelcapsule j.

Recall: computations inside of a capsule as described in Part II of the series. Source: author.

Dynamic Routing Between Capsules

So, what exactly happens during dynamic routing? Let’s have a look at the description of the algorithm aspublished in the paper. But before we dive into the algorithm step by step, I want you to keep in your mind themain intuition behind the algorithm:

Lower level capsule will send its input to the higher level capsule that “agrees” with its input. This is the essenceof the dynamic routing algorithm.

Now that we have this in mind, let’s go through the algorithm line by line.

2/6

Dynamic routing algorithm, as published in the original paper.

The first line says that this procedure takes all capsules in a lower level l and their outputs u_hat, as well asthe number of routing iterations r. The very last line tells you that the algorithm will produce the output of ahigher level capsule v_j. Essentially, this algorithm tells us how to calculate forward pass of the network.

In the second line you will notice that there is a new coefficient b_ij that we haven’t seen before. Thiscoefficient is simply a temporary value that will be iteratively updated and, after the procedure is over, itsvalue will be stored in c_ij. At start of training the value of b_ij is initialized at zero.

Line 3 says that the steps in 4–7 will be repeated r times (the number of routing iterations).

Step in line 4 calculates the value of vector c_i which is all routing weights for a lower level capsule i. This isdone for all lower level capsules. Why softmax? Softmax will make sure that each weight c_ij is a non-negative number and their sum equals to one. Essentially, softmax enforces probabilistic nature ofcoefficients c_ij that I described above.

At the first iteration, the value of all coefficients c_ij will be equal, because on line two all b_ij are set to zero.For example, if we have 3 lower level capsules and 2 higher level capsules, then all c_ij will be equal to 0.5.The state of all c_ij being equal at initialization of the algorithm represents the state of maximum confusionand uncertainty: lower level capsules have no idea which higher level capsules will best fit their output. Ofcourse, as the process is repeated these uniform distributions will change.

After all weights c_ij were calculated for all lower level capsules, we can move on to line 5, where we look athigher level capsules. This step calculates a linear combination of input vectors, weighted by routingcoefficients c_ij, determined in the previous step. Intuitively, this means scaling down input vectors andadding them together, which produces output vector s_j. This is done for all higher level capsules.

Next, in line 6 vectors from last step are passed through the squash nonlinearity, that makes sure thedirection of the vector is preserved, but its length is enforced to be no more than 1. This step produces theoutput vector v_j for all higher level capsules.

3/6


http://cs231n.github.io/optimization-2/

https://en.wikipedia.org/wiki/Softmax_function

Dot product is an operation that takes 2 vectors and outputs a scalar. There are several scenarios possiblefor the two vectors of given lengths but different relative orientations: (a) largest positive possible values; (b)positive dot product; (c) zero dot product; (d) negative dot product; (e) largest possible negative dot product.You can think of the dot product as a measure of similarity in the context of CapsNets. Source: author.

To summarize what we have so far: steps 4–6 simply calculate the output of higher level capsules. Step online 7 is where the weight update happens. This step captures the essence of the routing algorithm. Thissteps looks at each higher level capsule j and then examines each input and updates the correspondingweight b_ij according to the formula. The formula says that the new weight value equals to the old value plusthe dot product of current output of capsule j and the input to this capsule from a lower level capsule i. Thedot product looks at similarity between input to the capsule and output from the capsule. Also, rememberfrom above, the lower level capsule will sent its output to the higher level capsule whose output is similar.This similarity is captured by the dot product. After this step, the algorithm starts over from step 3 andrepeats the process r times.

After r times, all outputs for higher level capsules were calculated and routing weights have been established.The forward pass can continue to the next level of network.

Intuitive Example of Weight Update Step

4/6

https://en.wikipedia.org/wiki/Dot_product

Two higher level capsules with their outputs represented by purple vectors, and inputs represented by blackand orange vectors. Lower level capsule with orange output will decrease the weight for higher level capsule1 (left side) and increase the weight for higher level capsule 2 (right side). Source: author.

In the figure on the left, imagine that there are two higher level capsules, their output is represented by purplevectors v1 and v2 calculated as described in previous section. The orange vector represents input from oneof the lower level capsules and the black vectors represent all the remaining inputs from other lower levelcapsules.

We see that in the left part the purple output v1 and the orange input u_hat point in the opposite directions. Inother words, they are not similar. This means their dot product will be a negative number and as result routingcoefficient c_11 will decrease. In the right part, the purple output v2 and the orange input v_hat point in thesame direction. They are similar. Therefore, the routing coefficient c_12 will increase. This procedure isrepeated for all higher level capsules and for all inputs of each capsule. The result of this is a set of routingcoefficients that best matches outputs from lower level capsules with outputs of higher level capsules.

How Many Routing Iterations to Use?

The paper examined a range of values for both MNIST and CIFAR data sets. Author’s conclusion is two-fold:

1. More iterations tends to overfit the data2. It is recommended to use 3 routing iterations in practice

5/6

Conclusion

In this article, I explained the dynamic routing algorithm by agreement that allows to train the CapsNet. Themost important idea is that similarity between input and output is measured as dot product between inputand output of a capsule and then routing coefficient is updated correspondingly. Best practice is to use 3routing iterations.

In the next post, I will walk you through CapsNet architecture, where we will put together all pieces of thepuzzle that we learned so far.

6/6

Max Pechyonkin February 21, 2018

Understanding Hinton’s Capsule Networks. Part IV:CapsNet Architecture

medium.com/@pechyonkin/part-iv-capsnet-architecture-6a64422f7dce


Part I: IntuitionPart II: How Capsules WorkPart III: Dynamic Routing Between CapsulesPart IV: CapsNet Architecture (you are reading it now)


Introduction

In this part, I will walk through the architecture of the CapsNet. I will also offer my shot at calculating thenumber of trainable parameters in the CapsNet. My resulting number is around 8.2 million of trainableparameters which is different from the 11.36 officially referred to in the paper. The paper itself is not verydetailed and hence it leaves some open questions about specifics of the network implementation that are asof today still unanswered because the authors did not provide their code. Nonetheless, I still think thatcounting parameters in a network is a good exercise for purely learning purposes as it allows one to practiceunderstanding of all building blocks of a particular architecture.

The CapsNet has 2 parts: encoder and decoder. The first 3 layers are encoder, and the second 3 are decoder:

Layer 1. Convolutional layerLayer 2. PrimaryCaps layerLayer 3. DigitCaps layerLayer 4. Fully connected #1Layer 5. Fully connected #2Layer 6. Fully connected #3

Part I. Encoder.

1/6






CapsNet encoder architecture. Source: original paper.

Encoder part of the network takes as input a 28 by 28 MNIST digit image and learns to encode it into a 16-dimensional vector of instantiation parameters (as explained in the previous posts of this series), this iswhere the capsules do their job. The output of the network during prediction is a 10-dimensional vectors oflengths of DigitCaps’ outputs. The decoder has 3 layers: two of them are convolutional and the last one isfully connected.

Layer 1. Convolutional layer

Input: 28x28 image (one color channel).Output: 20x20x256 tensor.Number of parameters: 20992.

Convolutional layer’s job is to detect basic features in the 2D image. In the CapsNet, the convolutional layerhas 256 kernels with size of 9x9x1 and stride 1, followed by ReLU activation. If you don’t know what thismeans, here are some awesome resources that will allow you to quickly pick up key ideas behindconvolutions. To calculate the number of parameters, we need to also remember that each kernel in aconvolutional layer has 1 bias term. Hence this layer has (9x9 + 1)x256 = 20992 trainable parameters in total.

Layer 2. PrimaryCaps layer

Input: 20x20x256 tensor.Output: 6x6x8x32 tensor.Number of parameters: 5308672.

This layer has 32 primary capsules whose job is to take basic features detected by the convolutional layerand produce combinations of the features. The layer has 32 “primary capsules” that are very similar toconvolutional layer in their nature. Each capsule applies eight 9x9x256 convolutional kernels (with stride 2)to the 20x20x256 input volume and therefore produces 6x6x8 output tensor. Since there are 32 suchcapsules, the output volume has shape of 6x6x8x32. Doing calculation similar to the one in the previous layer,we get 5308672 trainable parameters in this layer.

Layer 3. DigitCaps layer

Input: 6x6x8x32 tensor.Output: 16x10 matrix.Number of parameters: 1497600.

This layer has 10 digit capsules, one for each digit. Each capsule takes as input a 6x6x8x32 tensor. You canthink of it as 6x6x32 8-dimensional vectors, which is 1152 input vectors in total. As per the inner workings of

2/6


https://www.youtube.com/watch?v=ACU-T9L4_lI

https://arxiv.org/pdf/1603.07285.pdf

http://colah.github.io/posts/2014-07-Understanding-Convolutions/

http://setosa.io/ev/image-kernels/

the capsule (as described here), each of these input vectors gets their own 8x16 weight matrix that maps 8-dimensional input space to the 16-dimensional capsule output space. So, there are 1152 matrices for eachcapsule, and also 1152 c coefficients and 1152 b coefficients used in the dynamic routing. Multiplying: 1152x 8 x 16 + 1152 + 1152, we get 149760 trainable parameters per capsule, then we multiply by 10 to get thefinal number of parameters for this layer.

The loss function

The loss function might look complicated at first sight, but it really is not. It is very similar to the SVM lossfunction. In order to understand the main idea about how it works, recall that the output of the DigitCapslayer is 10 sixteen-dimensional vectors. During training, for each training example, one loss value will becalculated for each of the 10 vectors according to the formula below and then the 10 values will be addedtogether to calculate the final loss. Because we are talking about supervised learning, each training examplewill have the correct label, in this case it will be a ten-dimensional one-hot encoded vector with 9 zeros and 1one at the correct position. In the loss function formula, the correct label determines the value of T_c: it is 1 ifthe correct label corresponds with the digit of this particular DigitCap and 0 otherwise.

Color coded loss function equation. Source: author, based on original paper.

Suppose the correct label is 1, this means the first DigitCap is responsible for encoding the presence of thedigit 1. For this DigitCap’s loss function T_c will be one and for all remaining nine DigitCaps T_c will be 0.When T_c is 1 then the first term of the loss function is calculated and the second becomes zero. For ourexample, in order to calculate the first DigitCap’s loss we take the output vector of this DigitCap and subtractit from m+, which is fixed at 0.9. Then we keep the resulting value only in the case when it is greater than zeroand square it. Otherwise, return 0. In other words, the loss will be zero if the correct DigitCap predicts thecorrect label with greater than 0.9 probability, and it will be non-zero if the probability is less than 0.9.

3/6


http://cs231n.github.io/linear-classify/

https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/

Loss function value for correct and incorrect DigitCap. Note that the red graph is “squashed” verticallycompared to the green one. This is due to the lambda multiplier from the formula. Source: author.

For DigitCaps who do not match with the correct label, T_c will be zero and therefore the second term will beevaluated (corresponding to (1 — T_c) part). In this case we can see that the loss will be zero if themismatching DigitCap predicts an incorrect label with probability less than 0.1 and non-zero if it predicts anincorrect label with probability more than 0.1.

Finally, in the formula lambda coefficient is included for numerical stability during training (its value is fixedat 0.5). The two terms in the formula have squares because this loss function has L2 norm and the authorsapparently consider this norm to work better.

Part II. Decoder.

CapsNet decoder architecture. Source: original paper.

Decoder takes a 16-dimensional vector from the correct DigitCap and learns to decode it into an image of adigit (note that it only uses the correct DigitCap vector during training and ignores the incorrect ones).Decoder is used as a regularizer, it takes the output of the correct DigitCap as input and learns to recreate an

4/6


28 by 28 pixels image, with the loss function being Euclidean distance between the reconstructed image andthe input image. Decoder forces capsules to learn features that are useful for reconstructing the originalimage. The closer the reconstructed image to the input image, the better. Examples of reconstructed imagescan be seen in the image below.

Top row: original images. Bottom row: reconstructed images. Source: original paper.

Layer 4. Fully connected #1

Input: 16x10.Output: 512.Number of parameters: 82432.

Each output of the lower level gets weighted and directed into each neuron of the fully connected layer asinput. Each neuron also has a bias term. For this layer there are 16x10 inputs that are all directed to each ofthe 512 neurons of this layer. Therefore, there are (16x10 + 1)x512 trainable parameters.

For the following two layers calculation is the same: number of parameters = (number of inputs + bias) xnumber of neurons in the layer. This is why there is no explanation for fully connected layers 2 and 3.


Input: 512.Output: 1024.Number of parameters: 525312.


Input: 1024.Output: 784 (which after reshaping gives back a 28x28 decoded image).Number of parameters: 803600.

Total number of parameters in the network: 8238608.

Conclusion

This wraps up the series on the CapsNet. There are many very good resources around the internet. If youwould like to learn more on this fascinating topic, please have a look at this awesome compilation of linksabout CapsNets.

5/6


https://github.com/aisummary/awesome-capsule-networks

Documents

Understanding Hinton’s Capsule Networks. Part I: Intuition. · represented numerically as a 4D pose matrix. When these relationships are built into internal representation of data,