A. Code base - Stanford University

Two Routes to Scalable Credit Assignment without Weight Symmetry

A. Code baseIn this section we describe our implementation and highlightthe technical details that allow its generality for use in anyarchitecture. We used TensorFlow version 1.13.1 to conductall experiments, and adhered to its interface. All code canbe found at https://github.com/neuroailab/neural-alignment.

A.1. Layers

The essential idea of our code base is that by implement-ing custom layers that match the TensorFlow API, but usecustom operations for matrix multiplication (matmul) andtwo-dimensional convolutions (conv2d), then we can ef-ficiently implement arbitrary feedforward networks usingany credit assignment strategies with untied forward andbackward weights. Our custom matmul and conv2d op-erations take in a forward and backward kernel. They usethe forward kernel for the forward pass, but use the back-ward kernel when propagating the gradient. To implementthis, we leverage the @tf.custom gradient decorator,which allows us to explicitly define the forward and back-ward passes for that op. Our Layer objects implementcustom dense and convolutional layers which use the cus-tom operations described above. Both layers take in thesame arguments as the native TensorFlow layers and anadditional argument for the learning rule.

A.2. Alignments

A learning rule is defined by the form of the layer-wiseregularization R added to the model at each layer. Thecustom layers take an instance of an alignment class whichwhen called will define its alignment specific regularizationand add it to the computational graph.

The learning rule are specializations of a parentAlignment object which implements a call methodthat creates the regularization function. The regularizationfunction uses tensors that prevent the gradients from flowingto previous layers via tf.stop gradient, keeping thealignment loss localized to a single layer. Implementationof the call method is delegated to subclasses, such thatthey can define their alignment specific regularization as aweighted sum of primitives, each of which is defined as afunction.

A.3. Optimizers

The total network loss is defined as the sum of the globalcost function J and the local alignment regularization R.The optimizer class provides a framework for specifyinghow to optimize each part of the total network loss as afunction of the global step.

In the Optimizers folder you will find two importantfiles:

• rate scheduler.py defines a scheduler which isa function of the global step, that allows you to adaptthe components of the alignment weighting based onwhere it is in training. If you do not pass in a schedulingfunction, it will by default return a constant rate.

• optimizers.py provides a class which takes in alist of optimizers, as well as a list of losses to optimize.Each loss element is optimized with the correspondingoptimizer at each step in training, allowing you tohave potentially different learning rate schedules fordifferent components of the loss. Minibatching is alsosupported.

B. Experimental DetailsIn what follows we describe the metaparameters we used torun each of the experiments reported above. For conciseness,we will only provide information on metaparameters thatdeviate from the defaults. Please refer to our source codefor metaparameter defaults. Any defaults from TensorFlowcorrespond to those in version 1.13.1.

B.1. TPE search spaces

We detail the search spaces for each of the searches per-formed in §4. For each search, we trained approximately60 distinct settings at a time using the HyperOpt package(Bergstra et al., 2011) using the ResNet-18 architecture andL2 weight decay of � = 10�4 (He et al., 2016) for 45epochs, corresponding to the point in training midway be-tween the first and second learning rate drops. Each modelwas trained on its own Tensor Processing Unit (TPUv2-8and TPUv3-8).

We employed a form of Bayesian optimization, a Tree-structured Parzen Estimator (TPE), to search the space ofcontinuous and categorical metaparameters (Bergstra et al.,2011). This algorithm constructs a generative model ofP [score | configuration] by updating a prior from a main-tained history H of metaparameter configuration-loss pairs.The fitness function that is optimized over models is theexpected improvement, where a given configuration c ismeant to optimize EI(c) =

Rx<t P [x | c, H]. This choice

of Bayesian optimization algorithm models P [c | x] via aGaussian mixture, and restricts us to tree-structured config-uration spaces.

B.1.1. RTPEWM SEARCH SPACE

Below is a description of the metaparameters and theirranges for the search that gave rise to RTPE

WM in Table 3.

https://github.com/neuroailab/neural-alignment

https://github.com/neuroailab/neural-alignment


• Gaussian input noise standard deviation � 2 [10�10, 1]used in the backward pass, sampled uniformly. RTPE

WMended up setting � = 0.690465059717068.

• Ratio between the weighting of Pamp and Pdecay givenby ↵/� 2 [0.1, 200], sampled uniformly. RTPE

WM endedup setting ↵/� = 15.6606919586.

• The weighting of Pdecay given by � 2 [10�11, 107],sampled log-uniformly. RTPE

WM ended up setting� = 0.028327320537228435.

We fix all other metaparameters as prescribed by Akroutet al. (2019), namely batch centering the backward pathinputs and forward path outputs in the backward pass, aswell as applying a ReLU activation function and bias to theforward path but not to the backward path in the backwardpass. To keep the learning rule fully local, we do not allowfor any transport during the mirroring phase of the batchnormalization mean and standard deviation as Akrout et al.(2019) allow.

B.1.2. RTPEWM+AD SEARCH SPACE


WM+AD in Table 3.

• Training batch size |B| 2 {256, 1024, 2048, 4096}.RTPE

WM+AD ended up setting |B| = 256. This choicealso determines the forward path Nesterov momentumlearning rate on the pseudogradient of the categoriza-tion objective J , as it is set to be |B|/2048 (which is0.125 in the case of RTPE

WM+AD), and linearly warm itup to this value for 6 epochs followed by 90% decayat 30, 60, and 80 epochs, training for 100 epochs total,as prescribed by Buchlovsky et al. (2019).

• Alignment learning rate ⌘ 2 [10�6, 10�2], sampledlog-uniformly. This parameter sets the adaptive learn-ing rate on the Adam optimizer applied to the gra-

dient of the alignment loss R, and which will bedropped synchronously by 90% decay at 30, 60,and 80 epochs along with the Nesterov momentumlearning rate on the pseudogradient of the catego-rization objective J . RTPE

WM+AD ended up setting⌘ = 0.005268466140227959.

• Number of delay epochs de 2 {0, 1, 2} for which wedelay optimization of the categorization objective Jand solely optimize the alignment loss R. If de > 0,we use the alignment learning rate ⌘ during this delayperiod and the learning rate drops are shifted by deepochs; otherwise, if de = 0, we linearly warmup⌘ for 6 epochs as well. RTPE

WM+AD ended up settingde = 0 epochs.

• Whether or not to alternate optimization of J and Reach step, or not do this and simultaneously optimizethese objectives in a single training step. RTPE

WM+ADended up performing this alternation per step.

The remaining metaparameters and their ranges were thesame as those from Appendix B.1.1:

• RTPEWM+AD ended up setting the backward

pass Gaussian input noise standard deviation� = 0.949983201505862.

• RTPEWM+AD ended up setting the ratio between

the weighting of Pamp and Pdecay given by↵/� = 13.9035614049.

• RTPEWM+AD ended up setting the weighting of Pdecay

given by � = 2.8109006032182933 ⇥ 10�8.

We fix the layer-wise operations as prescribed by Akroutet al. (2019), namely batch centering the backward pathand forward path outputs in the backward pass, as well asapplying a ReLU activation function and bias to the forwardpath but not to the backward path in the backward pass.

B.1.3. RTPEWM+AD+OPS SEARCH SPACE


WM+AD+OPS in Ta-ble 3. In this search, we expand the search space describedin Appendix B.1.2 to include boolean choices over layer-wise operations performed in the backward pass, involvingeither the inputs, the forward path fl (involving only theforward weights Wl), or the backward path bl (involvingonly the backward weights Bl):

Use of biases in the forward and backward paths:

• Whether or not to use biases in the forward path.RTPE

WM+AD+OPS uses biases in the forward path.

• Whether or not to use biases in the backward path.RTPE

WM+AD+OPS does not use biases in the backwardpath.

Use of nonlinearities in the forward and backward paths:

• Whether or not to apply a ReLU to the forward path out-put. RTPE

WM+AD+OPS applies a ReLU to the forwardpath output.

• Whether or not to apply a ReLU to the backward pathoutput. RTPE

WM+AD+OPS applies a ReLU to the back-ward path output.

Centering and normalization operations in the forward andbackward paths:


• Whether or not to mean center (across the batch

dimension) the forward path output fl = fl � fl.RTPE

WM+AD+OPS applies batch-wise mean centeringto the forward path output.

• Whether or not to mean center (across the batch dimen-sion) the backward path input. RTPE

WM+AD+OPS appliesbatch-wise mean centering to the backward pathinput.

• Whether or not to mean center (across the feature

dimension) the forward path output fl = fl � fl.RTPE

WM+AD+OPS applies feature-wise mean centeringto the forward path output.

• Whether or not to mean center (across the feature

dimension) the backward path output bl = bl � bl.RTPE

WM+AD+OPS does not apply feature-wise meancentering to the backward path output.

• Whether or not to L2 normalize (across the feature di-mension) the forward path output fl = (fl � fl)/||fl �fl||2. RTPE

WM+AD+OPS does not apply feature-wise L2normalization to the forward path output.

• Whether or not to L2 normalize (across the featuredimension) the backward path output bl = (bl �bl)/||bl � bl||2. RTPE

WM+AD+OPS applies feature-wiseL2 normalization to the backward path output.

Centering and normalization operations applied to the inputsto the backward pass:

• Whether or not to mean center (across the featuredimension) the backward pass input xl = xl � xl.RTPE

WM+AD+OPS does not apply feature-wise meancentering to the backward pass input.

• Whether or not to L2 normalize (across the featuredimension) the backward pass input xl = (xl �xl)/||xl � xl||2. RTPE

WM+AD+OPS does not applyfeature-wise L2 normalization to the backwardpass input.


• RTPEWM+AD+OPS ended up setting the training batch size

|B| = 256. Therefore, as explained in Appendix B.1.2,the forward path Nesterov momentum learning rate is0.125.

• RTPEWM+AD+OPS ended up setting the alignment Adam

learning rate to ⌘ = 0.0024814785849458275.

• RTPEWM+AD+OPS ended up setting the number of delay

epochs to be de = 0 epochs.

• RTPEWM+AD+OPS ended up performing alternating mini-

mization of J and R per training step.

• RTPEWM+AD+OPS ended up setting the backward

pass Gaussian input noise standard deviation� = 0.6401550883361206.

• RTPEWM+AD+OPS ended up setting the ratio between


• RTPEWM+AD+OPS ended up setting the weighting of Pdecay

given by � = 232.78563021186207.

B.1.4. RTPEIA SEARCH SPACE


IA in Table 3.In this search, we expand the search space described in Ap-pendix B.1.3, to now include the additional Pnull primitive.

• The weighting of Pnull given by � 2 [10�11, 107],sampled log-uniformly. RTPE

IA ended up setting� = 3.1610192645608835 ⇥ 10�6.


• RTPEIA uses biases in the forward path.

• RTPEIA uses biases in the backward path.

• RTPEIA applies a ReLU to the forward path output.

• RTPEIA does not apply a ReLU to the backward path

output.

• RTPEIA applies batch-wise mean centering to the forward

path output.

• RTPEIA does not apply batch-wise mean centering to the

backward path input.

• RTPEIA applies feature-wise mean centering to the for-

ward path output.

• RTPEIA applies feature-wise mean centering to the back-

ward path output.

• RTPEIA does not apply feature-wise L2 normalization to

the forward path output.

• RTPEIA applies feature-wise L2 normalization to the

backward path output.

• RTPEIA does not apply feature-wise mean centering to

the backward pass input.


• RTPEIA does not apply feature-wise L2 normalization to

the backward pass input.

• RTPEIA ended up setting the training batch size

|B| = 256. Therefore, as explained in Appendix B.1.2,the forward path Nesterov momentum learning rate is0.125.

• RTPEIA ended up setting the alignment Adam learning

rate to ⌘ = 0.009785729149984566.

• RTPEIA ended up setting the number of delay epochs to

be de = 1 epochs.

• RTPEIA ended up performing alternating minimization of

J and R per training step.

• RTPEIA ended up setting the backward pass

Gaussian input noise standard deviation� = 0.8176668980121832.

• RTPEIA ended up setting the ratio between


• RTPEIA ended up setting the weighting of Pdecay given

by � = 7.999046944679209.

B.2. Symmetric and Activation Alignmentmetaparameters

We now describe the metaparameters used to generate Ta-ble 4. We used a batch size of 256, forward path Nesterovwith Momentum of 0.9 and a learning rate of 0.1 appliedto the categorization objective J , warmed up linearly for 5epochs, with learning rate drops at 30, 60, and 80 epochs,trained for a total of 90 epochs, as prescribed by He et al.(2016).

For Symmetric and Activation Alignment (RSA and RAA),we used Adam on the alignment loss R with a learningrate of 0.001, along with the following weightings for theirprimitives:

• Symmetric Alignment: ↵ = 10�3, � = 2 ⇥ 10�3

• Activation Alignment: ↵ = 10�3, � = 2 ⇥ 10�3

We use biases in both the forward and backward paths ofthe backward pass, but do not employ a ReLU nonlinearityto either path.

B.3. Noisy updates

We describe the metaparameters used in §5 to generateFig. 5.

(a) ResNet-18

(b) Deeper Models

Figure S1. Noisy updates. Activation Alignment is more robustto noisy updates than backpropagation for ResNet-18 as shownin (a). For deeper models the difference in performance betweenActivation Alignment and backpropagation vanishes as shown in(b).

For backpropagation we used a momentum optimizer withan initial learning rate of 0.1, standard batch size of 256,and learning rate drops at 30 and 60 epochs.

For Symmetric and Activation Alignment we used the samemetaparameters as backpropagation for the categorizationobjective J and an Adam optimizer with an initial learningrate of 0.001 and learning rate drops at 30 and 60 epochsfor the alignment loss R. All other metaparameters werethe same as described in Appendix B.2.

In all experiments we added the noise to the update given bythe respective optimizers and scaled by the current learningrate, that way at learning rate drops the noise scaled appro-priately. To account for the fact that the initial learning ratefor the backpopagation experiments was 0.1, while for sym-metric and activation experiments it was 0.001, we shiftedthe latter two curves by 104 to account for the effectivedifference in variance.

B.4. Metaparameter importance quantification

We include here the set of discrete metaparameters thatmattered the most across hundreds of models in our large-scale search, sorted by most to least important, plotted inFig. S2. Specifically, these amount to choices of activation,layer-wise normalization, input normalization, and Gaussiannoise in the forward and backward paths of the backward


Figure S2. Analysis of important categorical metaparametersof top performing local rule RTPE

IA . Mean across models, and theerror bars indicate SEM across models.

pass. The detailed labeling is given as follows: A: Whetheror not to L2 normalize (across the feature dimension) thebackward path outputs in the backward pass. B: Whether touse Gaussian noise in the backward pass inputs. C: Whetherto solely optimize the alignment loss in the first 1-2 epochsof training. D, E: Whether or not to apply a non-linearityin the backward or forward path outputs in the backwardpass, respectively. F: Whether or not to apply a bias in theforward path outputs (pre-nonlinearity). G, H: Whetheror not to mean center or L2 normalize (across the featuredimension) the inputs to the backward pass. I: Same asA, but instead applied to the forward path outputs in thebackward pass.

B.5. Neural Fitting Procedure

We fit trained model features to multi-unit array responsesfrom (Majaj et al., 2015). Briefly, we fit to 256 recordedsites from two monkeys. These came from three multi-unitarrays per monkey: one implanted in V4, one in posterior IT,and one in central and anterior IT. Each image was presentedapproximately 50 times, using rapid visual stimulus presen-tation (RSVP). Each stimulus was presented for 100 ms,followed by a mean gray background interleaved betweenimages. Each trial lasted 250 ms. The image set consistedof 5120 images based on 64 object categories. Each imageconsisted of a 2D projection of a 3D model added to a ran-dom background. The pose, size, and x- and y-position ofthe object was varied across the image set, whereby 2 levelsof variation were used (corresponding to medium and highvariation from (Majaj et al., 2015).) Multi-unit responsesto these images were binned in 10ms windows, averagedacross trials of the same image, and normalized to the av-erage response to a blank image. They were then averaged70-170 ms post-stimulus onset, producing a set of (5120images x 256 units) responses, which were the targets forour model features to predict. The 5120 images were split75-25 within each object category into a training set and aheld-out testing set.

C. VisualizationsIn this section we present some visualizations which deepenthe understanding of the weight dynamics and stability dur-ing training, as presented in §4 and §5. By looking at theweights of the network at each validation point, we are ableto compare corresponding forward and backward weights(see Fig. S3) as well as to measure the angle between thevectorized forward and backward weight matrices to quan-tify their degree of alignment (see Fig. S4). Their similarityin terms of scale can also be evaluated by looking at theratio of the Frobenius norm of the backward weight matrixto the forward weight matrix, kBlkF /kWlkF . Separatelyplotting these metrics in terms of model depth sheds someinsight into how different layers behave.

(a) SA Epoch 0 (b) SA Epoch 2 (c) SA Epoch 90

(d) AA Epoch 0 (e) AA Epoch 2 (f) AA Epoch 90

(g) WM Epoch 0 (h) WM Epoch 2 (i) WM Epoch 90

(j) IA Epoch 0 (k) IA Epoch 2 (l) IA Epoch 90

Figure S3. Learning symmetry. Weight values of the third con-volutional layer in ResNet-18 throughout training with variouslearning rules. Each dot represents an element in layer l’s weightmatrix and its (x, y) location corresponds to its forward and back-ward weight values, (W (i,j)

l , B(j,i)l ). The dotted diagonal line

shows perfect weight symmetry, as is the case in backpropagation.


0 10 20 30 40 50 60 70 80 90Epoch

1e-02

1e-01

1e+00

1e+01

1e+02

||Bl||F||Wl||F

early

middle

late

Laye

rDep

th

(a) Symmetric Alignment

0 10 20 30 40 50 60 70 80 90Epoch

1e-02

1e-01

1e+00

1e+01

1e+02

||Bl||F||Wl||F

early

middle

late

Laye

rDep

th

(b) Activation Alignment

0 10 20 30 40 50 60 70 80 90Epoch

1e-02

1e-01

1e+00

1e+01

1e+02

||Bl||F||Wl||F

early

middle

late

Laye

rDep

th

(c) Weight Mirror

0 10 20 30 40 50 60 70 80 90Epoch

1e-02

1e-01

1e+00

1e+01

1e+02

||Bl||F||Wl||F

early

middle

late

Laye

rDep

th

(d) Information Alignment

Figure S4. Weight metrics during training. Figures on the left column show the angle between the forward and the backward weights ateach layer, depicting their degree of alignment. Figures on the right column show the ratio of the Frobenius norm of the backward weightsto the forward weights during training. For Symmetric Alignment (a) we can clearly see how the weights align very early during training,with the learning rate drops allowing them to further decrease. Additionally, the sizes of forward and backwards weight also remain at thesame scale during training. Activation Alignment (b) shows similar behavior to activation, though some of the earlier layers fail to alignas closely as the Symmetric Alignment case. Weight Mirror (c) shows alignment happening within the first few epochs, though some ofthe later layers don’t align as closely. Looking at the size of the weights during training, we can observe the unstable dynamics explainedin §4 with exploding and collapsing weight values (Fig. 2) within the first few epochs of training. Information Alignment (d) shows asimilar ordering in alignment as weight mirror, but overall alignment does improve throughout training, with all layers aligning within 5degrees. Compared to weight mirror, the norms of the weights are more stable, with the backward weights becoming smaller than theirforward counterparts towards the end of training.


D. Further AnalysisD.1. Instability of Weight Mirror

As explained in §4, the instability of weight mirror can beunderstood by considering the dynamical system given bythe symmetrized gradient flow on RSA, RAA, and RWM ata given layer l. By symmetrized gradient flow we implythe gradient dynamics on the loss R modified such that it issymmetric in both the forward and backward weights. Weignore biases and non-linearities and set ↵ = � for all threelosses.

When the weights, wl and bl, and input, xl, are all scalarvalues, the gradient flow for all three losses gives rise to thedynamical system,

@

@t

wl

bl

�= �A

wl

bl

�,

For Symmetric Alignment and Activation Alignment, A isrespectively the positive semidefinite matrix

ASA =

1 �1

�1 1

�and AAA =

x2

l �x2l

�x2l x2

l

�.

For weight mirror, A is the symmetric indefinite matrix

AWM =

�WM �x2

l�x2

l �WM

�.

In all three cases A can be diagonally decomposed by theeigenbasis

{u, v} =

⇢11

�,

1

�1

��,

where u spans the symmetric component and v spans theskew-symmetric component of any realization of the weightvector

⇥wl bl

⇤|.

As explained in §4, under this basis, the dynamical systemdecouples into a system of ODEs governed by the eigenval-ues �u and �v associated with u and v. For all three learningrules, �v > 0 (�v is respectively 1, x2, and �WM + x2

l forSA, AA, and weight mirror). For SA and AA, �u = 0,while for weight mirror �u = �WM � x2

l .

D.2. Beyond Feedback Alignment

An underlying assumption of our work is that certain formsof layer-wise regularization, such as the regularization intro-duced by Symmetric Alignment, can actually improve theperformance of feedback alignment by introducing dynam-ics on the backward weights. To understand these improve-ments, we build off of prior analyses of backpropagation(Saxe et al., 2013) and feedback alignment (Baldi et al.,2018).

Consider the simplest nontrivial architecture: a two layerscalar linear network with forward weights w1, w2, and

backward weight b. The network is trained with scalar data{xi, yi}n

i=1 on the mean squared error cost function

J =nX

i=1

1

2n(yi � w2w1xi)

2.

The gradient flow of this network gives the coupled systemof differential equations on (w1, w2, b)

w1 = b(↵ � w2w1�) (2)w2 = w1(↵ � w2w1�) (3)

where ↵ =Pn

i=1yixi

n and � =Pn

i=1x2i

n . For backpropaga-tion the dynamics are constrained to the hyperplane b = w2,while for feedback alignment the dynamics are containedon the hyperplane b = b(0) given by the initialization. ForSymmetric Alignment, an additional differential equation

b = w2 � b, (4)

attracts all trajectories to the backpropagation hyperplaneb = w2.

To understand the properties of these alignment strategies,we explore the fixed points of their flow. From equation (2)and (3) we see that both equations are zero on the hyperbola

w2w1 =↵

�,

which is the set of minima of J . From equation (4) we seethat all fixed points of Symmetric Alignment satisfy b = w2.Thus, all three alignment strategies have fixed points on thehyperbola of minima intersected with either the hyperplaneb = b(0) in the case of feedback alignment or b = w2 in thecase of backpropagation and Symmetric Alignment.

In addition to these non-zero fixed points, equation (2) and(3) are zero if b and w1 are zero respectively. For backprop-agation and Symmetric Alignment this also implies w2 = 0,however for feedback alignment w2 is free to be any value.Thus, all three alignment strategies have rank-deficient fixedpoints at the origin (0, 0, 0) and in the case of feedbackalignment more generally on the hyperplane b = w1 = 0.

To understand the stability of these fixed points we considerthe local linearization of the vector field by computing theJacobian matrix4

J =

2

4@w1w1 @w1w2 @w1 b@w2w1 @w2w2 @w2 b@bw1 @bw2 @bb

3

5 .

A source of the gradient flow is characterized by non-positive eigenvalues of J , a sink by non-negative eigen-values of J , and a saddle by both positive and negativeeigenvalues of J .

4In the case that the vector field is the negative gradient of aloss, as in backpropagation, then this is the negative Hessian of theloss.


On the hyperbola w2w1 = ↵� the Jacobian matrix for the

three alignment strategies have the corresponding eigenval-ues:

�1 �2 �3

Backprop. ��w2

1 + w22

�x2 0

Feedback ��w2

1 + bw2

�x2 0 0

Symmetric ��w2

1 + w22

�x2 0 �1

Thus, for backpropagation and Symmetric Alignment, allminima of the the cost function J are sinks, while for feed-back alignment the stability of the minima depends on thesign of w2

1 + bw2.

From this simple example there are two major takeaways:

1. All minima of the cost function J are sinks of the flowgiven by backpropagation and Symmetric Alignment,but only some minima are sinks of the flow given byfeedback alignment.

2. Backpropagation and Symmetric Alignment have theexact same critical points, but feedback alignment hasa much larger space of rank-deficient critical points.

Thus, even in this simple example it is clear that certaindynamics on the backward weights can have a stabilizingeffect on feedback alignment.

Documents

A. Code base - Stanford University