Compressing Neural Networks with the Hashing Trick ?· Compressing Neural Networks with the Hashing…

  • Published on

  • View

  • Download


<ul><li><p>Compressing Neural Networks with the Hashing Trick</p><p>Wenlin Chen WENLINCHEN@WUSTL.EDUJames T. Wilson J.WILSON@WUSTL.EDUStephen Tyree STYREE@NVIDIA.COMKilian Q. Weinberger KILIAN@WUSTL.EDUYixin Chen CHEN@CSE.WUSTL.EDU Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO, USA NVIDIA, Santa Clara, CA, USA</p><p>AbstractAs deep nets are increasingly used in applica-tions suited for mobile devices, a fundamen-tal dilemma becomes apparent: the trend indeep learning is to grow models to absorb ever-increasing data set sizes; however mobile devicesare designed with very little memory and cannotstore such large models. We present a novel net-work architecture, HashedNets, that exploits in-herent redundancy in neural networks to achievedrastic reductions in model sizes. HashedNetsuses a low-cost hash function to randomly groupconnection weights into hash buckets, and allconnections within the same hash bucket sharea single parameter value. These parameters aretuned to adjust to the HashedNets weight sharingarchitecture with standard backprop during train-ing. Our hashing procedure introduces no ad-ditional memory overhead, and we demonstrateon several benchmark data sets that HashedNetsshrink the storage requirements of neural net-works substantially while mostly preserving gen-eralization performance.</p><p>1. IntroductionIn the past decade deep neural networks have set newperformance standards in many high-impact applications.These include object classification (Krizhevsky et al., 2012;Sermanet et al., 2013), speech recognition (Hinton et al.,2012), image caption generation (Vinyals et al., 2014;Karpathy &amp; Fei-Fei, 2014) and domain adaptation (Glo-rot et al., 2011b). As data sets increase in size, so dothe number of parameters in these neural networks in or-der to absorb the enormous amount of supervision (Coateset al., 2013). Increasingly, these networks are trained onindustrial-sized clusters (Le, 2013) or high-performancegraphics processing units (GPUs) (Coates et al., 2013).</p><p>Simultaneously, there has been a second trend as applica-tions of machine learning have shifted toward mobile andembedded devices. As examples, modern smart phones areincreasingly operated through speech recognition (Schus-ter, 2010), robots and self-driving cars perform objectrecognition in real time (Montemerlo et al., 2008), andmedical devices collect and analyze patient data (Lee &amp;Verma, 2013). In contrast to GPUs or computing clus-ters, these devices are designed for low power consumptionand long battery life. Most importantly, they typically havesmall working memory. For example, even the top-of-the-line iPhone 6 only features a mere 1GB of RAM.1</p><p>The disjunction between these two trends creates adilemma when state-of-the-art deep learning algorithms aredesigned for deployment on mobile devices. While it ispossible to train deep nets offline on industrial-sized clus-ters (server-side), the sheer size of the most effective mod-els would exceed the available memory, making it pro-hibitive to perform testing on-device. In speech recog-nition, one common cure is to transmit processed voicerecordings to a computation center, where the voice recog-nition is performed server-side (Chun &amp; Maniatis, 2009).This approach is problematic, as it only works when suf-ficient bandwidth is available and incurs artificial delaysthrough network traffic (Kosner, 2012). One solution is totrain small models for the on-device classification; how-ever, these tend to significantly impact accuracy (Chun &amp;Maniatis, 2009), leading to customer frustration.</p><p>This dilemma motivates neural network compression. Re-cent work by Denil et al. (2013) demonstrates that thereis a surprisingly large amount of redundancy among theweights of neural networks. The authors show that a smallsubset of the weights are sufficient to reconstruct the entirenetwork. They exploit this by training low-rank decompo-sitions of the weight matrices. Ba &amp; Caruana (2014) showthat deep neural networks can be successfully compressed</p><p>1</p><p>arX</p><p>iv:1</p><p>504.</p><p>0478</p><p>8v1 </p><p> [cs</p><p>.LG</p><p>] 1</p><p>9 A</p><p>pr 2</p><p>015</p><p></p></li><li><p>Compressing Neural Networks with the Hashing Trick</p><p>into shallow single-layer neural networks by training thesmall network on the (log-) outputs of the fully trained deepnetwork (Bucilu et al., 2006). Courbariaux et al. (2014)train neural networks with reduced bit precision, and, longpredating this work, LeCun et al. (1989) investigated drop-ping unimportant weights in neural networks. In summary,the accumulated evidence suggests that much of the infor-mation stored within network weights may be redundant.</p><p>In this paper we propose HashedNets, a novel networkarchitecture to reduce and limit the memory overhead ofneural networks. Our approach is compellingly simple:we use a hash function to group network connections intohash buckets uniformly at random such that all connec-tions grouped to the ith hash bucket share the same weightvalue wi. Our parameter hashing is akin to prior work infeature hashing (Weinberger et al., 2009; Shi et al., 2009;Ganchev &amp; Dredze, 2008) and is similarly fast and requiresno additional memory overhead. The backpropagation al-gorithm (LeCun et al., 2012) can naturally tune the hashbucket parameters and take into account the random weightsharing within the neural network architecture.</p><p>We demonstrate on several real world deep learning bench-mark data sets that HashedNets can drastically reduce themodel size of neural networks with little impact in predic-tion accuracy. Under the same memory constraint, Hashed-Nets have more adjustable free parameters than the low-rank decomposition methods suggested by Denil et al.(2013), leading to smaller drops in descriptive power.</p><p>Similarly, we also show that for a finite set of parametersit is beneficial to inflate the network architecture by re-using each parameter value multiple times. Best results areachieved when networks are inflated by a factor 816.The inflation of neural networks with HashedNets im-poses no restrictions on other network architecture designchoices, such as dropout regularization (Srivastava et al.,2014), activation functions (Glorot et al., 2011a; LeCunet al., 2012), or weight sparsity (Coates et al., 2011).</p><p>2. Feature HashingLearning under memory constraints has previously beenexplored in the context of large-scale learning for sparsedata sets. Feature hashing (or the hashing trick) (Wein-berger et al., 2009; Shi et al., 2009) is a technique tomap high-dimensional text documents directly into bag-of-word (Salton &amp; Buckley, 1988) vectors, which would oth-erwise require use of memory consuming dictionaries forstorage of indices corresponding with specific input terms.</p><p>Formally, an input vector x Rd is mapped into a featurespace with a mapping function :Rd Rk where kd.The mapping is based on two (approximately uniform)hash functions h :N {1, . . . , k} and :N {1,+1}</p><p>and the kth dimension of the hashed input x is defined ask(x) =</p><p>i:h(i)=k xi(i).</p><p>The hashing trick leads to large memory savings for tworeasons: it can operate directly on the input term stringsand avoids the use of a dictionary to translate words intovectors; and the parameter vector of a learning model liveswithin the much smaller dimensional Rk instead of Rd.The dimensionality reduction comes at the cost of colli-sions, where multiple words are mapped into the samedimension. This problem is less severe for sparse datasets and can be counteracted through multiple hashing (Shiet al., 2009) or larger hash tables (Weinberger et al., 2009).</p><p>In addition to memory savings, the hashing trick has the ap-pealing property of being sparsity preserving, fast to com-pute and storage-free. The most important property of thehashing trick is, arguably, its (approximate) preservationof inner product operations. The second hash function,, guarantees that inner products are unbiased in expecta-tion (Weinberger et al., 2009); that is,</p><p>E[(x)&gt;(x)] = x&gt;x. (1)</p><p>Finally, Weinberger et al. (2009) also show that the hash-ing trick can be used to learn multiple classifiers withinthe same hashed space. In particular, the authors use itfor multi-task learning and define multiple hash functions1, . . . , T , one for each task, that map inputs for their re-spective tasks into one joint space. Let w1, . . . ,wT denotethe weight vectors of the respective learning tasks, then ift 6= t a classifier for task t does not interfere with a hashedinput for task t; i.e. w&gt;t t(x) 0.</p><p>3. NotationThroughout this paper we type vectors in bold (x), scalarsin regular (C or b) and matrices in capital bold (X). Spe-cific entries in vectors or matrices are scalars and follow thecorresponding convention, i.e. the ith dimension of vectorx is xi and the (i, j)th entry of matrix V is Vij .</p><p>Feed Forward Neural Networks. We define the forwardpropagation of the `th layer in a neural networks as,</p><p>a`+1i = f(z`+1i ), where z</p><p>`+1i =</p><p>n`j=0</p><p>V `ija`j , (2)</p><p>where V` is the (virtual) weight matrix in the `th layer.The vectors z`, a` Rn` denote the activation units be-fore and after transformation through the transition func-tion f(). Typical activation functions are rectifier linearunit (ReLU) (Nair &amp; Hinton, 2010), sigmoid or tanh (Le-Cun et al., 2012).</p></li><li><p>Compressing Neural Networks with the Hashing Trick</p><p>4. HashedNetsIn this section we present HashedNets, a novel variation ofneural networks with drastically reduced model sizes (andmemory demands). We first introduce our approach as amethod of random weight sharing across the network con-nections and then describe how to facilitate it with the hash-ing trick to avoid any additional memory overhead.</p><p>4.1. Random weight sharing</p><p>In a standard fully-connected neural network, there are(n`+1)n`+1 weighted connections between a pair of lay-ers, each with a corresponding free parameter in the weightmatrix V`. We assume a finite memory budget per layer,K` (n` + 1) n`+1, that cannot be exceeded. The ob-vious solution is to fit the neural network within budget byreducing the number of nodes n`, n`+1 in layers `, `+ 1 orby reducing the bit precision of the weight matrices (Cour-bariaux et al., 2014). However if K` is sufficiently small,both approaches significantly reduce the ability of the neu-ral network to generalize (see Section 6). Instead, we pro-pose an alternative: we keep the size of V` untouched butreduce its effective memory footprint through weight shar-ing. We only allow exactly K` different weights to occurwithin V`, which we store in a weight vector w` RK` .The weights within w` are shared across multiple randomlychosen connections within V`. We refer to the resultingmatrix V` as virtual, as its size could be increased (i.e.nodes are added to hidden layer) without increasing the ac-tual number of parameters of the neural network.</p><p>Figure 1 shows a neural network with one hidden layer,four input units and two output units. Connections arerandomly grouped into three categories per layer and theirweights are shown in the virtual weight matrices V1 andV2. Connections belonging to the same color share thesame weight value, which are stored in w1 and w2, respec-tively. Overall, the entire network is compressed by a fac-tor 1/4, i.e. the 24 weights stored in the virtual matricesV1 and V2 are reduced to only six real values in w1 andw2. On data with four input dimensions and two output di-mensions, a conventional neural network with six weightswould be restricted to a single (trivial) hidden unit.</p><p>4.2. Hashed Neural Nets (HashedNets)</p><p>A nave implementation of random weight sharing can betrivially achieved by maintaining a secondary matrix con-sisting of each connections group assignment. Unfortu-nately, this explicit representation places an undesirablelimit on potential memory savings.</p><p>We propose to implement the random weight sharing as-signments using the hashing trick. In this way, the sharedweight of each connection is determined by a hash function</p><p>2.5 -0.7 -0.7 1.3</p><p>1.3 1.3 2.5 2.5</p><p>2.5 -0.7 1.3</p><p>real weights</p><p>virtual weight matrix</p><p>3.2 1.1 -0.5</p><p>real weights</p><p>1.1 3.2 3.2 -0.5</p><p>3.2 1.1 3.2 3.2</p><p>-0.5 1.1 3.2 1.1</p><p>3.2 -0.5 1.1 -0.5</p><p>virtual weight matrix</p><p>input</p><p>output</p><p>hiddenlayer</p><p>V1</p><p>V2</p><p>w1 w2</p><p>h1 h2</p><p>a11</p><p>a12</p><p>a13</p><p>a14 a24</p><p>a23</p><p>a22</p><p>a21</p><p>a31</p><p>a32</p><p>Figure 1. An illustration of a neural network with random weightsharing under compression factor 1</p><p>4. The 16+9 = 24 virtual</p><p>weights are compressed into 6 real weights. The colors representmatrix elements that share the same weight value.</p><p>that requires no storage cost with the model. Specifically,we assign to V `ij an element of w` indexed by a hash func-tion h`(i, j), as follows:</p><p>V `ij = w`h`(i,j), (3)</p><p>where the (approximately uniform) hash function h`(, )maps a key (i, j) to a natural number within {1, . . . ,K`}.In the example of Figure 1, h1(2, 1) = 1 and thereforeV 12,1=w</p><p>1=3.2. For our experiments we use the open-source implementation xxHash.2</p><p>4.3. Feature hashing versus weight sharing</p><p>This section focuses on a single layer throughout and tosimplify notation we will drop the super-scripts `. We willdenote the input activation as a = a` Rm of dimension-ality m=n`. We denote the output as z= z`+1 Rn withdimensionality n=n`+1.</p><p>To facilitate weight sharing within a feed forward neuralnetwork, we can simply substitute Eq. (3) into Eq. (2):</p><p>zi =</p><p>mj=1</p><p>Vijaj =</p><p>mj=1</p><p>wh(i,j)aj . (4)</p><p>Alternatively and more in line with previous work (Wein-berger et al., 2009), we may interpret HashedNets in termsof feature hashing. To compute zi, we first hash the acti-vations from the previous layer, a, with the hash mappingfunction i() :Rm RK . We then compute the innerproduct between the hashed representation i(a) and theparameter vector w,</p><p>zi = w&gt;i(a). (5)2</p><p></p></li><li><p>Compressing Neural Networks with the Hashing Trick</p><p>Both w and i(a) areK-dimensional, whereK is the num-ber of hash buckets in this layer. The hash mapping func-tion i is defined as follows. The kth element of i(a), i.e.[i(a)]k, is the sum of variables hashed into bucket k:</p><p>[i(a)]k =</p><p>j:h(i,j)=k</p><p>aj . (6)</p><p>Starting from Eq. (5), we show that the two interpretations(Eq. (4) and (5)) are equivalent:</p><p>zi =</p><p>Kk=1</p><p>wk [i(a)]k =Kk=1</p><p>wk</p><p>j:h(i,j)=k</p><p>aj</p><p>=</p><p>mj=1</p><p>Kk=1</p><p>wkaj[h(i,j)=k]</p><p>=</p><p>mj=1</p><p>wh(i,j)aj .</p><p>The final term is equivalent to Eq. (4).</p><p>Sign factor. With this equivalence between randomweight sharing and feature hashing on input activations,HashedNets inherit several beneficial properties of the fea-ture hashing. Weinberger et al. (2009) introduce an ad-ditional sign factor (i, j) to remove the bias of hashedinner-products due to collisions. For the same reasons wemultiply (3) by the sign factor (i, j) for parameterizingV (Weinberger et al., 2009):</p><p>Vij = wh(i,j)(i, j), (7)</p><p>where (i, j) : N 1 is a second hash function inde-pendent of h. Incorporating (i, j) to feature hashing andweight sharing does not change the equivalence betweenthem as the proof in the previous section still hol...</p></li></ul>


View more >