Soft Threshold Weight Reparameterization for Learnable Sparsity
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Soft Threshold Weight Reparameterization for Learnable Sparsity Aditya Kusupati 1 *2 Vivek Ramanujan Raghav Somani * 1 Mitchell Wortsman * 1 Prateek Jain 3 Sham Kakade 1 Ali Farhadi 1 Abstract bottleneck in the real-world deployment of the solutions. Sparsity in Deep Neural Networks (DNNs) is During inference, a typical DNN model stresses the follow- studied extensively with the focus of maximiz- ing aspects of the compute environment: 1) RAM - working ing prediction accuracy given an overall param- memory, 2) Processor compute - Floating Point Operations eter budget. Existing methods rely on uniform (FLOPs1 ), and 3) Flash - model size. Various techniques are or heuristic non-uniform sparsity budgets which proposed to make DNNs efficient including model pruning have sub-optimal layer-wise parameter allocation (sparsity) (Han et al., 2015), knowledge distillation (Buciluǎ resulting in a) lower prediction accuracy or b) et al., 2006), model architectures (Howard et al., 2017) and higher inference cost (FLOPs). This work pro- quantization (Rastegari et al., 2016). poses Soft Threshold Reparameterization (STR), Sparsity of the model, in particular, has potential for impact a novel use of the soft-threshold operator on across a variety of inference settings as it reduces the model DNN weights. STR smoothly induces spar- size and inference cost (FLOPs) without significant change sity while learning pruning thresholds thereby in training pipelines. Naturally, several interesting projects obtaining a non-uniform sparsity budget. Our address inference speed-ups via sparsity on existing frame- method achieves state-of-the-art accuracy for un- works (Liu et al., 2015; Elsen et al., 2019) and commodity structured sparsity in CNNs (ResNet50 and Mo- hardware (Ashby et al.). On-premise or Edge computing bileNetV1 on ImageNet-1K), and, additionally, is another domain where sparse DNNs have potential for learns non-uniform budgets that empirically re- deep impact as it is governed by billions of battery limited duce the FLOPs by up to 50%. Notably, STR devices with single-core CPUs. These devices, including boosts the accuracy over existing results by up mobile phones (Anguita et al., 2012) and IoT sensors (Patil to 10% in the ultra sparse (99%) regime and et al., 2019; Roy et al., 2019), can benefit significantly from can also be used to induce low-rank (structured sparsity as it can enable real-time on-device solutions. sparsity) in RNNs. In short, STR is a simple mechanism which learns effective sparsity bud- Sparsity in DNNs, surveyed extensively in Section 2, has gets that contrast with popular heuristics. Code, been the subject of several papers where new algorithms pretrained models and sparsity budgets are at are designed to obtain models with a given parameter bud- https://github.com/RAIVNLab/STR. get. But state-of-the-art DNN models tend to have a large number of layers with highly non-uniform distribution both in terms of the number of parameters as well as FLOPs 1. Introduction required per layer. Most existing methods rely either on uniform sparsity across all parameter tensors (layers) or Deep Neural Networks (DNNs) are the state-of-the-art mod- on heuristic non-uniform sparsity budgets leading to a sub- els for many important tasks in the domains of Computer optimal weight allocation across layers and can lead to a Vision, Natural Language Processing, etc. To enable highly significant loss in accuracy. Furthermore, if the budget is accurate solutions, DNNs require large model sizes resulting set at a global level, some of the layers with a small number in huge inference costs, which many times become the main of parameters would be fully dense as their contribution to * 1 the budget is insignificant. However, those layers can have Equal contribution University of Washington, USA 2 Allen Institute for Artificial Intelligence, USA 3 Microsoft Re- significant FLOPs, e.g., in an initial convolution layer, a search, India. Correspondence to: Aditya Kusupati . Hence, while such models might decrease the number of non-zeroes significantly, their FLOPs could still be large. Proceedings of the 37 th International Conference on Machine 1 Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by One Multiply-Add is counted as one FLOP the author(s).
Soft Threshold Weight Reparameterization for Learnable Sparsity Motivated by the above-mentioned challenges, this works 2. Related Work addresses the following question: “Can we design a method to learn non-uniform sparsity budget across layers that is This section covers the spectrum of work on sparsity in optimized per-layer, is stable, and is accurate?”. DNNs. The sparsity in the discussion can be characterized as (a) unstructured and (b) structured while sparsification Most existing methods for learning sparse DNNs have their techniques can be (i) dense-to-sparse, and (ii) sparse-to- roots in the long celebrated literature of high-dimension sparse. Finally, the sparsity budget in DNNs can either be statistics and, in particular, sparse regression. These meth- (a) uniform, or (b) non-uniform across layers. This will be a ods are mostly based on well-known Hard and Soft Thresh- key focus of this paper, as different budgets result in differ- olding techniques, which are essentially projected gradient ent inference compute costs as measured by FLOPs. This methods with explicit projection onto the set of sparse pa- section also discusses the recent work on learnable sparsity. rameters. However, these methods require a priori knowl- edge of sparsity, and as mentioned above, mostly heuristic 2.1. Unstructured and Structured Sparsity methods are used to set the sparsity levels per layer. Unstructured sparsity does not take the structure of the We propose Soft Threshold Reparameterization (STR) to model (e.g. channels, rank, etc.,) into account. Typically, un- address the aforementioned issues. We use the fact that structured sparsity is induced in DNNs by making the param- the projection onto the sparse sets is available in closed eter tensors sparse directly based on heuristics (e.g. weight form and propose a novel reparameterization of the problem. magnitude) thereby creating sparse tensors that might not be That is, for forward pass of DNN, we use soft-thresholded capable of leveraging the speed-ups provided by commod- version (Donoho, 1995) of a weight tensor Wl of the l-th ity hardware during training and inference. Unstructured layer in the DNN: S(Wl , αl ) := sign (Wl )·ReLU(|Wl |− sparsity has been extensively studied and includes methods αl ) where αl is the pruning threshold for the l-th layer. As which use gradient, momentum, and Hessian based heuris- the DNN loss can be written as a continuous function of tics (Evci et al., 2020; Lee et al., 2019; LeCun et al., 1990; αl ’s, we can use backpropagation to learn layer-specific αl Hassibi & Stork, 1993; Dettmers & Zettlemoyer, 2019), to smoothly induce sparsity. Typically, each layer in a neural and magnitude-based pruning (Han et al., 2015; Guo et al., network is distinct unlike the interchangeable weights and 2016; Zhu & Gupta, 2017; Frankle & Carbin, 2019; Gale neurons making it interesting to learn layer-wise sparsity. et al., 2019; Mostafa & Wang, 2019; Bellec et al., 2018; Mo- Due to layer-specific thresholds and sparsity, STR is able canu et al., 2018; Narang et al., 2019; Kusupati et al., 2018; to achieve state-of-the-art accuracy for unstructured sparsity Wortsman et al., 2019). Unstructured sparsity can also be in CNNs across various sparsity regimes. STR makes even induced by L0 , L1 regularization (Louizos et al., 2018), and small-parameter layers sparse resulting in models with sig- Variational Dropout (VD) (Molchanov et al., 2017). nificantly lower inference FLOPs than the baselines. For ex- Gradual Magnitude Pruning (GMP), proposed in (Zhu & ample, STR for 90% sparse MobileNetV1 on ImageNet-1K Gupta, 2017), and studied further in (Gale et al., 2019), is a results in a 0.3% boost in accuracy with 50% fewer FLOPs. simple magnitude-based weight pruning applied gradually Empirically, STR’s learnt non-uniform budget makes it a over the course of the training. Discovering Neural Wirings very effective choice for ultra (99%) sparse ResNet50 as (DNW) (Wortsman et al., 2019) also relies on magnitude- well where it is ∼10% more accurate than baselines on based pruning while utilizing a straight-through estimator ImageNet-1K. STR can also be trivially modified to induce for the backward pass. GMP and DNW are the state-of-the- structured sparsity, demonstrating its generalizability to a va- art for unstructured pruning in DNNs (especially in CNNs) riety of DNN architectures across domains. Finally, STR’s demonstrating the effectiveness of magnitude pruning. VD learnt non-uniform sparsity budget transfers across tasks gets accuracy comparable to GMP (Gale et al., 2019) for thus discovering an efficient sparse backbone of the model. CNNs but at a cost of 2× memory and 4× compute during The 3 major contributions of this paper are: training making it hard to be used ubiquitously. • Soft Threshold Reparameterization (STR), for the Structured sparsity takes structure into account making the weights in DNNs, to induce sparsity via learning the models scalable on commodity hardware with the stan- per-layer pruning thresholds thereby obtaining a better dard computation techniques/architectures. Structured spar- non-uniform sparsity budget across layers. sity includes methods which make parameter tensors low- • Extensive experimentation showing that STR achieves rank (Jaderberg et al., 2014; Alizadeh et al., 2020; Lu et al., the state-of-the-art accuracy for sparse CNNs (ResNet50 2016), prune out channels, filters and induce block/group and MobileNetV1 on ImageNet-1K) along with a signifi- sparsity (Liu et al., 2019; Wen et al., 2016; Li et al., 2017; cant reduction in inference FLOPs. Luo et al., 2017; Gordon et al., 2018; Yu & Huang, 2019). • Extension of STR to structured sparsity, that is useful for Even though structured sparsity can leverage speed-ups pro- the direct implementation of fast inference in practice. vided by parallelization, the highest levels of model pruning
Soft Threshold Weight Reparameterization for Learnable Sparsity are only possible with unstructured sparsity techniques. tic described in RigL (Evci et al., 2020). A global pruning threshold (Han et al., 2015) can also induce non-uniform 2.2. Dense-to-sparse and Sparse-to-sparse Training sparsity and has been leveraged in Iterative Magnitude Prun- ing (IMP) (Frankle & Carbin, 2019; Renda et al., 2020). A Until recently, most sparsification methods were dense-to- good non-uniform sparsity budget can help in maintaining sparse i.e., the DNN starts fully dense and is made sparse by accuracy while also reducing the FLOPs due to a better the end of the training. Dense-to-sparse training in DNNs parameter distribution. The aforementioned methods with encompasses the techniques presented in (Han et al., 2015; non-uniform sparsity do not reduce the FLOPs compared Zhu & Gupta, 2017; Molchanov et al., 2017; Frankle & to uniform sparsity in practice. Very few techniques like Carbin, 2019; Renda et al., 2020). AMC (He et al., 2018), using expensive reinforcement learn- The lottery ticket hypothesis (Frankle & Carbin, 2019) ing, minimize FLOPs with non-uniform sparsity. sparked an interest in training sparse neural networks end-to- Most of the discussed techniques rely on intelligent heuris- end. This is referred to as sparse-to-sparse training and a lot tics to obtain non-uniform sparsity. Learning the pruning of recent work (Mostafa & Wang, 2019; Bellec et al., 2018; thresholds and in-turn learning the non-uniform sparsity Evci et al., 2020; Lee et al., 2019; Dettmers & Zettlemoyer, budget is the main contribution of this paper. 2019) aims to do sparse-to-sparse training using techniques which include re-allocation of weights to improve accuracy. 2.4. Learnable Sparsity Dynamic Sparse Reparameterization (DSR) (Mostafa & Concurrent to our work, (Savarese et al., 2019; Liu et al., Wang, 2019) heuristically obtains a global magnitude thresh- 2020; Lee, 2019; Xiao et al., 2019; Azarian et al., 2020) have old along with the re-allocation of the weights based on the proposed learnable sparsity methods through training of the non-zero weights present at every step. Sparse Networks sparse masks and weights simultaneously with minimal From Scratch (SNFS) (Dettmers & Zettlemoyer, 2019) uti- heuristics. The reader is urged to review these works for a lizes momentum of the weights to re-allocate weights across more complete picture of the field. Note that, while STR layers and the Rigged Lottery (RigL) (Evci et al., 2020) is proposed to induce layer-wise unstructured sparsity, it uses the magnitude to drop and the periodic dense gradi- can be easily adapted for global, filter-wise, or per-weight ents to regrow weights. SNFS and RigL are state-of-the-art sparsity as discussed in Appendix A.5. in sparse-to-sparse training but fall short of GMP for the same experimental settings. It should be noted that, even though sparse-to-sparse can reduce the training cost, the 3. Method - STR existing frameworks (Paszke et al., 2019; Abadi et al., 2016) Optimization under sparsity constraint on the parameter set consider the models as dense resulting in minimal gains. is a well studied area spanning more than three decades DNW (Wortsman et al., 2019) and Dynamic Pruning with (Donoho, 1995; Candes et al., 2007; Jain et al., 2014), and Feedback (DPF) (Lin et al., 2020) fall between both as is modeled as: DNW uses a fully dense gradient in the backward pass and DPF maintains a copy of the dense model in parallel to min L(W; D), s.t. kWk0 ≤ k, W optimize the sparse model through feedback. Note that DPF where D := xi ∈ Rd , yi ∈ R i∈[n] is the observed data, L is complementary to most of the techniques discussed here. is the loss function, W are the parameters to be learned and 2.3. Uniform and Non-uniform Sparsity k · k0 denotes the L0 -norm or the number of non-zeros, and k is the parameter budget. Due to non-convexity and com- Uniform sparsity implies that all the layers in the DNN have binatorial structure of the L0 norm constraint, it’s convex the same amount of sparsity in proportion. Quite a few relaxation L1 norm has been studied for long time and has works have used uniform sparsity (Gale et al., 2019), given been at the center of a large literature on high-dimensional its ease and lack of hyperparameters. However, some works learning. In particular, several methods have been proposed keep parts of the model dense, including the first or the to solve the two problems including projected gradient de- last layers (Lin et al., 2020; Mostafa & Wang, 2019; Zhu & scent, forward/backward pruning etc. Gupta, 2017). In general, making the first or the last layers Projected Gradient Descent (PGD) in particular has been dense benefits all the methods. GMP typically uses uniform popular for both the problems as the projection onto both sparsity and achieves state-of-the-art results. L0 as well as the L1 ball is computable in almost closed Non-uniform sparsity permits different layers to have differ- form (Beck & Teboulle, 2009; Jain et al., 2014); L0 ball ent sparsity budgets. Weight re-allocation heuristics have projection is called Hard Thresholding while L1 ball projec- been used for non-uniform sparsity in DSR and SNFS. It can tion is known as Soft Thresholding. Further, these methods be a fixed budget like the ERK (Erdos-Renyi-Kernel) heuris- have been the guiding principle for many modern DNN
Soft Threshold Weight Reparameterization for Learnable Sparsity model pruning (sparsity) techniques (Han et al., 2015; Zhu leads to the following update equation: & Gupta, 2017; Narang et al., 2019). (t+1) (t) Wl ← (1 − ηt · λ)Wl However, projection-based methods suffer from the problem n (t) o of dense gradient and intermediate parameter structure, as − ηt ∇Sg (Wl ,sl ) L(Sg (W (t) , s), D) 1 Sg (Wl , sl ) 6= 0 , the gradient descent iterate can be arbitrarily out of the set (4) and is then projected back onto L0 or L1 ball. At a scale where 1 {·} is the indicator function and A B denotes of billions of parameters, computing such dense gradients element-wise (Hadamard) product of tensors A and B. and updates can be daunting. More critically, the budget parameter k is set at the global level, so it is not clear how Now, if g is a continuous function, then using the STR to partition the budget for each layer, as the importance of (2) and (1), it is clear that L(Sg (W, s), D) is a continuous each layer can be significantly different. function of s. Further, sub-gradient of L w.r.t. s, can be computed and uses for gradient descent on s as well; see In this work, we propose a reparameterization, Soft Thresh- Appendix A.2. Algorithm 1 in the Appendix shows the old Reparameterization (STR) based on the soft threshold implementation of STR on 2D convolution along with ex- operator (Donoho, 1995), to alleviate both the above men- tensions to global, per-filter & per-weight sparsity. STR can tioned concerns. That is, instead of first updating W via be modified and applied on the eigenvalues of a parameter gradient descent and then computing its projection, we di- tensor, instead of individual entries mentioned above, result- rectly optimize over projected W. Let Sg (W; s) be the ing in low-rank tensors; see Section 4.2.1 for further details. projection of W parameterized by s and function g. S is Note that s also has the same weight-decay parameter λ. applied to each element of W and is defined as: Naturally, g plays a critical role here, as a sharp g can lead Sg (w, s) := sign (w) · ReLU(|w| − g(s)), (1) to an arbitrary increase in threshold leading to poor accuracy while a flat g can lead to slow learning. Practical considera- where s is a learnable parameter, g : R → R, and α = g(s) tions for choice of g are discussed in Appendix A.1. For the is the pruning threshold. ReLU(a) = max(a, 0). That is, if experiments, g is set as the Sigmoid function for unstruc- |w| ≤ g(s), then Sg (w, s) sets it to 0. tured sparsity and the exponential function for structured sparsity. Typically, {sl }l∈[L] are initialized with sinit to Reparameterizing the optimization problem with S modifies ensure that the thresholds {αl = g(sl )}l∈[L] start close to (note that it is not equivalent) it to: 0. Figure 1 shows that the thresholds’ dynamics are guided by a combination of gradients from L and the weight-decay min L(Sg (W, s), D). (2) on s. Further, the overall sparsity budget for STR is not W set explicitly. Instead, it is controlled by the weight-decay parameter (λ), and can be further fine-tuned using sinit . In- For L-layer DNN architectures, we divide W into: W = terestingly, this curve is similar to the handcrafted heuristic L [Wl ]l=1 where Wl is the parameter tensor for the l-th layer. for thresholds defined in (Narang et al., 2019). Figure 2 As mentioned earlier, different layers of DNNs are unique shows the overall learnt sparsity budget for ResNet50 dur- can have significantly different number of parameters. Simi- ing training. The curve looks similar to GMP (Zhu & Gupta, larly, different layers might need different sparsity budget 2017) sparsification heuristic, however, STR learns it via for the best accuracy. So, we set the trainable pruning pa- backpropagation and SGD. rameter for each layer as sl . That is, s = [s1 , . . . , sL ]. Now, using the above mentioned reparameterization for each Wl and adding a standard L2 regularization per layer, we get the following Gradient Descent (GD) update equation at the t-th step for Wl , ∀ l ∈ [L]: (t+1) (t) Wl ← (1 − ηt · λ)Wl − ηt ∇Sg (Wl ,sl ) L(Sg (W (t) , s), D) ∇Wl Sg (Wl , sl ), (3) Figure 1. The learnt threshold parameter, α = g(s), for layer 10 in where ηt is the learning rate at the t-th step, and λ 90% sparse ResNet50 on ImageNet-1K over the course of training. is the L2 regularization (weight-decay) hyper-parameter. ∇Wl Sg (Wl , sl ) is the gradient of Sg (Wl , sl ) w.r.t. Wl . Finally, each parameter tensor learns a different threshold Now, S is non-differentiable, so we use sub-gradient which value, {αl }l∈[L] , resulting in unique final thresholds across
Soft Threshold Weight Reparameterization for Learnable Sparsity zeros, and d n r log d. Due to the initialization, g(s) ≈ 0 in initial few iterations. So, gradient descent converges to the least `2 -norm regression solution. That is, w = UUT w∗ where U ∈ Rd×n is the right singular vector matrix of X and is a random n-dimensional subspace. As U is a random subspace. Since n r log d, US UTS ≈ dr · I where S = supp(w∗ ), and US indexes rows of U corre- sponding to S. That is, minj∈S Uj · UT w∗ ≥ 1 − o(1). √ √ On the other hand, Uj · UTS w∗ . dnr log d with high Figure 2. The progression of the learnt overall budget for 90% probability for j 6∈ S. As n r log d, almost all the el- sparse ResNet50 on ImageNet-1K over the course of training. ements of supp(w∗ ) will be in top O (n) elements of w. Furthermore, XSg (w, s) = y, so |s| would decrease sig- nificantly via weight-decay and hence g(s) becomes large enough to prune all but say O (n) elements. Using a similar argument as above, leads to further pruning of w, while ensuring recovery of almost all elements in supp(w∗ ). 4. Experiments This section showcases the experimentation followed by the observations from applying STR for (a) unstructured Figure 3. The final learnt threshold values, [αl ]54 54 l=1 = [g(sl )]l=1 , sparsity in CNNs and (b) structured sparsity in RNNs. for all the layers in 90% sparse ResNet50 on ImageNet-1K. 4.1. Unstructured Sparsity in CNNs the layers, as shown in Figure 3 for ResNet50. This, in turn, 4.1.1. E XPERIMENTAL S ETUP results in the non-uniform sparsity budget (see Figure 6) ImageNet-1K (Deng et al., 2009) is a widely used large- which is empirically shown to be effective in increasing scale image classification dataset with 1K classes. All prediction accuracy while reducing FLOPs. Moreover, (4) the CNN experiments presented are on ImageNet-1K. shows that the gradient update itself is sparse as gradient of ResNet50 (He et al., 2016) and MobileNetV1 (Howard L is multiplied with an indicator function of Sg (Wl ) 6= 0 et al., 2017) are two popular CNN architectures. ResNet50 which gets sparser over iterations (Figure 2). So STR ad- is extensively used in literature to show the effectiveness dresses both the issues with standard PGD methods (Hard/- of sparsity in CNNs. Experiments on MobileNetV1 argue Soft Thresholding) that we mentioned above. for the generalizability of the proposed technique (STR). Dataset and models’ details can be found in Appendix A.7. 3.1. Analysis STR was compared against strong state-of-the-art base- The reparameterization trick using the projection operator’s lines in various sparsity regimes including GMP (Gale functional form can be used for standard constrained opti- et al., 2019), DSR (Mostafa & Wang, 2019), DNW (Worts- mization problems as well (assuming the projection operator man et al., 2019), SNFS (Dettmers & Zettlemoyer, 2019), has a closed-form). However, it is easy to show that in gen- RigL (Evci et al., 2020) and DPF (Lin et al., 2020). GMP eral, such a method need not converge to the optimal solu- and DNW always use a uniform sparsity budget. RigL, tion even for convex functions over convex sets. This raises SNFS, DSR, and DPF were compared in their original form. a natural question about the effectiveness of the technique Exceptions for the uniform sparsity are marked in Table 1. for sparse weights learning problem. It turns out that for The “+ ERK” suffix implies the usage of ERK budget (Evci sparsity constrained problems, STR is very similar to back- et al., 2020) instead of the original sparsity budget. Even ward pruning (Hastie et al., 2009) which is a well-known though VD (Molchanov et al., 2017) achieves state-of-the- technique for sparse regression. Note that, similar to Hard/- art results, it is omitted due to the 2× memory and 4× com- Soft Thresholding, standard backward pruning also does pute footprint during training. Typically VD and IMP use a not support differentiable tuning thresholds which makes it global threshold for global sparsity (GS) (Han et al., 2015) challenging to apply it to DNNs. which can also be learnt using STR. The unstructured spar- To further establish this connection, let’s consider a stan- sity experiments presented compare the techniques which dard sparse regression problem where y = Xw∗ , Xij ∼ induce layer-wise sparsity. Note that STR is generalizable N (0, 1), and X ∈ Rn×d . w∗ ∈ {0, 1}d has r d non- to other scenarios as well. Open-source implementations,
Soft Threshold Weight Reparameterization for Learnable Sparsity pre-trained models, and reported numbers of the available levels of sparsity. Very few methods are stable in the ultra techniques were used as the baselines. Experiments were sparse regime of 98-99% sparsity and GMP can achieve run on a machine with 4 NVIDIA Titan X (Pascal) GPUs. All baselines use the hyperparameter settings defined in Table 1. STR is the state-of-the-art for unstructured sparsity in their implementations/papers. The experiments for STR ResNet50 on ImageNet-1K while having lesser inference cost (FLOPs) than the baselines across all the sparsity regimes. ∗ and use a batch size of 256, cosine learning rate routine and # imply that the first and last layer are dense respectively. Base- are trained for 100 epochs following the hyperparameter line numbers reported from their respective papers/open-source settings in (Wortsman et al., 2019) using SGD + momentum. implementations and models. FLOPs do not include batch-norm. STR has weight-decay (λ) and sinit hyperparameters to Top-1 Acc Sparsity control the overall sparsity in CNNs and can be found in Ap- Method Params FLOPs (%) (%) pendix A.6. GMP1.5× (Gale et al., 2019) and RigL5× (Evci et al., 2020) show that training the networks longer increases ResNet-50 77.01 25.6M 0.00 4.09G accuracy. However, due to the limited compute and environ- GMP 75.60 5.12M 80.00 818M mental concerns (Schwartz et al., 2019), all the experiments DSR∗# 71.60 5.12M 80.00 1.23G were run only for around 100 epochs (∼3 days each). Un- DNW 76.00 5.12M 80.00 818M structured sparsity in CNNs with STR is enforced by learn- SNFS 74.90 5.12M 80.00 - ing one threshold per-layer as shown in Figure 3. PyTorch SNFS + ERK 75.20 5.12M 80.00 1.68G STRConv code can be found in Algorithm 1 of Appendix. RigL∗ 74.60 5.12M 80.00 920M RigL + ERK 75.10 5.12M 80.00 1.68G 4.1.2. R ES N ET 50 ON I MAGE N ET-1K DPF 75.13 5.12M 80.00 818M STR 76.19 5.22M 79.55 766M A fully dense ResNet50 trained on ImageNet-1K has 77.01% STR 76.12 4.47M 81.27 705M top-1 validation accuracy. STR is compared extensively to other baselines on ResNet50 in the sparsity ranges of 80%, GMP 73.91 2.56M 90.00 409M 90%, 95%, 96.5%, 98%, and 99%. Table 1 shows that DNW DNW 74.00 2.56M 90.00 409M and GMP are state-of-the-art among the baselines across all SNFS 72.90 2.56M 90.00 1.63G the aforementioned sparsity regimes. As STR might not SNFS + ERK 72.90 2.56M 90.00 960M be able to get exactly to the sparsity budget, numbers are RigL∗ 72.00 2.56M 90.00 515M reported for the models which nearby. Note that the 90.23% RigL + ERK 73.00 2.56M 90.00 960M sparse ResNet50 on ImageNet-1K with STR is referred to DPF# 74.55 4.45M 82.60 411M as the 90% sparse ResNet50 model learnt with STR. STR 74.73 3.14M 87.70 402M STR 74.31 2.49M 90.23 343M STR 74.01 2.41M 90.55 341M 75 GMP 70.59 1.28M 95.00 204M 70 DNW 68.30 1.28M 95.00 204M RigL∗ 67.50 1.28M 95.00 317M Top-1 Accuracy (%) STR RigL + ERK 70.00 1.28M 95.00 ∼600M 65 GMP STR 70.97 1.33M 94.80 182M 60 DNW STR 70.40 1.27M 95.03 159M SNFS STR 70.23 1.24M 95.15 162M 55 SNFS + ERK RigL RigL∗ 64.50 0.90M 96.50 257M 50 RigL + ERK RigL + ERK 67.20 0.90M 96.50 ∼500M DPF STR 67.78 0.99M 96.11 127M 45 STR 67.22 0.88M 96.53 117M 80.0 82.5 85.0 87.5 90.0 92.5 95.0 97.5 100.0 Sparsity (%) GMP 57.90 0.51M 98.00 82M DNW 58.20 0.51M 98.00 82M Figure 4. STR forms a frontier curve over all the baselines in all STR 62.84 0.57M 97.78 80M sparsity regimes showing that it is the state-of-the-art for unstruc- STR 61.46 0.50M 98.05 73M tured sparsity in ResNet50 on ImageNet-1K. STR 59.76 0.45M 98.22 68M GMP 44.78 0.26M 99.00 41M STR comfortably beats all the baselines across all the spar- STR 54.79 0.31M 98.79 54M sity regimes as seen in Table 1 and is the state-of-the-art STR 51.82 0.26M 98.98 47M for unstructured sparsity. Figure 4 shows that STR forms STR 50.35 0.23M 99.10 44M a frontier curve encompassing all the baselines at all the
Soft Threshold Weight Reparameterization for Learnable Sparsity 120 STR 75 100 Uniform ERK SNFS FLOPS (Millions) 80 70 VD 60 GS Top-1 Accuracy (%) STR 65 40 GMP 20 60 DNW 0 SNFS 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 Layer 55 SNFS + ERK RigL RigL + ERK Figure 7. Layer-wise FLOPs budget for the 90% sparse ResNet50 50 models on ImageNet-1K using various sparsification techniques. DPF 45 0 200 400 600 800 1000 1200 1400 1600 FLOPs (millions) observations related to sparsity and inference cost (FLOPs). These observations will be further discussed in Section 5: Figure 5. STR results in ResNet50 models on ImageNet-1K which 1. STR is state-of-the-art for unstructured sparsity. have the lowest inference cost (FLOPs) for any given accuracy. 2. STR minimizes inference cost (FLOPs) while maintain- ing accuracy in the 80-95% sparse regime. 99% sparsity. STR is very stable even in the ultra sparse 3. STR maximizes accuracy while maintaining inference regime, as shown in Table 1 and Figure 4, while being up to cost (FLOPs) in 98-99% ultra sparse regime. 10% higher in accuracy than GMP at 99% sparsity. 4. STR learns a non-uniform layer-wise sparsity, shown STR induces non-uniform sparsity across layers, Table 1 in Figure 6, which shows that the initial layers of the and Figure 5 show that STR produces models which have CNN can be sparser than that of the existing non-uniform lower or similar inference FLOPs compared to the baselines sparsity methods. All the learnt non-uniform budgets while having better prediction accuracy in all the sparsity through STR can be found in Appendix A.3. regimes. This hints at the fact that STR could be redis- 5. Figure 6 also shows that the last layers through STR are tributing the parameters thereby reducing the FLOPs. In the denser than that of the other methods which is contrary 80% sparse models, STR is at least 0.19% better in accu- to the understanding in the literature of non-uniform spar- racy than the baselines while having at least 60M (6.5%) sity (Mostafa & Wang, 2019; Dettmers & Zettlemoyer, lesser FLOPs. Similarly, STR has state-of-the-art accuracy 2019; Evci et al., 2020; Gale et al., 2019). This leads to in 90%, 95%, and 96.5% sparse regimes while having at a sparser backbone for transfer learning. The backbone least 68M (16.5%), 45M (22%) and 140M (54%) lesser sparsities can be found in Appendix A.3. FLOPs than the best baselines respectively. In the ultra 6. Figure 7 shows the layer-wise FLOPs distribution for the sparse regime of 98% and 99% sparsity, STR has similar non-uniform sparsity methods. STR adjusts the FLOPs or slightly higher FLOPs compared to the baselines but is across layers such that it has lower FLOPs than the base- up to 4.6% and 10% better in accuracy respectively. Ta- lines. Note that the other non-uniform sparsity budgets ble 1 summarizes that the non-uniform sparsity baselines lead to heavy compute overhead in the initial layers due like SNFS, SNFS+ERK, and RigL+ERK can have up to to denser parameter tensors. 2-4× higher inference cost (FLOPs) due to non-optimal layer-wise distribution of the parameter weights. STR can also induce global sparsity (GS) (Han et al., 2015) with similar accuracy at ∼ 2× FLOPs compared to layer- 100 wise for 90-98% sparsity (details in Appendix A.5.1). 80 4.1.3. M OBILE N ET V1 ON I MAGE N ET-1K Sparsity (%) 60 STR 40 Uniform MobileNetV1 was trained on ImageNet-1K for unstructured ERK SNFS sparsity with STR to ensure generalizability. Since GMP is 20 VD GS the state-of-the-art baseline as shown earlier, STR was only 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 compared to GMP for 75% and 90% sparsity regimes. A Layer fully dense MobileNetV1 has a top-1 accuracy of 71.95% on Figure 6. Layer-wise sparsity budget for the 90% sparse ResNet50 ImageNet-1K. GMP (Zhu & Gupta, 2017) has the first layer models on ImageNet-1K using various sparsification techniques. and depthwise convolution layers dense for MobileNetV1 to ensure training stability and maximize accuracy. Observations: STR on ResNet50 shows some interesting Table 2 shows the STR is at least 0.65% better than GMP for
Soft Threshold Weight Reparameterization for Learnable Sparsity The baseline is low-rank FastGRNN where the ranks of the Table 2. STR is up to 3% higher in accuracy while having 33% matrices are preset (Kusupati et al., 2018). EdgeML (Dennis lesser inference cost (FLOPs) for MobileNetV1 on ImageNet-1K. et al.) FastGRNN was used for the experiments with the Top-1 Acc Sparsity hyperparameters suggested in the paper and is referred to Method Params FLOPs (%) (%) as vanilla training. Hyperparameters for the models can be MobileNetV1 71.95 4.21M 0.00 569M found in Appendix A.6. GMP 67.70 1.09M 74.11 163M 4.2.2. FAST GRNN ON G OOGLE -12 AND HAR-2 STR 68.35 1.04M 75.28 101M STR 66.52 0.88M 79.07 81M Table 3 presents the results for low-rank FastGRNN with GMP 61.80 0.46M 89.03 82M vanilla training and STR. Full-rank non-reparameterized STR 64.83 0.60M 85.80 55M FastGRNN has an accuracy of 92.60% and 96.10% on STR 62.10 0.46M 89.01 42M Google-12 and HAR-2 respectively. STR outperforms STR 61.51 0.44M 89.62 40M Table 3. STR can induce learnt low-rank in FastGRNN resulting in up to 2.47% higher accuracy than the vanilla training. 75% sparsity, while having at least 62M (38%) lesser FLOPs. Google-12 HAR-2 More interestingly, STR has state-of-the-art accuracy while Accuracy (%) Accuracy (%) (rW , rU ) (rW , rU ) having up to 50% (40M) lesser FLOPs than GMP in the 90% Vanilla STR Vanilla STR Training Training sparsity regime. All the observations made for ResNet50 Full rank (32, 100) 92.30 - Full rank (9, 80) 96.10 - hold for MobileNetV1 as well. The sparsity and FLOPs distribution across layers can be found in Appendix A.4. (12, 40) 92.79 94.45 (9, 8) 94.06 95.76 (11, 35) 92.86 94.42 (9, 7) 93.15 95.62 (10, 31) 92.86 94.25 (8, 7) 94.88 95.59 4.2. Structured Sparsity in RNNs (9, 24) 93.18 94.45 4.2.1. E XPERIMENTAL S ETUP vanilla training by up to 1.67% in four different model-size Google-12 is a speech recognition dataset that has 12 classes reducing rank settings on Google-12. Similarly, on HAR-2, made from the Google Speech Commands dataset (Warden, STR is better than vanilla training in all the rank settings by 2018). HAR-2 is a binarized version of the 6-class Hu- up to 2.47%. Note that the accuracy of the low-rank models man Activity Recognition dataset (Anguita et al., 2012). obtained by STR is either better or on-par with the full rank These two datasets stand as compelling cases for on-device models while being around 50% and 70% smaller in size resource-efficient machine learning at the edge. Details (low-rank) for Google-12 and HAR-2 respectively. about the datasets can be found in Appendix A.7. These experiments for structured sparsity in RNNs show that FastGRNN (Kusupati et al., 2018) was proposed to en- STR can be applied to obtain low-rank parameter tensors. able powerful RNN models on resource-constrained devices. Similarly, STR can be extended for filter/channel pruning FastGRNN relies on making the RNN parameter matrices and block sparsity (He et al., 2017; Huang & Wang, 2018; low-rank, sparse and quantized. As low-rank is a form of Liu et al., 2019) and details for this adaptation can be found structured sparsity, experiments were done to show the ef- in Appendix A.5.2. fectiveness of STR for structured sparsity. The input vector to the RNN at each timestep and hidden state have D & D̂ 5. Discussion and Drawbacks dimensionality respectively. FastGRNN has two parameter matrices, W ∈ RD×D̂ , U ∈ RD̂×D̂ which are reparameter- STR’s usage for unstructured sparsity leads to interesting ized as product of low-rank matrices, W = W1 W2 , and observations as noted in Section 4.1.2. It is clear from Ta- U = U1 U2 where W1 ∈ RD×rW , W2 ∈ RrW ×D̂ , and ble 1 and Figures 4, 5 that STR achieves state-of-the-art (U1 )> , U2 ∈ RrU ×D̂ . rW , rU are the ranks of the respec- accuracy for all the sparsity regimes and also reduces the tive matrices. In order to apply STR, the low-rank reparam- FLOPs in doing so. STR helps in learning non-uniform eterization can be changed to W = (W1 1m> sparsity budgets which are intriguing to study as an opti- W )W2 , and U = (U1 1m> mal non-uniform sparsity budget can ensure minimization U )U2 where mW = 1D , and mU = 1D̂ , of FLOPs while maintaining accuracy. Although it is not W1 ∈ RD×D , W2 ∈ RD×D̂ , and U1 , U2 ∈ RD̂×D̂ . To clear why STR’s learning dynamics result in a non-uniform learn the low-rank, STR is applied on the mW , and mU budget that minimizes FLOPs, the reduction in FLOPs is vectors. Learning low-rank with STR on mW , mU can due to the better redistribution of parameters across layers. be thought as inducing unstructured sparsity on the two trainable vectors aiming for the right rW , and rU . Non-uniform sparsity budgets learnt by STR have the ini-
Soft Threshold Weight Reparameterization for Learnable Sparsity tial and middle layers to be sparser than the other methods tively when used with GMP. Accuracy gains over uniform while making the last layers denser. Conventional wisdom sparsity are also accompanied by a significant reduction in suggests that the initial layers should be denser as the early inference FLOPs. Note that the learnt non-uniform sparsity loss of information would be hard to recover, this drives budgets can also be obtained using smaller representative the existing non-uniform sparsity heuristics. As most of datasets instead of expensive large-scale experiments. the parameters are present in the deeper layers, the exist- The major drawback of STR is the tuning of the weight- ing methods tend to make them sparser while not affecting decay parameter, λ and finer-tuning with sinit to obtain the the FLOPs by much. STR, on the other hand, balances targeted overall sparsity. One way to circumvent this issue is the FLOPs and sparsity across the layers as shown in Fig- to freeze the non-uniform sparsity distribution in the middle ures 6, 7 making it a lucrative and efficient choice. The of training when the overall sparsity constraints are met and denser final layers along with sparser initial and middle lay- train for the remaining epochs. This might not potentially ers point to sparser CNN backbones obtained using STR. give the best results but can give a similar budget which can These sparse backbones can be viable options for efficient be then transferred to methods like GMP or DNW. Another representation/transfer learning for downstream tasks. drawback of STR is the function g for the threshold. The stability, expressivity, and sparsification capability of STR Table 4. Effect of various layer-wise sparsity budgets when used depends on g. However, it should be noted that sigmoid and with DNW for ResNet50 on ImageNet-1K. exponential functions work just fine, as g, for STR. Top-1 Acc Sparsity Method Params FLOPs (%) (%) 6. Conclusions Uniform 74.00 2.56M 90.00 409M ERK 74.10 2.56M 90.00 960M This paper proposed Soft Threshold Reparameterization Budget from STR 74.01 2.49M 90.23 343M (STR), a novel use of the soft-threshold operator, for the Uniform 68.30 1.28M 95.00 204M weights in DNN, to smoothly induce sparsity while learn- Budget from STR 69.72 1.33M 94.80 182M ing layer-wise pruning thresholds thereby obtaining a non- Budget from STR 68.01 1.24M 95.15 162M uniform sparsity budget. Extensive experimentation showed that STR is state-of-the-art for unstructured sparsity in CNNs for ImageNet-1K while also being effective for struc- Table 4 shows the effectiveness/transferability of the learnt tured sparsity in RNNs. Our method results in sparse models non-uniform budget through STR for 90% sparse ResNet50 that have significantly lesser inference costs than the base- on ImageNet-1K using DNW (Wortsman et al., 2019). lines. In particular, STR achieves the same accuracy as DNW typically takes in a uniform sparsity budget and has the baselines for 90% sparse MobileNetV1 with 50% lesser an accuracy of 74% for a 90% sparse ResNet50. Using FLOPs. STR has ∼10% higher accuracy than the existing ERK non-uniform budget for 90% sparsity results in a 0.1% methods in ultra sparse (99%) regime for ResNet50 showing increase in accuracy at the cost 2.35× inference FLOPs. the effectiveness of the learnt non-uniform sparsity budgets. Training DNW with the learnt budget from STR results STR can also induce low-rank structure in RNNs while in a reduction of FLOPs by 66M (16%) while maintaining increasing the prediction accuracy showing the generaliz- accuracy. In the 95% sparsity regime, the learnt budget can ability of the proposed reparameterization. Finally, STR is improve the accuracy of DNW by up to 1.42% over uniform easy to adapt and the learnt budgets are transferable. along with a reduction in FLOPs by at least 22M (11%). Acknowledgments Table 5. Effect of various layer-wise sparsity budgets when used with GMP for ResNet50 on ImageNet-1K. We are grateful to Keivan Alizadeh, Tapan Chugh, Tim Top-1 Acc Sparsity Dettmers, Erich Elsen, Utku Evci, Daniel Gordon, Gabriel Method Params FLOPs Ilharco, Sarah Pratt, James Park, Mohammad Rastegari (%) (%) and Matt Wallingford for helpful discussions and feedback. Uniform 73.91 2.56M 90.00 409M Mitchell Wortsman is in part supported by AI2 Fellow- Budget from STR 74.13 2.49M 90.23 343M ship in AI. Sham Kakade acknowledges funding from the Uniform 57.90 0.51M 98.00 82M Washington Research Foundation for Innovation in Data- Budget from STR 59.47 0.50M 98.05 73M intensive Discovery, and the NSF Awards CCF-1637360, CCF-1703574, and CCF-1740551. Ali Farhadi acknowl- Similarly, these budgets can also be used for other meth- edges funding from the NSF Awards IIS 1652052, IIS ods like GMP (Zhu & Gupta, 2017). Table 5 shows that the 17303166, DARPA N66001-19-2-4031, 67102239 and gifts learnt sparsity budgets can lead to an increase in accuracy by from Allen Institute for Artificial Intelligence. 0.22% and 1.57% in 90% and 98% sparsity regimes respec-
Soft Threshold Weight Reparameterization for Learnable Sparsity References Dettmers, T. and Zettlemoyer, L. Sparse networks from scratch: Faster training without losing performance. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, arXiv preprint arXiv:1907.04840, 2019. J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: A system for large-scale machine learning. Donoho, D. L. De-noising by soft-thresholding. IEEE In 12th {USENIX} Symposium on Operating Systems transactions on information theory, 41(3):613–627, 1995. Design and Implementation ({OSDI} 16), pp. 265–283, 2016. Elsen, E., Dukhan, M., Gale, T., and Simonyan, K. Fast sparse convnets. arXiv preprint arXiv:1911.09723, 2019. Alizadeh, K., Farhadi, A., and Rastegari, M. Butterfly trans- form: An efficient fft based neural architecture design. In Evci, U., Gale, T., Menick, J., Castro, P. S., and Elsen, Proceedings of the IEEE Conference on Computer Vision E. Rigging the lottery: Making all tickets winners. In and Pattern Recognition, 2020. International Conference on Machine Learning, 2020. Anguita, D., Ghio, A., Oneto, L., Parra, X., and Frankle, J. and Carbin, M. The lottery ticket hypothesis: Reyes-Ortiz, J. L. Human activity recognition on Finding sparse, trainable neural networks. In Interna- smartphones using a multiclass hardware-friendly tional Conference on Learning Representations, 2019. support vector machine. In International Workshop on Gale, T., Elsen, E., and Hooker, S. The state of sparsity in Ambient Assisted Living, pp. 216–223. Springer, 2012. deep neural networks. arXiv preprint arXiv:1902.09574, URL https://archive.ics.uci.edu/ml/ 2019. datasets/human+activity+recognition+ using+smartphones. Gordon, A., Eban, E., Nachum, O., Chen, B., Wu, H., Yang, T.-J., and Choi, E. Morphnet: Fast & simple resource- Ashby, M., Baaij, C., Baldwin, P., Bastiaan, M., Bunting, constrained structure learning of deep networks. In Pro- O., Cairncross, A., Chalmers, C., Corrigan, L., Davis, S., ceedings of the IEEE Conference on Computer Vision van Doorn, N., et al. Exploiting unstructured sparsity on and Pattern Recognition, pp. 1586–1595, 2018. next-generation datacenter hardware. Guo, Y., Yao, A., and Chen, Y. Dynamic network surgery Azarian, K., Bhalgat, Y., Lee, J., and Blankevoort, for efficient dnns. In Advances In Neural Information T. Learned threshold pruning. arXiv preprint Processing Systems, pp. 1379–1387, 2016. arXiv:2003.00075, 2020. Han, S., Pool, J., Tran, J., and Dally, W. Learning both Beck, A. and Teboulle, M. A fast iterative shrinkage- weights and connections for efficient neural network. In thresholding algorithm for linear inverse problems. SIAM Advances in neural information processing systems, pp. journal on imaging sciences, 2(1):183–202, 2009. 1135–1143, 2015. Bellec, G., Kappel, D., Maass, W., and Legenstein, R. Deep Hassibi, B. and Stork, D. G. Second order derivatives for rewiring: Training very sparse deep networks. In Interna- network pruning: Optimal brain surgeon. In Advances tional Conference on Learning Representations, 2018. in neural information processing systems, pp. 164–171, 1993. Buciluǎ, C., Caruana, R., and Niculescu-Mizil, A. Model compression. In Proceedings of the 12th ACM SIGKDD Hastie, T., Tibshirani, R., and Friedman, J. The elements of international conference on Knowledge discovery and statistical learning: data mining, inference, and predic- data mining, pp. 535–541, 2006. tion. Springer Science & Business Media, 2009. Candes, E., Tao, T., et al. The dantzig selector: Statistical He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- estimation when p is much larger than n. The annals of ing for image recognition. In Proceedings of the IEEE Statistics, 35(6):2313–2351, 2007. conference on computer vision and pattern recognition, pp. 770–778, 2016. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. He, Y., Zhang, X., and Sun, J. Channel pruning for acceler- In 2009 IEEE conference on computer vision and pattern ating very deep neural networks. In Proceedings of the recognition, pp. 248–255. Ieee, 2009. IEEE International Conference on Computer Vision, pp. 1389–1397, 2017. Dennis, D. K., Gaurkar, Y., Gopinath, S., Gupta, C., Jain, M., Kumar, A., Kusupati, A., Lovett, C., Patil, S. G., and He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., and Han, S. Amc: Simhadri, H. V. EdgeML: Machine Learning for resource- Automl for model compression and acceleration on mo- constrained edge devices. URL https://github. bile devices. In Proceedings of the European Conference com/Microsoft/EdgeML. on Computer Vision (ECCV), pp. 784–800, 2018.
Soft Threshold Weight Reparameterization for Learnable Sparsity Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, Liu, Z., Sun, M., Zhou, T., Huang, G., and Darrell, T. Re- W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: thinking the value of network pruning. In International Efficient convolutional neural networks for mobile vision Conference on Learning Representations, 2019. applications. arXiv preprint arXiv:1704.04861, 2017. Louizos, C., Welling, M., and Kingma, D. P. Learning Huang, Z. and Wang, N. Data-driven sparse structure se- sparse neural networks through l0 regularization. In Inter- lection for deep neural networks. In Proceedings of the national Conference on Learning Representations, 2018. European conference on computer vision (ECCV), pp. 304–320, 2018. Lu, Z., Sindhwani, V., and Sainath, T. N. Learning compact recurrent neural networks. In 2016 IEEE International Jaderberg, M., Vedaldi, A., and Zisserman, A. Speeding up Conference on Acoustics, Speech and Signal Processing convolutional neural networks with low rank expansions. (ICASSP), pp. 5960–5964. IEEE, 2016. In Proceedings of the British Machine Vision Conference. BMVA Press, 2014. Luo, J.-H., Wu, J., and Lin, W. Thinet: A filter level pruning method for deep neural network compression. In Proceed- Jain, P., Tewari, A., and Kar, P. On iterative hard thresh- ings of the IEEE international conference on computer olding methods for high-dimensional m-estimation. In vision, pp. 5058–5066, 2017. Advances in Neural Information Processing Systems, pp. 685–693, 2014. Mocanu, D. C., Mocanu, E., Stone, P., Nguyen, P. H., Gibescu, M., and Liotta, A. Scalable training of arti- Krizhevsky, A., Hinton, G., et al. Learning multiple layers ficial neural networks with adaptive sparse connectivity of features from tiny images. 2009. inspired by network science. Nature communications, 9 (1):2383, 2018. Kusupati, A., Singh, M., Bhatia, K., Kumar, A., Jain, P., and Varma, M. Fastgrnn: A fast, accurate, stable and Molchanov, D., Ashukha, A., and Vetrov, D. Variational tiny kilobyte sized gated recurrent neural network. In dropout sparsifies deep neural networks. In Proceed- Advances in Neural Information Processing Systems, pp. ings of the 34th International Conference on Machine 9017–9028, 2018. Learning-Volume 70, pp. 2498–2507. JMLR. org, 2017. LeCun, Y., Denker, J. S., and Solla, S. A. Optimal brain Mostafa, H. and Wang, X. Parameter efficient training of damage. In Advances in neural information processing deep convolutional neural networks by dynamic sparse systems, pp. 598–605, 1990. reparameterization. In International Conference on Ma- chine Learning, pp. 4646–4655, 2019. Lee, N., Ajanthan, T., and Torr, P. SNIP: Single-shot net- work pruning based on connection sensitivity. In Interna- Narang, S., Elsen, E., Diamos, G., and Sengupta, S. Explor- tional Conference on Learning Representations, 2019. ing sparsity in recurrent neural networks. In International Lee, Y. Differentiable sparsification for deep neural net- Conference on Learning Representations, 2019. works. arXiv preprint arXiv:1910.03201, 2019. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, Pruning filters for efficient convnets. In International L., et al. Pytorch: An imperative style, high-performance Conference on Learning Representations, 2017. deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035, 2019. Lin, T., Stich, S. U., Barba, L., Dmitriev, D., and Jaggi, M. Dynamic model pruning with feedback. In International Patil, S. G., Dennis, D. K., Pabbaraju, C., Shaheer, N., Conference on Learning Representations, 2020. Simhadri, H. V., Seshadri, V., Varma, M., and Jain, P. Gesturepod: Enabling on-device gesture-based interac- Liu, B., Wang, M., Foroosh, H., Tappen, M., and Pensky, M. tion for white cane users. In Proceedings of the 32nd Sparse convolutional neural networks. In Proceedings Annual ACM Symposium on User Interface Software and of the IEEE Conference on Computer Vision and Pattern Technology, pp. 403–415, 2019. Recognition, pp. 806–814, 2015. Ramanujan, V., Wortsman, M., Kembhavi, A., Farhadi, A., Liu, J., Xu, Z., Shi, R., Cheung, R. C. C., and So, H. K. Dy- and Rastegari, M. What’s hidden in a randomly weighted namic sparse training: Find efficient sparse network from neural network? In Proceedings of the IEEE/CVF Con- scratch with trainable masked layers. In International ference on Computer Vision and Pattern Recognition, pp. Conference on Learning Representations, 2020. 11893–11902, 2020.
Soft Threshold Weight Reparameterization for Learnable Sparsity Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. Xnor-net: Imagenet classification using binary convo- lutional neural networks. In European conference on computer vision, pp. 525–542. Springer, 2016. Renda, A., Frankle, J., and Carbin, M. Comparing fine- tuning and rewinding in neural network pruning. In Inter- national Conference on Learning Representations, 2020. Roy, D., Srivastava, S., Kusupati, A., Jain, P., Varma, M., and Arora, A. One size does not fit all: Multi-scale, cascaded rnns for radar classification. In Proceedings of the 6th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation, pp. 1–10, 2019. Savarese, P., Silva, H., and Maire, M. Winning the lottery with continuous sparsification. arXiv preprint arXiv:1912.04427, 2019. Schwartz, R., Dodge, J., Smith, N. A., and Etzioni, O. Green AI. arXiv preprint arXiv:1907.10597, 2019. Warden, P. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018. URL http://download.tensorflow.org/data/ speech_commands_v0.01.tar.gz. Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pp. 2074–2082, 2016. Wortsman, M., Farhadi, A., and Rastegari, M. Discover- ing neural wirings. In Advances In Neural Information Processing Systems, pp. 2680–2690, 2019. Xiao, X., Wang, Z., and Rajasekaran, S. Autoprune: Auto- matic network pruning by regularizing auxiliary param- eters. In Advances in Neural Information Processing Systems, pp. 13681–13691, 2019. Yu, J. and Huang, T. Network slimming by slimmable net- works: Towards one-shot architecture search for channel numbers. arXiv preprint arXiv:1903.11728, 2019. Zhou, H., Lan, J., Liu, R., and Yosinski, J. Deconstruct- ing lottery tickets: Zeros, signs, and the supermask. In Advances in Neural Information Processing Systems, pp. 3592–3602, 2019. Zhu, M. and Gupta, S. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, 2017.
You can also read