ADMM-DAD NET: A DEEP UNFOLDING NETWORK FOR ANALYSIS COMPRESSED SENSING
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
ADMM-DAD NET: A DEEP UNFOLDING NETWORK FOR ANALYSIS COMPRESSED SENSING Vasiliki Kouni? Georgios Paraskevopoulos‡,∗ Holger Rauhut† George C. Alexandropoulos? ? Dep. of Informatics and Telecommunications, National & Kapodistrian University of Athens, Greece † Chair for Mathematics of Information Processing, RWTH Aachen University, Germany ‡ School of Electrical & Computer Engineering, National Technical University of Athens, Greece ∗ Institute for Language & Speech Processing, Athena Research Center, Athens, Greece arXiv:2110.06986v1 [cs.IT] 13 Oct 2021 ABSTRACT represents x, along with thresholds used by the original opti- In this paper, we propose a new deep unfolding neural net- mization algorithms. work based on the ADMM algorithm for analysis Com- Motivation: Our work is inspired by the articles [13] and pressed Sensing. The proposed network jointly learns a [15], which propose unfolded versions of the iterative soft redundant analysis operator for sparsification and recon- thresholding algorithm (ISTA), with learnable parameters be- structs the signal of interest. We compare our proposed ing the sparsifying (orthogonal) basis and/or the thresholds network with a state-of-the-art unfolded ISTA decoder, that involved in ISTA. The authors then test their frameworks on also learns an orthogonal sparsifier. Moreover, we consider synthetic data and/or real-world image datasets. In a similar not only image, but also speech datasets as test examples. spirit, we derive a decoder by interpreting the iterations of Computational experiments demonstrate that our proposed the alternating direction method of multipliers algorithm [16] network outperforms the state-of-the-art deep unfolding net- (ADMM) as a DNN and call it ADMM Deep Analysis Decod- works, consistently for both real-world image and speech ing (ADMM-DAD) network. We differentiate our approach datasets. by learning a redundant analysis operator as a sparsifier for x, i.e. we employ analysis sparsity in CS. The reason for choos- Index Terms— Analysis Compressed Sensing, ADMM, ing analysis sparsity over its synthesis counterpart is due to deep neural network, deep unfolding. some advantages the former has. For example, analysis spar- sity provides flexibility in modeling sparse signals, since it 1. INTRODUCTION leverages the redundancy of the involved analysis operators. We choose to unfold ADMM into a DNN since most of the Compressed Sensing (CS) [1] is a modern technique to re- optimization-based CS algorithms cannot treat analysis spar- cover signals of interest x ∈ Rn from few linear and pos- sity, while ADMM solves the generalized LASSO problem sibly corrupted measurements y = Ax + e ∈ Rm , with [17] which resembles analysis CS. Moreover, we test our de- m < n. Iterative optimization algorithms applied on CS are coder on speech datasets, not only on image ones. To the best by now widely used [2], [3], [4]. Recently, approaches based of our knowledge, an unfolded CS decoder has not yet been on deep learning were introduced [5], [6]. It seems promis- used on speech datasets. We compare numerically our pro- ing to merge these two areas by considering what is called posed network to the state-of-the-art learnable ISTA of [15], deep unfolding. The latter pertains to unfolding the iterations on real-world image and speech data. In all datasets, our pro- of well-known optimization algorithms into layers of a deep posed neural architecture outperforms the baseline, in terms neural network (DNN), which reconstructs the signal of inter- of both test and generalization error. est. Key results: Our novelty is twofold: a) we introduce a new Related work: Deep unfolding networks have gained much ADMM-based deep unfolding network that solves the analy- attention in the last few years [7], [8], [9], because of some sis CS problem, namely ADMM-DAD net, that jointly learns advantages they have compared to traditional DNNs: they an analysis sparsifying operator b) we test ADMM-DAD net are interpretable, integrate prior knowledge about the signal on image and speech datasets (while state-of-the-art deep un- structure [10], and have a relatively small number of trainable folding networks are only tested on synthetic data and images parameters [11]. Especially in the case of CS, many unfolding so far). Experimental results demonstrate that ADMM-DAD networks have proven to work particularly well. The authors outperforms the baseline ISTA-net on speech and images, in- in [12], [13], [14], [15] propose deep unfolding networks that dicating that the redundancy of the learned analysis operator learn a decoder, which aims at reconstructing x from y. Addi- leads to a smaller test MSE and generalization error as well. tionally, these networks jointly learn a dictionary that sparsely
Notation: For matrices A1 , A2 ∈ RN ×N , we denote by where [A1 | A2 ] ∈ R2N ×N their concatenation with respect to the first dimension, while we denote by [A1 ; A2 ] ∈ RN ×2N their W =ρΦ(AT A + ρΦT Φ)−1 ΦT ∈ RN ×N (8) concatenation with respect to the second dimension. We de- T b = b(y) =Φ(A A + ρΦ Φ) T −1 T A y∈R N ×1 . (9) note by ON ×N a square matrix filled with zeros. We write IN ×N for the real N × N identity matrix. For x ∈ R, τ > 0, We introduce vk = [uk ; z k ] and set Θ = (I − W | W ) ∈ the soft thresholding operator Sτ : R 7→ R is defined in RN ×2N to obtain closed form as Sτ (x) = sign(x) max(0, |x| − τ ). For x ∈ Rn , the soft thresholding operator acts componentwise, Θ b −Sλ/ρ (Θvk + b) vk+1 = vk + + . (10) i.e. (Sτ (x))i = Sτ (xi ). For two functions f, g : Rn 7→ Rn , ON ×2N 0 Sλ/ρ (Θvk + b) we write their composition as f ◦ g : Rn 7→ Rn . Now, we set Θ̃ = [Θ; ON ×2N ] ∈ R2N ×2N and I1 = 2. MAIN RESULTS [IN ×N ; ON ×N ] ∈ R2N ×N , I2 = [−IN ×N ; IN ×N ] ∈ R2N ×N , so that (10) is transformed into Optimization-based analysis CS: As we mentioned in Sec- tion 1, the main idea of CS is to reconstruct a vector x ∈ Rn vk+1 = Θ̃vk + I1 b + I2 Sλ/ρ (Θvk + b). (11) from y = Ax + e ∈ Rm , m < n, where A is the so-called measurement matrix and e ∈ Rm , with kek2 ≤ η, corre- Based on (11), we formulate ADMM as a neural network with sponds to noise. To do so, we assume there exists a redundant L layers/iterations, defined as sparsifying transform Φ ∈ RN ×n (N > n) called the analysis operator, such that Φx is (approximately) sparse. Using anal- f1 (y) = I1 b(y) + I2 Sλ/ρ (b(y)), ysis sparsity in CS, we wish to recover x from y. A common fk (v) = Θ̃v + I1 b + I2 Sλ/ρ (Θv + b), k = 2, . . . , L. approach is the analysis l1 -minimization problem The trainable parameters are the entries of Φ (or more gener- min kΦxk1 subject to kAx − yk2 ≤ η, (1) x∈Rn ally, the parameters in a parameterization of Φ). We denote A well-known algorithm that solves (1) is ADMM, which the concatenation of L such layers (all having the same Φ) as considers an equivalent generalized LASSO form of (1), i.e., fΦL (y) = fL ◦ · · · ◦ f1 (y). (12) 1 minn kAx − yk22 + λkΦxk1 , (2) x∈R 2 The final output x̂ is obtained after applying an affine map T with λ > 0 being a scalar regularization parameter. ADMM motivated by (4) to the final layer L, so that introduces the dual variables z, u ∈ RN , so that (2) is equiv- alent to x̂ =T (fΦL (y)) (13) 1 =(AT A + ρΦT Φ)−1 (AT y + ρΦT (z L − uL )), minn kAx − yk22 + λkzk1 subject to Φx − z = 0. (3) x∈R 2 L L Now, for ρ > 0 (penalty parameter), initial points (x0 , z 0 , u0 ) = where [u ; z ] = vL . In order to clip the output in case its (0, 0, 0) and k ∈ N, the optimization problem in (3) can be norm falls out of a reasonable range, we add an extra func- solved by the iterative scheme of ADMM: tion σ : Rn → Rn defined as σ(x) = x if kxk2 ≤ Bout and σ(x) = Bout x/kxk2 otherwise, for some fixed constant xk+1 = (AT A + ρΦT Φ)−1 (AT y + ρΦT (z k − uk )) (4) Bout > 0. We introduce the hypothesis class z k+1 = Sλ/ρ (Φxk+1 − uk ) (5) HL = {σ ◦ h : Rm 7→ Rn : h(y) = T (fΦL (y)), uk+1 = uk + Φxk+1 − z k+1 . (6) (14) Φ ∈ RN ×n , N > n} The iterates (4) – (6) are known [16] to converge to a solution p? of (3), i.e., kAxk − yk22 + kz k k1 → p? and Φxk − z k → 0 consisting of all the functions that ADMM-DAD can im- as k → ∞. plement. Then, given the aforementioned class and a set S = Neural network formulation: Our goal is to formulate the {(yi , xi )}si=1 of s training samples, ADMM-DAD yields a previous iterative scheme as a neural network. We substitute function/decoder hS ∈ HL that aims at reconstructing x from first (4) into the update rules (5) and (6) and second (5) into y = Ax. In order to measure the difference between xi and (6), yielding x̂ = h (y ), i = 1, . . . , s, we choose the training mean i S i uk+1 =(I − W )uk + W z k + b squared error (train MSE) − Sλ/ρ ((I − W )uk + W z k + b) (7) s 1X z k+1 k k =Sλ/ρ ((I − W )u + W z + b), Ltrain = kh(yi ) − xi k22 (15) s i=1
5 layers 25% CS ratio Dataset SpeechCommands TIMIT MNIST CIFAR10 Decoder test MSE gen. error test MSE gen. error test MSE gen. error test MSE gen. error ISTA-net 0.58 · 10−2 0.13 · 10−2 0.22 · 10−3 0.24 · 10−4 0.67 · 10−1 0.17 · 10−1 0.22 · 10−1 0.12 · 10−1 ADMM-DAD 0.25 · 10−2 0.16 · 10−3 0.79 · 10−4 0.90 · 10−5 0.23 · 10−1 0.16 · 10−3 0.15 · 10−1 0.11 · 10−3 10 layers 40% CS ratio 50% CS ratio Dataset SpeechCommands TIMIT SpeechCommands TIMIT Decoder test MSE gen. error test MSE gen. error test MSE gen. error test MSE gen. error ISTA-net 0.46 · 10−2 0.18 · 10−2 0.20 · 10−3 0.25 · 10−4 0.45 · 10−2 0.20 · 10−2 0.20 · 10−3 0.25 · 10−4 ADMM-DAD 0.13 · 10−2 0.58 · 10−4 0.42 · 10−4 0.47 · 10−5 0.87 · 10−3 0.10 · 10−4 0.29 · 10−4 0.30 · 10−5 Table 1: Average test MSE and generalization error for 5-layer decoders (all datasets) and 10-layer decoders (speech datasets). Bold letters indicate the best performance between the two decoders. as loss function. The test mean square error (test MSE) is using the Adam optimizer [22] and batch size 128. For the im- defined as age datasets, we set the learning rate η = 10−4 and train the d 1X 5- and 10-layer ADMM-DAD for 50 and 100 epochs, respec- Ltest = kh(ỹi ) − x̃i k22 , (16) tively. For the audio datasets, we set η = 10−5 and train the d i=1 5- and 10-layer ADMM-DAD for 40 and 50 epochs, respec- where D = {(ỹi , x̃i )}di=1 is a set of d test data, not used in the tively. We compare ADMM-DAD net to the ISTA-net pro- training phase. We examine the generalization ability of the posed in [15]. For ISTA-net, we set the best hyper-parameters network by considering the difference between the average proposed by the original authors and experiment with 5 and train MSE and the average test MSE, i.e., 10 layers. All networks are implemented in PyTorch [23]. For our experiments, we report the average test MSE and gener- Lgen = |Ltest − Ltrain |. (17) alization error as defined in (16) and (17) respectively. 3. EXPERIMENTAL SETUP 4. EXPERIMENTS AND RESULTS Datasets and pre-processing: We train and test the pro- We compare our decoder to the baseline of the ISTA-net de- posed ADMM-DAD network on two speech datasets, i.e., coder, for 5 layers on all datasets with a fixed 25% CS ra- SpeechCommands [18] (85511 training and 4890 test speech tio, and for 10 layers and both 40% and 50% CS ratios on examples, sampled at 16kHz) and TIMIT [19] (phonemes the speech datasets and report the corresponding average test sampled at 16kHz; we take 70% of the dataset for training MSE and generalization error in Table 1. Both the training and the 30% for testing) and two image datasets, i.e. MNIST errors and the test errors are always lower for our ADMM- [20] (60000 training and 10000 test 28 × 28 image examples) DAD net than for ISTA-net. Overall, the results from Table 1 and CIFAR10 [21] (50000 training and 10000 test 32 × 32 indicate that the redundancy of the learned analysis opera- coloured image examples). For the CIFAR10 dataset, we tor improves the performance of ADMM-DAD net, especially transform the images into grayscale ones. We preprocess the when tested on the speech datasets. Furthermore, we extract raw speech data, before feeding them to both our ADMM- the spectrograms of an example test raw audio file of TIMIT DAD and ISTA-net: we downsample each .wav file from reconstructed by either of the 5-layer decoders. We use 1024 16000 to 8000 samples and segment each downsampled .wav FFT points. The resulting spectrograms for 25% and 50% into 10 segments. CS ratio are illustrated in Fig. 1. Both figures indicate that Experimental settings: We choose a random Gaussian mea- our decoder outperforms the baseline, since the former distin- surement matrix √A ∈ Rm×n and appropriately normalize guishes many more frequencies than the latter. Naturally, the it, i.e., Ã = A/ m. We consider three CS ratios m/n ∈ quality of the reconstructed raw audio file by both decoders {25%, 40%, 50%}. We add zero-mean Gaussian noise with increases, as the CS ratio also increases from 25% to 50%. standard deviation std = 10−4 to the measurements, set the However, ADMM-DAD reconstructs –even for the 25% CS redundancy ratio N/n = 5 for the trainable analysis opera- ratio– a clearer version of the signal compared to ISTA-net; tor Φ ∈ RN ×n , perform He (normal) initialization for Φ and the latter recovers a significant part of noise, even for the 50% choose (λ, ρ) = (10−4 , 1). We also examine different val- CS ratio. Finally, we examine the robustness of both 10-layer ues for λ, ρ, as well as treating λ, ρ as trainable parameters, decoders. We consider noisy measurements in the test set of but both settings yielded identical performance. We evaluate TIMIT, taken at 25% and 40% CS ratio, with varying stan- ADMM-DAD for 5 and 10 layers. All networks are trained dard deviation (std) of the additive Gaussian noise. Fig. 2
(a) Original (b) 5-layer ADMM-DAD reconstruction (c) 5-layer ISTA-net reconstruction Fig. 1: Spectrograms of reconstructed test raw audio file from TIMIT for 25% CS ratio (top), as well as 50% CS ratio (bottom). (a) 25% CS ratio (b) 40% CS ratio Fig. 2: Average test MSE for increasing std levels of additive Gaussian noise. Blue: 10-layer ADMM-DAD, orange: 10-layer ISTA-net. shows how the average test MSE scales as the noise std in- the signal of interest and learns a redundant analysis oper- creases. Our decoder outperforms the baseline by an order of ator, serving as sparsifier for the signal. We compared our magnitude and is robust to increasing levels of noise. This framework with a state-of-the-art ISTA-based unfolded net- behaviour confirms improved robustness when learning a re- work on speech and image datasets. Our experiments con- dundant sparsifying dictionary instead of an orthogonal one. firm improved performance: the redundancy provided by the learned analysis operator yields a lower average test MSE and generalization error of our method compared to the ISTA- 5. CONCLUSION AND FUTURE DIRECTIONS net. Future work will include the derivation of generalization In this paper we derived ADMM-DAD, a new deep unfolding bounds for the hypothesis class defined in (14) similar to [15]. network for solving the analysis Compressed Sensing prob- Additionally, it would be interesting to examine the perfor- lem, by interpreting the iterations of the ADMM algorithm mance of ADMM-DAD, when constraining Φ to a particular as layers of the network. Our decoder jointly reconstructs class of operators, e.g., for Φ being a tight frame.
6. REFERENCES massive random access,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Sig- [1] Emmanuel J. Candès, Justin Romberg, and Terence Tao, nal Processing (ICASSP), 2020, pp. 5050–5054. “Robust uncertainty principles: Exact signal reconstruc- tion from highly incomplete frequency information,” [12] Zhonghao Zhang, Yipeng Liu, Jiani Liu, Fei Wen, and IEEE Transactions on information theory, vol. 52, no. Ce Zhu, “AMP-Net: Denoising-based deep unfolding 2, pp. 489–509, 2006. for compressive image sensing,” IEEE Transactions on Image Processing, vol. 30, pp. 1487–1500, 2021. [2] Amir Beck and Marc Teboulle, “A fast itera- tive shrinkage-thresholding algorithm for linear inverse [13] Jian Zhang and Bernard Ghanem, “ISTA-Net: In- problems,” SIAM journal on imaging sciences, vol. 2, terpretable optimization-inspired deep network for im- no. 1, pp. 183–202, 2009. age compressive sensing,” in Proceedings of the IEEE conference on computer vision and pattern recognition, [3] Sundeep Rangan, Philip Schniter, and Alyson K. 2018, pp. 1828–1837. Fletcher, “Vector approximate message passing,” IEEE Transactions on Information Theory, vol. 65, no. 10, pp. [14] Yan Yang, Jian Sun, Huibin Li, and Zongben Xu, 6664–6684, 2019. “ADMM-Net: A deep learning approach for compres- sive sensing MRI,” arXiv preprint arXiv:1705.06869, [4] Elaine T. Hale, Wotao Yin, and Yin Zhang, “Fixed-point 2017. continuation applied to compressed sensing: implemen- tation and numerical experiments,” Journal of Compu- [15] Arash Behboodi, Holger Rauhut, and Ekkehard tational Mathematics, pp. 170–194, 2010. Schnoor, “Compressive sensing and neural networks from a statistical learning perspective,” arXiv preprint [5] Guang Yang et al., “DAGAN: Deep de-aliasing gen- arXiv:2010.15658, 2020. erative adversarial networks for fast compressed sens- ing MRI reconstruction,” IEEE transactions on medical [16] Stephen Boyd, Neal Parikh, and Eric Chu, Distributed imaging, vol. 37, no. 6, pp. 1310–1321, 2017. optimization and statistical learning via the alternating direction method of multipliers, Now Publishers Inc, [6] Ali Mousavi, Gautam Dasarathy, and Richard G. Bara- 2011. niuk, “Deepcodec: Adaptive sensing and recovery via deep convolutional neural networks,” arXiv preprint [17] Yunzhang Zhu, “An augmented ADMM algorithm with arXiv:1707.03386, 2017. application to the generalized lasso problem,” Journal of Computational and Graphical Statistics, vol. 26, no. [7] Karol Gregor and Yann LeCun, “Learning fast approx- 1, pp. 195–204, 2017. imations of sparse coding,” in Proceedings of the 27th international conference on international conference on [18] Pete Warden, “Speech commands: A dataset for machine learning, 2010, pp. 399–406. limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018. [8] Scott Wisdom, John Hershey, Jonathan Le Roux, and Shinji Watanabe, “Deep unfolding for multichannel [19] John S Garofolo, “Timit acoustic phonetic continu- source separation,” in 2016 IEEE International Con- ous speech corpus,” Linguistic Data Consortium, 1993, ference on Acoustics, Speech and Signal Processing 1993. (ICASSP), 2016, pp. 121–125. [20] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick [9] Carla Bertocchi, Emilie Chouzenoux, Marie-Caroline Haffner, “Gradient-based learning applied to document Corbineau, Jean-Christophe Pesquet, and Marco Prato, recognition,” Proceedings of the IEEE, vol. 86, no. 11, “Deep unfolding of a proximal interior point method for pp. 2278–2324, 1998. image restoration,” Inverse Problems, vol. 36, no. 3, pp. [21] Alex Krizhevsky, Geoffrey Hinton, et al., “Learning 034005, 2020. multiple layers of features from tiny images,” 2009. [10] Yuqing Yang, Peng Xiao, Bin Liao, and Nikos Deligian- [22] Diederik P. Kingma and Jimmy Ba, “Adam: A nis, “A robust deep unfolded network for sparse signal method for stochastic optimization,” arXiv preprint recovery from noisy binary measurements,” in 2020 arXiv:1412.6980, 2014. 28th European Signal Processing Conference (EU- SIPCO). IEEE, 2021, pp. 2060–2064. [23] Nikhil Ketkar, “Introduction to pytorch,” in Deep learn- ing with python, pp. 195–208. Springer, 2017. [11] Anand P. Sabulal and Srikrishna Bhashyam, “Joint sparse recovery using deep unfolding with application to
You can also read