ADMM-DAD NET: A DEEP UNFOLDING NETWORK FOR ANALYSIS COMPRESSED SENSING

Page created by Nancy Barrett
 
CONTINUE READING
ADMM-DAD NET: A DEEP UNFOLDING NETWORK
                                                                        FOR ANALYSIS COMPRESSED SENSING

                                              Vasiliki Kouni?       Georgios Paraskevopoulos‡,∗              Holger Rauhut†         George C. Alexandropoulos?
                                          ?
                                              Dep. of Informatics and Telecommunications, National & Kapodistrian University of Athens, Greece
                                                   †
                                                     Chair for Mathematics of Information Processing, RWTH Aachen University, Germany
                                               ‡
                                                 School of Electrical & Computer Engineering, National Technical University of Athens, Greece
                                                    ∗ Institute for Language & Speech Processing, Athena Research Center, Athens, Greece
arXiv:2110.06986v1 [cs.IT] 13 Oct 2021

                                                                ABSTRACT                                     represents x, along with thresholds used by the original opti-
                                         In this paper, we propose a new deep unfolding neural net-          mization algorithms.
                                         work based on the ADMM algorithm for analysis Com-                  Motivation: Our work is inspired by the articles [13] and
                                         pressed Sensing. The proposed network jointly learns a              [15], which propose unfolded versions of the iterative soft
                                         redundant analysis operator for sparsification and recon-           thresholding algorithm (ISTA), with learnable parameters be-
                                         structs the signal of interest. We compare our proposed             ing the sparsifying (orthogonal) basis and/or the thresholds
                                         network with a state-of-the-art unfolded ISTA decoder, that         involved in ISTA. The authors then test their frameworks on
                                         also learns an orthogonal sparsifier. Moreover, we consider         synthetic data and/or real-world image datasets. In a similar
                                         not only image, but also speech datasets as test examples.          spirit, we derive a decoder by interpreting the iterations of
                                         Computational experiments demonstrate that our proposed             the alternating direction method of multipliers algorithm [16]
                                         network outperforms the state-of-the-art deep unfolding net-        (ADMM) as a DNN and call it ADMM Deep Analysis Decod-
                                         works, consistently for both real-world image and speech            ing (ADMM-DAD) network. We differentiate our approach
                                         datasets.                                                           by learning a redundant analysis operator as a sparsifier for x,
                                                                                                             i.e. we employ analysis sparsity in CS. The reason for choos-
                                            Index Terms— Analysis Compressed Sensing, ADMM,                  ing analysis sparsity over its synthesis counterpart is due to
                                         deep neural network, deep unfolding.                                some advantages the former has. For example, analysis spar-
                                                                                                             sity provides flexibility in modeling sparse signals, since it
                                                             1. INTRODUCTION                                 leverages the redundancy of the involved analysis operators.
                                                                                                             We choose to unfold ADMM into a DNN since most of the
                                         Compressed Sensing (CS) [1] is a modern technique to re-            optimization-based CS algorithms cannot treat analysis spar-
                                         cover signals of interest x ∈ Rn from few linear and pos-           sity, while ADMM solves the generalized LASSO problem
                                         sibly corrupted measurements y = Ax + e ∈ Rm , with                 [17] which resembles analysis CS. Moreover, we test our de-
                                         m < n. Iterative optimization algorithms applied on CS are          coder on speech datasets, not only on image ones. To the best
                                         by now widely used [2], [3], [4]. Recently, approaches based        of our knowledge, an unfolded CS decoder has not yet been
                                         on deep learning were introduced [5], [6]. It seems promis-         used on speech datasets. We compare numerically our pro-
                                         ing to merge these two areas by considering what is called          posed network to the state-of-the-art learnable ISTA of [15],
                                         deep unfolding. The latter pertains to unfolding the iterations     on real-world image and speech data. In all datasets, our pro-
                                         of well-known optimization algorithms into layers of a deep         posed neural architecture outperforms the baseline, in terms
                                         neural network (DNN), which reconstructs the signal of inter-       of both test and generalization error.
                                         est.                                                                Key results: Our novelty is twofold: a) we introduce a new
                                         Related work: Deep unfolding networks have gained much              ADMM-based deep unfolding network that solves the analy-
                                         attention in the last few years [7], [8], [9], because of some      sis CS problem, namely ADMM-DAD net, that jointly learns
                                         advantages they have compared to traditional DNNs: they             an analysis sparsifying operator b) we test ADMM-DAD net
                                         are interpretable, integrate prior knowledge about the signal       on image and speech datasets (while state-of-the-art deep un-
                                         structure [10], and have a relatively small number of trainable     folding networks are only tested on synthetic data and images
                                         parameters [11]. Especially in the case of CS, many unfolding       so far). Experimental results demonstrate that ADMM-DAD
                                         networks have proven to work particularly well. The authors         outperforms the baseline ISTA-net on speech and images, in-
                                         in [12], [13], [14], [15] propose deep unfolding networks that      dicating that the redundancy of the learned analysis operator
                                         learn a decoder, which aims at reconstructing x from y. Addi-       leads to a smaller test MSE and generalization error as well.
                                         tionally, these networks jointly learn a dictionary that sparsely
Notation: For matrices A1 , A2 ∈ RN ×N , we denote by              where
[A1 | A2 ] ∈ R2N ×N their concatenation with respect to the
first dimension, while we denote by [A1 ; A2 ] ∈ RN ×2N their                          W =ρΦ(AT A + ρΦT Φ)−1 ΦT ∈ RN ×N                     (8)
concatenation with respect to the second dimension. We de-                                      T
                                                                             b = b(y) =Φ(A A + ρΦ Φ)       T    −1   T
                                                                                                                     A y∈R      N ×1
                                                                                                                                       .    (9)
note by ON ×N a square matrix filled with zeros. We write
IN ×N for the real N × N identity matrix. For x ∈ R, τ > 0,        We introduce vk = [uk ; z k ] and set Θ = (I − W | W ) ∈
the soft thresholding operator Sτ : R 7→ R is defined in           RN ×2N to obtain
closed form as Sτ (x) = sign(x) max(0, |x| − τ ). For                                                            
x ∈ Rn , the soft thresholding operator acts componentwise,                     Θ             b       −Sλ/ρ (Θvk + b)
                                                                    vk+1 =            vk +        +                     . (10)
i.e. (Sτ (x))i = Sτ (xi ). For two functions f, g : Rn 7→ Rn ,               ON ×2N           0        Sλ/ρ (Θvk + b)
we write their composition as f ◦ g : Rn 7→ Rn .
                                                                   Now, we set Θ̃ = [Θ; ON ×2N ] ∈ R2N ×2N and I1 =
                       2. MAIN RESULTS                             [IN ×N ; ON ×N ] ∈ R2N ×N , I2 = [−IN ×N ; IN ×N ] ∈
                                                                   R2N ×N , so that (10) is transformed into
Optimization-based analysis CS: As we mentioned in Sec-
tion 1, the main idea of CS is to reconstruct a vector x ∈ Rn                     vk+1 = Θ̃vk + I1 b + I2 Sλ/ρ (Θvk + b).                  (11)
from y = Ax + e ∈ Rm , m < n, where A is the so-called
measurement matrix and e ∈ Rm , with kek2 ≤ η, corre-              Based on (11), we formulate ADMM as a neural network with
sponds to noise. To do so, we assume there exists a redundant      L layers/iterations, defined as
sparsifying transform Φ ∈ RN ×n (N > n) called the analysis
operator, such that Φx is (approximately) sparse. Using anal-            f1 (y) = I1 b(y) + I2 Sλ/ρ (b(y)),
ysis sparsity in CS, we wish to recover x from y. A common               fk (v) = Θ̃v + I1 b + I2 Sλ/ρ (Θv + b),           k = 2, . . . , L.
approach is the analysis l1 -minimization problem
                                                                   The trainable parameters are the entries of Φ (or more gener-
         min kΦxk1         subject to kAx − yk2 ≤ η,        (1)
         x∈Rn                                                      ally, the parameters in a parameterization of Φ). We denote
A well-known algorithm that solves (1) is ADMM, which              the concatenation of L such layers (all having the same Φ) as
considers an equivalent generalized LASSO form of (1), i.e.,
                                                                                          fΦL (y) = fL ◦ · · · ◦ f1 (y).                   (12)
                     1
                minn kAx − yk22 + λkΦxk1 ,                  (2)
                x∈R 2                                              The final output x̂ is obtained after applying an affine map T
with λ > 0 being a scalar regularization parameter. ADMM           motivated by (4) to the final layer L, so that
introduces the dual variables z, u ∈ RN , so that (2) is equiv-
alent to                                                                  x̂ =T (fΦL (y))
                                                                                                                                           (13)
        1                                                                   =(AT A + ρΦT Φ)−1 (AT y + ρΦT (z L − uL )),
  minn kAx − yk22 + λkzk1 subject to Φx − z = 0. (3)
  x∈R 2
                                                                                 L L
Now, for ρ > 0 (penalty parameter), initial points (x0 , z 0 , u0 ) = where [u ; z ] = vL . In order to clip the output in case its
(0, 0, 0) and k ∈ N, the optimization problem in (3) can be           norm falls out of a reasonable range, we add an extra func-
solved by the iterative scheme of ADMM:                               tion σ : Rn → Rn defined as σ(x) = x if kxk2 ≤ Bout
                                                                      and σ(x) = Bout x/kxk2 otherwise, for some fixed constant
    xk+1 = (AT A + ρΦT Φ)−1 (AT y + ρΦT (z k − uk )) (4)              Bout > 0. We introduce the hypothesis class
    z k+1 = Sλ/ρ (Φxk+1 − uk )                                  (5)
                                                                            HL = {σ ◦ h : Rm 7→ Rn : h(y) = T (fΦL (y)),
    uk+1 = uk + Φxk+1 − z k+1 .                                 (6)                                                           (14)
                                                                                             Φ ∈ RN ×n , N > n}
The iterates (4) – (6) are known [16] to converge to a solution
p? of (3), i.e., kAxk − yk22 + kz k k1 → p? and Φxk − z k → 0             consisting of all the functions that ADMM-DAD can im-
as k → ∞.                                                             plement. Then, given the aforementioned class and a set S =
Neural network formulation: Our goal is to formulate the              {(yi , xi )}si=1 of s training samples, ADMM-DAD yields a
previous iterative scheme as a neural network. We substitute          function/decoder hS ∈ HL that aims at reconstructing x from
first (4) into the update rules (5) and (6) and second (5) into       y = Ax. In order to measure the difference between xi and
(6), yielding                                                         x̂ = h (y ), i = 1, . . . , s, we choose the training mean
                                                                     i        S    i
          uk+1 =(I − W )uk + W z k + b                             squared error (train MSE)
                      − Sλ/ρ ((I − W )uk + W z k + b)       (7)                                        s
                                                                                                    1X
          z   k+1                   k       k
                    =Sλ/ρ ((I − W )u + W z + b),                                        Ltrain =          kh(yi ) − xi k22                 (15)
                                                                                                    s i=1
5 layers                                                           25% CS ratio
                 Dataset
                               SpeechCommands                   TIMIT                          MNIST                      CIFAR10
      Decoder
                            test MSE      gen. error    test MSE         gen. error    test MSE      gen. error    test MSE      gen. error
         ISTA-net          0.58 · 10−2   0.13 · 10−2   0.22 · 10−3      0.24 · 10−4   0.67 · 10−1   0.17 · 10−1   0.22 · 10−1   0.12 · 10−1
        ADMM-DAD           0.25 · 10−2   0.16 · 10−3   0.79 · 10−4      0.90 · 10−5   0.23 · 10−1   0.16 · 10−3   0.15 · 10−1   0.11 · 10−3

          10 layers                            40% CS ratio                                                50% CS ratio
                 Dataset
                               SpeechCommands                   TIMIT                     SpeechCommands                   TIMIT
      Decoder
                            test MSE      gen. error    test MSE         gen. error    test MSE      gen. error    test MSE      gen. error
         ISTA-net          0.46 · 10−2   0.18 · 10−2   0.20 · 10−3      0.25 · 10−4   0.45 · 10−2   0.20 · 10−2   0.20 · 10−3   0.25 · 10−4
        ADMM-DAD           0.13 · 10−2   0.58 · 10−4   0.42 · 10−4      0.47 · 10−5   0.87 · 10−3   0.10 · 10−4   0.29 · 10−4   0.30 · 10−5

Table 1: Average test MSE and generalization error for 5-layer decoders (all datasets) and 10-layer decoders (speech datasets).
Bold letters indicate the best performance between the two decoders.

as loss function. The test mean square error (test MSE) is                  using the Adam optimizer [22] and batch size 128. For the im-
defined as                                                                  age datasets, we set the learning rate η = 10−4 and train the
                           d
                        1X                                                  5- and 10-layer ADMM-DAD for 50 and 100 epochs, respec-
                Ltest =       kh(ỹi ) − x̃i k22 ,    (16)                  tively. For the audio datasets, we set η = 10−5 and train the
                        d i=1
                                                                            5- and 10-layer ADMM-DAD for 40 and 50 epochs, respec-
where D = {(ỹi , x̃i )}di=1 is a set of d test data, not used in the       tively. We compare ADMM-DAD net to the ISTA-net pro-
training phase. We examine the generalization ability of the                posed in [15]. For ISTA-net, we set the best hyper-parameters
network by considering the difference between the average                   proposed by the original authors and experiment with 5 and
train MSE and the average test MSE, i.e.,                                   10 layers. All networks are implemented in PyTorch [23]. For
                                                                            our experiments, we report the average test MSE and gener-
                    Lgen = |Ltest − Ltrain |.                   (17)        alization error as defined in (16) and (17) respectively.

                3. EXPERIMENTAL SETUP                                                   4. EXPERIMENTS AND RESULTS

Datasets and pre-processing: We train and test the pro-                     We compare our decoder to the baseline of the ISTA-net de-
posed ADMM-DAD network on two speech datasets, i.e.,                        coder, for 5 layers on all datasets with a fixed 25% CS ra-
SpeechCommands [18] (85511 training and 4890 test speech                    tio, and for 10 layers and both 40% and 50% CS ratios on
examples, sampled at 16kHz) and TIMIT [19] (phonemes                        the speech datasets and report the corresponding average test
sampled at 16kHz; we take 70% of the dataset for training                   MSE and generalization error in Table 1. Both the training
and the 30% for testing) and two image datasets, i.e. MNIST                 errors and the test errors are always lower for our ADMM-
[20] (60000 training and 10000 test 28 × 28 image examples)                 DAD net than for ISTA-net. Overall, the results from Table 1
and CIFAR10 [21] (50000 training and 10000 test 32 × 32                     indicate that the redundancy of the learned analysis opera-
coloured image examples). For the CIFAR10 dataset, we                       tor improves the performance of ADMM-DAD net, especially
transform the images into grayscale ones. We preprocess the                 when tested on the speech datasets. Furthermore, we extract
raw speech data, before feeding them to both our ADMM-                      the spectrograms of an example test raw audio file of TIMIT
DAD and ISTA-net: we downsample each .wav file from                         reconstructed by either of the 5-layer decoders. We use 1024
16000 to 8000 samples and segment each downsampled .wav                     FFT points. The resulting spectrograms for 25% and 50%
into 10 segments.                                                           CS ratio are illustrated in Fig. 1. Both figures indicate that
Experimental settings: We choose a random Gaussian mea-                     our decoder outperforms the baseline, since the former distin-
surement matrix √A ∈ Rm×n and appropriately normalize                       guishes many more frequencies than the latter. Naturally, the
it, i.e., Ã = A/ m. We consider three CS ratios m/n ∈                      quality of the reconstructed raw audio file by both decoders
{25%, 40%, 50%}. We add zero-mean Gaussian noise with                       increases, as the CS ratio also increases from 25% to 50%.
standard deviation std = 10−4 to the measurements, set the                  However, ADMM-DAD reconstructs –even for the 25% CS
redundancy ratio N/n = 5 for the trainable analysis opera-                  ratio– a clearer version of the signal compared to ISTA-net;
tor Φ ∈ RN ×n , perform He (normal) initialization for Φ and                the latter recovers a significant part of noise, even for the 50%
choose (λ, ρ) = (10−4 , 1). We also examine different val-                  CS ratio. Finally, we examine the robustness of both 10-layer
ues for λ, ρ, as well as treating λ, ρ as trainable parameters,             decoders. We consider noisy measurements in the test set of
but both settings yielded identical performance. We evaluate                TIMIT, taken at 25% and 40% CS ratio, with varying stan-
ADMM-DAD for 5 and 10 layers. All networks are trained                      dard deviation (std) of the additive Gaussian noise. Fig. 2
(a) Original                  (b) 5-layer ADMM-DAD reconstruction            (c) 5-layer ISTA-net reconstruction

Fig. 1: Spectrograms of reconstructed test raw audio file from TIMIT for 25% CS ratio (top), as well as 50% CS ratio (bottom).

                (a) 25% CS ratio                                                               (b) 40% CS ratio

Fig. 2: Average test MSE for increasing std levels of additive Gaussian noise. Blue: 10-layer ADMM-DAD, orange: 10-layer
ISTA-net.

shows how the average test MSE scales as the noise std in-       the signal of interest and learns a redundant analysis oper-
creases. Our decoder outperforms the baseline by an order of     ator, serving as sparsifier for the signal. We compared our
magnitude and is robust to increasing levels of noise. This      framework with a state-of-the-art ISTA-based unfolded net-
behaviour confirms improved robustness when learning a re-       work on speech and image datasets. Our experiments con-
dundant sparsifying dictionary instead of an orthogonal one.     firm improved performance: the redundancy provided by the
                                                                 learned analysis operator yields a lower average test MSE and
                                                                 generalization error of our method compared to the ISTA-
   5. CONCLUSION AND FUTURE DIRECTIONS
                                                                 net. Future work will include the derivation of generalization
In this paper we derived ADMM-DAD, a new deep unfolding          bounds for the hypothesis class defined in (14) similar to [15].
network for solving the analysis Compressed Sensing prob-        Additionally, it would be interesting to examine the perfor-
lem, by interpreting the iterations of the ADMM algorithm        mance of ADMM-DAD, when constraining Φ to a particular
as layers of the network. Our decoder jointly reconstructs       class of operators, e.g., for Φ being a tight frame.
6. REFERENCES                                     massive random access,” in ICASSP 2020 - 2020 IEEE
                                                                      International Conference on Acoustics, Speech and Sig-
 [1] Emmanuel J. Candès, Justin Romberg, and Terence Tao,            nal Processing (ICASSP), 2020, pp. 5050–5054.
     “Robust uncertainty principles: Exact signal reconstruc-
     tion from highly incomplete frequency information,”         [12] Zhonghao Zhang, Yipeng Liu, Jiani Liu, Fei Wen, and
     IEEE Transactions on information theory, vol. 52, no.            Ce Zhu, “AMP-Net: Denoising-based deep unfolding
     2, pp. 489–509, 2006.                                            for compressive image sensing,” IEEE Transactions on
                                                                      Image Processing, vol. 30, pp. 1487–1500, 2021.
 [2] Amir Beck and Marc Teboulle,            “A fast itera-
     tive shrinkage-thresholding algorithm for linear inverse    [13] Jian Zhang and Bernard Ghanem, “ISTA-Net: In-
     problems,” SIAM journal on imaging sciences, vol. 2,             terpretable optimization-inspired deep network for im-
     no. 1, pp. 183–202, 2009.                                        age compressive sensing,” in Proceedings of the IEEE
                                                                      conference on computer vision and pattern recognition,
 [3] Sundeep Rangan, Philip Schniter, and Alyson K.                   2018, pp. 1828–1837.
     Fletcher, “Vector approximate message passing,” IEEE
     Transactions on Information Theory, vol. 65, no. 10, pp.    [14] Yan Yang, Jian Sun, Huibin Li, and Zongben Xu,
     6664–6684, 2019.                                                 “ADMM-Net: A deep learning approach for compres-
                                                                      sive sensing MRI,” arXiv preprint arXiv:1705.06869,
 [4] Elaine T. Hale, Wotao Yin, and Yin Zhang, “Fixed-point           2017.
     continuation applied to compressed sensing: implemen-
     tation and numerical experiments,” Journal of Compu-        [15] Arash Behboodi, Holger Rauhut, and Ekkehard
     tational Mathematics, pp. 170–194, 2010.                         Schnoor, “Compressive sensing and neural networks
                                                                      from a statistical learning perspective,” arXiv preprint
 [5] Guang Yang et al., “DAGAN: Deep de-aliasing gen-                 arXiv:2010.15658, 2020.
     erative adversarial networks for fast compressed sens-
     ing MRI reconstruction,” IEEE transactions on medical       [16] Stephen Boyd, Neal Parikh, and Eric Chu, Distributed
     imaging, vol. 37, no. 6, pp. 1310–1321, 2017.                    optimization and statistical learning via the alternating
                                                                      direction method of multipliers, Now Publishers Inc,
 [6] Ali Mousavi, Gautam Dasarathy, and Richard G. Bara-              2011.
     niuk, “Deepcodec: Adaptive sensing and recovery via
     deep convolutional neural networks,” arXiv preprint         [17] Yunzhang Zhu, “An augmented ADMM algorithm with
     arXiv:1707.03386, 2017.                                          application to the generalized lasso problem,” Journal
                                                                      of Computational and Graphical Statistics, vol. 26, no.
 [7] Karol Gregor and Yann LeCun, “Learning fast approx-              1, pp. 195–204, 2017.
     imations of sparse coding,” in Proceedings of the 27th
     international conference on international conference on     [18] Pete Warden, “Speech commands: A dataset for
     machine learning, 2010, pp. 399–406.                             limited-vocabulary speech recognition,” arXiv preprint
                                                                      arXiv:1804.03209, 2018.
 [8] Scott Wisdom, John Hershey, Jonathan Le Roux, and
     Shinji Watanabe, “Deep unfolding for multichannel           [19] John S Garofolo, “Timit acoustic phonetic continu-
     source separation,” in 2016 IEEE International Con-              ous speech corpus,” Linguistic Data Consortium, 1993,
     ference on Acoustics, Speech and Signal Processing               1993.
     (ICASSP), 2016, pp. 121–125.
                                                                 [20] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick
 [9] Carla Bertocchi, Emilie Chouzenoux, Marie-Caroline               Haffner, “Gradient-based learning applied to document
     Corbineau, Jean-Christophe Pesquet, and Marco Prato,             recognition,” Proceedings of the IEEE, vol. 86, no. 11,
     “Deep unfolding of a proximal interior point method for          pp. 2278–2324, 1998.
     image restoration,” Inverse Problems, vol. 36, no. 3, pp.
                                                                 [21] Alex Krizhevsky, Geoffrey Hinton, et al., “Learning
     034005, 2020.
                                                                      multiple layers of features from tiny images,” 2009.
[10] Yuqing Yang, Peng Xiao, Bin Liao, and Nikos Deligian-
                                                                 [22] Diederik P. Kingma and Jimmy Ba,       “Adam: A
     nis, “A robust deep unfolded network for sparse signal
                                                                      method for stochastic optimization,” arXiv preprint
     recovery from noisy binary measurements,” in 2020
                                                                      arXiv:1412.6980, 2014.
     28th European Signal Processing Conference (EU-
     SIPCO). IEEE, 2021, pp. 2060–2064.                          [23] Nikhil Ketkar, “Introduction to pytorch,” in Deep learn-
                                                                      ing with python, pp. 195–208. Springer, 2017.
[11] Anand P. Sabulal and Srikrishna Bhashyam, “Joint
     sparse recovery using deep unfolding with application to
You can also read