A Zeroth-Order Block Coordinate Descent Algorithm for Huge-Scale Black-Box Optimization - ICML

Page created by Leon Payne

Hobbies & Interests

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

A Zeroth-Order Block Coordinate Descent Algorithm for Huge-Scale Black-Box Optimization - ICML

A Zeroth-Order Block Coordinate Descent Algorithm
        for Huge-Scale Black-Box Optimization

             1                              2                           1
HanQin Cai                Yuchen Lou             Daniel McKenzie              Wotao Yin3

         1
             Department of Mathematics, University of California, Los Angeles

                 2
                     Department of Mathematics, The University of Hong Kong

                                 3
                                     DAMO Academy, AliBaba US

         International Conference on Machine Learning
                         June 20, 2021

                                                                                           1/9

Authors

HanQin Cai   Yuchen Lou   Dainel McKenzie   Wotao Yin

                                                        2/9

Zeroth-Order Optimization
                                          Goal
  Using only noisy function queries (no gradients) to find x? ≈ arg min f (x).
                                                                      x∈Rd

Applications:
  • Simulation-based
    optimization.
  • Adversarial attacks.
  • Hyperparameter tuning.
  • Reinforcement learning.
  • Climate modelling.

                                       Figure: Left: Clean signal. Middle: attacking
                                       perturbation (scaled up). Right: Attacked signal.

                In all these applications, function queries are expensive.

                                                                                           3/9

Scaling to huge dimensions

  • Can estimate gradient via finite differencing:
               f (x + δei ) − f (x)
    ∇i f (x) ≈                      .
                        δ
  • This requires O(d) queries per iteration — too expensive for large d.
  • Recent work1 uses compressed sensing to estimate ∇f (x):

                           f (x + δzi ) − f (x)
                        yi =                    ≈ zi> ∇f (x) for i = 1, · · · , m.
                                    δ
                  ∇f (x) ≈ arg min kZv − yk2 s.t. kvk0 := #{i : vi 6= 0} ≤ s.                                (1)
                                  v∈Rd

  • If2 k∇f (x)k0 ≤ s then can take m = O(s log d). Query efficient.
  • However, solving (1) is computationally expensive.

                                          Motivating Question
Is there a query and computationally efficient Zeroth-Order Opt algorithm?

  1
      Wang et al, 2018. Cai et al, 2020.
  2
      Approximate gradient sparsity indeed observed in many applications. See Fig. 1 in (Cai et al, 2020).
                                                                                                                   4/9

The ZO-BCD Algorithm
Our algorithm combines:
  • Compressed sensing gradient estimator.
  • Block coordinate descent using randomized blocks.

Algorithm 1 Zeroth-Order Block Coordinate Descent (ZO-BCD)

 1:   π ← randperm(d)                                   C Create random permutation
 2:   for j = 1, · · · , J                              C Create J blocks
 3:     x(j) ← [xπ((j−1) d +1) , · · · , xπ(j d +1) ]   C Assign variables to blocks
                             J                J
 4:   for k = 1, · · · , K                              C   Do K iterations
 5:      j ← randint({1, · · · , J})                    C   Randomly select a block
 6:      for i = 1, · · · , m                           C   Query objective function
 7:          yi = f (x+δzδi )−f (x)                     C   Approximate zi> ∇f (x)
           (j)
 8:      ĝ ← arg minv:kvk0 ≤s kZv − yk2                C   Approximate block gradient
 9:      xk+1 ← xk − αĝ (j)                            C   Step of BCD
10:   return xK                                         C   Approximated minimizer

                                                                                         5/9

ZO-BCD Is Theoretically Sound

                                     Main Theorem
Assume k∇f (x)k0 ≤ s for all x ∈ Rd . Choose J  d random blocks. ZO-BCD
returns xK satisfying
                              f (xK ) − f? ≤ ε

using Õ(s/ε) total queries and Õ(sd/J 2 ) FLOPS per iteration (w.h.p.).

Sketch of proof.
  • Randomization ensures k∇(j) f (x)k0 ≈ s/J.
  • Compressed sensinga guarantees ĝ (j) ≈ ∇(j) f (x).
  • Apply convergence for inexactb block coordinate descent.

  a
      Needell et al, 2008.
  b
      Tappenden et al, 2016.

                                                                            6/9

A More Efficient Variant: ZO-BCD-RC

• Replace Z with a Rademacher circulant (RC) matrix3 :
                                                                             
                                            z1      z2         ···     zd/J
                                    zd/J           z1         ···    zd/J−1 
                             C(z) =  .                                  ..  .
                                                                            
                                     ..            ..         ..
                                                       .          .       . 
                                      z2            ···        zd/J     z1
• Only need to store a vector z. Memory efficient.
• Matrix product calculation can be accelerated by fft & ifft. Even faster.

                                    C(z) · x = F F(z) · F −1 (x)
                                                                         

     for all z and x.

3
    We actually only use m randomly selected rows from C(z).
                                                                                  7/9

Experimental Results: Synthetic
Benchmarked ZO-BCD on two functions exhibiting gradient sparsity:
                                                               Ps
       • Sparse quadric: f (x) =                                     i=1
                                                                               x2i .
                                                                       Ps
       • Max-s-squared-sum: f (x) =                                                    x2σ(i) where |xσ(1) | ≥ |xσ(2) | ≥ · · · .
                                                                                 i=1

ZO-BCD exceeds prior state-of-the-art.

                   104                                                                                   103
                                                               ZO-BCD-R                                                                                         ZO-BCD-R
                                                               ZO-BCD-RC                                                                                        ZO-BCD-RC
                                                               FDSA                                                                                             FDSA
                                                               SPSA                                                                                             SPSA
                   102
                                                               ZO-SCD                                    102                                                    ZO-SCD
                                                               ZORO                                                                                             ZORO
                                                               LMMAES                                                                                           LMMAES
  function value

                                                                                        function value
                   100
                                                                                                         101

                   10-2

                                                                                                         100

                        -4
                   10

                                                                                                         10-1
                             0   1   2   3     4       5   6     7         8                                    0   0.5   1   1.5   2     2.5     3   3.5   4       4.5     5
                                             queries                   104                                                              queries                           105

Figure: Comparing ZO-BCD against various SOTA Zeroth-Order Opt algorithms.
Left: Sparse quadric. Right: Max-s-squared-sum.

                                                                                                                                                                                8/9

Experimental Results: Adversarial Attack
  • Sparse wavelet transform attack4 :

                                x? = arg min f (IWT(WT(x̃) + x))
                                             x

      x̃ = clean image/audio signal. f = Carlini-Wagner loss function5 .
  • Image Attack. model: Inception-v36 . Wavelet: ‘db45’. d ≈ 675, 000.
  • Audio Attack. model: commandNet7 . Wavelet: Morse. d ≈ 1, 700, 000.

                Image Attack                                                     Audio Attack
Method           ASR       `2 dist       Queries                       Method                    ASR
ZO-SCD           78%         57.5           2400                       Alzantot et al, 2018      89.0%
ZO-SGD           78%         37.9           1590                       Vadillo et al, 2019       70.4%
ZO-AdaMM         81%         28.2           1720                       Li et al, 2020            96.8%
ZORO             90%         21.1           2950                       Xie et al, 2020           97.8%
ZO-BCD           96%         13.7           1662                       ZO-BCD                    97.9%
  4
    WT & IWT stand for some fixed wavelet transform & inverse wavelet transform, respectively.
  5
    Carlini & Wagner, 2016. Chen et al, 2017.
  6
    Szegedy et al, 2016. Trained on ImageNet (Deng et al 2009).
  7
    Matlab model trained on SpeechCommand dataset (Warden et al, 2018).
                                                                                                         9/9

You can also read