Accelerated Device Placement Optimization with Contrastive Learning - Hao Lan, Li Chen, Baochun Li University of Toronto

Page created by Patrick Morgan
 
CONTINUE READING
Accelerated Device Placement Optimization with Contrastive Learning - Hao Lan, Li Chen, Baochun Li University of Toronto
Accelerated Device Placement
Optimization with Contrastive
Learning

Hao Lan, Li Chen, Baochun Li
University of Toronto
hao.lan@mail.utoronto.ca
What is device placement?
Background
‣ Problem: Large DNNs cannot t into single GPU (mem limitation).

‣ Solution: Model parallelism, Partition a DNN across multiple GPUs

                                        CPU     GPU1     GPU3

                                                GPU2     GPU4

                                               Machine

   Deep Neural Networks               Computation Resources           3
                       fi
Device placement with deep reinforcement learning
Objective: to find the best way to assign devices to operations to
minimize training time
                                  Device Placement Optimization with Reinforcement Learning

Figure 5. RL-based placement of Inception-V3. Devices are denoted by colors, where the transparent color represents an operation on a
Mirhoseini et al. (Google Inc.), “Device placement optimization with
CPU and each other unique color represents a different GPU. RL-based placement achieves the improvement of 19.7% in running time
compared to expert-designed placement.
reinforcement learning,” in Proc. ICML 2017.
Neural MT. We train our Neural MT model on the                      Inception-V3. We train Inception-V3 on the ImageNet
WMT14 English-German dataset.1 For these experiments,               dataset (Russakovsky et al., 2015) until the model reaches          4
we pre-process the dataset into word pieces (Wu et al.,             the accuracy of 72% on the validation set. In practice, more
Device placement with deep reinforcement learning
                   Agent
                                                1        2

                           Policy                                 action
                                            3        4       5
                          Network

                                            6        7       8

                                                    reward
              observation
              1       2               CPU           GPU 1        GPU 3

          3       4         5
                                                    GPU 2        GPU 4

          6       7         8                   Machine

        Neural Networks             Computation Resources                  5
Agent design
     1       2                       1       2                     1               2

                     Grouper                         Placer
 3       4       5               3       4       5            3            4               5

 6       7       8
                     merge       6       7       8            6            7               8

Node Features                  Group Embeddings                   Placement

     1       2                       1       2                         1               2

3        4       5   Encoder     3       4       5   Placer   3                4               5

6        7       8               6       7       8            6                7               8

Node Features                  Node Embeddings                    Placement                        6
Challenges
We notice that Google’s ICML 2017 uses about 100 workers and
12 to 27 hours to nd the best placement for workloads. It mainly
caused by two reasons:

‣ The REINFORCE algorithm is ine cient. It requires large number
  of samples to train the agent.

‣ In device placement, the environment is a real-world machine
  with multiple devices(CPU and GPUs). Collecting rewards from a
  real-world environment is very slow, especially when the
  workload is a large DNN.
                                                                   7
           fi
                          ffi
Solution
‣ For the rst challenge, we replace the REINFORCE algorithm
  with PPO algorithm to improve the sample e ciency, however, it
  still needs a substantial amount of samples to train the agent.

‣ For the second challenge, one of the solution is using a
  simulated environment instead of a real one (proposed by
  Placeto). In that sense, we need to build a model for every
  single environment, which is not a model-free methods.

‣ Is there a way to train the agent with a very few amount of
  samples or even without any samples?
                                                                    8
  fi
                                       ffi
Our framework
R 2019
        Published as a conference paper at ICLR 2019
            ‣ To address these challenges, we proposed our DRL-based
              framework, Mars, a graph encoder pre-trained by contrastive
              learning,   followedPublished
                      (H, A)
                  (X, A)           by a aslight-weight
                                               (H, A)               segment-level
                                            a conference paper at ICLR 2019
                                                                                  sequence-to-
              sequence placer.E
 E                         ~xi                                       ~hi                    D        +
                                 ~hi                           D            +         R
              Neural                 01…00                  DGI                                             Seq-to-Seq
ublished Cas Network
             a conference paper at ICLR
                                     1 0 2019
                                         …01           R       (X, A)
                                                       Graph Encoder
                                                                                            ~s              (H, A)
                                                                                                              Placer
                                         …
                                            …
                                  …

                          ~x
                           ej
                                                E              ~s   ~e
                                                                     hj                   ED              output sequence
                                                                                                               ~hi                   D       +
                                  01…10                             ~
                                                                    xi          0.232 … 0.876                                        0.021 … 0.547
 E          1        2          ~e                        readout                                                              R
                               Adjacency
                                 h j     Matrix
                                       Published as          D
                                                  Ca conference paper at ICLR 2019                                                    ~s

                                                                                  …

                                                                                       …

                                                                                            …

                                                                                                                                           …

                                                                                                                                               …

                                                                                                                                                   …
                (X, A)    e  e
                         (X, A)                       (H, A)      e  e
                                                                 (H, A)
                                                                                                             Softmax
         3      4      5                                              ~x                      E                  ~e
                                     op_type
          Figure 1: A high-level overview of Deep Graph Infomax.       ej
                                                                    Refer  to      0.521 … 0.391
                                                                              Section 3.4 for more details.… LSTM hj
                                                                                                                              …      0.602
                                                                                                                                     D     … 0.102
                                       E                       GCNs                                                   LSTM
                ~xi           e   input_shape
                                   e                       ~hi                      D         +
         6      7      8 (Table
                             H,output_shape
                                  A)
                                   1: Summary of the datasets used ineoure experiments.
                                                        (X, A)      (X, A)
                                                                              R Node Representation  (H, A)     e A)
                                                                                                               (H,  e               Placement Policy
                                                                             ~s
eep   Graph  Infomax.
         Dataset   Task
                       Refer to
                          Node   Section
                                 Features
                              Nodes
                                          3.4   for   more
                                             Figurecorruption
                                          Edges
                                                              details.
                                                       1: A high-level
                                                      Features         overviewTrain/Val/Test
                                                                   Classes      of Deep Graph   Infomax.input
                                                                                              Nodes       Refersequence
                                                                                                                to Section 3.4 for more details.   9
                               E                   ~ei                     E
               ~                                  ~x                                             ~h                      D         +
Contrastive Learning

Computational Graph       Graph        h ⃗
                                                            Unsupervised
                          Neural              Summary s ⃗
    Corrupted            Network
                                       h̃ ⃗                 Loss ℒ ( ⋅ )
Computational Graph

  We use contrastive learning to pre-train the graph encoder without
  any labeled data. After pre-training, the graph encoder can encode
  the operation in computational graph into a node representation
     ⃗
   h , which represents the operation in a vector space.
                                                                           10
Contrastive Learning

Computational Graph       Graph       h ⃗
                                                             Unsupervised
                          Neural             Summary s ⃗
    Corrupted            Network
                                      h̃ ⃗                   Loss ℒ ( ⋅ )
Computational Graph
            Parameters
                             DRL agent

  Node          Graph             Node
                                                    Placer       Placement
 Features      Encoder       Representations
                                                                             11
Light-weight Placer
‣ In node classi cation task, DGI add a logistic regression layer as
  classi er after the pre-trained graph encoder, and train it with the
  labeled data. To reduce the amount of labeled data required for
  training, the classi er added should be simple and easy to train.

‣ Following this idea, we use a light-weight placer in our agent
  design, a segment-level sequence-to-sequence neural network.
                              DRL agent

                Pre-trained
 Node                           Node
                  Graph                           Placer      Placement
Features                      Embeddings
                 Encoder
                                                                          12
 fi
           fi
                fi
Light-weight Placer
‣ A segment-level sequence-to-sequence placer has been designed
  to avoid placing extremely long operation sequence at a time.

‣ The light-weight placer can utilize the pre-trained parameters of
  the graph encoder better.
                                 p1                 ps          ps+1                 p2s          pks+1       pn

                                LSTM       …       LSTM         LSTM        …       LSTM    …      LSTM   …   LSTM

                                 go

             BiLSTM                            BiLSTM               …             BiLSTM

                                                                                                                     13
       {~h1 , ~h2 . . . ~hs }         {~hs+1 , ~hs+2 . . . ~h2s }       {~hks+1 , ~hks+2 . . . ~hn }
Experimental Results
 ‣ From experimental results, ours approach outperforms all state-
   of-the-arts.

               Per-step runtime (in seconds) of placements found
                             by different approaches.

                 Human      GPU     Grouper-   Encoder-         Mars (no
  Models                                                  Mars
                 Experts    Only     Placer     Placer         pre-training)

Inception-V3     0.071     0.071     0.067      0.067     0.067    0.067
  GNMT-4         1.661      OOM      1.418      1.437     1.379    1.396
   BERT           OOM       OOM     12.661     11.737     9.214    11.363
                                                                               14
Experimental Results
‣ With unsupervised pre-training, we achieved better performance
  while reducing the agent training time by 13.2% on average.
Thank you
You can also read