Accelerated Device Placement Optimization with Contrastive Learning - Hao Lan, Li Chen, Baochun Li University of Toronto
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Accelerated Device Placement Optimization with Contrastive Learning Hao Lan, Li Chen, Baochun Li University of Toronto hao.lan@mail.utoronto.ca
What is device placement?
Background ‣ Problem: Large DNNs cannot t into single GPU (mem limitation). ‣ Solution: Model parallelism, Partition a DNN across multiple GPUs CPU GPU1 GPU3 GPU2 GPU4 Machine Deep Neural Networks Computation Resources 3 fi
Device placement with deep reinforcement learning Objective: to find the best way to assign devices to operations to minimize training time Device Placement Optimization with Reinforcement Learning Figure 5. RL-based placement of Inception-V3. Devices are denoted by colors, where the transparent color represents an operation on a Mirhoseini et al. (Google Inc.), “Device placement optimization with CPU and each other unique color represents a different GPU. RL-based placement achieves the improvement of 19.7% in running time compared to expert-designed placement. reinforcement learning,” in Proc. ICML 2017. Neural MT. We train our Neural MT model on the Inception-V3. We train Inception-V3 on the ImageNet WMT14 English-German dataset.1 For these experiments, dataset (Russakovsky et al., 2015) until the model reaches 4 we pre-process the dataset into word pieces (Wu et al., the accuracy of 72% on the validation set. In practice, more
Device placement with deep reinforcement learning Agent 1 2 Policy action 3 4 5 Network 6 7 8 reward observation 1 2 CPU GPU 1 GPU 3 3 4 5 GPU 2 GPU 4 6 7 8 Machine Neural Networks Computation Resources 5
Agent design 1 2 1 2 1 2 Grouper Placer 3 4 5 3 4 5 3 4 5 6 7 8 merge 6 7 8 6 7 8 Node Features Group Embeddings Placement 1 2 1 2 1 2 3 4 5 Encoder 3 4 5 Placer 3 4 5 6 7 8 6 7 8 6 7 8 Node Features Node Embeddings Placement 6
Challenges We notice that Google’s ICML 2017 uses about 100 workers and 12 to 27 hours to nd the best placement for workloads. It mainly caused by two reasons: ‣ The REINFORCE algorithm is ine cient. It requires large number of samples to train the agent. ‣ In device placement, the environment is a real-world machine with multiple devices(CPU and GPUs). Collecting rewards from a real-world environment is very slow, especially when the workload is a large DNN. 7 fi ffi
Solution ‣ For the rst challenge, we replace the REINFORCE algorithm with PPO algorithm to improve the sample e ciency, however, it still needs a substantial amount of samples to train the agent. ‣ For the second challenge, one of the solution is using a simulated environment instead of a real one (proposed by Placeto). In that sense, we need to build a model for every single environment, which is not a model-free methods. ‣ Is there a way to train the agent with a very few amount of samples or even without any samples? 8 fi ffi
Our framework R 2019 Published as a conference paper at ICLR 2019 ‣ To address these challenges, we proposed our DRL-based framework, Mars, a graph encoder pre-trained by contrastive learning, followedPublished (H, A) (X, A) by a aslight-weight (H, A) segment-level a conference paper at ICLR 2019 sequence-to- sequence placer.E E ~xi ~hi D + ~hi D + R Neural 01…00 DGI Seq-to-Seq ublished Cas Network a conference paper at ICLR 1 0 2019 …01 R (X, A) Graph Encoder ~s (H, A) Placer … … … ~x ej E ~s ~e hj ED output sequence ~hi D + 01…10 ~ xi 0.232 … 0.876 0.021 … 0.547 E 1 2 ~e readout R Adjacency h j Matrix Published as D Ca conference paper at ICLR 2019 ~s … … … … … … (X, A) e e (X, A) (H, A) e e (H, A) Softmax 3 4 5 ~x E ~e op_type Figure 1: A high-level overview of Deep Graph Infomax. ej Refer to 0.521 … 0.391 Section 3.4 for more details.… LSTM hj … 0.602 D … 0.102 E GCNs LSTM ~xi e input_shape e ~hi D + 6 7 8 (Table H,output_shape A) 1: Summary of the datasets used ineoure experiments. (X, A) (X, A) R Node Representation (H, A) e A) (H, e Placement Policy ~s eep Graph Infomax. Dataset Task Refer to Node Section Features Nodes 3.4 for more Figurecorruption Edges details. 1: A high-level Features overviewTrain/Val/Test Classes of Deep Graph Infomax.input Nodes Refersequence to Section 3.4 for more details. 9 E ~ei E ~ ~x ~h D +
Contrastive Learning Computational Graph Graph h ⃗ Unsupervised Neural Summary s ⃗ Corrupted Network h̃ ⃗ Loss ℒ ( ⋅ ) Computational Graph We use contrastive learning to pre-train the graph encoder without any labeled data. After pre-training, the graph encoder can encode the operation in computational graph into a node representation ⃗ h , which represents the operation in a vector space. 10
Contrastive Learning Computational Graph Graph h ⃗ Unsupervised Neural Summary s ⃗ Corrupted Network h̃ ⃗ Loss ℒ ( ⋅ ) Computational Graph Parameters DRL agent Node Graph Node Placer Placement Features Encoder Representations 11
Light-weight Placer ‣ In node classi cation task, DGI add a logistic regression layer as classi er after the pre-trained graph encoder, and train it with the labeled data. To reduce the amount of labeled data required for training, the classi er added should be simple and easy to train. ‣ Following this idea, we use a light-weight placer in our agent design, a segment-level sequence-to-sequence neural network. DRL agent Pre-trained Node Node Graph Placer Placement Features Embeddings Encoder 12 fi fi fi
Light-weight Placer ‣ A segment-level sequence-to-sequence placer has been designed to avoid placing extremely long operation sequence at a time. ‣ The light-weight placer can utilize the pre-trained parameters of the graph encoder better. p1 ps ps+1 p2s pks+1 pn LSTM … LSTM LSTM … LSTM … LSTM … LSTM go BiLSTM BiLSTM … BiLSTM 13 {~h1 , ~h2 . . . ~hs } {~hs+1 , ~hs+2 . . . ~h2s } {~hks+1 , ~hks+2 . . . ~hn }
Experimental Results ‣ From experimental results, ours approach outperforms all state- of-the-arts. Per-step runtime (in seconds) of placements found by different approaches. Human GPU Grouper- Encoder- Mars (no Models Mars Experts Only Placer Placer pre-training) Inception-V3 0.071 0.071 0.067 0.067 0.067 0.067 GNMT-4 1.661 OOM 1.418 1.437 1.379 1.396 BERT OOM OOM 12.661 11.737 9.214 11.363 14
Experimental Results ‣ With unsupervised pre-training, we achieved better performance while reducing the agent training time by 13.2% on average.
Thank you
You can also read