SELF-SUPERVISION IS ALL YOU NEED FOR SOLVING RUBIK'S CUBE - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
arXiv preprint S ELF -S UPERVISION IS A LL YOU N EED FOR S OLVING RUBIK ’ S C UBE Kyo Takano Tokyo, Japan kyo.psych@gmail.com A BSTRACT arXiv:2106.03157v2 [cs.LG] 6 Oct 2021 While combinatorial problems are of great academic and practical importance, previous approaches like explicit heuristics and reinforcement learning have been complex and costly. To address this, we developed a simple and robust method to train a Deep Neural Network (DNN) through self-supervised learning for solving a goal-predefined combinatorial problem. Assuming that more optimal moves occur more frequently as a path of random moves connecting two problem states, the DNN can approximate an optimal solver by learning to predict the last move of a random scramble based on the problem state. Tested on 1,000 scrambled Rubik’s Cube instances, a Transformer-based model could solve all of them near-optimally using a breadth-first search; with a maximum breadth of 103 , the mean solution length was 20.5 moves. The proposed method may apply to other goal-predefined combinatorial problems, though it has a few constraints. 1 I NTRODUCTION Combinatorial search is a challenging task both in theory and in practice. Representative problems include Traveling Salesman Problem, in which a salesman tries to travel through multiple cities in the shortest path possible. Despite how easy it sounds, the problem is labeled NP-hard due to its combinatorial complexity (Papadimitriou & Steiglitz, 1998); as the number of cities increases, the number of possible combinations skyrockets. In the real world, algorithms for such problems are applied in various ways to optimize logistics, supply chain, resource planning, etc. In the past, researchers have proposed several shortcut algorithms to tackle the combinatorial com- plexity of such problems. Recently, in particular, reinforcement learning has enabled automatic search of algorithms to optimally solve combinatorial problems (Mazyavkina et al., 2021); by re- warding efficient moves, DNNs can learn to find increasingly optimal solutions. Rubik’s Cube is also one such problem (Demaine et al., 2018), and there have been not only logical methods (Korf, 1997; Rokicki et al., 2014) but also heuristic (Korf, 1982; El-Sourani et al., 2010; Arfaee et al., 2011) and reinforcement learning approaches (McAleer et al., 2018; Agostinelli et al., 2019; Corli et al., 2021) that can solve any scrambled state of the puzzle. Nevertheless, Combinatorial search is still no easy task; you need detailed knowledge of the target problem, coding explicit logic, or configuring a reinforcement learning system. Reinforcement learning is especially notable for the need to adjust architecture, loss function, and hyperparameters several times to stabilize the learning process involving non-differentiable operations. To overcome this difficulty, we developed a novel method to train DNNs in a self-supervised manner for solving combinatorial problems with a predefined goal. Assuming that the first move in a random path is probabilistically most likely optimal, the method teaches a DNN probability distribution of optimal moves associated with problem states. In our experiment, we tested the proposed method by training DNNs to solve Rubik’s Cube. Contribution. We show that self-supervised learning can be all you need for solving some goal- predefined combinatorial problems almost optimally, without requiring specialized knowledge or reinforcement learning. 1
arXiv preprint c' S S a' b' a b G G c Figure 1: A simplified goal-predetermined combinatorial problem in training and inference. Opti- malities of possible moves are probabilistically learned through observations of random moves in training, and utilized for inference by a model. Lef t: During training, random moves {a, b, c} can occur at every node by the equal probability of 1/3. If an unknown path consisting of these dupli- cable moves connects nodes G (goal) and S (scramble), the optimal (shortest) path (b, b) is most likely of all the paths possible, followed by the second optimal three-move paths and then four-move paths. At all nodes, convoluting the probabilities of all the possible paths, a move in more optimal paths turns out to have a higher probability of being in the unknown path. Thus, the model can learn the probability distributions merely by observing a sufficient number of random moves derived from G. Right: Using the learned probability distributions, the model can infer reverse paths from node S to the unknown goal G consisting of the inverse moves {a0 , b0 , c0 }. Since the moves in the short- est path are learned most frequent of all paths, in this instance, the move b0 is predicted to be more optimal than a0 and c0 at node S and its subsequent node. 2 P ROPOSED M ETHOD In this section, we first provide an overview of the proposed method and then discuss constraints for applying it to combinatorial problems. Below, we consider a combinatorial problem as a pathfinding task on an implicit graph. Each node in the graph represents a unique state and has edges as equally probable moves to its adjacent nodes. 2.1 OVERVIEW The fundamental idea of the proposed method is as follows: 1) apply a sequence of random moves to the target problem 2) train a DNN to predict the last random move based on the problem state. By learning associated patterns between problem states and probability distributions of the last moves, the DNN can infer step-by-step moves as a reverse path to the initial goal state. After training, the multinomial probability distribution of moves at inference step i should approximate X P (Xi = j) = P (Xi+1 = k) · P (Xi = j | Xi+1 = k) (1) where Xi is a choice of move at step i, and both j and k belong to the set of all the moves available. The probability distribution of the move Xi implicitly depends on moves and paths to the current state or similar that appeared during training. See Figure 1 for a miniature example. 2.2 C ONDITION AND A SSUMPTION There is one necessary condition for the proposed method to work, and one assumption for a DNN to approximate towards an optimal solver. First, to be solved with the proposed method, a problem must meet the following condition. Condition 1. For the problem of interest, at least one goal state must be predefined and readily available. 2
arXiv preprint For a DNN to learn probability distributions of moves connecting to the goal state, any path consist- ing of random moves must be derived from the goal state. Therefore, Condition 1 is necessary for the proposed method. Subsequently, the following is desirable for a self-supervised DNN to approximate an optimal solver. Assumption 1. The more optimal a move is, the more frequently it occurs. Although Assumption 1 is not necessary for the proposed method, the more true Assumption 1 is for the problem, the more optimal the DNN will be. Here, we formulate the statistical condition for Assumption 1 to be true on a problem when all moves at every node have an equal likelihood. Let pn be the probability of taking a particular move as the first step to a specific target node in a path of n moves or more, kX max Ck|n pn = (2) ak k=n where k is the move count to the target state, kmax is the maximum number of k, a is the total number of possible moves, and Ck|n is the number of k-move paths (k ≥ n). Provided Assumption1 ⇔ Pkmax pn ≥ pn+1 and pn = Cn|n /an + n+1 (Ck|n /ak ), the following must be satisfied for any n: kX max kX max Cn|n Ck|n Ck|n+1 n + k ≥ (3) a a ak k=n+1 k=n+1 P At this point, both Ck|n and Ck|n+1 are unknown on an implicit graph, and the terms do not necessarily equal when the paths start with different moves. However, the only potential difference is the first move in two paths of different optimalities, and just taking a sub-optimal (alternative) move should not largely increase the total number of same-distance paths. Therefore, we assume that this holds true to varying degrees depending on the problem being addressed. If Condition 1 and Assumption 1 are met, random moves have expected probability distributions peaking around optimal moves. It is also notable that the degree to which a problem satisfies As- sumption 1 affects the optimality of a trained DNN; the more possible moves violating Assumption 1, the less optimal the DNN will be. Although the example in 1 has a very restricted setting for explanatory simplicity, this observation could apply to similar problems of higher dimensions and complexities under the constraints. Accordingly, we hypothesize DNNs to learn to solve goal- predefined combinatorial problems near-optimally. 3 E XPERIMENT 3.1 RUBIK ’ S C UBE Rubik’s Cube is a cubic puzzle consisting of 6 centers, 12 edges, and 8 corners. Edges and corners can move around the centers, and the puzzle can have approximately 4.3 × 1019 different states, including the predefined goal at which each of six faces has only one color. As input into DNNs, the edges and the corners have 48 unique possible states, determined by their spatial locations and flip/twist states 1 . In this work, we employ the half-turn metric (both 90° and 180° rotations count as one move), resulting in 18 possible moves for any state. Accordingly, solving Rubik’s Cube can be described as a path planning problem in the 18-D action space. 3.2 L EARNING O BJECTIVE Given the state of a Rubik’s Cube scrambled for random 1–20 moves2 , a DNN predicts the last move of the scramble (see Figure 2). Here, the maximum scramble length is set to 20 because the 1 Edges have 12 possible locations and 2 flip states; corners, 8 possible locations and 3 twist states 2 Obviously redundant moves like R (90° clockwise turn on the right face) following R’ (counterclockwise) are excluded. 3
arXiv preprint Scramble Last move F' U R2 D' B F L L DNN Piece & Position Embeddings Figure 2: An illustrated scheme of the proposed self-supervised learning on Rubik’s Cube. Applying a random scramble to Rubik’s Cube, we train the DNN to predict the last move in the scramble. number is known as sufficient for optimally solving any state of Rubik’s Cube (Rokicki et al., 2014). Since the inverse of the last scramble move is expectedly close to optimal under Assumption 1, we hypothesize that this learning objective enables a DNN to approximate an optimal solver. 3.3 M ODEL A RCHITECTURES The experiment trains two classes of DNNs: DeepCube and Transformer models. As DeepCube studies (McAleer et al., 2018; Agostinelli et al., 2019) indicated, a stack of fully-connected (FC) layers followed by Residual FC blocks (He et al., 2016) learns well to solve Rubik’s Cube. There- fore, we employ a similar architecture as DeepCube models. On the other hand, Transformer models employ a stack of Transformer encoders (Vaswani et al., 2017). Given Transformer’s robust perfor- mance in various domains (Devlin et al., 2018; Dosovitskiy et al., 2020; Rives et al., 2021), we examine if Transformer is also suitable for solving this type of combinatorial problem as well, in comparison to the DeepCube models. In order to test and compare the scalabilities of these ar- chitectures, we train models of different sizes for each class. All models of both types have piece and position embeddings like those of BERT (Devlin et al., 2018) and an 18-D FC layer for soft- max output. Transformer models have an additional special token [BUF] as a buffer to integrate information from the other embeddings. Appendix A summarizes the model configurations. 3.4 E XPERIMENTAL D ETAILS Training continued for 500, 000 steps for all the models and an additional 500, 000 (1, 000, 000 in total) for the best-performing model. The initial batch size is set to 320 (16 mini-batch per scramble length) and doubled at 400, 000th and 500, 000th steps for enhanced parallelism. Adam optimizer (Kingma & Ba, 2014) with the initial learning rate of 10−4 was used to update weights of all models. 4 R ESULTS 4.1 L EARNING P ERFORMANCE Figure 3 learning progress for all the models. Since Transformer-8L model seemed to perform best among all, we stopped training the other models after 500, 000 steps. In both classes, a larger model generally required fewer samples to reach the same level as a smaller one. However, while Deep- Cube models learned faster in the early stage of training, Transformer-4L and -8L had advantages in the later stages, showing Transformer models’ higher scalability on the learning task. Below, we evaluate the best-performing Transformer-8L on solving Rubik’s Cube. 4
arXiv preprint DeepCube-M 2.4 DeepCube-L DeepCube-XL Transformer-1L 2.2 Transformer-2L Cross-entropy loss Transformer-4L Transformer-8L 2.0 1.8 1.6 1.4 100 101 102 103 Training steps (×1000) Figure 3: Decreasing cross-entropy losses of last move prediction. Moving averages of all models are plotted for every 1,000 steps on the log scale. 300 Breadth: 101 Breadth: 102 250 Breadth: 103 Frequency 200 150 100 50 0 15 20 25 30 35 40 45 50 55 Move count Figure 4: Frequency distributions of solution move counts, with 101 , 102 and 103 maximum nodes at every search depth. 4.2 S OLVING RUBIK ’ S C UBE For solving Rubik’s Cube with Transformer-8L model, we adopted a breadth-first search for the simplicity of combinatorial search. At depth i, only a limited number of nodes are passed to the next depth i + 1 and sorted by their expected probabilities. At each inference step, the inverse of the predicted move is applied to the problem. Tested on 1, 000 instances scrambled with 100 random moves, as shown in Figure 4, Transformer-8L model successfully solved all the cases with different breadths 101 , 102 , and 103 , resulting in respective mean solution lengths of 28.3, 22.7, and 20.5 moves. 5 D ISCUSSION We proposed an intuitive and easy-to-implement method to solve goal-predefined combinatorial problems. Self-supervised on random moves and corresponding problem states, a DNN successfully learned to solve Rubik’s Cube. Since an estimated 98% of all possible states take 17–20 moves to 5
arXiv preprint be solved at shortest (Rokicki et al., 2014), the trained DNN can be considered a near-optimal solver as hypothesized, even before full convergence in training and with a compromised search algorithm. While reinforcement learning has been the primary means of automated heuristic search for com- binatorial problems like Rubik’s Cube, we demonstrated that simple self-supervised learning can stably achieve the same objective under the constraints of Condition 1 and Assumption 1. Although we cannot determine whether the Rubik’s Cube fully satisfies Assumption 1, namely that more op- timal moves occur more frequently in random scrambles, the high optimality of Transformer-8L implies that Rubik’s Cube probably has no or few paths violating the assumption in its state space. Additionally, a comparison with DeepCube models consisting of FC layers revealed Transformer’s superior adaptability and scalability on learning through self-supervision. While Transformer is reportedly unstable (Liu et al., 2020), our result suggested that the attention mechanism can stably integrate discrete information in our method. For reference, sample attention patterns generated by Transformer-8L are shown in Appendix B. While our method has potential applications to other problems, it has fundamental limitations. Most importantly, there must be at least one goal state predefined for a target problem (Condition 1). Therefore, the proposed method cannot apply to multiplayer games like chess, in which no goal state is available in advance, whereas reinforcement learning is an appropriate and effective ap- proach (Silver et al., 2018). Likewise, combinatorial problems like Traveling Salesman Problem, whose objective is to find the goal state itself, will not be best solved by our method. On the other hand, our method still has potential applications other than Rubik’ Cube. Our method would be well-suited for other goal-predefined puzzles like 15-puzzle (and its variants), Light Out, and Sokoban, which are also solved by reinforcement learning (Agostinelli et al., 2019). If As- sumption 1 is met, near-optimal applications could extend to real-world problems like route plan- ning with variable-distance moves, for example, by weighting probability distributions of random training moves inversely by move distances. In the specific domain of Rubik’s Cube, it could potentially adapt to any solution constraints and demands. For example, if you want to develop a robotic Rubik’s Cube solver with only 5 motors, you would only have to train/fine-tune the model without U moves in random scrambles. Alternatively, scrambles/solutions could be tuned for ergonomic optimality; controlling the probability distribution of random moves in training scrambles (e.g., increasing U and R moves, reducing B moves) may result in easy-to-execute solutions at inference. 6 C ONCLUSION In this work, we introduced a simple, robust, and stable method to solve goal-predetermined combi- natorial problems through self-supervision. In the proposed method, a DNN is trained to predict the last move of a random scramble given a problem state, under an assumption that mere random moves can form probability distributions of optimalities associated with problem states. When tested on Rubik’s Cube, a trained DNN could successfully solve all test cases near-optimally. Though with some constraints, the proposed method is novel and potentially applicable to similar combinatorial problems. It may also provide a new perspective on combinatorial optimization—especially when involving machine learning. R EPRODUCIBILITY S TATEMENT We release a minimal notebook to ensure reproducibility: https://colab.research. google.com/drive/1VrJ3Bt-zrRHM6Rfv8WkDBrrY-dvusn7M 6
arXiv preprint R EFERENCES Forest Agostinelli, Stephen McAleer, Alexander Shmakov, and Pierre Baldi. Solving the rubik’s cube with deep reinforcement learning and search. Nature Machine Intelligence, 1(8), 2019. Shahab Jabbari Arfaee, Sandra Zilles, and Robert C Holte. Learning heuristic functions for large state spaces. Artificial Intelligence, 175(16-17), 2011. Sebastiano Corli, Lorenzo Moro, Davide Galli, and Enrico Prati. Solving rubik’s cube via quantum mechanics and deep reinforcement learning. Journal of Physics A: Mathematical and Theoretical, 2021. Erik D Demaine, Sarah Eisenstat, and Mikhail Rudoy. Solving the rubik’s cube optimally is np- complete. arXiv:1706.06708v2, 2018. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2018. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929, 2020. Nail El-Sourani, Sascha Hauke, and Markus Borschbach. An evolutionary approach for solving the rubik’s cube incorporating exact methods. European Conference on the Applications of Evolu- tionary Computation, pp. 80–89, 2010. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014. Richard E Korf. A program that learns to solve rubik’s cube. AAAI, pp. 164–167, 1982. Richard E Korf. Finding optimal solutions to rubik’s cube using pattern databases. AAAI/IAAI, 1997. Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. Understanding the diffi- culty of training transformers. EMNLP, 2020. Nina Mazyavkina, Sergey Sviridov, Sergei Ivanov, and Evgeny Burnaev. Reinforcement learning for combinatorial optimization: A survey. Computers & Operations Research, 2021. Stephen McAleer, Forest Agostinelli, Alexander Shmakov, and Pierre Baldi. Solving the rubik’s cube with approximate policy iteration. International Conference on Learning Representations, 2018. Christos H Papadimitriou and Kenneth Steiglitz. Combinatorial Optimization: Algorithms and Com- plexity. Courier Corporation, 1998. Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C Lawrence Zitnick, Jerry Ma, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15), 2021. doi: 10.1073/pnas.2016239118. Tomas Rokicki, Herbert Kociemba, Morley Davidson, and John Dethridge. The diameter of the rubik’s cube group is twenty. SIAM Review, 56(4), 2014. David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419), 2018. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. NIPS, 2017. 7
arXiv preprint A M ODEL C ONFIGURATIONS Table 1: DeepCube and Transformer models and their configurations. Models #FC #Residual #Transformer #Parameter DeepCube-M 5000-D & 1000-D 2 × 1000-D 9.4M DeepCube-L 10000-D & 1000-D 4 × 1000-D 18.8M DeepCube-XL 20000-D & 1000-D 8 × 1000-D 37.7M Transformer-1L 1 4.4M Transformer-2L 2 8.8M Transformer-4L 4 17.5M Transformer-8L 8 34.8M Table 1 summarizes configurations for all the models. DeepCube models have two FC layers with different dimensionalities, followed by a different number of Residual blocks each containing a 100-D FC layer. In every Transformer model, each Transformer has 16 attention heads of 256-D embeddings and two subsequent 256-D FC layers. The last Transformer output tensor is horizontally concatenated before being fed to the output FC layer. B ATTENTION PATTERNS Given the Transformer model’s significant adaptability to the task, we extracted attention patterns of Transformer-8L on a random scramble instance (see Figure 5). Starting with the [BUF] token attending to some corners and edges, the DNN produced cross-sectional token- and block-level associations with gradually distributed attention. Layer Keys 1 2 3 4 5 6 7 8 [BUF] Corners Edges Edges Corners Queries Figure 5: Sample attention patterns produced by Transformer-8L model observed during an infer- ence. Lef t: The vertical and horizontal axes on a heatmap respectively represent a query and a key in an attention head. Right: Representative samples of two attention heads from each Transformer layer. 8
You can also read