SELF-SUPERVISION IS ALL YOU NEED FOR SOLVING RUBIK'S CUBE - arXiv

Page created by Charles Cooper
 
CONTINUE READING
arXiv preprint

                                        S ELF -S UPERVISION IS A LL YOU N EED
                                        FOR S OLVING RUBIK ’ S C UBE

                                         Kyo Takano
                                         Tokyo, Japan
                                         kyo.psych@gmail.com

                                                                                    A BSTRACT
arXiv:2106.03157v2 [cs.LG] 6 Oct 2021

                                                 While combinatorial problems are of great academic and practical importance,
                                                 previous approaches like explicit heuristics and reinforcement learning have been
                                                 complex and costly. To address this, we developed a simple and robust method to
                                                 train a Deep Neural Network (DNN) through self-supervised learning for solving
                                                 a goal-predefined combinatorial problem. Assuming that more optimal moves
                                                 occur more frequently as a path of random moves connecting two problem states,
                                                 the DNN can approximate an optimal solver by learning to predict the last move of
                                                 a random scramble based on the problem state. Tested on 1,000 scrambled Rubik’s
                                                 Cube instances, a Transformer-based model could solve all of them near-optimally
                                                 using a breadth-first search; with a maximum breadth of 103 , the mean solution
                                                 length was 20.5 moves. The proposed method may apply to other goal-predefined
                                                 combinatorial problems, though it has a few constraints.

                                        1   I NTRODUCTION

                                        Combinatorial search is a challenging task both in theory and in practice. Representative problems
                                        include Traveling Salesman Problem, in which a salesman tries to travel through multiple cities in
                                        the shortest path possible. Despite how easy it sounds, the problem is labeled NP-hard due to its
                                        combinatorial complexity (Papadimitriou & Steiglitz, 1998); as the number of cities increases, the
                                        number of possible combinations skyrockets. In the real world, algorithms for such problems are
                                        applied in various ways to optimize logistics, supply chain, resource planning, etc.
                                        In the past, researchers have proposed several shortcut algorithms to tackle the combinatorial com-
                                        plexity of such problems. Recently, in particular, reinforcement learning has enabled automatic
                                        search of algorithms to optimally solve combinatorial problems (Mazyavkina et al., 2021); by re-
                                        warding efficient moves, DNNs can learn to find increasingly optimal solutions. Rubik’s Cube is
                                        also one such problem (Demaine et al., 2018), and there have been not only logical methods (Korf,
                                        1997; Rokicki et al., 2014) but also heuristic (Korf, 1982; El-Sourani et al., 2010; Arfaee et al.,
                                        2011) and reinforcement learning approaches (McAleer et al., 2018; Agostinelli et al., 2019; Corli
                                        et al., 2021) that can solve any scrambled state of the puzzle. Nevertheless, Combinatorial search
                                        is still no easy task; you need detailed knowledge of the target problem, coding explicit logic, or
                                        configuring a reinforcement learning system. Reinforcement learning is especially notable for the
                                        need to adjust architecture, loss function, and hyperparameters several times to stabilize the learning
                                        process involving non-differentiable operations.
                                        To overcome this difficulty, we developed a novel method to train DNNs in a self-supervised manner
                                        for solving combinatorial problems with a predefined goal. Assuming that the first move in a random
                                        path is probabilistically most likely optimal, the method teaches a DNN probability distribution of
                                        optimal moves associated with problem states. In our experiment, we tested the proposed method
                                        by training DNNs to solve Rubik’s Cube.

                                        Contribution. We show that self-supervised learning can be all you need for solving some goal-
                                        predefined combinatorial problems almost optimally, without requiring specialized knowledge or
                                        reinforcement learning.

                                                                                          1
arXiv preprint

                                                                                       c'
                                     S                                                      S
                                                                                                a'
                                                                                  b'
            a
                  b
             G                                                G
                      c

Figure 1: A simplified goal-predetermined combinatorial problem in training and inference. Opti-
malities of possible moves are probabilistically learned through observations of random moves in
training, and utilized for inference by a model. Lef t: During training, random moves {a, b, c} can
occur at every node by the equal probability of 1/3. If an unknown path consisting of these dupli-
cable moves connects nodes G (goal) and S (scramble), the optimal (shortest) path (b, b) is most
likely of all the paths possible, followed by the second optimal three-move paths and then four-move
paths. At all nodes, convoluting the probabilities of all the possible paths, a move in more optimal
paths turns out to have a higher probability of being in the unknown path. Thus, the model can learn
the probability distributions merely by observing a sufficient number of random moves derived from
G. Right: Using the learned probability distributions, the model can infer reverse paths from node
S to the unknown goal G consisting of the inverse moves {a0 , b0 , c0 }. Since the moves in the short-
est path are learned most frequent of all paths, in this instance, the move b0 is predicted to be more
optimal than a0 and c0 at node S and its subsequent node.

2     P ROPOSED M ETHOD

In this section, we first provide an overview of the proposed method and then discuss constraints for
applying it to combinatorial problems. Below, we consider a combinatorial problem as a pathfinding
task on an implicit graph. Each node in the graph represents a unique state and has edges as equally
probable moves to its adjacent nodes.

2.1   OVERVIEW

The fundamental idea of the proposed method is as follows: 1) apply a sequence of random moves
to the target problem 2) train a DNN to predict the last random move based on the problem state. By
learning associated patterns between problem states and probability distributions of the last moves,
the DNN can infer step-by-step moves as a reverse path to the initial goal state. After training, the
multinomial probability distribution of moves at inference step i should approximate

                                     X
                      P (Xi = j) =       P (Xi+1 = k) · P (Xi = j | Xi+1 = k)                        (1)

where Xi is a choice of move at step i, and both j and k belong to the set of all the moves available.
The probability distribution of the move Xi implicitly depends on moves and paths to the current
state or similar that appeared during training. See Figure 1 for a miniature example.

2.2   C ONDITION AND A SSUMPTION

There is one necessary condition for the proposed method to work, and one assumption for a DNN
to approximate towards an optimal solver. First, to be solved with the proposed method, a problem
must meet the following condition.

Condition 1. For the problem of interest, at least one goal state must be predefined and readily
available.

                                                  2
arXiv preprint

For a DNN to learn probability distributions of moves connecting to the goal state, any path consist-
ing of random moves must be derived from the goal state. Therefore, Condition 1 is necessary for
the proposed method.
Subsequently, the following is desirable for a self-supervised DNN to approximate an optimal solver.

Assumption 1. The more optimal a move is, the more frequently it occurs.
Although Assumption 1 is not necessary for the proposed method, the more true Assumption 1 is
for the problem, the more optimal the DNN will be. Here, we formulate the statistical condition for
Assumption 1 to be true on a problem when all moves at every node have an equal likelihood. Let
pn be the probability of taking a particular move as the first step to a specific target node in a path
of n moves or more,

                                                    kX
                                                     max
                                                           Ck|n
                                             pn =                                                       (2)
                                                            ak
                                                    k=n

where k is the move count to the target state, kmax is the maximum number of k, a is the total number
of possible moves, and Ck|n is the number of k-move paths (k ≥ n). Provided Assumption1 ⇔
                                   Pkmax
pn ≥ pn+1 and pn = Cn|n /an + n+1          (Ck|n /ak ), the following must be satisfied for any n:

                                        kX
                                         max        kX
                                                     max
                                 Cn|n        Ck|n        Ck|n+1
                                    n
                                      +        k
                                                  ≥                                                     (3)
                                  a           a            ak
                                          k=n+1             k=n+1
                                                                                    P
At this point, both Ck|n and Ck|n+1 are unknown on an implicit graph, and the          terms do not
necessarily equal when the paths start with different moves. However, the only potential difference
is the first move in two paths of different optimalities, and just taking a sub-optimal (alternative)
move should not largely increase the total number of same-distance paths. Therefore, we assume
that this holds true to varying degrees depending on the problem being addressed.
If Condition 1 and Assumption 1 are met, random moves have expected probability distributions
peaking around optimal moves. It is also notable that the degree to which a problem satisfies As-
sumption 1 affects the optimality of a trained DNN; the more possible moves violating Assumption
1, the less optimal the DNN will be. Although the example in 1 has a very restricted setting for
explanatory simplicity, this observation could apply to similar problems of higher dimensions and
complexities under the constraints. Accordingly, we hypothesize DNNs to learn to solve goal-
predefined combinatorial problems near-optimally.

3       E XPERIMENT
3.1     RUBIK ’ S C UBE

Rubik’s Cube is a cubic puzzle consisting of 6 centers, 12 edges, and 8 corners. Edges and corners
can move around the centers, and the puzzle can have approximately 4.3 × 1019 different states,
including the predefined goal at which each of six faces has only one color. As input into DNNs,
the edges and the corners have 48 unique possible states, determined by their spatial locations and
flip/twist states 1 . In this work, we employ the half-turn metric (both 90° and 180° rotations count
as one move), resulting in 18 possible moves for any state. Accordingly, solving Rubik’s Cube can
be described as a path planning problem in the 18-D action space.

3.2     L EARNING O BJECTIVE

Given the state of a Rubik’s Cube scrambled for random 1–20 moves2 , a DNN predicts the last
move of the scramble (see Figure 2). Here, the maximum scramble length is set to 20 because the
    1
     Edges have 12 possible locations and 2 flip states; corners, 8 possible locations and 3 twist states
    2
     Obviously redundant moves like R (90° clockwise turn on the right face) following R’ (counterclockwise)
are excluded.

                                                     3
arXiv preprint

                         Scramble                               Last move
                 F' U R2 D' B F L                                    L

                                                                   DNN

                                                             Piece & Position
                                                               Embeddings

Figure 2: An illustrated scheme of the proposed self-supervised learning on Rubik’s Cube. Applying
a random scramble to Rubik’s Cube, we train the DNN to predict the last move in the scramble.

number is known as sufficient for optimally solving any state of Rubik’s Cube (Rokicki et al., 2014).
Since the inverse of the last scramble move is expectedly close to optimal under Assumption 1, we
hypothesize that this learning objective enables a DNN to approximate an optimal solver.

3.3   M ODEL A RCHITECTURES

The experiment trains two classes of DNNs: DeepCube and Transformer models. As DeepCube
studies (McAleer et al., 2018; Agostinelli et al., 2019) indicated, a stack of fully-connected (FC)
layers followed by Residual FC blocks (He et al., 2016) learns well to solve Rubik’s Cube. There-
fore, we employ a similar architecture as DeepCube models. On the other hand, Transformer models
employ a stack of Transformer encoders (Vaswani et al., 2017). Given Transformer’s robust perfor-
mance in various domains (Devlin et al., 2018; Dosovitskiy et al., 2020; Rives et al., 2021), we
examine if Transformer is also suitable for solving this type of combinatorial problem as well, in
comparison to the DeepCube models. In order to test and compare the scalabilities of these ar-
chitectures, we train models of different sizes for each class. All models of both types have piece
and position embeddings like those of BERT (Devlin et al., 2018) and an 18-D FC layer for soft-
max output. Transformer models have an additional special token [BUF] as a buffer to integrate
information from the other embeddings. Appendix A summarizes the model configurations.

3.4   E XPERIMENTAL D ETAILS

Training continued for 500, 000 steps for all the models and an additional 500, 000 (1, 000, 000
in total) for the best-performing model. The initial batch size is set to 320 (16 mini-batch per
scramble length) and doubled at 400, 000th and 500, 000th steps for enhanced parallelism. Adam
optimizer (Kingma & Ba, 2014) with the initial learning rate of 10−4 was used to update weights of
all models.

4     R ESULTS

4.1   L EARNING P ERFORMANCE

Figure 3 learning progress for all the models. Since Transformer-8L model seemed to perform best
among all, we stopped training the other models after 500, 000 steps. In both classes, a larger model
generally required fewer samples to reach the same level as a smaller one. However, while Deep-
Cube models learned faster in the early stage of training, Transformer-4L and -8L had advantages
in the later stages, showing Transformer models’ higher scalability on the learning task. Below, we
evaluate the best-performing Transformer-8L on solving Rubik’s Cube.

                                                 4
arXiv preprint

                                                                                             DeepCube-M
                                  2.4                                                        DeepCube-L
                                                                                             DeepCube-XL
                                                                                             Transformer-1L
                                  2.2                                                        Transformer-2L

             Cross-entropy loss
                                                                                             Transformer-4L
                                                                                             Transformer-8L
                                  2.0

                                  1.8

                                  1.6

                                  1.4

                                        100                  101                  102                         103
                                                             Training steps (×1000)

Figure 3: Decreasing cross-entropy losses of last move prediction. Moving averages of all models
are plotted for every 1,000 steps on the log scale.

                                  300                                                          Breadth: 101
                                                                                               Breadth: 102
                                  250                                                          Breadth: 103
             Frequency

                                  200

                                  150

                                  100

                                   50

                                    0
                                              15   20   25         30       35   40     45      50            55
                                                                   Move count
Figure 4: Frequency distributions of solution move counts, with 101 , 102 and 103 maximum nodes
at every search depth.

4.2   S OLVING RUBIK ’ S C UBE

For solving Rubik’s Cube with Transformer-8L model, we adopted a breadth-first search for the
simplicity of combinatorial search. At depth i, only a limited number of nodes are passed to the
next depth i + 1 and sorted by their expected probabilities. At each inference step, the inverse of the
predicted move is applied to the problem. Tested on 1, 000 instances scrambled with 100 random
moves, as shown in Figure 4, Transformer-8L model successfully solved all the cases with different
breadths 101 , 102 , and 103 , resulting in respective mean solution lengths of 28.3, 22.7, and 20.5
moves.

5     D ISCUSSION

We proposed an intuitive and easy-to-implement method to solve goal-predefined combinatorial
problems. Self-supervised on random moves and corresponding problem states, a DNN successfully
learned to solve Rubik’s Cube. Since an estimated 98% of all possible states take 17–20 moves to

                                                                        5
arXiv preprint

be solved at shortest (Rokicki et al., 2014), the trained DNN can be considered a near-optimal solver
as hypothesized, even before full convergence in training and with a compromised search algorithm.
While reinforcement learning has been the primary means of automated heuristic search for com-
binatorial problems like Rubik’s Cube, we demonstrated that simple self-supervised learning can
stably achieve the same objective under the constraints of Condition 1 and Assumption 1. Although
we cannot determine whether the Rubik’s Cube fully satisfies Assumption 1, namely that more op-
timal moves occur more frequently in random scrambles, the high optimality of Transformer-8L
implies that Rubik’s Cube probably has no or few paths violating the assumption in its state space.
Additionally, a comparison with DeepCube models consisting of FC layers revealed Transformer’s
superior adaptability and scalability on learning through self-supervision. While Transformer is
reportedly unstable (Liu et al., 2020), our result suggested that the attention mechanism can stably
integrate discrete information in our method. For reference, sample attention patterns generated by
Transformer-8L are shown in Appendix B.
While our method has potential applications to other problems, it has fundamental limitations. Most
importantly, there must be at least one goal state predefined for a target problem (Condition 1).
Therefore, the proposed method cannot apply to multiplayer games like chess, in which no goal
state is available in advance, whereas reinforcement learning is an appropriate and effective ap-
proach (Silver et al., 2018). Likewise, combinatorial problems like Traveling Salesman Problem,
whose objective is to find the goal state itself, will not be best solved by our method.
On the other hand, our method still has potential applications other than Rubik’ Cube. Our method
would be well-suited for other goal-predefined puzzles like 15-puzzle (and its variants), Light Out,
and Sokoban, which are also solved by reinforcement learning (Agostinelli et al., 2019). If As-
sumption 1 is met, near-optimal applications could extend to real-world problems like route plan-
ning with variable-distance moves, for example, by weighting probability distributions of random
training moves inversely by move distances.
In the specific domain of Rubik’s Cube, it could potentially adapt to any solution constraints and
demands. For example, if you want to develop a robotic Rubik’s Cube solver with only 5 motors, you
would only have to train/fine-tune the model without U moves in random scrambles. Alternatively,
scrambles/solutions could be tuned for ergonomic optimality; controlling the probability distribution
of random moves in training scrambles (e.g., increasing U and R moves, reducing B moves) may
result in easy-to-execute solutions at inference.

6   C ONCLUSION
In this work, we introduced a simple, robust, and stable method to solve goal-predetermined combi-
natorial problems through self-supervision. In the proposed method, a DNN is trained to predict the
last move of a random scramble given a problem state, under an assumption that mere random moves
can form probability distributions of optimalities associated with problem states. When tested on
Rubik’s Cube, a trained DNN could successfully solve all test cases near-optimally. Though with
some constraints, the proposed method is novel and potentially applicable to similar combinatorial
problems. It may also provide a new perspective on combinatorial optimization—especially when
involving machine learning.

R EPRODUCIBILITY S TATEMENT
We release a minimal notebook to ensure reproducibility: https://colab.research.
google.com/drive/1VrJ3Bt-zrRHM6Rfv8WkDBrrY-dvusn7M

                                                 6
arXiv preprint

R EFERENCES
Forest Agostinelli, Stephen McAleer, Alexander Shmakov, and Pierre Baldi. Solving the rubik’s
  cube with deep reinforcement learning and search. Nature Machine Intelligence, 1(8), 2019.
Shahab Jabbari Arfaee, Sandra Zilles, and Robert C Holte. Learning heuristic functions for large
  state spaces. Artificial Intelligence, 175(16-17), 2011.
Sebastiano Corli, Lorenzo Moro, Davide Galli, and Enrico Prati. Solving rubik’s cube via quantum
  mechanics and deep reinforcement learning. Journal of Physics A: Mathematical and Theoretical,
  2021.
Erik D Demaine, Sarah Eisenstat, and Mikhail Rudoy. Solving the rubik’s cube optimally is np-
  complete. arXiv:1706.06708v2, 2018.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
  bidirectional transformers for language understanding. arXiv:1810.04805, 2018.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
  Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An
  image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929,
  2020.
Nail El-Sourani, Sascha Hauke, and Markus Borschbach. An evolutionary approach for solving the
  rubik’s cube incorporating exact methods. European Conference on the Applications of Evolu-
  tionary Computation, pp. 80–89, 2010.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-
  nition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980,
  2014.
Richard E Korf. A program that learns to solve rubik’s cube. AAAI, pp. 164–167, 1982.
Richard E Korf. Finding optimal solutions to rubik’s cube using pattern databases. AAAI/IAAI,
  1997.
Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. Understanding the diffi-
  culty of training transformers. EMNLP, 2020.
Nina Mazyavkina, Sergey Sviridov, Sergei Ivanov, and Evgeny Burnaev. Reinforcement learning
  for combinatorial optimization: A survey. Computers & Operations Research, 2021.
Stephen McAleer, Forest Agostinelli, Alexander Shmakov, and Pierre Baldi. Solving the rubik’s
  cube with approximate policy iteration. International Conference on Learning Representations,
  2018.
Christos H Papadimitriou and Kenneth Steiglitz. Combinatorial Optimization: Algorithms and Com-
  plexity. Courier Corporation, 1998.
Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo,
  Myle Ott, C Lawrence Zitnick, Jerry Ma, et al. Biological structure and function emerge from
  scaling unsupervised learning to 250 million protein sequences. Proceedings of the National
  Academy of Sciences, 118(15), 2021. doi: 10.1073/pnas.2016239118.
Tomas Rokicki, Herbert Kociemba, Morley Davidson, and John Dethridge. The diameter of the
  rubik’s cube group is twenty. SIAM Review, 56(4), 2014.
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez,
  Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement
  learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419), 2018.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
  Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. NIPS, 2017.

                                                7
arXiv preprint

A                M ODEL C ONFIGURATIONS

                              Table 1: DeepCube and Transformer models and their configurations.

                  Models                              #FC        #Residual       #Transformer   #Parameter
                  DeepCube-M            5000-D & 1000-D         2 × 1000-D                              9.4M
                  DeepCube-L           10000-D & 1000-D         4 × 1000-D                             18.8M
                  DeepCube-XL          20000-D & 1000-D         8 × 1000-D                             37.7M
                  Transformer-1L                                                      1                 4.4M
                  Transformer-2L                                                      2                 8.8M
                  Transformer-4L                                                      4                17.5M
                  Transformer-8L                                                      8                34.8M

Table 1 summarizes configurations for all the models. DeepCube models have two FC layers with
different dimensionalities, followed by a different number of Residual blocks each containing a
100-D FC layer. In every Transformer model, each Transformer has 16 attention heads of 256-D
embeddings and two subsequent 256-D FC layers. The last Transformer output tensor is horizontally
concatenated before being fed to the output FC layer.

B               ATTENTION PATTERNS
Given the Transformer model’s significant adaptability to the task, we extracted attention patterns
of Transformer-8L on a random scramble instance (see Figure 5). Starting with the [BUF] token
attending to some corners and edges, the DNN produced cross-sectional token- and block-level
associations with gradually distributed attention.

                                                                         Layer
                           Keys        1         2          3        4            5       6        7           8

          [BUF] Corners Edges
           Edges Corners
Queries

Figure 5: Sample attention patterns produced by Transformer-8L model observed during an infer-
ence. Lef t: The vertical and horizontal axes on a heatmap respectively represent a query and a key
in an attention head. Right: Representative samples of two attention heads from each Transformer
layer.

                                                                8
You can also read