Recovering Quantitative Models of Human Information Processing with Differentiable Architecture Search
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Recovering Quantitative Models of Human Information Processing with Differentiable Architecture Search Sebastian Musslick (musslick@princeton.edu) Princeton Neuroscience Institute, Princeton University Princeton, NJ 08544, USA arXiv:2103.13939v1 [cs.LG] 25 Mar 2021 Abstract this article, we introduce the notion of a computation graph The integration of behavioral phenomena into mechanistic and describe ways of expressing the architecture of a quan- models of cognitive function is a fundamental staple of cog- titative model in terms of a such a graph. We then review nitive science. Yet, researchers are beginning to accumulate the use of differentiable architecture search (DARTS; Liu, increasing amounts of data without having the temporal or monetary resources to integrate these data into scientific the- Simonyan, & Yang, 2018) for searching the space candi- ories. We seek to overcome these limitations by incorporat- date computation graphs, and introduce an adaptation of this ing existing machine learning techniques into an open-source method for discovering quantitative models of human infor- pipeline for the automated construction of quantitative models. This pipeline leverages the use of neural architecture search to mation processing. We evaluate this method based on its automate the discovery of interpretable model architectures, ability to recover three different models of human cognition and automatic differentiation to automate the fitting of model from synthetic data, to explain behavioral phenomena in psy- parameters to data. We evaluate the utility of these methods based on their ability to recover quantitative models of human chophysics, learning and perceptual decision making. Our re- information processing from synthetic data. We find that these sults indicate that this method is capable of recovering com- methods are capable of recovering basic quantitative motifs putational motifs found in these models. However, we also from models of psychophysics, learning and decision making. We also highlight weaknesses of this framework, and discuss discuss further developments that are needed to expand the future directions for their mitigation. scope of models amenable to DARTS. Keywords: autonomous empirical research; computation graph; continuous relaxation; NAS; DARTS; AutoML Quantitative Models as Computation Graphs A broad class of complex mathematical functions—including Introduction the functions expressed by a quantitative model of human in- The process of developing a mechanistic model of cogni- formation processing—can be formulated as a computation tion incurs two challenges: (1) identifying the architecture graph. A computation graph is a collection of nodes that are of the model, i.e. the composition of functions and param- connected by directed edges. Each node denotes an expres- eters, and (2) tuning parameters of the model to fit experi- sion of a variable, and each outgoing edge corresponds to a mental data. While there are various methods for automating function applied to this variable (cf. Figure 1D). The value the fitting of parameters, cognitive scientists typically lever- of a node is typically computed by integrating over the result age their own expertise and intuitions to identify the architec- of every function (edge) feeding to that node. Akin to quanti- ture of a model—a process that requires substantial human tative models of cognitive function, a computation graph can effort. In machine learning, interest has grown in automating take experiment parameters as input (e.g. the brightness of the construction and parameterization of neural networks to two visual stimuli), and can transform this input through a solve machine learning problems more efficiently (He, Zhao, combination of functions (edges) and latent variables (inter- & Chu, 2021). This involves the use of neural architecture mediate nodes) to produce observable dependent measures as search (NAS) for automating the discovery of model architec- output nodes (e.g the probability that a participant is able to tures (Elsken, Metzen, Hutter, et al., 2019), and the use of au- detect the difference in brightness between two stimuli). tomatic differentiation to automate parameter fitting (Paszke The expression of a formula as computation graph can be et al., 2017). This combination of methods has led to break- illustrated with Weber’s law (Fechner, 1860)—a quantitative throughs in the automated construction of neural networks hypothesis that relates the difference between the intensities that are capable of outperforming networks designed by hu- of two stimuli to the probability that a participant can detect man researchers (e.g. in computer vision: Mendoza, Klein, this difference. It states that the just noticeable difference Feurer, Springenberg, & Hutter, 2016). In this study, we ex- (JND; the difference in intensity that a participant is capable plore the utility of these methods for constructing quantitative of detecting in 50% of the trials) amounts to models of human information processing. ∆I = k · I (1) To ease the application of NAS to the discovery of a quan- titative model, it is useful to treat quantitative models as neu- where ∆I is the JND, I corresponds to the intensity of the ral networks or, more generally, as computation graphs. In baseline stimulus and k is a constant. The probability of de-
tecting the difference between two stimuli, I and I , can then Every output node is computed by linearly combining all be formulated as a function of the two stimuli (with I < I ), intermediate nodes projecting to it. The goal of DARTS is P(detected) = σlogistic ((I − I ) − ∆I) (2) to identify all operations oi, j of the DAG. Following Liu et al. (2018), we define O = {oi, j , oi, j , . . . , oM i, j } to be the set of where σlogistic is a logistic function. Figure 1D depicts the M candidate operations associated with edge ei, j where ev- argument of σlogistic as a computation graph for k = .. The ery operation om i, j (xi ) corresponds to some function applied to graph encompasses two input nodes, one representing x = I xi (e.g. linear, exponential or logistic). DARTS relaxes the and the other x = I . The intermediate node x expresses problem of searching over candidate operations by formulat- ∆I which results from multiplying I with the parameter k = ing the transformation associated with an edge as a mixture .. The addition and subtraction of I and I , respectively, of all possible operations in O (cf. Figure 1A-B): result in their difference (I − I ) and are represented by the X exp(αoi, j ) intermediate node x . The linear combination of x and x in ōi, j (x) = P o0 · oi, j (x). (4) the output node r resembles the argument to σlogistic . o∈O o 0 ∈O exp(αi, j ) The automated construction of a mathematical hypothe- where each operation is weighted by the softmax transfor- sis, like Weber’s law, can be formulated as a search over mation of its architectural weight αoi, j . Every edge ei, j is as- the space of all possible computation graphs. Machine learn- signed a weight vector αi, j of dimension M, containing the ing researchers leverage the notion of computation graphs to weights of all possible candidate operations for that edge. The represent the composition of functions performed by a com- set of all architecture weight vectors α = {αi, j } determines the plex artificial neural network (i.e. its architecture), and de- architecture of the model. Thus, searching the architecture ploy NAS to search a space of computation graphs. Although amounts to identifying α. The key contribution of DARTS some level of specification of the graph remains with the re- is that searching α becomes amenable to gradient descent af- searcher, NAS relieves the researcher from searching through ter relaxing the search space to become continuous (Equation these possibilities. (4)). However, minimizing the loss function of the model L(w, α) requires finding both α∗ and w∗ —the parameters of Identifying Computation Graphs the computation graph.1 Liu et al. (2018) propose to learn α with Neural Architecture Search and w simultaneously using bi-level optimization: NAS refers to a family of methods for automating the dis- min Lval (w∗ (α), α) α covery of useful neural network architectures. There are a (5) number of methods to guide this search, such as evolutionary s.t. w∗ (α) = argminw Ltrain (w, α). algorithms, reinforcement learning or Bayesian optimization That is, one can obtain α∗ through gradient descent, by (for a recent survey of NAS search strategies, see Elsken et iterating through the following steps: al., 2019). However, most of these methods are computation- ally demanding due to the nature of the optimization prob- 1. Obtain the optimal set of weights w∗ for the current archi- lem: The search space of candidate computation graphs is tecture α by minimizing the training loss Ltrain (w, α). high-dimensional and discrete. To address this problem, Liu 2. Update the architecture α (cf. Figure 1C) by following the et al. (2018) proposed DARTS which relaxes the search space gradient of the validation loss ∇Lval (w∗ , α). to become continuous, making architecture search amenable to gradient decent. The authors demonstrate that DARTS Once α∗ is found, one can obtain the final architecture by can yield useful network architectures for image classification replacing ōi, j with the operation that has the highest archi- and language modeling that are on par with architectures de- tectural weight, i.e. oi, j ← argmaxo α∗o i, j , or by sampling each signed by human researchers. In this work, we assess whether operation with probability P (oi, j | αi, j ) = σsoftmax (α∗i, j )o . DARTS can be adopted to automate the discovery of inter- pretable quantitative models to explain human cognition. Adapting DARTS for Autonomous Cognitive Modeling We adopt the framework from Liu et al. (2018) by represent- Differentiable Architecture Search ing quantitative models of information processing as DAGs, DARTS treats the architecture of a neural network as a di- and seek to automate the discovery of model architectures by rected acyclic computation graph (DAG), containing N nodes differentiating through the space of operations in the under- in sequential order (Figure 1). Each node xi corresponds to a lying computation graph. To map computation graphs onto latent representation of the input space. Each directed edge quantitative models of cognitive function, we separate the ei, j is associated with some operation oi, j that transforms the nodes of the computation graph into input nodes, intermedi- representation of the preceding node i, and feeds it to node ate nodes and output nodes. Every input node corresponds to j. Each intermediate node is computed by integrating over its a different independent variable (e.g. the brightness of a stim- transformed predecessors: ulus) and every output node corresponds to a different depen- X dent variable (e.g. the probability of detecting the stimulus). xj = oi, j (xi ) . (3) 1 This includes the parameters of each candidate operation om i< j i, j .
P(detected) x_4 y y x_3 P(detected) P(detected) y x_2 x_4 A Candidate Operations x_4 Continuous Relaxation Bx_1 x_3 P(detected) Table 1: Search space of candidate operations o(x) ∈ O and none x_3 their complexity p(o). Note that parameters a, b ∈ w are fitted x_4 y x_0 x_2 a*x x_2 P(detected) r separately for every om i, j . y x_3 -x P(detected) x_1 Description o(x) p(o) P(detected) +x x_1 x_3 P(detected) zero 0 x_2 x_4 x_0 addition +x 1 x_4 x_0 x_1 x_4 subtraction −x 1 Training of Architecture Architecture Sampling C x_1 x_3 Weights D multiplication a·x 2 x_3 x_0 x_3 linear function a·x+b 3 x_0 x_2 0.5 * x x_2 -1 * x x_0 -x r exponential function exp(a · x + b) 3 x_2 P(detected) r I_11 * x x_2 rectified linear function ReLU(x) 1 x_1 x_3 +x logistic function σlogistic (x) 1 x_1 I_0 x_1 x_1 x_3 P(detected) x_0 x_0 1: Learning Figure x_1 x_4 computation graphs with DARTS. x_0 The where p(om i, j ) corresponds to the complexity of a candi- nodes and edges in a computation graph correspond to vari- date operation, amounting to one plus the number of trainable x_0 x_3 ables and functions (operations) performed on those vari- parameters (see Table 1), and γ scales the degree to which ables, respectively.I_1(A) Edges x_2 represent different candidate complexity is penalized. Following the objective in Equa- operations. (B) DARTS relaxes the search space of opera- tion (5), we seek to minimize the total loss, Ltotal (w, α) = I_0 tions to be continuous. x_1 Each intermediate node (blue) is com- LMSE (r j ,t j | w, α) + Lcomplexity , by simultaneously finding α∗ puted as a weighted mixturex_0of operations. The (architectural) and w∗ , using gradient descent. weight of a candidate operation in an edge represents the con- tribution of that operation to the mixture computation. Output Experiments and Results nodes (green) are computed by linearly combining all inter- Identifying the architecture of a quantitative model is an am- mediate nodes. (C) Architectural weights are trained using bitious task. Consider the challenge of constructing a DAG bi-level optimization, and used to sample the final architec- to explain the relationship between three independent vari- ture of the computation graph, as shown in (D). ables and one dependent variable, with only two latent vari- ables. Assuming a set of eight candidate operations per edge Intermediate nodes represent latent variables of the model and for a total number of seven edges, there are possible ar- are computed according to Equation (3), by applying an op- chitectures to explore and endless ways to parameterize the eration to every predecessor of the node and by integrating chosen architecture. Adding one more latent variable to the over all transformed predecessors.2 For the simulation ex- model would expand the search space to possible archi- periments reported below, we consider eight candidate oper- tectures. DARTS offers one way of automating this search. ations which are summarized in Table 1, including a “zero” However, before applying DARTS to explain human data, it is operation to indicate the lack of a connection between nodes. worth assessing whether this method is capable of recovering Similar to Liu et al. (2018), we compute every output node r j computational motifs from a known ground truth. Therefore, by linearly combining all intermediate nodes: we seek to evaluate whether DARTS can recover quantitative X K+S models of human information processing from synthetic data. rj = vi, j xi (6) As detailed below, we assess the performance of DARTS i=S+ in recovering three distinct computational motifs in cognitive where vi, j ∈ w is a trainable weight projecting from inter- psychology (see Test Cases). For each test case, we vary mediate node x j to the output node r j , S corresponds to the the number of intermediate nodes k ∈ {, , } and the com- number of input nodes and K to the number of intermediate plexity penalty y ∈ {, ., ., ., .} across architecture nodes. To warrant interpretability of the model, we constrain searches, and initialize each search with ten different seeds. all nodes to be scalar variables, i.e. xi , r j ∈ R× . When evaluating instantiations of NAS, it is important to Our goal is to identify a computation graph that can pre- compare their performance against baselines (Lindauer & dict each dependent variable from all independent variables. Hutter, 2020). In many cases, random sampling can yield Thus, we seek to minimize, for every dependent variable j, results that are comparable to more sophisticated NAS (Li & the discrepancy between every output of the model r j and the Talwalkar, 2020; Xie, Kirillov, Girshick, & He, 2019). Thus, observed data t j . This discrepancy can be formulated as a we seek to compare the performance of each search condition mean squared error (MSE), LMSE (r j ,t j | w, α), that is contin- against a randomly sampled architecture. gent on both the architecture α and the parameters in w. In addition, we seek to minimize the complexity of the model, Training and Evaluation Procedure XXX For each test case and each search condition, we optimized Lcomplexity = γ p(om i, j ) (7) the architecture according to Equation (5), using stochas- i j m tic gradient descent (SGD). To identify w∗ , we optimized w 2 Predecessors may include both input and intermediate nodes. for the selected training set over epochs with a cosine
cote, Brown, & Mewhort, 2000). It explains the performance Table 2: Summary of test cases, stating the reference to the re- on a task Pn as follows: spective equation (Eqn.), the number of independent variables Pn = P∞ − (P∞ − P ) · e−α·t (8) (IVs), the number of dependent variables (DVs), the number of free parameters (|Θ|), as well as distinct operations (o∗ ). where t corresponds to the number of practice trials, α Test Case Eqn. IVs DVs |Θ| o∗ is a learning rate, P corresponds to the initial performance Weber’s Law (2) 2 1 1 subtraction on a task, and P∞ to the final performance for t → ∞. We Exp. Learning (8) 3 1 1 exponential treat t, P and P∞ as independent variables of the model and LCA (10) 3 1 3 rectified linear Pn as a real-valued dependent variable. To avoid numerical instabilities based on large inputs, we constrain ≤ t ≤ , annealing schedule (initial learning rate = ., minimum ≤ P ≤ . and . ≤ P∞ ≤ and set α = . We generate learning rate × − ), momentum . and weight decay the synthesized data set by drawing eight evenly-spaced sam- × − . Following Liu et al. (2018), we initialize archi- ples for each independent variable, generating a full crossing tecture variables to be zero. For a given w∗ , we optimized α between these samples, and by computing Pn for each con- for the validation set over epochs using Adam (Kingma dition. The purpose of this test case is to highlight a poten- & Ba, 2014), with initial learning rate × − , momentum tial weakness of DARTS: Intermediate nodes cannot repre- β = (., .) and weight decay × − . sent non-linear interactions between input variables—as it is After training w and α, we sampled the final architecture the case in Equation (8)—due to the additive integration of by selecting operations with the highest architectural weights. their inputs. Thus, DARTS must identify alternative expres- Finally, we trained 5 random initializations of each sam- sions to approximate Equation (8). pled architecture on the training set for 1000 epochs using SGD with a cosine annealing schedule (initial learning rate Case 3: Leaky Competing Accumulator To model the dy- = ., minimum learning rate × − ). All parameters namics of perceptual decision making, Usher and McClelland were selected based on recoveries of out-of-sample test cases. (2001) introduced the leaky, competing accumulator (LCA). Every unit xi of the model represents a different choice in a Test Cases decision making task. The activity of these units is used to All test cases are summarized in Table 2. Here, we report the determine the selected choice of an agent. The activity dy- results for three different psychological models as test cases namics are determined by the non-linear equation (without for DARTS. While these models appear fairly simple, they are consideration of noise): X dt based on common computational motifs in cognitive psychol- dxi = [ρi − λxi + α f (xi ) − β f (x j )] (9) ogy, and serve as a proof of concept for uncovering potential j6=i τ weaknesses of DARTS. Below, we describe each computa- where ρi is an external input provided to unit xi , λ is the tional model in greater detail. For each of these models, we decay rate of xi , α is the recurrent excitation weight of xi , β used 80% of the generated data set to compute the training is the inhibition weight between units, τ is a rate constant and loss, and 20% to compute the validation loss, to optimize the f (xi ) is a rectified linear activation function. Here, we seek objective stated in Equation (5). Experiment sequences were to recover the dynamics of an LCA with three units, using the generated with SweetPea—a programming language for au- following (typical) parameterization: λ = ., α = ., β = tomating experimental design (Musslick et al., 2020). . and τ = . In addition, we assume that all units receive no Case 1: Weber’s Law Weber’s law is a quantitative hy- external input ρi = . This results in the simplified equation: pothesis from psychophysics relating the difference in inten- X dxi = [−.xi + . f (xi ) − . f (x j )]dt (10) sity of two stimuli (e.g. their brightness) to the probability j6=i that a participant can detect the difference. Here, we adopt We treat units x , x , x as independent variables (− ≤ xi ≤ the formal description of Weber’s law with k = from Equa- ) and dx as the dependent variable for a given time step dt. tion (2) (see Quantitative Models as Computation Graphs for We generate data from the model by drawing eight evenly- a detailed description). We consider the two stimulus intensi- spaced samples for each xi , generating the full crossing be- ties, I and I as the independent variables of the model, and tween these, and computing dx for each condition. P(detected) as the dependent variable. The generated data set (for computing Lval and Ltrain ) is synthesized based on 20 Results evenly spaced samples from the interval [, ] for I, and by Figures 2, 3 and 4 summarize the search results for each test computing P(detected) for all valid crossings between I and case. In most cases, DARTS outperforms random sampling in I , with I ≤ I . Since we seek to explain a single probability, recovering architectures from synthesized data. The best per- we apply a sigmoid function to the output of each generated formance for DARTS can on average be observed for k = computation graph. and γ = although the best fitting architecture may result Case 2: Exponential Learning The exponential learning from different parameters.3 Below, we examine the best- equation is one of the standard equations to explain the im- 3 Note that we expect no relationship between γ and the validation provement on a task with practice (Thurstone, 1919; Heath- loss for random sampling, as random sampling is unaffected by γ.
A B D A B D C Recovered Computation Graph C Recovered Computation Graph Figure 2: Architecture search results for Weber’s law. (A, B) The mean validation loss as a function of the number of Figure 3: Architecture search results for exponential intermediate nodes (k) and penalty on model complexity (γ) learning. (A, B) The mean validation loss as a function of for architectures obtained through DARTS and random sam- the number of intermediate nodes (k) and penalty on model pling. Vertical bars indicate the standard error of the mean complexity (γ) for architectures obtained through DARTS and (SEM) across seeds. The star designates the validation loss of random sampling. Vertical bars indicate the SEM across the best-fitting architecture depicted in (C). (D) Psychomet- seeds. The star designates the loss of the best-fitting archi- ric function for different baseline intensities, generated by the tecture shown in (C). (D) The learning curves generated by original model and the recovered architecture shown in (C). the original model and the recovered architecture in (C). fitting architectures for each test case—determined by the blance to the original model (cf. Equation (10)), lowest validation loss—which are generally capable of recov- ering distinct operations used by the data generating model. X dxi = [. − . · x − . ReLU(xi )]dt (12) Case 1: Weber’s Law The best fitting architecture (k = ) j6=i for Weber’s law (Figure 2C) can be summarized as follows: in that it recovers the rectified linear activation function imposed on the two units competing with x , as well as the P(detected) = σlogistic (. · I − . · I − .) (11) corresponding inhibitory weight . ≈ β = .. Yet, the re- covered model misses to apply this function to unit xi . The and resembles a simplification of the ground truth model generated dynamics are nevertheless capable of approximat- in Equation (2): σlogistic (I − · I )), recovering the compu- ing the behavior of the original model (Figure 4D). tational motif of the difference between the two input vari- ables, as well as the role of I as a bias term. The architecture General Discussion and Conclusion can also reproduce psychometric functions generated by the original model (Figure 2D). However, the recovery of We- Empirical scientists are challenged with integrating an in- ber’s law should be merely considered a sanity check given creasingly large number of experimental phenomena into that the data generating model could be recovered with much quantitative models of cognitive function. In this article, we simpler methods, such as logistic regression. This is reflected introduced and evaluated a method for recovering quantita- in decent performance of the random sampling method. tive models of cognition using DARTS. The proposed method treats quantitative models as DAGs, and leverages continuous Case 2: Exponential Learning One of the core features relaxation of the architectural search space to identify candi- of this test case is the exponential relationship between task date models using gradient descent. We evaluated the perfor- performance and the number of trials. Note that we do not mance of this method based on its ability to recover three dif- expect DARTS to fully recover Equation (8) as it is—by ferent quantitative models of human cognition from synthetic design—incapable of representing the non-linear interaction data. Our results show that this method has an advantage over of (P∞ − P ) and e−α·t . Nevertheless, DARTS recovers the random sampling, and is capable of recovering computational exponential relationship between the number of trials t and motifs from quantitative models of human information pro- performance Pn for k = (Figure 3C). However, the best- cessing, such as the difference operation in Weber’s law or fitting architecture relies on a number of other transforma- the rectified linear activation function in the LCA. While the tions to compute Pn based on its independent variables, and initial results reported here seem promising, there are a num- fails to fully recover learning curves of the original model ber of limitations worth addressing in future work. (Figure 3D). In the General Discussion, we examine ways of All limitations of DARTS pertain to its assumptions, most mitigating this issue. of which limit the scope of discoverable models. First, not Case 3: Leaky Competing Accumulator The best-fitting all quantitative models can be represented as a DAG, such as architecture, here shown for k = , bears remarkable resem- ones that require independent variables to be combined in a
A B D part of a Python toolbox for autonomous empirical research, and allows for the user-friendly integration and evaluation of other search methods and test cases. As such, the reposi- tory includes additional test cases (e.g. models of controlled processing) that we could not include in this article due to space constraints. We invite interested researchers to evalu- C Recovered Computation Graph ate DARTS based on other computational models, and to uti- lize this method for the automated discovery of quantitative models of human information processing. References Figure 4: Architecture search results for LCA. (A, B) The Chu, X., Zhou, T., Zhang, B., & Li, J. (2020). Fair darts: mean validation loss as a function of the number of interme- Eliminating unfair advantages in differentiable architecture diate nodes (k) and penalty on model complexity (γ) for ar- search. In European conference on computer vision (pp. chitectures obtained through DARTS and random sampling. 465–480). Vertical bars indicate the SEM across seeds. The star desig- Elsken, T., Metzen, J. H., Hutter, F., et al. (2019). Neural nates the loss of the best-fitting architecture for k = depicted architecture search: A survey. JMLR, 20(55), 1–21. in (C). (D) Dynamics of each decision unit simulated with the Fechner, G. T. (1860). Elemente der psychophysik (Vol. 2). original model and the best architecture shown in (C). Breitkopf u. Härtel. He, X., Zhao, K., & Chu, X. (2021). Automl: A survey of the multiplicative fashion (see Test Case 2). Solving this prob- state-of-the-art. Knowledge-Based Systems, 212, 106622. lem may require expanding the search space to include dif- Heathcote, A., Brown, S., & Mewhort, D. J. (2000). The ferent integration functions performed on every node.4 Sec- power law repealed: The case for an exponential law of ond, some operations may have an unfair advantage over oth- practice. Psychonomic bulletin & review, 7(2), 185–207. ers when trained via gradient descent, e.g. if their gradients Kingma, D. P., & Ba, J. (2014). Adam: A method for stochas- are larger. Due to the competitive nature of the softmaxed tic optimization. arXiv preprint arXiv:1412.6980. weighting of operations (Equation (4)) this can lead to a sup- Li, L., & Talwalkar, A. (2020). Random search and repro- pression of operations that take more time to learn but are, ducibility for neural architecture search. In Uncertainty in in principle, closer to the ground truth. Chu, Zhou, Zhang, artificial intelligence (pp. 367–377). and Li (2020) address this problem by replacing Equation (4) Lindauer, M., & Hutter, F. (2020). Best practices for scientific with a sigmoid function. This modification, referred to as research on neural architecture search. JMLR, 21(243), 1– fair DARTS, allows operations to contribute in a cooperative 18. manner. However, the space of solutions is still limited to the Liu, H., Simonyan, K., & Yang, Y. (2018). Darts: number and types of functions considered (Lindauer & Hut- Differentiable architecture search. arXiv preprint ter, 2020). Finally, the performance of DARTS is contingent arXiv:1806.09055. on a number of training and evaluation parameters, as is the Mendoza, H., Klein, A., Feurer, M., Springenberg, J. T., & case for other NAS methods. Future work is needed to evalu- Hutter, F. (2016). Towards automatically-tuned neural net- ate DARTS for a larger space of parameters, in addition to the works. In Workshop on automl (pp. 58–65). number of intermediate nodes and the penalty on model com- Musslick, S., Cherkaev, A., Draut, B., Butt, A., Srikumar, plexity as explored in this study. However, despite all these V., Flatt, M., & Cohen, J. D. (2020). Sweetpea: A stan- limitations, DARTS may provide a first step toward automat- dard language for factorial experimental design. PsyArXiv, ing the construction of complex quantitative models based on doi:10.31234/osf.io/mdwqh. interpretable linear and non-linear expressions. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., In this study, we consider a small number of test cases to DeVito, Z., . . . Lerer, A. (2017). Automatic differentiation evaluate the performance of DARTS. While these test cases in pytorch. present useful proofs of concept, we encourage the rigorous Thurstone, L. L. (1919). The learning curve equation. Psy- evaluation of this method based on other quantitative mod- chological Monographs, 26(3), i. els of cognitive function. To enable such explorations, we Usher, M., & McClelland, J. L. (2001). The time course provide open access to a documented implementation of the of perceptual choice: the leaky, competing accumulator evaluation pipeline described in this article.5 This pipeline is model. Psychological review, 108(3), 550. Xie, S., Kirillov, A., Girshick, R., & He, K. (2019). Exploring 4 Another solution would be to linearize the data or to operate in randomly wired neural networks for image recognition. In logarithmic space. However, the former might hamper interpretabil- Proceedings of the IEEE/CVF International Conference on ity for models relying on simple non-linear functions, and the latter may be inconvenient if the ground truth cannot be easily represented Computer Vision (pp. 1284–1293). in logarithmic space. 5 We will include a link to the source code and documentation of this pipeline in the final submission.
You can also read