(ICLR 2017) Neural Architecture Search with Reinforcement Learning

Paper:

Page:

we use a recurrent network to generate the model descriptions of neural networks and train this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set.

# 中文

## 方法

### 使用强化训练

$$J(\theta_c) = E_{P(a_{1:T};\theta_c)} [R]$$

### 使用跳过连接和其他层类型来增加体系结构的复杂性

$$P(\text{Layer j is an input to layer i}) = \text{sigmoid}(v^T \tanh(W_{prev} h_j + W_{curr} h_i))$$

1. 如果某层未连接到任何输入图层，则将图像作为输入层。

2. 在最后一层，我们获取所有尚未连接的图层输出，并在将此最终隐藏状态发送给分类器之前将它们连接起来。

3. 手机上的澳门永利真的假的如果要连接的输入层具有不同的大小，我们用零填充小层，以便连接的层具有相同的大小。

# English

## Introduction

This paper presents Neural Architecture Search, a gradient-based method for finding good architectures (see Figure 1).

Figure 1: An overview of Neural Architecture Search.

Our work is based on the observation that the structure and connectivity of a neural network can be typically specified by a variable-length string. It is therefore possible to use a recurrent network – the controller – to generate such string.

Training the network specified by the string – the “child network” – on the real data will result in an accuracy on a validation set. Using this accuracy as the reward signal, we can compute the policy gradient to update the controller. As a result, in the next iteration, the controller will give higher probabilities to architectures that receive high accuracies.

## Methods

### Training with Reinforce

The list of tokens that the controller predicts can be viewed as a list of actions $a_{1:T}$ to design an architecture for a child network.

At convergence, this child network will achieve an accuracy $R$ on a held-out dataset. We can use this accuracy $R$ as the reward signal and use reinforcement learning to train the controller.

More concretely, to find the optimal architecture, we ask our controller to maximize its expected reward, represented by $J(\theta_c)$:

Since the reward signal R is non-differentiable, we need to use a policy gradient method to iteratively update $\theta_c$. In this work, we use the REINFORCE rule from Williams (1992):

$$\nabla_{\theta_c} J(\theta_c) = \sum_{t = 1}^T E_{P(a_{1:T};\theta_c)} [\nabla_{\theta_c} \log P(a_t \mid a_{(t - 1):1};\theta_c) R]$$

### Increase Architecture Complexity with Skip Connections and Other Layer Types

At layer $N$, we add an anchor point which has $N - 1$ content-based sigmoids to indicate the previous layers that need to be connected. Each sigmoid is a function of the current hiddenstate of the controller and the previous hiddenstates of the previous $N - 1$ anchor points:

$$P(\text{Layer j is an input to layer i}) = \text{sigmoid}(v^T \tanh(W_{prev} h_j + W_{curr} h_i))$$

trainable parameters: $W_{prev}$, $W_{curr}$, $v$

Figure 4: The controller uses anchor points, and set-selection attention to form skip connections.

Skip connections can cause “compilation failures” where one layer is not compatible with another layer, or one layer may not have any input or output. To circumvent these issues, we employ three simple techniques.

1. 手机上的澳门永利真的假的if a layer is not connected to any input layer then the image is used as the input layer.

2. 手机上的澳门永利真的假的at the final layer we take all layer outputs that have not been connected and concatenate them before sending this final hiddenstate to the classifier.

3. 手机上的澳门永利真的假的if input layers to be concatenated have different sizes, we pad the small layers with zeros so that the concatenated layers have the same sizes.

### Generate Recurrent Cell Architectures

At every time step $t$, the controller needs to find a functional form for $h_t$ that takes $x_t$ and $h_{t-1}$ as inputs. The simplest way is to have $h_t = \tanh(W_1 x_t + W_2 h_{t-1})$

The computations for basic RNN and LSTM cells can be generalized as a tree of steps that take $x_t$ and $h_{t-1}$ as inputs and produce $h_t$ as final output.

Left: the tree that defines the computation steps to be predicted by controller.

Center: an example set of predictions made by the controller for each computation step in the tree.

1. The controller predicts Add and Tanh for tree index 0, this means we need to compute $a_0 = tanh(W_1 x_t + W_2 h_{t-1})$.

2. The controller predicts ElemMult and ReLU for tree index 1, this means we need to compute $a_1 = \text{ReLU} ((W_3 x_t) \odot (W_4 手机上的澳门永利真的假的 h_{t-1}))$.

3. The controller predicts 0 for the second element of the “Cell Index”, Add and ReLU for elements in “Cell Inject”, which means we need to compute $a^{new}_0 = \text{ReLU}(a_0 + c_{t-1})$. Notice that we don’t have any learnable parameters for the internal nodes of the tree.

4. The controller predicts ElemMult and Sigmoid手机上的澳门永利真的假的 for tree index 2, this means we need to compute $a_2 = \text{sigmoid}(a^{new}_0 \odot a_1)$. Since the maximum index in the tree is 2, $h_t$ is set to $a_2$.

5. The controller RNN predicts 1 for the first element of the “Cell Index”, this means that we should set $c_t$ to the output of the tree at index 1 before the activation, i.e., $c_t = (W_3 x_t) \odot (W_4 h_{t-1})$.

## Experiments and Results

On CIFAR-10, our goal is to find a good convolutional architecture whereas on Penn Treebank our goal is to find a good recurrent cell.

### Learning Convolutional Architectures for CIFAR-10

Search space手机上的澳门永利真的假的: Our search space consists of convolutional architectures, with rectified linear units as non-linearities (Nair & Hinton, 2010), batch normalization (Ioffe & Szegedy, 2015) and skip connections between layers (Section 3.3). For every convolutional layer, the controller RNN has to select a filter height in [1, 3, 5, 7], a filter width in [1, 3, 5, 7], and a number of filters in [24, 36, 48, 64]. For strides, we perform two sets of experiments, one where we fix the strides to be 1, and one where we allow the controller to predict the strides in [1, 2, 3].

Training details手机上的澳门永利真的假的: The controller RNN is a two-layer LSTM with 35 hidden units on each layer. It is trained with the ADAM optimizer (Kingma & Ba, 2015) with a learning rate of 0.0006. The weights of the controller are initialized uniformly between -0.08 and 0.08. For the distributed training, we set the number of parameter server shards S to 20, the number of controller replicas K to 100 and the number of child replicas m to 8, which means there are 800 networks being trained on 800 GPUs concurrently at any time.

During the training of the controller, we use a schedule of increasing number of layers in the child networks as training progresses. On CIFAR-10, we ask the controller to increase the depth by 2 for the child models every 1,600 samples, starting at 6 layers.

Results: After the controller trains 12,800 architectures, we find the architecture that achieves the best validation accuracy. We then run a small grid search over learning rate, weight decay, batchnorm epsilon and what epoch to decay the learning rate.

Table 1: Performance of Neural Architecture Search and other state-of-the-art models on CIFAR-10.

### Learning Recurrent Cells for Penn Treebank

Search space: For each node in the tree, the controller RNN needs to select a combination method in [add; elem mult] and an activation method in [identity; tanh; sigmoid; relu]. The number of input pairs to the RNN cell is called the “base number” and set to 8 in our experiments. When the base number is 8, the search space is has approximately $6 \times 10^{16}$ architectures, which is much larger than 15,000, the number of architectures that we allow our controller to evaluate.

Training details: The controller and its training are almost identical to the CIFAR-10 experiments except for a few modifications: 1) the learning rate for the controller RNN is 0.0005, slightly smaller than that of the controller RNN in CIFAR-10, 2) in the distributed training, we set S to 20, K to 400 and m to 1, which means there are 400 networks being trained on 400 CPUs concurrently at any time, 3) during asynchronous training we only do parameter updates to the parameter-server once 10 gradients from replicas have been accumulated.

In our experiments, every child model is constructed and trained for 35 epochs.

Table 2: Single model perplexity on the test set of the Penn Treebank language modeling task. Parameter numbers with z are estimates with reference to Merity et al. (2016).

## Conclusion

In this paper we introduce Neural Architecture Search, an idea of using a recurrent neural network to compose neural network architectures.