> TL;DR: We theoretically analyze the differential architecture search (DARTS) for
understanding the role and impact of skip connections, which inspires a new
method for Neural Architecture Search (NAS) using group-structured sparse gates
and path-depth-wise regularization to overcome the limitation of existing NAS
methods for AutoML.
In our work [1], we investigate the problem of Network architecture search
(NAS), an important family of techniques for Automated Machine Learning
(AutoML). We theoretically analyze the performance of one popular NAS method,
namely differential architecture search (DARTS) [2], which inspires us to design
a more effective NAS method. For the first time, our work theoretically proves
that skip (identity) connections can help networks converge faster and have
superior competitive advantages over other types of operations. This well
explains why DARTS often selects network architectures with dominated skip
connections which often leads to unsatisfactory performance. Based on our
theory, we propose a new path-regularized DARTS with two novel components,
group-structured sparse gate and path-depth-wise regularization, which both
alleviate the unfair competition among operations and thus help finding better
network architectures. Our work advances NAS in both theory and practical
algorithm design to advance the frontiers of automated machine learning.
1. What is network architecture search (NAS)?
Network architecture search (NAS) is an important branch of machine learning
techniques in AutoML. As shown in Fig. 1, given a network graph in the left
figure, NAS aims to address the problem of automatically searching the best
network, that is, how to automatically select proper operations from a
predefined operation set (e.g., 3×3/5×5 convolution/pooling, skip connection,
zero operation, etc) between any two data nodes in the graph. After the NAS
search, one can expect to obtain the well-defined network in the right figure of
Fig. 1 without human experts.
2. Why network architecture search (NAS) is important?
Network architecture search (NAS) [3] is an effective approach for automating
network architecture design, with many successful applications witnessed to
image recognition and language modelling. Unlike expert-designed architectures
which require substantial efforts from experts by trial and error, NAS can
automatically design the network architectures and thus greatly alleviates the
design efforts of experts. Moreover, NAS can also alleviate the design bias
brought by experts which could prohibit achieving better performance. This is
demonstrated by the fact that the networks designed by NAS outperform these
designed by human experts and achieves state-of-the-art performance in many
applications, e.g. image classification.
3. What is the issue in the differential architecture search (DARTS)?
DARTS [2] is a recently developed leading NAS approach. Previous reinforcement
learning (RL) and evolutionary algorithm (EA) based NAS methods reformulate the
network search problem as a discrete optimization problem, since one needs to
select proper operations from a predefined operation sets to connect any two
data nodes in the network. However, such discrete optimization has extremely
search space (total 1e+16 operation choices for the network of 7 nodes), and
suffers from very high computational cost (thousands of GPU days). To solve this
issue, DARTS relaxes the discrete network architecture search process into
finding continuous architecture parameters. That is, DARTS defines weights for
each operations between any two nodes, and optimizes these operation weights to
decide the final network architecture. In this way, it can optimize the
architecture parameters via gradient descent and thus greatly reduces the high
search cost in RL and EA searching approaches. However, this differential NAS
family, including DARTS and its variants, typically selects many skip
connections (identity mapping, one kinds of operations for connecting two data
nodes) which dominate over other types of operations [4,5,6]. Consequently, the
searched networks are observed to have unsatisfactory performance.
4. What is the intrinsic theoretical reason for the dominated skip connection in
DARTS?
Though the dominance of skip connections in DARTS is observed in many works
[4,5,6], theoretical understandings on this issue remain absent, hindering the
development of more advanced methods in a principled way. In this work, we solve
this problem by theoretically analyzing the effects of various types of
operations, e.g. convolution, skip connection and zero operation, to the network
optimization. By imposing proper conditions on the activation functions,
network parameter initializations, and input data, we prove that the
architectures with more skip connections can converge faster than the other
candidates, and thus are selected by DARTS. This result, for the first time,
theoretically and explicitly reveals the impact of skip connections to fast
network optimization and its competitive advantage over other types of
operations in DARTS.
5. What method can resolve the issue of the dominated skip connection in DARTS?
Inspired by our theory, we propose a path-regularized DARTS that consists of two
key modules: (i) a differential group-structured sparse binary gate introduced
for each operation to avoid unfair competition among operations, and (ii) a
path-depth-wise regularization used to incite search exploration for deep
architectures that often converge slower than shallow ones and are not well
explored during search.
(i) Differential group-structured sparse binary gate
By analysis, we find that the reason for the dominated skip connection is that
skip connection and other types of operations share one softmax distribution
(their sum is one). When the weights of skip connections become larger, the
weights of other operations become smaller. To resolve this issue, we introduce
independent binary gate implemented by Bernoulli distribution for each operation
between two nodes to avoid the direct competition between skip connection and
other operations. If the binary gate is one for the operation, then the
operation will be used to connect two data nodes. Otherwise, the operation will
not be used. However, we prove that independent distribution for each operation
gives dense architectures which will lead to information loss when pruning the
network to a small network after the search.
To solve this issue, we propose a group-structured sparse binary gate for each
operation. Specifically, we first define one binary Bernoulli gate variable for
each operation. Then we pass the gate variable into a hard thresholding function
such that when the gate variable is smaller than a threshold, it becomes zero.
In this way, the gate variables in the network become sparse. Moreover, we can
compute the probability of activated gates of the whole network. So we collect
all skip connections in the cell as a skip-connection group and take the
remaining operations into non-skip-connection group. Then we compute the average
activation probability of these two groups and respectively penalize them to
obtain sparse selected network. As a result, as shown in Fig. 1 (a) and (b),
penalizing skip connections heavier than other types of operations can rectify
the competitive advantage of skip connections over other operations and avoids
skip-connection-dominated cell.
Moreover, the above group-structured sparse binary gates can gradually and
automatically prune redundancy and unnecessary connections, which reduces the
information loss of pruning at the end of search. Finally, sparse binary gates
defined on the whole cell can encourage global competition and cooperation of
all operations in the cell, which differs from DARTS that only introduces local
competition among the operations between two nodes.
(ii) Path-depth-wise Regularizer on Operation Gates
Except for the above advantages, the above independent sparse gates also
introduce one issue: they prohibit the method to select deep networks. Without
dominated skip connections in the network, other types of operations, e.g. zero
operation, become freer and are widely used. Accordingly, the the search
algorithm can easily transform a deep network to a shallow network. For example,
it can use skip connections to connect the intermediate nodes to input nodes
and cut off the connection between intermediate neighbouring nodes. Meanwhile,
we prove that gradient descent algorithm prefers shallow networks than deep
ones, as shallow networks often have more smooth function landscape and can be
faster optimized. So these two factors together lead to a bias of search
algorithm to shallow cells.
To resolve this network-selection bias, we propose a path-depth-wise
regularization to rectify the unfair competition between shallow and deep
networks. We first compute the probability that any two neighbouring nodes are
connected with each other by parameterized operations, namely various types of
convolutions. Then to rectify the competitive advantage of shallow networks over
deep ones, we impose path-depth-wised regularization on the stochastic gates to
encourage more exploration to deep networks and then decide the depth of
networks instead of greedily choosing shallow cell at the beginning of search.
As a result, as shown in Fig. 1 (a) and (c), our path-depth-wise regularization
can help find deeper networks which usually have better data fitting ability
than shallow networks. Finally, by combining group-structured sparse gates and
the path-depth-wise regularization, we can obtain our final model which is shown
to be superior than DARTS in Fig. 1 (d) and (e).
6. What results the developed method has achieved?
To evaluate the performance of our method, following conventional setting we
train our model on the search the CIFAR10 dataset to find network architecture,
and then test the selected network on the CIFAR10, CIFAR 100 and ImageNet
datasets. The classification results in Table 1 show that our method achieves
the state-of-the-art results on both CIFAR10 and CIFAR100.
The results on ImageNet in Table 2 also demonstrate the superiority of our
method. All these results show the superiority, robustness and transferability
of the selected network architecture by our method.
7. Takeaway Remarks and What is Next?
This work first provides theoretical understanding of existing DARTS methods,
which is rarely investigated though heavily desired. Then we develop an
effective network architecture search approach which well resolves the issue in
DARTS. Our work pushes the frontiers of both theoretical performance analysis
and practical algorithm design of NAS, which not only helps us better
theoretically understand current NAS approaches and also brings new practical
design insights for NAS. We will release our code and models for anyone who is
interested in NAS and AutoML research and applications.
If you are interested in learning more, please check out our paper
[https://arxiv.org/abs/2006.16537] and feel free to contact us.
Resources:
Paper: Theory-Inspired Path-Regularized Differential Network Architecture Search
[https://arxiv.org/abs/2006.16537]
Code: We will release the codes and models soon.
When referencing this work, please cite:
@inproceedings{zhou2020nas,
title={Theory-Inspired Path-Regularized Differential Network Architecture
Search},
author={Pan Zhou, Caiming Xiong, Richard Socher, and Steven Hoi},
booktitle={Neural Information Processing Systems},
year={2020}
}
References
[1] P. Zhou, C. Xiong, R. Socher, S. Hoi, Theory-Inspired Path-Regularized
Differential Network Architecture Search, Neural Information Processing Systems
(NeurIPS), 2020 (oral). [https://arxiv.org/pdf/2006.16537.pdf]
[2] H. Liu, K. Simonyan, and Y. Yang. DARTS: Differentiable architecture
search.
In Int’l Conf. Learning Representations, 2018.
[https://arxiv.org/pdf/1806.09055.pdf]
[3] B. Zoph and Q. Le. Neural architecture search with reinforcement learning.
In Int’l Conf. Learning Representations, 2017.
[https://arxiv.org/abs/1611.01578]
[4] X. Chen, L. Xie, J. Wu, and Q. Tian. Progressive differentiable
architecture
search: Bridging the depth gap between search and evaluation. In IEEE
International Conference on Computer Vision, pages 1294–1303, 2019.
[https://arxiv.org/abs/1904.12760]
[5] X. Chu, T. Zhou, B. Zhang, and J. Li. Fair DARTS: Eliminating unfair
advantages in differentiable architecture search. arXiv preprint
arXiv:1911.12126, 2019. [https://arxiv.org/abs/1911.12126]
[6] T. Arber Zela, T. Saikia, Y. Marrakchi, T. Brox, and F. Hutter.
Understanding and robustifying differentiable architecture search. In Int’l
Conf. Learning Representations, 2020. [https://arxiv.org/abs/1909.09656]