2020.1.13 note
AdderNet: Do We Really Need Mu< iplications in Deep Learning?

Compared with cheap addition operation, mu< iplication operation is of much higher computation complexity. The widely-used convolutions in deep neural networks are exactly cross-correlation to measure the similarity between input feature and convolution fi< ers, which involves massive mu< iplications between float values. In this paper, they present adder networks (AdderNets) to trade these massive mu< iplications in deep neural networks, especially convolutional neural networks (CNNs), for much cheaper additions to reduce computation costs. In AdderNets, they take the L1-norm distance between fi< ers and input feature as the output response. The influence of this new similarity measure on the optimization of neural network have been thoroughly analyzed. To achieve a better performance, they develop a special back-propagation approach for AdderNets by investigating the full-precision gradient. They then propose an adaptive learning rate strategy to enhance the training procedure of AdderNets according to the magnitude of each neuron’s gradient. As a resu< , the proposed AdderNets can achieve 74.9% Top-1 accuracy 91.7% Top-5 accuracy using ResNet-50 on the ImageNet dataset without any mu< iplication in convolution layer.

Optimization for deep learning: theory and algorithmsA review paper.

How neural networks find generalizable solutions: Self-tuned annealing in deep learningBy analyzing the learning dynamics and loss function landscape, they discover a robust inverse relation between the weight variance and the landscape flatness (inverse of curvature) for all SGD-based learning algorithms. To explain the inverse variance-flatness relation, they develop a random landscape theory, which shows that the SGD noise stren >h (effective temperature) depends inversely on the landscape flatness. Their study indicates that SGD attains a selftuned landscape-dependent annealing strategy to find generalizable solutions at the flat minima of the landscape.

CNN-generated images are surprisingly easy to spot… for nowIn this work, they ask whether it is possible to create a universal detector for telling apart real images from these generated by a CNN, regardless of architecture or dataset used. To test this, we collect a dataset consisting of fake images generated by 11 different CNN-based image generator models, chosen to span the space of commonly used architectures today. They demonstrate that, with careful pre- and post-processing and data augmentation, a standard image classifier trained on only one specific CNN generator (ProGAN) is able to generalize surprisingly well to unseen architectures, datasets, and training methods. Their findings suggest the intriguing possibility that today’s CNN-generated images share some common systematic flaws, preventing them from achieving realistic image synthesis.

A Deep Neural Network’s Loss Surface Contains Every Low-dimensional PatternThe work Loss Landscape Sightseeing with Mu< i-Point Optimization demonstrated that one can empirically find arbitrary 2D binary patterns inside loss surfaces of popular neural networks. In this paper we prove that: (i) this is a general property of deep universal approximators; and (ii) this property holds for arbitrary smooth patterns, for other dimensionalities, for every dataset, and any neural network that is sufficiently deep and wide.

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNsThe loss functions of deep neural networks are complex and their geometric properties are not well understood. They show that the optima of these complex loss functions are in fact connected by simple curves over which training and test accuracy are nearly constant. They introduce a training procedure to discover these high-accuracy pathways between modes. Inspired by this new geometric insight, they also propose a new ensembling method entitled Fast Geometric Ensembling (FGE). Using FGE they can train high-performing ensembles in the time required to train a single model. They achieve improved performance compared to the recent state-of-the-art Snapshot Ensembles, on CIFAR-10, CIFAR-100, and ImageNet.

Characterizing the Decision Boundary of Deep Neural NetworksThey propose a novel approach we call Deep Decision boundary Instance Generation (DeepDIG). DeepDIG utilizes a method based on adversarial example generation as an effective way of generating samples near the decision boundary of any deep neural network model. Then, they introduce a set of important principled characteristics that take advantage of the generated instances near the decision boundary to provide mu< ifaceted understandings of deep neural networks. They have performed extensive experiments on mu< iple representative datasets across various deep neural network models and characterized their decision boundaries.

BACKPACK: PACKING MORE INTO BACKPROPAutomatic differentiation frameworks are optimized for exactly one thing: computing the average mini-batch gradient. Yet, other quantities such as the variance of the mini-batch gradients or many approximations to the Hessian can, in theory, be computed efficiently, and at the same time as the gradient. While these quantities are of great interest to researchers and practitioners, current deep-learning software does not support their automatic calculation. To address this problem, they introduce BACKPACK, an efficient framework bui< on top of PYTORCH, that extends the backpropagation algorithm to extract additional information from first- and second-order derivatives. Its capabilities are illustrated by benchmark reports for computing additional quantities on deep neural networks, and an example application by testing several recent curvature approximations for optimization.

oLMpics - On what Language Model Pre-training CapturesIn this work, they propose eight reasoning tasks, which conceptually require operations such as comparison, conjunction, and composition. A fundamental challenge is to understand whether the performance of a LM on a task should be attributed to the pre-trained representations or to the process of fine-tuning on the task data. To address this, they propose an evaluation protocol that includes both zero-shot evaluation (no fine-tuning), as well as comparing the learning curve of a fine-tuned LM to the learning curve of mu< iple controls, which paints a rich picture of the LM capabilities. Their main findings are that: (a) different LMs exhibit qualitatively different reasoning abilities, e.g., ROBERTA succeeds in reasoning tasks where BERT fails completely; (b) LMs do not reason in an abstract manner and are context-dependent, e.g., while ROBERTA can compare ages, it can do so only when the ages are in the typical range of human ages; © On half of our reasoning tasks all models fail completely. Their findings and infrastructure can help future work on designing new datasets, models and objective functions for pre-training.