Jump to Content
Gamaleldin Fathy Elsayed

Gamaleldin Fathy Elsayed

Gamaleldin F. Elsayed is a Research Scientist at Google Brain interested in deep learning and computational neuroscience research. In particular, his research is focused on studying properties and problems of artificial neural networks and designing better machine learning models with inspiration from neuroscience. In 2017, he completed his PhD in Neuroscience from Columbia University at the Center for Theoretical Neuroscience with John P. Cunningham. During his PhD, he contributed to the field of computational neuroscience through designing machine learning methods for identifying and validating structures in complex neural data. Prior to that, he completed his B.S. from The American University in Cairo with a major in Electronics Engineering and a minor in Computer Science, and earned M.S. degrees in electrical engineering from KAUST and Washington University in St. Louis. Before his Graduate studies, he was also a professional athlete and Olympian Fencer. He competed at The 2008 Olympic Games in Beijing with the Egyptian Saber team.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modeling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters. We present a recipe for highly efficient training of a 22B-parameter ViT and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features) ViT22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between bias and performance, an improved alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT22B demonstrates the potential for "LLM-like'' scaling in vision, and provides key steps towards getting there. View details
    Preview abstract Automatically discovering composable abstractions from raw perceptual data is a long-standing challenge in machine learning. Recent slot-based neural networks that learn about objects in a self-supervised manner have made exciting progress in this direction. However, they typically fall short at adequately capturing spatial symmetries present in the visual world, which leads to sample inefficiency, such as when entangling object appearance and pose. In this paper, we present a simple yet highly effective method for incorporating spatial symmetries via slot-centric reference frames. We incorporate equivariance to per-object pose transformations into the attention and generation mechanism of Slot Attention by translating, scaling, and rotating position encodings. These changes result in little computational overhead, are easy to implement, and can result in large gains in terms of data efficiency and overall improvements to object discovery. We evaluate our method on a wide range of synthetic object discovery benchmarks namely Tetrominoes, CLEVRTex, Objects Room and MultiShapeNet, and show promising improvements on the challenging real-world Waymo Open dataset. View details
    Multitask Learning Via Interleaving: A Neural Network Investigation
    David Mayo
    Tyler Scott
    Mengye Ren
    Katherine Hermann
    Matt Jones
    Michael Mozer
    44th Annual Meeting of the Cognitive Science Society (2023)
    Preview abstract The most common settings in machine learning to study multi-task learning assume either iid task draws on each training trial or training on each task to mastery before moving on to the next. We instead study a setting in which tasks are interleaved, i.e., training proceeds on task $\mathcal{A}$ for some period of time and then switches to another task $\mathcal{B}$ before $\mathcal{A}$ is mastered. We examine properties of standard neural net learning algorithms and architectures in this setting. With inspiration from psychological phenomena pertaining to the influence of task sequence on human learning, we observe qualitatively similar phenomena in networks, including: forgetting with relearning savings, task switching costs, and better memory consolidation with interleaved training. By improving our understanding of such properties, one can design learning procedures that are suitable given the temporal structure of the environment. We illustrate with a momentum optimizer that resets following a task switch and leads to reliably better online cumulative learning accuracy. View details
    Learning in Temporally Structured Environments
    Matt Jones
    Tyler R. Scott
    Mengye Ren
    Katherine Hermann
    David Mayo
    Michael Mozer
    International Conference on Learning Representations (2023)
    Preview abstract Natural environments have temporal structure at multiple timescales. This property is reflected in biological learning and memory but typically not in machine learning systems. We advance a multiscale learning method in which each weight in a neural network is decomposed as a sum of subweights with different learning and decay rates. Thus knowledge becomes distributed across different timescales, enabling rapid adaptation to task changes while avoiding catastrophic interference. First, we prove previous models that learn at multiple timescales, but with complex coupling between timescales, are equivalent to multiscale learning via a reparameterization that eliminates this coupling. The same analysis yields a new characterization of momentum learning, as a fast weight with a negative learning rate. Second, we derive a model of Bayesian inference over 1/f noise, a common temporal pattern in many online learning domains that involves long-range (power law) autocorrelations. The generative side of the model expresses 1/f noise as a sum of diffusion processes at different timescales, and the inferential side tracks these latent processes using a Kalman filter. We then derive a variational approximation to the Bayesian model and show how it is an extension of the multiscale learner. The result is an optimizer that can be used as a drop-in replacement in an arbitrary neural network architecture. Third, we evaluate the ability of these methods to handle nonstationarity by testing them in online prediction tasks characterized by 1/f noise in the latent parameters. We find that the Bayesian model significantly outperforms online stochastic gradient descent and two batch heuristics that rely preferentially or exclusively on more recent data. Moreover, the variational approximation performs nearly as well as the full Bayesian model, and with memory requirements that are linear in the size of the network. View details
    Conditional Object-Centric Learning from Video
    Thomas Kipf
    Austin Stone
    Rico Jonschkowski
    Alexey Dosovitskiy
    Klaus Greff
    ICLR, ICLR (2022)
    Preview abstract Object-centric representations are a promising path toward more systematic generalization by providing flexible abstractions upon which compositional world models can be built. Recent work on simple 2D and 3D datasets has shown that models with object-centric inductive biases can learn to segment and represent meaningful objects from the statistical structure of the data alone without the need for any supervision. However, such fully-unsupervised methods still fail to scale to diverse realistic data, despite the use of increasingly complex inductive biases such as priors for the size of objects or the 3D geometry of the scene. In this paper, we instead take a weakly-supervised approach and focus on how 1) using the temporal dynamics of video data in the form of optical flow and 2) conditioning the model on simple object location cues can be used to enable segmenting and tracking objects in significantly more realistic synthetic data. We introduce a sequential extension to Slot Attention which we train to predict optical flow for realistic looking synthetic scenes and show that conditioning the initial state of this model on a small set of hints, such as center of mass of objects in the first frame, is sufficient to significantly improve instance segmentation. These benefits generalize beyond the training distribution to novel objects, novel backgrounds, and to longer video sequences. We also find that such initial-state-conditioning can be used during inference as a flexible interface to query the model for specific objects or parts of objects, which could pave the way for a range of weakly-supervised approaches and allow more effective interaction with trained models. View details
    SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos
    Klaus Greff
    Michael Mozer
    Thomas Kipf
    Advances in Neural Information Processing Systems (2022), pp. 28940-28954
    Preview abstract The visual world can be parsimoniously characterized in terms of distinct entities with sparse interactions. Discovering this compositional structure in dynamic visual scenes has proven challenging for end-to-end computer vision approaches unless explicit instance-level supervision is provided. Slot-based models leveraging motion cues have recently shown great promise in learning to represent, segment, and track objects without direct supervision, but they still fail to scale to complex real-world multi-object videos. In an effort to bridge this gap, we take inspiration from human development and hypothesize that information about scene geometry in the form of depth signals can facilitate object-centric learning. We introduce SAVi++, an object-centric video model which is trained to predict depth signals from a slot-based video representation. By further leveraging best practices for model scaling, we are able to train SAVi++ to segment complex dynamic scenes recorded with moving cameras, containing both static and moving objects of diverse appearance on naturalistic backgrounds, without the need for segmentation supervision. Finally, we demonstrate that by using sparse depth signals obtained from LiDAR, SAVi++ is able to learn emergent object segmentation and tracking from videos in the real-world Waymo Open dataset. Project page: https://slot-attention-video.github.io/savi++/ View details
    Preview abstract Class imbalance is a common problem in medical diagnosis, causing a standard classifier to be biased towards majority classes and ignore the importance of the rest. This is especially true for dermatology, a specialty with thousands of skin conditions but many of which rarely occur in the wild. Buoyed by recent advances, we explore meta-learning based few-shot learning approaches in skin condition recognition problem and propose an evaluation setup to fairly assess the real-world impact of such approaches. When compared to conventional class imbalance techniques, we find that the state-of-the-art few-shot learning methods are not as performant, but combining the two approaches using a novel ensemble leads to improvement in all-way classification, especially the rare classes. We conclude that the ensemble can be useful to address the class imbalance problem, yet progress here can further be accelerated by the use of real-world evaluation setups for benchmarking new methods. View details
    Saccader: Accurate, interpretable image classification with hard attention
    Simon Kornblith
    2019 Conference on Neural Information Processing Systems (NeurIPS) (2019)
    Preview abstract Deep convolutional networks have achieved high accuracy on image classification tasks. Due to the complexity of these models, they are considered black boxes as the decisions made by these models are hard to interpret. This lack of interpretation have plagued the wide use of these models in critical application. One class of models that offers interpretations by design are those that use hard attention mechanisms. The training of these models without attention supervision is often challenging, resulting in low accuracy and poor attention locations. The difficulty stems from the fact that it is hard to quantify what is salient places in an image. Thus, these models are often trained by RL losses such as REINFORCE. In large scale images such as ImageNet, the action space is high dimensional and the reward is sparse which lead to the optimization to fail. Here we propose a novel model (Saccader) with hard attention mechanism that make discrete attention actions. We also propose a self supervised pretraining procedure that initializes the model to a state with more frequent rewards. We show that our model achieves high accuracy on ImageNet while providing interpretable decisions. View details
    Adversarial Reprogramming of Neural Networks
    Jascha Sohl-dickstein
    Ian Goodfellow
    ICLR (2019)
    Preview abstract Deep neural networks are susceptible to adversarial attacks. In computer vision, well-crafted perturbations to images can cause neural networks to make mistakes such as confusing a cat with a computer. Previous adversarial attacks have been designed to degrade performance of models or cause machine learning models to produce specific outputs chosen ahead of time by the attacker. We introduce attacks that instead reprogram the target model to perform a task chosen by the attacker—without the attacker needing to specify or compute the desired output for each test-time input. This attack finds a single adversarial perturbation, that can be added to all test-time inputs to a machine learning model in order to cause the model to perform a task chosen by the adversary—even if the model was not trained to do this task. These perturbations can thus be considered a program for the new task. We demonstrate adversarial reprogramming on six ImageNet classification models, repurposing these models to perform a counting task, as well as classification tasks: classification of MNIST and CIFAR-10 examples presented as inputs to the ImageNet model. View details
    Preview abstract We present a formulation of deep learning that aims at producing a large margin classifier. The notion of margin has served as the foundation of several theoretically profound and empirically successful results for both classification and regression tasks. However, most large margin algorithms are applicable only to shallow models with preset feature representation; and existing margin methods for neural networks only enforce margin at the output layer, or are formulated with weak approximations to the true margin. This keeps margin methods inaccessible to models like deep networks. In this paper, we propose a novel loss function to impose a margin on any set of layers of deep network and show promising empirical results that consistently outperform cross-entropy based models across different application scenarios such as adversarial examples and generalization from small training sets. Our formulation allows choosing any norm for the margin. The resulting loss is general and complementary to existing regularization techniques such as weight decay, dropout and batch norm. It is applicable to any classification task where cross-entropy is used. View details
    Preview abstract Machine learning models are vulnerable to adversarial examples: small changes to images can cause computer vision models to make mistakes such as identifying a school bus as an ostrich. However, it is still an open question whether humans are prone to similar mistakes. Here, we address this question by leveraging recent techniques that transfer adversarial examples from computer vision models with known parameters and architecture to other models with unknown parameters and architecture, and by matching the initial processing of the human visual system. We find that adversarial examples that strongly transfer across computer vision models influence the classifications made by time-limited human observers. View details
    No Results Found