Alexey Gritsenko

Alexey Gritsenko

Alexey is a Google Brain resident based in Amsterdam. He joined the residency programme after spending several years in industry working on computational advertising, and after obtaining a doctorate degree in Bioinformatics from the Delft University of Technology, where he used computational methods to study sequence determinants of protein synthesis. During that time Alexey developed a broad interest in machine learning, but was primarily exposed to its applications. He is excited about switching to the research side of machine learning and views the residency as an opportunity to work on topics he would not otherwise be exposed to. Alexey’s current research interests lie at the intersection of ML Fairness and generative models, but he’s always excited to talk about biology and genetics!

Research Areas

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Scaling Vision Transformers to 22 Billion Parameters
    Josip Djolonga
    Basil Mustafa
    Piotr Padlewski
    Justin Gilmer
    Mathilde Caron
    Rodolphe Jenatton
    Lucas Beyer
    Michael Tschannen
    Anurag Arnab
    Carlos Riquelme
    Gamaleldin Elsayed
    Fisher Yu
    Avital Oliver
    Fantine Huot
    Mark Collier
    Vighnesh Birodkar
    Yi Tay
    Alexander Kolesnikov
    Filip Pavetić
    Thomas Kipf
    Xiaohua Zhai
    Neil Houlsby
    Arxiv (2023)
    Preview abstract The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modeling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters. We present a recipe for highly efficient training of a 22B-parameter ViT and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features) ViT22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between bias and performance, an improved alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT22B demonstrates the potential for "LLM-like'' scaling in vision, and provides key steps towards getting there. View details
    Simple Open-Vocabulary Object Detection with Vision Transformers
    Austin Stone
    Maxim Neumann
    Dirk Weissenborn
    Alexey Dosovitskiy
    Anurag Arnab
    Zhuoran Shen
    Xiaohua Zhai
    Thomas Kipf
    Neil Houlsby
    ECCV (Poster) (2022)
    Preview abstract Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub (https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit). View details
    Preview abstract In this paper we analyse and improve integer discrete flows for lossless compression. Integer discrete flows are a recently proposed class of models that learn invertible transformations for integer-valued random variables. Their discrete nature makes them particularly suitable for lossless compression with entropy coding schemes. We start by investigating a recent theoretical claim that states that invertible flows for discrete random variables are less flexible than their continuous counterparts. We demonstrate with a proof that this claim does not hold for integer discrete flows due to the embedding of data with finite support into the countably infinite integer lattice. Furthermore, we zoom in on the effect of gradient bias due to the straight-through estimator in integer discrete flows, and demonstrate that its influence is highly dependent on architecture choices and less prominent than previously thought. Finally, we show how different modifications to the architecture improve the performance of this model class for lossless compression. View details
    Preview abstract Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such autoregressive models is that they require executing tens of thousands of sequential operations per second of generated audio, making them ill-suited for deployment on specialized deep learning hardware. Here, we propose a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function. Our approach is based on a generalized energy distance between the distributions of the generated and real audio. This spectral energy distance is a proper scoring rule with respect to the distribution over magnitude-spectrograms of the generated waveform audio and offers statistical consistency guarantees. The distance can be calculated from minibatches without bias, and does not involve adversarial learning, yielding a stable and consistent method for training implicit generative models. Empirically, we achieve state-of-the-art generation quality among implicit generative models, as judged by the recently proposed cFDSD metric. When combining our method with adversarial techniques, we also improve upon the recently proposed GAN-TTS model in terms of Mean Opinion Score as judged by trained human evaluators. View details
    BriarPatches: Pixel-Space Interventions for Inducing Demographic Parity
    Yoni Halpern
    Neural Information Processing Systems: Workshop on Ethical, Social and Governance Issues in AI (2018)
    Preview abstract We introduce the BriarPatch, a pixel-space intervention that obscures sensitive attributes from representations encoded in pre-trained classifiers. The patches encourage internal model representations not to encode sensitive information, which has the effect of pushing downstream predictors towards exhibiting demographic parity with respect to the sensitive information. The net result is that these BriarPatches provide an intervention mechanism available at user level, and complements prior research on fair representations that were previously only applicable by model developers and ML experts. View details