Jump to Content
Ilya Tolstikhin

Ilya Tolstikhin

Between 2014 and 2018 I was a postdoc (and later a team lead) at the Empirical Inference Department of Max Planck Institute for Intelligent Systems, Tübingen, Germany. I received a diploma (MSc equivalent) in 2010 from Lomonosov Moscow State University and PhD in 2014 from Dorodnicyn Computing Center of Russian Academy of Sciences.
Currently I am actively interested in understanding neural network training and generalization. Previously I worked on statistical learning theory and more generally on theory of machine learning.

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Fine-Grained Distribution-Dependent Learning Curves
    Jonathan Shafer
    Shay Moran
    Steve Hanneke
    Proceedings of Thirty Sixth Conference on Learning Theory (COLT), PMLR 195:5890-5924, 2023. (2023)
    Preview abstract Learning curves plot the expected error of a learning algorithm as a function of the number of labeled input samples. They are widely used by machine learning practitioners as a measure of an algorithm's performance, but classic PAC learning theory cannot explain their behaviour. In this paper we introduce a new combinatorial characterization called the VCL dimension that improves and refines the recent results of Bousquet et al. (2021). Our characterization sheds new light on the structure of learning curves by providing fine-grained bounds, and showing that for classes with finite VCL, the rate of decay can be decomposed into a linear component that depends only on the hypothesis class and an exponential component that depends also on the target distribution. In particular, the finer nuance of the VCL dimension implies lower bounds that are quantitatively stronger than the bounds of Bousquet et al. (2021) and qualitatively stronger than classic 'no free lunch' lower bounds. The VCL characterization solves an open problem studied by Antos and Lugosi (1998), who asked in what cases such lower bounds exist. As a corollary, we recover their lower bound for half-spaces in ℝd, and we do so in a principled way that should be applicable to other cases as well. Finally, to provide another viewpoint on our work and how it compares to traditional PAC learning bounds, we also present an alternative formulation of our results in a language that is closer to the PAC setting. View details
    Preview abstract Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks with comparable pre-training and inference cost. We hope that these results spark further research beyond the realms of well established CNNs and Transformers. View details
    Preview abstract We study deep neural networks (DNNs) trained on natural image data with entirely random labels. Despite its popularity in the literature, where it is often used to study memorization, generalization, and other phenomena, little is known about what DNNs learn in this setting. In this paper, we show analytically for convolutional and fully connected networks that an alignment between the principal components of network parameters and data takes place when training with random labels. We study this alignment effect by investigating neural networks pre-trained on randomly labelled image data and subsequently fine-tuned on disjoint datasets with random or real labels. We show how this alignment produces a positive transfer: networks pre-trained with random labels train faster downstream compared to training from scratch even after accounting for simple effects, such as weight scaling. We analyze how competing effects, such as specialization at later layers, may hide the positive transfer. These effects are studied in several network architectures, including VGG16 and ResNet18, on CIFAR10 and ImageNet. View details
    Practical and Consistent Estimation of f-Divergences
    Paul Rubenstein
    Josip Djolonga
    Carlos Riquelme
    Submission to Neurips 2019. (2019) (to appear)
    Preview abstract The estimation of an f-divergence between two probability distributions based on samples is a fundamental problem in statistics and machine learning. Most works study this problem under very weak assumptions, in which case it is provably hard. We consider the case of stronger structural assumptions that are commonly satisfied in modern machine learning, including representation learning and generative modelling with autoencoder architectures. Under these assumptions we propose and study an estimator that can be easily implemented, works well in high dimensions, and enjoys faster rates of convergence. We verify the behavior of our estimator empirically in both synthetic and real-data experiments. View details
    When can unlabeled data improve the learning rate?
    Christina Göpfert
    Shai Ben-David
    Sylvain Gelly
    Ruth Urner
    COLT 2019
    Preview abstract In semi-supervised classification, one is given access both to labeled and unlabeled data. As unlabeled data is typically cheaper to acquire than labeled data, this setup becomes advantageous as soon as one can exploit the unlabeled data in order to produce a better classifier than with labeled data alone. However, the conditions under which such an improvement is possible are not fully understood yet. Our analysis focuses on improvements in the {\em minimax} learning rate in terms of the number of labeled examples (with the number of unlabeled examples being allowed to depend on the number of labeled ones). We argue that for such improvements to be realistic and indisputable, certain specific conditions should be satisfied and previous analyses have failed to meet those conditions. We then demonstrate simple toy examples where these conditions can be met, in particular showing rate changes from $1/\sqrt{\l}$ to $e^{-c\l}$ and $1/\sqrt{\l}$ to $1/\l$. These results allow us to better understand what is and isn't possible in semi-supervised learning. View details
    No Results Found