Michal Lukasik
I am a Senior Research Scientist at Google Research in New York, USA. My research interests broadly lie in the areas of machine learning and natural language processing, with focus on loss function design, LLMs and deep learning.
Research Areas
Authored Publications
Sort By
Preview abstract
Large language models (LLMs) have shown strong results on a range of applications,
including regression and scoring tasks. Typically, one obtains outputs from an LLM via autoregressive sampling from the model’s output distribution. We show that this inference
strategy can be sub-optimal for common regression and scoring evaluation metrics. As a
remedy, we build on prior work on Minimum Bayes Risk decoding, and propose alternate
inference strategies that estimate the Bayes-optimal solution for regression and scoring metrics in closed-form from sampled responses. We show that our proposal significantly improves over baselines across datasets and models.
View details
Teacher's pet: understanding and mitigating biases in distillation
Aditya Krishna Menon
Transactions on Machine Learning Research (2022)
Preview abstract
Knowledge distillation is widely used as a means of improving the performance of a relatively simple student model using the predictions from a complex teacher model.
Several works have shown that distillation significantly boosts the student's overall performance;
however, are these gains uniform across all data subgroups?
In this paper, we
show that
distillation can harm performance on
certain subgroups,
e.g., classes with few associated samples, compared to the vanilla student trained using the one-hot labels.
We trace this behaviour to errors made by the teacher distribution being transferred to and amplified by the student model.
To mitigate this problem,
we present techniques
which soften the teacher influence for subgroups where it is less reliable.
Experiments on several image classification benchmarks show that these modifications of distillation maintain boost in overall accuracy,
while additionally ensuring improvement in subgroup performance.
View details
Preview abstract
Document and discourse segmentation are two fundamental NLP tasks pertaining to breaking up text into constituents, which are commonly used to help downstream tasks such as information retrieval or text summarization. In this work, we propose three transformer-based architectures and provide comprehensive comparisons with previously proposed approaches on three standard datasets. We establish a new state-of-the-art, reducing in particular the error rates by a large margin in all cases. We further analyze model sizes and find that we can build models with many fewer parameters while keeping good performance, thus facilitating real-world applications.
View details
Scaling Graph Neural Networks with Approximate PageRank
Aleksandar Bojchevski
Johannes Klipera
Amol Kapoor
Martin Blais
Benedek András Rózemberczki
Stephan Günnnemann
KDD (2020)
Preview abstract
Graph neural networks (GNNs) have emerged as a powerful approach for solving many network mining tasks. However, despite their successes on small datasets, efficiently utilizing them on massive web-scale data remains a challenge.
All recently proposed scalable GNN approaches rely on a message passing procedure to propagate information on the graph, leading to expensive recursive neighborhood expansion (and aggregation) schemes during both training and inference. This limitation is particularly problematic if we want to consider neighbors that are multiple hops away.
In contrast, by leveraging connections between GNNs and personalized PageRank, we develop a model that incorporates multi-hop neighborhood information in a single (non-recursive) step.
Our model \model, is significantly faster than previous scalable approaches while maintaining state-of-the-art prediction performance. Moreover, our algorithm can produce a scalability certificate which guarantees that the predictions will not change if we would have used instead a more expensive non-scalable baseline.
To demonstrate the strengths and the scalability of our approach we both evaluate on existing datasets, and propose a new large scale graph learning setting, using the open academic graph (90M nodes, 3B edges).
Additionally, we discuss the practical applications of large-scale semi-supervised learning, like \model~ at Google to solve node classification problems.
View details
Does label smoothing mitigate label noise?
Aditya Krishna Menon
International Conference on Machine Learning (2020) (to appear)
Preview abstract
Label smoothing is commonly used in training deep learning models, wherein one-hot training labels are mixed with uniform label vectors. Empirically, smoothing has been shown to improve both predictive performance and model calibration. In this paper, we study whether label smoothing is also effective as a means of coping with label noise. While label smoothing apparently amplifies this problem --- being equivalent to injecting symmetric noise to the labels --- we show how it relates to a general family of loss-correction techniques from the label noise literature. Building on this connection, we show that label smoothing can be competitive with loss-correction techniques under label noise. Further, we show that when performing distillation under label noise, label smoothing of the teacher can be beneficial; this is in contrast to recent findings for noise-free problems, and sheds further light on settings where label smoothing is beneficial.
View details
Semantic label smoothing for sequence to sequence problems
Himanshu Jain
Aditya Krishna Menon
EMNLP (2020) (to appear)
Preview abstract
Label smoothing has been shown to be an effective regularization strategy in classification, that prevents overfitting and helps in label de-noising.
However, extending such methods directly to seq2seq settings, such as Machine Translation, has been hindered by the large target output space, making it intractable to apply label smoothing over all possible outputs. Most existing approaches for seq2seq settings either do token level smoothing, or smooth over sequences generated by randomly substituting tokens in the target sequence. Unlike these works, in this paper, we propose a technique that smooths over \emph{well formed} relevant sequences that not only have sufficient n-gram overlap with the target sequence, but are also \emph{semantically similar}. Our method shows a consistent and significant improvement over the state-of-the-art techniques on different datasets.
View details
Content Explorer: Recommending Novel Entities for a Document Writer
Proceedings of Empirical Methods of Natural Language Processing, EMNLP, 2018.
Preview abstract
Background research is an inseparable part of document writing. Search engines are great for retrieving information once we know what to look for. However, the bigger challenge is often identifying topics for further research.
Automated tools could help significantly in this discovery process and increase the productivity of the writer.
In this paper, we formulate the problem of recommending topics to a writer.
We formulate this as a supervised learning problem and run a user study to validate this approach.
We propose an evaluation metric and perform an empirical comparison of state-of-the-art models for extreme multi-label classification on a large data set.
We demonstrate how a simple modification of the cross-entropy loss function leads to improved results of the deep learning models.
View details