Mateusz Malinowski
I work at DeepMind as a research scientist with a broad interest in building responsive machines that understand natural language, surrounding environment, as well as human intentions, all necessary for human-like communication.
Prior to my work at DeepMind, I worked at Max Planck Institute for Informatics, where I pursued PhD in Computer Vision. During my PhD studies, I pioneered the task of Visual Turing Test (also known as Visual Question Answering) that has been widely followed up by a research community. In this task, I study a problem of question answering about real-world images, and have proposed various architectures such as LSTM+CNN termed 'Ask Your Neurons', and logic-based that relies on a semantic parser. Besides of the Visual Turing Test, I also studied a retrieval problem and learnable spatial representations.
Authored Publications
Google Publications
Other Publications
Sort By
Preview abstract
Visual QA
is a pivotal challenge for higher-level reasoning, requiring understanding language, vision, and relationships between many objects in a scene.
Although datasets like CLEVR are designed to be unsolvable without such complex relational reasoning, some surprisingly simple feed-forward, ``holistic'' models have recently shown strong performance on this dataset.
These models lack any kind of explicit iterative, symbolic reasoning procedure, which are hypothesized to be necessary for counting objects, narrowing down the set of relevant objects based on several attributes, etc.
The reason for this strong performance is poorly understood.
Hence, our work analyzes such models, and finds that minor architectural elements
are crucial to performance.
In particular, we find that \textit{early fusion} of language and vision
provides large performance improvements.
This contrasts with the late fusion approaches
popular at the dawn of Visual QA.
We propose a simple module we call Multimodal Core, which we hypothesize performs the fundamental operations for multimodal tasks.
We believe that understanding why these elements are so important to complex question answering will aid the design of better-performing algorithms for Visual QA while minimizing hand-engineering effort.
View details
Preview abstract
We study the problem of learning classifiers robust to universal adversarial perturbations. While prior work approaches this problem via robust optimization, adversarial training, or input transformation, we instead phrase it as a two-player zero-sum game. In this new formulation, both players simultaneously play the same game, where one player chooses a classifier that minimizes a classification loss whilst the other player creates an adversarial perturbation that increases the same loss when applied to every sample in the training set.
By observing that performing a classification (respectively creating adversarial samples) is the best response to the other player, we propose a novel extension of a game-theoretic algorithm, namely \fp, to the domain of training robust classifiers. Finally, we empirically show the robustness and versatility of our approach in two defence scenarios where universal attacks are performed on several image classification datasets -- CIFAR10, CIFAR100 and ImageNet.
View details
Learning Visual Question Answering by Bootstrapping Hard Attention
Carl Doersch
Adam Santoro
Peter Battaglia
European Conference on Computer Vision (ECCV) (2018)
Preview abstract
Attention mechanisms in biological perception are thought
to select subsets of perceptual information for more sophisticated processing
which would be prohibitive to perform on all sensory inputs. In
computer vision, however, there has been relatively little exploration of
hard attention, where some information is selectively ignored, in spite
of the success of soft attention, where information is re-weighted and
aggregated, but never filtered out. Here, we introduce a new approach
for hard attention and find it achieves very competitive performance on
a recently-released visual question answering datasets, equalling and in
some cases surpassing similar soft attention architectures while entirely
ignoring some features. Even though the hard attention mechanism is
thought to be non-differentiable, we found that the feature magnitudes
correlate with semantic relevance, and provide a useful signal for our
mechanism’s attentional selection criterion. Because hard attention selects
important features of the input information, it can also be more
efficient than analogous soft attention mechanisms. This is especially
important for recent approaches that use non-local pairwise operations,
whereby computational and memory costs are quadratic in the size of
the set of features.
View details
Relational inductive biases, deep learning, and graph networks
Peter Battaglia
Jessica Blake Chandler Hamrick
Victor Bapst
Alvaro Sanchez
Vinicius Zambaldi
Andrea Tacchetti
David Raposo
Adam Santoro
Ryan Faulkner
Caglar Gulcehre
Francis Song
Andy Ballard
Justin Gilmer
Ashish Vaswani
Kelsey Allen
Charles Nash
Victoria Jayne Langston
Chris Dyer
Nicolas Heess
Daan Wierstra
Matt Botvinick
Yujia Li
Razvan Pascanu
arXiv (2018)
Preview abstract
The purpose of this paper is to explore relational inductive biases in modern AI, especially
deep learning, describing a rough taxonomy of existing approaches, and introducing a common
mathematical framework for expressing and unifying various approaches. The key theme running through this work is structure—how the world is structured, and how the structure of different computational strategies determines their strengths and weaknesses.
View details
No Results Found