Lucy Colwell
Lucy is a research scientist at Google Research who works closely with colleagues from GAS and Brain to better understand the relationship between the sequence and function of biological macromolecules. Her broader research interests involve understanding how Google's strengths in experimental design and machine learning can be applied to the discovery and production of proteins for use in a diverse range of applications.
Authored Publications
Sort By
Preview abstract
Machine learning-guided protein design is rapidly emerging as a strategy to find high fitness multi-mutant variants. In this issue of Cell Systems, Wittman et al. analyze the impact of design decisions for machine learning-assisted directed evolution (MLDE) on its ability to navigate a fitness landscape and reliably find global optima.
View details
Deep diversification of an AAV capsid protein by machine learning
Ali Bashir
Sam Sinai
Nina K. Jain
Pierce J. Ogden
Patrick F. Riley
George M. Church
Eric D. Kelsic
Nature Biotechnology (2021)
Preview abstract
Modern experimental technologies can assay large numbers of biological sequences, but engineered protein libraries rarely exceed the sequence diversity of natural protein families. Machine learning (ML) models trained directly on experimental data without biophysical modeling provide one
route to accessing the full potential diversity of engineered proteins. Here we apply deep learning to design highly diverse adeno-associated virus 2 (AAV2) capsid protein variants that remain viable for packaging of a DNA payload. Focusing on a 28-amino acid segment, we generated 201,426 variants of the AAV2 wild-type (WT) sequence yielding 110,689 viable engineered capsids, 57,348 of which surpass the average diversity of natural AAV serotype sequences, with 12–29 mutations across this region. Even when trained on limited data, deep neural network models accurately predict capsid viability across diverse variants. This approach unlocks vast areas of functional but previously unreachable sequence space, with many potential applications for the generation of improved
viral vectors and protein therapeutics.
View details
Rethinking Attention with Performers
Valerii Likhosherstov
David Martin Dohan
Peter Hawkins
Jared Quincy Davis
Afroz Mohiuddin
Lukasz Kaiser
Adrian Weller
accepted to ICLR 2021 (oral presentation) (to appear)
Preview abstract
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
View details
Evaluating Attribution for Graph Neural Networks
Alexander B Wiltschko
Benjamin Sanchez-Lengeling
Brian Lee
Jennifer Wei
Wesley Qian
Yiliu Wang
Advances in Neural Information Processing Systems 33 (2020)
Preview abstract
Interpretability of machine learning models is critical to scientific understanding, AI safety, and debugging. Attribution is one approach to interpretability, which highlights input dimensions that are influential to a neural network’s prediction. Evaluation of these methods is largely qualitative for image and text models, because acquiring ground truth attributions requires expensive and unreliable human judgment. Attribution has been comparatively understudied for graph neural networks (GNNs), a model class of growing importance that makes predictions on arbitrarily-sized graphs. Graph-valued data offer an opportunity to quantitatively benchmark attribution methods, because challenging synthetic graph problems have computable ground-truth attributions. In this work we adapt commonly-used attribution methods for GNNs and quantitatively evaluate them using the axes of attribution accuracy, stability, faithfulness and consistency. We make concrete recommendations for which attribution methods to use, and provide the data and code for our benchmarking suite. Rigorous and open source benchmarking of attribution methods in graphs could enable new methods development and broader use of attribution in real-world ML tasks.
View details
Model-Based Reinforcement Learning for Biological Sequence Design
David Dohan
Ramya Deshpande
ICLR 2020 (2020)
Preview abstract
Being able to design biological sequences like DNA or proteins to have desired properties would have considerable impact in medical and industrial applications. However, doing so presents a challenging black-box optimization problem that requires multiple rounds of expensive, time-consuming experiments. In response, we propose using reinforcement learning (RL) for biological sequence design. RL is a flexible framework that allows us to optimize generative sequence policies to achieve a variety of criteria, including diversity among high-quality sequences discovered. We use model-based RL to improve sample efficiency, where at each round the policy is trained offline using a simulator fit on functional measurements from prior rounds. To accommodate the growing number of observations across rounds, the simulator model is automatically selected at each round from a pool of diverse models of varying capacity. On the tasks of designing DNA transcription factor binding sites, designing antimicrobial proteins, and optimizing the energy of Ising models based on protein structures, we find that model-based RL is an attractive alternative to existing methods.
View details
Population Based Optimization for Biological Sequence Design
Zelda Mariet
David Martin Dohan
ICML 2020 (2020)
Preview abstract
The use of black-box optimization for the design of new biological sequences is an emerging research area with potentially revolutionary impact. The cost and latency of wet-lab experiments requires methods that find good sequences in few experimental rounds of large batches of sequences --- a setting that off-the-shelf black-box optimization methods are ill-equipped to handle. We find that the performance of existing methods varies drastically across optimization tasks, posing a significant obstacle to real-world applications. To improve robustness, we propose population-based optimization (PBO), which generates batches of sequences by sampling from an ensemble of methods. The number of sequences sampled from any method is proportional to the quality of sequences it previously proposed, allowing PBO to combine the strengths of individual methods while hedging against their innate brittleness. Adapting the population of methods online using evolutionary optimization further improves performance. Through extensive experiments on in-silico optimization tasks, we show that PBO outperforms any single method in its population, proposing both higher quality single sequences as well as more diverse batches. By its robustness and ability to design diverse, high-quality sequences, PBO is shown to be a new state-of-the art approach to the batched black-box optimization of biological sequences.
View details
Using attribution to decode binding mechanism in neural network models for chemistry
Ankur Taly
Federico Monti
Proceedings of the National Academy of Sciences (2019), pp. 201820657
Preview abstract
Deep neural networks have achieved state of the art accuracy at classifying molecules with respect to whether they bind to specific protein targets. A key breakthrough would occur if these models could reveal the fragment pharmacophores that are causally involved in binding. Extracting chemical details of binding from the networks could potentially lead to scientific discoveries about the mechanisms of drug actions. But doing so requires shining light into the black box that is the trained neural network model, a task that has proved difficult across many domains. Here we show how the binding mechanism learned by deep neural network models can be interrogated, using a recently described attribution method. We first work with carefully constructed synthetic datasets, in which the 'fragment logic' of binding is fully known. We find that networks that achieve perfect accuracy on held out test datasets still learn spurious correlations due to biases in the datasets, and we are able to exploit this non-robustness to construct adversarial examples that fool the model. The dataset bias makes these models unreliable for accurately revealing information about the mechanisms of protein-ligand binding. In light of our findings, we prescribe a test that checks for dataset bias given a hypothesis. If the test fails, it indicates that either the model must be simplified or regularized and/or that the training dataset requires augmentation.
View details
Preview abstract
Machine learning (ML) models trained to predict ligand binding to single proteins have achieved remarkable success, but cannot make predictions about protein targets other than the one they are trained on. Models that make predictions for multiple proteins and multiple ligands, known as drug-target interaction (DTI) models, aim to solve this problem but generally have lower performance. In this work, we improve the performance of DTI models by taking advantage of the accuracy of single protein/ligand binding models. Specifically, we first construct individual protein/ligand binding models for all train proteins with some experimental data, then use each individual model to make predictions for all remaining ligands, against the corresponding protein target. Finally, we use the known and predicted ligand binding data for all targets in a DTI model to make predictions for the unseen test proteins. This approach significantly improves performance; most importantly, some of our models are able to achieve Areas Under the Receiver Operator Characteristic curve (AUCs) exceeding $0.9$ on test datasets that contain only unseen proteins and unseen ligands.
View details
Biological Sequences Design using Batched Bayesian Optimization
Zelda Mariet
Ramya Deshpande
David Dohan
Olivier Chapelle
NeurIPS workshop on Bayesian Deep Learning (2019)
Preview abstract
Being able to effectively design biological sequences like DNA and proteins would have transformative impact on medicine. Currently, the most popular method in the life sciences for performing design is directed evolution,which explores sequence space by making small mutations to existing sequences.Alternatively, Bayesian optimization (BO) provides an attractive framework for model-based black-box optimization, and has achieved many recent successes in life sciences applications. However, within the ML community, most large-scale BO efforts have focused on hyper-parameter tuning. These methods often do not translate to biological sequence design, where the search space is over a discrete alphabet, wet-lab experiments are run with considerable parallelism (1K-100K sequences per batch), and experiments are sufficiently slow and expensive that only few rounds of experiments are feasible. This paper discusses the particularities of batched BO on a large discrete space, and investigates the design choices that must be made in order to obtain robust, scalable, and experimentally successful models within this unique context.
View details
A Comparison of Generative Models for Sequence Design
David Dohan
Ramya Deshpande
Olivier Chapelle
Babak Alipanahi
Machine Learning in Computational Biology Workshop (2019)
Preview abstract
In this paper, we compare generative models of different complexity for designing DNA and protein sequences using the Cross Entropy Method.
View details