Lucy Colwell

Lucy Colwell

Lucy is a research scientist at Google Research who works closely with colleagues from GAS and Brain to better understand the relationship between the sequence and function of biological macromolecules. Her broader research interests involve understanding how Google's strengths in experimental design and machine learning can be applied to the discovery and production of proteins for use in a diverse range of applications.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Machine learning-guided protein design is rapidly emerging as a strategy to find high fitness multi-mutant variants. In this issue of Cell Systems, Wittman et al. analyze the impact of design decisions for machine learning-assisted directed evolution (MLDE) on its ability to navigate a fitness landscape and reliably find global optima. View details
    Rethinking Attention with Performers
    Valerii Likhosherstov
    David Martin Dohan
    Xingyou Song
    Peter Hawkins
    Jared Quincy Davis
    Afroz Mohiuddin
    Lukasz Kaiser
    Adrian Weller
    accepted to ICLR 2021 (oral presentation) (to appear)
    Preview abstract We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers. View details
    Deep diversification of an AAV capsid protein by machine learning
    Ali Bashir
    Sam Sinai
    Nina K. Jain
    Pierce J. Ogden
    Patrick F. Riley
    George M. Church
    Eric D. Kelsic
    Nature Biotechnology(2021)
    Preview abstract Modern experimental technologies can assay large numbers of biological sequences, but engineered protein libraries rarely exceed the sequence diversity of natural protein families. Machine learning (ML) models trained directly on experimental data without biophysical modeling provide one route to accessing the full potential diversity of engineered proteins. Here we apply deep learning to design highly diverse adeno-associated virus 2 (AAV2) capsid protein variants that remain viable for packaging of a DNA payload. Focusing on a 28-amino acid segment, we generated 201,426 variants of the AAV2 wild-type (WT) sequence yielding 110,689 viable engineered capsids, 57,348 of which surpass the average diversity of natural AAV serotype sequences, with 12–29 mutations across this region. Even when trained on limited data, deep neural network models accurately predict capsid viability across diverse variants. This approach unlocks vast areas of functional but previously unreachable sequence space, with many potential applications for the generation of improved viral vectors and protein therapeutics. View details
    Preview abstract The use of black-box optimization for the design of new biological sequences is an emerging research area with potentially revolutionary impact. The cost and latency of wet-lab experiments requires methods that find good sequences in few experimental rounds of large batches of sequences --- a setting that off-the-shelf black-box optimization methods are ill-equipped to handle. We find that the performance of existing methods varies drastically across optimization tasks, posing a significant obstacle to real-world applications. To improve robustness, we propose population-based optimization (PBO), which generates batches of sequences by sampling from an ensemble of methods. The number of sequences sampled from any method is proportional to the quality of sequences it previously proposed, allowing PBO to combine the strengths of individual methods while hedging against their innate brittleness. Adapting the population of methods online using evolutionary optimization further improves performance. Through extensive experiments on in-silico optimization tasks, we show that PBO outperforms any single method in its population, proposing both higher quality single sequences as well as more diverse batches. By its robustness and ability to design diverse, high-quality sequences, PBO is shown to be a new state-of-the art approach to the batched black-box optimization of biological sequences. View details
    Preview abstract Being able to design biological sequences like DNA or proteins to have desired properties would have considerable impact in medical and industrial applications. However, doing so presents a challenging black-box optimization problem that requires multiple rounds of expensive, time-consuming experiments. In response, we propose using reinforcement learning (RL) for biological sequence design. RL is a flexible framework that allows us to optimize generative sequence policies to achieve a variety of criteria, including diversity among high-quality sequences discovered. We use model-based RL to improve sample efficiency, where at each round the policy is trained offline using a simulator fit on functional measurements from prior rounds. To accommodate the growing number of observations across rounds, the simulator model is automatically selected at each round from a pool of diverse models of varying capacity. On the tasks of designing DNA transcription factor binding sites, designing antimicrobial proteins, and optimizing the energy of Ising models based on protein structures, we find that model-based RL is an attractive alternative to existing methods. View details
    Evaluating Attribution for Graph Neural Networks
    Alexander B Wiltschko
    Benjamin Sanchez-Lengeling
    Brian Lee
    Jennifer Wei
    Wesley Qian
    Yiliu Wang
    Advances in Neural Information Processing Systems 33(2020)
    Preview abstract Interpretability of machine learning models is critical to scientific understanding, AI safety, and debugging. Attribution is one approach to interpretability, which highlights input dimensions that are influential to a neural network’s prediction. Evaluation of these methods is largely qualitative for image and text models, because acquiring ground truth attributions requires expensive and unreliable human judgment. Attribution has been comparatively understudied for graph neural networks (GNNs), a model class of growing importance that makes predictions on arbitrarily-sized graphs. Graph-valued data offer an opportunity to quantitatively benchmark attribution methods, because challenging synthetic graph problems have computable ground-truth attributions. In this work we adapt commonly-used attribution methods for GNNs and quantitatively evaluate them using the axes of attribution accuracy, stability, faithfulness and consistency. We make concrete recommendations for which attribution methods to use, and provide the data and code for our benchmarking suite. Rigorous and open source benchmarking of attribution methods in graphs could enable new methods development and broader use of attribution in real-world ML tasks. View details
    Preview abstract Machine learning (ML) models trained to predict ligand binding to single proteins have achieved remarkable success, but cannot make predictions about protein targets other than the one they are trained on. Models that make predictions for multiple proteins and multiple ligands, known as drug-target interaction (DTI) models, aim to solve this problem but generally have lower performance. In this work, we improve the performance of DTI models by taking advantage of the accuracy of single protein/ligand binding models. Specifically, we first construct individual protein/ligand binding models for all train proteins with some experimental data, then use each individual model to make predictions for all remaining ligands, against the corresponding protein target. Finally, we use the known and predicted ligand binding data for all targets in a DTI model to make predictions for the unseen test proteins. This approach significantly improves performance; most importantly, some of our models are able to achieve Areas Under the Receiver Operator Characteristic curve (AUCs) exceeding $0.9$ on test datasets that contain only unseen proteins and unseen ligands. View details
    Biological Sequences Design using Batched Bayesian Optimization
    Zelda Mariet
    Ramya Deshpande
    David Dohan
    Olivier Chapelle
    NeurIPS workshop on Bayesian Deep Learning(2019)
    Preview abstract Being able to effectively design biological sequences like DNA and proteins would have transformative impact on medicine. Currently, the most popular method in the life sciences for performing design is directed evolution,which explores sequence space by making small mutations to existing sequences.Alternatively, Bayesian optimization (BO) provides an attractive framework for model-based black-box optimization, and has achieved many recent successes in life sciences applications. However, within the ML community, most large-scale BO efforts have focused on hyper-parameter tuning. These methods often do not translate to biological sequence design, where the search space is over a discrete alphabet, wet-lab experiments are run with considerable parallelism (1K-100K sequences per batch), and experiments are sufficiently slow and expensive that only few rounds of experiments are feasible. This paper discusses the particularities of batched BO on a large discrete space, and investigates the design choices that must be made in order to obtain robust, scalable, and experimentally successful models within this unique context. View details
    Using attribution to decode binding mechanism in neural network models for chemistry
    Ankur Taly
    Federico Monti
    Proceedings of the National Academy of Sciences(2019), pp. 201820657
    Preview abstract Deep neural networks have achieved state of the art accuracy at classifying molecules with respect to whether they bind to specific protein targets. A key breakthrough would occur if these models could reveal the fragment pharmacophores that are causally involved in binding. Extracting chemical details of binding from the networks could potentially lead to scientific discoveries about the mechanisms of drug actions. But doing so requires shining light into the black box that is the trained neural network model, a task that has proved difficult across many domains. Here we show how the binding mechanism learned by deep neural network models can be interrogated, using a recently described attribution method. We first work with carefully constructed synthetic datasets, in which the 'fragment logic' of binding is fully known. We find that networks that achieve perfect accuracy on held out test datasets still learn spurious correlations due to biases in the datasets, and we are able to exploit this non-robustness to construct adversarial examples that fool the model. The dataset bias makes these models unreliable for accurately revealing information about the mechanisms of protein-ligand binding. In light of our findings, we prescribe a test that checks for dataset bias given a hypothesis. If the test fails, it indicates that either the model must be simplified or regularized and/or that the training dataset requires augmentation. View details
    Critiquing Protein Family Classification Models Using Sufficient Input Subsets
    Brandon Michael Carter
    Jamie Alexander Smith
    Theo Sanderson
    ACM SIGKDD Conference on Knowledge Discovery and Data Mining(2019) (to appear)
    Preview abstract In many application domains, neural networks are highly accurate and have been deployed at large scale. However, users often do not have good tools for understanding how these models arrive at their predictions. This has hindered adoption in fields such as the life and medical sciences, where researchers require that models base their decisions on underlying biological phenomena rather than peculiarities of the dataset introduced, e.g., as a function of when and how the data were collected. In response, we propose a set of methods for critiquing deep learning models, and demonstrate their application for protein family classification, a task for which high- accuracy models have considerable potential impact. Our methods extend the recently-introduced sufficient input subsets technique (SIS), which we use to identify the subset of locations (SIS) in each protein sequence that is sufficient for classification. Our suite of tools analyzes these SIS to shed light on the decision making criteria employed by models trained on this task. These tools expose that while these deep models may perform classification for biologically-relevant reasons, their behavior varies considerably across choice of network architecture and parameter initialization. While the techniques that we develop are specific to the protein sequence classification task, the approach taken generalizes to a broad set of scientific contexts in which model interpretability is essential. We encourage further application of our techniques for interrogating machine learning models trained on other scientifically relevant tasks. View details
    Preview abstract Understanding the relationship between amino acid sequence and protein function is a long-standing problem in molecular biology with far-reaching scientific implications. Despite six decades of progress, state-of-the-art techniques cannot annotate $\sim1/3$ of microbial protein sequences, hampering our ability to exploit sequences collected from diverse organisms. To address this, we report a deep learning model that learns the relationship between unaligned amino acid sequences and their functional classification across all 17929 families of the PFam database. Using the Pfam seed sequences we establish a rigorous benchmark assessment and find that a dilated convolutional model reduces the error of state of the art BLASTp and pHMM models by a factor of nine. With 80\% of the full Pfam database we train a protein family predictor that is more accurate and over 200 times faster than BLASTp, while learning sequence features such as structural disorder and transmembrane helices. Our model co-locates sequences from unseen families in embedding space far from existing families, allowing sequences from novel families to be classified. We anticipate that deep learning models will be a core component of future general-purpose protein function prediction tools. View details
    A Comparison of Generative Models for Sequence Design
    David Dohan
    Ramya Deshpande
    Olivier Chapelle
    Babak Alipanahi
    Machine Learning in Computational Biology Workshop(2019)
    Preview abstract In this paper, we compare generative models of different complexity for designing DNA and protein sequences using the Cross Entropy Method. View details
    Glycation changes molecular organization and charge distribution in type I collagen fibrils
    Sneha Bansode,
    Uliana Bashtanova,
    Rui Li,
    Jonathan Clark,
    Karin H. Müller,
    Anna Puszkarska,
    Ieva Goldberga,
    Holly H. Chetwood,
    David G. Reid,
    Jeremy N. Skepper,
    Catherine M. Shanahan,
    Georg Schitter,
    Patrick Mesquida
    Melinda J. Duer
    Scientific Reports, 10(2020), pp. 3397
    Preview abstract Collagen fibrils are central to the molecular organization of the extracellular matrix (ECM) and to defining the cellular microenvironment. Glycation of collagen fibrils is known to impact on cell adhesion and migration in the context of cancer and in model studies, glycation of collagen molecules has been shown to affect the binding of other ECM components to collagen. Here we use TEM to show that ribose-5-phosphate (R5P) glycation of collagen fibrils – potentially important in the microenvironment of actively dividing cells, such as cancer cells – disrupts the longitudinal ordering of the molecules in collagen fibrils and, using KFM and FLiM, that R5P-glycated collagen fibrils have a more negative surface charge than unglycated fibrils. Altered molecular arrangement can be expected to impact on the accessibility of cell adhesion sites and altered fibril surface charge on the integrity of the extracellular matrix structure surrounding glycated collagen fibrils. Both effects are highly relevant for cell adhesion and migration within the tumour microenvironment. View details
    Rapid discovery and evolution of orthogonal aminoacyl-tRNA synthetase–tRNA pairs
    Daniele Cervettini
    Shan Tang
    Stephen D. Fried
    Julian C. W. Willis
    Louise F. H. Funke
    Jason W. Chin
    Nature Biotechnology, 38(2020), 989–999
    Preview abstract A central challenge in expanding the genetic code of cells to incorporate noncanonical amino acids into proteins is the scalable discovery of aminoacyl-tRNA synthetase (aaRS)–tRNA pairs that are orthogonal in their aminoacylation specificity. Here we computationally identify candidate orthogonal tRNAs from millions of sequences and develop a rapid, scalable approach—named tRNA Extension (tREX)—to determine the in vivo aminoacylation status of tRNAs. Using tREX, we test 243 candidate tRNAs in Escherichia coli and identify 71 orthogonal tRNAs, covering 16 isoacceptor classes, and 23 functional orthogonal tRNA–cognate aaRS pairs. We discover five orthogonal pairs, including three highly active amber suppressors, and evolve new amino acid substrate specificities for two pairs. Finally, we use tREX to characterize a matrix of 64 orthogonal synthetase–orthogonal tRNA specificities. This work expands the number of orthogonal pairs available for genetic code expansion and provides a pipeline for the discovery of additional orthogonal pairs and a foundation for encoding the cellular synthesis of noncanonical biopolymers. View details
    Computational approaches to therapeutic antibody design: established methods and emerging trends
    Richard A. Norman
    Francesco Ambrosetti
    Alexandre M.J.J. Bonvin
    Sebastian Kelm
    Sandeep Kumar
    Konrad Krawczyk
    Briefings in Bioinformatics, 21(2019), 1549=1567
    Preview abstract Antibodies are proteins that recognize the molecular surfaces of potentially noxious molecules to mount an adaptive immune response or, in the case of autoimmune diseases, molecules that are part of healthy cells and tissues. Due to their binding versatility, antibodies are currently the largest class of biotherapeutics, with five monoclonal antibodies ranked in the top 10 blockbuster drugs. Computational advances in protein modelling and design can have a tangible impact on antibody-based therapeutic development. Antibody-specific computational protocols currently benefit from an increasing volume of data provided by next generation sequencing and application to related drug modalities based on traditional antibodies, such as nanobodies. Here we present a structured overview of available databases, methods and emerging trends in computational antibody analysis and contextualize them towards the engineering of candidate antibody therapeutics. View details
    Collagen-inspired self-assembly of twisted filaments
    MJ Falk,
    A Duwel,
    Phys. Rev. Lett., 123(2019), pp. 238102
    Preview abstract Collagen consists of three peptides twisted together through a periodic array of hydrogen bonds. Here we use this as inspiration to find design rules for programmed specific interactions for self-assembling synthetic collagen like triple helices, starting from disordered configurations. The assembly generically nucleates defects in the triple helix, the characteristics of which can be manipulated by spatially varying the enthalpy of helix formation. Defect formation slows assembly, evoking kinetic pathologies that have been observed to mutations in the primary collagen amino acid sequence. The controlled formation and interaction between defects gives a route for hierarchical self-assembly of bundles of twisted filaments. View details
    A polymer physics framework for the entropy of arbitrary pseudoknots
    Ofer Kimchi
    Tristan Cragnolini
    Biophysical Journal, 117(2019), pp. 520-532
    Preview abstract The accurate prediction of RNA secondary structure from primary sequence has had enormous impact on research from the past 40 years. Although many algorithms are available to make these predictions, the inclusion of non-nested loops, termed pseudoknots, still poses challenges arising from two main factors: 1) no physical model exists to estimate the loop entropies of complex intramolecular pseudoknots, and 2) their NP-complete enumeration has impeded their study. Here, we address both challenges. First, we develop a polymer physics model that can address arbitrarily complex pseudoknots using only two parameters corresponding to concrete physical quantities—over an order of magnitude fewer than the sparsest state-of-the-art phenomenological methods. Second, by coupling this model to exhaustive enumeration of the set of possible structures, we compute the entire free energy landscape of secondary structures resulting from a primary RNA sequence. We demonstrate that for RNA structures of ∼80 nucleotides, with minimal heuristics, the complete enumeration of possible secondary structures can be accomplished quickly despite the NP-complete nature of the problem. We further show that despite our loop entropy model’s parametric sparsity, it performs better than or on par with previously published methods in predicting both pseudoknotted and non-pseudoknotted structures on a benchmark data set of RNA structures of ≤80 nucleotides. We suggest ways in which the accuracy of the model can be further improved. View details
    The Effect of Debiasing Protein–Ligand Binding Data on Generalization
    Vikram Sundar
    J. Chem. Inf. Model., 60(2019), 56–62
    Preview abstract The structured nature of chemical data means machine-learning models trained to predict protein–ligand binding risk overfitting the data, impairing their ability to generalize and make accurate predictions for novel candidate ligands. Data debiasing algorithms, which systematically partition the data to reduce bias and provide a more accurate metric of model performance, have the potential to address this issue. When models are trained using debiased data splits, the reward for simply memorizing the training data is reduced, suggesting that the ability of the model to make accurate predictions for novel candidate ligands will improve. To test this hypothesis, we use distance-based data splits to measure how well a model can generalize. We first confirm that models perform better for randomly split held-out sets than for distant held-out sets. We then debias the data and find, surprisingly, that debiasing typically reduces the ability of models to make accurate predictions for distant held-out test sets and that model performance measured after debiasing is not representative of the ability of a model to generalize. These results suggest that debiasing reduces the information available to a model, impairing its ability to generalize. View details
    Statistical and machine learning approaches to predicting protein–ligand interactions
    Current opinion in structural biology, 49(2018), pp. 123-128
    Preview abstract Data driven computational approaches to predicting protein–ligand binding are currently achieving unprecedented levels of accuracy on held-out test datasets. Up until now, however, this has not led to corresponding breakthroughs in our ability to design novel ligands for protein targets of interest. This review summarizes the current state of the art in this field, emphasizing the recent development of deep neural networks for predicting protein–ligand binding. We explain the major technical challenges that have caused difficulty with predicting novel ligands, including the problems of sampling noise and the challenge of using benchmark datasets that are sufficiently unbiased that they allow the model to extrapolate to new regimes. View details
    Comparative analysis of nanobody sequence and structure data
    Laura S. Mitchell
    Proteins: Structure, Function, and Bioinformatics, 86(2018), 697–706
    Preview abstract Nanobodies are a class of antigen‐binding protein derived from camelids that achieve comparable binding affinities and specificities to classical antibodies, despite comprising only a single 15 kDa variable domain. Their reduced size makes them an exciting target molecule with which we can explore the molecular code that underpins binding specificity—how is such high specificity achieved? Here, we use a novel dataset of 90 nonredundant, protein‐binding nanobodies with antigen‐bound crystal structures to address this question. To provide a baseline for comparison we construct an analogous set of classical antibodies, allowing us to probe how nanobodies achieve high specificity binding with a dramatically reduced sequence space. Our analysis reveals that nanobodies do not diversify their framework region to compensate for the loss of the VL domain. In addition to the previously reported increase in H3 loop length, we find that nanobodies create diversity by drawing their paratope regions from a significantly larger set of aligned sequence positions, and by exhibiting greater structural variation in their H1 and H2 loops. View details
    Power law tails in phylogenetic systems
    Chongli Qin
    PNAS, 115(2018), pp. 690-695
    Preview abstract Covariance analysis of protein sequence alignments can predict structure and function from sequence alignments alone. Current methodologies typically assume that sequences are independent, notwithstanding their phylogenetic relationships. This corruption constrains the alignments for which covariance analysis can be used. It is critically important to control for phylogeny and understand how phylogeny contaminates signal. This paper presents a mathematical analysis that argues that there is a distinctive signature of phylogeny in the covariance matrix, allowing us to identify modes that are corrupted by phylogeny. This signature is present in large protein sequence alignments, explaining recent covariance analyses, and provides an important step toward decoupling phylogenetic effects from biologically meaningful interactions. View details
    Analysis of nanobody paratopes reveals greater diversity than classical antibodies
    Laura S Mitchell,
    Protein Engineering, Design and Selection, 31(2018), 267–275
    Preview abstract Nanobodies (Nbs) are a class of antigen-binding protein derived from camelid immune systems, which achieve equivalent binding affinities and specificities to classical antibodies (Abs) despite being comprised of only a single variable domain. Here, we use a data set of 156 unique Nb:antigen complex structures to characterize Nb–antigen binding and draw comparison to a set of 156 unique Ab:antigen structures. We analyse residue composition and interactions at the antigen interface, together with structural features of the paratopes of both data sets. Our analysis finds that the set of Nb structures displays much greater paratope diversity, in terms of the structural segments involved in the paratope, the residues used at these positions to contact the antigen and furthermore the type of contacts made with the antigen. Our findings suggest a different relationship between contact propensity and sequence variability from that observed for Ab VH domains. The distinction between sequence positions that control interaction specificity and those that form the domain scaffold is much less clear-cut for Nbs, and furthermore H3 loop positions play a much more dominant role in determining interaction specificity. View details
    Proline provides site-specific flexibility for in vivo collagen
    Wing Ying Chow,
    Chris J Forman,
    Dominique Bihan,
    Anna M Puszkarska,
    Rakesh Rajan,
    David G Reid,
    David A Slatter,
    David J Wales,
    Richard W Farndale,
    Melinda J Duer
    Scientific Reports, 9(2018), pp. 13809
    Preview abstract Fibrillar collagens have mechanical and biological roles, providing tissues with both tensile strength and cell binding sites which allow molecular interactions with cell-surface receptors such as integrins. A key question is: how do collagens allow tissue flexibility whilst maintaining well-defined ligand binding sites? Here we show that proline residues in collagen glycine-proline-hydroxyproline (Gly-Pro-Hyp) triplets provide local conformational flexibility, which in turn confers well-defined, low energy molecular compression-extension and bending, by employing two-dimensional 13C-13C correlation NMR spectroscopy on 13C-labelled intact ex vivo bone and in vitro osteoblast extracellular matrix. We also find that the positions of Gly-Pro-Hyp triplets are highly conserved between animal species, and are spatially clustered in the currently-accepted model of molecular ordering in collagen type I fibrils. We propose that the Gly-Pro-Hyp triplets in fibrillar collagens provide fibril “expansion joints” to maintain molecular ordering within the fibril, thereby preserving the structural integrity of ligand binding sites. View details