Pengcheng Yin
Hi! I am Pengcheng, a research scientist at the learning for code team at Google Brain. I work on problems in the intersection of natural language processing and machine learning for software engineering. My long-term research goal is to build models to let developers communicate to computers in their own language. You can find more about my research at my personal website (http://pengcheng.in).
Research Areas
Authored Publications
Sort By
SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL
Satya Gundabathula
Hanjun Dai
TMLR (2024)
Preview abstract
Text-to-SQL, the process of translating natural language into Structured Query Language
(SQL), represents a transformative application of large language models (LLMs), potentially
revolutionizing how humans interact with data. This paper introduces the SQL-PaLM
framework, a comprehensive solution for understanding and enhancing Text-to-SQL using
LLMs, using in the learning regimes of few-shot prompting and instruction fine-tuning. With
few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error filtering. With instruction fine-tuning, we delve deep in understanding the critical
paradigms that influence the performance of tuned LLMs. In particular, we investigate
how performance can be improved through expanded training data coverage and diversity,
synthetic data augmentation, and integrating query-specific database content. We propose
a test-time selection method to further refine accuracy by integrating SQL outputs from
multiple paradigms with execution feedback as guidance. Additionally, we tackle the
practical challenge of navigating intricate databases with a significant number of tables and
columns, proposing efficient techniques for accurately selecting relevant database elements to
enhance Text-to-SQL performance. Our holistic approach yields substantial advancements
in Text-to-SQL, as demonstrated on two key public benchmarks, Spider and BIRD. Through
comprehensive ablations and error analyses, we shed light on the strengths and weaknesses
of our framework, offering valuable insights into Text-to-SQL’s future work.
View details
Spider2.0-GUI: Can Multimodal Agents Achieve Expert Proficiency in Data Science and Engineering?
Ruisheng Cao
Fangyu Lei
Haoyuan Wu
Jixuan Chen
Yeqiao Fu
Hongcheng Gao
Xinzhuang Xiong
Hanchong Zhang
Yuchen Mao
Wenjing Hu
Tianbao Xie
Hongshen Xu
Danyang Zhang
Sida Wang
Caiming Xiong
Ansong Ni
Qian Liu
Victor Zhong
Lu Chen
Kai Yu
Tao Yu
Preview abstract
The field of data science and engineering is crucial for harnessing large-scale data to assist both individuals and enterprises in analytical processing and automated orchestration. Despite the significance, large language model~(LLM)-based data agents remain underexplored, particularly concerning professional data engineering tools such as {\tt dbt}, {\tt Airflow}, and {\tt Airbyte}, which are complex to use and include intensive GUI operations. To bridge this gap, we introduce Spider2.0-GUI, the first benchmark focusing on enterprise data engineering softwares across a full data pipeline. It encapsulates $486$ tasks involving $20$ professional softwares, guiding through tasks such as data warehousing, ingestion, transformation, analysis, visualization, and orchestration. Each task is paired with both abstract and verbose instructions, considering different levels of user expertise. We also build a comprehensive document warehouse with $11,231$ documents for Spider2.0-GUI to support retrieval-augmented agent frameworks. The benchmark is further enhanced with a real-time, executable Ubuntu desktop environment that interacts with real-world internet, providing a realistic and dynamic testing ground. Preliminary results with state-of-the-art vision language models~(VLMs) indicate that even the most advanced model only achieves $11\%$ success rate~(SR) with abstract instructions, and $21\%$ SR with verbose instructions~(a.k.a., step-by-step tutorials). This benchmark not only investigates the competencies of data agents, but also paves the way for future advancements in real-world automated data science and engineering tasks.
View details
UQE: A Query Engine for Unstructured Databases
Hanjun Dai
Bethany Wang
Sherry Yang
Phitchaya Mangpo Phothilimthana
Advances in Neural Information Processing Systems (NeurIPS) (2024)
Preview abstract
Analytics on structured data is a mature field with many successful methods. However, most real world data exists in unstructured form, such as images and conversations. We investigate the potential of Large Language Models (LLMs) to enable unstructured data analytics. In particular, we propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections. This engine accepts queries in a Universal Query Language (UQL), a dialect of SQL that provides full natural language flexibility in specifying conditions and operators. The new engine leverages the ability of LLMs to conduct analysis of unstructured data, while also allowing us to exploit advances in sampling and optimization techniques to achieve efficient and accurate query execution. In addition, we borrow techniques from classical compiler theory to better orchestrate the workflow between sampling methods and foundation model calls. We demonstrate the efficiency of UQE on data analytics across different modalities, including images, dialogs and reviews, across a range of useful query types, including conditional aggregation, semantic retrieval and abstraction aggregation.
View details
Preview abstract
Identifying invariants in programs is an important program analysis task with applications towards program understanding, vulnerability analysis, and formal verification. Existing tools for identifying invariants rely on dynamic analysis, requiring traces collected from multiple executions in order to produce reliable invariants. We study the application of large language models to invariant prediction, finding that models training on source code and fine-tuned to invariant prediction can perform invariant prediction as static rather than dynamic analysis. Using a scratchpad approach gives the best performance, finding invariants statically of quality comparable to those obtained by a dynamic analysis tool with access to five program traces.
View details
SQLPrompt: Improved In-context Learning for Few-shot Text-to-SQL
Hanjun Dai
Findings of Conference on Empirical Methods in Natural Language Processing (EMNLP) (2023)
Preview abstract
Text-to-SQL aims to automate the process of generating SQL queries on a database from natural language text. In this work, we propose "SQLPrompt", tailored to improve the few-shot prompting capabilities of Text-to-SQL for Large Language Models (LLMs). Our methods
include innovative prompt design, execution based consistency decoding strategy which selects the SQL with the most consistent execution outcome among other SQL proposals, and a method that aims to improve performance by diversifying the SQL proposals during consistency selection with different prompt designs ("MixPrompt") and foundation models ("MixLLMs"). We show that SQLPrompt outperforms previous approaches for in-context learning with few labeled data by a large margin, closing the gap with finetuning state-of the-art with thousands of labeled data.
View details
Compositional Generalization and Decomposition in Neural Program Synthesis
Joey Hong
Deep Learning for Code (DL4C) Workshop at ICLR'22 (2022)
Preview abstract
When writing programs, people have the ability to tackle a new complex task by decomposing it into smaller and more familiar subtasks. While it is difficult to measure whether neural program synthesis methods have similar capabilities, what we can measure is whether they compositionally generalize, that is, whether a model that has been trained on the simpler subtasks is subsequently able to solve more complex tasks. In this paper, we focus on measuring the ability of learned program synthesizers to compositionally generalize. We first characterize several different axes along which program synthesis methods would be desired to generalize, e.g., length generalization, or the ability to combine known subroutines in new ways that do not occur in the training data. Based on this characterization, we introduce a benchmark suite of tasks to assess these abilities based on two popular existing datasets, SCAN and RobustFill. Finally, we make first attempts to improve the compositional generalization ability of Transformer models along these axes through novel attention mechanisms that draw inspiration from a human-like decomposition strategy. Empirically, we find our modified Transformer models generally perform better than natural baselines, but the tasks remain challenging.
View details
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Hyung Won Chung
Sebastian Gehrmann
Parker Schuh
Sasha Tsvyashchenko
Abhishek Rao
Yi Tay
Noam Shazeer
Nan Du
Reiner Pope
James Bradbury
Guy Gur-Ari
Toju Duke
Henryk Michalewski
Xavier Garcia
Liam Fedus
David Luan
Barret Zoph
Ryan Sepassi
David Dohan
Shivani Agrawal
Mark Omernick
Marie Pellat
Aitor Lewkowycz
Erica Moreira
Rewon Child
Oleksandr Polozov
Zongwei Zhou
Brennan Saeta
Michele Catasta
Jason Wei
Kathy Meier-Hellstern
arxiv:2204.02311 (2022)
Preview abstract
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
View details