Yeounoh Chung

Yeounoh Chung

SystemsResearch@Google.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Large Language Models (LLMs) have demonstrated impressive capabilities across a range of natural language processing tasks. In particular, improvements in reasoning abilities and the expansion of context windows have opened new avenues for leveraging these powerful models. NL2SQL is challenging in that the natural language question is inherently ambiguous, while the SQL generation requires a precise understanding of complex data schema and semantics. One approach to this semantic ambiguous problem is to provide more and sufficient contextual information. In this work, we explore the performance and the latency trade-offs of the extended context window (a.k.a., long context) offered by Google's state-of-the-art LLM (\textit{gemini-1.5-pro}). We study the impact of various contextual information, including column example values, question and SQL query pairs, user-provided hints, SQL documentation, and schema. To the best of our knowledge, this is the first work to study how the extended context window and extra contextual information can help NL2SQL generation with respect to both accuracy and latency cost. We show that long context LLMs are robust and do not get lost in the extended contextual information. Additionally, our long-context NL2SQL pipeline based on Google's \textit{gemini-pro-1.5} achieve a strong performance with 67.41\% on BIRD benchmark (dev) without finetuning and expensive self-consistency based techniques. View details
    Slice Finder: Automated Data Slicing for Model Validation
    Neoklis Polyzotis
    Steven Whang
    Tim Klas Kraska
    Proceedings of the IEEE Int' Conf. on Data Engineering (ICDE), 2019 (to appear)
    Preview abstract As machine learning (ML) systems become democratized, helping users easily debug their models becomes increasingly important. Yet current data tools are still primitive when it comes to helping users trace model performance problems all the way to the data. We focus on the particular prob- lem of slicing data to identify subsets of the training data where the model performs poorly. Unlike general techniques (e.g., clustering) that can find arbitrary slices, our goal is to find interpretable slices (which are easier to take action com- pared to arbitrary subsets) that are problematic and large. We propose Slice Finder, which is an interactive framework for identifying such slices using statistical techniques. The slices can be used for applications like diagnosing model fair- ness and fraud detection where describing slices that are interpretable to humans is necessary. View details
    Automated Data Slicing for Model Validation: A Big data - AI Integration Approach
    Steven Whang
    Tim Klas Kraska
    Alkis Polyzotis
    Ki Hyun Tae
    IEEE Transactions on Knowledge and Data Engineering (2019)
    Preview abstract As machine learning systems become democratized, it becomes increasingly important to help users easily debug their models. However, current data tools are still primitive when it comes to helping users trace model performance problems all the way to the data. We focus on the particular problem of slicing data to identify subsets of the validation data where the model performs poorly. This is an important problem in model validation because the overall model performance can fail to reflect that of the smaller subsets, and slicing allows users to analyze the model performance on a more granular-level. Unlike general techniques (e.g., clustering) that can find arbitrary slices, our goal is to find interpretable slices (which are easier to take action compared to arbitrary subsets) that are problematic and large. We propose Slice Finder, which is an interactive framework for identifying such slices using statistical techniques. Applications include diagnosing model fairness and fraud detection, where identifying slices that are interpretable to humans is crucial. This research is part of a larger trend of Big data and Artificial Intelligence (AI) integration and opens many opportunities for new research View details
    Preview abstract As machine learning (ML) systems become democratized, helping users easily debug their models becomes increasingly important. Yet current data tools are still primitive when it comes to helping users trace model performance problems all the way to the data. We focus on the particular problem of slicing data to identify subsets of the training data where the model performs poorly. Unlike general techniques (e.g., clustering) that can find arbitrary slices, our goal is to find interpretable slices (which are easier to take action compared to arbitrary subsets) that are problematic and large. We propose {\sf Slice Finder}, which is an interactive framework for identifying such slices using statistical techniques. The slices can be used for applications like diagnosing model fairness and fraud detection where describing slices that are interpretable to humans is necessary. View details
    Preview abstract This is the poster for an existing pub-approved paper: https://pub-tools.googleplex.com/cms/publication/publication-201923585819419475880458635502622248052/change/ View details