
Sercan O. Arik
Sercan Arik is a Research Scientist at Google Cloud AI. Motivated by the mission of democratizing AI and bringing it to the most impactful use cases (from Healthcare, Finance, Retail, Media, Education, Communications and many other industries), he works on making AI high-performance for the most-demanded data types, interpretable, fair, data-efficient, robust and reliable.
Before joining Google, he was a Research Scientist at Baidu Silicon Valley AI Lab. At Baidu, he focused on deep learning research, particularly for applications in human-technology interfaces. He co-developed state-of-the-art speech synthesis, keyword spotting, voice cloning, and neural architecture search systems. Prior to Baidu, he completed a PhD degree in Electrical Engineering at Stanford University in 2016. He has co-authored more than 50 journal and conference publications.
Authored Publications
Sort By
ASPEST: Bridging the Gap Between Active Learning and Selective Prediction
Somesh Jha
Transactions on Machine Learning Research (TMLR) (2024)
Preview abstract
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain. These predictions can then be deferred to humans for further evaluation. As an everlasting challenge for machine learning, in many real-world scenarios, the distribution of test data is different from the training data. This results in more inaccurate predictions, and often increased dependence on humans, which can be difficult and expensive. Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples. Selective prediction and active learning have been approached from different angles, with the connection between them missing. In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain while increasing accuracy and coverage. For this new paradigm, we propose a simple yet effective approach, ASPEST, that utilizes ensembles of model snapshots with self-training with their aggregated outputs as pseudo labels. Extensive experiments on numerous image, text and structured datasets, which suffer from domain shifts, demonstrate that ASPEST can significantly outperform prior work on selective prediction and active learning (e.g. on the MNIST→SVHN benchmark with the labeling budget of 100, ASPEST improves the AUACC metric from 79.36% to 88.84%) and achieves more optimal utilization of humans in the loop.
View details
Teach Better or Show Smarter? On Instructions and Exemplars in Automatic Prompt Optimization
Advances in Neural Information Processing Systems (NeurIPS) (2024)
Preview abstract
Large language models have demonstrated remarkable capabilities, but their performance is heavily reliant on effective prompt engineering. Automatic prompt optimization (APO) methods are designed to automate this and can be broadly categorized into those targeting instructions (instruction optimization, IO) vs. those targeting exemplars (exemplar selection, ES). Despite their shared objective, these have evolved rather independently, with IO recently receiving more research attention. This paper seeks to bridge this gap by comprehensively comparing the performance of representative IO and ES techniques, both isolation and combination, on a diverse set of challenging tasks. Our findings reveal that intelligently reusing model-generated input-output pairs obtained from evaluating prompts on the validation set as exemplars consistently improves performance over IO methods but is currently under-investigated. We also find that despite the recent focus on IO, how we select exemplars can outweigh how we optimize instructions, with ES strategies as simple as random search outperforming state-of-the-art IO methods with seed instructions without any optimization. Moreover, we observe synergy between ES and IO, with optimal combinations surpassing individual contributions. We conclude that studying exemplar selection as a standalone method and its optimal combination with instruction optimization remains a crucial aspect of APO and deserves greater consideration in future research, even in the era of highly capable instruction-following models.
View details
Preview abstract
Large language models (LLMs) have achieved remarkable advancements in natural language understanding, generation, and manipulation of text-based data. However, one major issue towards their widespread deployment in the real world is that they can generate "hallucinated" answers that are not factual. Towards this end, this paper focuses on improving grounding from a holistic perspective with a novel framework, AGREE. We start with the design of a test time adaptation capability that takes into account the support information generated in self-grounded responses. To effectively enable this capability, we propose that the model tuning needs to be redesigned with a novel tuning objective mimicking the test time adaptation setup for grounding. This tuning on top of the pre-trained LLMs requires small amount of data that need to be constructed in a particular way to learn the grounding information, for which we introduce a data construction method. Our results show that AGREE pushes the state-of-the-art in grounding, demonstrated across many datasets.
View details
SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL
Satya Gundabathula
Hanjun Dai
TMLR (2024)
Preview abstract
Text-to-SQL, the process of translating natural language into Structured Query Language
(SQL), represents a transformative application of large language models (LLMs), potentially
revolutionizing how humans interact with data. This paper introduces the SQL-PaLM
framework, a comprehensive solution for understanding and enhancing Text-to-SQL using
LLMs, using in the learning regimes of few-shot prompting and instruction fine-tuning. With
few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error filtering. With instruction fine-tuning, we delve deep in understanding the critical
paradigms that influence the performance of tuned LLMs. In particular, we investigate
how performance can be improved through expanded training data coverage and diversity,
synthetic data augmentation, and integrating query-specific database content. We propose
a test-time selection method to further refine accuracy by integrating SQL outputs from
multiple paradigms with execution feedback as guidance. Additionally, we tackle the
practical challenge of navigating intricate databases with a significant number of tables and
columns, proposing efficient techniques for accurately selecting relevant database elements to
enhance Text-to-SQL performance. Our holistic approach yields substantial advancements
in Text-to-SQL, as demonstrated on two key public benchmarks, Spider and BIRD. Through
comprehensive ablations and error analyses, we shed light on the strengths and weaknesses
of our framework, offering valuable insights into Text-to-SQL’s future work.
View details
Preview abstract
With development of Large Language Models (LLMs), collaboration between LLMs to solve complex tasks has attracted more and more attention. An important challenging task is reasoning from long text that cannot be input into LLMs. Thus far, limited research has explored how to solve long context tasks via pure multi-agent collaboration.
In this paper, we propose Chain-of-Agents (CoA), a novel framework that leverages the multi-agent collaboration via natural language to solve complex tasks. In CoA, the long text is split into chunks to be processed by agents repeatedly with appending the information from preceding agents. A manager model is finally employed to obtain the final answer utilizing the output of the last agent.
On wide range of datasets for long context question answering, summarization, and code completion and with many LLMs (including PaLM 2, Claude, and Gemini), we show that CoA framework outperforms strong baselines, including the commonly-used retrieval augmented generation (RAG) systems, by a large margin. For instance, text-bison obtains 13.30\% performance gain on NarrativeQA, and 10.22\% on MuSiQue dataset.
View details
Preview abstract
Accurate estimation of output quantiles is crucial in many use cases, where it is desired to model the range of possibility. Modeling target distribution at arbitrary quantile levels and at arbitrary input attribute levels are important to offer a comprehensive picture of the data, and requires the quantile function to be expressive enough. The quantile function describing the target distribution using quantile levels is critical for quantile regression. Althought various parametric forms for the distributions (that the quantile function specifies) can be adopted, an everlasting problem is selecting the most appropriate one that can properly approximate the data distributions. In this paper, we propose a non-parametric and data-driven approach, Neural Spline Search (NSS), to represent the observed data distribution without parametric assumptions. NSS is flexible and expressive for modeling data distributions by transforming the inputs with a series of monotonic spline regressions guided by symbolic operators. We demonstrate that NSS outperforms previous methods on synthetic, real-world regression and time-series forecasting tasks.
View details
Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs
Somesh Jha
Findings of the Association for Computational Linguistics: EMNLP (2023)
Preview abstract
Large language models (LLMs) have recently shown great advances in a variety of tasks, including natural language understanding and generation. However, their use in high-stakes
decision-making scenarios is still limited due to the potential for errors. Selective prediction
is a technique that can be used to improve the reliability of the LLMs by allowing them to abstain from making predictions when they are unsure of the answer. In this work, we propose a novel framework for adaptation with self-evaluation to improve the selective prediction performance of LLMs. Our framework is based on the idea of using parameter-efficient tuning to adapt the LLM to the specific task at hand while improving its ability to perform self-evaluation. We evaluate our method on a variety of question-answering (QA) datasets and show that it outperforms state-of-the-art selective prediction methods. For example, on the CoQA benchmark, our method improves the AUACC from 91.23% to 92.63% and improves the AUROC from 74.61% to 80.25%.
View details
Universal Self-adaptive Prompting
Hanjun Dai
Empirical Methods in Natural Language Processing (EMNLP) (2023)
Preview abstract
A hallmark of modern large language models (LLMs) is their impressive general zero-shot and few-shot abilities, often elicited through in-context learning (ICL) via prompting. However, while highly coveted and being the most general, zero-shot performances in LLMs are still typically weaker due to the lack of guidance and the difficulty of applying existing automatic prompt design methods in general tasks when ground-truth labels are unavailable. In this study, we address this by presenting Universal Self-Adaptive Prompting (USP), an automatic prompt design approach specifically tailored for zero-shot learning (while compatible with few-shot). Requiring only a small amount of unlabeled data and an inference-only LLM, USP is highly versatile: to achieve universal prompting, USP categorizes a possible NLP task into one of the three possible task types and then uses a corresponding selector to select the most suitable queries and zero-shot model-generated responses as pseudo-demonstrations, thereby generalizing ICL to the zero-shot setup in a fully automated way. We evaluate USP with PaLM and PaLM 2 models and demonstrate performances that are considerably stronger than standard zero-shot baselines and often comparable to or even superior to few-shot baselines across more than 40 natural language understanding, natural language generation, and reasoning tasks.
View details
Preview abstract
We propose a canonical approach for feature selection, sparse learnable masks (SLM). SLM integrates learnable sparse masks into end-to-end training. For the fundamental non-differentiability challenge of selecting a desired number of features, we propose duo mechanisms for automatic mask scaling to achieve the desired feature sparsity, and gradually tempering this sparsity for effective learning.
In addition, SLM employs a novel objective that maximizes the mutual information (MI) between the selected features and the labels, in an efficient and scalable way. Empirically, SLM achieves state-of-the-art results on several benchmark datasets, often by a significant margin, especially on real-world challenging datasets.
View details
Preview abstract
Multimodal large-scale pretraining has shown impressive performance gains for unstructured data including language, image, audio, and video. Yet, the scenario prominent in real-world applications is the existence of combination of structured (including tabular and time-series) and unstructured data in conjunction, and it has been understudied.
Towards this end, we propose LANISTR, a novel attention-based framework to learn from LANguage, Image, and STRuctured data. We introduce a new multimodal fusion module with a similarity-based multimodal masking loss that enables LANISTR to learn cross-modal relations from large-scale multimodal data with missing modalities during training and test time. On two publicly available MIMIC-IV and Amazon Product Review datasets, LANISTR achieves absolute improvements of 6.47% (AUROC) and 8.35% (accuracy), respectively, compared to the state-of-the-art multimodal models, while showing superior generalization capabilities.
View details