Da-Cheng Juan
Da-Cheng Juan is a software engineer at Google Research. Da-Cheng has worked on large-scale, semi-supervised learning with Expander, as well as personalized recommendation for computational advertising. Prior to joining Google, Da-Cheng received his Ph.D. from Carnegie Mellon University in 2014. His research interests include machine learning, convex optimization, and data mining.
Authored Publications
Sort By
Preview abstract
Augmenting LLMs with context leads to improved performance across many applications. Despite much research on Retrieval Augmented Generation (RAG) systems, an open question is whether errors arise because LLMs fail to utilize the context from retrieval or the context itself is insufficient to answer the query. To shed light on this, we develop a new notion of sufficient context, along with a way to classify instances that have enough information to answer the query. We then use sufficient context to analyze several models and datasets. By stratifying errors based on context sufficiency, we find that proprietary LLMs (Gemini, GPT, Claude) excel at answering queries when the context is sufficient, but often output incorrect answers instead of abstaining when the context is not. On the other hand, open-source LLMs (Llama, Mistral, Gemma) hallucinate or abstain often, even with sufficient context. We further categorize cases when the context is useful, and improves accuracy, even though it does not fully answer the query and the model errs without the context. Building on our findings, we explore ways to reduce hallucinations in RAG systems, including a new selective generation method that leverages sufficient context information for guided abstention. Our method improves the fraction of correct answers among times where the model responds by 2--10% for Gemini, GPT, and Gemma.
View details
DreamSync: Aligning Text-to-Image Generation with Image Understanding Models
Jiao Sun
Yushi Hu
Deqing Fu
Royi Rassin
Su Wang
Charles Herrmann
Ranjay Krishna
Synthetic Data for Computer Vision Workshop @ CVPR 2024
Preview abstract
Text-to-Image (T2I) models still struggle to produce images that are both beautiful and faithful to the user's input text prompt. Recent frameworks to evaluate the faithfulness of T2I models, such as TIFA, have observed that large vision-language models (VLMs) can reliably analyze the generated images and measure the alignment to the text prompts. Building on this insight, we introduce DreamSync, a model-agnostic training algorithm that utilizes VLM feedback to improve T2I models. The main idea behind DreamSync is to bootstrap T2I models with their own generations. First, we use the T2I model to generate several candidate images. Then, we use two VLMs as data selectors: one is a Visual Question Answering (VQA) model that measures the alignment of generated images to user prompts, and the other measures the image aesthetic quality. After selecting the top candidate images, we use LoRA to iteratively fine-tune the T2I model. Despite its simplicity, DreamSync improves both the semantic alignment and aesthetic appeal of two diffusion-based T2I models, evidenced by multiple benchmarks (+1.77% on TIFA, +2.8% on DSG1K, +3.76% on VILA aesthetic) and human evaluations. DreamSync does not need any additional human annotation, model architecture changes, or reinforcement learning.
View details
Preview abstract
Large language models (LLMs) have demonstrated remarkable capabilities, but their outputs can sometimes be unreliable or factually incorrect. To address this, we introduce Self Logits Evolution Decoding (SLED), a novel decoding framework that enhances the truthfulness of LLMs without relying on external knowledge bases or requiring further fine-tuning. From an optimization perspective, our SLED framework leverages the latent knowledge embedded within the LLM by contrasting the output logits from the final layer with those from early layers. It then utilizes an approximate gradient approach to enable latent knowledge to guide the self-refinement of outputs, thereby effectively improving factual accuracy. Extensive experiments have been conducted on established benchmarks across a diverse range of model families (LLaMA 2, LLaMA 3, Gemma) and scales (from 2B to 70B), including more advanced architectural configurations such as the mixture of experts (MoE). Our evaluation spans a wide variety of tasks, including multi-choice, open-generation, and adaptations to chain-of-thought reasoning tasks. The results demonstrate that SLED consistently improves factual accuracy by up to 20\% compared to existing decoding methods while maintaining natural language fluency and negligible latency overhead. Furthermore, it can be flexibly combined with other decoding methods to further enhance their performance.
View details
Substance or Style: What Does Your Image Embedding Know?
Charles Herrmann
Chun-Sung Ferng
Dilip Krishnan
NeurIPS 2023 Workshop on Distribution Shifts (DistShift) New Frontiers with Foundation Models
Preview abstract
Probes are small networks that predict properties of underlying data from embeddings, and they provide a targeted, effective way to illuminate the information contained in embeddings. While analysis through the use of probes has become standard in NLP, there has been much less exploration in vision. Image foundation models have primarily been evaluated for semantic content. Better understanding the non-semantic information in popular embeddings (e.g., MAE, SimCLR, or CLIP) will shed new light both on the training algorithms and on the uses for these foundation models. We design a systematic transformation prediction task and measure the visual content of embeddings along many axes, including image style, quality, and a range of natural and artificial transformations. Surprisingly, six embeddings (including SimCLR) encode enough non-semantic information to identify dozens of transformations. We also consider a generalization task, where we group similar transformations and hold out several for testing. We find that image-text models (CLIP and ALIGN) are better at recognizing new examples of style transfer than masking-based models (CAN and MAE). Overall, our results suggest that the choice of pre-training algorithm impacts the types of information in the embedding, and certain models are better than others for non-semantic downstream tasks.
View details
OmniNet: Omnidirectional Representations from Transformers
Yi Tay
Vamsi Aribandi
ICML 2021
Preview abstract
This paper proposes Omnidirectional Representations from Transformers (\textsc{OmniNet}). In OmniNet, instead of maintaing a strictly horizontal receptive field, each token is allowed to attend to all tokens in the entire network. This process can also be interpreted as a form of extreme or intensive attention mechanism that has the receptive field of the entire width and depth of the network. To this end, the omnidirection attention is learned via a meta-learner, which is essentially another self-attention based model. In order to mitigate the computationally expensive costs of full receptive field attention, we leverage efficient self-attention models such as kernel-based \cite{choromanski2020rethinking}, low-rank attention \cite{wang2020linformer} and/or Big Bird \cite{zaheer2020big} as the meta-learner. We conduct extensive experiments on autoregressive language modeling (LM1B, C4), Machine Translation, Long Range Arena (LRA) and Image Recognition, showing that OmniNet not only achieves considerable improvements when equipped with both sequence-based (1D) Transformers but also on image recognition (finetuning and few shot learning) tasks. OmniNet also achieves state-of-the-art performance on LM1B, WMT'14 En-De/En-Fr and Long Range Arena.
View details
Preview abstract
Achieving state-of-the-art performance on natural language understanding tasks typically relies on fine-tuning a fresh model for every task. Consequently, this approach leads to a higher overall parameter cost, along with higher technical maintenance for serving multiple models. Learning a single multi-task model that is able to do well for all the tasks has been a challenging and yet attractive proposition. In this paper, we propose \textsc{HyperGrid}, a new approach for highly effective multi-task learning. The proposed approach is based on a decomposable hypernetwork that learns grid-wise projections, which helps to specialize regions in weight matrices for different tasks. In order to construct the proposed hyper projection, our method learns the interactions and composition between a global state and a local task-specific state. We apply our proposed \textsc{HyperGrid} on the current state-of-the-art T5 model, yielding optimistic and strong gains across GLUE and SuperGLUE benchmarks when trained in a single model multi-tasking setup. Our method helps to bridge the gap between the single-task finetune methods and the single model multi-tasking approaches
View details
Graph-RISE: Graph-Regularized Image Semantic Embedding
Aleksei Timofeev
Futang Peng
Krishnamurthy Viswanathan
Lucy Gao
Sujith Ravi
Yi-ting Chen
Zhen Li
The 12th International Conference on Web Search and Data Mining (2020) (to appear)
Preview abstract
Learning image representation to capture instance-based semantics has been a challenging and important task for enabling many applications such as image search and clustering. In this paper, we explore the limits of image embedding learning at unprecedented scale and granularity. We present Graph-RISE, an image embedding that captures very fine-grained, instance-level semantics. Graph-RISE is learned via a large-scale, neural graph learning framework that leverages graph structure to regularize the training of deep neural networks. To the best of our knowledge, this is the first work that can capture instance-level image semantics at million—O(40M)—scale. Experimental results show that Graph-RISE outperforms state-of-the-art image embedding algorithms on several evaluation tasks, including image classification and triplet ranking. We also provide case studies to demonstrate that, qualitatively, image retrieval based on Graph-RISE well captures the semantics and differentiates nuances at instance level.
View details
Neural Structured Learning in TensorFlow: Hands-On Tutorial at KDD
Chun-Sung Ferng
George Yu
(2020), pp. 3501-3502
Preview abstract
We present Neural Structured Learning (NSL) in TensorFlow, a new learning paradigm to train neural networks by leveraging structured signals in addition to feature inputs. Structure can be explicit as represented by a graph, or implicit, either induced by adversarial perturbation or inferred using techniques like embedding learning. NSL is open-sourced as part of the TensorFlow ecosystem and is widely used in Google across many products and services. In this tutorial, we provide an overview of the NSL framework including various libraries, tools, and APIs as well as demonstrate the practical use of NSL in different applications. The NSL website is hosted at www.tensorflow.org/neural_structured_learning, which includes details about the theoretical foundations of the technology, extensive API documentation, and hands-on tutorials.
View details
Preview abstract
We propose Sparse Sinkhorn Attention, a new efficient and sparse method for learning to attend. Our method is based on differentiable sorting of internal representations. Concretely, we introduce a meta sorting network that learns to generate latent permutations over sequences. Given sorted sequences, we are then able to compute quasi-global attention with only local windows, improving the memory efficiency of the attention module. To this end, we propose new algorithmic innovations such as Causal Sinkhorn Balancing and SortCut, a dynamic sequence truncation method for tailoring Sinkhorn Attention for encoding and/or decoding purposes. Via extensive experiments on algorithmic seq2seq sorting, language modeling, pixel-wise image generation, document classification and natural language inference, we demonstrate that our Sinkhorn Attention remains competitive to the vanilla attention, consistently outperforming recently proposed efficient Transformer models such as Sparse Transformers, while retaining memory efficiency.
View details
COCO-GAN: Generation by Parts via Conditional Coordinating
Chieh Hubert Lin
Chia-Che Chang
Yu-Sheng Chen
Wei Wei
Hwann-Tzong Chen
International Conference on Computer Vision (ICCV) (2019)
Preview abstract
We present a new architecture of generative adversarial nets (GANs): \underline{CO}nditional \underline{CO}ordinate GAN (\modelNamePunc). Given a latent vector and spatial positions, the generator learns to produce position-aware image patches; each patch is generated independently (referred as ``spatial disentanglement''), and without any post-processing, the produced patches can further be composed into a full image that is locally smooth and globally coherent. Without additional hyper-parameter tuning, the images composed by \modelName are qualitatively competitive with those generated by state-of-the-art GANs. In addition to the spatial disentanglement property, \modelName learns via coordinates, and can generalize to different predefined coordinate systems. We take panorama as a case study to demonstrate that, in addition to Cartesian coordinates, \modelName can also learn in a cylindrical coordinate system that is cyclic in the horizontal direction. We further investigate and demonstrate three new applications of \modelName. ``Patch-Inspired Image Generation'' takes an image patch and generates a full image containing a local patch similar to the given one. We show that the generated image can loosely retain some local structure or global characteristic of the original image. ``Partial-Scene Generation'' uses the controllable spatial disentanglement to render patches within the designated region without spending resources on generating pixels outside the region. ``Computational-Friendly Generation'' demonstrates multiple advantages of \modelName, including higher parallelism and lower memory requirement.
View details