Jump to Content
Sugato Basu

Sugato Basu

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional capabilities are still considered major challenging issues, especially when involving multiple objects. In this work, we improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions. To do this, we incorporate linguistic structures with the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion-based T2I models. We observe that keys and values in cross-attention layers have strong semantic meanings associated with object layouts and content. Therefore, we can better preserve the compositional semantics in the generated image by manipulating the cross-attention representations based on linguistic insights. Built upon Stable Diffusion, a SOTA T2I model, our structured cross-attention design is efficient that requires no additional training samples. We achieve better compositional skills in qualitative and quantitative results, leading to a 5-8% advantage in head-to-head user comparison studies. Lastly, we conduct an in-depth analysis to reveal potential causes of incorrect image compositions and justify the properties of cross-attention layers in the generation process. View details
    Discriminative Diffusion Models as Few-shot Vision and Language Learners
    Xuehai He
    Weixi Feng
    Tsu-Jui Fu
    Varun Jampani
    William Yang Wang
    Xin Eric Wang
    ArXiv (2023)
    Preview abstract Diffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation. Since text-to-image generation often requires models to generate visual concepts with fine-grained details and attributes specified in text prompts, can we leverage the powerful representations learned by pre-trained diffusion models for discriminative tasks such as image-text matching? To answer this question, we propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information and fine-tune the model via attention-based prompt learning to perform image-text matching. By comparing DSD with state-of-the-art methods on several benchmark datasets, we demonstrate the potential of using pre-trained diffusion models for discriminative tasks with superior results on few-shot image-text matching. View details
    Preview abstract Creativity is an indispensable part of human cognition and also an inherent part of how we make sense of the world. Metaphorical abstraction is fundamental in communicating creative ideas through nuanced relationships between abstract concepts such as feelings. While computer vision benchmarks and approaches predominantly focus on understanding and generating literal interpretations of images, metaphorical comprehension of images remains relatively unexplored. Towards this goal, we introduce MetaCLUE, a set of vision tasks on visual metaphor. We also collect high-quality and rich metaphor annotations (abstract objects, concepts, relationships along with their corresponding object boxes) as there do not exist any datasets that facilitate the evaluation of these tasks. We perform a comprehensive analysis of state-of-the-art models in vision and language based on our annotations, highlighting strengths and weaknesses of current approaches in visual metaphor Classification, Localization, Understanding (retrieval, question answering, captioning) and gEneration (text-to-image synthesis) tasks. We hope this work provides a concrete step towards developing AI systems with human-like creative capabilities. View details
    Preview abstract Image ad understanding is a crucial task with wide real-world applications. Although highly challenging with the involvement of diverse atypical scenes, real-world entities, and reasoning over scene-texts, how to interpret image ads is relatively under-explored, especially in the era of foundational vision-language models (VLMs) featuring impressive generalizability and adaptability. In this paper, we perform the first empirical study of image ad understanding through the lens of pre-trained VLMs. We benchmark and reveal practical challenges in adapting these VLMs to image ad understanding. We propose a simple feature adaptation strategy to effectively fuse multimodal information for image ads and further empower it with knowledge of real-world entities. We hope our study draws more attention to image ad understanding which is broadly relevant to the advertising industry. View details
    LayoutGPT: Compositional Visual Planning and Generation with Large Language Models
    Weixi Feng
    Wangrong Zhu
    Tsu-Jui Fu
    Varun Jampani
    Xuehai He
    Xin Eric Wang
    William Wang
    NeurIPS (2023)
    Preview abstract Attaining a high degree of user controllability in visual generation often requires intricate, fine-grained inputs like layouts. However, such inputs impose a substantial burden on users when compared to simple text inputs. To address the issue, we study how Large Language Models (LLMs) can serve as visual planners by generating layouts from text conditions, and thus collaborate with visual generative models. We propose LayoutGPT, a method to compose in-context visual demonstrations in style sheet language to enhance the visual planning skills of LLMs. LayoutGPT can generate plausible layouts in multiple domains, ranging from 2D images to 3D indoor scenes. LayoutGPT also shows superior performance in converting challenging language concepts like numerical and spatial relations to layout arrangements for faithful text-to-image generation. When combined with a downstream image generation model, LayoutGPT outperforms text-to-image models/systems by 20-40% and achieves comparable performance as human users in designing visual layouts for numerical and spatial correctness. Lastly, LayoutGPT achieves comparable performance to supervised methods in 3D indoor scene synthesis, demonstrating its effectiveness and potential in multiple visual domains. View details
    CPL: Counterfactual Prompt Learning for Vision and Language Models
    Xuehai He
    Diji Yang
    Weixi Feng
    Tsu-Jui Fu
    Varun Jampani
    William Yang Wang
    Xin Eric Wang
    Conference on Empirical Methods in Natural Language Processing (EMNLP) (2022)
    Preview abstract Prompt tuning is a new few-shot transfer learning technique that only tunes the learnable prompt for pre-trained vision and language models such as CLIP. However, existing prompt tuning methods tend to learn spurious or entangled representations, which leads to poor generalization to unseen concepts. Towards non-spurious and efficient prompt learning from limited examples, this paper presents a novel Counterfactual Prompt Learning (CPL) method for vision and language models, which simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework. Particularly, CPL constructs counterfactual by identifying minimal non-spurious feature change between semantically-similar positive and negative samples that causes concept change, and learns more generalizable prompt representation from both factual and counterfactual examples via contrastive learning. Extensive experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks than previous prompt tuning methods on CLIP. On image classification, we achieve 3.55% average relative improvement on unseen classes across seven datasets; on image-text retrieval and visual question answering, we gain up to 4.09% and 25.08% relative improvements across three few-shot scenarios on unseen test sets respectively. View details
    Preview abstract A major challenge in visually grounded language generation is to build robust benchmark datasets and models that can generalize well in real-world settings. To do this, it is critical to ensure that our evaluation protocols are correct, and benchmarks are reliable. In this work, we set forth to design a set of experiments to understand an important but often ignored problem in visually grounded language generation: given that humans have different utilities and visual attention, how will the sample variance in multi-reference datasets affect the models’ performance? Empirically, we study several multi-reference datasets and corresponding vision-and-language tasks. We show that it is of paramount importance to report variance in experiments; that human-generated references could vary drastically in different datasets/tasks, revealing the nature of each task; that metric-wise, CIDEr has shown systematically larger variances than others. Our evaluations on reference-per-instance shed light on the design of reliable datasets in the future. View details
    Product Phrase Extraction from e-Commerce Pages
    Dmitrii Tochilkin
    Kazoo Sone
    The Proceedings of The Web Conference 2019, Companion
    Preview abstract Analyzing commercial pages to infer the products or services being offered by a web-based business is a task central to product search, product recommendation, ad placement and other e-commerce tasks. What makes this task challenging is that there are two types of e-commerce product pages. One is the single-product (SP) page where one product is featured primarily and users are able to buy that product or add to cart on the page. The other is the multi-product (MP) page, where users are presented with multiple (often 10-100) choices of products within a same category, often with thumbnail pictures and brief descriptions — users browse through the catalogue until they find a product they want to learn more about, and subsequently purchase the product of their choice on a corresponding SP page. In this paper, we take a two-step approach to identifying product phrases from commercial pages. First we classify whether a commercial web page is a SP or MP page. To that end, we introduce two different image recognition based models to differentiate between these two types of pages. If the page is determined to be SP, we identify the main product featured in that page. We compare the two types of image recognition models in terms of trade-offs between accuracy and latency, and empirically demonstrate the efficacy of our overall approach. View details
    Micro-Browsing Models for Search Snippets
    International Conference on Data Engineering (ICDE), IEEE (2019), pp. 1904-1909
    Preview abstract Click-through rate (CTR) is a key signal of relevance for search engine results, both organic and sponsored. CTR is the product of the probability of examination times the perceived relevance of the result. Hence there has been considerable work on user browsing models to separate out the examination and relevance components of CTR. However, the snippet text often plays a critical role in the perceived relevance of the result. In this paper, we propose a micro-browsing model for how users read result snippets. We validate the user model by considering the problem of predicting which of two different snippets will yield higher CTR. We show that classification accuracy is dramatically higher with our user model. View details
    Graphical RNN Models
    Ashish Bora
    Joydeep Ghosh
    (2016)
    Preview abstract Many time series are generated by a set of entities that interact with one another over time. This paper introduces a broad, flexible framework to learn from multiple inter-dependent time series generated by such entities. Our framework explicitly models the entities and their interactions through time. It achieves this by building on the capabilities of Recurrent Neural Networks, while also offering several ways to incorporate domain knowledge/constraints into the model architecture. The capabilities of our approach are showcased through an application to weather prediction, which shows gains over strong baselines. View details
    PLUMS: Predicting Links Using Multiple Sources.
    Karthik Subbian
    Arindam Banerjee
    SIAM Conference on Data Mining (SDM) (2015)
    Preview abstract Link prediction is an important problem in online social and collaboration networks, for recommending friends and future collaborators. Most of the existing approaches for link prediction are focused on building unsupervised or supervised classification models based on the availability of accepts and rejects of the past recommendations. Several of these methods are feature-based and they construct a large number of network-level features to make the prediction more effective. A more flexible approach is to allow the model to learn the required features from the network for a specific task, rather than explicit feature engineering. In addition, most of the social and collaboration relationships do not happen instantly and rather build slowly over time through several low cost interactions, such as email and chat. The existing approaches often ignore the availability of such auxiliary networks to make link prediction more robust and effective. The main focus of this work is to build a robust and effective classifier for link prediction using multiple auxiliary networks. We develop a supervised random walk model, that does not require any explicit feature construction, and can be personalized to each user based on the past accept and reject behavior. Our approach consistently outperforms several popular baselines in terms of precision and recall in multiple real-life data sets. Also, our approach is robust to noise and sparsity in auxiliary networks, while several popular baselines, specifically feature-based ones, are inconsistent in their performance under such conditions. View details
    Rolx: structural role extraction & mining in large graphs
    Keith Henderson
    Brian Gallagher
    Tina Eliassi-Rad
    Hanghang Tong
    Leman Akoglu
    Danai Koutra
    Christos Faloutsos
    Lei Li
    KDD (2012)
    Preview abstract Given a network, intuitively two nodes belong to the same role if they have similar structural behavior. Roles should be automatically determined from the data, and could be, for example, “clique-members,” “periphery-nodes,” etc. Roles enable numerous novel and useful network-mining tasks, such as sense-making, searching for similar nodes, and node classification. This paper addresses the question: Given a graph, how can we automatically discover roles for nodes? We propose RolX (Role eXtraction), a scalable (linear in the number of edges), unsupervised learning approach for automatically extracting structural roles from general network data. We demonstrate the e↵ectiveness of RolX on several network-mining tasks: from exploratory data analysis to network transfer learning. Moreover, we compare network role discovery with network community discovery. We highlight fundamental di↵erences between the two (e.g., roles generalize across disconnected networks, communities do not); and show that the two approaches are complimentary in nature. View details
    Hierarchical Mixtures of GLMs for Combining Multiple Ground Truths
    Joseph Reisinger
    Roberto Bayardo
    NIPS Doman Adaptation Workshop (2011)
    Preview abstract In real-world machine learning problems it is often the case that the gold-standard for a particular learning problem is not accurately reflected by any one particular data set. For example, when modeling the landing-page quality associated with a search result, labels from human evaluators are often biased towards “brandname” sites, whereas labels derived from conversions can potentially confound search abandonment and successful conversion. In this paper we propose a class of models for characterizing and isolating the relative bias of a prediction problem across multiple data sets. These models can be used either as tools for data analysis, with the goal of calculating the divergence to the hypothetical gold-standard, or as smoothing procedures aimed at capturing as much shared structure between the domains as possible. View details
    MapReduce and Its Application to Massively Parallel Learning of Decision Tree Ensembles
    Biswanath Panda
    Roberto J Bayardo
    Scaling up Machine Learning: Parallel and Distributed Approaches (2011)
    Preview abstract In this chapter we look at leveraging the MapReduce distributed computing framework (Dean and Ghemawat, 2004) for parallelizing machine learning methods of wide interest, with a specific focus on learning ensembles of classification or regression trees. Building a production-ready implementation of a distributed learning algorithm can be a complex task. With the wide and growing availability of MapReduce-capable computing infrastructures, it is natural to ask whether such infrastructures may be of use in parallelizing common data mining tasks such as tree learning. For many data mining applications, MapReduce may offer scalability as well as ease of deployment in a production setting (for reasons explained later). We initially give an overview of MapReduce and outline its application in a classic clustering algorithm, k-means. Subsequently, we focus on PLANET: a scalable distributed framework for learning tree models over large datasets. PLANET defines tree learning as a series of distributed computations and implements each one using the MapReduce model. We show how this framework supports scalable construction of classification and regression trees, as well as ensembles of such models. We discuss the benefits and challenges of using a MapReduce compute cluster for tree learning and demonstrate the scalability of this approach by applying it to a real-world learning task from the domain of computational advertising. MapReduce is a simple model for distributed computing that abstracts away many of the difficulties in parallelizing data management operations across a cluster of commodity machines. By using MapReduce, one can alleviate, if not eliminate, many complexities such as data partitioning, scheduling tasks across many machines, handling machine failures, and performing inter-machine communication. These properties have motivated many companies to run MapReduce frameworks on their compute clusters for data analysis and other data management tasks. MapReduce has become in some sense an industry standard. For example, there are open-source implementations such as Hadoop that can be run either in-house or on cloud computing services such as Amazon EC2. View details
    User browsing models: relevance versus examination
    Ni Wang
    Daryl Pregibon
    Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, Washington, DC (2010), pp. 223-232
    Preview abstract There has been considerable work on user browsing models for search engine results, both organic and sponsored. The click-through rate (CTR) of a result is the product of the probability of examination (will the user look at the result) times the perceived relevance of the result (probability of a click given examination). Past papers have assumed that when the CTR of a result varies based on the pattern of clicks in prior positions, this variation is solely due to changes in the probability of examination. We show that, for sponsored search results, a substantial portion of the change in CTR when conditioned on prior clicks is in fact due to a change in the relevance of results for that query instance, not just due to a change in the probability of examination. We then propose three new user browsing models, which attribute CTR changes solely to changes in relevance, solely to changes in examination (with an enhanced model of user behavior), or to both changes in relevance and examination. The model that attributes all the CTR change to relevance yields substantially better predictors of CTR than models that attribute all the change to examination, and does only slightly worse than the model that attributes CTR change to both relevance and examination. For predicting relevance, the model that attributes all the CTR change to relevance again does better than the model that attributes the change to examination. Surprisingly, we also find that one model might do better than another in predicting CTR, but worse in predicting relevance. Thus it is essential to evaluate user browsing models with respect to accuracy in predicting relevance, not just CTR. View details
    Predicting Bounce Rates in Sponsored Search Advertisements
    Robert Malkin
    Roberto J. Bayardo
    Proc. of the 15th International ACM-SIGKDD Conference on Knowledge Discovery and Data Mining, ACM (2009), pp. 1325-1334
    Preview
    PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce
    Biswanath Panda
    Roberto J. Bayardo
    Proceedings of the 35th International Conference on Very Large Data Bases (VLDB-2009)
    Preview abstract Classification and regression tree learning on massive datasets is a common data mining task at Google, yet many state of the art tree learning algorithms require training data to reside in memory on a single machine. While more scalable implementations of tree learning have been proposed, they typically require specialized parallel computing architectures. In contrast, the majority of Google’s computing infrastructure is based on commodity hardware. In this paper, we describe PLANET: a scalable distributed framework for learning tree models over large datasets. PLANET defines tree learning as a series of distributed computations, and implements each one using the MapReduce model of distributed computation. We show how this framework supports scalable construction of classification and regression trees, as well as ensembles of such models. We discuss the benefits and challenges of using a MapReduce compute cluster for tree learning, and demonstrate the scalability of this approach by applying it to a real world learning task from the domain of computational advertising. View details
    A Social Query Model for Decentralized Search
    Arindam Banerjee
    Second ACM Workshop on Social Network Mining and Analysis at the KDD Conference (SNAKDD-08) (2008)
    Preview
    Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping
    Mikhail Bilenko
    Mehran Sahami
    Proceedings of the 5th IEEE International Conference on Data Mining (2005), pp. 58-65
    Preview
    iLink: Search and Routing in Social Networks
    Jeffrey Davitz
    Jiye Yu
    David Gutelius
    Alexandra Harris
    Proc. 13th KDD, ACM, San Jose (2007), pp. 931-940
    Semi-supervised graph clustering: a kernel approach
    Brian Kulis
    Inderjit S. Dhillon
    Raymond J. Mooney
    ICML (2005), pp. 457-464
    Model-based overlapping clustering
    Arindam Banerjee
    Chase Krumpelman
    Joydeep Ghosh
    Raymond J. Mooney
    KDD (2005)
    Integrating constraints and metric learning in semi-supervised clustering
    Mikhail Bilenko
    Raymond J. Mooney
    ICML (2004)
    Active Semi-Supervision for Pairwise Constrained Clustering
    Arindam Banerjee
    Raymond J. Mooney
    SDM (2004)
    A probabilistic framework for semi-supervised clustering
    Mikhail Bilenko
    Raymond J. Mooney
    KDD (2004)
    Semi-supervised Clustering by Seeding
    Arindam Banerjee
    Raymond J. Mooney
    ICML (2002)
    Evaluating the novelty of text-mined rules using lexical knowledge
    Raymond J. Mooney
    Krupakar V. Pasupuleti
    Joydeep Ghosh
    KDD (2001), pp. 233-238