Joonseok Lee
Joonseok Lee is a research engineer in Foresight group at Google Research. He is mainly working on multi-modal video representation learning. He earned his Ph. D. in Computer Science from Georgia Institute of Technology in August 2015, under the supervision of Dr. Guy Lebanon and Prof. Hongyuan Zha. His thesis is about local approaches for collaborative filtering, with recommendation systems as the main application. He has done three internships during Ph.D, including Amazon (2014 Summer), Microsoft Research (2014 Spring), and Google (2013 Summer). Before coming to Georgia Tech, he worked in NHN corp. in Korea (2007-2010). He received his B.S degree in computer science and engineering from Seoul National University, Korea. His paper "Local Collaborative Ranking" received the best student paper award from the 23rd International World Wide Web Conference (2014). He has served as a program committee in many conferences including NIPS, ICML, CVPR, ICCV, AAAI, WSDM, and CIKM, and journals including JMLR, ACM TIST, and IEEE TKDE. He co-organized the YouTube-8M Large-Scale Video Understanding Workshop as a program chair, and served as the publicity chair for AISTATS 2015 conference. He is currently serving as a reviewer for Google Faculty Research Awards Program. More information is available in his website (http://www.joonseok.net).
Research Areas
Authored Publications
Sort By
Towards a Complete Benchmark on Video Moment Localization
Jinyeong Chae
Donghwa Kim
Kwanseok Kim
Doyeon Lee
Sangho Lee
Seongsu Ha
Jonghwan Mun
Wooyoung Kang
Byungseok Roh
(2024)
Preview abstract
In this paper, we propose and conduct a comprehensive benchmark on moment localization task, which aims to retrieve a segment that corresponds to a text query from a single untrimmed video. Our study starts from an observation that most moment localization papers report experimental results only on a few datasets in spite of availability of far more benchmarks. Thus, we conduct an extensive benchmark study to measure the performance of representative methods on widely used 7 datasets. Looking further into the details, we pose additional research questions and empirically verify them, including if they rely on unintended biases introduced by specific training data, if advanced visual features trained on classification task transfer well to this task, and if computational cost of each model pays off. With a series of these experiments, we provide multifaceted evaluation of state-of-the-art moment localization models. Codes are available at https://github.com/snuviplab/MoLEF.
View details
V2Meow: Meowing to the Visual Beat via Video-to-Music Generation
Chris Donahue
Dima Kuzmin
Judith Li
Kun Su
Mauro Verzetti
Qingqing Huang
Yu Wang
Vol. 38 No. 5: AAAI-24 Technical Tracks 5, AAAI Press (2024), pp. 4952-4960
Preview abstract
Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow.
View details
Preview abstract
Referring Image Segmentation is a comprehensive task to segment an object referred by a textual query from an image. In nature, the level of difficulty in this task is affected by the existence of similar objects and the complexity of the referring expression. Recent RIS models still show a significant performance gap between easy and hard scenarios. We pose that the bottleneck exists in the data, and propose a simple but powerful data augmentation method, Negative-mined Mosaic Augmentation (NeMo). This method augments a training image into a mosaic with three other negative images carefully curated by a pretrained multimodal alignment model, e.g., CLIP, to make the sample more challenging. We discover that it is critical to properly adjust the difficulty level, neither too ambiguous nor too trivial. The augmented training data encourages the RIS model to recognize subtle differences and relationships between similar visual entities and to concretely understand the whole expression to locate the right target better. Our approach shows consistent improvements on various datasets and models, verified by extensive experiments.
View details
Preview abstract
The latent space of diffusion model mostly still remains unexplored, despite its great success and potential in the field of generative modeling. In fact, the latent space of existing diffusion models are entangled, with a distorted mapping from its latent space to image space. To tackle this problem, we present Isometric Diffusion, equipping a diffusion model with a geometric regularizer to guide the model to learn a geometrically sound latent space. Our approach allows diffusion models to learn a more disentangled latent space, which enables smoother interpolation, more accurate inversion, and more precise control over attributes directly in the latent space. Extensive experiments illustrate advantages of the proposed method in image interpolation, image inversion, and linear editing.
View details
Preview abstract
Graph convolutions have been successfully applied to recommendation systems, utilizing high-order collaborative signals present in the user-item interaction graph. This idea, however, has not been applicable to the cold-start items, since cold nodes are isolated in the graph and thus do not take advantage of information exchange from neighboring nodes. Recently, there have been a few attempts to utilize graph convolutions on item-item or user-user attribute graphs to capture high-order collaborative signals for cold-start cases, but these approaches are still limited in that the item-item or user-user graph falls short in capturing the dynamics of user-item interactions, as their edges are constructed based on arbitrary and heuristic attribute similarity.
In this paper, we introduce Content-based Graph Reconstruction for Cold-start item recommendation (CGRC), employing a masked graph autoencoder structure and multimodal contents to directly incorporate interaction-based high-order connectivity, applicable even in cold-start scenarios. To address the cold-start items directly on the interaction-based graph, our approach trains the model to reconstruct plausible user-item interactions from masked edges of randomly chosen cold items, simulating fresh items without connection to users. This strategy enables the model to infer potential edges for unseen cold-start nodes. Extensive experiments on real-world datasets demonstrate the superiority of the proposed model.
View details
Bilateral Self-unbiased Recommender Learning for Missing-not-at-Random Implicit Feedback
Jaewoong Lee
Seongmin Park
Jongwuk Lee
Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), ACM (2022) (to appear)
Preview abstract
Unbiased recommender learning aims at eliminating the intrinsic bias from implicit feedback under the missing-not-at-random (MNAR) assumption. Existing studies primarily focus on estimating the propensity score for item popularity bias but neglect to address the exposure bias of items caused by recommender models, i.e., when the recommender model renders an item more frequently, users tend to more click the item. To resolve this issue, we propose a novel unbiased recommender learning framework, namely Bilateral self-unbiased recommender (BISER). Concretely, BISER consists of two parts: (i) estimating self-inverse propensity weighting (SIPW) for the exposure bias during model training and (ii) utilizing bilateral unbiased learning (BU) to minimize the difference for model predictions between user- and item-based models, thereby alleviating the high variance from SIPW. Our extensive experiments show that BISER significantly outperforms state-of-the-art unbiased recommender models on various real-world datasets, such as Coat, Yahoo! R3, MovieLens-100K, and CiteULike.
View details
Towards Detailed Characteristic-Preserving Virtual Try-On
Sangho Lee
Seoyoung Lee
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), The 5th Workshop on Computer Vision for Fashion, Art, and Design (2022)
Preview abstract
While virtual try-on has rapidly progressed recently, existing virtual try-on methods still struggle to faithfully represent various details of the clothes when worn. In this paper, we propose a simple yet effective method to better preserve details of the clothing and person by introducing an additional fitting step after geometric warping. This minimal modification enables disentangling representations of the clothing from the wearer, hence we are able to preserve the wearer-agnostic structure and details of the clothing, to fit a garment naturally to a variety of poses and body shapes. Moreover, we propose a novel evaluation framework applicable to any metric, to better reflect the semantics of clothes fitting. From extensive experiments, we empirically verify that the proposed method not only learns to disentangle clothing from the wearer, but also preserves details of the clothing on the try-on results.
View details
A Conservative Approach for Unbiased Learning on Unknown Biases
Myeongho Jeon
Daekyung Kim
Woochul Lee
Myungjoo Kang
Proceedings of the 38th IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Preview abstract
Although convolutional neural networks (CNNs) achieve state-of-the-art in image classification, recent works address their unreliable predictions due to their excessive dependence on biased training data. Existing unbiased modeling postulates that the bias in the dataset is obvious to know, but it is actually unsuited for image datasets including countless sensory attributes. To mitigate this issue, we present a new scenario that does not necessitate a predefined bias. Under the observation that CNNs do have multi-variant and unbiased representations in the model, we propose a conservative framework that employs this internal information for unbiased learning. Specifically, this mechanism is implemented via hierarchical features captured along the multiple layers and orthogonal regularization. Extensive evaluations on public benchmarks demonstrate our method is effective for unbiased learning.
View details
S-Walk: Accurate and Scalable Session-based Recommendation with Random Walks
Minjin Choi
Jinhong Kim
Hyunjung Shim
Jongwuk Lee
Proceedings of the 15th ACM International Conference on Web Search and Data Mining (WSDM), ACM (2022)
Preview abstract
Session-based recommendation (SR) aims at predicting the next items from a sequence of the previous items consumed by an anonymous user. Most existing SR models focus only on modeling intra-session characteristics, but neglect to consider inter-session relationships of items, helpful for improving the accuracy. Another critical aspect of recommender systems is computational efficiency and scalability, considering practical concerns in commercial applications. In this paper, we propose the novel Session-based Recommendation with Random Walk, namely S-Walk. Specifically, S-Walk can effectively capture both intra- and inter-session correlations on items by handling high-order relationships across items using random walks with restart (RWR). At the same time, S-Walk is highly efficient and scalable by adopting linear models with closed-form solutions for transition and teleportation matrices to formulate RWR. Despite its simplicity, our extensive experiments demonstrate that S-Walk achieves comparable or state-of-the-art performances in various metrics on four benchmark datasets. Moreover, the learned model by S-walk can be highly compressed without sacrificing accuracy, achieving two or more orders of magnitude faster inference than existing DNN-based models, particularly suitable for large-scale commercial systems.
View details
MuLan: A Joint Embedding of Music Audio and Natural Language
Qingqing Huang
Ravi Ganti
Judith Yue Li
Proceedings of the the 23rd International Society for Music Information Retrieval Conference (ISMIR) (2022) (to appear)
Preview abstract
Music tagging and content-based retrieval systems have traditionally been constructed using pre-defined ontologies covering a rigid set of music attributes or text queries. This paper presents MuLan: a first attempt at a new generation of acoustic models that link music audio directly to unconstrained natural language music descriptions. MuLan takes the form of a two-tower, joint audio-text embedding model trained using 44 million music recordings (370K hours) and weakly-associated, free-form text annotations. Through its compatibility with a wide range of music genres and text styles (including conventional music tags), the resulting audio-text representation subsumes existing ontologies while graduating to true zero-shot functionalities. We demonstrate the versatility of the MuLan embeddings with a range of experiments including transfer learning, zero-shot music tagging, language understanding in the music domain, and cross-modal retrieval applications.
View details