Joonseok Lee

Joonseok Lee

Joonseok Lee is a research engineer in Foresight group at Google Research. He is mainly working on multi-modal video representation learning. He earned his Ph. D. in Computer Science from Georgia Institute of Technology in August 2015, under the supervision of Dr. Guy Lebanon and Prof. Hongyuan Zha. His thesis is about local approaches for collaborative filtering, with recommendation systems as the main application. He has done three internships during Ph.D, including Amazon (2014 Summer), Microsoft Research (2014 Spring), and Google (2013 Summer). Before coming to Georgia Tech, he worked in NHN corp. in Korea (2007-2010). He received his B.S degree in computer science and engineering from Seoul National University, Korea. His paper "Local Collaborative Ranking" received the best student paper award from the 23rd International World Wide Web Conference (2014). He has served as a program committee in many conferences including NIPS, ICML, CVPR, ICCV, AAAI, WSDM, and CIKM, and journals including JMLR, ACM TIST, and IEEE TKDE. He co-organized the YouTube-8M Large-Scale Video Understanding Workshop as a program chair, and served as the publicity chair for AISTATS 2015 conference. He is currently serving as a reviewer for Google Faculty Research Awards Program. More information is available in his website (http://www.joonseok.net).
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    V2Meow: Meowing to the Visual Beat via Video-to-Music Generation
    Chris Donahue
    Dima Kuzmin
    Judith Li
    Kun Su
    Mauro Verzetti
    Qingqing Huang
    Yu Wang
    Vol. 38 No. 5: AAAI-24 Technical Tracks 5, AAAI Press(2024), pp. 4952-4960
    Preview abstract Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow. View details
    S-Walk: Accurate and Scalable Session-based Recommendation with Random Walks
    Minjin Choi
    Jinhong Kim
    Hyunjung Shim
    Jongwuk Lee
    Proceedings of the 15th ACM International Conference on Web Search and Data Mining (WSDM), ACM(2022)
    Preview abstract Session-based recommendation (SR) aims at predicting the next items from a sequence of the previous items consumed by an anonymous user. Most existing SR models focus only on modeling intra-session characteristics, but neglect to consider inter-session relationships of items, helpful for improving the accuracy. Another critical aspect of recommender systems is computational efficiency and scalability, considering practical concerns in commercial applications. In this paper, we propose the novel Session-based Recommendation with Random Walk, namely S-Walk. Specifically, S-Walk can effectively capture both intra- and inter-session correlations on items by handling high-order relationships across items using random walks with restart (RWR). At the same time, S-Walk is highly efficient and scalable by adopting linear models with closed-form solutions for transition and teleportation matrices to formulate RWR. Despite its simplicity, our extensive experiments demonstrate that S-Walk achieves comparable or state-of-the-art performances in various metrics on four benchmark datasets. Moreover, the learned model by S-walk can be highly compressed without sacrificing accuracy, achieving two or more orders of magnitude faster inference than existing DNN-based models, particularly suitable for large-scale commercial systems. View details
    Bilateral Self-unbiased Recommender Learning for Missing-not-at-Random Implicit Feedback
    Jaewoong Lee
    Seongmin Park
    Jongwuk Lee
    Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), ACM(2022) (to appear)
    Preview abstract Unbiased recommender learning aims at eliminating the intrinsic bias from implicit feedback under the missing-not-at-random (MNAR) assumption. Existing studies primarily focus on estimating the propensity score for item popularity bias but neglect to address the exposure bias of items caused by recommender models, i.e., when the recommender model renders an item more frequently, users tend to more click the item. To resolve this issue, we propose a novel unbiased recommender learning framework, namely Bilateral self-unbiased recommender (BISER). Concretely, BISER consists of two parts: (i) estimating self-inverse propensity weighting (SIPW) for the exposure bias during model training and (ii) utilizing bilateral unbiased learning (BU) to minimize the difference for model predictions between user- and item-based models, thereby alleviating the high variance from SIPW. Our extensive experiments show that BISER significantly outperforms state-of-the-art unbiased recommender models on various real-world datasets, such as Coat, Yahoo! R3, MovieLens-100K, and CiteULike. View details
    MuLan: A Joint Embedding of Music Audio and Natural Language
    Qingqing Huang
    Ravi Ganti
    Judith Yue Li
    Proceedings of the the 23rd International Society for Music Information Retrieval Conference (ISMIR)(2022) (to appear)
    Preview abstract Music tagging and content-based retrieval systems have traditionally been constructed using pre-defined ontologies covering a rigid set of music attributes or text queries. This paper presents MuLan: a first attempt at a new generation of acoustic models that link music audio directly to unconstrained natural language music descriptions. MuLan takes the form of a two-tower, joint audio-text embedding model trained using 44 million music recordings (370K hours) and weakly-associated, free-form text annotations. Through its compatibility with a wide range of music genres and text styles (including conventional music tags), the resulting audio-text representation subsumes existing ontologies while graduating to true zero-shot functionalities. We demonstrate the versatility of the MuLan embeddings with a range of experiments including transfer learning, zero-shot music tagging, language understanding in the music domain, and cross-modal retrieval applications. View details
    A Conservative Approach for Unbiased Learning on Unknown Biases
    Myeongho Jeon
    Daekyung Kim
    Woochul Lee
    Myungjoo Kang
    Proceedings of the 38th IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2022)
    Preview abstract Although convolutional neural networks (CNNs) achieve state-of-the-art in image classification, recent works address their unreliable predictions due to their excessive dependence on biased training data. Existing unbiased modeling postulates that the bias in the dataset is obvious to know, but it is actually unsuited for image datasets including countless sensory attributes. To mitigate this issue, we present a new scenario that does not necessitate a predefined bias. Under the observation that CNNs do have multi-variant and unbiased representations in the model, we propose a conservative framework that employs this internal information for unbiased learning. Specifically, this mechanism is implemented via hierarchical features captured along the multiple layers and orthogonal regularization. Extensive evaluations on public benchmarks demonstrate our method is effective for unbiased learning. View details
    Towards Detailed Characteristic-Preserving Virtual Try-On
    Sangho Lee
    Seoyoung Lee
    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), The 5th Workshop on Computer Vision for Fashion, Art, and Design(2022)
    Preview abstract While virtual try-on has rapidly progressed recently, existing virtual try-on methods still struggle to faithfully represent various details of the clothes when worn. In this paper, we propose a simple yet effective method to better preserve details of the clothing and person by introducing an additional fitting step after geometric warping. This minimal modification enables disentangling representations of the clothing from the wearer, hence we are able to preserve the wearer-agnostic structure and details of the clothing, to fit a garment naturally to a variety of poses and body shapes. Moreover, we propose a novel evaluation framework applicable to any metric, to better reflect the semantics of clothes fitting. From extensive experiments, we empirically verify that the proposed method not only learns to disentangle clothing from the wearer, but also preserves details of the clothing on the try-on results. View details
    Local Collaborative Autoencoders
    Minjin Choi
    Yoongi Jeong
    Jongwuk Lee
    Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM), ACM(2021)
    Preview abstract Top-N recommendation is a challenging problem because complex and sparse user-item interactions should be adequately addressed to achieve high-quality recommendation results. The local latent factor approach has been successfully used with multiple local models to capture diverse user preferences with different subcommunities. However, previous studies have not fully explored the potential of local models, and failed to identify many small and coherent sub-communities. In this paper, we present Local Collaborative Autoencoders (LOCA), a generalized local latent factor framework. Specifically, LOCA adopts different neighborhood ranges at the training and inference stages. Besides, LOCA uses a novel sub-community discovery method, maximizing the coverage of a union of local models and employing a large number of diverse local models. By adopting autoencoders as the base model, LOCA captures latent non-linear patterns representing meaningful user-item interactions within sub-communities. Our experimental results demonstrate that LOCA is scalable and outperforms state-of-the-art models on several public benchmarks, by 2.99-4.70% in Recall and 1.02-7.95% in NDCG, respectively. View details
    Continuous-Time Video Generation via Learning Motion Dynamics with Neural ODE
    Kangyeol Kim
    Sunghyun Park
    Junsoo Lee
    Sookyung Kim
    Jaegul Choo
    Edward Choi
    Proceedings of the 32nd British Machine Vision Conference (BMVC)(2021)
    Preview abstract In order to perform unconditional video generation, we must learn the distribution of the real-world videos. In an effort to synthesize high-quality videos, various studies at-tempted to learn a mapping function between noise and videos, including recent efforts to separate motion distribution and appearance distribution. Previous methods, how-ever, learn motion dynamics in discretized, fixed-interval timesteps, which is contrary to the continuous nature of motion of a physical body. In this paper, we propose a novel video generation approach that learns separate distributions for motion and appearance, the former modeled by neural ODE to learn natural motion dynamics. Specifically, we employ a two-stage approach where the first stage con-verts a noise vector to a sequence of keypoints in arbitrary frame rates, and the second stage synthesizes videos based on the given keypoints sequence and the appearance noise vector. Our model not only quantitatively outperforms re-cent baselines for video generation in both fixed and varying frame rates, but also demonstrates versatile functionality such as dynamic frame rate manipulation and motion transfer between two datasets, thus opening new doors to diverse video generation applications. View details
    Session-aware Linear Item-Item Models for Session-based Recommendation
    Minjin Choi
    Jinhong Kim
    Hyunjung Shim
    Jongwuk Lee
    Proceedings of the ACM Conference on the Web(2021)
    Preview abstract Session-based recommendation aims at predicting the next item given a sequence of previous items consumed in the session. (e.g., e-commerce or multimedia streaming services) Specifically, session data exhibits its unique characteristics, i.e., session consistency, sequential dependency, repeated item consumption, and timeliness of sessions. In this paper, we propose simple-yet-effective session-aware linear models, considering the holistic aspects of the sessions. This holistic nature of our model helps improve the quality of recommendations, and more importantly provides a generalized framework for various session data. Thanks to the closed-form solution for the linear models, the proposed models are highly scalable. Experimental results demonstrate that our simple linear models show comparable or state-of-the-art performance in various metrics on multiple real-world datasets. View details
    Vid-ODE: Continuous-Time Video Generation with Neural Ordinary Differential Equation
    Sunghyun Park
    Kangyeol Kim
    Junsoo Lee
    Jaegul Choo
    Sookyung Kim
    Edward Choi
    Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI)(2021)
    Preview abstract Video generation models often operate under the assumption of fixed frame rates, which leads to suboptimal performance when it comes to handling flexible frame rates (e.g., increasing the frame rate of the more dynamic portion of the video as well as handling missing video frames). To resolve the restricted nature of existing video generation models' ability to handle arbitrary timesteps, we propose continuous-time video generation by combining neural ODE (Vid-ODE) with pixel-level video processing techniques. Using ODE-ConvGRU as an encoder, a convolutional version of the recently proposed neural ODE, which enables us to learn continuous-time dynamics, Vid-ODE can learn the spatio-temporal dynamics of input videos of flexible frame rates. The decoder integrates the learned dynamics function to synthesize video frames at any given timesteps, where the pixel-level composition technique is used to maintain the sharpness of individual frames. With extensive experiments on four real-world video datasets, we verify that the proposed Vid-ODE outperforms state-of-the-art approaches under various video generation settings, both within the trained time range (interpolation) and beyond the range (extrapolation). To the best of our knowledge, Vid-ODE is the first work successfully performing continuous-time video generation using real-world videos. View details