Jump to Content
Sercan O. Arik

Sercan O. Arik

Sercan Arik is a Research Scientist at Google Cloud AI. Motivated by the mission of democratizing AI and bringing it to the most impactful use cases (from Healthcare, Finance, Retail, Media, Education, Communications and many other industries), he works on making AI high-performance for the most-demanded data types, interpretable, fair, data-efficient, robust and reliable. Before joining Google, he was a Research Scientist at Baidu Silicon Valley AI Lab. At Baidu, he focused on deep learning research, particularly for applications in human-technology interfaces. He co-developed state-of-the-art speech synthesis, keyword spotting, voice cloning, and neural architecture search systems. Prior to Baidu, he completed a PhD degree in Electrical Engineering at Stanford University in 2016. He has co-authored more than 50 journal and conference publications.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract A hallmark of modern large language models (LLMs) is their impressive general zero-shot and few-shot abilities, often elicited through in-context learning (ICL) via prompting. However, while highly coveted and being the most general, zero-shot performances in LLMs are still typically weaker due to the lack of guidance and the difficulty of applying existing automatic prompt design methods in general tasks when ground-truth labels are unavailable. In this study, we address this by presenting Universal Self-Adaptive Prompting (USP), an automatic prompt design approach specifically tailored for zero-shot learning (while compatible with few-shot). Requiring only a small amount of unlabeled data and an inference-only LLM, USP is highly versatile: to achieve universal prompting, USP categorizes a possible NLP task into one of the three possible task types and then uses a corresponding selector to select the most suitable queries and zero-shot model-generated responses as pseudo-demonstrations, thereby generalizing ICL to the zero-shot setup in a fully automated way. We evaluate USP with PaLM and PaLM 2 models and demonstrate performances that are considerably stronger than standard zero-shot baselines and often comparable to or even superior to few-shot baselines across more than 40 natural language understanding, natural language generation, and reasoning tasks. View details
    Preview abstract Text-to-SQL aims to automate the process of generating SQL queries on a database from natural language text. In this work, we propose "SQLPrompt", tailored to improve the few-shot prompting capabilities of Text-to-SQL for Large Language Models (LLMs). Our methods include innovative prompt design, execution based consistency decoding strategy which selects the SQL with the most consistent execution outcome among other SQL proposals, and a method that aims to improve performance by diversifying the SQL proposals during consistency selection with different prompt designs ("MixPrompt") and foundation models ("MixLLMs"). We show that SQLPrompt outperforms previous approaches for in-context learning with few labeled data by a large margin, closing the gap with finetuning state-of the-art with thousands of labeled data. View details
    Preview abstract Multimodal large-scale pretraining has shown impressive performance gains for unstructured data including language, image, audio, and video. Yet, the scenario prominent in real-world applications is the existence of combination of structured (including tabular and time-series) and unstructured data in conjunction, and it has been understudied. Towards this end, we propose LANISTR, a novel attention-based framework to learn from LANguage, Image, and STRuctured data. We introduce a new multimodal fusion module with a similarity-based multimodal masking loss that enables LANISTR to learn cross-modal relations from large-scale multimodal data with missing modalities during training and test time. On two publicly available MIMIC-IV and Amazon Product Review datasets, LANISTR achieves absolute improvements of 6.47% (AUROC) and 8.35% (accuracy), respectively, compared to the state-of-the-art multimodal models, while showing superior generalization capabilities. View details
    Preview abstract Real-world time-series datasets are often multivariate with complex dynamics. To capture this complexity, high capacity architectures like recurrent- or attention-based sequential deep learning models have become popular. However, recent work demonstrates that simple univariate linear models can outperform such deep learning models on several commonly used academic benchmarks. Extending them, in this paper, we investigate the capabilities of linear models for time-series forecasting and present Time-Series Mixer (TSMixer), a novel architecture designed by stacking multi-layer perceptrons (MLPs). TSMixer is based on mixing operations along both the time and feature dimensions to extract information efficiently. On popular academic benchmarks, the simple-to-implement TSMixer is comparable to specialized state-of-the-art models that leverage the inductive biases of specific benchmarks. On the challenging and large scale M5 benchmark, a real-world retail dataset, TSMixer demonstrates superior performance compared to the state-of-the-art alternatives. Our results underline the importance of efficiently utilizing cross-variate and auxiliary information for improving the performance of time series forecasting. We present various analyses to shed light into the capabilities of TSMixer. The design paradigms utilized in TSMixer are expected to open new horizons for deep learning-based time series forecasting. The implementation is available at: https://github.com/google-research/google-research/tree/master/ tsmixer . View details
    Preview abstract Accurate estimation of output quantiles is crucial in many use cases, where it is desired to model the range of possibility. Modeling target distribution at arbitrary quantile levels and at arbitrary input attribute levels are important to offer a comprehensive picture of the data, and requires the quantile function to be expressive enough. The quantile function describing the target distribution using quantile levels is critical for quantile regression. Althought various parametric forms for the distributions (that the quantile function specifies) can be adopted, an everlasting problem is selecting the most appropriate one that can properly approximate the data distributions. In this paper, we propose a non-parametric and data-driven approach, Neural Spline Search (NSS), to represent the observed data distribution without parametric assumptions. NSS is flexible and expressive for modeling data distributions by transforming the inputs with a series of monotonic spline regressions guided by symbolic operators. We demonstrate that NSS outperforms previous methods on synthetic, real-world regression and time-series forecasting tasks. View details
    Preview abstract Semi-supervised anomaly detection is a common problem, as often the datasets containing anomalies are partially labeled. We propose a canonical framework: Semi-supervised Pseudo-labeler Anomaly Detection with Ensembling (SPADE) that isn't limited by the assumption that labeled and unlabeled data come from the same distribution. Indeed, the assumption is often violated in many applications -- for example, the labeled data may contain only anomalies unlike unlabeled data, or unlabeled data may contain different types of anomalies, or labeled data may contain only `easy-to-label' samples. SPADE utilizes an ensemble of one class classifiers as the pseudo-labeler to improve the robustness of pseudo-labeling with distribution mismatch. Partial matching is proposed to automatically select the critical hyper-parameters for pseudo-labeling without validation data, which is crucial with limited labeled data. SPADE shows state-of-the-art semi-supervised anomaly detection performance across a wide range of scenarios with distribution mismatch in both tabular and image domains. In some common real-world settings such as model facing new types of unlabeled anomalies, SPADE outperforms the state-of-the-art alternatives by 5% AUC in average. View details
    Preview abstract For visual document understanding (VDU), self-supervised pretraining has been shown to successfully generate transferable representations, yet, effective adaptation of such representations to distribution shifts at test-time remains to be an unexplored area. We propose DocTTA, a novel test-time adaptation method for documents, that does source-free domain adaptation using unlabeled target document data. DocTTA leverages cross-modality self-supervised learning via masked visual language modeling, as well as pseudo labeling to adapt models learned on a source domain to an unlabeled target domain at test time. We introduce new benchmarks using existing public datasets for various VDU tasks, including entity recognition, key-value extraction, and document visual question answering. DocTTA shows significant improvements on these compared to the source model performance, up to 1.89% in (F1 score), 3.43% (F1 score), and 17.68% (ANLS score), respectively. View details
    Preview abstract In this paper, we propose a novel deep sequence model based on the Koopman theory for time series forecasting with distribution shifts. Our model, Koopman Neural Forecaster (KNF), leverages DNNs to learn the linear Koopman space and the measurement functions, and imposes inductive biases for improved robustness against distributional shifts. KNF employs both a global operator to learn shared characteristics, and a local operator to capture changing dynamics. KNF also includes a judiciously-designed feedback loop to continuously update the learnt operators over time for rapidly varying behaviors. To the best of our knowledge, this is the first time that Koopman theory is applied to real-world time series without known governing laws. We demonstrate that KNF achieves the state-of-the-art performance on wide range of time series datasets that are particularly known to suffer from distribution shifts. View details
    Preview abstract Modern large language models (LLMs) have demonstrated impressive capabilities at sophisticated tasks, often through step-by-step reasoning similar to humans. This is made possible by their strong few-shot and zero shot abilities: they either learn from a handful of handcrafted, completed responses (“in context examples”), or are prompted to reason spontaneously through specially designed triggers. Nonetheless, few-shot performance is sensitive to the choice of the examples, for which artisanal hand-crafted selection would require extensive effort, and in some cases, it might not even be possible to obtain relevant examples a-priori without expertise about the downstream tasks. On the other hand, most general and handcrafting-free, zero-shot performance is limited by the lack of guidance to the LLM. To address this, we propose Consistency-based Self-adaptive Prompting (COSP), a novel prompt design method for LLMs. Requiring neither handcrafted responses nor ground-truth labels, COSP selects & builds the set of examples from the LLM’s own zero-shot outputs via carefully designed criteria combining consistency, diversity and repetition. In zero-shot setting, with only LLM predictions, COSP significantly improves performance (up to 2× compared to zero-shot baselines and matching or exceeding few-shot baselines) in a range of reasoning tasks in 3 LLMs. Moreover, COSP can be generalized to few-shot setting and can take advantage of few labeled examples in an efficient way View details
    Preview abstract We propose a canonical approach for feature selection, sparse learnable masks (SLM). SLM integrates learnable sparse masks into end-to-end training. For the fundamental non-differentiability challenge of selecting a desired number of features, we propose duo mechanisms for automatic mask scaling to achieve the desired feature sparsity, and gradually tempering this sparsity for effective learning. In addition, SLM employs a novel objective that maximizes the mutual information (MI) between the selected features and the labels, in an efficient and scalable way. Empirically, SLM achieves state-of-the-art results on several benchmark datasets, often by a significant margin, especially on real-world challenging datasets. View details
    Preview abstract Understanding black-box machine learning models is crucial for their widespread adoption. Learning globally interpretable models is one approach, but achieving high performance with them is challenging. An alternative approach is to explain individual predictions using locally interpretable models. For locally interpretable modeling, various methods have been proposed and indeed commonly used, but they suffer from low fidelity, i.e. their explanations do not approximate the predictions well. In this paper, our goal is to push the state-of-the-art in high-fidelity locally interpretable modeling. We propose a novel framework, Locally Interpretable Modeling using Instance-wise Subsampling (LIMIS). LIMIS utilizes a policy gradient to select a small number of instances and distills the black-box model into a low-capacity locally interpretable model using those selected instances. Training is guided with a reward obtained directly by measuring the fidelity of the locally interpretable models. We show on multiple tabular datasets that LIMIS near-matches the prediction accuracy of black-box models, significantly outperforming state-of-the-art locally interpretable models in terms of fidelity and prediction accuracy. View details
    Algorithmic fairness in pandemic forecasting: lessons from COVID-19
    Thomas Tsai
    Benjamin Jacobson
    Nate Yoder
    Dario Sava
    Meg Mitchell
    Garth Graham
    npj Digital Medicine (2022)
    Preview abstract Racial and ethnic minorities have borne a particularly acute burden of the COVID-19 pandemic in the United States. There is a growing awareness from both researchers and public health leaders of the critical need to ensure fairness in forecast results. Without careful and deliberate bias mitigation, inequities embedded in data can be transferred to model predictions, perpetuating disparities, and exacerbating the disproportionate harms of the COVID-19 pandemic. These biases in data and forecasts can be viewed through both statistical and sociological lenses, and the challenges of both building hierarchical models with limited data availability and drawing on data that reflects structural inequities must be confronted. We present an outline of key modeling domains in which unfairness may be introduced and draw on our experience building and testing the Google-Harvard COVID-19 Public Forecasting model to illustrate these challenges and offer strategies to address them. While targeted toward pandemic forecasting, these domains of potentially biased modeling and concurrent approaches to pursuing fairness present important considerations for equitable machine-learning innovation. View details
    Decoupling Local and Global Representations of Time Series
    Sana Tonekaboni
    Chun-Liang Li
    Anna Goldenberg
    International Conference on Artificial Intelligence and Statistics (2022)
    Preview abstract Real-world time series data are often generated from several sources of variation. Learning representations that capture the factors contributing to this variability enables better understanding of the data via its underlying generative process and can lead to improvements in performance on downstream machine learning tasks. In this paper, we propose a novel generative approach for learning representations for the global and local factors of variation in time series data. The local representation of each sample models non-stationarity over time with a stochastic process prior, and the global representation of the sample encodes the time-independent characteristics. To encourage decoupling between the representations, we introduce a counterfactual regularization that minimizes the mutual information between the two variables. In experiments, we demonstrate successful recovery of the true local and global factors of variability on simulated data, and show that representations learned using our method lead to superior performance on downstream tasks on real-world datasets. We believe that the proposed way of defining representations is beneficial for data modeling and can yield better insights into the complexity of the real-world data. View details
    Preview abstract Hierarchical structures are popular in recent vision transformers, however, they require sophisticated designs and massive datasets to work well. In this paper, we explore the idea of nesting basic local transformers on non-overlapping image blocks and aggregating them in a hierarchical way. We find that the block aggregation function plays a critical role in enabling cross-block non-local information communication. This observation leads us to design a simplified architecture that requires minor code changes upon the original vision transformer. The benefits of the proposed judiciously-selected design are threefold: (1) NesT converges faster and requires much less training data to achieve good generalization on both ImageNet and small datasets like CIFAR; (2) when extending our key ideas to image generation, NesT leads to a strong decoder that is 8$\times$ faster than previous transformer-based generators; and (3) we show that decoupling the feature learning and abstraction processes via this nested hierarchy in our design enables constructing a novel method (named GradCAT) for visually interpreting the learned model. Source code is available https://github.com/google-research/nested-transformer. View details
    Preview abstract Anomaly detection (AD), separating anomalies from normal data, has many applications across domains, from security to healthcare. While most previous works were shown to be effective for cases with fully or partially labeled data, that setting is in practice less common due to labeling being particularly tedious for this task. In this paper, we focus on fully unsupervised AD, in which the entire training dataset, containing both normal and anomalous samples, is unlabeled. To tackle this problem effectively, we propose to improve the robustness of one-class classification trained on self-supervised representations using a data refinement process. Our proposed data refinement approach is based on an ensemble of one-class classifiers (OCCs), each of which is trained on a disjoint subset of training data. Representations learned by self-supervised learning on the refined data are iteratively updated as the data refinement improves. We demonstrate our method on various unsupervised AD tasks with image and tabular data. With a 10% anomaly ratio on CIFAR-10 image data / 2.5% anomaly ratio on Thyroid tabular data, the proposed method outperforms the state-of-the-art one-class classifier by 6.3 AUC and 12.5 average precision / 22.9 F1-score. View details
    A prospective evaluation of AI-augmented epidemiology to forecast COVID-19 in the USA and Japan
    Arkady Epshteyn
    Ashwin Sura Ravi
    Beth Luan
    Chun-Liang Li
    Daisuke Yoneoka
    Dario Sava
    Hiroaki Miyata
    Hiroki Kayama
    Isaac Jones
    Joe Mckenna
    Johan Euphrosine
    Kris Popendorf
    Nate Yoder
    Shashank Singh
    Shuhei Nomura
    Thomas Tsai
    npj Digital Medicine (2021)
    Preview abstract The COVID-19 pandemic has highlighted the global need for reliable models of disease spread. We evaluate an AI-improved forecasting approach that provides daily predictions of the expected number of confirmed COVID-19 deaths, cases and hospitalizations during the following 28 days. We present an international, prospective evaluation of model performance across all states and counties in the USA and prefectures in Japan. National mean absolute percentage error (MAPE) for predicting COVID-19 associated deaths before and after prospective deployment remained consistently <3% (US) and <10% (Japan). Average statewide (US) and prefecture wide (Japan) MAPE was 6% and 20% respectively (14% when looking at prefectures with more than 10 deaths).We show our model performs well even during periods of considerable change in population behavior, and that it is robust to demographic differences across different geographic locations.We further demonstrate the model provides meaningful explanatory insights, finding that the model appropriately responds to local and national policy interventions. Our model enables counterfactual simulations, which indicate continuing NPIs alongside vaccinations is essential for more rapidly recovering from the pandemic, delaying the application of interventions has a detrimental effect, and allow exploration of the consequences of different vaccination strategies. The COVID-19 pandemic remains a global emergency. In the face of substantial challenges ahead, the approach presented here has the potential to inform critical decisions. View details
    Preview abstract We propose a novel training method that integrates rules into deep learning, in a way the strengths of the rules are controllable at inference. Deep Neural Networks with Controllable Rule Representations (DeepCTRL) incorporates a rule encoder into the model coupled with a rule-based objective, enabling a shared representation for decision making. DeepCTRL is agnostic to data type and model architecture. It can be applied to any kind of rule defined for inputs and outputs. The key aspect of DeepCTRL is that it does not require retraining to adapt the rule strength -- at inference, the user can adjust it based on the desired operation point on accuracy vs. rule verification ratio. In real-world domains where incorporating rules is critical -- such as Physics, Retail and Healthcare -- we show the effectiveness of DeepCTRL in teaching rules for deep learning. DeepCTRL improves the trust and reliability of the trained models by significantly increasing their rule verification ratio, while also providing accuracy gains at downstream tasks. Additionally, DeepCTRL enables novel use cases such as hypothesis testing of the rules on data samples, and unsupervised adaptation based on shared rules between datasets. View details
    Preview abstract We propose a novel high-performance and interpretable canonical deep tabular data learning architecture, TabNet. TabNet uses sequential attention to choose which features to reason from at each decision step, enabling interpretability and more efficient learning as the learning capacity is used for the most salient features. We demonstrate that TabNet outperforms other neural network and decision tree variants on a wide range of non-performance-saturated tabular datasets and yields interpretable feature attributions plus insights into the global model behavior. Finally, for the first time to our knowledge, we demonstrate self-supervised learning for tabular data, significantly improving performance with unsupervised representation learning when unlabeled data is abundant. View details
    Preview abstract Multi-horizon prediction problems often contain a complex mix of inputs -- including static covariates, known future inputs, and other exogenous time series -- without any prior information on how they interact with the target. While several deep learning models have been proposed for multi-step prediction, they typically comprise black-box models which do not account for the full range of inputs present in common scenarios. In this paper, we introduce the Temporal Fusion Transformer (TFT) -- a novel attention-based architecture which combines high-performance multi-horizon forecasting with interpretable insights into temporal dynamics. To learn temporal relationships at different scales, the TFT utilizes recurrent layers for local processing and interpretable self-attention layer for learning long-term dependencies. The TFT also utilizes specialized components for judicious selection of the relevant features, and series of gating layers to suppress unnecessary components -- enabling high performance in a wide range of regimes. On a variety of real-world datasets, we demonstrate performance improvements over existing benchmarks, and showcase three practical interpretability use-cases of our model. View details
    Preview abstract We propose a novel model that integrates machine learning into compartmental disease modeling to predict the progression of Covid-19. Our model incorporates explainable encoding of information-bearing covariates to improve performance. The motivation to maintain explainability is two-fold: the behavior of the resulting model will be credible with epidemiologists, and will instill confidence in the intended end-users - policy makers and healthcare institutions. The proposed model can be applied at different geographic resolutions, and we demonstrate it for United States' states and counties. We show that the forecasting accuracy of our model is significantly better than the alternatives, and the explanatory insights from it are qualitatively meaningful. View details
    Explaining Deep Neural Networks using Unsupervised Clustering
    2020 Workshop on Human Interpretability in Machine Learning (2020)
    Preview abstract We propose a novel method to explain trained deep neural networks (DNNs), by distilling them into surrogate models using unsupervised clustering. Our method can be flexibly applied to any subset of layers of a DNN architecture and can incorporate low-level and high-level information. On image datasets given pre-trained DNNs, we demonstrate strength of our method in finding similar training samples, and shedding light on the concepts the DNN bases its decision on. Via user studies, we show that our model can improve user trust in model’s prediction. View details
    Preview abstract Active learning (AL) combines data labeling and model training to minimize the labeling cost by prioritizing the selection of high value data that can best improve model performance. In pool-based active learning, accessible unlabeled data are not used for model training in most conventional methods. Here, we propose to unify unlabeled sample selection and model training towards minimizing labeling cost, and make two contributions towards that end. First, we exploit both labeled and unlabeled data using semi-supervised learning (SSL) to distill information from unlabeled data during the training stage. Second, we propose a consistency-based sample selection metric that is coherent with the training objective such that the selected samples are effective at improving model performance. We conduct extensive experiments on image classification tasks. The experimental results on CIFAR-10, CIFAR-100 and ImageNet demonstrate the superior performance of our proposed method with limited labeled data, compared to the existing methods and the alternative AL and SSL combinations. Additionally, we also study an important yet under-explored problem – “When can we start learning-based AL selection?”. We propose a measure that is empirically correlated with the AL target loss and is potentially useful for determining the proper starting point of learning-based AL methods View details
    Preview abstract We propose a novel inherently interpretable machine learning method that bases decisions on few relevant examples that we call prototypes. Our method, ProtoAttend, can be integrated into a wide range of neural network architectures including pre-trained models. It utilizes an attention mechanism that relates the encoded representations to samples in order to determine prototypes. Without sacrificing ac- curacy of the original model, ProtoAttend yields superior results in: sample-based interpretability, confidence estimation and distribution mismatch detection. View details
    Preview abstract Deep neural networks (DNNs) yield poorly-calibrated confidence estimates when their raw predicted posterior estimates are considered. Towards obtaining perfectly-calibrated confidence estimates, we propose a novel framework, named as `Distance-Based Learning from Errors' (DBLE). DBLE is based on two fundamental principles: (i) learning a representation space where distances correspond to relatedness of samples, and (ii) efficient feedback from the training errors to accurately model distances to ground truth centroids. For (i), we adapt prototypical learning such that pairwise distances determine the predicted posteriors during training, and the related samples, ideally from the same class, should be grouped together. For (ii), we propose a simple yet effective solution of relying updates on the samples that yielded the inaccurate decisions during training, with the goal of efficiently fitting a model that represents the variance of prediction in the decision manifold. On four datasets, we demonstrate that DBLE significantly outperforms alternative approaches that are based on a single DNN, in confidence calibration. DBLE is on par with ensemble approaches that contain multiple DNNs, without even doubling the training time and yielding negligible increase in the number of parameters. View details
    Preview abstract Concept-based explanations can be a key direction to understand how DNNs make decisions. In this paper, we study concept-based explainability in a systematic framework. First, we define the notion of completeness, which quantifies how sufficient a particular set of concepts is in explaining the model's behavior. Based on performance and variability motivations, we propose two definitions to quantify completeness. We show that they yield the commonly-used PCA method under certain assumptions. Next, we study two additional constraints to ensure the interpretability of discovered concept, based on sparsity principles. Through systematic experiments, on specifically-designed synthetic dataset and real-world text and image datasets, we demonstrate the superiority of our framework in finding concepts that are complete (in explaining the decision) and that are interpretable. View details
    Preview abstract Quantifying the value of datum is a fundamental problem in machine learning. Besides building insights about the learning task, data valuation has applications in diverse use-cases, such as domain adaptation, corrupted sample discovery, and robust learning. To adaptively learn data values jointly with the predictive model, we propose a meta learning framework - named Data Valuator using Reinforcement Learning (DVRL). We employ a data value estimator, modeled by a deep neural network, to output how likely each datum is used in training of the predictive model. Training of the data value estimator is guided with the reinforcement signal based on a reward directly obtained from the performance on the target task. We evaluate DVRL in various applications across multiple types of datasets. DVRL yields superior quality data value estimates compared to alternative methods. The corrupted sample discovery performance of DVRL is close to optimal (i.e. as if the noisy samples are apriori known) in many regimes. For domain adaptation and robust learning tasks, outperformance of DVRL is significant - 14.6\% and 10.8\% average performance improvements, respectively. View details
    Preview abstract Collecting large-scale data with clean labels for supervised training of neural networks is practically challenging. Although noisy labels are usually cheap to acquire, existing methods suffer a lot from label noise. This paper targets at the challenge of robust training at high label noise regimes. The key insight to achieve this goal is to wisely leverage a small trusted set to estimate exemplar weights and pseudo labels for noisy data in order to reuse them for supervised training. We present a holistic framework to train deep neural networks in a way that is highly invulnerable to label noise. Our method sets the new state of the art on various types of label noise and achieves leading performance on large-scale datasets with real-world label noise. For instance, on CIFAR100 with a 40% uniform noise ratio and only 10 trusted labeled data per class, our method achieves 80.2% classification accuracy, where the error rate is only 1.4% higher than a neural network trained without label noise. Moreover, increasing the noise ratio to 80%, our method still maintains a high accuracy of 75.5%, compared to the previous best accuracy 48.2%. View details
    Preview abstract We propose a novel framework, named learn to transfer learn (L2TL), to improve transfer learning on a target dataset by judicious extraction of information from a source dataset. Our framework considers joint optimization of vastly-shared weights between models of source and target tasks, and employs adaptive coefficients for scaling of constituent loss terms. The adaptation of the coefficients is done using a reinforcement learning (RL)-based policy model, which is guided based on a performance metric on target evaluation set. We demonstrate the state-of-the-art performance of L2TL on various datasets, with consistent outperformance of fine-tuning baselines. Especially in the regimes of small-scale target datasets and in the cases of significant label mismatch between source and target datasets, the outperformance of L2TL is more significant. View details
    No Results Found