 
                Alekh Agarwal
            Alekh is a researcher in the Learning Theory team at Google, where he works on a range of questions in theoretical machine learning. Bulk of his research focuses on the theory and applications of reinforcement learning, as well as other forms of interactive learning such as bandits, active learning and online learning.
          
        
        Research Areas
      Authored Publications
    
  
  
  
    
    
  
      
        Sort By
        
        
    
    
        
        
          
              Preview abstract
          
          
              Most contextual bandit algorithms assume that rewards are bounded uniformly, or with a high probability for each context and action. Furthermore, the theoretical regret, as well as empirical performance of most prevailing methods scales polynomially with this reward scale. In this work, we use ideas from robust mean estimation to design contextual bandit algorithms where the regret has only a mild logarithmic scaling on the reward scale, with a polynomial dependence on the second moment rather than the maximum reward value. Our algorithm is based on Catoni's mean estimator, is applicable to arbitrary non-linear function classes, and we present both variance aware and variance agnostic versions of our approach.
              
  
View details
          
        
      
    
        
        
          
              Preview abstract
          
          
              We conduct a theoretical analysis of techniques for preference-based RL from offline datasets annotated with pairwise preferences, such as DPO. We identify key properties of the learning objective that influence the quality of the learned policy, such as the coverage of the offline dataset, the presence or absence of a normalizing baseline and the choice of loss function. Informed by the theory, we further conduct an empirical analysis of some key variants to corroborate our theoretical findings. 
              
  
View details
          
        
      
    
        
        
          
              Preview abstract
          
          
              We propose a general framework to design posterior sampling methods for model-based RL. We show that the proposed algorithms can be analyzed by reducing regret to Hellinger distance in conditional probability estimation. We further show that optimistic posterior sampling can control this Hellinger distance, when we measure model error via data likelihood. This technique allows us to design and analyze unified posterior sampling algorithms with state-of-the-art sample complexity guarantees for many model-based RL settings. We illustrate our general result in many special cases, demonstrating the versatility of our framework.
              
  
View details
          
        
      
    
        
        
          
              Preview abstract
          
          
              POMDPs capture a broad class of decision making problems, but hardness results suggest that learning is intractable even in simple settings due to the inherent partial observability. However, in many realistic problems, more information is either revealed or can be computed during some point of the learning process. Motivated by diverse applications ranging from robotics to data center scheduling, we formulate a Hindsight Observable Markov Decision Process (HOMDP) as a POMDP where the latent states are revealed to the learner in hindsight and only during training. We introduce new algorithms for the tabular and function approximation settings that are provably sample-efficient with hindsight observability, even in POMDPs that would otherwise be statistically intractable. We give a lower bound showing that the tabular algorithm is optimal in its dependence on latent state and observation cardinalities.
              
  
View details
          
        
      
    