- Xin Xin
- Alexandros Karatzoglou
- Ioannis Arapakis
- Joemon Jose
Casting session-based or sequential recommendation as reinforcement learning (RL) through reward signals is a promising research direction towards recommender systems (RS) that maximize long-term user engagement. However, the direct use of RL algorithms under the RS setting is unfeasible due to challenges like off-policy training, huge action spaces and lack of sufficient reward signals. Recent RL approaches in the recommendation domain try to tackle these challenges, by for example combining RL and self-supervised learning. In this paper, we examine the approach of self-supervised reinforcement learning for recommendation and show that existing methods still have limitations. For example, the negative signals from the self-supervised component are not sufficient for the RL component to perform good ranking. Moreover, the length of the sequence could also introduce bias to the training procedure.
To address the above problems, we first propose to introduce negative sampling into the RL training procedure and then combine it with self-supervised learning, namely Self-Supervised Negative Q-learning (SNQN). Based on the sampled negative actions (items), we can further calculate the ``advantage" of a positive action, which can be further utilized as a weight for the self-supervised part. This lead to another learning framework: Self-Supervised Advantage Actor-Critic (SA2C). We integrate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datesets. Experimental results show that the proposed approaches achieve better performance than existing self-supervised reinforcement learning methods. Code will be open-sourced.