In this paper, we present an approach for representation learning from videos. Instead of relying on hand-designed splitting strategies to obtain space-time tokens from videos, our approach learns to mine important tokens in video frames. This results in efficiently and effectively finding a few important visual tokens and enables modeling of pairwise interactions between such tokens over a longer temporal horizon. We introduce a vector transformer to capture such pairwise space-time relations, and a technique to fuse the transformed tokens while learning their spatio-temporal patterns. The proposed approach is designed with the intention to allow the tokenizer to adaptively react to input video frames containing diverse visual content, and then to have the vector transformer and subsequent modules learn the underlying spatio-temporal interactions and long-range dependencies in video inputs. We show the effectiveness of the proposed approach over challenging video classification datasets, outperforming the state-of-the-art, despite using much less compute. We further conduct extensive ablation experiments to study the method.