Area Attention
Abstract
Existing attention mechanisms, are mostly point-based in that a model is designed to attend to a single item in a collection of items (the memory). Intuitively, an area in the memory that may contain multiple items can be worth attending to as well. Although Softmax, which is typically used for computing attention alignments, assigns non-zero probability for every item in memory, it tends to converge to a single item and cannot efficiently attend to a group of items that matter. We propose area attention: a way to attend to an area of the memory, where each area contains a group of items that are either spatially adjacent when the memory has a 2-dimensional structure, such as images, or temporally adjacent for 1-dimensional memory, such as natural language sentences. Importantly, the size of an area, i.e., the number of items in an area, can vary depending on the learned coherence of the adjacent items. Using an area of items, instead of a single, we hope attention mechanisms can better capture the nature of the task. Area attention can work along multi-head attention for attending multiple areas in the memory. We evaluate area attention on two tasks: character-level neural machine translation and image captioning, and improve upon strong (state-of-the-art) baselines in both cases. In addition to proposing the novel concept of area attention, we contribute an efficient way for computing it by leveraging the technique of summed area tables.