- Yun Zeng
We propose a unified principle for exploring context in machine learning models. Starting with a simple assumption that each observation (random variables) may or may not depend on its context (conditional variables), a conditional probability distribution is decomposed into two parts: context-free and context-sensitive. Then by employing the log-linear word production model for relating random variables to their embedding space representation and making use of the convexity of natural exponential function, we show that the embedding of an observation can also be decomposed into a weighted sum of two vectors, representing its context-free and context-sensitive parts, respectively. This simple yet revolutionary treatment of context provides a unified view of almost all existing deep learning models, leading to revisions of these models able to achieve significant performance boost. Specifically, our upgraded version of a recent sentence embedding model (Arora et al., 2017) not only outperforms the original one by a large margin, but also leads to a new, principled approach for compositing the embeddings of bag-of-words features, as well as a new architecture for modeling attention in deep neural networks. More surprisingly, our new principle provides a very intuitive explanation of the gates and equations defined by the long short term memory (LSTM) model, which also leads to a new model that is able to converge significantly faster and achieve much lower prediction errors. Furthermore, our principle also inspires a new type of generic neural network layer that better resembles real biological neurons than the traditional linear mapping plus nonlinear activation based architecture. Its multi-layer extension provides a new principle for deep neural networks which subsumes residual network (ResNet) as its special case, and its extension to convolutional neutral network model accounts for irrelevant input (e.g., background in an image) in addition to filtering. Our models are validated through various benchmark datasets and we show that in most cases, simply replacing existing layers with our context-aware counterparts is sufficient to significantly improve the results.