Abstract
Current deep learning methods only study time series data in separation. There exists a considerable gap between the time series community and communities that work with vision and language data. With the tremendous success of Large Language Models (LLM), many start to wonder how to exploit LLMs for improved forecasting. A few recent works explored LLMs by either converting numerical time series into text strings, or fine-tuning a pretrain vision-language model. In this paper, we argue that these approaches do not fully exploit the predictive power LLMs for text data and are susceptible to outputting irrelevant tokens. We propose to exploit LLM for multimodal time series forecasting by combining textual data with numerical time series. We develop a framework that can efficiently encode multimodal sequence data and generate time series data only as forecasts. To validate our framework, we collect 3 large-scale real-world multimodal time series datasets from different domains: e-commerce, health-care and climate science. Comparing to single-modal deep learning models and methods that use LLMs, our approach leads to xx\% improvement in forecasting accuracy. Furthermore, with the addition of text prompts, our framework also enables efficient time series scenario creation in a highly interpretable manner.