Fangyu Liu

I'm a Research Scientist at Google DeepMind, where I help build multimodal LLMs with frontier capabilities.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Language models (LMs) trained on raw texts have no access to the real physical world environment. Gordon and Van Durme (2013) argue that they thus suffer from reporting bias, meaning that texts rarely report commonsensical facts about the world but more frequently talk about non-commonsenscical facts/events. If LMs naively overfit to the co-occurrence statistics of training corpora, they learn a biased view of the physical world. While prior studies have repeatedly verified that LMs of smaller scales (e.g. RoBERTa, GTP-2) amplifies reporting bias, it remains unknown whether such trends continue when models are scaled up. We investigate reporting bias in larger language models (LLMs) such as PaLM and GPT-3. Specifically, we query LLMs for the colour of objects, using colour as a representative property for visual commonsense. Surprisingly, we found that LLMs significantly outperform smaller LMs on answering queries about an object's typical colour. We find that LLMs' predictions deviate from corpus co-occurrence statistics induced from resources such as Google Books Ngram and are closer to human judgement. We believe this serves as evidence that larger LMs can overcome reporting bias, rather than showing an inverse scaling function as previously suggested. View details
    Preview abstract By nature of the cost and time required to train Large Language Models (LLMs), the embedded knowledge within is usually frozen at the moment their training data is collected. As a result, LLMs have been shown to suffer from diachronic degradation. The in-context learning paradigm can provide a workaround for this limitation by supplying relevant information at inference time. We introduce a new benchmark to evaluate LLMs for one particular but critical aspect of diachronic change: language acquisition. To that end, we rewrite Winograd-style co-reference resolution problems by replacing a word for a new synthetic but plausible English word. The meaning of the word is given to the model in the prompt via a dictionary definition. We show that the accuracy of LLMs compared to the original Winograd tasks decreases radically in our benchmark and we believe this serves as a measure of progress for future models. View details
    Preview abstract Visual language such as charts and plots are ubiquitous in the human world. Comprehending plots and charts requires strong reasoning skills. Prior state-of-the-art models are end-to-end multimodal Transformers pretrained with dedicated plot derendering and numerical reasoning objectives. However, the models reasoning capabilities still fall short and will generally fail on complex queries. In this paper, we decompose the multimodal reasoning problem into first, a modality conversion problem from image to text, then a purely textual reasoning problem. Through combining a pretrained image-to-text model and an LLM for the task of chart/figure reasoning. Compared with a SOTA model finetuned on >10k data points, our plug-and-play model DePlot-LLM achieves >20% improvement over finetuned SOTA with just one-shot prompting. View details
    Preview abstract Visual language data such as plots, charts, and infographics are ubiquitous in the human world. However, state-of-the-art vision-language models do not perform well on these data. We propose a set of pretraining tasks to enhance visual language models' capabilities in jointly modeling charts/plots and language data. We initialize with Pix2Struct, a recently proposed image-to-text visual language model and continue pretraining with our proposed objectives. We argue that numerical reasoning and plot deconstruction enable a model with the key capabilities of (1) extracting key information and (2) reasoning on the extracted information. On standard benchmarks such as PlotQA and ChartQA, our continually pretrained MatCha model outperforms state-of-the-art methods by as much as ~20%. We also examine how well does MatCha pretraining transfer to domains such as screenshot, textbook, and poster figures. We observe improvement over the base Pix2Struct checkpoint by 1.2% on average, verifying the usefulness of MatCha pretraining on broader visual language tasks. View details