- Kazuma Hashimoto
- Iftekhar Naim
- Karthik Raman
Abstract
There are increasing studies on using text-to-text generative models for sequential labeling (e.g., entity extraction and dialog slot tagging). Despite the effectiveness, one remaining research question is how to reliably estimate confidence of the model predictions, because confidence estimation is crucial in practical use cases (such as analyzing and interpreting the predictions). This short paper presents a comparative study on estimating a confidence score for each labeled span. We start with naively using decoder's output probabilities, and then propose methods to take full advantage of top-$k$ statistics given by a beam search. We conduct experiments across six different datasets/tasks, showing that the use of the top-$k$ statistics significantly reduces model's calibration errors.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work