Google Research

A Comparison of Features for Automatic Readability Assessment

  • Lijun Feng
  • Martin Jansche
  • Matt Huenerfauth
  • Noémie Elhadad
23rd International Conference on Computational Linguistics (COLING 2010), Poster Volume, pp. 276-284


Several sets of explanatory variables – including shallow, language modeling, POS, syntactic, and discourse features – are compared and evaluated in terms of their impact on predicting the grade level of reading material for primary school students. We find that features based on in-domain language models have the highest predictive power. Entity-density (a discourse feature) and POS-features, in particular nouns, are individually very useful but highly correlated. Average sentence length (a shallow feature) is more useful – and less expensive to compute – than individual syntactic features. A judicious combination of features examined here results in a significant improvement over the state of the art.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work