Jump to Content

A Comparison of Features for Automatic Readability Assessment

Lijun Feng
Martin Jansche
Matt Huenerfauth
Noémie Elhadad
23rd International Conference on Computational Linguistics (COLING 2010), Poster Volume, pp. 276-284


Several sets of explanatory variables – including shallow, language modeling, POS, syntactic, and discourse features – are compared and evaluated in terms of their impact on predicting the grade level of reading material for primary school students. We find that features based on in-domain language models have the highest predictive power. Entity-density (a discourse feature) and POS-features, in particular nouns, are individually very useful but highly correlated. Average sentence length (a shallow feature) is more useful – and less expensive to compute – than individual syntactic features. A judicious combination of features examined here results in a significant improvement over the state of the art.