BERT Rediscovers the Classical NLP Pipeline
Pre-trained sentence encoders such as ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2018) have rapidly advanced the state-of-theart on many NLP tasks, and have been shown to encode contextual information that can resolve many aspects of language structure. We extend the edge probing suite of Tenney et al. (2019) to explore the computation performed at each layer of the BERT model, and find that tasks derived from the traditional NLP pipeline appear in a natural progression: part-of-speech tags are processed earliest, followed by constituents, dependencies, semantic roles, and coreference. We trace individual examples through the encoder and find that while this order holds on average, the encoder occasionally inverts the order, revising low-level decisions after deciding higher-level contextual relations.