The Taxonomy of Writing Systems: How to Measure how Logographic a System is
Abstract
Taxonomies of writing systems since Gelb (1952) have classified systems based on what the
written symbols represent: if they represent words or morphemes, they are logographic; if
syllables, syllabic; if segments, alphabetic; etc. Sproat (2000) and Rogers (2005) broke with
tradition by splitting the logographic and phonographic aspects into two dimensions, with
logography being graded rather than a categorical distinction. A system could be syllabic, and
highly logographic; or alphabetic, and mostly non-logographic. This accords better with how
writing systems actually work, but neither author proposed a method for measuring logography.
In this article we propose a novel measure of the degree of logography that uses an attention based
sequence-to-sequence model trained to predict the spelling of a token from its pronunciation
in context. In an ideal phonographic system, the model should need to attend to only
the current token in order to compute how to spell it, and this would show in the attention
matrix activations. In contrast, with a logographic system, where a given pronunciation might
correspond to several different spellings, the model would need to attend to a broader context. The
ratio of the activation outside the token and the total activation forms the basis of our measure.
We compare this with a simple lexical measure, and an entropic measure, as well as several
other neural models, and argue that on balance our attention-based measure accords best with
intuition about how logographic various systems are.
Our work provides the first quantifiable measure of the notion of logography that accords
with linguistic intuition and, we argue, provides better insight into what this notion means.
written symbols represent: if they represent words or morphemes, they are logographic; if
syllables, syllabic; if segments, alphabetic; etc. Sproat (2000) and Rogers (2005) broke with
tradition by splitting the logographic and phonographic aspects into two dimensions, with
logography being graded rather than a categorical distinction. A system could be syllabic, and
highly logographic; or alphabetic, and mostly non-logographic. This accords better with how
writing systems actually work, but neither author proposed a method for measuring logography.
In this article we propose a novel measure of the degree of logography that uses an attention based
sequence-to-sequence model trained to predict the spelling of a token from its pronunciation
in context. In an ideal phonographic system, the model should need to attend to only
the current token in order to compute how to spell it, and this would show in the attention
matrix activations. In contrast, with a logographic system, where a given pronunciation might
correspond to several different spellings, the model would need to attend to a broader context. The
ratio of the activation outside the token and the total activation forms the basis of our measure.
We compare this with a simple lexical measure, and an entropic measure, as well as several
other neural models, and argue that on balance our attention-based measure accords best with
intuition about how logographic various systems are.
Our work provides the first quantifiable measure of the notion of logography that accords
with linguistic intuition and, we argue, provides better insight into what this notion means.