A STATISTICAL COMPARISON OF WRITTEN LANGUAGE AND NONLINGUISTIC SYMBOL SYSTEMS
Abstract
Are statistical methods useful in distinguishing written language from nonlinguistic symbol
systems? Some recent articles (Rao et al. 2009a, Lee et al. 2010a) have claimed so. Both of these
previous articles use measures based at least in part on bigram conditional entropy, and subsequent
work by one of the authors (Rao) has used other entropic measures. In both cases the authors have
argued that the methods proposed either are useful for discriminating between linguistic and nonlinguistic
systems (Lee et al.), or at least count as evidence of a more ‘inductive’ kind for the status
of a system (Rao et al.).
Using a larger set of nonlinguistic and comparison linguistic corpora than were used in these
and other studies, I show that none of the previously proposed methods are useful as published.
However, one of the measures proposed by Lee and colleagues (2010a) (with a different cut-off
value) and a novel measure based on repetition turn out to be good measures for classifying symbol
systems into the two categories. For the two ancient symbol systems of interest to Rao and colleagues
(2009a) and Lee and colleagues (2010a)—Indus Valley inscriptions and Pictish symbols,
respectively—both of these measures classify them as nonlinguistic, contradicting the findings of
those previous works.