Analogical and Structure

Royal Skousen
1992
Kluwer Academic Publishers
Dordrecht
ISBN 0-7923-1935-4

INTRODUCTION

In 1973, while teaching at the University of Texas, I began research on a model of language description to account for a number of specific language problems (such as English spelling, Finnish morphology, and probabilistic language behavior). An important milestone in this research came in 1979 when I realized that probabilistic behavior could be learned and produced indirectly rather than directly. Instead of trying to learn probabilities and then using those probabilities to predict behavior, it would be easier to store examples of the probabilistic behavior and then randomly select one of those examples to predict a specific occurrence of probabilistic behavior. The remaining problem then was to determine how to predict behavior for a given context when there were no occurrences of that context. The solution to this problem came with the discovery of the first natural statistic in 1981, while I was a Fulbright lecturer in Finland. The next two years were spent writing up my findings. In 1983 I completed a 480-page manuscript, Analogy and Structure. This book represents, with some minor changes, that original work.

In 1985 I wrote up an 84-page summary of Analogy and Structure. Nonetheless, I realized that this work, however significant, would not reach its intended audience unless I wrote a summary version, with application to specific language examples. In 1987 I completed the manuscript of Analogical Modeling of Language, which was published in 1989. Nonetheless, this more recent version of the theory presents the mathematical findings of Analogy and Structure in only a sketchy outline. The publication of the original, complete Analogy and Structure will provide the necessary foundations for understanding the nature of analogical approaches to language.

In this work, I provide each of the chapters of Analogy and Structure basically as they were originally written, except for the unnecessary repetition of sections that have already appeared in print (namely, in Analogical Modeling of Language). I have also deleted one irrelevant example and one unnecessary proof. Analogy and Structure is divided into two parts; that is, "Structuralist Descriptions" (Part I) and "Analogical Descriptions" (Part II). Because of the mathematical nature of the chapters in Part I, each of these chapters is preceded by a synopsis taken from the summary originally written in 1985. The original 1983 introduction to Analogy and Structure has not been reproduced here since it served as the introduction to Analogical Modeling of Language. In Part II, those portions that have already appeared in print are replaced by the 1985 summary. The important discoveries of Part I include the following:

(1) The normal approach for measuring the uncertainty of rule systems is Shannon's "information", the logarithmic measure first used in physics and more commonly known as the entropy H. Although this measure can be given a natural interpretation (as the number of yes-no questions needed to determine the outcome of a rule occurrence), it has a number of disadvantages: (a) entropy is based on the notion that one gets an unlimited number of chances to discover the correct outcome, an unreasonable procedure for a psychologically based theory of behavior; (b) the entropy for continuous probabilistic distributions is infinite; even the entropy density is infinite for continuous distributions, and an unmotivated definition for entropy density must be given, one that sometimes gives negative measures of entropy density!

(2) A more plausible method for measuring uncertainty is a quadratic one, the disagreement Q. This measure also has a natural interpretation: it represents the probability that two randomly chosen occurrences of a rule disagree in outcome. The disagreement is based on the psychologically plausible restriction that one gets a single chance to guess the correct outcome rather than an unlimited number of guesses. Moreover, the disagreement density exists (and is positively finite) for virtually all continuous probabilistic distributions. In fact, the disagreement density can be used to measure the uncertainty of continuous distributions for which the variance (the traditional measure of dispersion) is undefined.

(3) Structuralist descriptions have implicitly assumed that descriptions of behavior should not only be correct, but should also minimize the number of rules and permit only the simplest possible contextual specifications. These intuitive notions can actually be derived from more fundamental statements about the uncertainty of rule systems. For example, an optimal description is defined as a system of rules that minimizes the probability that the measured dependence between rules is accidental. From this definition it can be shown that an optimal description will not only be a correct description of the behavior, but will also use a minimal number of rules to describe that behavior. Further, by defining the notion of contextual complexity, we can derive the simplest description, an optimal description for which the overall complexity of the rule contexts in the description is minimized.

(4) Using this notion of a simplest description, we can define three basic kinds of behavior (categorical, exceptional/regular, and idiosyncratic), as well as various combinations of these behaviors. The role (and limitations) of rule ordering in structuralist descriptions can also be determined. Other important considerations in learning rule descriptions can be dealt with; for example, how long does it take to learn a system of rules? or how many excessive errors will occur in learning a system of rules? And finally, the overall effect of a given variable in reducing uncertainty can be measured, thus allowing us to determine which variables are the most important in accounting for behavior.

Despite the nice mathematical properties of rule descriptions, there are serious empirical and conceptual defects in their ability to predict behavior. Rule descriptions partition the contextual space, thus sharply demarcating different types of behavior. Yet actual language behavior shows that often speakers' predictions are fuzzy at rule boundaries. Rule descriptions provide a static view of the behavior, and are incapable of adjusting to difficult situations, such as contexts which are "ill-formed" or where the specification for a "crucial" variable is lacking. Finally, rule descriptions are virtually incapable of describing the probabilistic behavior characteristic of language variation. In fact, the correct description of non-deterministic behavior may ultimately require a separate rule for every different set of conditions. Taken to its logical conclusion, this would mean that each rule would represent a single occurrence since probably no two occurrences are completely identical (given enough textual specification). In other words, instead of representing types of occurrence, rules would represent tokens of occurrence. In Part II of this book, this atomistic approach forms the basis for an analogical model of description, thus providing an alternative to structuralist (or rule-based) descriptions of behavior. Most importantly, analogical models are dynamic alternatives to the static descriptions of rules.

The crucial problem in analogical descriptions is to locate heterogeneity in the contextual space. One of the major innovations of Part II is the notion of a natural statistic. Traditional statistical tests require knowledge of either the underlying probability distribution for the test or a distribution that approximates the underlying distribution. Unfortunately, such tests are mathematically very complex and completely unsuitable as psychologically plausible models of decision making. A natural statistic, on the other hand, avoids any direct consideration of probabilistic distributions, yet has the ability to predict random behavior as if the underlying probabilistic distribution is known. Two natural statistics are presented in Part II:

(1) The first natural statistic is based on the rate of agreement, which derives from the quadratic measure of uncertainty developed in Part I. The decision rule for determining heterogeneity is a very simple one: maximize the rate of agreement. This decision rule is a very powerful one, with a level of significance near 1/2. Smaller levels of significance (at 0.05 or less) can also be defined in terms of this statistic, so the test can be made fully equivalent to standard statistical tests. Most of the examples in Part II are analyzed in accordance with this natural statistic.

(2) The second natural statistic is, on the surface, an incredible one in that it eliminates the need for any mathematical calculation at all. By simple inspection all cases of potential heterogeneity in the contextual space are eliminated. This test represents the most powerful test possible: any context that could possibly be heterogeneous is declared to be heterogeneous. The decision rule for this statistic is extremely simple: minimize the number of disagreements. Such a powerful test is, of course, completely contrary to all standard statistical procedure, but by adding the concept of imperfect memory, this natural statistic gives the same basic results as standard statistics. In fact, there is a direct correlation between imperfect memory and level of significance: the more imperfect the memory, the smaller the level of significance. We always use the most powerful test based on minimizing the number of disagreements, but test at a more typical (that is, smaller) level of significance by randomly selecting only a small part of the data. In other words, a "statistically significant" relationship is one that will hold even when most of the data is forgotten. This very simple natural statistic is introduced in this book, but its properties are more fully developed in Analogical Modeling of Language. In fact, all the statistical predictions made in that work are based on this second natural statistic.

The analogical approach to describing behavior can be called a procedural model, in distinction to the declarative models of structuralist descriptions. Given a specific context, a procedural model can predict behavior for that context, but in general an overall description of behavior cannot be "declared" (that is, explicitly stated) independently of any given context.

Analogical models are not the only example of procedural modeling. Another procedural alternative to declarative (or rule-based) modeling is found in the work that has been done on "neural networks"⁽¹⁾ -- especially in the work of McClelland and Rumelhart, referred to as "parallel distributed processing".⁽²⁾ But there is one major problem with the interactive models of learning found in all theories of "neural networking" -- namely, their inability to "learn" probabilities. Although probabilistic results can be predicted by neural networks, it is very difficult to get networks to predict the specific frequencies of occurrence that actually occur in data. This problem is a serious one, and researchers have spent a good deal of effort trying to solve it (as is demonstrated by the two chapters of McClelland and Rumelhart's Parallel Distributed Processing devoted to this problem).⁽³⁾ Only under unnatural and extreme learning conditions (such as low computational temperatures) can neural networks learn probabilities. Yet the work by Labov and his associates on language variation has clearly demonstrated the ready ability of speakers to learn probabilities (or at least use language forms as if probabilities had been learned). Analogical models account for this stochastic ability of speakers -- and without directly learning probabilities.

It has sometimes been claimed that analogical models cannot predict the reaction times of certain recognition experiments in psychology. This is undoubtedly true for the analogical models that have been occasionally proposed in the literature. But the model proposed in this work (and in Analogical Modeling of Language) can readily account for these reaction times. In the last chapter of Part II, this claim against analogical modeling is shown to be false: a simple linear-based proposal for measuring the processing times of analogical sets neatly accounts for the different reaction times.

One important question has arisen in the work on analogical models -- namely, the basis for random selection. In my earliest work on analogical models, I assumed that the basis for random selection was the context itself; that is, the task is to first find all the homogeneous supracontexts, then randomly select one of these contexts. This approach ignores the frequency of each context, counting each homogeneous supracontext equally. In Analogy and Structure the basis for random selection is the occurrence rather than the context; that is, the task is to randomly select one of the occurrences in any of the homogeneous supracontexts. This method of random selection takes frequency into account, so that the probability of selecting an occurrence from any particular homogeneous supracontext is directly proportional to the frequency of that context. However, in Analogical Modeling of Language, another basis for random selection is proposed. Since pointers are used to measure uncertainty, the conceptually simplest basis is to randomly select a pointer so that the predicted outcome would be the occurrence associated with the selected pointer. In this case, the probability of selecting an occurrence from any particular homogeneous supracontext is proportional to the square of the frequency of that context.

basis for random selection	proportional probability of selecting a particular homogeneous supracontext
context	(frequency)⁰ = 1
occurrence	(frequency)¹
pointer	(frequency)²

It appears that the predicted differences between the last two bases are minor. Nonetheless, there is a need for a full examination of the differences between these three bases.

Since the publication of Analogical Modeling of Language, a number of important properties of natural statistics have been developed, especially in terms of imperfect memory. For instance, the natural statistics for a number of statistical procedures (such as estimating a probability, testing for homogeneity, or choosing the most frequent outcome) give results equivalent to the predictions of traditional statistics, but again without any mathematical calculation. A complete comparison of traditional statistics and natural statistics will appear in Natural Statistics, a book I am currently working on.

¹ Maureen Caudill and Charles Butler, Natural Intelligent Systems (Cambridge, Massachusetts: MIT Press, 1990).

² James L. McClelland, David Rumelhart, and the PDP Research Group, Parallel Distributed Processing (PDP), 2 volumes (Cambridge, Massachusetts: MIT Press, 1986).

³ Paul Smolensky, "Information Processing in Dynamical Systems: Foundations of Harmony Theory", PDP I:194-281 (chapter 6); Geoffrey E. Hinton and Terrence J. Sejnowski, "Learning and Relearning in Boltzmann Machines", PDP I:282-317 (chapter 7).