During the last two decades, as rule approaches have encountered difficulties in explaining language behavior, several competing non-rule approaches to language have been developed. First was the development (or rejuvenation) of neural networks, more commonly known in linguistics as connectionism and best exemplified by the work of McClelland, Rumelhart, et al. (1986) in what they call "parallel distributed processing". More recently, some researchers (such as Aha and Daelemans) have turned to exemplar-based systems (sometimes known as instance-based systems or "lazy learning") to explain language behavior (see Aha, Kibler, and Albert 1991; and Daelemans, Gillis, and Durieux 1994). These exemplar-based learning systems involve hunting for the most similar instances ("nearest neighbors") to predict language behavior. A more general theory of the exemplar-based approach is Skousen's analogical modeling of language (1989, 1992), which permits (under well-defined conditions) even non-neighbors to affect language behavior.
These non-rule approaches have several advantages over the traditional rule approaches. First of all, they can be explicitly defined and are therefore testable. Second, they are procedurally defined -- that is, they predict behavior for a given input, but do not declare any globally-defined rules. The problem of knowing how to learn and then use a general rule to predict specific behavior is avoided. Third, these non-rule approaches are robust in the sense that they can make predictions when the input is not "well-formed" or when "crucial" variables are missing. In general, boundaries between different behaviors (or outcomes) do not have to be precise; fuzzy boundaries and leakage across boundaries are in fact expected.
Since analogical modeling is a procedural approach, predictions are always based on a dataset of occurrences. Each occurrence is specified in terms of a set of variables and an assigned outcome for that specific assignment of variables. A given set of variables can occur more than once in a dataset, as can the assigned outcome. (In fact, such repetition is normal.) For the purposes of discussion, we will assume that n variables are specified.
In order to make a prediction, we always do it in terms of a given context, where the variables are specified, but for which no outcome is given. Usually all n variables are specified in the given context, but this is not necessary. Our task is to predict the outcome for this given context in terms of the occurrences found in the dataset. For our purposes here, we will let m stand for the number of specified variables in the given context, where 0 <= m <= n.
For each subset of variables defined by the given context, we determine which occurrences in the dataset occur with that subset. Each of these subsets of variables is called a supracontext. Given m variables in the given context, we have a total of 2m supracontexts.
Our problem is to determine the homogeneity (or its opposite, the heterogeneity) of each supracontext defined by the given context. Basically, a supracontext is homogenous if all its possible subcontexts behave identically. In predicting the outcome for the given context, we only apply information found in the homogeneous supracontexts. All heterogeneous supracontexts are ignored.
We determine whether a supracontext is homogeneous by using a nonlinear statistical procedure based on measuring the number of disagreements between different occurrences within the supracontext. To do this we connect all the occurrences within a supracontext to each other by means of a system of pointers. For each pointer from one occurrence to another, we indicate whether the pointer points to a different outcome (a disagreement) or to the same outcome (an agreement). We adopt a conceptually simple statistical procedure for determining the homogeneity of the supracontext -- namely, if no subcontext of the supracontext increases the number of disagreements, the supracontext is homogeneous. Otherwise, the supracontext is heterogeneous. This measure ends up minimizing the number of disagreements (that is, the number of pointers to differing outcomes) in the supracontext. It turns out that this statistic is based on a quadratic measure of information with its reasonable restriction that language speakers get only a single chance to guess the correct outcome. This is unlike Shannon's logarithmic measure of uncertainty, which is based on the idea that speakers get an unlimited number of chances to guess the correct outcome.
This statistical procedure of minimizing the number of disagreements is also the most powerful statistical test possible. However, by introducing the notion of imperfect memory, this test can be made equivalent to standard statistical procedures, especially when the probability of remembering a given occurrence is one-half. This kind of statistic is referred to as a natural statistic since it is psychologically plausible and avoids any direct consideration of probability distributions, yet has the ability to predict stochastic behavior as if the underlying probability distribution is known.
Using this natural statistic, it is easy to show that there are only two types of homogeneous supracontexts for a given context: (1) the supracontext is deterministic; or (2) the supracontext is non-deterministic but there is no occurrence in the supracontext that is closer to the given context than any other occurrence in the supracontext.
These homogeneous supracontexts form what is called the analogical set. The final step is to randomly select one of the occurrences in the analogical set and make our prediction based on the outcome assigned to this occurrence. Theoretically this selection has been done in two different ways: (1) randomly select one of the occurrences found in any of the homogeneous supracontexts; or (2) randomly select one of the pointers pointing to an occurrence in any of the homogeneous supracontexts. In the first case, the probability of selecting a particular occurrence is based on its frequency of occurrence within the homogeneous supracontexts. In the second case, the probability of selecting a particular occurrence is based on the square of its frequency of occurrence within the homogeneous supracontexts. This squaring of the frequency is the result of using a system of pointers (equivalent to the quadratic measure of uncertainty) to select an occurrence.
There is an alternative to random selection. Instead of randomly choosing one of the occurrences in the analogical set, one could examine the overall chances for each outcome under random selection but then select the most frequent outcome. This method is referred to as selection by plurality and is used to maximize gain (or minimize loss).
For a basic introduction to analogical modeling, see the thematic section in volume 7 of Rivista di Linguistica (Eggington 1995). There Skousen (1995) provides a basic overview of analogical modeling and describes some of the advantages of analogical modeling over connectionism. In addition, Chandler (1995) describes some of the support from psycholinguistics for analogical modeling. And Robinson (1995) discusses inverse indexing (a simple nearest-neighbor approach) and compares it with analogical modeling.
The fundamental works on analogical modeling are two books by Skousen. The first one, Analogical Modeling of Language (Skousen 1989), provides a complete, but basic, outline of the approach (chapter 2) and then applies it to various language problems (chapter 3) as well as theoretical language issues (chapter 4). In chapter 5, Skousen provides an in-depth analysis of past-tense formation in Finnish. In particular, he shows how analogical modeling, unlike traditional rule approaches, is able to explain the complex historical development of the Finnish past-tense. The second book, Analogy and Structure (Skousen 1992), is a mathematical description of both rule-based and analogical approaches to describing behavior.
The concept of natural statistics is introduced in the second half of Analogy and Structure. It is also discussed briefly in chapter 4 of Analogical Modeling of Language. More recently, Skousen (1998) further develops the theory of natural statistics and demonstrates its close relationship to normal statistical procedures (especially when the probability of remembering a past occurrence equals one-half).
Analogical modeling has been applied to a number of specific language problems. Derwing and Skousen (1994) have used analogical modeling to predict English past-tense formation, especially the kinds of errors found in children's speech. Derwing and Skousen first constructed a dataset of verbs based on the frequently occurring verbs in grade-school children's speech and writing. Initially they predicted the past-tense for verbs in terms of a dataset composed of only the 30 most frequent verbs (most of which were irregular verbs), then they continuously doubled the size of the dataset (from 30 to 60, to 122, to 244, to 488, and finally to 976). Derwing and Skousen discovered that when the dataset was small, the kinds of errors children typically make were predicted, but by the time the dataset reached the third doubling (at 244 verbs) stability had usually set in, and the expected adult forms (that is, the standard language forms) were predicted more than any other. For instance, the most common prediction for the verb snow was snew as long as the dataset had only 30 or 60 verbs, but with 122 verbs (after the second doubling) the prediction shifted to the regular snowed (with a 90 percent chance). With the full dataset of 976 verbs, the probability of predicting the regular snowed reached 99 percent. Similarly, overflew was most commonly predicted for overflow until the third doubling (at 244 verbs), and succame for succumb (pronounced succome, of course) until the fourth doubling (at 488 verbs).
Analogical modeling (along with connectionism) has been criticized because it proposes a single-route approach to predicting the past-tense in English (see, for instance, Jaeger et al. 1996, 455-457, 477-478). Prasada and Pinker (1993) have argued, on the other hand, for a dual-route approach -- that is, irregular verbs in English are processed differently than regular verbs. More specifically, they argue that irregular verbs are predicted in an analogical, lexically-based fashion, but that regular verbs are predicted by rule (namely, by syntactically adding some form of the regular past-tense ending -ed). Jaeger et al. 1996 further argued that there is information from neural activity in the brain for the dual-route approach. The main claim about analogical modeling in Jaeger et al. 1996 was that analogical modeling could not predict the processing time differences between regular and irregular verbs, and between known and unknown verbs. In reply, Chandler and Skousen (1997) noted that in section 16.1 of Analogy and Structure (under "Efficiency and Processing Time"), the correct processing times were in fact predicted.
Prasada and Pinker (1993) report on a study in which English speakers produced past-tense forms for various nonce verbs. They found that a subject's willingness to provide irregular past-tense forms was strongly related to the nonce verb's phonological similarity to existing irregular verbs, but for nonce verbs similar to existing regular verbs, no such correlation was found. Prasada and Pinker took this basic difference in behavior as evidence that English speakers use a dual-route approach in forming the past-tense, especially since a single-route connectionist approach failed to predict the basic difference in behavior. But more recently, Eddington (2000a) has shown that just because connectionism fails to make the right prediction does not mean that the single-route approach is wrong. To the contrary, both analogical modeling and Daelemans' instance-based approach (each a single-route approach to describing English past-tense formation) correctly predict Prasada and Pinker's experimental findings.
An important application of analogical modeling is found in Jones 1996. Here we see analogical modeling applied to automatic translation (between English and Japanese). Most work done in analogical modeling has dealt with phonology, morphology, and orthography (the linguistic disciplines most closely connected to an objective reality), but here Jones shows how analogical modeling can be applied to syntax and semantics. He contrasts analogical modeling with both traditional rule approaches and connectionism (parallel distributed processing). In a variety of test cases, he finds analogical modeling more successful and less arbitrary than parallel distributed processing.
There have also been a number of applications to several non-English language problems in, for instance, the work of Eddington (Spanish stress assignment) and Douglas Wulf (German plural formation). Eddington's work on Spanish (Eddington 2000b) has shown that analogical modeling can correctly predict stress placement for about 95% of the words, but in addition can regularly predict the stress for nonce words from experiments and errors that children make. Wulf (1996) has found that analogical modeling is able to predict cases where an umlauting plural type has been extended from a frequent exceptional German plural to other less frequent words.
Daelemans, Gillis, and Durieux (1997) have done considerable work comparing analogical modeling with various instance-based approaches to language. They have discovered that under regular conditions, analogical modeling consistently outperforms their own instance-based approaches in predicting Dutch stress (see their table 1.3). Only when they add various levels of noise to the system are they able to get comparable results for analogical modeling and their instance-based approaches (see their table 1.4), but their introduction of noise appears to irrelevant to the larger issue of which approach best predicts Dutch stress.
Skousen's work on the Finnish past-tense has been able to capture the otherwise unexplained behavior of certain verbs in Finnish. Of particular importance is his demonstration (Skousen 1995, 223-226) that the verb sorta- 'oppress', under an analogical approach, takes the past-tense form sorti. According to every rule analysis found in the literature, verbs stems ending in -rta or -rtä should take -si in the past-tense. Yet speakers overwhelmingly prefer sorti, not sorsi. When we look at the analogical set for sorta- (a relatively infrequent verb), we discover that for this example only, verbs containing o as the first vowel (24 of them) almost completely overwhelm verbs ending in -rta or -rtä (only 5 of these). And each of these verbs with o produce the past-tense by replacing the final stem vowel a by i (thus giving sorti). This large group of o-vowel verbs just happens (from an historical point of view) to take this same outcome. Although there is another group of verbs that take the si outcome, its effect is minor. The resulting probability of analogically predicting sorti is 94.6 percent.
More generally, a correct theory of language behavior needs to pass certain empirical tests (Skousen 1989, 54-76). In cases of categorical behavior (such as the indefinite article a/an in English), there should be some leakage (or fuzziness) across categorical boundaries (such as an being replaced by a). Similarly, when we have a case of exceptional behavior in a field of regular behavior (such as the plural oxen in English), we should find that only when a given context gets very close to the exceptional item do we get a small probability of the given context behaving like the exception (such as the infrequent plurals axen for ax and uxen for the nonce ux). And finally, in empty space between two occurrences of different behavior, we should get transitional behavior as we move from one occurrence to the other.
A theory of language behavior is tested by considering what kinds of language changes it predicts. The ability to simply reproduce the outcomes for the occurrences in the dataset does not properly test a theory. Instead, we try to predict the outcome for given contexts that are not in the dataset, and then we check these predictions against the kinds of changes that have been observed, preferably changes that have been naturally observed. Such data for testing a theory can be found in children's language, historical change, dialect development, and performance errors. Experiments (involving for instance, nonce items) can be helpful if their results do not inordinately clash with naturally observed changes, but in general, artificial experiments always run the risk of contaminated results. Experiments can help us gather additional data, providing their results do not sharply contradict observations from actual language use.
This explicit theory of analogical modeling differs considerably from traditional uses of analogy in language description. First of all, traditional analogy is definitely not explicit. Related to this problem is that almost any item can serve as the analogy for predicting behavior, although in practice the attempt is to always look to nearest neighbors for the preferred analogical source. But if this fails, one can almost always find some item, perhaps considerably different, that can be used to analogically predict the desired outcome. In other words, if needed, virtually any occurrence can serve as the analogical source.
Skousen's analogical modeling, on the other hand, will allow occurrences further away from the given context to be used as the exemplar, but not just any occurrence. Instead, the occurrence must be in a homogeneous supracontext. The analogical source does not have to be a near neighbor. The probability of an occurrence further away acting as the analogical model is nearly always less than a closer occurrence, but this probability is never zero (providing the occurrence is in a homogeneous supracontext).
Further, the ability to use all the occurrences in all the homogeneous supracontexts of the contextual space directly accounts for the gang effects we find when we describe either categorical or regular/exceptional behavior. In other words, we are able to predict "rule-governed" behavior (plus a little fuzziness) whenever the data behaves "regularly".
Analogical modeling does not require us to determine in advance which variables are significant and the degree to which these variables determine the outcome (either alone or in various combinations). Nearest-neighbor approaches are like traditional analogical practice in that they try to predict behavior by using the most similar occurrences to the given context. But unless some additional information is added, the leakage across categorical boundaries and in regions close to exceptions will be too large. As a result, nearest-neighbor approaches frequently try to correct for this excessive fuzziness by ranking the statistical significance of each variable. One can determine, as Daelemans, Gillis and Durieux have (1994, 435-436), the information gain (or other measures of reducing entropy) for each variable. Such added information requires a training period to determine this information, and in this regard is like connectionism.
Analogical modeling, on the other hand, does not have a training stage except in the sense that one must have a dataset of occurrences. Predictions are made "on the fly", and all variables are considered apriorily equal (with certain limitations due to restrictions on short-term memory). The significance of a variable is determined locally -- that is, only with regard to the given context. The extent of any gang effect is determined by the location of the given context and the amount of resulting homogeneity within the surrounding contextual space.
One possible example of a locally significant variable is the o vowel in the Finnish past-tense dataset mentioned in the previous section. In predicting the past-tense form for all verbs except one, this o variable is not crucial, no matter how frequent the verb is. It only turns out to be crucial for the relatively infrequent verb sorta- 'oppress', a verb which is not in the dataset. In other words, the necessity of this variable for predicting sorti for sorta- cannot be learned from predicting the past-tense of other verbs. This variable only becomes crucial when the analogical system is asked to predict the past-tense for sorta-. In an analogical approach, the significance of the o variable is locally determined, not globally. The occurrences in the dataset carry the information necessary to make predictions, but the significance of a particular variable cannot be determined independently of the occurrences themselves. Daelemans' nearest-neighbor approach, when it relies on measuring information gain, can never obtain sufficient gain for this o vowel to be able to predict sorti. This crucial example may play a very important role in empirically deciding between analogical modeling and nearest-neighbor approaches with information gain.
Daelemans, van den Bosch, and Zavrel (1999) argue that with nearest-neighbor approaches, predictions are worse if the data is "mined" in advance -- that is, if variables are reduced and "bad" (or "exceptional") examples are removed. Such systems tend to collapse or become degraded when memory losses occur. On the other hand, memory loss is important in analogical modeling, especially since imperfect memory results in statistically acceptable predictions (and reduces the extraordinary statistical power of the approach). For instance, randomly throwing out about half the data leads to standard statistical results. In analogical modeling, statistically significant results are retained under conditions of imperfect memory. In fact, a statistically significant result is one that holds when at least half the data is forgotten. The reason that analogical modeling can get away with substantial memory loss is because this approach considers much larger parts of the contextual space, whereas nearest-neighbor approaches tend to fail when memory is imperfect.
In analogical modeling, given sufficiently large amounts of data, stability sets in, with the result that adding more examples in the data set will have little effect on predicting behavior. Imperfect memory also shows how less frequent exceptions tend to be removed from a language, but frequent exceptions are kept. This agrees with what Bloomfield observed many years ago about historical change (1933:408-410).
One important aspect of analogical modeling is that adjusting parameters and conditions doesn't make much difference in the resulting predictions. This is quite different from neural networks, where there are so many parameters and conditions to manipulate that almost any result can be obtained. One wonders if there is any limit to what can be described when so many possibilities are available.
Recent work in analogical modeling suggests that within the analogical approach it is difficult to manipulate parameters to get different predictions. (This inability is empirically desirable.) Consider, for instance, whether random selection is done by choosing either an occurrence or a pointer. The first possibility provides a linear-based prediction, the second a quadratic one. Yet when either method is used in analogical modeling, we get the same basic results except that under linearity we get an increase in fuzziness at category boundaries and around exceptional occurrences.
Similarly, we get the same basic results when we consider the conditions under which a given outcome can be applied to a given context. This problem first arose when Skousen tried to predict the past-tense for Finnish verbs. In Analogical Modeling of Language Skousen decided (1989, 101-104) to narrowly restrict the three possible past-tense outcomes by including a number of conditions:
replace the stem-final vowel by i | [no additional conditions added] | |
replace the stem-final a vowel by oi | [additional conditions: the first vowel is unround (i, e, or a)] | |
replace the sequence of t and the stem-final non-high unround vowel (e, a, or ä) by si | [additional conditions: the segment preceding the t is either a vowel or a sonorant (that is, not an obstruent)] |
These added conditions had been assumed in all rule analyses of the Finnish past-tense.
But these added conditions are not part of the actual alternation (which replaces one sound -- or a sequence of sounds -- by another). So one obvious extension of applicability would be to ignore these additional conditions and allow an outcome to apply only whenever a given verb stem meets the conditions specified by the actual alternation:
replace the stem-final vowel by i | |
replace the stem-final a vowel by oi | |
replace the sequence of t and a non-high unround vowel by si |
The argument for relaxing the conditions is that the analogical model itself should be able to figure out the additional conditions since they occur in the verbs listed in the dataset.
But one can even go further and let every outcome apply no matter what the stem ends in:
replace the stem-final vowel by i | |
replace the stem-final vowel by oi | |
replace the stem-final sequence of consonant and vowel by si |
The argument here is that the analogical model itself should be able to figure out the alternation itself.
Applying these different conditions on outcome applicability, the results were virtually the same. The only difference in prediction (using selection by plurality) occurred in a handful of cases of nearly equal probability between competing outcomes. In other words, analogical modeling doesn't provide many opportunities for varying parameters and conditions. We get the same basic results no matter whether we randomly select by occurrence or by pointer -- or the degree to which we restrict the conditions on outcome applicability. The only real way to affect the results is in the dataset itself: by what occurrences we put in the dataset and how we specify the variables for those occurrences. And how we determine the dataset is fundamentally a linguistic issue. Thus analogical modeling is a strong theory and is definitely risky. It is not easily salvaged if it fails to predict the right behavior.
In the concluding chapter of Analogical Modeling of Language, Skousen (1989, 137-139) points out a serious difficulty with the algorithm for analogical modeling -- namely, the fact that given m variables in the given context, there are 2m supracontexts for which homogeneity must be determined. The current algorithm for determining homogeneity exhibits an exponential explosion for both the working memory of the program as well as the time needed for processing. If massive parallel processing is used, the time requirements become a linear function of the number of variables. But the hardware (or memory requirements) can only be reduced by a factor of 1/m1/2, which does not effectively eliminate the exponential explosion. Daelemans, Gillis, and Durieux (1997) have argued that their memory-based approaches exhibit linearity, in comparison to the exponential requirements for Skousen's analogical modeling. However, it should be noted that in order to make their instance-based approach work properly, Daelemans and his colleagues are forced to determine in advance which variables are statistically significant. But their "information gain" derives from a training stage which requires a global analysis of variable significance. Since it is patently false that language variables act independently of each other, information gain must take into account every possible combination of variables (but from a global point of view). In other words, there is no escaping the exponential explosion. Of course, having determined the information gain, then Daelemans et al. can run their linear-based algorithm. Analogical modeling, on the other hand, never determines global statistical significance. It only determines local statistical significance for a given context. In both cases then, there is an exponential explosion. One gets the suspicion that the exponential explosion is inherent in all linguistic analysis and cannot be avoided. Connectionist modeling also requires the selection of variables -- one does not set up a connectionist system that directly interconnects every possible combination of variables.
A pragmatic concern regarding the original analogical algorithm was the need to apply the approach to cases which involve more than 12 or 13 variables. For instance, in his phonetic specifications in Analogical Modeling of Language, Skousen (1989, 51-54) had to reduce the number of variables that could be considered. Distinctive features were also avoided in Analogical Modeling of Language (Skousen 1989, 53-54), in part for combinatory reasons. A linear-based algorithm, on the other hand, could allow dozens, even hundreds of variables to be considered. Ultimately, the analogical algorithm needs to be linear so that it can be applied to language processing in real time.
Within the past few years, the analogical modeling research group at Brigham Young University has been working on the algorithm. Using the original algorithm as its basis (despite the exponential explosion), Dil Parkinson has ported the program to Perl and has been able to increase the number of variables to about 20. With this many variables, analogical modeling can at least be empirically tested on more complex language problems.
Along with his colleagues, Skousen has also been working on replacing the algorithm with one that does not need to keep track of every possible supracontext, but instead only certain crucial heterogeneous ones from which the homogeneous regions of the contextual space can be determined. For such an algorithm, memory and time requirements appear to be determined by the number of occurrences in the dataset rather than the number of variables in the given context. The exponential explosion can still occur, but only in time and not in memory. With such an algorithm, parallel processing may help to reduce the algorithm to linear time, although this possibility has not yet been investigated.
Another possibility is to re-interpret analogical modeling in terms of quantum computing. (For a general introduction to quantum computing, see Williams and Clearwater 1998.) One distinct theoretical advantage of quantum computing is that it can simultaneously keep track of an exponential number of states (such as 2m supracontexts defined by an m-variable given context), thus potentially reducing intractable exponential problems to tractable polynomial analyses (or even linear ones). In certain well-defined cases it has been shown (in pseudocode only, since there is no hardware implementation of quantum computing thus far) that the exponential aspects of programming can be reduced to one of polynomial degree (which entails tractability, unlike exponential cases). Quantum computing allows for certain kinds of simultaneity or parallelism that exceeds the ability of normal computing (sequential or parallel). The examples discussed thus far in quantum computing involve numbers, especially cryptography, as in Shor's program for determining the prime factors of a long integer (Williams and Clearwater 1998, 133-137).
One reason for considering analogical quantum computing is that the exponential factor seems to be inherent in all approaches to language processing. Thus far linguistic evidence argues that virtually all possible combinations of variables can be used by native speakers in predicting language. The exponential problem is obvious in analogical modeling, and normal kinds of parallel processing may ultimately fail to solve it. But as already noted, the exponential explosion is not restricted to analogical modeling. Other instance-based approaches and neural nets (connectionist approaches) also encounter exponential problems as they must decide how to limit their predictions to those based on the "most significant" variables. The difficulty for these other approaches is in the training stage, where the system has to figure out which combinations of variables are significant, a global task that is inherently exponential.
One advantage of analogical modeling is that no mathematical (or statistical) calculation is actually used in determining the analogical set; instead, there is just the simple comparison of supracontexts. This simplicity suggests that some very simple matrix operations could be used to determine a quantum analogical set that would be realized classically as a set of probabilities.
Thus far a few similarities between analogical modeling and quantum mechanics have been noted. First, the analogical measure of certainty is a probability derived from the sum of squaring (Skousen 1992, 73-74), which is like the conjugate squaring of complex numbers in quantum mechanics in order to create a classical probability (Omnes 1999, 34-45). Second, in applying natural statistics, it has been shown (Skousen 1998) under two different cases that when the probability of remembering is one-half, we get standard statistical results (including the ability to explain the traditional use of "level of significance" in statistical decision making). This probability of one-half implies an equal chance of forgetting or remembering, which appears to correspond to storing the occurrences of a dataset as a vector composed of quantum bits, each with an equal chance of being accessed or not (much like an electron's spin, with its two states of up and down).
One important reason for investigating the possibility of analogical quantum computing is that language speakers are able to deal with a seemingly unlimited number of linguistic variables and in linear time. Moreover, occurrences of local predictability (such as sorta- in Finnish) would indicate that speakers do not determine in advance which combinations of variables are significant. Rather, such decisions are always made "on the fly" and for a specific given context.
Aha, David W., Dennis Kibler, and Marc K. Albert. (1991). Instance-Based Learning Algorithms. Machine Learning 6, 37-66.
Bloomfield, Leonard. (1933). Language. New York: Holt, Rinehart and Winston.
Chandler, Steve. (1995). Non-Declarative Linguistics: Some Neuropsychological Perspectives. Rivista di Linguistica 7, 233-247.
Chandler, Steve, and Royal Skousen. (1997). Analogical Modeling and the English Past Tense: A Reply to Jaeger et al. 1996. http://humanities.byu.edu/am/jaeger.html.
Daelemans, Walter, Steven Gillis, and Gert Durieux. (1994). The Acquisition of Stress: A Data-Oriented Approach. Computational Linguistics 20(3), 421-451.
Daelemans, Walter, Steven Gillis, and Gert Durieux. (1997). Skousen's Analogical Modeling Algorithm: A Comparison with Lazy Learning. New Methods in Language Processing (eds. Daniel Jones and Harold Somers), 3-15. London: University College Press.
Daelemans, Walter, Antal van den Bosch, and Jakub Zavrel. (1999). Forgetting Exceptions is Harmful in Language Learning. Machine Learning 34, 11-43.
Derwing, Bruce, and Royal Skousen. (1994). Productivity and the English Past Tense: Testing Skousen's Analogy Model. The Reality of Linguistic Rules (eds. Susan D. Lima, Roberta L. Corrigan, and Gregory K. Iverson), 193-218. Amsterdam: John Benjamins.
Eddington, David. (2000a). Analogy and the Dual-Route Model of Morphology. Lingua 100, 281-298.
Eddington, David. (2000b). Spanish Stress Assignment within the Analogical Modeling of Language. Language 76, 92-109.
Eggington, William G. (1995). Analogical Modeling: A New Horizon. Rivista di Linguistica 7, 211-212.
Jaeger, Jeri J., Alan H. Lockwood, David L. Kemmerer, Robert D. Van Valin Jr., Brian W. Murphy, and Hanif G. Khalak. (1996). A Positron Emission Tomographic Study of Regular and Irregular Verb Morphology in English. Language 72, 451-497.
Jones, Daniel. (1996). Analogical Natural Language Processing. London: University College London Press.
McClelland, James L., David E. Rumelhart, and the PDP Research Group. (1986). Parallel Distributed Processing (PDP), 2 volumes. Cambridge: MIT Press.
Omnes, Roland. (1999). Understanding Quantum Mechanics. Princeton: Princeton University Press.
Prasada, Sandeep, and Steven Pinker. (1993). Generalization of Regular and Irregular Morphological Patterns. Language and Cognitive Processes 8, 1-56.
Robinson, Derek. (1995). Index and Analogy: A Footnote to the Theory of Signs. Rivista di Linguistica 7, 249-272.
Skousen, Royal. (1989). Analogical Modeling of Language. Dordrecht: Kluwer.
Skousen, Royal. (1992). Analogy and Structure. Dordrecht: Kluwer.
Skousen, Royal. (1995). Analogy: A Non-Rule Alternative to Neural Networks. Rivista di Linguistica 7, 213-231.
Skousen, Royal. (1998). Natural Statistics in Language Modeling. Journal of Quantitative Linguistics 5, 246-255.
Williams, Colin P. and Scott H. Clearwater. (1998). Explorations in Quantum Computing. New York: Springer-Verlag.
Wulf, Doug. (1996). An Analogical Approach to Plural Formation in German. Proceedings of the Twelfth Northwest Linguistics Conference. Working Papers in Linguistics 14, 239-254. Seattle: University of Washington.