Article summary: "Natural Statistics in Language Modeling"
Royal Skousen
A paper delivered at the 3rd International Conference on Quantitative
Linguistics, 27 August 1997, University of Helsinki, Helsinki, Finland.
Published in Journal of Quantitative Linguistics 5:3 (1998):
246-255.
ABSTRACT
Language speakers have the ability to estimate frequencies of occurrence,
predict which outcome is the most frequent, and use language as if the
statistical relationships between various linguistic variables have been
determined. Within a psychologically plausible theory of analogical modeling,
natural statistics would allow speakers to make such judgments without
requiring them to posit highly complex statistical distributions or to
directly calculate probabilities mathematically.
OUTLINE
The Original Problem
-
a probabilistic series of different outcomes
-
problem of learning and using a probabilistic rule
-
alternative: storing examples, randomly selecting an example
Estimating the Probability of Occurrence
-
imperfect memory: assume probability of remembering is 1/2
-
given that n exemplars are remembered
-
get standard unbiased estimate of probability Ep
-
variance asymptotically approaches Ep(1-Ep)/(n-1)
-
equivalent to standard unbiased estimate of variance
Predicting the Most Frequent Outcome
-
random selection versus selection by plurality
-
probability of determining which outcome is most frequent
-
approximates standard statistical results
-
providing probability of remembering is 1/2
-
motivated level of significance (0.05, 0.01, 0.001)
-
each equivalent to a doubling of data (24, 48, 96)
Multivariate Analysis
-
dealing with a large number of variables
-
problem of many sampling zeros
-
problem with global determination of variable relationships
-
excessive number of hierarchical models in standard analysis
-
examples of local determination:
-
Arabic terms of address using social variables
-
mostly empty contextual space
-
easily predicted locally, but not globally
-
too many variables and sampling zeros
-
Finnish past-tense
-
an "insignificant" global variable can sometimes be "highly significant"
locally