it's a lesson worth learning that paul erdos, one of the esteemed mathematicians of the 20th century, not only got the
monty hall problem wrong, but it took simulations
to convince him, even though the correct answer can be had by examining the the sigma-algebra for this system, which is small and can be explicitly
penned down. he must have been aware of this, as he is a probabilist. thinking under uncertainty, even with human judgement fallacies excluded, is tricky at best and meaningless at worst.
although fascination with the unpredictability of gambling devices goes back to the time of the pharaohs,
these devices were not perceived as possessing inherent elements of uncertainty; instead, they were seen
as means of communicating with a source of knowledge (eg deity) that was basically deterministic.
as if from a broken record you can again hear the same warning
this time amplified. if i want to assign meaningful probabilities
to very rare events i cannot escape the link between the frequency
of data collection the relevance of the data and the rarity of the event
in question and if the event depends on the codependence
among very many variables, the complexity of the problem literally explodes
very rare events are not just bigger cousins of normal
occurrences. they belong to a different species
the Gaussian distribution is often called "normal" because of the widespread opinion that it sets a universally applicable "norm".
in the case of the phenomena studied throughout my life this opinion is unwarranted. in their case, randomness is highly non-Gaussian,
but it is no longer possible to describe it as "pathological" "improper" "anomalous" or "abnormal". therefore any occurrence of normal
in this book, as synonym of Gaussian, is the result of oversight, and i try not to think about the second and third syllables
most engineering systems in communication, control, and signal processing are developed under the often erroneous assumption
that the interfering noise is Gaussian. many physical environments are more accurately modeled as impulsive, characterized
by heavy-tailed non-Gaussian distributions. the performances of systems developed under the assumption of Gaussian noise
can be severely degraded by the non-Gaussian noise due to potent deviations from normality in the tails
confusion over the interpretation of classical statistical tests is so complete as to render their application almost meaningless. ...this chaos extends throughout the scholarly hierarchy from the originators of the tests themselves - Fisher and Neyman-Pearson - to...professional statisticians to textbook authors to applied researchers
most...researchers are unaware of the historical development of classical statistical testing methods, and the mathematical and philosophical principles underlying them....researchers erroneously believe that..interpretation of such tests is prescribed by a single coherent theory of statistical inference. this is not the case...the distinction between evidence (p's) and error (alpha's) is not trivial...unfortunately statistics textbooks tend to inadvertently cobble together elements from [Fisher and NPW] schools of thought...perpetuating the confusion.
the principlal means for ascertaining truth - induction and analogy - are based on probabilities; so that the entire system of human knowlege is connected with the theory [of probability].
we all know that there are good and bad experiments. the latter accumulate in vain. whether there are a hundred or a thousand, one single piece of work by a real master - by a Pasteur, for example - will be sufficient to sweep them into oblivion.
nature permits us to calculate only probabilities yet science has not collapsed.
so in order to get numerical probabilities we have to be able to judge that a number of cases are equally probable, and
to enable us to make this judgement we need an a priori principle...called by Keynes the Principle of Indifference.
the name is original to him but the principle itself, he says, was introduced by J Bernoulli under the name of the
principle of Nonsufficient Reason: "if there is no known reason for predicating of our subject one rather than anotther of several
alternatives then relatively to such knowledge the assertions of each of these alternatives have an equal probability".
unfortunately the Principle of Indifference leads to a number of paradoxes (eg Buffon's needle and the book paradoxes).
yet physicists /have/ made definite choices, guided by the Principle of Indifference, and they /have/ led us to correct and
nontrivial predictions of viscosity and many other physical phenomena
from many years of experience with its applications in hundreds of real problems, our views on the foundations of probability theory have evolved into something quite complex, which cannot be described in any such simplistic terms as 'pro-this' or 'anti-that'. for example, our system of probability could hardly be more different from that of kolmogorov, in style, philosophy, and purpose. what we consider to be fully half of probability theory as it is needed in current applications - the principles for assigning probabilities from local analysis of incomplete information - is not present at all in the kolmogorov system. . . yet, when all is said and done, we find ourselves, to our surprise, in agreement with kolmogorov and in disagreement with his critics, on nearly all technical issues....as another example, it appears at first glance to everyone that we are in very close agreement with the de Finetti system of probability. indeed, the writer believed this for some time. yet when all is said and done we find, to our surprise, that little more than a loose philosophical agreement remains; on many technical issues we disagree strongly with de Finetti. it appears to us that his way of treating infinite sets has opened up a pandora's box of useless and unnecessary paradoxes; nonconglomerability and finite addivitiy are examples.
it was our use of probability theory as logic that has enabled us to do so easily what was impossible for those who thought of probability as a physical phenomenon associated with 'randomness'. quite the opposite; we have thought of probability distributions as carriers of information.
working scientists will note with dismay that statisticians have developed ad hoc criteria for accepting or rejecting theories (chi-squared test, etc) which make no reference to any alternatives....there is not the slightest use of rejecting any hypothesis H0 unless we can do it in favor of some definitive alternative H1 which better fits the facts....we are considering hypotheses which are 'scientific theories' in that they are suppositions about what is not observable directly....for such hypotheses Bayes' theorem tells us: unless the observed facts are absolutely impossible on H0, it is meaningless to ask how much those facts tend 'in themselves' to confirm or refute H0....a statistician's formal significance test can always be interpreted as a test of a specified H0 against a specified class of alternatives...however, the orthodox literature, which dealt with composite hypotheses by applying arbitrary ad hockeries instead of probability thoery, never perceived this.
the distinction between hypothesis testing and parameter estimation is...not a real difference...when the hypotheses become very numerous, deciding between the hypotheses Ht and estimating the index t are practically the same thing; and it is a small step to regard this index, rather than the hypotheses, as the quantity of interest; then we are doing parameter estimation.
today we belive that we can, at last, explain (1) the inevitably ubiquitous use and (2) ubiquitous success of the Gaussian error law. once seen, the explanation is indeed trivially obvious; yet, to the best of our knowledge, it is not recognized in any of the previous literature of the field, because of the universal tendency to think of probability distributions in terms of frequencies. we cannot understandd what is happening until we learn to think of probability distributions in terms of their demonstrable information content instead of their imagined (and...irrelevant) frequency connections.
consider estimation of a location parameter th_ from a sampling distribution. Fisher perceived a strange difficulty with orthodox procedures. choosing some function of the data as estimator, two different data sets might yield the same estimate for th_, yet have very different configurations (such as range, fourth central moments etc), and must leave us in a very different state of knowledge concerning th_. in particular, it seemed that a very broad range and sharply clustered one might lead us to the same actual estimate, but they ought to yield very different conclusions as to the accuracy of the estimate. yet if we hold that the accuracy of an estimate is determined by the width of the sampling distribution for the estimator, one is obliged to conclude that all estimates from a given estimator have the same accuracy, regardless of the configurations of the sample. Fisher's proposed remedy was not to question the ortodox reasoning which caused this anomaly, but rather to invent still another ad hockery to patch it up: use sampling distributions conditional on some 'ancillary' statistic that gives some information about the data configuration that is not contained in the estimator. ..for a Bayesian the question of ancillarity never comes up at all; we proceed directly from the statement of the problem to the solution that obeys the likelyhood principle.
it may be that some function of the parameters can be estimated more accurately than can any one of them. for example, if two parameters have a high negative correlation in the posterior pdf, then their sum can be estimated much more accurately than can their difference. all these subtleties are lost on orthodox statistics, which does not recognize even the concept of correlations in a posterior pdf.
in principle a single data point could determine accurate values of a million parameters. for example, if a function f(x1,...) of one million variables takes on the value sqrt(2) only at a single point, and we learn that f=sqrt(2) exactly, then we have determined one million variables exactly.
reactions against Laplace had begun in the mid-19th century, when Cournot, Ellis, Boole and Venn - none of whom had any training in physics - were unable to comprehend Laplace's rationale and attacked what he did, simply ignoring all his successful results. in particular, Venn, a philosopher without the tiniest fraction of Laplace's knowledge of either physics or mathematics, nevertheless considered himself competent to write..sarcastic attacks on Laplace's work, [with a] possible later influence on the young Fisher. Boole shows repeatedly that he does not understand the function of Laplace's prior probabilities (to represent a state of knowledge rather than a physical fact). in other words he too suffers from the mind projection fallacy... he rejects uniform prior probability assignment as 'arbitary', and explicitly refuses to examine its consequences; by which tactics he prevents himself from learning what Laplace was really doing and why. . . in any event, a radical change took place at about the beginning of the 20th century when a new group of workers, not physicists, entered the field. they were concerned mostly with biological problems and with Venn's encouragement proceeded to reject virtually everything done by Laplace.
why do orthodoxians put such exaggerated emphasis on bias? we suspect that the main reason is simply that they are caught in a psychosemantic trap of their own making. when we call the quantity (< b > - a) the 'bias', that makes it sound like something awfully reprehensible, which we must get rid of at all costs. if it had been called instead the 'component of error orthogonal to the variance', it would have been clear to all that these two contributions to the error are on an equal footing; it is folly to decrease one at the expense of increasing the other.
in such uncontrolled situations as economics, there is, in principle, no such thing as 'asymptotic sampling properties' because the 'population' is always finite, and it changes uncontrollably in a finite time. the attempt to use only sampling distributions - always interpreted as limiting frequencies - in such a situation forces one to expend virtually all his efforts on irrelevant fantasies. what is relevant to inference is not any imagined (i.e. unobserved) frequencies, but the actual state of knowledge that we have about the real situation. to reject that state of knowledge - or any human information - on the grounds that it is 'subjective' is to destroy any possibility of finding useful results for human information is all we have.
today one wonders how it is possible that orthodox logic contiues to be taught in some places year after year and praised as 'objective', while bayesians are charged with 'subjectivity'. orthodoxians, preoccupied with fantasies about nonexistent data sets and, in principle, unobservable limiting frequencies - while ignoring relevant prior information- are in no position to charge anybody with 'subjectivity'. if there is no sufficient statistic, the orthodox accuracy claim based on a single 'statistic' simply ignores not only the prior information but also all the evidence in the data that is relevant to that accuracy: hardly an 'objective' procedure. if there are ancillary statistics and the orthodoxian follows Fisher by conditioning on them, he obtains just the estimate that Bayes' theorem based on a noninformative prior would have given...by a shorter calculation.
but we must note with sadness that, in much of the current Bayesian literature, very little of the orthodox baggage has been cast off. e.g. it is rather typical to see a Bayesian article start with such phrases as: 'let X be a random variable with density function p(x|th_) where the value of the parameter th_ is unknown. suppose this parametric family of distributions'. th_ both countable and uncountable. those who use measure theory are, in effect, supposing the passage to an infinite set already accomplished before introducing probabilities. for example, Feller advocates this policy. in discussing this issue, Feller 1966 [intro to prob theory and its application] notes that specialists in various applications sometimes deny the need for measure theory because they are unacquainted with problems of other types and with situations where vague reasoning did lead to wrong results.' if Feller knew of any case where such a thing has happened, this would surely have been the place to cite it - yet he does not. therefore we remain, just as he says, unacquainted with instances where wrong results could be attributed to failure to use measure theory.
how do people assess tthe probability of an uncertain event or the value of an uncertain quantity?... people rely on a limited number of heuristic principles which reduce the complex tasts of assessing probabilities and predicting values to simpler judgmental operations. in general these heuristics are quite useful, but sometimes they lead to severe and systematic errors
four problems plague the process of exttracting information from a large multidimensional data set:
1. the size of the data set does not allow conventional analysis methdods that are comptuationally intensive.
2. many conventional multivariate analysis methods make unwarranted assumptions.
3. a combinatorial explosion of possible cases that must be explored occurs as you increas the number of variables (dimensions) andd their cardinality (number of values)
4.humans have difficulty codifying knowledge even when they are experts in some information domain. this last problem is ggreatly compounded when you are asked to discover information in multidimensional data containing attributes (variables) associated with several different information domains. a brilliant analysis in one field may not realize the significance of a discovery in another field.
let's considder a data set consisting of 6 columns (5 excluding identifier) and 100k rows. each column is a variable and each row is a record...the number of different data sets can be estimated via Sterling's formulat to be about 10 reaised to the power 1350, i.e. 1 followed by 1350 zeros...it dwarfs the total # of atoms in the universe. when you typical data sets that people get from today's data warehouses, involving tens of millions of records and far more cells, the number of possible data sets can exceed 10 raised to the power 1 billion.
by...contrast consider the language of analysis. if A, B, and C are three ordinal variables, people ask whether A and B etc are correlated. if we have five ordinal variables, people will ask what the 5x5 correlation matrix looks like. if the variables are nominal discrete (categorical) variables, they ask similar question but refer to interaction instea of correlation. but wait a minute. a 5x5 correlation matrix has at most (25-5)/2=10 distinct meaningful numbers ....it is as if we are supposed to believe that 10 numbers between 0 and 1 [the correlation coefficients] can uniquely specify which of the 10 to the power 1350 data sets we are dealing with. that this is not so can be demonstrated by considering a correlation matrix that has all 0's off-diagonal, indicating "no correlation" between any two non-identical variables. this is a uniquely specified matrix, and yet there are many...data sets of our form that can give rise to this matrix, including cases such as "A and B arre perfectly correlated when C=C1 and perfectly anti-correlated when C=C2" and so on... there is a small fraction ...but nonetheless a very large number of cases, all of which give the exact same correlation matrix.
for the most part...the theory and practice of 20th century statistics was dominated by the parametric system of ideas put forth by R.A. Fisher, K Pearson, J Neyman and other statistical legends....[it] is self-contained, very elegant mathematically, and - quite importantly its practical implementation involves calculations that can be carried out with pen and paper and the occasional slide-rule/calculator. its limitation, however, lies in forcing the practitioner to make prior assumptions regarding the data, assumptions that are quite often unrealistic.
two events do not become relevant to each other merely by virtue of predicting a common consequence, but they do become relevant when the consequence is actually observed. the opposite is true for two consequences of a common cause; typically the two become independent upon learning the cause.
assume that X and Y represent continuous return (log-return) of two financial assets over [some] period. If you know the correlation of these two random variables, this does not imply that you that you the dependence sstructure between the asset prices itself because [they are obtained by exponentiation]. the asset prices are strictly increasing functions of the return but the correlation structure is not maintained by this transformation....returns could be uncorrelated whereas the prices are sstrongly correlated and vice versa.
we won't go further into the details, but the moral of this discussion is that our results about
deviation from a finite mean can still be applied to natural models like random walks where variables with infinite expectation may play an important role.
we expect that [probabilistic] algorithms might give a wrong answer, but will do so only if the number of mistaken
probabilistic "guesses" it makes is much larger than should be expected. the likelihood of this unusually large number of mistakes
can often be estimated well using the Chernoff bound...note that the more we know about the distribution the better are the bounds we can obtain on the probability
every time we sit down to do an analysis, every time we study an accident, we come up with something we didn't know that we didn't know...before the TWA 800 crash [1996, killing 230], the F.A.A. did not know, or simply did not believe, that a spark could find its way into the fuel tank.
many authors have argued that a logical treatment of law is in principle unable to fully reflect the nature of legal argumentation. this is true if one has classical logic in its original form in mind. however, in the meantime all non-classical aspects of logic important for the law (and other applications), such as non-monotonicity, causality, time, vagueness, states of belief, probabilism, argumentation, pleading, theory formation, case-based reasoning, and several others, have been studied extensively. the maturity of the resulting formalisms is such that arguments of this kind are simply no longer appropriate. ...so let us assume that a substantial part of some law has been coded the way just indicated. what will have been gained? the advantages have been described many times and as early as two decades ago. what has changed since those early times is...for instance...search engines (such as Google) which can give you correct answers to many of your questions more or less immediately.
the posterior pdf gives a complete description of what we can infer about the value of the parameter in light of the data, and our relevant prior knowledge. the idea of a best estimate and an error-bar, or even a confidence interval, is merely an attempt to summarize the posterior with just two or three numbers; sometimes this just can't be done, and so these concepts are not valid. the posterior pdf still exists, and we are free to draw from it whatever conclusions are appropriate.
students and, remarkably, teachers of statistics, often misread the meaning of a statistical test of significance. Haller and Krauss (2002) asked 30 statistics instructors, 44 statistics students and 39 scientific psychologists from six psychology departments in germany about the meaning of a significant two-sample t-test (significance level = 1%). the test was supposed to detect a possible treatment effect based on a control group and a treatment group. The subjects were asked to comment upon the following six statements (all of which are false). they were told in advance that several or perhaps none of the statements were correct.
(1) you have absolutely disproved the null hypothesis that there is no difference between the population means (true/false)
(2) you have found the probability of the null hypothesis being true (true/false)
(3) you have absolutely proved your experimental hypothesis that there is a difference between the population means (true/false)
(4) you can deduce the probability of the experimental hypothesis being true (true/false)
(5) you know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision (true/false)
(6) you have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significan result on 99% of occasions (true/false)
all the statistics students marked at least one of the above faulty statements as correct. and, quite disconcertingly, 90% of the academic psychologists and 80% of the methodology instructors did as well! in particular, one third of both the instructors and the academic psychologists and 59% of the statistics students marked item 4 as correct; that is, they believe that, given a rejection of the null at level 1%, they can deduce a probability of 99% that the alternative is correct....ironically, one finds that this misconception is perpetuated in many textbooks.
in the 1920's "data miners" were excited to find that by preprocessing their data with repeated smoothing, they could discover trading cycles. their joy was shattered by a theorem by Evgeny Slutsky (1880-1948) who demonstrated that any noisy time series will converge to a sine wave after repeated applications of moving window smoothing.
there are many properties which the Gaussian and stable Levy distributions share: both are fixed points, or attractors, for the distributions of sums of independent random variables, in both cases guaranteed by a central limit theorem, and both of them describe the probability distributions associated with certain stochastic processes. the main difference is that the gaussian distribution is of finite variance, while the power-law decay of the levy distribution leads to infinite variance.
during the last years at different computer science conferences, i heard reiteration of the following claim: "complex theories do not work, simple algorithms do."... at least in the problems of statistical inference, this is not true.
hadamard thought that ill-posed problems are a pure mathematical phenomenon and that all real-life problems are "well-posed." however in the 2nd half of the century a number of very important real-life problems were found to be ill-posed. in particular ill-posed problems arise when one tries to reverse the cause-effect relations: to find unknown causes from known consequences. even if cause-effect relationship forms a one-to-one mapping, the problem of inverting it can be ill-posed...one of the main problems of statistics, estimating the density function from the data, is ill-posed.
regularization theory [1960's] was one of the first signs of the existence of intelligent inference. it demonstrated that whereas the "self-evident" method of minimizing the [risk] functional R(f) does not work, the not "self-evident" method of minimizing the functional R*(f) [with a correction functional to deal with ill-posedness] does.
a volatility forecast is a number that provides some information about the distribution of an asset price in the future. a far more challenging forecasting problem is to use market information to produce a predictive density for the future asset price. a realistic density will have a shape that is more general than provided by the lognormal family....a satisfactory method will not constrain the levels of skewness and kurtosis for the log of the predicted price.
for the most part, foundational thinking has been driven by intuition based on low-dimensional parametric models. but in modern statistical practice, it is routine to use high-dimensional or even infinite dimensional (nonparametric or semiparametric) methods. there is some danger in extrapolating our intuition from finite-dimensional to infinite-dimensional models. should we rethink foundations in light of these methods?...we argue that the answer is "yes".
does a covariance function give a convenient description of a random function? no. this should not be surprising, because this is only the equivalent, in infinite-dimensional spaces, of the well-known fact that one-dimensional random variables with completely different probability densities may well have the same mean and the same variance. the problem is that when the number of dimensions...is high, it is not easy to obtain an intuitive idea of what the probability densities look like, and one may easily be misled with the assumptions made.
nobody really believves that multivariate data is multivariate normal, but that data model occupies a large number of pages in every graduate textbook on multivariate statistical analysis.
the implication...that inferences are coherent if and only if they are Bayesian [increases its] appeal to many statisticians. of course the results are only as compelling as the axioms. given the choice between a method that is coeherent and a method that has, say, correct frequentist coverage, many statisticians would choose the latter. the issue is not mathematical in nature. the question is under which circumstances one finds coherence or correct coverage more important
according to Birnbaum's theorem, the likelyhood principle follows logically from two other principles [conditionality and sufficiency]. to many statisticians, both...seem compelling yet [likelyhood] does not. the mathematical content of the Birnbaum theorem is not in question. rather, the question is whether conditionality and sufficiency should be elevated to the status of "principles" just because they seem compelling in simple examples.
a complementary view of Bayesian model comparison is obtained by replacing probabilities of events by the lengths in bits of messages that communicate the events without loss to a receiver.
in the last three years, methods for constructing a new type of universal learning machine were developed based on results obtained in statistical learning theory. in contrast to traditional learning techniques, these novel methods do not depend explicitly on the dimensionality of input spaces and can control the generalization ability of machines [~estimators] that use different sets of decision functions in high dimensional spaces.
if we turn this descriptive account of science as finding concise, theory-based explanations of observed data into a prescription for how to handle data, we get various flavours of inductive inference techniques based on measures of information: Solomonoff prediction, MML and MDL theory selection, MML and BIC estimation. the differences reflect independent and near-simultaneous development, and are in practice far less important than the similarities. They use Shannon and Kolmogorov measures of "information". formally, if not always philosophically, they are closer to Bayesian statistical inference than to classical methods. all rely on a fundamental trade-off between complexity of theory and fit to data encapsulated in the length of a message stating both theory and data.
by contrast, the Vapnik-Chervonenkis approach seems to come from a different world, in philosophy and form. It is in a sense an extension of classical statistical reasoning about "confidence" into a wider model space, bringing (like MML but by totally different means) both estimation and model selection under the same umbrella. yet both these approaches demonstrably (and provably) work very well, having far more general application and giving consistently better results than previous methods of statistical and inductive inference.
for high-dimensional problems, the most widely used random sampling methods are Markov chain Monte Carlo methods like the Metropolis method, Gibbs sampling, and slice sampling. the problem with all these methods is this: yes, a given algorithm can be guaranteed to produce samples from the target density P(x) asymptotically, 'once the chain has converged to the equilibrium distribution'. but if one runs the chain for too short a time...then then the samples will come from some other distribution P^t(x) [i.e. at iteration=t]. for how long must the Markov chain be run before it has converged?...this question is usually very hard to answer.
even programs with some of the very simplest possible rules yield highly complex behavior, while programs with fairly complicated rules often yield only rather simple behavior...if one just looks at a rule in its raw form, it is usually almost impossible to tell much about the overall behavior it will produce.
"bayes&juice": failure to model conjunctions in conditional propositions
while many of the predictions of neoclassical economic theory have been verified experimentally, many others have been decisively disconfirmed. what distinguishes success from failure?
[in] when modeling market processes with well specified contracts such as double-continuous auctions (supply and demand) and oligopoly, game-theoretic predictions are verified under a wide variety of social settings
[out] where contracts are incomplete and agents can engage in strategic interaction with the power to reward and punish the behavior of other players, the neoclassical predictions generally fail
[unlike physical sciences] experiments in human social interaction cannot [eliminate all influences on behavior except experimentally controlled, e.g. particles are interchangable]...even in principle because...subjects fbring their personal history with them into the lab...their behavior an interaction between the subject's personal history and the experimenter's controlled lab conditions...they "choose a frame" for interpreting the experimental situtation.
if you have previously studied game theory, you will no doubt have noticed that our treatment of Bayesian updating in games with private information has not relied at all on the concept of "beliefs". this is not an oversight but rather a natural side-effect of our evolutionary perspective. classical game theory takes the decision processes of the /rational actor/ as central, whereas evolutionary game theory takes the /behavior/ of corporeal actors as central and models the game's evolution using replicator dynamics, diffusion processes and other population-level phenomena. beliefs in such a framework are... a shorthand way of expressing a behavioral regularity rather than the source of the regularity. there is absolutely no need to introduce the concept of beliefs into a general theory of games with private information.
it may be comforting to the classically minded to have a notion that agents have beliefs at the start of the game (called /priors/) which they update using Bayes' rule in the course of the game and use to describe the process as /sequential rationality/ but the notion is completely unnecessary. the assumption that agents know the frequency distribution of types is no more or less in need of justification than the assumption that they know the rules of the game, the payoffs, and the decision tree...justified by its ability to explain the behavior of the agents in social life..."belief"...invites all sorts of philosophical nonsense of the type [philosophers kick up the dust then complain they cant see - spinoza].
there is no clear value in using game theory to analyze situations that have not been subjected to an extensive evolutionary process, after which it may be plausible that agents have correct and common priors, on grounds that agents have inaccurate maps of the world and are less likely to survive than agents who have accurate maps. but then again, perhaps not.
evolutinary processes ignore low probability events...for a cod the probability of being impaled on a hook is a such a low probability event that it is simply not registered in the fish's perceptual or behavioral repertoire. it is folly to represent this situation as one in which the cod does not know which node (prey v hook) it occupies in an information set. ..if you give people offers they cannot refuse, they often refuse anyway. if you put people in unfamiliar situations and depend on their reacting optimally, or even well, to these situations, you are not long for this world.
these considerations should not be taken as asserting that there are certain inherent limitations to game theory, but rather that we need new modeling concepts to deal with low probability events.
perhaps we should assume that players partition the event world into sets whose probability is at least p* and choose best responses based on these partitions. a mutant however can repartition the event space without violating the p* minimum to "invade" a Nash equlibrium based on a given partition. since the strategy space is no longer finite (or closed or bounded), there need not necessarily be a Nash equilibrium of the metagame where partitions evolve, so the description of strategic interaction becomes one of temporariy equilibria with shifting ways of carving up reality.
hueston: sir, you have a long list of people to blame for enron's collapse, and it gets longer and longer as you testify
[short-seller induced market panic, wall st j., bursting of the tech boom, 9/11, schemes hatched by fastow]...your list
of people to blame and events to blame did not include yourself, did it, sir?
lay: i did everything i could humanly do during this time. did i make mistakes? i'm sure i did... i had to make real-time decisions based on the information i had at the time.