Data and their misbehavior
To be honest, I use the clickbaity word “data” in the title when I really mean “sample statistics”. The point of this post is first illustrated using a sample mean, but applies to any estimate computed from data.
If you ask a student in statistics to write a formula for “the mean” they may write the expression:
\[ \bar{X} := \frac{1}{N}\sum_{i = 1}^N X_i \]
There are at least two “problems” with this answer. The first problem is that the above expression is not the mean of a distribution, but it is a sample statistic from some sample data \(\{X_i\}\). It is not simply a constant number but a random variable in its own right. It has a distribution, a mean, a variance, quantiles, etc. If we had done a different “experiment” and had obtained different data the variable \(\bar{X}\) may likely have come out with a different value. If the data were generated by a particularly misbehaved random process, then the distribution of \(\bar{X}\) may have little to do with a measure of centrality and may tell us less little about the distribution of any future data element \(X_{N+1}\).
- Lesson: Quantities estimated from data are usually random variables, and it’s necessary to understand their distributional properties before inferring anything from the observed value.
The second problem however comes from the fact that in some cases the distribution of \(\bar{X}\) makes it indeed a very good approximation (in a precise sense) to a mean. If the data \(X_i\) are generated IID, with “small” variance, then the central limit theorem tells us that \(\bar{X}\) has a distribution that is approximately normal with mean \(\mu = E[X_i]\), and variance \(\sigma^2 = \frac{Var[X_i]}{N}\). If the data is truly IID and generated from a distribution that is thin tailed then this limit is indeed a good approximation, even for modest data sizes. If this is the case and if we assume the limiting distribution is correct then the fact that normal distributions themselves have very thin tails means that we can construct short yet extremely strong confidence intervals for \(\mu\) centered on \(\bar{X}\). So in this case, thin tails make it easy (far too easy) for a student to forget the difference between the sample statistic \(\bar{X}\) and the distributional parameter \(\mu\).
Unfortunately, in many interesting situations, data is not generated IID with small variance. In the previous sentence I personally would replace the word “many” by “most”. Nassim Taleb, the author of Fooled by Randomness and the Black Swan, many very well replace it by “all”. In any case, dividing data into “IID” and “non-IID”, or “definitely thin tailed” and “possibly heavy tailed” is like dividing animals into “elephants” and “non-elephants”.
- Lesson: It’s true that IID data with finite variance will have a sample mean that converges in distribution to a normal. But the rate of convergence may be very slow for data with large (or numerically infinite) variance, or for data that is not exactly IID. Do not assume asymptotic normality unless 1) you are absolutely sure, or 2) you are absolutely sure that the mistake you would make from such an assumption would not be fatal.
Economics and finance abound with examples. Market data is almost never IID in any observable sense almost as a corollary to market participants seeking out arbitrage opportunities, causing previously stable relationships to degrade (a variation of the Lucas Critique). Insurance and Operational Risk are replete with examples of data that may well be generated IID (there would be no real way to know) but whose distributions have numerically infinite variance.
In such interesting cases it could be a grave mistake to confuse sample statistics as an approximation for a particular parametric constant. For example, a Cauchy distribution is so heavy tailed that it has no finite mean. So the sample mean is not an approximation to any measure of central tendency of data generated by a Cauchy distribution, and would itself show enormous variation. In one experiment the sample mean may be 100. In the next experiment its value may very well be 100,000.
- Lesson: There are no true tests for stationarity, path-independence, and IID-ness unless you make strong assumptions about the distributional properties of the data. In finance and economics, such assumptions are almost never justifiable. On the contrary, it’s very simple to give an argument for the non-stationarity, path-dependence, non-ARIMA structure of financial time series. Thus in finance and economics, in these cases, one is missing the one thing that makes data useful: limit theorems. No, the neural net you’re training will not solve the markets.
Understanding the distributional properties of sample statistics is probably the most important part of the applied field of statistics. Mistaking sample statistics for constants throws away all of their distributional properties, such as their tendency to vary or to change with time, and assuming their distribution to follow from the central limit theorem may very well be incorrect.