Sunday, 23 March 2014

Normal tables



Today I'm just going to rant about something that really gets into my nerves, but I think I should put some context into my complaints first.

Statistics deals with the recollection and analysis of data to infer characteristics of a population that will help to take better decisions than alternatives such as random guessing, throwing leaves into the fire and reading the smoke, crystal balls, acid trips, or perhaps the best of the alternatives and Frank Miller's 300 film favourite means of divination: a nearly naked virgin teenage Greek girl serving as an oracle.


Before the age of computers, it must have been really hard to analyse big sets of data so it made sense to welcome all sort of helpful approximations. Without hesitation, the most recurrent theorem in all Statistics is the Central Limit Theorem.

The CLT states, in its simplest version, that if we have a sequence of independent and identically distributed random variables $X_1,X_2,\ldots$ with finite mean $\mu$ and finite variance $\sigma^2$ then the empirical mean converges in distribution to a normally distributed random variable with the same mean and variance $\frac{\sigma^2}{n}$. Mathematically: \[\lim_{n\rightarrow\infty}\mathbb P\left[\frac{\sqrt{n}}{\sigma}\left(\frac{1}{n}\sum_{k=1}^nX_k-\mu\right)\leq x\right]=\int_{-\infty}^x\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{z^2}{2}\right)dz\]

There are two main issues to address in the formula above. First, we should use the formula when $n$ tends to infinity, for practical purposes this means a big $n$, but how big is "big"? While I was an undergrad student and while I was a teacher assistant in high school I recall hearing that students were told that the magic number is 30. Yes, 30 is big enough to think it is close to infinity... read that again! "30 is close enough to infinity", "30 is a close approximation to infinity"... I mean, seriously?!

Let's not make fools of ourselves, or worse, let's not make fools of people working on social studies. 30 is not close to infinity. However, the CLT approximation may be very good with just 30 data points, but that is not because 30 is a spectacularly big number, that must definitely be because of the nature of the random variables we're approximating. If they were originally normal, then the statement is true even for $n=1$ and that sounds quite a long way from infinity.


The other issue I was talking about is the integral on the right: the Gaussian integral. There isn't a simple primitive for it, and hence we just write it like this: \[\Phi(x)=\int_{-\infty}^x\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{z^2}{2}\right)dz\]
We know the median of a standard normal random variable is 0 and hence $\Phi(0)=\frac{1}{2}$. But there's no real way of saying what is the exact value of any given real number other than stating the integral form.

This problem leads to generate the standard normal tables which have numerical approximations to the integral for different values between -3 and 3, where it is significantly not constant (and well, nearly 0 before -3 and nearly 1 after 3)


Arguably, Statistics is the most important branch of maths in social studies and data driven areas of science. Currently, it is easier to gather data than actually making sense out of it and appealing to the closeness to infinity of the sample sizes in some sense, it is clear why the values of the numbers of in the table are important. Having said that, I'm not sure the tables are relevant any more. Furthermore, teaching how to read the tables is not worth the time, especially to social studies students.Why would I say such a thing? Well, nowadays finding those values is as easy as Google them, here's a calculator.

Things are even worse when I was told I should teach my students how to read the table even when at the same time I'm teaching them how to program in R. That's just ridiculous!

One of my officemates argued that you won't be having access to R during the exam, I said that one does have access to it when not in the exam, why should we be preparing students for non realistic situations? Real stats happen with real data, and real data comes in really big, big files! One can't expect students to work by hand and reading values from a table. What should happen is that the approach of the exam should evolve as so has technology.

Just to finish my ranting, if you're teaching stats, I encourage you to push the boundaries into the future! Students can't be preparing themselves for problems from the year 1950. If you're studying stats, I encourage you to push the boundaries into the future! Encourage your teacher to realise that we no longer live in 1950 and there are apps even for hipster fans of Apple to find the values of the standard normal (I'm not kidding! Download it here!). If you're neither teaching nor studying stats, I encourage you to push the boundaries into the future! Someone has to say it to you, but the future is full of data to be analysed and only those with the knowledge to do so will survive.

No comments:

Post a Comment