Now, the amount of mental peace I felt after reading this (even if it is just for a few moments), makes it inevitable that I should drop a line or two about it in my blog 🙂 Even if its momentary, I don’t consider the peace as random or arbitrary. I consider it significant ;-).
The questions on the use of statistical significance for large datasets have been bugging me for sometime now although I never really did anything about it. The questions only kept getting back more and more frequently. Especially each time a reviewer asked about significance tests, I wondered – “Won’t everything become significantly different if you have a large N?”. As the perennial fledgling researcher, although, my first instinct is to doubt my own understanding of the process.
I came across this piece “Language is never, ever, ever, random” by Adam Kilgariff, which brought me some mental peace in what is (in my imagination) one of the very confusing phases of my life at the moment 🙂
Here are the details of the paper:
Language is never, ever, ever, random
by Adam Kilgariff
Corpus Linguistics and Linguistic Theory 1-2 (2005), 263-276
“Language users never choose words randomly, and language is essentially non-random. Statistical hypothesis testing uses a null hypothesis, which posits randomness. Hence, when we look at linguistic phenomena in corpora, the null hypothesis will never be true. Moreover, where there is enough data, we shall (almost) always be able to establish that it is not true. In
corpus studies, we frequently do have enough data, so the fact that a relation between two phenomena is demonstrably non-random, does not support the inference that it is not arbitrary. We present experimental evidence of how arbitrary associations between word frequencies and corpora are systematically non-random. We review literature in which hypothesis testing has been used, and show how it has often led to unhelpful or misleading results.”
And the take home message (acc. to me):
Hypothesis testing has been used to reach conclusions, where the difficulty in reaching a conclusion is caused by sparsity of data. But language data, in this age of information glut, is available in vast quantities. A better strategy will generally be to use more data Then the difference between the motivated and the arbitrary will be evident without the use of compromised hypothesis testing. As Lord Rutherford put it: “If your experiment needs statistics, you ought to have done a better experiment.”