A very basic question about asking questions

NiallBeag · **Joined:** Wed 19 Dec 2012 3:58 pm **Posts:** 488

An Lon Dubh wrote:

Quote:

Frequency analyses cannot produce statistically trustworthy results without having a heck of a lot of language data. Even now, corpuses can be spoiled by selective choosing of source material, and they contain billions of words. No matter anyone's best intentions, in 1966 it would have been impossible to build a large-scale, statistically relevant corpus, so they would have had to be very selective in what they thought was worth including, and that selective step destroys the objectivity of the study.

Why would it have been impossible in 1966 to build such a corpus? This is out of genuine scientific curiosity.

Just because of sheer size requirements. There were lots of very cool things going on in the 60s, but we were still looking at very limited datasets - the Brown Corpus (US English) had only a million words. Only? Well, the Lord of the Rings is estimated to be half a million words long, so it's only two LotRs.

Given the wide variety of possible language, you're going to have language features not included, and many rare-ish features are likely to occur once, if it all, and therefore are indistinguishable from any slips-of-the-tongue or typos in the source material.

The British National Corpus was compiled in the early 90s with over a hundred million words, and things were a lot more practical then. I shudder to think how many words are in some of the newer corpora.

Now you could compile a corpus that size on paper, but the problem would then be how to do anything meaningful with it. With a computer, I can search for all instances of a certain pattern in a short space of time and sort and filter them on screen. Now imagine going through 200 times the text of the Lord of the Rings to look for all instances of (for example) "there's" followed by a plural. Every search in a computer corpus is basically equivalent to a PhD's worth of research before the computer was created.

Early statistical analysis tended to focus on vocabulary, as it's the easiest thing to identify at a superficial level, and also very likely to repeat. Unsurprisingly, dictionaries were the first mass-market product affected by corpus work

Before Brown, we had other statistical analyses as RobertKaucher says, including the General Service List, but this is just a frequency list of words. It is highly likely that the "daily conversational requirements" identified scientifically by BG would have been a word-frequency list (or possibly even just a translated version of the Swadesh list). This was cutting-edge stuff at the time, although it's now pretty rudimentary.

It is highly unlikely that there would have been any scientific research behind many of the grammatical structures in the book, and in particular I personally find it unlikely that the specific construction we're discussing here would have come out of statistical analysis.

Forum rules

A very basic question about asking questions

Who is online