Gos

November 5th, 2010, 10:23 AM

CBE,

Rather than risk derailing the circumcision discussion with a discussion of statistical significance, I have decided to start a new thread.

I know that a study with 5,000 subjects encompassing 10,000 person-years sounds like a pretty big deal, and certain to establish statistically significant results.

However, one has to look not only at the size of the study, but also at the incidence of what's being measured. If the incidence of the phenomenon being measured is small, then it takes a considerably larger and/or longer study to achieve statistical significance.

Take, for example, the odds of being dealt a royal flush in a single hand of 5-card stud poker. (1:650,000.)

One way of testing this would be to deal 650,000 hands from randomized decks to a group of people for 650,000 "person/hands". For example, you could deal 100 hands each to a group of 6,500 people.

After dealing 650,000 hands, the law of averages says that you should have one royal flush. However, in reality, it might be two or three, or there might be none.

If the actual number of RFs is anything but 1, then our study is off by a minimum of 100% of the actual incidence (which means that the outcome is essentially worthless, because we're actually at least as likely to see an outcome which is off by a large enough margin to hide 100% of the actual incidence at least once if not multiple times, as we are to see an accurate result.)

One way of getting around this is to expand the scope of the study. For example, rather than dealing 100 hands, let's say we dealt 400 hands to each of 6,500 people, for a total of 2.6 million P/H.

At that point, we could reasonably expect to see four RFs dealt. The actual number might be five or six or it might be two or three, but chances are it'll at least be close to four, and there's a better-than-even chance that it'll be 4 +/- 25%.

That's not bad, but have we achieved statistical significance?

No. For two reasons: 1) There's still too much probability that our outcome could be off by as much as 100% or perhaps more, and 2) While we've achieved some statistical significance to explore questions relating to instance itself, we still have not achieved a fraction of the statistical significance required to begin breaking our figures down to explore questions relating to things with even smaller instance within the grouping of samples.

For example, even if you had exactly four RFs, there's absolutely no reason to expect that you will have exactly one RF of each suit. It is far more likely that you will have (for example) one RF of spades, one of diamonds, two RFs of hearts, and none of clubs. The odds are actually against all four suits being represented if there are exactly four RFs in the sample.

In the above example, it would appear that our study "proved" that red suits are 3X luckier than black suits, and that hearts are twice as lucky as any other suit, if we were to judge by the number of RFs which were dealt in each suit.

Thus, when we start breaking the numbers down into subgroups, we still haven't achieved enough statistical significance to get any sort of accurate analysis of what's going on at the subgroup level.

The only way to get past this is (you guessed it) to expand the scope of the study even further. This time, instead of dealing 400 hands to each of our 6,500 participants, we'll deal them each 4000, for a total of 26 million P/H.

At 26 million P/H, our study is astronomically huge, but we've finally achieved enough statistical significance to begin to break the RFs down by suit and do analysis of the results. By the law of averages, we can reasonably expect the total number of RFs to be somewhere pretty close to 40, and while they still probably won't break down absolutely evenly by suit, you'll probably have somewhere in the neighborhood of 10 RFs of each suit -- maybe, for example, you've got a total of 39 RFs -- say, 12 RFs of clubs and nine of each other suit.

What we have at this point is just enough statistical significance to begin to ascertain what the average instance of RFs of any suit might be, but we still really haven't achieved enough statistical significance to begin to compare the instance of, say, clubs vs hearts, as evidenced by the fact that clubs appear to be about 33% luckier than any other suit without requiring any large anomaly in the statistical noise floor in order for this outcome to occur.

So how do we get around this? You guessed it, we have to expand the scope of our study. So this time we deal 8,000 hands apeice to 32,500 people for a whopping 260 million P/H.

This gives us enough samples to achieve sufficient statistical significance to begin to analyze the instance of RFs by suit. For each suit, there will be approximately 100 RFs, and while the statistical noise floor is still too high to be considered negligible, the study is large enough to rise far enough above the noise floor that we can begin to get a glimpse of the bigger statistical picture.

However, it is important to note that we haven't achieved high statistical significance at this point. There's still enough of a statistical noise floor that it's always eminently possible that the actual study ends up with, say 105 RFs of spades and 91 RFs of clubs, creating the illusion that spades are 15% luckier than clubs.

Thus, even at 260 million P/H, we still haven't achieved a high enough statistical significance to compare how lucky the various suits are and be in any way confident that our study was returning sufficiently accurate results.

Ya know how you get around that? You guessed it: We have to expand the scope of our study another 10X, to 2.6 billion P/H. At this point, we're dealing 32,500 hands of poker apiece to 80,000 people.

BUT -- we can be reasonably confident that during the course of this study, there will be approximately 4,000 RFs dealt, and that there will be roughly 1,000 of each suit, and that the difference between the "luckiest" and "unluckiest" suit will be negligible enough to be easily dismissed as statistical noise, and that the difference is unlikely to be large enough to be mistaken for "proof" that one suit is substantially luckier than another.

At an instance of 1:650,000, we had to expand the size of the study to 4,000 X the reciprocal of the instance in order to be confident that we were getting results which would not create an illusion that some suits are substantially luckier than others.

And yet, at an average instance of 1 seroconversion per 100 P/Y in this circumcision study, I am expected to believe that a measly 10,000 P/Y (100 X the reciprocal of instance) is a large enough sample to determine whether it's "luckier" to be circumcised or uncircumcised?

----

Rather than risk derailing the circumcision discussion with a discussion of statistical significance, I have decided to start a new thread.

I know that a study with 5,000 subjects encompassing 10,000 person-years sounds like a pretty big deal, and certain to establish statistically significant results.

However, one has to look not only at the size of the study, but also at the incidence of what's being measured. If the incidence of the phenomenon being measured is small, then it takes a considerably larger and/or longer study to achieve statistical significance.

Take, for example, the odds of being dealt a royal flush in a single hand of 5-card stud poker. (1:650,000.)

One way of testing this would be to deal 650,000 hands from randomized decks to a group of people for 650,000 "person/hands". For example, you could deal 100 hands each to a group of 6,500 people.

After dealing 650,000 hands, the law of averages says that you should have one royal flush. However, in reality, it might be two or three, or there might be none.

If the actual number of RFs is anything but 1, then our study is off by a minimum of 100% of the actual incidence (which means that the outcome is essentially worthless, because we're actually at least as likely to see an outcome which is off by a large enough margin to hide 100% of the actual incidence at least once if not multiple times, as we are to see an accurate result.)

One way of getting around this is to expand the scope of the study. For example, rather than dealing 100 hands, let's say we dealt 400 hands to each of 6,500 people, for a total of 2.6 million P/H.

At that point, we could reasonably expect to see four RFs dealt. The actual number might be five or six or it might be two or three, but chances are it'll at least be close to four, and there's a better-than-even chance that it'll be 4 +/- 25%.

That's not bad, but have we achieved statistical significance?

No. For two reasons: 1) There's still too much probability that our outcome could be off by as much as 100% or perhaps more, and 2) While we've achieved some statistical significance to explore questions relating to instance itself, we still have not achieved a fraction of the statistical significance required to begin breaking our figures down to explore questions relating to things with even smaller instance within the grouping of samples.

For example, even if you had exactly four RFs, there's absolutely no reason to expect that you will have exactly one RF of each suit. It is far more likely that you will have (for example) one RF of spades, one of diamonds, two RFs of hearts, and none of clubs. The odds are actually against all four suits being represented if there are exactly four RFs in the sample.

In the above example, it would appear that our study "proved" that red suits are 3X luckier than black suits, and that hearts are twice as lucky as any other suit, if we were to judge by the number of RFs which were dealt in each suit.

Thus, when we start breaking the numbers down into subgroups, we still haven't achieved enough statistical significance to get any sort of accurate analysis of what's going on at the subgroup level.

The only way to get past this is (you guessed it) to expand the scope of the study even further. This time, instead of dealing 400 hands to each of our 6,500 participants, we'll deal them each 4000, for a total of 26 million P/H.

At 26 million P/H, our study is astronomically huge, but we've finally achieved enough statistical significance to begin to break the RFs down by suit and do analysis of the results. By the law of averages, we can reasonably expect the total number of RFs to be somewhere pretty close to 40, and while they still probably won't break down absolutely evenly by suit, you'll probably have somewhere in the neighborhood of 10 RFs of each suit -- maybe, for example, you've got a total of 39 RFs -- say, 12 RFs of clubs and nine of each other suit.

What we have at this point is just enough statistical significance to begin to ascertain what the average instance of RFs of any suit might be, but we still really haven't achieved enough statistical significance to begin to compare the instance of, say, clubs vs hearts, as evidenced by the fact that clubs appear to be about 33% luckier than any other suit without requiring any large anomaly in the statistical noise floor in order for this outcome to occur.

So how do we get around this? You guessed it, we have to expand the scope of our study. So this time we deal 8,000 hands apeice to 32,500 people for a whopping 260 million P/H.

This gives us enough samples to achieve sufficient statistical significance to begin to analyze the instance of RFs by suit. For each suit, there will be approximately 100 RFs, and while the statistical noise floor is still too high to be considered negligible, the study is large enough to rise far enough above the noise floor that we can begin to get a glimpse of the bigger statistical picture.

However, it is important to note that we haven't achieved high statistical significance at this point. There's still enough of a statistical noise floor that it's always eminently possible that the actual study ends up with, say 105 RFs of spades and 91 RFs of clubs, creating the illusion that spades are 15% luckier than clubs.

Thus, even at 260 million P/H, we still haven't achieved a high enough statistical significance to compare how lucky the various suits are and be in any way confident that our study was returning sufficiently accurate results.

Ya know how you get around that? You guessed it: We have to expand the scope of our study another 10X, to 2.6 billion P/H. At this point, we're dealing 32,500 hands of poker apiece to 80,000 people.

BUT -- we can be reasonably confident that during the course of this study, there will be approximately 4,000 RFs dealt, and that there will be roughly 1,000 of each suit, and that the difference between the "luckiest" and "unluckiest" suit will be negligible enough to be easily dismissed as statistical noise, and that the difference is unlikely to be large enough to be mistaken for "proof" that one suit is substantially luckier than another.

At an instance of 1:650,000, we had to expand the size of the study to 4,000 X the reciprocal of the instance in order to be confident that we were getting results which would not create an illusion that some suits are substantially luckier than others.

And yet, at an average instance of 1 seroconversion per 100 P/Y in this circumcision study, I am expected to believe that a measly 10,000 P/Y (100 X the reciprocal of instance) is a large enough sample to determine whether it's "luckier" to be circumcised or uncircumcised?

----