Statistics, Lies, and Things That Go Bump in the Night

Back when I was a callow youth I always felt that statistics was a waste of time.  After all, what does an average really tell you.  And a median seems almost useless as a measure of anything.  Besides, one is always measuring the past or something like that.  Well, that was the short view.  The fact is that those lovely measurements so often quoted by news stories or politicians or baseball fans are really only descriptors of data.  Beyond that the term average or median never says any more than that.  A baseball player with a batting average of .250 hits safely, meaning he reaches first base without being put out, only hits the ball safely one out of four turns at bat.  We do not know if he is walked eight consecutive times at bat or if he has struck out eight or ten consecutive times at bat.  For all we know, he could hit safely the first twenty five times at bat and be put out the next seventy five times at bat.  If he is the median hitter for his team, meaning that there is say a hitter on his team with an average of .350 and a hitter with an average of .150, then he is right in the middle of those two averages.  The median only describes his place between the highest and lowest averages.  And a batting average says nothing about how many times you may swing and miss the ball, how many times you may foul it off, how many times you strike out, how many times you pop out, and how many times you ground out.

Now many sports commentators confuse the average occurrence with the probability of an occurrence.  Jack Armstrong, the All American baseball slugger comes to the plate and even though he has walked or been put out the last five times at bat he is sure to have a hit this time.  His average is .400 and could he not get a hit?  Well, he could strike out or hit a catch-able fly ball or hit a grounder and be thrown out at first base.  He could hit the ball, travel to first base and then be thrown out trying to take second base on that long grounder to the outfield.  But just because the announcer believes Jack is due for a hit does not mean he will get one.  It’s a problem of probability and that can be measures in different ways.  What is the probability that he will swing and hit the first pitch safely?  What about the second or third pitch?  Using an average as a probability is a misuse of the statistic and can be an outright lie.

Statistical analysis attempts to make sense of a collection of data.  Describing the sun’s color at sunset when it is just on the edge of the horizon only tells us what color we are seeing.  It says nothing about the cause of the color or why we can look directly at the sun and not damage our retinas.  Statistical analysis attempts to find cause and effect, it seeks to measure such things and then attempt to explain why it happens.  Statistical analysis is the study of variables, the dependent variable and the independent variable.  We seek to find some interaction between the two variables.  The dependent variable is usually acted upon by the independent variable and not the other way round.  Now understand that a variable is usually some behavior.  I am driving a vehicle on a highway and the speed I control with the pressure applied by my foot is about sixty miles an hour.  I say about because that applied pressure is varied.  That is our dependent variable, my attention to the speed-odometer and using the muscles in my leg and foot.  The independent variable could be listening to the radio or talking on my cell phone.  If I take a number of measurements where I am sole concentrating of keeping my speed as constant as I can and then take measurements when I listen to the radio or talk on the cell phone I have established a way to see if and how much the independent variable interferes or influences my dependent variable.

Now I should note that we would usually make a hypothesis about some set of variables under study.  Normally one doesn’t make measurements just to be doing analysis.  and what we seek to prove is the null hypothesis, or the opposite of what we expect to happen.  This is where many lies come into play.  Many experiments and statistical analysis are flawed due to the incorrect use of the null hypothesis.  I see the problem more in the social sciences.  Usually those in the social sciences are not as familiar with the mathematical training needed to do good statistical analysis.  And many BA, MA, and PhD programs don’t require extensive training in statistical analysis.  The other problem is determining if the outcome of an experiment is due solely to chance occurrence.  What is the probability that the results you obtained were due solely to chance?

The things that go bump in the night have more to do with the method or experimental design.  Not all experiments are designed well and not all hypothesis questions are asked well.  Experiments may run the risk of poor design where the chance of outside interference can occur.  Or perhaps the experiment is not run well and the outcomes are influenced towards the expected outcome.  Then there is the problem that our sample population is not representative of our universe or all the members of the group under study.  Perhaps the experimenters did not select the participants by a truly random method or worse yet, mark the deck.  It is estimated that about twenty five percent of all experiments in any particular peer judged journal are in error.  That includes the hard sciences.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s