ABOUT THIS BOOK

“McCloskey and Ziliak have been pushing this very elementary, very correct, very important argument through several articles over several years and for reasons I cannot fathom it is still resisted. If it takes a book to get it across, I hope this book will do it. It ought to.”

—Thomas Schelling, Distinguished University Professor, School of Public Policy, University of Maryland, and 2005 Nobel Prize Laureate in Economics

“With humor, insight, piercing logic and a nod to history, Ziliak and McCloskey show how economists—and other scientists—suffer from a mass delusion about statistical analysis. The quest for statistical significance that pervades science today is a deeply flawed substitute for thoughtful analysis. . . . Yet few participants in the scientific bureaucracy have been willing to admit what Ziliak and McCloskey make clear: the emperor has no clothes.”

—Kenneth Rothman, Professor of Epidemiology, Boston University School of Health

The Cult of Statistical Significance shows, field by field, how “statistical significance,” a technique that dominates many sciences, has been a huge mistake. The authors find that researchers in a broad spectrum of fields, from agronomy to zoology, employ “testing” that doesn’t test and “estimating” that doesn’t estimate. The facts will startle the outside reader: how could a group of brilliant scientists wander so far from scientific magnitudes? This study will encourage scientists who want to know how to get the statistical sciences back on track and fulfill their quantitative promise. The book shows for the first time how wide the disaster is, and how bad for science, and it traces the problem to its historical, sociological, and philosophical roots.

Stephen T. Ziliak is the author or editor of many articles and two books. He currently lives in Chicago, where he is Professor of Economics at Roosevelt University. Deirdre N. McCloskey, Distinguished Professor of Economics, History, English, and Communication at the University of Illinois at Chicago, is the author of twenty books and three hundred scholarly articles. She has held Guggenheim and National Humanities Fellowships. She is best known for How to Be Human* Though an Economist (University of Michigan Press, 2000) and her most recent book, The Bourgeois Virtues: Ethics for an Age of Commerce (2006).

AUTHOR BIOGRAPHY

Stephen T. Ziliak is the author or editor of many articles and two books. He currently lives in Chicago, where he is Professor of Economics at Roosevelt University.

Deirdre N. McCloskey, Distinguished Professor of Economics, History, English, and Communication at the University of Illinois at Chicago, is the author of twenty books and three hundred scholarly articles. She has held Guggenheim and National Humanities Fellowships. She is best known for How to Be Human* Though an Economist (University of Michigan Press, 2000), and her most recent book, The Bourgeois Virtues: Ethics for an Age of Commerce (2006).

REVIEWS

"[Steve Ziliak and Deirdre McCloskey] explain to us why the misunderstanding of statistical significance has lead to bad government policy making and how one particularly famous brewery employed the technique to improve the pints we enjoy today."
—Tim Harford, BBC

— Tim Harford, BBC

"The book is a model of scholarship, transparent in its method, wide-reaching in its disciplinary expertise, and highly literate, including occasional haiku poems and humor such as, 'If the variable doesn't fit/you may not have to acquit.' The authors convincingly argue that environmental quality, jobs, and even lives are at stake."
—M. H. Maier, Glendale Community College, Choice

— M. H. Maier, Choice

"A clear trade-off: how much confidence [in a result] is "enough" depends on the costs of further research and the benefits of extra precision. Ziliak and his co-author Deirdre McCloskey argue in The Cult of Statistical Significance that most academic disciplines have forgotten this trade-off . . . A sharp line for statistical significance makes no sense, and it has a cost."
—Tim Harford, The Financial Times

— Tim Harford, Financial Times

"Persuading professionals that their procedures are wrong is a long and lonely task. McCloskey, joined later by Ziliak, has been conducting such a crusade against the misuse of significance testing for over 25 years. This book presents their argument, gives lots of examples of the adverse consequences of misuse, and provides some history of the controversy, which dates from the origins of mathematical statistics."
—Ron P. Smith, Journal of Economic Issues

— Ron P. Smith, Journal of Economic Issues

"The Cult of Statistical Significance has virtues that extend beyond its core message. It is clearly written and should be accessible to those who have neither formal training in statistics nor a desire to secure any. It is full of examples that illustrate why it is the strength of relationships and not their statistical significance that mainly matters."
—Richard Lempert, Law and Social Inquiry

— Richard Lempert, Law and Social Inquiry

"Despite appearing to be a book of limited appeal - it is after all a book that looks at a set of statistical techniques - it is one that has immense social implications. We live in an age where ideologies have largely been cast aside and instead we are governed increasingly by a class of politicians and civil servants who aim for 'evidence-based' policy-making. When that evidence is based on statistically significant results that ignore any quantification of results then we all have reason to pay attention."
—London Book Review

— NA, London Book Review

"If not Fisherian significance, what should be the Holy Grail of statistics? Ziliak and McCloskey . . . answer: "Oomph." We should identify quantities that matter and measure them, not merely determine whether they can be distinguished from the null (meaning no effect) at some predetermined likelihood level. The validity of this point I take to be virtually self-evident. Yet statistical tests that ignore quantity remain pervasive, as the authors demonstrate through quantitative analyses of the contents of some very prestigious journals of economics, psychology, and medicine."
—Theodore Porter, Science

— Theodore Porter, Science

"What is important is a shift of emphasis away from a dichotomous world of true and false towards a recognition of "oomph". This is what the presented book tries to achieve. It is also fun to read, rich with historical information and an excellent reminder of what empirical work of any sort is all about."
—Walter Kramer, Stat Papers

— W. Kramer, Stat Papers

"With humor, insight, piercing logic and a nod to history, Ziliak and McCloskey show how economists—and other scientists—suffer from a mass delusion about statistical analysis. The quest for statistical significance that pervades science today is a deeply flawed substitute for thoughtful analysis. This hollow pursuit, kept alive by mechanical, conformist thinking, has led to grave and obvious errors. Yet few participants in the scientific bureaucracy have been willing to admit what Ziliak and McCloskey make clear: the emperor has no clothes."
—Kenneth Rothman, Professor of Epidemiology, Boston University School of Health

— Kenneth Rothman

"McCloskey and Ziliak have been pushing this very elementary, very correct, very important argument through several articles over several years and for reasons I cannot fathom it is still resisted. If it takes a book to get it across, I hope this book will do it. It ought to."
—Thomas Schelling, Distinguished University Professor, School of Public Policy, University of Maryland and 2005 Nobel Prize Laureate in Economics

— Thomas Schelling

TABLE OF CONTENTS

\rrhp\
\lrrh: Contents\
\1h\ Contents \xt\
\comp: set page numbers on page proof\
Preface
Acknowledgments
A Significant Problem
In many of the life and human sciences the existence/whether question of the philosophical
disciplines has substituted for the size-matters/how-much question of the scientific
disciplines. The substitution is causing a loss of jobs, justice, profits, environmental quality,
and even life. The substitution we are worrying about here is called "statistical significance"-
-a qualitative, philosophical rule that has substituted for a quantitative, scientific magnitude
and judgment.
Chapter 1. Dieting "Significance" and the Case of Vioxx
Since R. A. Fisher (1890---1962) the sciences that have put statistical significance at their
centers have misused it. They have lost interest in estimating and testing for the actual
effects of drugs or fertilizers or economic policies. The big problem began when Fisher
ignored the size-matters/how-much question central to a statistical test invented by William
Sealy Gosset (1876---1937), so-called Student's t. Fisher substituted for it a qualitative
question concerning the "existence" of an effect, by which he meant "low sampling error by
an arbitrary standard of variance." Forgetting after Fisher what is known in statistics as a
"minimax strategy," or other "loss function," many sciences have fallen into a sizeless stare.
They seek sampling precision only. And they end by asserting that sampling precision just
is oomph, magnitude, practical significance. The minke and sperm whales of Antarctica and
the users and makers of Vioxx are some of the recent victims of this bizarre ritual.
Chapter 2. The Sizeless Stare of Statistical Significance
Crossing frantically a busy street to save your child from certain death is a good gamble.
Crossing frantically to get another mustard packet for your hot dog is not. The size of the
potential loss if you don't hurry to save your child is larger, most will agree, than the
potential loss if you don't get the mustard. But a majority of scientists in economics,
medicine, and other statistical fields appear not to grasp the difference. If they have been
trained in exclusively Fisherian methods (and nearly all of them have) they look only for a
probability of success in the crossing--the existence of a probability of success better than
.99 or .95 or .90, and this within the restricted frame of sampling--ignoring in any spiritual
or financial currency the value of the prize and the expected cost of pursuing it. In the life
and human sciences a majority of scientists look at the world with what we have dubbed "the
sizeless stare of statistical significance."
Chapter 3. What the Sizeless Scientists Say in Defense
The sizeless scientists act as if they believe the size of an effect does not matter. In their
hearts they do care about size, magnitude, oomph. But strangely they don't measure it. They
substitute "significance" measured in Fisher's way. Then they take the substitution a step
further by limiting their concern for error to errors in sampling only. And then they take it
a step further still, reducing all errors in sampling to one kind of error--that of excessive
skepticism, "Type I error." Their main line of defense for this surprising and unscientific
procedure is that, after all, "statistical significance," which they have calculated, is
"objective." But so too are the digits in the New York City telephone directory, objective,
and the spins of a roulette wheel. These are no more relevant to the task of finding out the
sizes and properties of viruses or star clusters or investment rates of return than is statistical
significance. In short, statistical scientists after Fisher neither test nor estimate, really, truly.
They "testimate."
\comp: lowercase Greek beta and lowercase Greek alpha in chapter 4\
Chapter 4. Better Practice: -Importance vs. à-"Significance"
The most popular test was invented, we've noted, by Gosset, better known by his pen name
"Student," a chemist and brewer at Guinness in Dublin. Gosset didn't think his test was very
important to his main goal, which was of course brewing a good beer at a good price. The
test, Gosset warned right from the beginning, does not deal with substantive importance. It
does not begin to measure what Gosset called "real error" and "pecuniary advantage," two
terms worth reviving in current statistical practice. But Karl Pearson and especially the
amazing Ronald Fisher didn't listen. In two great books written and revised during the 1920s
and 1930s, Fisher imposed a Rule of Two: if a result departs from an assumed hypothesis
by two or more standard deviations of its own sampling variation, regardless of the size of
the prize and the expected cost of going for it, then it is to be called a "significant" scientific
finding. If not, not. Fisher told the subjectivity-phobic scientists that if they wanted to raise
their studies "to the rank of sciences" they must employ his rule. He later urged them to
ignore the size-matters/how-much approaches of Gosset, Neyman, Egon Pearson, Wald,
Jeffreys, Deming, Shewhart, and Savage. Most statistical scientists listened to Fisher.
Chapter 5. A Lot Can Go Wrong in the Use of Significance Tests in Economics
We ourselves in our home field of economics were long enchanted by Fisherian significance
and the Rule of Two. But at length we came to wonder why the correlation of prices at home
with prices abroad must be "within two standard deviations of 1.0 in the sample" before one
could speak about the integration of world markets. And we came to think it strange that the
U.S. Department of Labor refused to discuss black teenage unemployment rates of 30 or 40
percent because they were, by Fisher's circumscribed definition, "insignificant." After being
told repeatedly, if implausibly, that such mistakes in the use of Gosset's test were not
common in economics, we developed in the 1990s a questionnaire to test in economics
articles for economic as against statistical significance. We applied it to the behavior of our
tribe during the 1980s.
Chapter 6. A Lot Did Go Wrong in the American Economic Review during the 1980s
We did not study the scientific writings of amateurs. On the contrary, we studied the
American Economic Review (known to its friends as the AER), a leading journal of
economics. With questionnaire in hand we read every full-length article it published that
used a test of statistical significance from January 1980 to December 1989. As we expected,
in the 1980s more than 70 percent of the articles made the significant mistake of R. A.
Fisher.
Chapter 7. Is Economic Practice Improving?
We published our article in 1996. Some of our colleagues replied, "In the old days [of the
1980s] people made that mistake, but [in the 1990s] we modern sophisticates do not." So in
2004 we published a follow-up study, reading all the articles published in the AER in the
next decade, the 1990s. Sadly, our colleagues were again mistaken. Since the 1980s the
practice in important respects got worse, not better. About 80 percent of the articles made
the mistaken Fisherian substitution, failing to examine the magnitudes of their results. And
less than 10 percent showed full concern for oomph. In a leading journal of economics, in
other words, nine out of ten articles in the 1990s acted as if size doesn't matter for deciding
whether a number is big or small, whether an effect is big or small enough to matter. The
significance asterisk, the flickering star of *, has become a totem of economic belief.
Chapter 8. How Big Is Big in Economics?
Does globalization hurt the poor, does the minimum wage increase unemployment, does
world money cause inflation, does public welfare undermine self-reliance? Such scientific
questions are always matters of economic significance. How much hurt, increase, cause,
undermining? Size matters. Oomph is what we seek. But that is not what is found by the
statistical methods of modern economics.
Chapter 9. What the Sizeless Stare Costs, Economically Speaking
Sizeless economic research has produced mistaken findings about purchasing power parity,
unemployment programs, monetary policy, rational addiction, and the minimum wage. In
truth, it has vitiated most econometric findings since the 1920s and virtually all of them
since the significance error was institutionalized in the 1940s. The conclusions of Fisherian
studies might occasionally be correct. But only by accident.
Chapter 10. How Economics Stays That Way: The Textbooks and the Referees
New assistant professors are not to blame. Look rather at the report card of their teachers and
editors and referees--notwithstanding cries of anguish from the wise Savages, Zellners,
Grangers, and Leamers of the economics profession. Economists received a quiet warning
by F. Y. Edgeworth in 1885--too quiet, it seems--that sampling precision is not the same as
oomph. They ignored it and have ignored other warnings, too.
Chapter 11. The Not-Boring Rise of Significance in Psychology
Did other fields, such as psychology, do the same? Yes. In 1919 Edwin Boring warned his
fellow psychologists about confusing so-called statistical with actual significance. Boring
was a famous experimentalist at Harvard. But during his lectures on scientific inference his
colleagues appear to have dozed off. Fisher's 5 percent philosophy was eventually codified
by the Publication Manual of the American Psychological Association, which dictated the
erroneous method worldwide to thousands of academic journals in psychology, education,
and related sciences, including forensics.
Chapter 12. Psychometrics Lacks Power
"Power" is a neglected statistical offset to the "first kind of error" of null-hypothesis
significance testing. Power assigns a likelihood to the "second kind of error," that of undue
gullibility. The leading journals of psychometrics have had their power examined by insiders
to the field. The power of most psychological science in the age of Fisher turns out to have
been embarrassingly low or, in more than a few cases, spuriously "high"--as was found in
a seventy-thousand-observation examination of the matter. Like economists the
psychologists developed a fetish for testimation and wandered away from powerful measures
of oomph.
Chapter 13. The Psychology of Psychological Significance Testing
Psychologists and economists have said for decades that people are "Bayesian learners" or
"Neyman-Pearson signal detectors." We learn by doing and staying alert to the signals. But
when psychologists and others propose to test those very hypotheses they use Fisher's Rule
of Two. That is, they erase their own learning and power to detect the signal. They seek a
foundation in a Popperian falsificationism long known to be philosophically dubious. What
in logic is called the "fallacy of the transposed conditional" has grossly misled psychology
and other sizeless sciences. An example is the overdiagnosis of schizophrenia.
Chapter 14. Medicine Seeks a Magic Pill
We found that medicine and epidemiology, too, are doing damage with Student's t--more in
human terms perhaps than are economics and psychology. The scale along which one would
measure oomph is very clear in medicine: life or death. Cardiovascular epidemiology, to take
one example, combines with gusto the fallacy of the transposed conditional and the sizeless
stare of statistical significance. Your mother, with her weak heart, needs to know the oomph
of a treatment. Medical testimators aren't saying.
Chapter 15. Rothman's Revolt
Some medical editors have battled against the 5 percent philosophy. But even the New
England Journal of Medicine could not lead medical research back to William Sealy Gosset
and the promised land of real science. Neither could the International Committee of Medical
Journal Editors, though covering worldwide hundreds of journals. Kenneth Rothman, the
founder of Epidemiology, forced change in his journal. But only his journal. Decades ago
a sensible few in education, ecology, and sociology initiated a "significance test
controversy." But grantors, journal referees, and tenure committees in the statistical sciences
had faith that probability spaces can judge--the "judgment" merely that p < .05 is "better"
for variable X than p < .11 for variable Y. It's not. It depends on the oomph of X and Y.
Chapter 16. On Drugs, Disability, and Death
The upshot is that because of Fisher's standard error you are being given dangerous
medicines, and are being denied the best medicines. The Centers for Disease Control is
infected with p-values in a grant, for example, to study drug use in Atlanta. Public health has
been infected, too. An outbreak of salmonella in South Carolina was studied using
significance tests. In consequence a good deal of the outbreak was ignored. In 1995 a Cancer
Trialists' Collaborative Group came to a rare consensus on effect size: ten different studies
agreed that a certain drug for treating prostate cancer can increase patient survival by 12
percent. An eleventh study published in the New England Journal of Medicine dismissed the
drug. The dismissal was based not on effect size bounded by confidence intervals based on
what Gosset called "real" error but on a single p-value only, indicating, the Fisherian authors
believed, "no clinically meaningful improvement" in survival.
Chapter 17. Edgeworth's Significance
The history of this persistent but mistaken practice is a social study of science. In 1885 an
eccentric and brilliant Oxford don, Francis Ysidro Edgeworth, coined the very term
significance. Edgeworth was prolific in science and philosophy, but was especially interested
in watching bees and wasps. In measuring their behavioral differences, though, he focused
on the sizes and meanings of the differences. He never depended on statistical significance.
\comp: lowercase Greek sigma in chapter 18 title\
Chapter 18. "Take 3å as Definitely Significant": Pearson's Rule
By contrast, Edgeworth's younger colleague in London, the great and powerful Karl Pearson,
used "significance" very heavily indeed. As such things were defined in 1900 Pearson was
an advanced thinker--for example, he was an imperialist and a racist and one of the founding
fathers of neopositivism and eugenics. Seeking to resolve a tension between passion and
science, ethics and rationality, Pearson mistook significance for "revelations about the
objective world." In 1901 he believed 1.5 to 3 standard deviations were "definitely
significant." By 1906, he tried to codify the sizeless stare with a Rule of Three and tried to
teach it to Gosset.
Chapter 19. Who Sits on the Egg of Cuculus Canorus? Not Karl Pearson
Pearson's journal, Biometrika (1901--- ), was for decades a major nest for the significance
mistake. An article on the brooding habits of the cuckoo bird, published in the inaugural
volume, shows the sizeless stare at its beginnings.
Chapter 20. Gosset: The Fable of the Bee
Gosset revolutionized statistics in 1908 with two articles published in this same Pearson's
journal, "The Probable Error of a Mean" and "The Probable Error of a Correlation
Coefficient." Gosset also independently invented Monte Carlo analysis and the economic
design of experiments. He conceived in 1926 the ideas if not the words of "power" and
"loss," which he gave to Egon Pearson and Jerzy Neyman to complete. Yet most statistical
workers know nothing about Gosset. He was exceptionally humble, kindly to other
scientists, a good father and husband, altogether a paragon. As suits an amiable worker bee,
he planted edible berries, blew a pennywhistle, repaired entire, functioning fishing boats
with a penknife, and--though a great scientist--was for 38thirty-eight years a businessman
brewing Guinness. Gosset always wanted to answer the how-much question. Guinness
needed to know. Karl Pearson couldn't understand.
Chapter 21. Fisher: The Fable of the Wasp
The tragedy in the fable arose from Gosset the bee losing out to R. A. Fisher the wasp. All
agree that Fisher was a genius. Richard Dawkins calls him "the greatest of Darwin's
successors." But Fisher was a genius at a certain kind of academic rhetoric and politics as
much as at mathematical statistics and genetics. His ascent came at a cost to science--and
to Gosset.
22. How the Wasp Stung the Bee and Took over Some Sciences
Fisher asked Gosset to calculate Gosset's tables of t for him, gratis. He then took Gosset's
tables, copyrighted them for himself, and in the journal Metron and in his Statistical
Methods for Research Workers, later to be published in thirteen editions and many
languages, he promoted his own circumscribed version of Gosset's test. The new assignment
of authorship and the faux machinery for science were spread by disciples and by Fisher
himself to America and beyond. For decades Harold Hotelling, an important statistician and
economist, enthusiastically carried the Fisherian flag. P. C. Mahalanobis, the great Indian
scientist, was spellbound.
Chapter 23. Eighty Years of Trained Incapacity: How Such a Thing Could Happen
R. A. Fisher was a necessary condition for the standard error of regressions. No Fisher, no
lasting error. But for null-hypothesis significance testing to persist in the face of its logical
and practical difficulties, something else must be operating. Perhaps it is what Thorstein
Veblen called "trained incapacity," to which might be added what Robert Merton called the
"bureaucratization of knowledge" and what Friedrich Hayek called the "scientistic
prejudice." We suggest that the sizeless sciences need to reform their scientistic
bureaucracies.
Chapter 24. What to Do
What, then? Get back to size in science, and to "real error" seriously considered. It is more
difficult than Fisherian procedures, and cannot be reduced to mechanical procedures. How
big is big is a necessary question in any science and has no answer independent of the
conversation of scientists. But it has the merit at least of being relevant to science, business,
and life. The Fisherian procedures are not.
A Reader's Guide
Notes
Works Cited
Index \to come\

REQUEST ACCESSIBLE FILE

If you are a student who cannot use this book in printed form, BiblioVault may be able to supply you with an electronic file for alternative access.

Please have the accessibility coordinator at your school fill out this form.

It can take 2-3 weeks for requests to be filled.