A recent sequel to the discussion of Harvard President Claudine Gay’s resignation has been the contrasting of Gay’s slap on the wrist for plagiarism with Harvard’s heavy-handed treatment of Business School Professor Francesca Gino for accusations of data fraud. As in my post about plagiarism, I’m not writing about whether Gino committed data fraud. But I do believe that the problem of data fraud in the academic world is real. My concern is about why and how it happens, and how to prevent it.
Suspecting Data Fraud
My doctoral dissertation involved applying marginal cost pricing, something economic theory advocates, to achieve optimality in the use of airports. While writing my thesis, I discovered a paper published a few years earlier attempting to do the same thing, but that produced different results. I admit that I acknowledged this paper in the dissertation but didn’t deal with it. Shortly after receiving my Ph.D., I took a hard look at it, reworked the algebra, and discovered the authors made a significant error. In addition, they used a simulation model to determine when airport capacity should be added; I discovered that their simulation numbers were consistent with their theoretical conclusions, but not with their own algebra. I suspected that they simply made the numbers up.
I could have contacted them, but that was long before email. I was also afraid that senior scholars would ignore a challenge from a newly minted Ph.D.
I had done enough research to write an article that would correct their work. I was as diplomatic as possible. I simply wrote that I got a different algebraic result and, because I reworked the simulation in accordance with my algebraic result, I subtitled the article “A Re-Analysis of the Data.” I submitted my paper to the journal that published theirs, the journal passed my paper along to them, and they sent me a note thanking me for a useful extension of their original work. And the journal published my paper. The authors and I never had any contact after that. Given my unresolved suspicions, I haven’t provided a link to either my article or theirs (but if you’re really interested, you shouldn’t have much difficulty finding it).
Why and How Data Fraud Happens
Academics are rewarded for publishing, and it is always easier to publish an article that produces an interesting result than an article that reports on an unsuccessful attempt to determine a relationship or, more technically, maintains a null hypothesis. This situation provides an incentive for data fraud.
Some social science research uses publicly available databases, and it should be impossible to commit data fraud in that context. However, in many cases social scientists generate their own data, for example by building databases based on observation of a social phenomenon or an experimental design.
We use theory to advise us about what data to gather, but social science theories are often inexact or imprecise. They may tell us that one variable is positively or negatively related to another, but it won’t tell us more about the precise relationship. When attempting to relate theory to data in regression analysis, it is commonplace to “massage” the data, for example by transforming variables (for example using exponents or logarithms) or by experimenting with many sets of independent variables. The researcher might run thirty or forty regression equations and find one or two that yield the results predicted by their theory. Clearly, such a result would not be very robust. Indeed, a model yielding only one or two acceptable results out of forty attempts is no better than chance.
I think data fraud goes one step further than data massaging. In massaging, you are playing with all observations of a given variable, hence treating all the same way. In data fraud, you adjust individual observations, perhaps dropping outliers or changing how particular observations were measured. The last step down the slope would be making up observations that fit your theory. This is data fraud, pure and simple, and deserves condemnation and sanction.
How to Avoid Data Fraud
The best way to avoid data fraud is to start with first principles. You should regard the theory and the data as separate in the sense that they were generated separately. The origin of the word “data” is the past participle of the Latin verb to give, namely “that which is given.” If you regard the data as given – even if you generated them as the result of observation or experimentation – then you are not allowed to change them at all midway through the process of analysis.
Mark Twain, many of whose aphorisms I find instructive, said that “if you tell the truth you don’t have to remember anything.” Applying this to empirical research, if you do your research cleanly, it is easy to tell any researcher who might ask what you did and why you did it. And, for the sake of your career and your peace of mind, that’s worth knowing.