Biased Benchmarks honesty is hard

submited by

Style Pass

2021-06-05 17:00:09

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation.

Performing benchmarks to compare software is surprisingly difficult to do fairly, even assuming the best intentions of the author. Technical developers can fall victim to a few natural human failings:

Last week I started comparing performance between Dask Dataframes (a project that I maintain) and Spark Dataframes (the current standard). My initial results showed that Dask.dataframes were overall much faster, somewhere like 5x.

These results were wrong. They weren’t wrong in a factual sense, and the experiments that I ran were clear and reproducible, but there was so much bias in how I selected, set up, and ran those experiments that the end result was misleading. After checking results myself and then having other experts come in and check my results I now see much more sensible numbers. At the moment both projects are within a factor of two most of the time, with some interesting exceptions either way.

This blogpost outlines the ways in which library authors can fool themselves when performing benchmarks, using my recent experience as an anecdote. I hope that this encourages authors to check themselves, and encourages readers to be more critical of numbers that they see in the future.