Should you ever log transform data?
That’s an interesting question you bring up with that article, Shane Keller. Thanks for posting.
Its main technical point, I think, is that doing modeling on transformed data is probably a mistake. This is an example of my general beliefs about statistical techniques: don’t use methods you don’t really understand in a deep way. It’s notoriously easy to misapply statistics, and to fool yourself; and this gets worse the more sophisticated your tools are. As he points out, all methods have requirements (assumptions), and people violate them all the time without even realizing it, making the result dubious at best.
There’s also a broader idea that transforming data makes it hard to interpret. I think that’s true, and I agree very much with the sentiment. If you log your data, calculate a mean, and then back-transform it, what is the meaning of that value? How would you explain it its interpretation to someone? It’s not easy.
I always advocate strongly for paying close attention to the interpretation of your statistics — and that they have an interpretation! Computers can perform myriad kinds of processing on data; but the question for the analyst/scientist/designer is what the result will be — and whether it has a meaning will be helpful for the task at hand.
This means it’s important to pay attention to the details of even simple statistics. That’s why I focused on just the moments here (mean, geo. mean, and median). Because as you can see, there’s a lot to understand even there, and they too are easy to misapply. When you start using even simple techniques like binning, you have a hundred more ways to lie to yourself. It’s tricky!
Unfortunately, if the world strongly agreed with my advice never to use techniques one doesn’t really understand, nothing would ever get done. But I’d at least say, stick to the simplest methods you can get away with. If you’re thorough, it’s amazing what you can do even with summary statistics.