Friday, May 4, 2018

Overfit v underfit

You've got data!

That's a good start. Now, working back to cause for these effects, what model fits the data? If you get the model right, you can forecast (gasp! estimate) what comes next.

You can make two errors, both of which could be costly, but one more than the other:
  1. Underfit the data. Meaning: a "too tight" fit such that some data fits very well, and other data not so well. The danger here is that the "good fit" stuff may actually be only a selection of outliers and ill effects. Thus, the real causation is missed
  2. Overfit the data. Meaning: a "too loose" fit such that too many ill effects and outliers are included and thus too many causations are possible and the model is too sloppy to be meaningful
Now, in practice, the "underfit" is most common. Why? Because with the tight fit it looks really good on a PowerPoint slide and thus wins the day in the briefing.

But, come reality, the underfit model breaks down, and the estimating naysayers say nay to estimating. Who can blame them?

... it may look superficially more impressive until then, claiming to make very accurate and newsworthy predictions and to represent an advance over previously applied techniques. This may make it easier to get the model published in an academic journal or to sell to a client, crowding out more honest models from the marketplace. But if the model is fitting noise, it has the potential to hurt the science."
"The Signal and the Noise: Why So Many Predictions Fail-but Some Don't" 
by Nate Silver

Read in the library at Square Peg Consulting about these books I've written
Buy them at any online book retailer!
Read my contribution to the Flashblog