Skip to content

3 · Training data & "garbage in, garbage out"

All those labeled examples you gather have a name: the training data. It's the most important ingredient in the whole process, and it leads to the single most useful rule in machine learning:

Garbage in, garbage out. (Makers shorten it to "GIGO.") A model is only as good as the examples you train it on. Messy, lopsided, or wrong examples make a messy, lopsided, or wrong model — no matter how fancy the computer is.

A model can't learn something you never showed it. It can't fix examples that were labeled wrong. It just faithfully copies the patterns in whatever you feed it. So a huge part of being a good maker is being picky about your training data. Here's what "good" looks like:

Good training dataWhy it matters
Enough examples of each labelA handful isn't enough to find a real pattern
Balanced — roughly the same amount per label200 cats and 5 dogs teaches "everything is a cat"
Correctly labeledOne mislabeled photo teaches the wrong lesson
Variety that matches real lifeDifferent angles, lighting, backgrounds — not all identical
Examples that look like what you'll actually use it onTrain on bright daytime photos, and it'll struggle at night

A quick story of GIGO in action: imagine you train a "is this a cat?" model, but every single cat photo you used had a comfy couch in the background. The model might secretly learn "couch = cat." Then you show it a dog on a couch and it confidently says "cat!" It didn't do anything wrong — it learned exactly the pattern you accidentally gave it. Garbage in, garbage out.

This is why makers spend more time collecting and cleaning examples than almost anything else. When you train your own model later in this course, the quality of your photos will matter more than any setting you click.

Think about it. You're training "happy face vs. sad face," but all your happy photos are of you and all your sad photos are of your little brother. What sneaky wrong pattern (instead of the expression) might the model learn?

Sources

3 · Training data & "garbage in, garbage out" · ElementaryMBA