Expanding on the idea of multiple data sets with
something I forgot earlier:
Traditionally, when you're teaching a program to do
something, you use two data sets: a training set, which
is properly marked ("this should match", "this shouldn't",
etc), and a test set, which is also marked. You
don't want to train the program on all the data at
once, because you run the risk of overfitting (i.e. you
get a program that does really well at matching the training
data set, but is so specific to the training data that it
fails on real-world data).
--
:wq