VC Theory and PAC Learning
An introduction to VC theory and PAC learning, including the foundations of statistical learning theory.
Learning from Samples
The goal of learning from samples is to learn an unknown function from as few samples as possible. We receive inputs drawn i.i.d. from a distribution and query the oracle (a black box). The oracle outputs the value . From these input-output pairs, we want to reconstruct or approximately learn the function.
Let us take a specific example and learn a toy function under specific conditions.
Suppose is a finite domain, the inputs are drawn uniformly from , and . We are promised that is either constant on the entire domain or assigns each label to exactly half of the domain. That is, either
Now, we have two options, and I need to figure out which kind of function the oracle represents. I draw samples and look at their outputs. The outputs may be different, or they may all be the same. If I observe both and , I can immediately say that the function is balanced because of the promise. If all the outputs are the same, however, I cannot be certain. My learning rule will declare the function constant in that case, so all I can talk about is the probability that this decision is wrong.
For example, let us say that I have drawn three samples and all are in the class. Can I say that the function is constant? No. A balanced function could also produce those three outputs. Nor can I say for sure that it is balanced and that drawing more samples will reveal this. My whole issue is to figure out how many samples I need before making a decision, so I must choose that number in advance.
Then comes the idea of probability. If the function is balanced, each draw has probability of producing either label. Therefore, the probability that three draws are all is
The same is true for three outputs that are all . Thus, the probability that all three outputs are the same, causing my learning rule to mistake a balanced function for a constant one, is
So, now comes the big question: how much risk of being fooled by the sampled observations can I afford? I shall call this limit , where .
If I draw samples, the probability that a balanced function produces only one label is
Therefore, I will choose enough samples that
This makes sure that the probability of my learning rule being fooled by the sampled observations is at most .
Equivalently, I need
Since the number of samples must be an integer, the smallest such choice is .
We can distinguish between case A and case B with probability atleast using queries. under finite no. of samples.