CSC 208-01 (Fall 2023)

Lab: Inference

Recall Bayes’s theorem from our previous reading:

\[ \Pr(A \mid B) = \frac{\Pr(B \mid A)}{\Pr(B)} \cdot \Pr(A) \]

Observe that we can interpret Bayes’s theorem as a way of expressing an updated belief about the probability of \(A\) given that we learn some new information \(B\).

To obtain the posterior belief, we multiply the prior belief by:

We can apply this interpretation to the classification problem to obtain a simple algorithm, the Naive Bayes algorithm. For example, consider the following table of data that records when we are likely to go to class given that certain circumstances occur:

Time of Day Partied Yesterday? Test Today? Hungry? Went to Class?
Early Yes No Yes No
Early No No No No
Early No No No Yes
Late Yes Yes Yes Yes
Late No No Yes No
Early Yes Yes No Yes
Early No No Yes Yes
Late No No No No
Late Yes No Yes Yes
Late No Yes Yes Yes

We’ll call an individual row of this table a piece of data. The first four columns are:

We’ll call these columns the features of the data. To simplify our subsequent calculation, we’ll let all our features be two-valued, i.e., boolean. But in general, we can consider other types of values as well. We can represent these features as a 4-tuple, e.g., \((\text{Early}, \text{Yes}, \text{No}, \text{Yes})\) for the first entry of the table.

The final column, “Went to Class?”, is the class of the data. For example, the class of the first entry of the table is “No,” i.e., when it was early, we partied, there was no test, and we were hungry, we recorded that we did not go to class.

The table above represents our training data. Given this training data, a classification algorithm is computes the most likely class for a new piece of data. For example, if we see the data:

Is it more likely we’ll go to class or not in this scenario?

Problem 1: Working Out the Theory

First, let’s see how Bayes Theorem applies to this situation. We’ll let \(A\) be the result of the classifier, i.e., whether we go to class, and \(B\) will be the data fed to the algorithm, i.e., the 4-tuple. If we let \(X\) be the 4-tuple and \(c\) be the class, then we have according to Bayes’s theorem:

\[ \Pr(c \mid X) = \frac{\Pr(X \mid c)}{\Pr(X)} \cdot \Pr(c) \]

Note that \(c\) can be either “Yes” or “No”, so we can calculate the probability of either class for our data \(X\) by instantiating \(c\) appropriately. Ultimately, our classifier to compute each of these probabilities utilize the probabilities \(\Pr(\text{Yes} \mid X)\) and \(\Pr(\text{No} \mid X)\). Our classifier chooses the class corresponding to the higher of the two probabilities as the class to assign to \(X\).

In terms of notation, the \(\mathop{\mathrm{argmax}}\) function captures this idea succinctly. If \(C = \{\, \text{"Yes"}, \text{"No"} \,\}\) is our set of classes, then this computation is:

\[ \mathop{\mathrm{argmax}}_{c \in C} \frac{\Pr(X \mid c)}{\Pr(X)} \cdot \Pr(c) \]

In other words, we choose the class \(c\) that maximizes the probability described by Bayes’s equation.

Next, we must figure out how to compute each of these probabilities in terms of our training data. But before we do that, let’s try to simplify the formula to avoid having to compute too much stuff. One observation for simplification is that computing \(\Pr(X)\) is unnecessary, so we can instead compute:

\[ \mathop{\mathrm{argmax}}_{c \in C} \Pr(X \mid c) \Pr(c) \]

In a sentence or two, describe why we don’t need to compute \(\Pr(X)\) in our \(\mathop{\mathrm{argmax}}\) calculation.

(Hint: think about for each \(c\) what \(\Pr(X)\) represents and whether it depends on the particular \(c\) in question.)

Problem 2: Independence and Computation

The above calculation now demands that we compute two quantities for each class \(c\):

We can employ the definition of conditional probability which says that:

\[ \Pr(X \cap c) = \Pr(X \mid c) \cdot \Pr(c) \]

Employed in this fashion, we call this the probability chain rule.

Therefore, we are really computing \(\Pr(X \cap c)\), the probability that we have data \(X\) and class \(c\) at the same time. Recall that \(X\) is really a tuple of sub-events, \(X = (x_1, x_2, x_3, x_4)\). With this in mind, we can then repeatedly apply the chain rule, observing that intersection of events is commutative:

\[ \Pr(X \mid c) \cdot \Pr(c) = \Pr(c) \cdot \Pr(x_1 \mid c) \cdot \Pr(x_2 \mid c, x_1) \cdot \Pr(x_3 \mid c, x_2, x_1) \cdot \Pr(x_4 \mid c, x_3, x_2, x_1). \]

While this, in theory, allows us to compute \(\Pr(X \mid c)\), it is onerous to do. We need to know the conditional probability of each feature given the others.

To simplify the computation, let’s assume that the features \(x_1, \ldots, x_4\) are independent from each other, i.e., their probabilities do not depend on each other. This leads us to the following computation:

\[ \Pr(X \mid c) = \Pr(x_1 \mid c) \cdot \Pr(x_2 \mid c) \cdot \Pr(x_3 \mid c) \cdot \Pr(x_4 \mid c). \]

\(\Pr(x_1 \mid c)\), for example, is the probability in our training data that it was early (or not) given that we went to class (or not). These probabilities are much easier to compute!

In summary, we now have the following computation to perform:

\[ \mathop{\mathrm{argmax}}_{c \in C} \Pr(c) \cdot \Pr(x_1 \mid c) \cdot \Pr(x_2 \mid c) \cdot \Pr(x_3 \mid c) \cdot \Pr(x_4 \mid c). \]

And then we choose the class \(c\) that maximizes this probability.

Go through the training data and hand-calculate the various probabilities you need to carry out this computation. It’ll be useful to organize these probabilities into various tables that you can look up easily. Once you have done this, determine what class your resulting classifier assigns the following data:

Create three other test data points and carry out the classifier on those data as well, showing the computation you performed.