Conditional Probability

So far, we’ve considered probability computations in the absence of additional information. However, how does knowledge of one event influence the probability of another event occurring? By modeling this phenomena with conditional probabilities, we can begin to model the notion of learning where the discovery of new information influences our current knowledge. This is the basis of modern machine learning techniques that are so prevalent in modern-day computing.

Definition (Conditional Probability): the conditional probability of an event \(A\) given that an event \(B\) has occurred, written \(\Pr(A \mid B)\) is:

\[ \Pr(A \mid B) = \frac{\Pr(A ∩ B)}{\Pr(B)}. \]

We can pronounce \(\Pr(A \mid B)\) as the probability of event \(A\) occurring given that \(B\) has occurred. This is a sort of implication, but for probabilities.

For example, consider the random value \(X\) representing the sum of rolling two six-sided dice. Here are all the possible outcomes of \(X\):

\(X = 2\), 1 possibility: \(1+1\).
\(X = 3\), 2 possibilities: \(1+2\), \(2+1\).
\(X = 4\), 3 possibilities: \(1+3\), \(3+1\), \(2+2\).
\(X = 5\), 4 possibilities: \(1+4\), \(4+1\), \(2+3\), \(3+2\).
\(X = 6\), 5 possibilities: \(1+5\), \(5+1\), \(2+4\), \(4+2\), \(3+3\).
\(X = 7\), 6 possibilities: \(1+6\), \(6+1\), \(2+5\), \(5+2\), \(3+4\), \(4+3\).
\(X = 8\), 5 possibilities: \(2+6\), \(6+2\), \(3+5\), \(5+3\), \(4+4\).
\(X = 9\), 4 possibilities: \(3+6\), \(6+3\), \(4+5\), \(5+4\).
\(X = 10\), 3 possibilities: \(4+6\), \(6+4\), \(5+5\).
\(X = 11\), 2 possibilities: \(5+6\), \(6+5\).
\(X = 12\), 1 possibility: \(6+6\).

The probability of \(X = 8\) is \(\Pr(8) = \frac{5}{36}\). However, what if we know that the first die is a \(3\)? Then, we only consider the dice rolls where the first die \(x\) is a \(3\):

\[ 3+1, 3+2, 3+3, 3+4, 3+5, 3+6 \]

Of these six possibilities, only one results in a sum of \(8\), so we have that \(\Pr(8 \mid x = 3) = \frac{1}{6}\). Alternatively, we can calculate this directly using the definition of conditional probability:

\[ \Pr(8 ∣ x = 3) = \frac{\Pr(8 ∩ x = 3)}{\Pr(x = 3)} = \frac{ \frac{1}{36} }{ \frac{6}{36} } = \frac{1}{6}. \]

Independence

In the example above, knowing that the first dice is a 3 influences the probability that their sum is 8. We say that the two events—“the first dice is a 3” and “the sum of the two dice is 8”—are dependent on each other. However, we have an intuition that some events are independent of each other. For example, consider the two events:

\(E_1\) = “The first dice is a two.”
\(E_2\) = “The second dice is even.”

\(Pr(E_1) = \frac{6}{36}\) since there are six possibilities for the second dice when the first is fixed to two. \(Pr(E_2) = \frac{3 ⋅ 6}{36}\) since there are 3 possibilities for the second dice to be even and then 6 possibilities for the first dice once the second has been fixed. However, since we believe that \(E_1\) is independent of \(E_2\) that \(\Pr(E_1 \mid E_2) = \Pr(E_1)\), i.e., knowledge of \(E_2\) does not change the probability of \(E_1\).

We formalize the notion of independence in probability theory as follows:

Definition (Independence): we say that two events \(E_1, E_2 ⊆ Ω\) are independent if \(\Pr(E_1 ∩ E_2) = \Pr(E_1) \cdot \Pr(E_2)\).

That is, independence is the condition necessary for us to apply the combinatorial product rule to probabilities. If two events are independent, then we can reason about their probabilities in sequence.

Claim: If two events \(E_1, E_2 ⊆ Ω\) are independent, then \(Pr(E_1 \mid E_2) = E_1\).

Proof. By the definition of conditional probability and independence:

\[ \Pr(E_1 \mid E_2) = \frac{\Pr(E_1 ∩ E_2)}{\Pr(E_2)} = \frac{\Pr(E_1) \Pr(E_2)}{Pr(E_2)} = \Pr(E_1). \]

A similar argument holds for \(\Pr(E_2)\) as well.

□

Bayes’ Theorem

Is the probability \(\Pr(A \mid B)\) related to \(\Pr(B \mid A)\) in any way? We can use the definition of conditional probability to explore this idea:

\[\begin{gather} \Pr(A \mid B) = \frac{\Pr(A ∩ B)}{\Pr(B)} \\ \Pr(B \mid A) = \frac{\Pr(B ∩ A)}{\Pr(A)}. \end{gather}\]

But set intersection is symmetric, so we have that:

\[ \Pr(A ∩ B) = \Pr(A \mid B) \Pr(B) = \Pr(B \mid A) \Pr(A). \]

But now we can remove \(A ∩ B\) entirely from discussion and reason exclusively about conditional probabilities. This insight leads us to Bayes’ Theorem:

Theorem (Bayes’ Theorem): for any events \(A, B ⊆ Ω\):

\[ \Pr(A \mid B) = \frac{\Pr(B \mid A) \Pr(A)}{\Pr(B)}. \]

Bayes’ Theorem allows us to talk concretely about our updated belief of an event \(A\) occurring given the new knowledge that \(B\) occurred. A classical example of this concerns drug testing. Suppose that we have a drug test that has the following characteristics:

The true positivity rate of the drug test is 95%. This is the rate at which the drug test reports “yes” when the drug is actually present.
The true negativity rate of the drug test is 90%. This is the rate at which the drug test reports “no” when the drug is not present.

Furthermore, suppose that we assume that 1% of people use this drug. What is \(\Pr(\text{user} \mid \text{pos})\) the probability that a person is a user of a drug given that they tested positive? By Bayes’ Theorem, this quantity is given by:

\[ \Pr(\text{user} \mid \text{pos}) = \frac{\Pr(\text{pos} \mid \text{user}) \Pr(\text{user})}{\Pr(pos)}. \]

What are these various probabilities on the right-hand side of the equation?

\(\Pr(\text{pos} \mid \text{user})\) is the probability of a test reporting positive when the person is actually a user. This is precisely the true positivity rate, 0.95 in our case.
\(\Pr(\text{user})\) is the probability that a person is a user of the drug, assumed to be 1% in our example.
\(\Pr(\text{pos})\) is the probability that a given test is positive.

We don’t have immediate access to this last value. However, we can reconstruct it using the probabilities that we have! We observe that the following equality holds:

\[ \Pr(\text{pos}) = \Pr(\text{pos} ∩ \text{user}) + \Pr(\text{pos} ∩ \text{non-user}) \]

Because every person is either a user or non-user of a drug. We can then use the definition of conditional probability to rewrite the equation in terms of the conditional probabilities that we know:

\[\begin{align} &\; \Pr(\text{pos} ∩ \text{user}) + \Pr(\text{pos} ∩ \text{non-user}) \\ =&\; \Pr(\text{pos} \mid \text{user}) \Pr(\text{user}) + \Pr(\text{pos} \mid \text{non-user}) \Pr(\text{non-user}) \end{align}\]

The probability \(\Pr(\text{pos} \mid \text{non-user})\) is the false negativity rate, which is \(1 - 0.90 = 0.10\).

Putting all of this together, we obtain:

\[\begin{align} \Pr(\text{user} \mid \text{pos}) =&\; \frac{\Pr(\text{pos} \mid \text{user}) \Pr(\text{user})}{\Pr(\text{pos} \mid \text{user}) \Pr(\text{user}) + \Pr(\text{pos} \mid \text{non-user}) \Pr(\text{non-user})} \\ =&\; \frac{0.95 ⋅ 0.01}{0.95 ⋅ 0.01 + 0.10 ⋅ 0.99} \\ =&\; 0.0876 \end{align}\]

In other words, the probability of a positive test given the person is a drug user is only 8.8%! Since many more people are non-users than users, it is more important that our drug functions correctly in the negative cases rather than the positive cases. To see this, observe that the drug always reporting “yes” would result in more false claims than the situation where the drug always reports “no.” This is because, by default, there are many more “no” cases than there are “yes” cases.

Exercise (False Negativity, ‡): redo the drug test calculation two more times:

In the first, raise the true positivity rate to 100%.
In the second, raise the true negativity rate to 95%.

Which calculation produces the better probability for \(\Pr(\text{pos} \mid \text{user})\)?

CSC 208-01 (Fall 2023)

Reading: Conditional Probability

Conditional Probability

Independence

Bayes’ Theorem