# Machine LearningLikelihood Ratio Classification

In this section, we will continue our study of statistical learning theory by introducing some vocabulary and results specific to binary classification. Borrowing from the language of disease diagnosis, will call the two classes *positive* and *negative* (which, in the medical context, indicate presence or absence of the disease in question). Correctly classifying a positive sample is called **detection,** and incorrectly classifying a negative sample is called **false alarm** or **type I error**.

Suppose that is the feature set of our classification problem and that is the set of classes. Denote by a random observation drawn from the probability measure on . We define to the probability that a sample is positive and to be the probability that a sample is negative. Let be the conditional PMF or PDF of given the event , and let be the conditional PMF or PDF of given . We call and *class conditional distributions*.

Given a function (which we call a **classifier**), we define its **confusion matrix** to be

We call the top-left entry of the confusion matrix the **detection rate** (or *true positive rate*, or *recall* or *sensitivity*) and the top-right entry the **false alarm rate** (or *false positive rate*).

**Example**

The **precision** of a classifier is the conditional probability of given . Show that a classifier can have high detection rate, low false alarm rate, and low precision.

*Solution.* Suppose that and that has detection rate 0.99 and false alarm rate 0.01. Then the precision of is

We see that, unlike detection rate and false alarm rate, precision depends on the value of . If is very high, it can result in low precision even if the classifier has high accuracy within each class.

The **Bayes classifier**

minimizes the probability of misclassification. In other words, it is the classifier for which

is as small as possible. However, the two types of misclassification often have different real-world consequences, and we might therefore wish to weight them differently. Given , we define the **likelihood ratio classifier**

**Example**

Show that the likelihood ratio classifier is a generalization of the Bayes classifier.

*Solution.* If we let , then the inequality simplifies to . Therefore, the Bayes classifier is equal to .

## Receiver Operating Characteristic

If we increase , then some of the predictions of switch from to , while others stay the same. Therefore, the detection rate and false alarm rate both decrease as increases. Likewise, if we decrease , then detection rate and false alarm rate both increase. If we let range over the interval and plot each ordered pair , then we obtain a curve like the one shown in the figure below. This curve is called the **receiver operating characteristic** of the likelihood ratio classifier.

The ideal scenario is that this curve passes through points near the top left corner of the square, since that means that some of the classifiers in the family have both high detection rate and low false alarm rate. We quantify this idea using the **area under the ROC** (called the AUROC). This value is close to 1 for an excellent classifier and close to for a classifier whose ROC is the diagonal line from the origin to .

**Example**

Suppose that and that the class conditional densities for and are normal distributions with unit variances and means and , respectively. For each , predict the approximate shape of the ROC for the likelihood ratio classifier. Then calculate it explicitly and plot it.

*Solution.* We predict that the ROC will be nearly diagonal for , since the class conditional distributions overlap heavily, and therefore any increase in detection rate will induce an approximately equal increase in false alarm rate. When , we expect to get a very large AUROC, since in that case the distributions overlap very little. The curve will lie between these extremes. To plot these curves, we begin by calculating the likelihood ratio

So the detection rate for is the probability that an observation drawn from lies in the region where . Solving this inequality for , we find that the detection rate is equal to the probability mass assigned to the interval by the distribution .

Likewise, the false alarm rate is the probability mass assigned to the same interval by the negative class conditional distribution, .

using Plots, Distributions FAR(μ,t) = 1-cdf(Normal(0,1),log(t)/μ + μ/2) DR(μ,t) = 1-cdf(Normal(μ,1),log(t)/μ + μ/2) ROC(μ) = [(FAR(μ,t),DR(μ,t)) for t in exp.(-20:0.1:20)] plot(ROC(1/4),label="1/4") plot!(ROC(1),label="1") plot!(ROC(4),label="4") plot!(xlabel = "false alarm rate", ylabel = "detection rate")