Recall, for linear regression the response variable \(Y\) is quantitative and generally thought of as continuous

How do we handle response variables that are qualitative (i.e., categorical)?

- For example, eye color

Predicting a qualitative response for an observation can be referred to as classifying that observation, since it involves assigning the observation to a category, or class

Often the methods used for classification first predict the probability that the observation belongs to each of the categories of a qualitative variable, as the basis for making the classification

Widely used classifiers:

- Logistic Regression, Linear Discriminant Analysis, quadratic Discriminant Analysis, Naive Bayes, and K-Nearest Neighbors

Qualitative variables take values in an unordered set \(C\)

eye color \(\in\) {brown, blue, green}

color \(\in\) {blue, red, green, yellow}

Given a feature vector \(X\) and a qualitative response \(Y\) taking values in the set \(C\), the classification task is to build a function \(C(X)\) that takes as a input the feature vector \(X\) and predicts its value for \(Y\)

Often we are more interested in estimating the

*probabilities*that \(X\) belongs to each category in \(C\)- For example, it is more valuable to have an estimate of the probability that an insurance claim is fraudulent, than a classification fraudulent or not

We are interested in predicting whether an individual will default on his or her credit card payment, on the basis of annual income and monthly credit card balance

In the left-hand panel, we have plotted annual income and monthly credit card balance for a subset of 10,000 individuals

- The individuals who defaulted in a given month are shown in orange, and those who did not in blue

In the center and right-hand panels, two pairs of boxplots are shown. The first shows the distribution of balance split by the binary default variable; the second is a similar plot for income

Suppose for the example above that we code the response as follows:

\(Y = \left\{\begin{matrix} 0 \\ 1 \end{matrix}\right.\)

Where 0 equals “No”

And 1 equals “Yes”

Can we simply perform linear regression of Y on X and classify as \(Yes\) if \(\hat{Y}\) > 0.5

In this case of a binary outcome, linear regression does a decent job as a classifier and is equivalent to

*Linear Discriminant Analysis*(discussed below)Since in the population \(E(Y|X = x)= Pr(Y = 1 | X = X)\), we might think that regression is perfect for this task

However, linear regression might produce probabilities that are less than zero or larger than 1

- Logistic regression is more appropriate