Recall, for linear regression the response variable \(Y\) is quantitative and generally thought of as continuous
How do we handle response variables that are qualitative (i.e., categorical)?
Predicting a qualitative response for an observation can be referred to as classifying that observation, since it involves assigning the observation to a category, or class
Often the methods used for classification first predict the probability that the observation belongs to each of the categories of a qualitative variable, as the basis for making the classification
Widely used classifiers:
Qualitative variables take values in an unordered set \(C\)
eye color \(\in\) {brown, blue, green}
color \(\in\) {blue, red, green, yellow}
Given a feature vector \(X\) and a qualitative response \(Y\) taking values in the set \(C\), the classification task is to build a function \(C(X)\) that takes as a input the feature vector \(X\) and predicts its value for \(Y\)
Often we are more interested in estimating the probabilities that \(X\) belongs to each category in \(C\)
We are interested in predicting whether an individual will default on his or her credit card payment, on the basis of annual income and monthly credit card balance
In the left-hand panel, we have plotted annual income and monthly credit card balance for a subset of 10,000 individuals
In the center and right-hand panels, two pairs of boxplots are shown. The first shows the distribution of balance split by the binary default variable; the second is a similar plot for income
Suppose for the example above that we code the response as follows:
\(Y = \left\{\begin{matrix} 0 \\ 1 \end{matrix}\right.\)
Where 0 equals “No”
And 1 equals “Yes”
Can we simply perform linear regression of Y on X and classify as \(Yes\) if \(\hat{Y}\) > 0.5
In this case of a binary outcome, linear regression does a decent job as a classifier and is equivalent to Linear Discriminant Analysis (discussed below)
Since in the population \(E(Y|X = x)= Pr(Y = 1 | X = X)\), we might think that regression is perfect for this task
However, linear regression might produce probabilities that are less than zero or larger than 1
The orange marks indicate the response \(Y\), either 0 or 1
Linear regression does not estimate \(Pr(Y = 1|X)\) very well
Logistic regression is well suited to the task
Now suppose we have a response variable with three possible values. A patient presents at the emergency room, and we must classify them according to their symptoms
\(Y = \left\{\begin{matrix} 1 \\ 2 \\ 3 \end{matrix}\right.\)
Where 1 equals if stroke
And 2 equals if drug overdose
and 3 equals if epileptic seizure
This coding suggests an ordering, and in fact implies that the difference between stroke and drug overdose is the same as between drug overdose and epileptic seizure
Linear regression is not appropriate here
Multinominal (multiclass) Logistic Regression or Discriminant Analysis are more appropriate
Let’s write \(p(X) = Pr(Y = 1 | X)\) for short and consider using balance to predict default (from our example above)
Logistic regression uses the form
\(p(X) = \frac{e^{B_0+B_1X}}{1 + e^{B_0+B_1X}}\)
$e $2.71828 which is a mathematical constant (Euler’s number)
It is easy to see that no matter what values \(B_0, B_1,\) or \(X\) take, \(p(X)\) will have values between 0 and 1
A bit of rearrangement gives:
\(Log(\frac{p(X)}{1 - p(X)}) = B_0 + B_1X\)
This transformation is called the log odds or logit transformation of p(X)
Logistic regression ensures that our estimate for \(p(X)\) lies between 1 and 0
In logistic regression, we utilize maximum likelihood to estimate the parameters
\(\ell(B_0, B) = \prod_{i:y_i=1}p(x_i)\prod_{i:y_i=0}(1-p(x_i))\)
This likelihood gives the probability of the observed zeros and ones in the data
We pick \(B_0\) and \(B_1\) t maximize the likelihood of the observed data
Most statistical packages can fit linear logistic regression models by maximum likelihood. In R we use the glm function
\(log(\frac{p(X)}{1-p(X)})=B_0+B_1X_1+...+B_pX_p\)
Why is the coefficient for student negative, while it was positive before?
Students tend to have higher balances than non-students, so their marginal default rate is higher than for non-students
But for each level of balance, students default less than non-students
Multiple logistic regression can tease this out
Logistic regression with more than two classes (categories)
So far we have discussed logistic regression with two classes. It is easily generalized to more than two classes. One version (used in the R package glmnet) has the symmetric form
\(Pr(Y = k|X)=\frac{e^{b_0k+B_1kX_1+...+B_pkX_p}}{\sum_{\ell = 1}^{K}e^{b_0k+B_1kX_1+...+B_pkX_p}}\)
Where there is a linear function for each class
Multinomial logistic regression is also referred to as multiclass logistic regression
Here the approach is to model the distribution of \(X\) in each of the classes separately, and then use Bayes theorem to flip things around and obtain \(Pr(Y|X)\)
When we use normal (Gaussian) distributions for each class, this leads to linear or quadratic discriminant analysis
However, this approach is quite general, and other distributions can be used as well. We will focus on normal distributions
Thomas Bayes was a famous mathematician whose name represents a big subfield of statistical and probabilistic modeling. Here we focus on a simple result, known as Bayes theorem
For discriminant analysis we would write the above slightly differently
\(Pr(Y = k|X = x) = \frac{\pi_kf_k(x)}{\sum_{l=1}^{K} \pi_lf_l(x)}\)
Where \(f_k(x)=Pr(X = x|Y = k)\) is the density for \(X\) in class \(k\)
And \(\pi_k = Pr(Y = k)\) is the marginal or prior probability for class \(k\)
We classify a new point according to which density is highest
When the priors are different, we take them into account as well, and compare \(\pi_kf_k(x)\)
When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem
If \(n\) is small and the distribution of the predictors \(X\) is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model
Linear discriminant analysis is popular when we have more than two response classes, because it also provides low-dimensional views of the data
Linear Discriminant Analysis when \(p = 1\)
The Gaussian density has the form: \(f_k(x)=\frac{1}{\sqrt{2\pi\sigma_k}}e^{-\frac{1}{2}(\frac{x-\mu_k}{\sigma_k})^2}\)
Where \(\mu_k\) is the mean and \(\sigma_k^2\) the variance in class \(k\)
Plugging this into Bayes formula, we get a rather complex expression for \(p_k(x)=Pr(Y=k|X=x)\)
\(p_k(x)=\frac{\pi_k \frac{1}{\sqrt{2\pi\sigma}}e^{-\frac{1}{2}(\frac{x-\mu_k}{\sigma_k})^2}}{\sum_{l=1}^{K}\pi_l \frac{1}{\sqrt{2\pi\sigma}}e^{-\frac{1}{2}(\frac{x-\mu_k}{\sigma_k})^2}}\)
Lucky for us, there are simplifications and cancellations
To classify at the value \(X = x\), we need to see which of the \(p_k(x)\) is the largest. Taking logs and discarding terms that do not depend on \(k\), we see that this is equivalent to assigning \(x\) too the class with the largest discriminant score
\(\delta_k(x)=x \cdot \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2}+log(\pi_k)\)
Note that \(\delta_k(x)\) is a linear function of \(x\)
If there are \(K = 2\) classes and \(\pi_1 = \pi_2 = 0.5\), then one can see that the decision boundary is at \(x = \frac{\mu_1+\mu_2}{2}\)
The above is an example with \(\mu_1 = -1.5\), \(\mu_2=1.5\), \(\pi_1 = \pi_2 = 0.5\), and \(\sigma^2 = 1\)
\(\hat{\sigma^2} = \frac{1}{n-K}\sum_{k=1}^{K}\sum_{i:y_i=k}(x_i - \hat{\mu_k})^2\) = \(\sum_{k=1}^{K}\frac{n_k-1}{n-K} \cdot \hat{\sigma_k^2}\)
Density: \(f(x) = \frac{1}{(2\pi)^{p/2|\sum|1/2}}e-\frac{1}{2}(x-\mu)^{T\sum^-1(x-\mu)}\)
Discriminant Function: \(\delta_k(x) = x^{T}\sum^-1\mu_k - \frac{1}{2}\mu_k^T \sum^-1 \mu_k + log\pi_k\)
Despite its complex form: \(\delta_k(x) = c_{k0} + c_{k1}x_1 + ... + c_{kp}x_p\)
An illustration for \(p = 2\) and \(K = 3\) classes
Here \(\pi_1 = \pi_2 = \pi_3 = \frac{1}{3}\)
The dashed lines are known as the Bayes decision boundaries. Were they known, they would yield the fewest misclassification errors, among all possible classifiers
An example with Fisher’s Iris Data
There are 4 variable, 3 species, and 50 samples/classes
LDA classifies all but 3 of the 150 training samples correctly
When there are \(K\)classes, linear discriminant analysis can be viewed exactly in a \(K − 1\) dimensional plot
Why?
Because it essentially classifies to the closest centroid, and they span a \(K − 1\) dimensional plane
Even when \(K > 3\), we can find the “best” 2-dimensional plane for visualizing the discriminant rule
Once we have estimates \(\hat{\sigma_k}(x)\), we can turn these into estimates for class probabilities
So classifying to the largest \(\hat{\sigma_k}(x)\) amounts too classifying to the class for which \(Pr(Y = k|X = x)\) is largest
When \(K = 2\) we classify to class 2 if \(Pr(Y = 2 | X = x) \geq 0.5\), else to class 1
(23 + 52) / 10000 errors — a 2.75% misclassification rate
This is training error and we may be overfitting
If we classified to the prior — always to class No in this case — we would make 333/10000 errors, or only 3.33%
Of the true No’s, we make 23/9667 = 0.2% errors; of the true Yes’s, we make 252/333 = 75.7% errors
Types of Errors
False positive rate
False negative rate
We produced the table above by classifying to class Yes if \(\hat{Pr}(Default = Yes|Balance, student)\geq 0.5\)
We can change the two error rates by changing the threshold from 0.5 to some other value in [0, 1]
\(\hat{Pr}(Default = Yes | Balance, Student) \geq threshold\)
In order to reduce the false negative rate, we may want to reduce the threshold to 0.1 or less
ROC Curve Plot
Sometimes we use the AUC or area under the curve to summarize the overall performance
Assumes features are independent in each class
Useful when \(p\) is large, and so multivariate methods like QDA (Quadratic Discriminant Analysis) and even LDA (Linear Discriminant Analysis) break down
Gaussian naive Bayes assumes each \(\sum_k\) is diagonal
can use for mixed feature vectors (qualitative and quantitative)
If \(X_j\) is qualitative, replace \(f_{kj}(x_j)\) with probability mass function (histogram) over discrete categories
Despite strong assumptions, naive Bayes often produces good classification results
For a two-class problem, one can show that for LDA
\(log(\frac{p_1(x)}{1-p_1(x)}) = log(\frac{p_1(x)}{p_2(x)}) = c_0 + c_1x_1 + ... + c_px_p\)
Which has the same form as the logistic regression
The primary difference is how the parameters are estimated
Logistic regression uses the conditional likelihood based on \(Pr(Y|X)\) (known as discriminative learning)
LDA uses the full likelihood based on \(Pr(X,Y)\) (known as generative learning)
Logistic regression can also fit quadratic boundaries like QDA, by explicitly including quadratic terms in the model
For linear regression, we assumed that the response \(Y\) is quantitative, and explored the use of least squares linear regression to predict \(Y\) . Thus far for classification, we have instead assumed that \(Y\) is qualitative. However, we may sometimes be faced with situations in which \(Y\) is neither qualitative nor quantitative, and so neither linear regression nor the classification approaches are applicable
What about response variables that are neither qualitative nor quantitative
Empirical example using the Bikeshare
data
library(ISLR2)
Bikeshare <- ISLR2::Bikeshare
#estimate a model using linear regression
BSLinear <- lm(bikers ~ workingday + temp + factor(weathersit) + factor(mnth) + factor(hr), data = Bikeshare)
summary(BSLinear)
At first glance, fitting a linear regression model to the
Bikeshare
data set seems to provide reasonable and
intuitive results
But upon more careful inspection, some issues become apparent.
For example, 9.6% of the fitted values in the Bikeshare
data set are negative: that is, the linear regression model predicts a
negative number of users during 9.6% of the hours in the data
set
This calls into question our ability to perform meaningful predictions on the data, and it also raises concerns about the accuracy of the coefficient estimates, confidence intervals, and other outputs of the regression model
Some of the problems that arise when fitting a linear regression
model to the Bikeshare
data can be overcome by transforming
the response
\(log(Y) = \sum_{j=1}^{p}X_jB_j+\epsilon\)
Transforming the response avoids the possibility of negative predictions, and it overcomes much of the heteroscedasticity in the untransformed data
To overcome the inadequacies of linear regression for analyzing
the Bikeshare
data set, we will make use of an alternative
approach, called Poisson regression
Suppose that a random variable \(Y\) takes on non-negative integer values: \(Y \in \left\{0,1,2,... \right\}\)
If \(Y\) follows the Poisson distribution then: \(Pr(Y = k) = \frac{e^{-\lambda}\lambda^k}{k!}\)
For k = 0, 1, 2, …
Where \(\lambda > 0\) is the expected value of \(Y\)
The Poisson distribution is typically used to model counts; this is a natural choice for a number of reasons, including the fact that counts, like the Poisson distribution, take on non-negative integer values
BSPoisson <- glm(bikers ~ workingday + temp + factor(weathersit) + factor(mnth) + factor(hr), data = Bikeshare, family = "poisson")
summary(BSPoisson)
Logistic regression is very popular for classification, especially when \(K = 2\)
LDA is useful when \(n\) is small, or the classes are well separated, and Gaussian assumptions are reasonable. Also when \(K > 2\)
Naive Bayes is useful when p is very large