Introduction to Classification

Recall, for linear regression the response variable $Y$ is quantitative and generally thought of as continuous
How do we handle response variables that are qualitative (i.e., categorical)?
- For example, eye color
Predicting a qualitative response for an observation can be referred to as classifying that observation, since it involves assigning the observation to a category, or class
Often the methods used for classification first predict the probability that the observation belongs to each of the categories of a qualitative variable, as the basis for making the classification
Widely used classifiers:
- Logistic Regression, Linear Discriminant Analysis, quadratic Discriminant Analysis, Naive Bayes, and K-Nearest Neighbors

Classification

Qualitative variables take values in an unordered set $C$
- eye color $\in$ {brown, blue, green}
- color $\in$ {blue, red, green, yellow}
Given a feature vector $X$ and a qualitative response $Y$ taking values in the set $C$, the classification task is to build a function $C(X)$ that takes as a input the feature vector $X$ and predicts its value for $Y$
Often we are more interested in estimating the probabilities that $X$ belongs to each category in $C$
- For example, it is more valuable to have an estimate of the probability that an insurance claim is fraudulent, than a classification fraudulent or not

We are interested in predicting whether an individual will default on his or her credit card payment, on the basis of annual income and monthly credit card balance
- In the left-hand panel, we have plotted annual income and monthly credit card balance for a subset of 10,000 individuals
  - The individuals who defaulted in a given month are shown in orange, and those who did not in blue
- In the center and right-hand panels, two pairs of boxplots are shown. The first shows the distribution of balance split by the binary default variable; the second is a similar plot for income

Can We Use Linear Regression?

Suppose for the example above that we code the response as follows:
- $Y = \left\{\begin{matrix} 0 \\ 1 \end{matrix}\right.$
  - Where 0 equals “No”
  - And 1 equals “Yes”
Can we simply perform linear regression of Y on X and classify as $Yes$ if $\hat{Y}$ > 0.5
- In this case of a binary outcome, linear regression does a decent job as a classifier and is equivalent to Linear Discriminant Analysis (discussed below)
- Since in the population $E(Y|X = x)= Pr(Y = 1 | X = X)$, we might think that regression is perfect for this task
- However, linear regression might produce probabilities that are less than zero or larger than 1
  - Logistic regression is more appropriate

Linear versus Logistic Regression

The orange marks indicate the response $Y$, either 0 or 1
Linear regression does not estimate $Pr(Y = 1|X)$ very well
Logistic regression is well suited to the task
Now suppose we have a response variable with three possible values. A patient presents at the emergency room, and we must classify them according to their symptoms
- $Y = \left\{\begin{matrix} 1 \\ 2 \\ 3 \end{matrix}\right.$
  - Where 1 equals if stroke
  - And 2 equals if drug overdose
  - and 3 equals if epileptic seizure
- This coding suggests an ordering, and in fact implies that the difference between stroke and drug overdose is the same as between drug overdose and epileptic seizure
  - Linear regression is not appropriate here
  - Multinominal (multiclass) Logistic Regression or Discriminant Analysis are more appropriate

Logistic Regression

Let’s write $p(X) = Pr(Y = 1 | X)$ for short and consider using balance to predict default (from our example above)
Logistic regression uses the form
- $p(X) = \frac{e^{B_0+B_1X}}{1 + e^{B_0+B_1X}}$
- $e $2.71828 which is a mathematical constant (Euler’s number)
- It is easy to see that no matter what values $B_0, B_1,$ or $X$ take, $p(X)$ will have values between 0 and 1
- A bit of rearrangement gives:
  - $Log(\frac{p(X)}{1 - p(X)}) = B_0 + B_1X$
    - This transformation is called the log odds or logit transformation of p(X)
      - Here we mean natural log (ln) when we use the term “log”
Logistic regression ensures that our estimate for $p(X)$ lies between 1 and 0

Maximum Likelihood

In logistic regression, we utilize maximum likelihood to estimate the parameters
- $\ell(B_0, B) = \prod_{i:y_i=1}p(x_i)\prod_{i:y_i=0}(1-p(x_i))$
- This likelihood gives the probability of the observed zeros and ones in the data
- We pick $B_0$ and $B_1$ t maximize the likelihood of the observed data
- Most statistical packages can fit linear logistic regression models by maximum likelihood. In R we use the glm function

Making Predictions

What is our estimated probability of default for someone with a balance of $1000

$p(X) = \frac{e^{B_0+B_1X}}{1 + e^{B_0+B_1X}}$ = $\frac{e^{-10.6513+0.0055 \times 1000}}{1 + e^{-10.6513+0.0055 \times 1000}}$ = 0.006

What is our estimated probability of default for someone with a balance of $2000

$p(X) = \frac{e^{B_0+B_1X}}{1 + e^{B_0+B_1X}}$ = $\frac{e^{-10.6513+0.0055 \times 2000}}{1 + e^{-10.6513+0.0055 \times 2000}}$ = 0.586

Lets examine it again, using student as the predictor

$\hat{Pr}(default = Yes | student = Yes) = \frac{e^{-3.5041+0.4049 \times 1}}{1+e^{-3.5041+0.4049 \times 1}} = 0.0431$

$\hat{Pr}(default = Yes | student = No) = \frac{e^{-3.5041+0.4049 \times 0}}{1+e^{-3.5041+0.4049 \times 0}} = 0.0292$

Multiple Logistic Regression

$log(\frac{p(X)}{1-p(X)})=B_0+B_1X_1+...+B_pX_p$
- Where $p(X)=\frac{e^{B_0+B_1x_1+...+B_pX_P}}{1+e^{B_0+B_1x_1+...+B_pX_P}}$

Why is the coefficient for student negative, while it was positive before?
- Students tend to have higher balances than non-students, so their marginal default rate is higher than for non-students
- But for each level of balance, students default less than non-students
- Multiple logistic regression can tease this out

Multinominal Logistic Regression

Logistic regression with more than two classes (categories)
So far we have discussed logistic regression with two classes. It is easily generalized to more than two classes. One version (used in the R package glmnet) has the symmetric form
- $Pr(Y = k|X)=\frac{e^{b_0k+B_1kX_1+...+B_pkX_p}}{\sum_{\ell = 1}^{K}e^{b_0k+B_1kX_1+...+B_pkX_p}}$
- Where there is a linear function for each class
  - Notice that some cancellation is possible, and only K − 1 linear functions are needed as in 2-class logistic regression
Multinomial logistic regression is also referred to as multiclass logistic regression

Discriminant Analysis

Here the approach is to model the distribution of $X$ in each of the classes separately, and then use Bayes theorem to flip things around and obtain $Pr(Y|X)$
When we use normal (Gaussian) distributions for each class, this leads to linear or quadratic discriminant analysis
However, this approach is quite general, and other distributions can be used as well. We will focus on normal distributions

Bayes Theorem for Classification

Thomas Bayes was a famous mathematician whose name represents a big subfield of statistical and probabilistic modeling. Here we focus on a simple result, known as Bayes theorem
- $Pr(Y = k|X = x) = \frac{Pr(X = x|Y = k) \cdot Pr(Y = k)}{Pr(X = x)}$
For discriminant analysis we would write the above slightly differently
- $Pr(Y = k|X = x) = \frac{\pi_kf_k(x)}{\sum_{l=1}^{K} \pi_lf_l(x)}$
  - Where $f_k(x)=Pr(X = x|Y = k)$ is the density for $X$ in class $k$
    - Here we will use normal densities for these, spearately in each class
  - And $\pi_k = Pr(Y = k)$ is the marginal or prior probability for class $k$

We classify a new point according to which density is highest
When the priors are different, we take them into account as well, and compare $\pi_kf_k(x)$
- On the imagine above, for the right panel, we favor the pink class based on the fact that the boundary has shifted to the left

Why Discriminant Analysis?

When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem
If $n$ is small and the distribution of the predictors $X$ is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model
Linear discriminant analysis is popular when we have more than two response classes, because it also provides low-dimensional views of the data
Linear Discriminant Analysis when $p = 1$
- The Gaussian density has the form: $f_k(x)=\frac{1}{\sqrt{2\pi\sigma_k}}e^{-\frac{1}{2}(\frac{x-\mu_k}{\sigma_k})^2}$
  - Where $\mu_k$ is the mean and $\sigma_k^2$ the variance in class $k$
    - We will assume that all the $\sigma_k = \sigma$ are the same
  - Plugging this into Bayes formula, we get a rather complex expression for $p_k(x)=Pr(Y=k|X=x)$
    - $p_k(x)=\frac{\pi_k \frac{1}{\sqrt{2\pi\sigma}}e^{-\frac{1}{2}(\frac{x-\mu_k}{\sigma_k})^2}}{\sum_{l=1}^{K}\pi_l \frac{1}{\sqrt{2\pi\sigma}}e^{-\frac{1}{2}(\frac{x-\mu_k}{\sigma_k})^2}}$
    - Lucky for us, there are simplifications and cancellations

Discriminant Functions

To classify at the value $X = x$, we need to see which of the $p_k(x)$ is the largest. Taking logs and discarding terms that do not depend on $k$, we see that this is equivalent to assigning $x$ too the class with the largest discriminant score
- $\delta_k(x)=x \cdot \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2}+log(\pi_k)$
- Note that $\delta_k(x)$ is a linear function of $x$
- If there are $K = 2$ classes and $\pi_1 = \pi_2 = 0.5$, then one can see that the decision boundary is at $x = \frac{\mu_1+\mu_2}{2}$

The above is an example with $\mu_1 = -1.5$, $\mu_2=1.5$, $\pi_1 = \pi_2 = 0.5$, and $\sigma^2 = 1$
- Typically we don’t know these parameters; we just have the training data. In that case we simply estimate the parameters and plug them into the rule

Estimating the Parameters

$\hat{\pi_k} = \frac{n_k}{n}$

$\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i$

$\hat{\sigma^2} = \frac{1}{n-K}\sum_{k=1}^{K}\sum_{i:y_i=k}(x_i - \hat{\mu_k})^2$ = $\sum_{k=1}^{K}\frac{n_k-1}{n-K} \cdot \hat{\sigma_k^2}$
- Where $\hat{\sigma_k^2} = \frac{1}{n_k-1}\sum_{i:y_i=k}(x_i - \hat{\mu_k})^2$ is the usual formula for the estimated variance in the $k$th class

Linear Discriminant Analysis When $p > 1$

Density: $f(x) = \frac{1}{(2\pi)^{p/2|\sum|1/2}}e-\frac{1}{2}(x-\mu)^{T\sum^-1(x-\mu)}$
Discriminant Function: $\delta_k(x) = x^{T}\sum^-1\mu_k - \frac{1}{2}\mu_k^T \sum^-1 \mu_k + log\pi_k$
Despite its complex form: $\delta_k(x) = c_{k0} + c_{k1}x_1 + ... + c_{kp}x_p$
- A linear function
An illustration for $p = 2$ and $K = 3$ classes

Here $\pi_1 = \pi_2 = \pi_3 = \frac{1}{3}$
The dashed lines are known as the Bayes decision boundaries. Were they known, they would yield the fewest misclassification errors, among all possible classifiers
An example with Fisher’s Iris Data

There are 4 variable, 3 species, and 50 samples/classes
- Blue = Setosa, Orange = Versicolor, Green = Virginica
LDA classifies all but 3 of the 150 training samples correctly
- See the Fisher’s Discriminant Plot below

When there are $K$classes, linear discriminant analysis can be viewed exactly in a $K − 1$ dimensional plot
- Why?
  - Because it essentially classifies to the closest centroid, and they span a $K − 1$ dimensional plane
  - Even when $K > 3$, we can find the “best” 2-dimensional plane for visualizing the discriminant rule

From $\delta_k(x)$ to Probabilities

Once we have estimates $\hat{\sigma_k}(x)$, we can turn these into estimates for class probabilities
- $\hat{Pr}(Y = k|X=x)=\frac{e^{\hat{\sigma_k}(x)}}{\sum_{l=1}^{K}e^{\hat{\sigma_l}(x)}}$
So classifying to the largest $\hat{\sigma_k}(x)$ amounts too classifying to the class for which $Pr(Y = k|X = x)$ is largest
When $K = 2$ we classify to class 2 if $Pr(Y = 2 | X = x) \geq 0.5$, else to class 1

LDA on Credit Data

(23 + 52) / 10000 errors — a 2.75% misclassification rate
- This is training error and we may be overfitting
  - Not a big concern here since n = 10000 and p = 2
- If we classified to the prior — always to class No in this case — we would make 333/10000 errors, or only 3.33%
- Of the true No’s, we make 23/9667 = 0.2% errors; of the true Yes’s, we make 252/333 = 75.7% errors
Types of Errors
- False positive rate
  - The fraction of negative examples that are classified as positive
- False negative rate
  - The fraction of positive examples that are classified as negative
We produced the table above by classifying to class Yes if $\hat{Pr}(Default = Yes|Balance, student)\geq 0.5$
We can change the two error rates by changing the threshold from 0.5 to some other value in [0, 1]
- $\hat{Pr}(Default = Yes | Balance, Student) \geq threshold$
  - Where we would vary the threshold (as is depicted below)

In order to reduce the false negative rate, we may want to reduce the threshold to 0.1 or less
ROC Curve Plot
- Is a graph showing the performance of a classification model at all classification thresholds

Sometimes we use the AUC or area under the curve to summarize the overall performance
- Higher AUC is good

Naive Bayes

Assumes features are independent in each class
Useful when $p$ is large, and so multivariate methods like QDA (Quadratic Discriminant Analysis) and even LDA (Linear Discriminant Analysis) break down
Gaussian naive Bayes assumes each $\sum_k$ is diagonal
- $\delta_k(x) \propto log\left [ \pi_k\prod_{j=1}^{p}f_{kj}(x_j) \right ]$ = $-\frac{1}{2}\prod_{j=1}^{p}\left [ \frac{(x_j - \mu_{kj})^2}{\sigma_{kj}^{2}} +log\sigma_{kj}^{2} \right ] + log\pi_k$
can use for mixed feature vectors (qualitative and quantitative)
If $X_j$ is qualitative, replace $f_{kj}(x_j)$ with probability mass function (histogram) over discrete categories
Despite strong assumptions, naive Bayes often produces good classification results

Logistic Regression versus LDA

For a two-class problem, one can show that for LDA
- $log(\frac{p_1(x)}{1-p_1(x)}) = log(\frac{p_1(x)}{p_2(x)}) = c_0 + c_1x_1 + ... + c_px_p$
- Which has the same form as the logistic regression
- The primary difference is how the parameters are estimated
  - Logistic regression uses the conditional likelihood based on $Pr(Y|X)$ (known as discriminative learning)
  - LDA uses the full likelihood based on $Pr(X,Y)$ (known as generative learning)
    - Despite the differences, in practice the results are often very similar
  - Logistic regression can also fit quadratic boundaries like QDA, by explicitly including quadratic terms in the model

Generalized Linear Models

For linear regression, we assumed that the response $Y$ is quantitative, and explored the use of least squares linear regression to predict $Y$ . Thus far for classification, we have instead assumed that $Y$ is qualitative. However, we may sometimes be faced with situations in which $Y$ is neither qualitative nor quantitative, and so neither linear regression nor the classification approaches are applicable
What about response variables that are neither qualitative nor quantitative
- Such as those that that take on non-negative integer values or counts
Empirical example using the Bikeshare data

library(ISLR2)

Bikeshare <- ISLR2::Bikeshare

#estimate a model using linear regression
BSLinear <- lm(bikers ~ workingday + temp + factor(weathersit) + factor(mnth) + factor(hr), data = Bikeshare)
summary(BSLinear)

At first glance, fitting a linear regression model to the Bikeshare data set seems to provide reasonable and intuitive results
- But upon more careful inspection, some issues become apparent. For example, 9.6% of the fitted values in the Bikeshare data set are negative: that is, the linear regression model predicts a negative number of users during 9.6% of the hours in the data set
- This calls into question our ability to perform meaningful predictions on the data, and it also raises concerns about the accuracy of the coefficient estimates, confidence intervals, and other outputs of the regression model
Some of the problems that arise when fitting a linear regression model to the Bikeshare data can be overcome by transforming the response
- $log(Y) = \sum_{j=1}^{p}X_jB_j+\epsilon$
- Transforming the response avoids the possibility of negative predictions, and it overcomes much of the heteroscedasticity in the untransformed data
  - However, it is not quite a satisfactory solution, since predictions and inference are made in terms of the log of the response, rather than the response. This leads to challenges in interpretation, e.g. “a one-unit increase in $X_j$ is associated with an increase in the mean of the log of $Y$ by an amount $B_j$”

Poisson Regression

To overcome the inadequacies of linear regression for analyzing the Bikeshare data set, we will make use of an alternative approach, called Poisson regression
Suppose that a random variable $Y$ takes on non-negative integer values: $Y \in \left\{0,1,2,... \right\}$
- If $Y$ follows the Poisson distribution then: $Pr(Y = k) = \frac{e^{-\lambda}\lambda^k}{k!}$
  - For k = 0, 1, 2, …
  - Where $\lambda > 0$ is the expected value of $Y$
- The Poisson distribution is typically used to model counts; this is a natural choice for a number of reasons, including the fact that counts, like the Poisson distribution, take on non-negative integer values

BSPoisson <- glm(bikers ~ workingday + temp + factor(weathersit) + factor(mnth) + factor(hr), data = Bikeshare, family = "poisson")
summary(BSPoisson)

Conclusion

Logistic regression is very popular for classification, especially when $K = 2$
LDA is useful when $n$ is small, or the classes are well separated, and Gaussian assumptions are reasonable. Also when $K > 2$
Naive Bayes is useful when p is very large

Introduction to Classification

Introduction to Classification

Classification

Can We Use Linear Regression?

Linear versus Logistic Regression

Logistic Regression

Maximum Likelihood

Making Predictions

Multiple Logistic Regression

Multinominal Logistic Regression

Discriminant Analysis

Bayes Theorem for Classification

Why Discriminant Analysis?

Discriminant Functions

Estimating the Parameters

Linear Discriminant Analysis When \(p > 1\)

From \(\delta_k(x)\) to Probabilities

LDA on Credit Data

Naive Bayes

Logistic Regression versus LDA

Generalized Linear Models

Poisson Regression

Conclusion