Introduction to Classification


Classification



Can We Use Linear Regression?


Linear versus Logistic Regression


  • The orange marks indicate the response \(Y\), either 0 or 1

  • Linear regression does not estimate \(Pr(Y = 1|X)\) very well

  • Logistic regression is well suited to the task

  • Now suppose we have a response variable with three possible values. A patient presents at the emergency room, and we must classify them according to their symptoms

    • \(Y = \left\{\begin{matrix} 1 \\ 2 \\ 3 \end{matrix}\right.\)

      • Where 1 equals if stroke

      • And 2 equals if drug overdose

      • and 3 equals if epileptic seizure

    • This coding suggests an ordering, and in fact implies that the difference between stroke and drug overdose is the same as between drug overdose and epileptic seizure

      • Linear regression is not appropriate here

      • Multinominal (multiclass) Logistic Regression or Discriminant Analysis are more appropriate


Logistic Regression



Maximum Likelihood


Making Predictions









Multiple Logistic Regression




Multinominal Logistic Regression


Discriminant Analysis


Bayes Theorem for Classification

  • Thomas Bayes was a famous mathematician whose name represents a big subfield of statistical and probabilistic modeling. Here we focus on a simple result, known as Bayes theorem

    • \(Pr(Y = k|X = x) = \frac{Pr(X = x|Y = k) \cdot Pr(Y = k)}{Pr(X = x)}\)
  • For discriminant analysis we would write the above slightly differently

    • \(Pr(Y = k|X = x) = \frac{\pi_kf_k(x)}{\sum_{l=1}^{K} \pi_lf_l(x)}\)

      • Where \(f_k(x)=Pr(X = x|Y = k)\) is the density for \(X\) in class \(k\)

        • Here we will use normal densities for these, spearately in each class
      • And \(\pi_k = Pr(Y = k)\) is the marginal or prior probability for class \(k\)


  • We classify a new point according to which density is highest

  • When the priors are different, we take them into account as well, and compare \(\pi_kf_k(x)\)

    • On the imagine above, for the right panel, we favor the pink class based on the fact that the boundary has shifted to the left

Why Discriminant Analysis?

  • When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem

  • If \(n\) is small and the distribution of the predictors \(X\) is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model

  • Linear discriminant analysis is popular when we have more than two response classes, because it also provides low-dimensional views of the data

  • Linear Discriminant Analysis when \(p = 1\)

    • The Gaussian density has the form: \(f_k(x)=\frac{1}{\sqrt{2\pi\sigma_k}}e^{-\frac{1}{2}(\frac{x-\mu_k}{\sigma_k})^2}\)

      • Where \(\mu_k\) is the mean and \(\sigma_k^2\) the variance in class \(k\)

        • We will assume that all the \(\sigma_k = \sigma\) are the same


      • Plugging this into Bayes formula, we get a rather complex expression for \(p_k(x)=Pr(Y=k|X=x)\)

        • \(p_k(x)=\frac{\pi_k \frac{1}{\sqrt{2\pi\sigma}}e^{-\frac{1}{2}(\frac{x-\mu_k}{\sigma_k})^2}}{\sum_{l=1}^{K}\pi_l \frac{1}{\sqrt{2\pi\sigma}}e^{-\frac{1}{2}(\frac{x-\mu_k}{\sigma_k})^2}}\)

        • Lucky for us, there are simplifications and cancellations


Discriminant Functions

  • To classify at the value \(X = x\), we need to see which of the \(p_k(x)\) is the largest. Taking logs and discarding terms that do not depend on \(k\), we see that this is equivalent to assigning \(x\) too the class with the largest discriminant score

    • \(\delta_k(x)=x \cdot \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2}+log(\pi_k)\)

    • Note that \(\delta_k(x)\) is a linear function of \(x\)

    • If there are \(K = 2\) classes and \(\pi_1 = \pi_2 = 0.5\), then one can see that the decision boundary is at \(x = \frac{\mu_1+\mu_2}{2}\)


  • The above is an example with \(\mu_1 = -1.5\), \(\mu_2=1.5\), \(\pi_1 = \pi_2 = 0.5\), and \(\sigma^2 = 1\)

    • Typically we don’t know these parameters; we just have the training data. In that case we simply estimate the parameters and plug them into the rule

Estimating the Parameters

  • \(\hat{\pi_k} = \frac{n_k}{n}\)


  • \(\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i\)


  • \(\hat{\sigma^2} = \frac{1}{n-K}\sum_{k=1}^{K}\sum_{i:y_i=k}(x_i - \hat{\mu_k})^2\) = \(\sum_{k=1}^{K}\frac{n_k-1}{n-K} \cdot \hat{\sigma_k^2}\)

    • Where \(\hat{\sigma_k^2} = \frac{1}{n_k-1}\sum_{i:y_i=k}(x_i - \hat{\mu_k})^2\) is the usual formula for the estimated variance in the \(k\)th class

Linear Discriminant Analysis When \(p > 1\)

  • Density: \(f(x) = \frac{1}{(2\pi)^{p/2|\sum|1/2}}e-\frac{1}{2}(x-\mu)^{T\sum^-1(x-\mu)}\)

  • Discriminant Function: \(\delta_k(x) = x^{T}\sum^-1\mu_k - \frac{1}{2}\mu_k^T \sum^-1 \mu_k + log\pi_k\)

  • Despite its complex form: \(\delta_k(x) = c_{k0} + c_{k1}x_1 + ... + c_{kp}x_p\)

    • A linear function
  • An illustration for \(p = 2\) and \(K = 3\) classes


  • Here \(\pi_1 = \pi_2 = \pi_3 = \frac{1}{3}\)

  • The dashed lines are known as the Bayes decision boundaries. Were they known, they would yield the fewest misclassification errors, among all possible classifiers

  • An example with Fisher’s Iris Data


  • There are 4 variable, 3 species, and 50 samples/classes

    • Blue = Setosa, Orange = Versicolor, Green = Virginica


  • LDA classifies all but 3 of the 150 training samples correctly

    • See the Fisher’s Discriminant Plot below


  • When there are \(K\)classes, linear discriminant analysis can be viewed exactly in a \(K − 1\) dimensional plot

    • Why?

      • Because it essentially classifies to the closest centroid, and they span a \(K − 1\) dimensional plane

      • Even when \(K > 3\), we can find the “best” 2-dimensional plane for visualizing the discriminant rule


From \(\delta_k(x)\) to Probabilities

  • Once we have estimates \(\hat{\sigma_k}(x)\), we can turn these into estimates for class probabilities

    • \(\hat{Pr}(Y = k|X=x)=\frac{e^{\hat{\sigma_k}(x)}}{\sum_{l=1}^{K}e^{\hat{\sigma_l}(x)}}\)
  • So classifying to the largest \(\hat{\sigma_k}(x)\) amounts too classifying to the class for which \(Pr(Y = k|X = x)\) is largest

  • When \(K = 2\) we classify to class 2 if \(Pr(Y = 2 | X = x) \geq 0.5\), else to class 1


LDA on Credit Data


  • (23 + 52) / 10000 errors — a 2.75% misclassification rate

    • This is training error and we may be overfitting

      • Not a big concern here since n = 10000 and p = 2


    • If we classified to the prior — always to class No in this case — we would make 333/10000 errors, or only 3.33%

    • Of the true No’s, we make 23/9667 = 0.2% errors; of the true Yes’s, we make 252/333 = 75.7% errors

  • Types of Errors

    • False positive rate

      • The fraction of negative examples that are classified as positive
    • False negative rate

      • The fraction of positive examples that are classified as negative
  • We produced the table above by classifying to class Yes if \(\hat{Pr}(Default = Yes|Balance, student)\geq 0.5\)

  • We can change the two error rates by changing the threshold from 0.5 to some other value in [0, 1]

    • \(\hat{Pr}(Default = Yes | Balance, Student) \geq threshold\)

      • Where we would vary the threshold (as is depicted below)


  • In order to reduce the false negative rate, we may want to reduce the threshold to 0.1 or less

  • ROC Curve Plot

    • Is a graph showing the performance of a classification model at all classification thresholds


  • Sometimes we use the AUC or area under the curve to summarize the overall performance

    • Higher AUC is good

Naive Bayes


Logistic Regression versus LDA


Generalized Linear Models

library(ISLR2)

Bikeshare <- ISLR2::Bikeshare

#estimate a model using linear regression
BSLinear <- lm(bikers ~ workingday + temp + factor(weathersit) + factor(mnth) + factor(hr), data = Bikeshare)
summary(BSLinear)

Poisson Regression

  • To overcome the inadequacies of linear regression for analyzing the Bikeshare data set, we will make use of an alternative approach, called Poisson regression

  • Suppose that a random variable \(Y\) takes on non-negative integer values: \(Y \in \left\{0,1,2,... \right\}\)

    • If \(Y\) follows the Poisson distribution then: \(Pr(Y = k) = \frac{e^{-\lambda}\lambda^k}{k!}\)

      • For k = 0, 1, 2, …

      • Where \(\lambda > 0\) is the expected value of \(Y\)

    • The Poisson distribution is typically used to model counts; this is a natural choice for a number of reasons, including the fact that counts, like the Poisson distribution, take on non-negative integer values

BSPoisson <- glm(bikers ~ workingday + temp + factor(weathersit) + factor(mnth) + factor(hr), data = Bikeshare, family = "poisson")
summary(BSPoisson)

Conclusion