In this lab, you will begin to get oriented with R and work with some data.
Attempt each exercise in order.
In each code chunk, if you see “# INSERT CODE HERE”, then you are expected to add some code to create the intended output (Make sure to erase “# INSERT CODE HERE” and place your code in its place).
If my instructions say to “Run the code below…” then you do not need to add any code to the chunk.
Many exercises may require you to type some text below the code chunk, interpreting the output and answering the questions.
Please follow the Davidson Honor Code and rules from the course syllabus regarding seeking help with this assignment.
When you are finished, click the “Knit” button at the top of this panel. If there are no errors, an word file should pop up after a few seconds.
Take a look at the resulting word file that pops up. Make sure everything looks correct, your name is listed at the top, and that there is no ‘junk’ code or output.
Save the word file (to your local computer, and/or to a cloud location) as: Lab 10 “Insert Your Name”.
Use this link to upload your word file to my Google Drive folder. Do not upload the original .Rmd version.
This assignment is due Thursday, August 11, 2022, no later than 9:30 am Eastern. Points will be deducted for late submissions.
TIP: Start early so that you can troubleshoot any issues with knitting to word.
There are 6 possible points on this assignment.
Baseline (C level work)
Average (B level work)
Advanced (A level work)
In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data.
Generate a simulated data set with 20 observations in each of
three classes (i.e. 60 observations total), and 50 variables (Hint:
There are a number of functions in R that you can use to generate data.
One example is the rnorm()
function; runif()
is another option. Be sure to add a mean shift to the observations in
each class so that there are three distinct classes.)
Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (C). If not, then return to part (A) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (C) until the three classes show at least some separation in the first two principal component score vectors.
Perform K-means clustering of the observations with \(K = 3\). How well do the clusters that you
obtained in K-means clustering compare to the true class labels (Hint:
You can use the table()
function in R to compare the true
class labels to the class labels obtained by clustering. Be careful how
you interpret the results: K-means clustering will arbitrarily number
the clusters, so you cannot simply check whether the true class labels
and clustering labels are the same.)?
Perform K-means clustering with \(K = 2\). Describe your results.
Now perform K-means clustering with \(K = 4\), and describe your results.
Now perform K-means clustering with \(K = 3\) on the first two principal component score vectors, rather than on the raw data. That is, perform K-means clustering on the 60 × 2 matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. Comment on the results.
#insert code here
ANSWER: