Principal Component Analysis (PCA) in R
Principal component analysis (PCA) is a statistical technique used to reduce the dimensionality of a dataset while retaining most of the original variability in the data. It accomplishes this by transforming the data into a new coordinate system such that the greatest variance comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.
PCA is commonly used as one step in an exploratory data analysis pipeline. It can reveal clustering, outliers, and other interesting structures in your data. This article will demonstrate conducting PCA in R using a built-in dataset.
Read the Original Article: Principal Component Analysis in R -PCA Explained
Introduction
Principal component analysis (PCA) is a statistical technique used for dimensionality reduction. It transforms a dataset with a large number of correlated variables into a new dataset with fewer uncorrelated variables called principal components. The goal is to retain as much of the original variability in the data as possible.
Some key applications of PCA include:
- Visualizing high-dimensional data in two or three dimensions
- Identifying patterns, clusters, outliers, and other interesting structures
- Feature extraction - using the principal components as new features for modeling
- Data compression - reducing storage and computational requirements
This article will walk through an example of conducting PCA in R on a built-in dataset. We will visualize the results to explore the structure of the data.
Loading the Data
We will use the `USArrests` dataset built into R for this analysis. This dataset contains statistics on violent crime rates in the 50 US states in 1973. Let's load it and inspect the structure:
```r
data(USArrests)
str(USArrests)
```
The output shows there are 50 observations (one for each state) and 4 numeric variables:
- Murder - murder arrests per 100,000 residents
- Assault - assault arrests per 100,000 residents
- UrbanPop - percent urban population
- Rape - rape arrests per 100,000 residents
## Checking Assumptions
Before running PCA, we need to check that the key assumptions are met:
- The data should be numeric and continuous
- The variables should be linearly related
- The data should be approximately normally distributed
- The variables should be on a similar scale
Let's verify these one by one:
```r
# Numeric and continuous
cor(USArrests)
# Linearly related
apply(USArrests, 2, shapiro.test)
# Approximately normal
# Scale the data
data <- scale(USArrests[,-3])
```
The correlations, histograms, and Shapiro-Wilk tests (output not shown) indicate the assumptions are reasonably met, so we can proceed.
Before conducting PCA, we scale the data to put all the variables on a comparable scale.
## Performing PCA
We can now conduct PCA using the `prcomp()` function in R. We specify `scale = TRUE` to tell it to scale the variables first:
```r
pca <- prcomp(data, scale = TRUE)
```
This performs the PCA and stores the results in a model object called `pca`.
## Interpreting the Results
The `summary()` function displays useful information about the results:
```r
summary(pca)
```
This includes the standard deviations (square roots of the eigenvalues), the variance variance explained by each principal component, and the cumulative proportion of variance explained.
We can also visualize the eigenvalues to see how much variance each component explains. This "scree plot" can tell us how many components are meaningful to retain:
```r
plot(pca$sdev^2,
type = "b",
xlab = "Principal Component",
ylab = "Eigenvalue")
```
The eigenvalues taper off after the first few components, with the first two capturing most of the variance.
Similarly, we can plot the cumulative proportion of variance explained:
```r
pve <- pca$sdev^2 / sum(pca$sdev^2)
cum_pve <- cumsum(pve)
plot(cum_pve,
type = "b",
xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained")
```
This shows that the first two components explain over 80% of the variance. The remaining components add little additional information.
## Enhanced PCA Plots
The `factoextra` package provides additional functions to generate nicer plots of PCA results. For example, we can color the points in the PCA plot by each state:
```r
library(factoextra)
fviz_pca_ind(pca,
geom.ind = "point",
pointsize = 2,
col.ind = as.factor(rownames(USArrests)),
palette = rainbow(50),
legend.title = "State")
```
We can also label each point with the state name, color by region, add ellipses to highlight clustering and more.
Use Cases
PCA provides an accessible visualization of high-dimensional data. It can reveal interesting patterns, like:
- Clustering of similar states
- Outlier states that differ from the rest
- Relationships between the original variables
The principal components extracted from PCA can also be used as features in subsequent modeling rather than the original correlated variables. This is a common application of PCA for feature engineering.
Limitations
PCA has some limitations to be aware of:
- Interpretability - PCs can be difficult to interpret
- Information loss - Reducing dimensionality loses some info
- Sensitive to scaling - Mix of units can distort results
- Assumptions - Requires linearly related numeric data
So PCA may only be appropriate for some datasets and should be combined with other techniques.
Conclusion
In this article, we walked through conducting PCA in R - from loading the data and checking assumptions to interpreting the results and enhanced visualizations. PCA is a key technique for exploring the structure of high-dimensional datasets and extracting new features for modeling. With the right context and careful interpretation, it can reveal interesting insights into your data.
Read the Original Article: Principal Component Analysis in R -PCA Explained
Follow us and Stay Updated with Latest Trends
Facebook: https://www.facebook.com/DataAnalysis03
Youtube:https://www.youtube.com/@data.03?sub-confirmation=1
Twitter:https://twitter.com/Zubair01469079
Instagram:https://www.instagram.com/dataanalysis03/
Tiktok:https://www.tiktok.com/@dataanalysis03?lang=en
Whatsapp:https://wa.me/message/J6ELCCB6EW7YC1
Telegram: https://t.me/dataanalysis03
Linkdin:https://www.linkedin.com/in/muhammad-zubair-ishaq-187ba0109/
Quora: https://www.quora.com/profile/Muhammad-Zubair-Ishaq
Medium: https://medium.com/@zubairishaq8305
Google News: https://news.google.com/publications/CAAqBwgKMIaV0QswxbDoAw?hl=en-PK&gl=PK&ceid=PK%3Aen