How I Perform Factor Analysis in R

RStudioDataLab
11 min readSep 27, 2023

Factor analysis is a statistical method that can help us understand the underlying structure of a set of variables. It can reduce the complexity of data by finding a smaller number of latent factors that explain the variation in the observed variables. In this article, I will show you how I chose to perform R-factor analysis, using an example dataset and some useful packages and functions.

How I Perform Factor Analysis in R
How I Perform Factor Analysis in R

Key Points

  • It can identify the relationships among many variables and summarize them into a few factors representing common themes or dimensions.
  • It can be divided into exploratory factor analysis (EFA) and confirmatory factor analysis (CFA). EFA is used when there is no prior knowledge or hypothesis about the number or nature of the factors, while CFA is used when there is some theoretical or empirical basis for specifying the number and structure of the factors.
  • It can be performed using different methods for extracting factors, such as principal component analysis (PCA), principal axis factoring (PAF), maximum likelihood (ML), etc. Each method has its assumptions and advantages, and the choice depends on the research question and the data characteristics.
  • It can also use different rotation methods for simplifying and interpreting the factor structure, such as varimax, promax, oblimin, etc. Rotation methods can be orthogonal or oblique, depending on whether the factors are assumed to be independent or correlated.
  • It can provide different outputs and tests for evaluating the results, such as factor loadings, uniquenesses, communalities, fit indices, chi-square tests, confidence intervals, etc. These outputs and tests can help us assess how well the factor model fits the data and how meaningful and reliable the factors are.

What is Factor Analysis?

This technique aims to identify the relationships among many variables and summarize them into a few factors. Each factor represents a common theme or dimension that influences the observed variables. For example, suppose we have a dataset of students’ scores on different subjects. In that case, we can use factor analysis to determine how many factors (such as intelligence, motivation, interest, etc.) affect their performance.

There are two main types of factor analysis:

  1. Exploratory factor analysis (EFA)
  2. Confirmatory factor analysis (CFA)

Exploratory factor analysis (EFA)

EFA is used when we do not have prior knowledge or hypothesis about the number or nature of the factors. It explores the data and tries to find the best solution that fits the data. Read more about EFA.

Confirmatory factor analysis (CFA)

CFA is used when we have some theoretical or empirical basis for specifying the number and structure of the factors. It tests whether the data are consistent with our expectations. Read more.

I will focus on EFA in this article, as it is more suitable for exploratory purposes and data reduction. EFA can be performed using different methods, such as

  1. Principal component analysis (PCA),
  2. Principal axis factoring (PAF),
  3. Maximum likelihood (ML), etc.

Each method has its assumptions and advantages, and the choice depends on the research question and the characteristics of the data.

How to Perform Factor Analysis in R?

R is a powerful programming language for statistical analysis and data visualization. It has many packages and functions that can help us perform factor analysis easily and efficiently. In this section, I will demonstrate how to perform R-factor analysis using an example dataset and some popular packages and functions.

a package for personality, psychometric, and psychological research.

It has many functions for data analysis, such as principal, fa, fa.parallel, etc.

GPArotation

a package for performing various types of factor rotation,

such as varimax, promax, oblimin, etc.

a package for determining the number of factors to extract using different criteria,

such as parallel analysis, scree plot, etc.

The Dataset

The dataset I will use is called attitude, which is built-in in R. It contains 30 observations (employees) on 7 variables (attitudes) measured on a scale from 1 to 5. Read more about this dataset. The variables are:

  • rating: the rating of the employee by their supervisor
  • complaints: number of complaints received
  • privileges: perceived privileges in the workplace
  • learning: perceived learning opportunities
  • raises: perceived fairness of raises
  • critical: perceived level of criticism
  • advance: the perceived opportunity for advancement

To load the dataset, we can type:

This will show us the first six rows of the dataset:

We can also check the summary statistics of the dataset by typing:

Descriptive Statistics

The Packages and Functions

To perform factor analysis in R, we must install and load some packages that provide useful functions for this task. The packages I will use are:

  • psych: a package for personality, psychometric, and psychological research. It has many functions for data analysis, such as principal, fa, fa.parallel, etc.
  • GPArotation: a package for performing various types of factor rotation, such as varimax, promax, oblimin, etc.
  • nFactors: a package for determining the number of factors to extract using different criteria, such as parallel analysis, scree plot, etc.

To install these packages, read more about “ How to Import and Install Packages in R: A Comprehensive Guide”, we can use the install.packages function:

To load these packages, we can use the library function:

library(psych) 
library(GPArotation)
library(nFactors)

The Number of Factors

One of the most important decisions in factor analysis is how many factors to extract from the data. There are different methods and criteria for determining the optimal number of factors, such as

Each method has strengths and limitations, and the choice depends on the research question and the data characteristics.

In this article, I will use two methods to decide the number of factors: eigenvalues and parallel analysis.

Eigenvalues

Eigenvalues are the variances of the factors, and they indicate how much information each factor explains. A common rule of thumb is to retain only the factors with eigenvalues greater than 1, as they explain more variance than a single variable.

Parallel analysis

Parallel analysis is a method that compares the eigenvalues of the data with those of random data with the same dimensions. It retains only the factors whose eigenvalues are larger than those of the random data, as they indicate a significant amount of information.

How To Choose a Number of Factors in R

To perform these methods in R, we can use the Eigen and fa.parallel functions from the psych package. The eigenfunction computes the eigenvalues and eigenvectors of a matrix, and the fa.parallel function performs parallel analysis using different methods (such as principal components, principal axis, minimum rank factor analysis, etc.).

Compute the correlation matrix.

To use these functions, we need first to compute the correlation matrix of the data, as factor analysis is based on the correlations among the variables. We can use the cor function to do this:

This will show us the correlation matrix of the data:

If you visualize the results of Correlation, check out this article Correlation Plot

Compute the Eigenvalues

To compute the eigenvalues of the correlation matrix, we can use the eigen function:

This will show us the eigenvalues of the correlation matrix:

[1] 3.7163758 1.1409219 0.8471915 0.6128697 0.3236728 0.2185306 0.1404378

We can see that only three eigenvalues are greater than one, which suggests that we should retain three factors.

Compute the Parallel Analysis

To perform parallel analysis on the correlation matrix, we can use the fa.parallel function:

This will show us a table and a plot of the results of parallel analysis using different methods:

Based on these results, I decided to extract three factors from the data, as they capture the most information and have a clear interpretation.

The Factor Extraction

To extract the factors from the data, we can use different functions from the psych package, depending on the method we want to use. For example, we can use the principal function for principal component analysis, the fa function for principal axis factoring or maximum likelihood, etc. Each function has different arguments and options that we can specify, such as the number of factors, the rotation method, the correlation matrix, etc.

In this article, I will use the fa function with the

Maximum likelihood method

The maximum likelihood method is a parametric method that assumes that the data are multivariate normal and estimates the factor loadings by maximizing a likelihood function.

Varimax rotation

The varimax rotation is an orthogonal rotation that maximizes the variance of the squared loadings within each factor, resulting in a simpler and more interpretable factor structure.

Factor Extraction in R

To use the fa function, we can type:

This will show us a summary of the factor analysis results, such as the factor loadings, the uniquenesses, the commonalities, the fit indices, etc.

  • Mean item complexity = 1.7
  • Test of the hypothesis that 3 factors are sufficient.
  • df null model = 21 with the objective function = 3.82 with Chi Square = 98.75
  • df of the model are 3 and the objective function was 0.09
  • The root mean square of the residuals (RMSR) is 0.02
  • The df corrected root mean square of the residuals is 0.06
  • The harmonic n.obs is 30 with the empirical chi square 0.75 with prob < 0.86
  • The total n.obs was 30 with Likelihood Chi Square = 2.06 with prob < 0.56
  • Tucker Lewis Index of factoring reliability = 1.094
  • RMSEA index = 0 and the 90 % confidence intervals are 0 0.272
  • BIC = -8.14
  • Fit based upon off diagonal values = 1

The factor loadings are the correlations between the variables and the factors. They indicate how much each variable contributes to each factor. The uniquenesses are the variances of the variables that are not explained by the factors.

They indicate how much each variable is unique and not related to other variables. The commonalities are the variables' variances explained by the factors. They indicate how much each variable is shared and related to other variables.

The fit indices measure how well the factor model fits the data. They include:

  • The chi-square statistic and its p-value: a test of whether the factor model is significantly different from the observed correlation matrix. A small chi-square value and a large p-value indicate a good fit.
  • The root mean square error of approximation (RMSEA) and its 90% confidence interval measure how well the factor model approximates the population correlation matrix. A small RMSEA value (less than 0.05) and a narrow confidence interval indicate a good fit.
  • The Tucker-Lewis index (TLI) and the comparative fit index (CFI): measure how well the factor model compares to a null model that assumes no correlations among the variables. A large TLI and CFI value (close to 1) indicate a good fit.
  • We can see that the factor loadings are high and mainly concentrated on one factor for each variable, except for raises, which have moderate loadings on both factor 1 and factor 2. This indicates that the factors are well-defined and distinct from each other.
  • We can also see that the uniquenesses are low for most variables, except for raises, which have a high uniqueness of 0.59. This indicates that most variables are well explained by the factors, except for raises, which have a lot of unique variance not captured by the factors.
  • We can also see that the communalities are high for most variables, except for raises, which have a low communality of 0.41. This indicates that most variables share a lot of variance with other variables, except for raises, which have a lot of independent variance that is not shared with other variables.
  • The fit indices are also excellent, close to their ideal values. The chi-square statistic and p-value indicate that the factor model is not significantly different from the observed correlation matrix, which means it fits the data well. The RMSEA and its confidence interval indicate that the factor model approximates the population correlation matrix very well, as the RMSEA value is zero and the confidence interval is narrow and contains zero. The TLI and CFI indicate that the factor model compares very well to a null model that assumes no correlations among the variables, as they are both equal to one.
  • Based on these results, I concluded that the factor model with three factors, maximum likelihood method, and varimax rotation is a good fit for the data, as it explains a large amount of variance in the data, has a clear and interpretable factor structure, and has excellent fit indices.

The Factor Interpretation

We need to look at the factor loadings and assign meaningful labels to each factor based on the high loadings variables to interpret the factors. We can also look at the correlations among the factors to see how they relate.

Correlations among the factors

To see the correlations among the factors, we can use the phi argument in the fa function:

This will show us the correlation matrix of the factors:

We can see that the factors are not correlated with each other, as they have very small correlation coefficients (close to zero). This means that they are independent and orthogonal to each other.

We can use our domain knowledge and common sense to label the factors to find a suitable name for each factor based on the high-loading variables.

Looking at the factor loadings, we can see that:

  • Factor 1 has high loadings on ratings and complaints, which relate to the employee’s performance and satisfaction. We can label this factor as Performance.
  • Factor 2 has high loadings on privileges, learning, and advancement, which are all related to the employee’s opportunities and benefits in the workplace. We can label this factor as Opportunity.
  • Factor 3 has a high loading on critical, which is related to the employee’s perception of criticism in the workplace. We can label this factor as Criticism.

Based on these labels, we can interpret the factors as follows:

  • Factor 1 (Performance) represents the employee’s performance and satisfaction in their job. Employees scoring high on this factor have high supervisor ratings and low customer complaints. Employees who score low on this factor have low ratings from their supervisors and high customer complaints.
  • Factor 2 (Opportunity) represents the employee’s opportunities and benefits in the workplace. Employees who score high on this factor have high perceived privileges, learning opportunities, and chances for advancement. Employees who score low on this factor have low perceived privileges, learning opportunities, and chances for advancement.
  • Factor 3 (Criticism) represents the employee’s perception of criticism in the workplace. Employees who score high on this factor have high perceived levels of criticism from their supervisors and colleagues. Employees who score low on this factor have low perceived levels of criticism from their supervisors and colleagues.

The Factor Scores

To obtain the factor scores for each observation, we can use the scores argument in the fa function:

This will show us a matrix of the factor scores for each observation:

The factor scores are standardized values that indicate how much each observation deviates from the mean on each factor.

We can see that some observations have extreme values on some factors, such as observation 22, which has a very high score on factor 3 (Criticism), or observation 24, which has a very high score on factor 1 (Performance) and factor 2 (Opportunity). These observations may be outliers or influential cases that need further investigation.

We can also see that some observations have similar values on some factors, such as 7 to 20, with very low scores on factors 1 (Performance) and 2 (Opportunity). These observations may belong to a distinct group or cluster with common characteristics.

The factor scores can be used for further analysis, such as clustering, regression, classification, etc.

Conclusion

In this article, I showed how I chose to perform factor analysis in R, using an example dataset and some useful packages and functions. I explained how to determine the number of factors, extract the factors, interpret the factors, and obtain the factor scores. I also showed how to use various elements of markdown to style my article, such as headings, tables, code blocks, lists, etc.

I hope you found this article helpful and informative. If you have any questions or comments, please get in touch with me at info@data03.online or visit my website at [ data03.online]. You can also hire me for your data analysis projects at order now. Thank you for reading! 😊

Allow us to assist you.

Stay Updated:
https://www.data03.online/p/join-our-community.html

Originally published at https://www.data03.online.

--

--

RStudioDataLab

I am a doctoral scholar, certified data analyst, freelancer, and blogger, offering complimentary tutorials to enrich our scientific community's knowledge.