Exploratory Factor Analysis in SPSS vs R
I got interested in Exploratory Factor Analysis (EFA) recently, thanks to some of the students with whom I work right now. They come from a background of statistical methods in language testing, where EFA is generally used to look for validating the test items, pondering over questions such as whether they are all assessing the same underlying constructs or different ones etc. I saw it as a good method to come up with and validate some hypotheses about data (That, by the way, is not a ground breaking idea. It has been done before.), and to use it as a way to group features based on some idea of a common underlying construct (unlike PCA, whose primary goal is dimensionality reduction) and later use Confirmatory Factor Analysis (CFA) to validate the “theory” about feature groups from EFA.
The point where the idea validation was put to a test in this particular post is this: SPSS or R? My experience working with research software generally is that there are lots of differences between different packages implementing the same algorithm. Liberato, one of my collaborators and a graduate student at ISU, with whom I worked on this, did this using SPSS, typically employed by people interested in statistical analysis. I did it several weeks later, today, using R (not out of suspicion, just out of curiosity!). This post is a quick comparison of what SPSS and R show for EFA (CFA comes in a later post).
Quick note about the data and what we are trying to explore: There is this ETS corpus of non-native English which contains TOEFL essays by people from different native language backgrounds. Based on my previous experience working with this data and its variants (here, here, and here - this last one is a pre-print), and originally during the discussions with Magdalena Wolska about this data and what it can tell us, we thought of this idea that it is possible to cluster these essays based on where the authors come from (preliminary observations in the slides here). From there, I thought of doing something more systematically. So, this is what we did as a part of imposing some systematicity into the idea- taking 3-word sequences (e.g., I like to, I agree with etc.) and using EFA to look for language clusters. Overall, there are 11 kinds of Englishes in this data (i.e., people from 11 language backgrounds). For more details on the data, please refer to one of those links above.
Doing the EFA
Okay, so let me take a 2 factor EFA as an example to compare between SPSS and R. General experimental setting in SPSS: principal axis factoring, 2 factor analysis, oblimin rotation (This was done by Liberato, as mentioned earlier.)
With R: In comparison with SPSS, I felt R’s EFA was simple to do (and free!!). You need two libraries - psych to do the EFA and GPArotation which supports different rotation functions for factor analysis. Once you (install and) load those libraries, and have your data file, it is just 1 line of R-code to do the EFA.
library(psych) #Needed for doing EFA using fa() library(GPArotation) #Needed for oblimin rotation. efadata <- read.csv("mydata.csv", header = TRUE) efadata <- efadata[,2:12] #First column has the actual word strings, which we don't need for this analysis efa1 <- fa(efadata, nfactors=2, rotate="oblimin", fm="pa") #, n.iter = 15 # 2 factor EFA with (I guess) the same settings as in SPSS.
Results and Analyses shown SPSS shows long 6 page output, with a lot of stuff there. R’s output is definitely not as “formatted” as SPSS, but it does provide us with a lot of informative output. First and foremost, our question was: how much do the outputs differ. The table below shows a summary of factor loadings in SPSS and in R.
|Text Label||SPSS-Factor 1 Loading||SPSS-Factor 2 Loading||R-Factor 1 loading||R-Factor 2 loading|
We seem to have set up SPSS to suppress loadings below 0.3 (it is the checkbox that is on by default in the GUI), otherwise, there seems to be no significant difference here (which is good). In terms of the detailed output provided, however, there seems to be some differences, which will matter depending on what you are looking at. For example, I was looking for Root Mean Square Error of Approximation (RMSEA) as the Fabrigar book (referred below) was mentioning about that as a way of measuring the goodness of fit when maximum likelihood estimation is used, and the CFA output showed that as one of the measures. I did not find it directly in the SPSS output. R directly gives that (efa1$RMSEA). Apparently there are ways to calculate this from other outputs given by SPSS though.
In the discussion on interpreting EFA, Fabrigar book discusses in detail the various measures that are relevant. E.g., communalities table - SPSS directly shows that in the output, whereas it does not appear in R summary (although you can access it by using: efa1$communalities). There is a Kaiser-Meyer-Olkin Measure of Sampling Adequacy (KMO) and Bartlett’s test that SPSS reports of which only Bartlett’s test seems to be seen in R (and without the name!). There is this measure of factoring reliability (Tucker Lewis Index) in R that is not shown in SPSS.
I perhaps will spend more time with EFA as I am still not familiar with the underlying mathematical models, and so far only going by the practice advice in Fabrigar book (and without clarity on a few things), but as per the initial question, I guess I will continue with R, as there seem to be comparable results and I don’t have to bother about getting SPSS license and stuff.
- For some practical advice on EFA, I found the following book very useful (all thanks to those multiple students who recommended this to me)
Exploratory Factor Analysis
Leander R. Fabrigar and Daune T. Wegener
Understanding Statistics series
Oxford University Press