3  Opening datasets in R

3.1 Importing files: essential information

Spreadsheet and text files

In principle, any dataset whether in CSV, or spreadsheet format can be imported into R:

  • CSV files can be directly imported with read.csv() from Base R.

  • MS Excel spreadsheets can be opened with the read_excel() function from the readXL package or opened/written with read.xlsx() and write.xlsx() from the xlsx package.

  • ODS (Open Document Spreadheets) files from LibreOffice/OpenOffice Calc can be opened and written with read_ods()/write_ods() from the readODS package.

SAS, SPSS, or Stata

Currently, the haven packages provides the most versatile and straightforward route to importing data from other mainstream statistical software. SPSS, Stata and SAS files can be opened with respectively read_spss(), read_dta() and read_sas(). The only potential downside is that it relies on ad hoc data formats and data structures for converting labelled categorical variables and attempts to mimic SPSS/Stata’s value and variable labels. More information is available here. We will be using haven throughout this guide.

For the record, the foreign package used to be the standard for importing MINITAB, SAS, SPSS and Stata datasets into R, but its development ceased in 2000 (Stata version 12).

3.2 The 2017 British Social Attitudes Survey

In the remainder of this guide, we will be using the British Social Attitudes Survey, 2017, Environment and Politics: Open Access Teaching Dataset, which can be downloaded from the UK Data Service website. We will import the SPSS version of the dataset, We will import the SPSS version of the dataset. We will assume that the data is saved in a folder named UKDA inside the Documents folder (on a Windows computer). We will set . C:\\Users\\Your_User_Name_Here\\Documents as the default working directory.

This way, we won’t have to specify the full path each time that we will be opening or saving file.

setwd("C:\\Users\\Your_User_Name_Here\\Documents\\UKDS")

We can finally open the file:

bsa<-read_spss("8849spss_V1/bsa2017_open_enviropol.sav"
               )

Typing:

ls()
[1] "bsa"

will show us that the object ‘bsa’ has appeared in the environment.

3.3 Understanding the dataset

As mentioned in earlier sections, we can find out the number of observations and variables in the dataset by typing the following:

dim(bsa)
[1] 3988   25

We can see that there are 3988 observations and 25 variables in the dataset.

What if we want to get the list of all variables in the dataset? We need to type:

ls(bsa)
 [1] "actchar"          "actpol"           "carallow"         "carenvdc"        
 [5] "carnod2"          "carreduc"         "cartaxhi"         "CCBELIEV"        
 [9] "ChildHh"          "eq_inc_quintiles" "govnosa2"         "HEdQual3"        
[13] "leftrigh"         "libauth"          "Married"          "PartyId2"        
[17] "plnenvt"          "plnuppri"         "Politics"         "RAgeCat"         
[21] "RClassGp"         "Rsex"             "Sserial"          "Voted"           
[25] "WtFactor"        

If we want to get a better sense of the data, we use the head() function which will return the first six rows.

head(bsa)
  Sserial Rsex RAgeCat Married ChildHh HEdQual3 eq_inc_quintiles RClassGp
1  290001    1       3       1       1        3                3        4
2  290002    2       7       2       2        2               NA        5
3  290003    2       4       1       1        1                3        1
4  290004    2       4       1       1        1                3        1
5  290005    1       7       4       2        2                4        1
6  290006    2       2       4       1        2                1        1
  CCBELIEV carallow carreduc carnod2 cartaxhi carenvdc plnenvt plnuppri
1        2       NA       NA      NA       NA       NA      NA       NA
2        3       NA       NA      NA       NA       NA      NA       NA
3       NA       NA       NA      NA       NA       NA      NA       NA
4       NA       NA       NA      NA       NA       NA      NA       NA
5       NA       NA       NA      NA       NA       NA      NA       NA
6       NA       NA       NA      NA       NA       NA      NA       NA
  Politics Voted actchar actpol govnosa2 PartyId2 leftrigh  libauth  WtFactor
1        3     1       5      5        2        2      1.0 4.500000 0.9380587
2        1     1      NA     NA        4        2      1.8 4.333333 0.6844214
3        3     1      NA     NA       NA        2       NA       NA 0.9060821
4        1     1       4      4        2        2      1.5 2.500000 1.3085513
5        4     1      NA     NA       NA        3      2.0 3.166667 0.4392823
6        4     1      NA     NA        3       NA      3.0 3.000000 0.5695720

Single variables may be also summarised with head().

For example:

head(bsa$Rsex)
[1] Male   Female Female Female Male   Female
Levels: skipped or na Male Female Dontknow Refusal

displays the first six observations of Rsex (gender of respondents).

Simply typing the name of a variable as in:

bsa$Rsex

will list its first 1000 observations.

Other commands, such as summary() provide more synthetic information.

summary(bsa$Rsex)
skipped or na          Male        Female      Dontknow       Refusal 
            0          1806          2182             0             0 

summary() is a generic function that tailors the most appropriate output to an object class. Since Rsex is a categorical variable, the output of summary() is identical to what we would have obtained with the straightforward tabulation function table():

table(bsa$Rsex)

skipped or na          Male        Female      Dontknow       Refusal 
            0          1806          2182             0             0 

When encountering a numeric variable, summary() will compute basic descriptive statistics (mean, median, quartiles, maximum and minimum). For example, in the case of the libertarian-authoritarian scale libauth:

summary(bsa$libauth)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.167   3.000   3.500   3.511   4.000   5.000     775 

3.4 Identifying and selecting variables

As we have already seen, variables are objects. R automatically stores variables using the appropriate object class. Categorical variables are ‘Factors’ with ‘Levels’ as categories within these, while continuous variables are ‘Numeric’ types of objects. The class() displays the type of an object:.

class(bsa$Rsex)
[1] "factor"

The levels() function returns the categories of the variable.

levels(bsa$Rsex)
[1] "skipped or na" "Male"          "Female"        "Dontknow"     
[5] "Refusal"      

3.5 Factor variables in the haven package

As we have seen in the previous chapter, statistical software such as SPSS or Stata treat categorical variables as numerical variables. Values labels are then ‘attached’, allocating a substantive meaning to these values. On the other hand, R records categorical variables in objects that are called factors, that may be ordered or not. Factors consist of a limited number of levels sequentially numbered values (ranging from 1 to the number of levels) that are paired with arbitrary character strings. The number of a factor level is therefore distinct from the values categorical variables are allocated in codebooks.

Unfortunately, there are no straightforward ways to convert SPSS or Stata labelled categorical variables into R factors. The approach followed by the haven package is to preserve the arbitrary numeric values of the original variables, and add attributes that contain the value labels that can be manipulated separately. Attributes are a special type of R objects that have a name, and can be retrieved using the attr() function. haven-converted categorical variables all have a ‘label’ and ‘labels’ attribute. The former is the variable description aka variable label, the latter the value labels.

Let’s examine the original variable description and value labels with the attr() function.

attr(bsa$HEdQual3, "label")
[1] "Highest educational qual obtained - dv"

We do the same with value labels:

attr(bsa$HEdQual3, "labels")
                          Degree Higher educ below degree/A level 
                               1                                2 
            O level or equiv/CSE                 No qualification 
                               3                                4 
                   DK/Refusal/NA 
                               8 

The value labels displayed above include both the text description and the original (i.e. codebook) numeric values from SPSS/Stata.

These haven-imported numeric variables can be converted into R factors using as_factor(), with their numeric values simply reflecting their order in the vector of levels. The factor levels in the converted variables can consist of either the value labels:

levels(
       as_factor(bsa$HEdQual3, levels="labels")
       )
[1] "Degree"                           "Higher educ below degree/A level"
[3] "O level or equiv/CSE"             "No qualification"                
[5] "DK/Refusal/NA"                   

… or, as in the earlier output, of both the original SPSS/Stata numeric values and the value labels:

levels(
       as_factor(bsa$HEdQual3, levels="both")
       )
[1] "[1] Degree"                          
[2] "[2] Higher educ below degree/A level"
[3] "[3] O level or equiv/CSE"            
[4] "[4] No qualification"                
[5] "[8] DK/Refusal/NA"