setwd("C:\\Users\\Your_User_Name_Here\\Documents\\UKDS")
3.1 Importing files: essential information
Spreadsheet and text files
In principle, any dataset whether in CSV, or spreadsheet format can be imported into R:
CSV files can be directly imported with read.csv() from Base R.
MS Excel spreadsheets can be opened with the
read_excel()
function from the readXL package or opened/written withread.xlsx()
andwrite.xlsx()
from the xlsx package.ODS (Open Document Spreadheets) files from LibreOffice/OpenOffice Calc can be opened and written with
read_ods()/write_ods()
from the readODS package.
SAS, SPSS, or Stata
Currently, the haven
packages provides the most versatile and straightforward route to importing data from other mainstream statistical software. SPSS, Stata and SAS files can be opened with respectively read_spss()
, read_dta()
and read_sas()
. The only potential downside is that it relies on ad hoc data formats and data structures for converting labelled categorical variables and attempts to mimic SPSS/Stata’s value and variable labels. More information is available here. We will be using haven
throughout this guide.
For the record, the foreign
package used to be the standard for importing MINITAB, SAS, SPSS and Stata datasets into R, but its development ceased in 2000 (Stata version 12).
3.3 Understanding the dataset
As mentioned in earlier sections, we can find out the number of observations and variables in the dataset by typing the following:
dim(bsa)
[1] 3988 25
We can see that there are 3988 observations and 25 variables in the dataset.
What if we want to get the list of all variables in the dataset? We need to type:
ls(bsa)
[1] "actchar" "actpol" "carallow" "carenvdc"
[5] "carnod2" "carreduc" "cartaxhi" "CCBELIEV"
[9] "ChildHh" "eq_inc_quintiles" "govnosa2" "HEdQual3"
[13] "leftrigh" "libauth" "Married" "PartyId2"
[17] "plnenvt" "plnuppri" "Politics" "RAgeCat"
[21] "RClassGp" "Rsex" "Sserial" "Voted"
[25] "WtFactor"
If we want to get a better sense of the data, we use the head()
function which will return the first six rows.
head(bsa)
Sserial Rsex RAgeCat Married ChildHh HEdQual3 eq_inc_quintiles RClassGp
1 290001 1 3 1 1 3 3 4
2 290002 2 7 2 2 2 NA 5
3 290003 2 4 1 1 1 3 1
4 290004 2 4 1 1 1 3 1
5 290005 1 7 4 2 2 4 1
6 290006 2 2 4 1 2 1 1
CCBELIEV carallow carreduc carnod2 cartaxhi carenvdc plnenvt plnuppri
1 2 NA NA NA NA NA NA NA
2 3 NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA
Politics Voted actchar actpol govnosa2 PartyId2 leftrigh libauth WtFactor
1 3 1 5 5 2 2 1.0 4.500000 0.9380587
2 1 1 NA NA 4 2 1.8 4.333333 0.6844214
3 3 1 NA NA NA 2 NA NA 0.9060821
4 1 1 4 4 2 2 1.5 2.500000 1.3085513
5 4 1 NA NA NA 3 2.0 3.166667 0.4392823
6 4 1 NA NA 3 NA 3.0 3.000000 0.5695720
Single variables may be also summarised with head()
.
For example:
head(bsa$Rsex)
[1] Male Female Female Female Male Female
Levels: skipped or na Male Female Dontknow Refusal
displays the first six observations of Rsex
(gender of respondents).
Simply typing the name of a variable as in:
bsa$Rsex
will list its first 1000 observations.
Other commands, such as summary()
provide more synthetic information.
summary(bsa$Rsex)
skipped or na Male Female Dontknow Refusal
0 1806 2182 0 0
summary()
is a generic function that tailors the most appropriate output to an object class. Since Rsex
is a categorical variable, the output of summary()
is identical to what we would have obtained with the straightforward tabulation function table()
:
table(bsa$Rsex)
skipped or na Male Female Dontknow Refusal
0 1806 2182 0 0
When encountering a numeric variable, summary()
will compute basic descriptive statistics (mean, median, quartiles, maximum and minimum). For example, in the case of the libertarian-authoritarian scale libauth
:
summary(bsa$libauth)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.167 3.000 3.500 3.511 4.000 5.000 775
3.4 Identifying and selecting variables
As we have already seen, variables are objects. R automatically stores variables using the appropriate object class. Categorical variables are ‘Factors’ with ‘Levels’ as categories within these, while continuous variables are ‘Numeric’ types of objects. The class()
displays the type of an object:.
class(bsa$Rsex)
[1] "factor"
The levels() function returns the categories of the variable.
levels(bsa$Rsex)
[1] "skipped or na" "Male" "Female" "Dontknow"
[5] "Refusal"
3.5 Factor variables in the haven package
As we have seen in the previous chapter, statistical software such as SPSS or Stata treat categorical variables as numerical variables. Values labels are then ‘attached’, allocating a substantive meaning to these values. On the other hand, R records categorical variables in objects that are called factors, that may be ordered or not. Factors consist of a limited number of levels sequentially numbered values (ranging from 1 to the number of levels) that are paired with arbitrary character strings. The number of a factor level is therefore distinct from the values categorical variables are allocated in codebooks.
Unfortunately, there are no straightforward ways to convert SPSS or Stata labelled categorical variables into R factors. The approach followed by the haven
package is to preserve the arbitrary numeric values of the original variables, and add attributes that contain the value labels that can be manipulated separately. Attributes are a special type of R objects that have a name, and can be retrieved using the attr()
function. haven
-converted categorical variables all have a ‘label’ and ‘labels’ attribute. The former is the variable description aka variable label, the latter the value labels.
Let’s examine the original variable description and value labels with the attr()
function.
attr(bsa$HEdQual3, "label")
[1] "Highest educational qual obtained - dv"
We do the same with value labels:
attr(bsa$HEdQual3, "labels")
Degree Higher educ below degree/A level
1 2
O level or equiv/CSE No qualification
3 4
DK/Refusal/NA
8
The value labels displayed above include both the text description and the original (i.e. codebook) numeric values from SPSS/Stata.
These haven
-imported numeric variables can be converted into R factors using as_factor()
, with their numeric values simply reflecting their order in the vector of levels. The factor levels in the converted variables can consist of either the value labels:
levels(
as_factor(bsa$HEdQual3, levels="labels")
)
[1] "Degree" "Higher educ below degree/A level"
[3] "O level or equiv/CSE" "No qualification"
[5] "DK/Refusal/NA"
… or, as in the earlier output, of both the original SPSS/Stata numeric values and the value labels:
levels(
as_factor(bsa$HEdQual3, levels="both")
)
[1] "[1] Degree"
[2] "[2] Higher educ below degree/A level"
[3] "[3] O level or equiv/CSE"
[4] "[4] No qualification"
[5] "[8] DK/Refusal/NA"