$lnleftrigh <- log(bsa$leftrigh) bsa
In this section we will cover how to recode variables and deal with missing data.
4.1 Creating and transforming numeric variables
Let’s say we would like to transform our numerical political orientation variable: leftrigh
into a logarithmic scale. We can use the log()
function which is available in R Base and simply returns the natural logarithm (base-e). We will use the assignment operator ( <- ) to create a new variable called ‘lnleftright’ from the original variable
Note that if we had not specified bsa$
the command would have created a transformed variable outside of the BSA data frame. We can now check the results with summary()
summary(cbind(bsa$lnleftrigh,bsa$leftrigh))
V1 V2
Min. :0.0000 Min. :1.00
1st Qu.:0.6931 1st Qu.:2.00
Median :0.8755 Median :2.40
Mean :0.8707 Mean :2.52
3rd Qu.:1.0986 3rd Qu.:3.00
Max. :1.6094 Max. :5.00
NA's :782 NA's :782
It is not possible to pass several variables names directly to summary()
. We need to group them first into a temporary object using cbind()
. In the output V1
refers to the first variable, lnleftrigh
.
We can easily create completely new variables in the dataset. For instance, the following will create test
with a constant value of 1.
$test <- 1
bsa
summary(bsa$test)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1 1 1 1 1
4.2 Recoding variables
Recoding categorical and numeric variables is very common in survey research. For example, let us create a dichotomic version of the marital status of BSA respondents. The original marital status is recorded by Married
. Simply using summary
will return descriptive statistics of its numeric values, which is not what we need here:
summary(bsa$Married)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.000 1.000 1.000 1.977 3.000 4.000 1
What we want instead is to work with the factor-converted version of Married
.
summary(as_factor(bsa$Married))
Married/living as married Separated/divorced Widowed
2209 530 380
Never married No information NA's
868 0 1
Even better, we can produce the exhaustive list of all of Married
’s factor levels:
levels(as_factor(bsa$Married))
[1] "Married/living as married" "Separated/divorced"
[3] "Widowed" "Never married"
[5] "No information"
We can now move on to creating a new variable called Married2
where respondents are categorised into two new categories: ‘Not partnered’ and ‘Partnered’. The “separated/divorced” and ‘Never married’ categories of the ‘Married’ variable are recoded as ‘Not partnered’. It is always advised to create new variables when recoding old ones so the original data is not tampered with.
$Married.f<-as_factor(bsa$Married,"labels")
bsa$Married2 <- ifelse(bsa$Married.f=="Married/living as married",
bsa"Partnered",bsa$Married)
$Married2 <- ifelse(bsa$Married.f=="Widowed" |
bsa$Married.f=="Never married" |
bsa$Married.f=="Separated/divorced",
bsa"Not partnered",bsa$Married2)
$Married2<-as.factor(bsa$Married2)
bsalevels(bsa$Married2)
[1] "Not partnered" "Partnered"
summary(bsa$Married2)
Not partnered Partnered NA's
1778 2209 1
The second and fourth categories have been renamed to ‘Not partnered’. Now we have two levels: ‘Partnered’ and ‘Not partnered’
ifelse()
is a convenient tool when it is required to work with Base R functions only or when variables have a limited number of categories. The syntax consists of three terms:
- the condition to be evaluated: for example
bsa$Married.f=="Widowed"
- what happens if the condition is met, for example the new variable takes the value “Not partnered”
- what happens if the condition is not met, for example, the new variable retains its existing value
More complex cases may require a more advanced function. The dplyr
library provides a comprehensive set of data manipulation tools, such as case_when()
.
library(dplyr)
<-bsa%>%
bsamutate(Married3=case_when(
== "Married/living as married" ~ "Partnered",
Married.f == "Separated/divorced" |
Married.f == "Widowed" ~ "Not Partnered",
Married.f == "Never married" ~ "Not Partnered"
Married.f
)
)$Married3<-as.factor(bsa$Married3)
bsasummary(bsa$Married3)
Not Partnered Partnered NA's
1778 2209 1
The syntax above created the Married3
variable, which is identical to Married2
. Let’s decompose it:
dplyr
use thepipe
symbol ie%>%
or|>
which enables to sequentially combine functions. We will come back to this later in this guide.mutate()
is the generic variable creation/alteration command, and can handle complex combinations of conditions as well as multiple simultaneous variable creation operations.case_when()
is the function that allows recoding numerical, character, or factor variables. On the left hand side of the tilde~
are the condition or the values that need to be matched in the original variable , and on the right hand side, the attributed ie recoded values in the new variable. Note that in this case, the recoded variable is by default a character object and needs to be converted into a factor for easier manipulation.
Extra tips:
- As with any data manipulation exercise, caution is required, and it is recommended to create new variables with the recoded value rather than alter an original variable when handling missing values.
- The standard value attribution command in R is
<-
. However,=
will also work in many cases. - Unless explicitly specified (in our case, by adding the bsa$ prefix to variable name), the objects created are not included in the data frame from which they were computed.
4.3 Missing Values
Explicit or system missing values in R (i.e. values that R itself considers as missing) are represented as NA
for factors and numerical variables. For character variables, missing values are simply empty strings, ie ""
. R has a series of functions specifically designed to handle NAs.
R has fewer safety nets than other packages for handling missing values. Most functions won’t warn users about whether there are observations with missing values that have been dropped. On the other hand, some commands will return error messages and by default won’t run when missing values are present. This is the case of mean()
for example.
4.3.1 Inspecting missing data
The logical function is.na()
assesses each observation in variables and identifies whether cases are valid or missing. The result will appear as a boolean TRUE/FALSE vector for each observation. is.na()
can be combined with other functions:
With
table()
in order to get the frequencies of missing values of a specific variable.With
sum()
in order to count the number of missing observations of variables or whole datasets.
table( # number of missing values in the leftright variable
is.na(bsa$leftrigh)
)
FALSE TRUE
3206 782
sum( # of missing values in the whole dataset
is.na(bsa)
)
[1] 36593
mean( # proportion of NAs for a variable
is.na(bsa$leftrigh)
)
[1] 0.1960883
mean( # proportion of NAs in the dataset
is.na(bsa)
# returns the proportion in the dataset )
[1] 0.3058592
4.3.2 Recoding missing values as NA (continuous variables)
It may sometimes be useful to recode implicit missing values (ie considered by the data producer as missing, but not by R) of either numeric objects or factors into <NA>
, in order to simplify case selection when conducting analyses. This can either be done with Base R code or the more advanced data manipulation functions from the dplyr package that we explored earlier.
Let’s assume for a moment that we would like to eliminate respondents aged under 25 for our analysis. A safe way to proceed is by creating a new dataset.
# convert labelled numeric variable into factor for clarity
$RAgeCat.f <- as_factor(bsa$RAgeCat)
bsatable(bsa$RAgeCat.f)
18-24 25-34 35-44 45-54 55-59 60-64 65+
223 591 650 729 320 333 1138
DK/Refused
0
# retains all values except those that match the condition:
<-bsa[!bsa$RAgeCat.f=="18-24", ]
bsa.adultstable(bsa.adults$RAgeCat.f)
18-24 25-34 35-44 45-54 55-59 60-64 65+
0 591 650 729 320 333 1138
DK/Refused
0
We can also notice that although there are now no observations left in the 18-24 category, it is still displayed by table()
. This is because levels are attributes of factors and are not deleted with observations. We can remove unused levels permanently with droplevels()
$RAgeCat.f<-droplevels(bsa.adults$RAgeCat.f)
bsa.adultstable(bsa.adults$RAgeCat.f)
25-34 35-44 45-54 55-59 60-64 65+
591 650 729 320 333 1138
4.3.3 Working with missing values
Explicit missing values (coded as NA) can be taken care of by R’s own missing values functions. For instance using the na.rm=T
or na.rm=TRUE
option will remove missing values from an analysis (typing ?na.rm
will provide more information). Below is a summary of how NAs are dealt with by common R commands:
Command | Default action | Parameter |
---|---|---|
mean(), sd(),median() | Includes NA (may return an error) | na.rm=T |
cor(),cov() | Includes NA (may return an error) | use=“complete.obs” |
table() | Excludes NA | useNA = “always” to display NAs |
xtabs() | Excludes NA | addNA = T to display NAs |
lm(),glm() | Excludes NA | na.action=NULL |
4.4 Subsetting datasets
When analysing survey data. it is often necessary to limit the scope of computation to specific groups or subset of the data we may be interested in. There are many ways of subsetting datasets in R. We will review the most common here.
Using Base R
Most subsetting commands involve some form of conditions whereby the characteristics of a subsample of interest are specified. Suppose we would like to examine the interest for politics among people aged 18-24.
We can either create an adhoc data frame:
table(bsa$Politics)
1 2 3 4 5
739 982 1179 708 379
<-bsa[bsa$RAgeCat.f=="18-24",]
bsa.youngtable(bsa.young$Politics)
1 2 3 4 5
29 41 72 56 25
Or we can directly limit the extent of the analysis on the go:
table(
$Politics[bsa$RAgeCat.f=="18-24"]
bsa )
1 2 3 4 5
29 41 72 56 25
In the first example it was necessary to include a comma after the condition. This is meant to indicate that we want to retain all variables ie columns in the dataset. The comma is not necessary in the second example as we are already working with a single variable.
Using dplyr*
Now suppose we want to further restrict the analysis to people self-identifying as males. We could use the same Base R syntax as above, but with more than one condition coding tends to become a bit cumbersome. We could instead use the more convenient syntax from the dplyr
package. Either:
<-bsa%>%
bsa.young.malesfilter(RAgeCat.f=="18-24" & as_factor(Rsex)=="Male")
table(as_factor(bsa.young.males$Politics))
Item not applicable ... a great deal, quite a lot, some,
0 15 22 30
not very much, or, none at all? Don`t know Refusal
18 9 0 0
Or as before, embed it as a condition within table()
:
table(
as_factor(
%>%
bsafilter(RAgeCat.f=="18-24" & as_factor(Rsex)=="Male")%>%
select(Politics)
) )
Politics
Item not applicable ... a great deal, quite a lot, some,
0 15 22 30
not very much, or, none at all? Don`t know Refusal
18 9 0 0
filter()
and select()
are the functions that specify respectively rows and columns to be kept/removed. They can be combined or used independently and used in any order.
We are now equipped with the necessary information to move to the next stage and carry out basic analysis using R.