4  Essentials of Data Manipulation

In this section we will cover how to recode variables and deal with missing data.

4.1 Creating and transforming numeric variables

Let’s say we would like to transform our numerical political orientation variable: leftrigh into a logarithmic scale. We can use the log() function which is available in R Base and simply returns the natural logarithm (base-e). We will use the assignment operator ( <- ) to create a new variable called ‘lnleftright’ from the original variable

bsa$lnleftrigh <- log(bsa$leftrigh)

Note that if we had not specified bsa$ the command would have created a transformed variable outside of the BSA data frame. We can now check the results with summary()

summary(cbind(bsa$lnleftrigh,bsa$leftrigh))
       V1               V2      
 Min.   :0.0000   Min.   :1.00  
 1st Qu.:0.6931   1st Qu.:2.00  
 Median :0.8755   Median :2.40  
 Mean   :0.8707   Mean   :2.52  
 3rd Qu.:1.0986   3rd Qu.:3.00  
 Max.   :1.6094   Max.   :5.00  
 NA's   :782      NA's   :782   

It is not possible to pass several variables names directly to summary(). We need to group them first into a temporary object using cbind(). In the output V1 refers to the first variable, lnleftrigh.

We can easily create completely new variables in the dataset. For instance, the following will create test with a constant value of 1.

bsa$test <- 1

summary(bsa$test) 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       1       1       1       1       1 

4.2 Recoding variables

Recoding categorical and numeric variables is very common in survey research. For example, let us create a dichotomic version of the marital status of BSA respondents. The original marital status is recorded by Married. Simply using summary will return descriptive statistics of its numeric values, which is not what we need here:

summary(bsa$Married)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.000   1.000   1.000   1.977   3.000   4.000       1 

What we want instead is to work with the factor-converted version of Married.

summary(as_factor(bsa$Married))
Married/living as married        Separated/divorced                   Widowed 
                     2209                       530                       380 
            Never married            No information                      NA's 
                      868                         0                         1 

Even better, we can produce the exhaustive list of all of Married’s factor levels:

levels(as_factor(bsa$Married))
[1] "Married/living as married" "Separated/divorced"       
[3] "Widowed"                   "Never married"            
[5] "No information"           

We can now move on to creating a new variable called Married2 where respondents are categorised into two new categories: ‘Not partnered’ and ‘Partnered’. The “separated/divorced” and ‘Never married’ categories of the ‘Married’ variable are recoded as ‘Not partnered’. It is always advised to create new variables when recoding old ones so the original data is not tampered with.

bsa$Married.f<-as_factor(bsa$Married,"labels")
bsa$Married2 <- ifelse(bsa$Married.f=="Married/living as married",
                                      "Partnered",bsa$Married)
bsa$Married2 <- ifelse(bsa$Married.f=="Widowed" | 
                       bsa$Married.f=="Never married" | 
                      bsa$Married.f=="Separated/divorced",
                                      "Not partnered",bsa$Married2)

bsa$Married2<-as.factor(bsa$Married2)
levels(bsa$Married2)
[1] "Not partnered" "Partnered"    
summary(bsa$Married2)
Not partnered     Partnered          NA's 
         1778          2209             1 

The second and fourth categories have been renamed to ‘Not partnered’. Now we have two levels: ‘Partnered’ and ‘Not partnered’

ifelse() is a convenient tool when it is required to work with Base R functions only or when variables have a limited number of categories. The syntax consists of three terms:

  1. the condition to be evaluated: for example bsa$Married.f=="Widowed"
  2. what happens if the condition is met, for example the new variable takes the value “Not partnered”
  3. what happens if the condition is not met, for example, the new variable retains its existing value

More complex cases may require a more advanced function. The dplyr library provides a comprehensive set of data manipulation tools, such as case_when().

library(dplyr)
bsa<-bsa%>%
     mutate(Married3=case_when(
               Married.f == "Married/living as married" ~ "Partnered",  
               Married.f == "Separated/divorced" | 
               Married.f == "Widowed" ~ "Not Partnered",
               Married.f == "Never married" ~ "Not Partnered"
)
)
bsa$Married3<-as.factor(bsa$Married3)
summary(bsa$Married3)
Not Partnered     Partnered          NA's 
         1778          2209             1 

The syntax above created the Married3 variable, which is identical to Married2. Let’s decompose it:

  • dplyr use the pipe symbol ie %>% or |> which enables to sequentially combine functions. We will come back to this later in this guide.
  • mutate() is the generic variable creation/alteration command, and can handle complex combinations of conditions as well as multiple simultaneous variable creation operations.
  • case_when() is the function that allows recoding numerical, character, or factor variables. On the left hand side of the tilde ~ are the condition or the values that need to be matched in the original variable , and on the right hand side, the attributed ie recoded values in the new variable. Note that in this case, the recoded variable is by default a character object and needs to be converted into a factor for easier manipulation.

Extra tips:

  • As with any data manipulation exercise, caution is required, and it is recommended to create new variables with the recoded value rather than alter an original variable when handling missing values.
  • The standard value attribution command in R is <-. However, = will also work in many cases.
  • Unless explicitly specified (in our case, by adding the bsa$ prefix to variable name), the objects created are not included in the data frame from which they were computed.

4.3 Missing Values

Explicit or system missing values in R (i.e. values that R itself considers as missing) are represented as NA for factors and numerical variables. For character variables, missing values are simply empty strings, ie "". R has a series of functions specifically designed to handle NAs.

R has fewer safety nets than other packages for handling missing values. Most functions won’t warn users about whether there are observations with missing values that have been dropped. On the other hand, some commands will return error messages and by default won’t run when missing values are present. This is the case of mean() for example.

4.3.1 Inspecting missing data

The logical function is.na() assesses each observation in variables and identifies whether cases are valid or missing. The result will appear as a boolean TRUE/FALSE vector for each observation. is.na() can be combined with other functions:

  • With table() in order to get the frequencies of missing values of a specific variable.

  • With sum() in order to count the number of missing observations of variables or whole datasets.

table(                     # number of missing values in the leftright variable
      is.na(bsa$leftrigh)
      ) 

FALSE  TRUE 
 3206   782 
sum(                       # of missing values in the whole dataset
   is.na(bsa)
   ) 
[1] 36593
mean(                     # proportion of NAs for a variable
     is.na(bsa$leftrigh)
     ) 
[1] 0.1960883
mean(                     # proportion of NAs in the dataset
     is.na(bsa)
     ) # returns the proportion in the dataset
[1] 0.3058592

4.3.2 Recoding missing values as NA (continuous variables)

It may sometimes be useful to recode implicit missing values (ie considered by the data producer as missing, but not by R) of either numeric objects or factors into <NA>, in order to simplify case selection when conducting analyses. This can either be done with Base R code or the more advanced data manipulation functions from the dplyr package that we explored earlier.

Let’s assume for a moment that we would like to eliminate respondents aged under 25 for our analysis. A safe way to proceed is by creating a new dataset.

# convert labelled numeric variable into factor for clarity
bsa$RAgeCat.f <- as_factor(bsa$RAgeCat) 
table(bsa$RAgeCat.f)                      

     18-24      25-34      35-44      45-54      55-59      60-64        65+ 
       223        591        650        729        320        333       1138 
DK/Refused 
         0 
# retains all values except those that match the condition:  
bsa.adults<-bsa[!bsa$RAgeCat.f=="18-24", ] 
table(bsa.adults$RAgeCat.f)                      

     18-24      25-34      35-44      45-54      55-59      60-64        65+ 
         0        591        650        729        320        333       1138 
DK/Refused 
         0 

We can also notice that although there are now no observations left in the 18-24 category, it is still displayed by table(). This is because levels are attributes of factors and are not deleted with observations. We can remove unused levels permanently with droplevels()

bsa.adults$RAgeCat.f<-droplevels(bsa.adults$RAgeCat.f)
table(bsa.adults$RAgeCat.f)

25-34 35-44 45-54 55-59 60-64   65+ 
  591   650   729   320   333  1138 

4.3.3 Working with missing values

Explicit missing values (coded as NA) can be taken care of by R’s own missing values functions. For instance using the na.rm=T or na.rm=TRUE option will remove missing values from an analysis (typing ?na.rm will provide more information). Below is a summary of how NAs are dealt with by common R commands:

Treatment of missing values by R commands
Command Default action Parameter
mean(), sd(),median() Includes NA (may return an error) na.rm=T
cor(),cov() Includes NA (may return an error) use=“complete.obs”
table() Excludes NA useNA = “always” to display NAs
xtabs() Excludes NA addNA = T to display NAs
lm(),glm() Excludes NA na.action=NULL

4.4 Subsetting datasets

When analysing survey data. it is often necessary to limit the scope of computation to specific groups or subset of the data we may be interested in. There are many ways of subsetting datasets in R. We will review the most common here.

Using Base R

Most subsetting commands involve some form of conditions whereby the characteristics of a subsample of interest are specified. Suppose we would like to examine the interest for politics among people aged 18-24.

We can either create an adhoc data frame:

table(bsa$Politics)

   1    2    3    4    5 
 739  982 1179  708  379 
bsa.young<-bsa[bsa$RAgeCat.f=="18-24",]  
table(bsa.young$Politics)

 1  2  3  4  5 
29 41 72 56 25 

Or we can directly limit the extent of the analysis on the go:

  table(
        bsa$Politics[bsa$RAgeCat.f=="18-24"]
        )

 1  2  3  4  5 
29 41 72 56 25 

In the first example it was necessary to include a comma after the condition. This is meant to indicate that we want to retain all variables ie columns in the dataset. The comma is not necessary in the second example as we are already working with a single variable.

Using dplyr*

Now suppose we want to further restrict the analysis to people self-identifying as males. We could use the same Base R syntax as above, but with more than one condition coding tends to become a bit cumbersome. We could instead use the more convenient syntax from the dplyr package. Either:

bsa.young.males<-bsa%>%
  filter(RAgeCat.f=="18-24" & as_factor(Rsex)=="Male")  

table(as_factor(bsa.young.males$Politics))

Item not applicable   ... a great deal,        quite a lot,               some, 
                  0                  15                  22                  30 
     not very much,    or, none at all?          Don`t know             Refusal 
                 18                   9                   0                   0 

Or as before, embed it as a condition within table() :

  table(
         as_factor(
          bsa%>%
          filter(RAgeCat.f=="18-24" & as_factor(Rsex)=="Male")%>%
          select(Politics)
  )
  )
Politics
Item not applicable   ... a great deal,        quite a lot,               some, 
                  0                  15                  22                  30 
     not very much,    or, none at all?          Don`t know             Refusal 
                 18                   9                   0                   0 

filter() and select() are the functions that specify respectively rows and columns to be kept/removed. They can be combined or used independently and used in any order.

We are now equipped with the necessary information to move to the next stage and carry out basic analysis using R.