5 Descriptive statistics – Analysing social survey data using R

5.1 Continuous variables

Producing descriptive statistics in R is relatively straightforward, as key functions are included by default in the Base package. We have already seen above that the summary() command provides essential information about a variable. For instance,

summary(bsa$leftrigh)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1.00    2.00    2.40    2.52    3.00    5.00     782

provides information about the mean, median and quartiles of the political scale of respondents.

The describe() command from the Hmisc package provides a more detailed set of summary statistics.

library(Hmisc)
describe(bsa$leftrigh)

Error in proxy[, ..., drop = FALSE]: incorrect number of dimensions

The code above returns an error because describe() expects numeric values, and ‘leftrigh’ isn’t a pure numeric variable:

class(bsa$leftrigh)

[1] "haven_labelled" "vctrs_vctr"     "double"

It is a numeric variable: ‘double’, but with some extra metadata, including haven-generated value and variable labels. In order for describe() to run properly, we need to convert leftrigh to Base R numeric format, either as a new variable or as shown below, temporarily:

describe(as.numeric(bsa$leftrigh))

as.numeric(bsa$leftrigh) 
       n  missing distinct     Info     Mean  pMedian      Gmd      .05 
    3206      782       30    0.993     2.52      2.5   0.8831      1.2 
     .10      .25      .50      .75      .90      .95 
     1.4      2.0      2.4      3.0      3.6      4.0 

lowest : 1    1.2  1.4  1.5  1.6 , highest: 4.4  4.6  4.75 4.8  5

describe() also provides the number of observations (including missing and unique observations), deciles as well as the five largest and smallest values.

Commands producing single statistics are also available:

mean(bsa$leftrigh, na.rm = T)

[1] 2.519911

sd(bsa$leftrigh, na.rm = T)

[1] 0.7852958

median(bsa$leftrigh, na.rm = T)

[1] 2.4

max(as.numeric(bsa$leftrigh), na.rm = T)

[1] 5

min(as.numeric(bsa$leftrigh), na.rm = T)

[1] 1

As previously, the na.rm = T option prevents missing values from being taken into account (in which case the output would have been NA, as this is the default behaviour of these functions). Similarly to describe() earlier, max() and min() need the variable to be converted into numeric format to deliver the desired output.

We could combine the output from the above commands into a single line using the c() function:

c(
  mean(bsa$leftrigh, na.rm = T),
  sd(bsa$leftrigh, na.rm = T),
  median(bsa$leftrigh, na.rm = T),
  max(as.numeric(bsa$leftrigh), na.rm = T),
  min(as.numeric(bsa$leftrigh), na.rm = T)
)

[1] 2.5199106 0.7852958 2.4000000 5.0000000 1.0000000

Using these individual commands may come in handy, for instance when further processing of the result is needed:

m <- mean(bsa$leftrigh, na.rm= T)

Let’s round the results to two decimal places:

rm <- round(m,2)

We can see the final results by typing:

rm

[1] 2.52

Note:

round(mean(bsa$leftrigh,na.rm=T),2)

[1] 2.52

would produce the same results using just one line of code .

5.2 Bivariate association between continuous variables

The Base R installation comes with a wide range of bivariate statistical functions. cor() and cov() provide basic measures of association between two variables. For instance, in order to measure the correlation between the left-right the libertarian-authoritarian scales:

cor(bsa$leftrigh, bsa$libauth, use='complete.obs')

[1] 0.009625928

The latter variable is records how far someone sits on the libertarian – authoritarian scale ranging from 1 to 5.

A correlation of 0.009 indicates a positive but very small relationship. It can be interpreted as ’an increase in authoritarianism is associated with a marginal increase in rightwing views.

Note: When using cor() and cov(), missing values are dealt with the use= option, which can either take “everything”, “all.obs”, “complete.obs”, “na.or.complete”, or “pairwise.complete.obs” values. See ?cor for additional information.

5.3 Categorical Variables

As with continuous variables, R offers several tools that can be used to describe the distribution of categorical variables. One- and two-way contingency tables are the most commonly used.

5.3.1 One way frequency tables

There are several R commands that we can use to create frequency tables. The most common ones table(), xtabs() or ftable() which return the frequencies of observations within each level of a factor. For example, in order to obtain the political affiliation of BSA respondents in 2017:

table(as_factor(bsa$PartyId2))


     Not applicable        Conservative              Labour    Liberal Democrat 
                  0                1263                1479                 241 
        Other party                None         Green Party Other answer/DK/Ref 
                193                 515                  79                   0

As with any other R functions, the outcome of table() can be stored as an object for further processing:

a<-table(as_factor(bsa$PartyId2))

It is not directly possible to have proportions or percentages computed with table(). Proportions are obtained using the prop.table() function which in turn does not produce percentages. It is also a good idea to round the results for greater readability.

Either:

round(
  100*
    prop.table(a),
  1)


     Not applicable        Conservative              Labour    Liberal Democrat 
                0.0                33.5                39.2                 6.4 
        Other party                None         Green Party Other answer/DK/Ref 
                5.1                13.7                 2.1                 0.0

… or:

round(100*
        prop.table(
          table(as_factor(bsa$PartyId2))
        ),
      1)


     Not applicable        Conservative              Labour    Liberal Democrat 
                0.0                33.5                39.2                 6.4 
        Other party                None         Green Party Other answer/DK/Ref 
                5.1                13.7                 2.1                 0.0

5.3.2 Two way or more contingency table

The simplest way to produce a two-way contingency table is to pass a second variable to table():

table(as_factor(bsa$PartyId2), as_factor(bsa$Rsex))

                     
                      skipped or na Male Female Dontknow Refusal
  Not applicable                  0    0      0        0       0
  Conservative                    0  627    636        0       0
  Labour                          0  644    835        0       0
  Liberal Democrat                0  124    117        0       0
  Other party                     0   97     96        0       0
  None                            0  199    316        0       0
  Green Party                     0   31     48        0       0
  Other answer/DK/Ref             0    0      0        0       0

By default table does not discard empty factor levels - (i.e. categories with no observations), which may sometimes result in slightly cumbersome result. Using droplevels() on each variable resolves the issue:

table(droplevels(as_factor(bsa$PartyId2)), droplevels(as_factor(bsa$Rsex)))

                  
                   Male Female
  Conservative      627    636
  Labour            644    835
  Liberal Democrat  124    117
  Other party        97     96
  None              199    316
  Green Party        31     48

However, when dealing with more than one variable it is recommended to use xtabs() instead as it has a number of desirable functions directly available as options. The syntax is slightly different as it relies on a formula – a R object consisting of elements separated by a tilde ‘~’. The variables to be tabulated are specified on the right hand side of the formula. In order to lighten the syntax, we will also recode PartyId2 and Rsex permanently into factors.

bsa$PartyId2.f<-as_factor(bsa$PartyId2)
bsa$Rsex.f<-as_factor(bsa$Rsex)

xtabs(~PartyId2.f +Rsex.f,
      data = bsa)

                     Rsex.f
PartyId2.f            skipped or na Male Female Dontknow Refusal
  Not applicable                  0    0      0        0       0
  Conservative                    0  627    636        0       0
  Labour                          0  644    835        0       0
  Liberal Democrat                0  124    117        0       0
  Other party                     0   97     96        0       0
  None                            0  199    316        0       0
  Green Party                     0   31     48        0       0
  Other answer/DK/Ref             0    0      0        0       0

The data= parameter does not have to be explicitly specified as simply using ´bsa’ will work. Other useful options are:

subset=, which allows direct specification of a subpopulation from which to derive the table;
drop.unused.levels=T to remove empty levels;
weights~ variables on the right hand side of the formula will be treated as weights, a useful feature for survey analysis.

As previously prop.table() is necessary in order to obtain proportions:

b<-xtabs(~PartyId2.f +Rsex.f,
         bsa,
         drop.unused.levels = T)

round(100*
        prop.table(b),
      1) ### Cell percentages

                  Rsex.f
PartyId2.f         Male Female
  Conservative     16.6   16.9
  Labour           17.1   22.1
  Liberal Democrat  3.3    3.1
  Other party       2.6    2.5
  None              5.3    8.4
  Green Party       0.8    1.3

The largest group in the sample (22.1%) is made of labour-voting females and the smallest, of green-voting males.

round(100*
        prop.table(b,1),
      1) ### Option 1 for row percentages

                  Rsex.f
PartyId2.f         Male Female
  Conservative     49.6   50.4
  Labour           43.5   56.5
  Liberal Democrat 51.5   48.5
  Other party      50.3   49.7
  None             38.6   61.4
  Green Party      39.2   60.8

Conservative voters are more or less evenly split between men and women, whereas Labour and Green voters are more likely to be women.

round(100*
        prop.table(b,2),
      1) ### Option 2 for column percentages

                  Rsex.f
PartyId2.f         Male Female
  Conservative     36.4   31.1
  Labour           37.4   40.8
  Liberal Democrat  7.2    5.7
  Other party       5.6    4.7
  None             11.6   15.4
  Green Party       1.8    2.3

Similar proportions of men voted Conservative and Labour (36-37%), whereas women were clearly more likely to vote Labour.

There is not a straightforward way to obtain percentages in three-way contingency tables with either xtabs() or table(). This is where ftable() function comes handy. For convenience, we converted RAgeCat into a factor.

bsa$RAgeCat.f<-as_factor(bsa$RAgeCat) 

round(100*
        prop.table(
          ftable(RAgeCat.f~PartyId2.f+Rsex.f,
                 data=droplevels(bsa)
                   )
         ,1)
      ,1) ### Option 1 for row,  2 for column percentages

                        RAgeCat.f 18-24 25-34 35-44 45-54 55-59 60-64  65+
PartyId2.f       Rsex.f                                                   
Conservative     Male               3.0   7.8  13.9  15.0   8.6   9.3 42.4
                 Female             2.7   7.1   8.8  18.6   8.8   8.7 45.4
Labour           Male               7.6  16.1  14.3  21.3   7.5   9.3 23.9
                 Female             7.9  20.2  19.2  18.8   7.1   7.1 19.7
Liberal Democrat Male               0.8  13.7  19.4  15.3   9.7   8.9 32.3
                 Female             4.3   9.4  26.5   6.0   6.0   9.4 38.5
Other party      Male               3.1  14.4  11.3  17.5  11.3  16.5 25.8
                 Female             5.2  14.6  15.6  16.7  11.5   8.3 28.1
None             Male               7.5  22.1  20.6  17.1  11.1   6.0 15.6
                 Female             8.5  22.5  20.6  21.8   7.3   6.3 13.0
Green Party      Male               6.5  32.3  16.1  29.0   6.5   3.2  6.5
                 Female             6.2  18.8  25.0  20.8   6.2  10.4 12.5

The table gives the relative age breakdown for each gender/political affiliation combination (ie row percentages). Here again we used droplevels(): this removes unused factor levels which would otherwise be displayed and make the table difficult to read. droplevels() can be applied either to entire data frames or single variables.

5.4 Grouped summary statistics for continuous variables

A common requirement in survey analysis is the ability to compare descriptive statistics across subgroups of the data. There are different ways to do this in R. We demonstrate below the most straightforward one, which is obtained by using some of the functions available in the dplyr package.

bsa%>%
  group_by(PartyId2.f)%>%
  summarise(mdscore=median(libauth,na.rm=T),
            sdscore=sd(libauth,na.rm=T))

# A tibble: 7 × 3
  PartyId2.f       mdscore sdscore
  <fct>              <dbl>   <dbl>
1 Conservative        3.67   0.587
2 Labour              3.33   0.774
3 Liberal Democrat    3.17   0.726
4 Other party         3.67   0.739
5 None                3.67   0.584
6 Green Party         2.83   0.872
7 <NA>                3.67   0.564

The above command produces a table of summary values (median and standard deviations) of the Liberal vs authoritarian scale. We can see from the first one that Green party voters are the most liberal, followed by Labour, whereas non voters and Conservatives are the most authoritarian. Liberal Democrats are the most cohesive group (i.e. with the smallest standard deviation). We chose to leave non-responses for PartyId2 for this analysis. Some users might want to remove them instead before computing their results as in the table below. We do this by using is.na(), which checks variables for the presence of system missing values, in conjunction with filter().

bsa%>%
  filter(!is.na(PartyId2.f)) %>%                              
  group_by(Rsex.f,PartyId2.f) %>%
  summarise(mnscore=sd(libauth,na.rm=T),
            mdscore=median(libauth,na.rm=T))

# A tibble: 12 × 4
# Groups:   Rsex.f [2]
   Rsex.f PartyId2.f       mnscore mdscore
   <fct>  <fct>              <dbl>   <dbl>
 1 Male   Conservative       0.607    3.67
 2 Male   Labour             0.765    3.33
 3 Male   Liberal Democrat   0.766    3.17
 4 Male   Other party        0.703    3.83
 5 Male   None               0.616    3.67
 6 Male   Green Party        1.04     2.67
 7 Female Conservative       0.565    3.67
 8 Female Labour             0.781    3.33
 9 Female Liberal Democrat   0.688    3.17
10 Female Other party        0.773    3.67
11 Female None               0.565    3.67
12 Female Green Party        0.744    3

When further broken down by gender, we can see that overall the same trends remain valid, with some nuances: male Green supporters are markedly more liberal than their female counterpart, the opposite being true among Conservative supporters.

Instead of tables of summary statistics, we may want to have summary statistics computed as variables that will be added to the current dataset for each corresponding gender/political affiliation group. This is straightforward to do with dplyr, we just need to use the mutate() command.

bsa<-bsa%>%
  group_by(Rsex.f,PartyId2.f)%>%
  mutate(msscore=sd(libauth,na.rm=T),
         mdscore=median(libauth,na.rm=T))

However, we also need to add the newly created variables into the existing bsa dataset, which the first line of the syntax above does. We can check that the variables have been created and that the correct values have been assigned to each sex/affiliation category.

names(bsa)

 [1] "Sserial"          "Rsex"             "RAgeCat"          "Married"         
 [5] "ChildHh"          "HEdQual3"         "eq_inc_quintiles" "RClassGp"        
 [9] "CCBELIEV"         "carallow"         "carreduc"         "carnod2"         
[13] "cartaxhi"         "carenvdc"         "plnenvt"          "plnuppri"        
[17] "Politics"         "Voted"            "actchar"          "actpol"          
[21] "govnosa2"         "PartyId2"         "leftrigh"         "libauth"         
[25] "WtFactor"         "PartyId2.f"       "Rsex.f"           "RAgeCat.f"       
[29] "msscore"          "mdscore"

bsa[4:8,c("Rsex","PartyId2","mdscore")]

# A tibble: 5 × 3
  Rsex       PartyId2              mdscore
  <dbl+lbl>  <dbl+lbl>               <dbl>
1 2 [Female]  2 [Labour]              3.33
2 1 [Male]    3 [Liberal Democrat]    3.17
3 2 [Female] NA                       3.67
4 1 [Male]    3 [Liberal Democrat]    3.17
5 2 [Female]  6 [Green Party]         3