Producing weighted and survey design-informed estimates – Analysing social survey data using R

Most users of social surveys will be looking to produce representative estimates when conducting analyses. Conducting such population inference entail using weights, which are meant to correct estimates for the under/over representation of certain groups in the sample due to the sampling process and non-response. Producing sound results relies not just on weighting estimates but also on computing adequate standard errors or confidence intervals, which measure the precision of the estimates.

The recommended approach to inferring confidence intervals and standard errors involves accounting for the survey design (ie the way sampling was carried out) when conducting analyses – which can be done with the survey package in R, a topic described in Section 6.3. At the same time, for users who are concerned with quickly computing reasonably accurate point estimates, rather than publication-quality results, it may be useful to be aware which common R commands and operations can be used with weights.

Central tendency and dispersion (continuous variables)

The stats package, part of Base R, includes weighted.mean() which, as indicated by its name, computes weighted estimates of the mean of a variable when weights are provided. However, the Hmisc package includes a more comprehensive set of functions that can be used when weighting estimates: wtd.mean(), wtd.var() and wtd.quantile(). The code below provides an illustration of weighted means, variance and median of the left-right political attitudes score used in previous chapters, each time comparing it with the unweighted estimates:

### Mean
c(mean(bsa$leftrigh,na.rm=T),
  wtd.mean(bsa$leftrigh,bsa$WtFactor)
  )

[1] 2.519911 2.521589

### Variance
c(var(bsa$leftrigh,na.rm=T),
  wtd.var(bsa$leftrigh,bsa$WtFactor))

[1] 0.6166894 0.6195378

### Median and quartiles
c(quantile(bsa$leftrigh,na.rm=T,probs=c(.25,.5,.75)),
  wtd.quantile(bsa$leftrigh,bsa$WtFactor,probs=c(.25,.5,.75)))

25% 50% 75% 25% 50% 75% 
2.0 2.4 3.0 2.0 2.4 3.0

The above functions can be used in conjunction with group_by() and summarise() in order to compute weighted estimates of continuous variables by groups of categorical variables:

bsa%>%
  filter(!is.na(RAgeCat.f))%>%
  group_by(RAgeCat.f)%>%
  summarise(Mean=wtd.mean(leftrigh,WtFactor),
            Var=wtd.var(leftrigh,WtFactor),
            Median=wtd.quantile(leftrigh,WtFactor,probs=c(.5))
            )

# A tibble: 7 × 4
  RAgeCat.f  Mean   Var Median
  <fct>     <dbl> <dbl>  <dbl>
1 18-24      2.49 0.556    2.4
2 25-34      2.56 0.577    2.6
3 35-44      2.52 0.615    2.4
4 45-54      2.53 0.671    2.6
5 55-59      2.54 0.653    2.4
6 60-64      2.46 0.685    2.4
7 65+        2.52 0.613    2.4

Frequencies and contingency tables

Neither ftable() nor table() that were used in previous chapter allow weights. Although the Hmisc packages includes a wtd.table() function for one-way frequency tables, we recommend using xtabs() as previously, as it is more versatile and allows weights. Indeed, as variables used with xtabs() are specified on the right hand side of a formula:

> xtabs(~var1, data=mydata)

> xtabs(~var1 + var2, data=mydata)

… A variable containing weights can be passed to xtabs() by specifying its name on the left hand-side of the equation (or the tilde ~ )

> xtabs(weights~var1 + var2, data=mydata)

Let’s apply this technique to investigate respondents’ agreement with the sentence: People should be able to travel by plane as much as they like, even if this harms the environment as recorded in the plnenvt variable.

bsa$plnenvt.f<-as_factor(bsa$plnenvt) # Converts the original variable into a factor
 
## Unweighted vs weighted frequency tables
cbind(
  Unweighted=round(
    100*prop.table(
      xtabs(~plnenvt.f,bsa,
            drop.unused.levels = T)
        ),
    1),
  Weighted=round(
    100*prop.table(
      xtabs(WtFactor~plnenvt.f,bsa,
            drop.unused.levels = T)
      ),
    1)
)

                           Unweighted Weighted
agree strongly                    4.6      4.8
agree                            16.0     15.9
neither agree nor disagree       33.2     33.2
disagree                         36.3     37.0
disagree strongly                10.0      9.0

Obtaining a weighted contingency table of respondents’ views about flying by gender follow the same logic:

## Unweighted vs weighted contingency tables
cbind(
  round(
    100*prop.table(
      xtabs(~plnenvt.f+Rsex.f,bsa,
            drop.unused.levels = T),
      1),
    1),
  round(
    100*prop.table(
      xtabs(WtFactor~plnenvt.f+Rsex.f,bsa,
            drop.unused.levels = T),
      1),
    1)
)

                           Male Female Male Female
agree strongly             50.0   50.0 46.1   53.9
agree                      47.2   52.8 53.5   46.5
neither agree nor disagree 40.8   59.2 43.6   56.4
disagree                   47.9   52.1 52.7   47.3
disagree strongly          42.3   57.7 41.5   58.5

Inference using survey procedures

The weighting procedures described above could be described as ‘quick and dirty’ in that they compute representative point estimates – i.e. single values. Computing the precision of survey data estimates – for example via their standard error – requires more than just adding weights to a command. Information about the survey design, its primary sampling units, strata and clusters is required so that robust standard errors, statistical tests and/or confidence interval can be computed. The Survey package was designed in order to deal with this set of issues. It provides functions for computing common estimates while accounting for the survey design. Its most important features are described below.

In order to use survey functions one first needs to create a svydesign object, in essence a version of the data that incorporates the sample design information available, then compute the required estimate using the svydesign object.

A common issue with survey datasets available in the UK is that sampling information is often only available in a secured version of the data, restricting its access to authorised users in a secure lab. Although it is sometimes possible to use available variables to account for aspects of the sample design – region as a strata in the case of stratified samples – in most cases users are left with computing standard errors without sample design information, which amounts to assuming that the sample was drawn purely at random. Even if this is the case however, using the survey package is recommended, as it provides a coherent framework for computing survey parameters.

library(survey) ### Loading the package in memory
bsa.design<-svydesign(ids =~1,
                      weights=~WtFactor,
                      data=bsa)

The code above simply declares the survey design by creating the bsa.design object (the name is arbitrary). The ids= parameter is where primary sampling units are declared, as well as any clustering information as a formula ie ~PSU+cluster2id.... When PSU information is unavailable ids is given the value 1 or 0. A strata= and fpc= are available in order to declare the sampling strata and the variable used for finite population correction. None of these are available in the bsa dataset, and estimation commands will therefore rely on the assumption of simple random sampling.

We can now compute estimates comparable to those in the previous sections. The code below provides the mean of the left vs right political orientation indicator, as well as its 95% confidence interval:

a<-svymean(~leftrigh,
        bsa.design,
        na.rm = T)### Computes the mean and its standard error...

a

           mean     SE
leftrigh 2.5216 0.0155

confint(a) ### ... and confidence interval

            2.5 %   97.5 %
leftrigh 2.491277 2.551902

We can similarly

oldsvyquantile(~leftrigh,
            bsa.design,
            quantiles=.5,
            na.rm = T)

         0.5
leftrigh 2.4

Frequency and contingency tables are computed using svytable(), whose syntax relies on formulas similarly to xtabs() in the previous chapter.

### A frequency table...
round(100*
        prop.table(
          svytable(~RAgeCat.f,bsa.design)
        ),
      1)

RAgeCat.f
18-24 25-34 35-44 45-54 55-59 60-64   65+ 
 11.2  17.2  16.1  17.9   7.9   6.8  22.8

### And a two-way contingency table:

round(100*
        prop.table(
          svytable(~RAgeCat.f+Rsex.f,bsa.design),
          1),
      1)

         Rsex.f
RAgeCat.f Male Female
    18-24 51.1   48.9
    25-34 50.2   49.8
    35-44 49.7   50.3
    45-54 49.3   50.7
    55-59 48.8   51.2
    60-64 49.0   51.0
    65+   45.4   54.6