Using survey weights and survey design variables

R examples

Authors

Affiliation

Published

June 2026

1 Introduction

Survey weights help make survey estimates more representative of the target population. Many surveys also use complex sample designs involving stratification and clustering. These design features should be taken into account when estimating standard errors and confidence intervals.

This resource shows how to apply survey weights and account for survey design in R through examples using data from the 2017 British Social Attitudes Survey.

The examples show how to:

apply survey weights when calculating a mean.
specify a survey design using the survey package.
estimate means and proportions with confidence intervals.
produce estimates for population subgroups.

For an introduction to the concepts covered here, including survey weights, stratification, clustering and design variables, see the companion resource Survey Explainer: Survey weights and design variables.

You can use this resource as a simple reference, or follow the examples using your own copy of the data. Information on the required software and data is provided below.

2 Survey weighting and design in R

There are several ways to apply survey weights in R. A key distinction is between general weighting functions which apply weifhts without accounting for the survey design, and survey-specific approaches, which can account for survey design when estimating measures of precision such as confidence intervals.

2.1 General weighting functions

Base R provides some functionality for weighted analysis, including functions such as weighted.mean() and regression models like lm() that allow the use of weights. Additional R packages offer convenient functions for calculating weighted estimates. For example, the Hmisc package includes functions such as wtd.mean(), wtd.var(), and wtd.quantile(), while the summarytools package supports weighted frequency tables and common descriptive statistics.

2.2 Survey analysis functions

The functions discussed above can be used for producing weighted estimates, but they do not allow users to fully account for survey design features such as stratification, clustering and unequal selection probabilities.

The survey package provides a set of functions that apply survey weights and correctly account for the sample design when estimating standard errors and confidence intervals. Before using the package, you must identify the survey weights and design variables and create a survey design object. This resource includes step-by-step examples showing how to do this.

The srvyr package is based on the survey package and designed to work better with dplyr and the tidyverse approach to R including piping.

3 Data and software requirements

To work through the examples on your own machine, you will need:

basic familiarity with R.
an up-to-date copy of R.
the dplyr, haven, survey, Hmisc and ggplot2 packages.
a UK Data Service account (for downloading the data)
a copy of the 2017 British Social Attitudes Survey (BSA) - the examples use the Stata (.dta) version of the dataset.

4 Identifying the survey design

From the information in the dataset user guide, we can see that the 2017 British Social Attitudes (BSA) survey used a three-stage stratified random sample.

Postcode sectors were selected first, followed by addresses and then individuals.
The primary sampling units (postcode sectors) were:
- first stratified by sub-region, population density, and the proportion of owner-occupied housing.
- selected with probability proportional to size, so postcode sectors containing more addresses had a greater chance of being selected.

To account for this design in the analysis, we need to identify the variables that contain information on the sampling weights, the strata, and the primary sampling units (clusters).

From the data documentation, we can find the following.

Survey feature	Variable
Weight	`WtFactor`
Stratum	`StratID`
Primary sampling unit	`Spoint`

5 Preparing the data

The first step is to load the R packages we need and to read in the data. Note the code uses the Stata (.dta) version of the dataset. To run the code, you will need to update the file path to match the location where you saved the downloaded data.

# Load required packages
library(dplyr)   # Data manipulation
library(haven)   # Import Stata/SPSS files
library(Hmisc)   # Weighted statistics
library(survey)  # Survey design
library(ggplot2) # Visualisation

# Read in the data 

# ------------------------------------------------------------------
# DATA LOCATION: 
# Save the UKDA-8450-stata folder somewhere on your computer.
# Then edit the path below to match its location.
# Example:
# data_dir <- "C:/Data/UKDA-8450-stata"
# ------------------------------------------------------------------

data_dir <- "C:/Data/UKDA-8450-stata"


# Build the path to the dataset

bsa_file <- file.path(
  data_dir,
  "stata",
  "stata13",
  "bsa2017_for_ukda.dta"
)

# Import the dataset using Haven
bsa17 <- read_dta(bsa_file)

# Check that the import was successful
dim(bsa17)

[1] 3988  580

The dataset should contain 3988 cases (rows) and 580 variables (columns).

6 Example 1: Mean age

This example shows how weighting and survey design can affect survey estimates. We will:

calculate the unweighted mean age of respondents and compare it with a weighted estimate (using Hmisc).
construct an approximate confidence interval using a general weighting function.
use the survey package to estimate the mean age and its confidence interval while accounting for the survey design.

6.1 Unweighted and weighted mean age

To start, we will first calculate the unweighted mean age.

# Unweighted mean age

mean_age <- mean(
  bsa17$RAgeE,
  na.rm = TRUE
)

mean_age

[1] 52.19358

To calculate the weighted mean, we first use wtd.mean(), from the Hmisc package.

# Weighted mean age
weighted_mean_age <- wtd.mean(
  bsa17$RAgeE,
  weights = bsa17$WtFactor,
  na.rm = TRUE
)

weighted_mean_age

[1] 48.31309

How do they compare?

Unweighted mean age: 52.2 years

Weighted mean age: 48.3 years

Difference: -3.9 years

In this example, the weighted and unweighted estimates differ by around four years. This suggests that older respondents are over-represented in the unweighted sample relative to the target population. Applying the survey weights reduces this imbalance and produces a lower estimate of the population mean age.

6.2 Approximate weighted confidence interval

We can calculate a confidence interval for the weighted mean age. The confidence interval reflects the uncertainty around the estimate. To do this, we first estimate a weighted standard error using the Hmisc package and then use it to construct an approximate 95% confidence interval.

The approximate confidence interval is calculated as: mean ± 1.96 × standard error.

Because this approach does not account for stratification or clustering, the confidence interval is an approximation.

# APPROXIMATE 95% CONFIDENCE INTERVAL  

# Calculate the weighted standard error (approximate)
weighted_se_age <- sqrt(
  wtd.var(
    bsa17$RAgeE,
    weights = bsa17$WtFactor,
    normwt = TRUE,
    na.rm = TRUE
  ) / sum(!is.na(bsa17$RAgeE))
)

# Calculate the lower and upper confidence limits
lower_ci <- weighted_mean_age - 1.96 * weighted_se_age
upper_ci <- weighted_mean_age + 1.96 * weighted_se_age

# Display the results
cat(
  "Approximate 95% confidence interval:",
  round(lower_ci, 1),
  "to",
  round(upper_ci, 1),
  "years\n"
)

Approximate 95% confidence interval: 47.7 to 48.9 years

The approximate 95% confidence interval suggests that the population mean age lies between 47.7 and 48.9 years. Later, we will compare this interval with one that accounts for the survey design.

6.3 Specifying the survey design in R

The survey package provides functions that apply survey weights and account for the survey design when estimating standard errors and confidence intervals.

To use these functions, we first need to specify the survey design by creating a survey design object. A survey design object combines the survey dataset with information about the survey weights and sample design. I

To create a survey design object, we need the variables identified earlier:

Spoint - the primary sampling unit identifier.
StratID - the stratum identifier.
WtFactor - the survey weight.

We create a survey design object using svydesign(). The resulting object is then used with functions from the survey package to produce estimates that account for the survey design.

# Create a survey design object
bsa17.s<-svydesign(ids=~Spoint,
                   strata=~StratID,                     
                   weights=~WtFactor,                    
                   data=bsa17)

The survey design object is called bsa17.s.

6.4 Using the survey package

We can now use functions from the survey package to produce estimates that account for the survey design. Here we use svymean() to estimate mean age of respondents.

#Estimate mean age using the svymean() from the survey package

survey_mean_age <- svymean( 
  ~RAgeE,                   
  design = bsa17.s, # the survey design object
  na.rm = TRUE
)
survey_mean_age

        mean     SE
RAgeE 48.313 0.4236

The output includes the weighted mean and its standard error. We can use confint() to calculate a 95% confidence interval for the estimated mean.

# use confint to calculate a confidence interval

survey_ci <- confint(survey_mean_age)
survey_ci

         2.5 %  97.5 %
RAgeE 47.48289 49.1433

We can now compare the survey-based estimate and confidence interval with the weighted estimate and approximate confidence interval calculated earlier.

Comparison of weighted mean age estimates and confidence intervals
	Mean age	95% CI lower	95% CI upper
Weighted estimate (Hmisc)	48.3	47.7	48.9
Survey estimate	48.3	47.5	49.1

The weighted mean age looks identical using both approaches. However, the confidence interval produced by the survey package is slightly wider than the approximate confidence interval. This is because the survey package accounts for clustering and stratification when estimating uncertainty, whereas the approximate method only uses the survey weights.

6.5 Key points

Applying survey weights reduced the estimated mean age from 52.2 years to 48.3 years.
The weighted mean was virtually identical whether calculated using Hmisc or the survey package.
The survey package produced a wider confidence interval because it accounts for the sample design when estimating uncertainty.

7 Example 2: Estimating a proportion

In this example, we will use the survey package and our survey design object to estimate the proportion of respondents who report being interested in politics.

The variable is Politics. Before calculating proportions, let’s examine the distribution of the variable and its response categories.

# Frequency distribution of political interest
politics_freq <- table(
  droplevels(
    as_factor(
      bsa17$Politics,
      levels = "both"
    )
  )
)
# Display frequencies in a table
knitr::kable(
  as.data.frame(politics_freq),
  col.names = c("Response", "Frequency")
)

Response	Frequency
[1] … a great deal,	739
[2] quite a lot,	982
[3] some,	1179
[4] not very much,	708
[5] or, none at all?	379
[8] Don`t know	1

Categories 1 (“A great deal”) and 2 (“Quite a lot”) represent respondents who report relatively high levels of interest in politics. In the next step, we combine these categories to estimate the proportion of adults who are interested in politics (in 2017).

# Estimate the proportion interested in politics

politics_interest <- svyciprop(       # function for proportion 
  ~I(Politics == 1 | Politics == 2),  # for those who answered 1 or 2
  bsa17.s                             # survey design object
)
politics_interest                     # print result

                                        2.5% 97.5%
I(Politics == 1 | Politics == 2) 0.430 0.411  0.45

The output from svyciprop() gives the estimate as a proportion. For easier interpretation, we can convert the estimate and confidence interval to percentages.

# Convert estimate and confidence interval to percentages

round(
  data.frame(
    Estimate = coef(politics_interest),
    Lower = confint(politics_interest)[1],
    Upper = confint(politics_interest)[2]
  ) * 100,
  1
)

                                 Estimate Lower Upper
I(Politics == 1 | Politics == 2)       43  41.1  44.9

Try it:

Use svyciprop and the variable RAgecat5 to estimate the proportion of respondents aged 17-35. Run the code and compare to the solution below.

Solution

# Explore the age-group variable
table(
  droplevels(
    as_factor(
      bsa17$RAgecat5,
      levels = "both"
    )
  )
)


 [1] 17-34  [2] 35-54    [3] 55+ [9] DK/Ref 
       814       1379       1791          4

# Use svyciprop() to estimate the proportion and confidence interval
young_prop <- svyciprop(
  ~I(RAgecat5 == 1),     
  bsa17.s            
)

# Convert the estimate and confidence interval to percentages
round(
  data.frame(
    Estimate = coef(young_prop),
    Lower = confint(young_prop)[1],
    Upper = confint(young_prop)[2]
  ) * 100,
  1
)

                 Estimate Lower Upper
I(RAgecat5 == 1)     28.5  26.5  30.6

7.1 Example 2: Key points

svyciprop() can be used to estimate proportions and confidence intervals while accounting for the survey design.
The group of interest is specified using a logical expression, for example ~I(RAgecat5 == 1).
Estimates are returned as proportions but can be converted to percentages for easier interpretation.

8 Example 3: Estimates for population sub-groups (domain estimates)

In Example 2, we estimated the proportion of adults who reported being interested in politics. In this example, we will extend this analysis by estimating the proportion interested in politics separately within each Government Office Region (GOR_ID). These are examples of domain estimates because they describe subgroups of the population. In the survey package, one way of producing domain estimates is using svyby().

In this example, we will:

create a new binary variable politics_interest to capture those interested in politics.
recreate the survey design object so that it contains the new variable.
use svyby() and svymean to estimate the proprtoin interested in each Goverment Office Region.
convert the proportions to percentages.
plot the percentages to make it easier to compare.

# Create indicator: interested in politics = 1, otherwise 0
bsa17$politics_interest <- ifelse(
  bsa17$Politics %in% c(1, 2),
  1,
  0
)

# Recreate survey design object so it includes the new variable
bsa17.s <- svydesign(
  ids = ~Spoint,
  strata = ~StratID,
  weights = ~WtFactor,
  data = bsa17
)

# Estimate percentage interested in politics by region
politics_region <- svyby(
  ~politics_interest,
  ~as_factor(GOR_ID),
  design = bsa17.s,
  FUN = svymean,
  vartype = "ci",
  na.rm = TRUE
)

# Convert estimates and CIs to percentages
politics_region[, 2:4] <- politics_region[, 2:4] * 100

# Round numeric columns only
politics_region[, 2:4] <- round(politics_region[, 2:4], 1)

# print region, estimate and confidence intervals
politics_region [2:4]

                           politics_interest ci_l ci_u
A North East                            33.4 27.3 39.5
B North West                            41.9 36.2 47.6
D Yorkshire and The Humber              35.6 29.2 41.9
E East Midlands                         36.9 33.1 40.7
F West Midlands                         36.3 31.6 41.1
G East of England                       47.2 41.7 52.6
H London                                54.2 47.6 60.8
J South East                            44.6 38.8 50.5
K South West                            46.5 39.7 53.3
L Wales                                 38.6 28.5 48.7
M Scotland                              42.7 36.3 49.2

Let’s now plot the point estimates and their confidence intervals.

plot_region <- politics_region

names(plot_region)[1:4] <- c(
  "region",
  "percent_interested",
  "ci_lower",
  "ci_upper"
)

ggplot(
  plot_region,
  aes(x = percent_interested, y = region)
) +
  geom_point() +
  geom_errorbar(
    aes(xmin = ci_lower, xmax = ci_upper),
    orientation = "y",
    width = 0.2
  ) +
  labs(
    x = "Interested in politics (%)",
    y = "Government Office Region"
  )

The estimates show political interest varies across regions, ranging from around one-third of adults in the North East to just over one-half in London. The confidence intervals indicate that some of this variation may reflect sampling uncertainty.

8.1 An additional note on subsetting

Another way to estimate statistics for a subgroup is to analyse only respondents belonging to that group. For example, if we were interested only in respondents who are employed, we could create a subset and calculate estimates for that subgroup.

There are a few points to note when subsetting survey data.

Survey weights are usually designed to make the full sample representative of the target population. If we analyse only a subset of respondents, the weights may not necessarily produce estimates that are representative of the corresponding population subgroup.
It is important to retain all the survey design information when estimating standard errors and confidence intervals. If subsetting, you should create the survey design object first and then subset using the subset() function. This ensures that the survey weights, clustering and stratification information are retained in the analysis.

Although subsetting can be useful when analysing a subgroup, domain estimation using svyby() is generally the preferred approach because it is simpler, less prone to error and naturally retains the survey design information.

8.2 Key points

Domain estimates describe subgroups of the population, such as regions, age groups or men and women.
svyby() can be used to estimate statistics for multiple subgroups in a single command while accounting for the survey design.
If new variables are added to the dataset after creating a survey design object, the survey design object should be recreated before analysis.
Be cautious when analysing a subgroup separately. Survey weights may not make the subsample representative of the corresponding population subgroup. Always create the survey design object first and then use subset() to retain all the survey design information.