summary(bsa$leftrigh)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.00 2.00 2.40 2.52 3.00 5.00 782
Producing descriptive statistics in R is relatively straightforward, as key functions are included by default in the Base package. We have already seen above that the summary()
command provides essential information about a variable. For instance,
provides information about the mean, median and quartiles of the political scale of respondents.
The describe()
command from the Hmisc
package provides a more detailed set of summary statistics.
Error in proxy[, ..., drop = FALSE]: incorrect number of dimensions
The code above returns an error because describe()
expects numeric values, and ‘leftrigh’ isn’t a pure numeric variable:
It is a numeric variable: ‘double’, but with some extra metadata, including haven
-generated value and variable labels. In order for describe()
to run properly, we need to convert leftrigh
to Base R numeric format, either as a new variable or as shown below, temporarily:
as.numeric(bsa$leftrigh)
n missing distinct Info Mean Gmd .05 .10
3206 782 30 0.993 2.52 0.8831 1.2 1.4
.25 .50 .75 .90 .95
2.0 2.4 3.0 3.6 4.0
lowest : 1 1.2 1.4 1.5 1.6 , highest: 4.4 4.6 4.75 4.8 5
describe()
also provides the number of observations (including missing and unique observations), deciles as well as the five largest and smallest values.
Commands producing single statistics are also available:
[1] 2.519911
[1] 0.7852958
[1] 2.4
[1] 5
[1] 1
As previously, the na.rm = T
option prevents missing values from being taken into account (in which case the output would have been NA, as this is the default behaviour of these functions). Similarly to describe()
earlier, max()
and min()
need the variable to be converted into numeric format to deliver the desired output.
We could combine the output from the above commands into a single line using the c()
function:
c(
mean(bsa$leftrigh, na.rm = T),
sd(bsa$leftrigh, na.rm = T),
median(bsa$leftrigh, na.rm = T),
max(as.numeric(bsa$leftrigh), na.rm = T),
min(as.numeric(bsa$leftrigh), na.rm = T)
)
[1] 2.5199106 0.7852958 2.4000000 5.0000000 1.0000000
Using these individual commands may come in handy, for instance when further processing of the result is needed:
Let’s round the results to two decimal places:
We can see the final results by typing:
Note:
would produce the same results using just one line of code .
The Base R installation comes with a wide range of bivariate statistical functions. cor()
and cov()
provide basic measures of association between two variables. For instance, in order to measure the correlation between the left-right the libertarian-authoritarian scales:
The latter variable is records how far someone sits on the libertarian – authoritarian scale ranging from 1 to 5.
A correlation of 0.009 indicates a positive but very small relationship. It can be interpreted as ’an increase in authoritarianism is associated with a marginal increase in rightwing views.
Note: When using cor()
and cov()
, missing values are dealt with the use=
option, which can either take “everything”, “all.obs”, “complete.obs”, “na.or.complete”, or “pairwise.complete.obs” values. See ?cor
for additional information.
As with continuous variables, R offers several tools that can be used to describe the distribution of categorical variables. One- and two-way contingency tables are the most commonly used.
There are several R commands that we can use to create frequency tables. The most common ones table()
, xtabs()
or ftable()
which return the frequencies of observations within each level of a factor. For example, in order to obtain the political affiliation of BSA respondents in 2017:
Not applicable Conservative Labour Liberal Democrat
0 1263 1479 241
Other party None Green Party Other answer/DK/Ref
193 515 79 0
As with any other R functions, the outcome of table()
can be stored as an object for further processing:
It is not directly possible to have proportions or percentages computed with table()
. Proportions are obtained using the prop.table()
function which in turn does not produce percentages. It is also a good idea to round the results for greater readability.
Either:
Not applicable Conservative Labour Liberal Democrat
0.0 33.5 39.2 6.4
Other party None Green Party Other answer/DK/Ref
5.1 13.7 2.1 0.0
… or:
The simplest way to produce a two-way contingency table is to pass a second variable to table()
:
skipped or na Male Female Dontknow Refusal
Not applicable 0 0 0 0 0
Conservative 0 627 636 0 0
Labour 0 644 835 0 0
Liberal Democrat 0 124 117 0 0
Other party 0 97 96 0 0
None 0 199 316 0 0
Green Party 0 31 48 0 0
Other answer/DK/Ref 0 0 0 0 0
By default table
does not discard empty factor levels - (i.e. categories with no observations), which may sometimes result in slightly cumbersome result. Using droplevels()
on each variable resolves the issue:
Male Female
Conservative 627 636
Labour 644 835
Liberal Democrat 124 117
Other party 97 96
None 199 316
Green Party 31 48
However, when dealing with more than one variable it is recommended to use xtabs()
instead as it has a number of desirable functions directly available as options. The syntax is slightly different as it relies on a formula
– a R object consisting of elements separated by a tilde ‘~’. The variables to be tabulated are specified on the right hand side of the formula. In order to lighten the syntax, we will also recode PartyId2
and Rsex
permanently into factors.
bsa$PartyId2.f<-as_factor(bsa$PartyId2)
bsa$Rsex.f<-as_factor(bsa$Rsex)
xtabs(~PartyId2.f +Rsex.f,
data = bsa)
Rsex.f
PartyId2.f skipped or na Male Female Dontknow Refusal
Not applicable 0 0 0 0 0
Conservative 0 627 636 0 0
Labour 0 644 835 0 0
Liberal Democrat 0 124 117 0 0
Other party 0 97 96 0 0
None 0 199 316 0 0
Green Party 0 31 48 0 0
Other answer/DK/Ref 0 0 0 0 0
The data=
parameter does not have to be explicitly specified as simply using ´bsa
’ will work. Other useful options are:
subset=
, which allows direct specification of a subpopulation from which to derive the table;drop.unused.levels=T
to remove empty levels;weights~
variables on the right hand side of the formula will be treated as weights, a useful feature for survey analysis.As previously prop.table()
is necessary in order to obtain proportions:
b<-xtabs(~PartyId2.f +Rsex.f,
bsa,
drop.unused.levels = T)
round(100*
prop.table(b),
1) ### Cell percentages
Rsex.f
PartyId2.f Male Female
Conservative 16.6 16.9
Labour 17.1 22.1
Liberal Democrat 3.3 3.1
Other party 2.6 2.5
None 5.3 8.4
Green Party 0.8 1.3
The largest group in the sample (22.1%) is made of labour-voting females and the smallest, of green-voting males.
Rsex.f
PartyId2.f Male Female
Conservative 49.6 50.4
Labour 43.5 56.5
Liberal Democrat 51.5 48.5
Other party 50.3 49.7
None 38.6 61.4
Green Party 39.2 60.8
Conservative voters are more or less evenly split between men and women, whereas Labour and Green voters are more likely to be women.
Rsex.f
PartyId2.f Male Female
Conservative 36.4 31.1
Labour 37.4 40.8
Liberal Democrat 7.2 5.7
Other party 5.6 4.7
None 11.6 15.4
Green Party 1.8 2.3
Similar proportions of men voted Conservative and Labour (36-37%), whereas women were clearly more likely to vote Labour.
There is not a straightforward way to obtain percentages in three-way contingency tables with either xtabs()
or table()
. This is where ftable()
function comes handy. For convenience, we converted RAgeCat
into a factor.
bsa$RAgeCat.f<-as_factor(bsa$RAgeCat)
round(100*
prop.table(
ftable(RAgeCat.f~PartyId2.f+Rsex.f,
data=droplevels(bsa)
)
,1)
,1) ### Option 1 for row, 2 for column percentages
RAgeCat.f 18-24 25-34 35-44 45-54 55-59 60-64 65+
PartyId2.f Rsex.f
Conservative Male 3.0 7.8 13.9 15.0 8.6 9.3 42.4
Female 2.7 7.1 8.8 18.6 8.8 8.7 45.4
Labour Male 7.6 16.1 14.3 21.3 7.5 9.3 23.9
Female 7.9 20.2 19.2 18.8 7.1 7.1 19.7
Liberal Democrat Male 0.8 13.7 19.4 15.3 9.7 8.9 32.3
Female 4.3 9.4 26.5 6.0 6.0 9.4 38.5
Other party Male 3.1 14.4 11.3 17.5 11.3 16.5 25.8
Female 5.2 14.6 15.6 16.7 11.5 8.3 28.1
None Male 7.5 22.1 20.6 17.1 11.1 6.0 15.6
Female 8.5 22.5 20.6 21.8 7.3 6.3 13.0
Green Party Male 6.5 32.3 16.1 29.0 6.5 3.2 6.5
Female 6.2 18.8 25.0 20.8 6.2 10.4 12.5
The table gives the relative age breakdown for each gender/political affiliation combination (ie row percentages). Here again we used droplevels()
: this removes unused factor levels which would otherwise be displayed and make the table difficult to read. droplevels()
can be applied either to entire data frames or single variables.
A common requirement in survey analysis is the ability to compare descriptive statistics across subgroups of the data. There are different ways to do this in R. We demonstrate below the most straightforward one, which is obtained by using some of the functions available in the dplyr
package.
bsa%>%
group_by(PartyId2.f)%>%
summarise(mdscore=median(libauth,na.rm=T),
sdscore=sd(libauth,na.rm=T))
# A tibble: 7 × 3
PartyId2.f mdscore sdscore
<fct> <dbl> <dbl>
1 Conservative 3.67 0.587
2 Labour 3.33 0.774
3 Liberal Democrat 3.17 0.726
4 Other party 3.67 0.739
5 None 3.67 0.584
6 Green Party 2.83 0.872
7 <NA> 3.67 0.564
The above command produces a table of summary values (median and standard deviations) of the Liberal vs authoritarian scale. We can see from the first one that Green party voters are the most liberal, followed by Labour, whereas non voters and Conservatives are the most authoritarian. Liberal Democrats are the most cohesive group (i.e. with the smallest standard deviation). We chose to leave non-responses for PartyId2
for this analysis. Some users might want to remove them instead before computing their results as in the table below. We do this by using is.na()
, which checks variables for the presence of system missing values, in conjunction with filter()
.
bsa%>%
filter(!is.na(PartyId2.f)) %>%
group_by(Rsex.f,PartyId2.f) %>%
summarise(mnscore=sd(libauth,na.rm=T),
mdscore=median(libauth,na.rm=T))
# A tibble: 12 × 4
# Groups: Rsex.f [2]
Rsex.f PartyId2.f mnscore mdscore
<fct> <fct> <dbl> <dbl>
1 Male Conservative 0.607 3.67
2 Male Labour 0.765 3.33
3 Male Liberal Democrat 0.766 3.17
4 Male Other party 0.703 3.83
5 Male None 0.616 3.67
6 Male Green Party 1.04 2.67
7 Female Conservative 0.565 3.67
8 Female Labour 0.781 3.33
9 Female Liberal Democrat 0.688 3.17
10 Female Other party 0.773 3.67
11 Female None 0.565 3.67
12 Female Green Party 0.744 3
When further broken down by gender, we can see that overall the same trends remain valid, with some nuances: male Green supporters are markedly more liberal than their female counterpart, the opposite being true among Conservative supporters.
Instead of tables of summary statistics, we may want to have summary statistics computed as variables that will be added to the current dataset for each corresponding gender/political affiliation group. This is straightforward to do with dplyr, we just need to use the mutate()
command.
However, we also need to add the newly created variables into the existing bsa dataset, which the first line of the syntax above does. We can check that the variables have been created and that the correct values have been assigned to each sex/affiliation category.
[1] "Sserial" "Rsex" "RAgeCat" "Married"
[5] "ChildHh" "HEdQual3" "eq_inc_quintiles" "RClassGp"
[9] "CCBELIEV" "carallow" "carreduc" "carnod2"
[13] "cartaxhi" "carenvdc" "plnenvt" "plnuppri"
[17] "Politics" "Voted" "actchar" "actpol"
[21] "govnosa2" "PartyId2" "leftrigh" "libauth"
[25] "WtFactor" "PartyId2.f" "Rsex.f" "RAgeCat.f"
[29] "msscore" "mdscore"
# A tibble: 5 × 3
Rsex PartyId2 mdscore
<dbl+lbl> <dbl+lbl> <dbl>
1 2 [Female] 2 [Labour] 3.33
2 1 [Male] 3 [Liberal Democrat] 3.17
3 2 [Female] NA 3.67
4 1 [Male] 3 [Liberal Democrat] 3.17
5 2 [Female] 6 [Green Party] 3