Sourcing a suitable sample: understanding selection bias in survey data

During my time at FreeAgent, I have been analysing attitudinal customer survey data to predict their behaviour. Getting to the bottom of how exactly this data was collected has helped me to understand the data and has given me a few ideas about how the data could be collected in the future. This blog focuses on how we choose who is selected to take part in surveys: a process is known as ‘sampling’.

Determining the definition

The Merriam-Webster dictionary defines the process of sampling as:

“…selecting a representative part of a population for the purpose of determining parameters or characteristics of the whole population¹”

Therefore, a ‘sample’, by definition, is never a perfect representation of the population that we are truly interested in. The only perfectly representative sample would be the entire population at a given time, which is not a ‘sample’ at all! During survey sampling, we are artificially sieving out individuals that are accessible and willing to give us information, which inevitably leads to bias. This virtual sieve has become finer in recent times due to increasingly stringent data protection laws and a reduction in the number of people that are contactable². It is more important than ever that we acknowledge that each survey has its own susceptibilities for sample bias. Knowing about this allows us to draw appropriate conclusions from our results and might influence how we design future surveys.

Evaluating an example

We have discussed what sampling means but how does it work in real life? Let’s consider a food-based survey scenario (always my favourite type of survey!).

Recently, the FreeAgent data science team have made the most out of the fantastic hot weather by making dried aubergines in the sun on the balcony. We liked our aubergines so much that we were interested in getting people’s opinions about them to see how we can improve in future!

Here, our ‘target population’ is everyone who eats aubergines in the UK. However, we don’t have any money for our survey and we only very limited time before the British summer ends and we can no longer make our aubergines! We also need to make sure people get the chance to actually try our aubergines before they take the survey, which is a significant practical restraint. What sampling method shall we use to best suit our needs?

Our dried aubergines with mozzarella and tomato!

Mulling over the methods

The following sampling methods are an introduction to some of the most common used for surveys. This is not an exclusive list and there are many different variations and combinations used, depending on time, legal, financial and practical constraints. Let’s find out if we could apply any of these methods to our aubergine survey!

Probability sampling methods involve selecting a sample at random, such as forming a team by pulling names out of a hat³. There are several types:

Simple random sampling — involves randomly selecting individuals for the sample from the target population so that they have an equal chance of being selected⁴.

Pros:

Easy
Generally fair representation of the target population

Cons:

Can lead to under-representation of certain subgroups (e.g. sex, race)
Not always practical

Example: Obtain a list of everyone who eats aubergines in the UK and use a computer to generate a random sample the goal number of people.

Method evaluation: Not practically feasible due to practical, financial, time and probably legal restraints!

Systematic random sampling — involves randomly ordering individuals and selecting them in equal intervals (i.e. choosing every 5th individual)⁴.

Pros

Easy
Generally fair representation of the target population
Can save time in certain sampling situations (e.g. seated individuals)

Cons

Can lead to under-representation of certain subgroups (e.g. sex, race)
Not always practical

Example: Obtain a list of everyone who eats aubergines in the UK, shuffle the list and choose the goal number of people at equal intervals along the list to sample.

Method evaluation: Same issues as simple random sampling.

Stratified random sampling — involves dividing the target population into groups and randomly selecting individuals from each group⁴.

Pros

Ensures equal representation of subgroups (e.g. sex, race)
Generally fair representation of the target population
May reduce variation in the sample

Cons

Difficult to decide how to divide into groups
Not always practical

Example: Obtain a list of everyone who eats aubergines in the UK, divide these into groups of people that we know perceive taste differently based on research (e.g. sex, ethnicity, age, smoking status) and use a computer to generate a random sample within each group, totalling the goal number of individuals.

Method evaluation: Same issues as simple random sampling, with the added complexity of obtaining a considerable amount of personal information.

Clustered (AKA area) random sampling — involves involves dividing the target population into groups (usually by geographical area) and randomly selecting some groups (all individuals within these groups are included)⁴.

Pros

Can save time/money in certain situations

Cons

Can lead to under-representation of certain subgroups (e.g. sex, race)
Less fair representation of the target population
Not always practical

Example: Obtain a list of everyone who eats aubergines in the UK, divide these into groups of people within geographical locations (e.g. output area) and use a computer to randomly choose groups to sample.

Method evaluation: Same issues as simple random sampling, with the added complexity of obtaining geographical information.

Non-probability sampling methods involve selecting a sample but not at random, such as for a specific purpose³. There are several types:

Convenience sampling — involves selecting individuals that are practically easy to reach⁴.

Pros

Easy
Practical and saves money

Cons

Unlikely to be a fair representation of the target population

Example: Ask everyone in the FreeAgent office if they eat aubergines and select everyone who does to be in our sample.

Method evaluation: A practical solution but would not represent the opinions of the UK population and we might not be able to obtain a large enough sample.

Quota sampling — involves dividing the target population into groups and aiming to hit a target (quota) for a number/proportion of individuals within those groups to be included⁴.

Pros

More likely to be a fair representation of the target population
More practical than stratified random sampling

Cons

Difficult to decide how to divide into groups and assign quotas
Time consuming

Example: Obtain a list of everyone who eats aubergines in the UK, divide these into groups of people that we know perceive taste differently based on research (e.g. sex, ethnicity, age, smoking status) and sample the same number of people with each group, until the goal number of individuals is reached.

Method evaluation: Same issues as stratified random sampling.

Modal instance and heterogeneity sampling — opposite methods that involve selecting individuals that represent the majority or the extremes of the target population respectively⁴.

Pros

Good when it is only the majority or the extremes of the target population that we are interested in respectively

Cons

Only valid in specific situations
Difficult to decide which individuals represent the majority or extremes
Each method does not represent the variation or the average respectively
Not a fair representation of the wider population

Example: Obtain a list of everyone who eats aubergines in the UK and a score of how much they like eating them, then survey the people that give the most common score (modal instance) or the lowest and highest scores (heterogeneity sampling).

Method evaluation: Same issues as simple random sampling, with the added complexity of obtaining aubergine scores.

Expert sampling — involves selecting a panel of individuals that are experts in a specific topic⁴.

Pros

Good when it is only the opinion of experts that we are interested in

Cons

Only valid in specific situations
Not a fair representation of the wider population

Example: Research aubergine-eating food critics in the UK and include them in the sample.

Method evaluation: Same issues as convenience sampling with the added complexities that it probably wouldn’t be practical and aubergine experts might expect financial compensation for giving us their opinions!

Snowball sampling — involves selecting individuals that fit the inclusion criteria for your study and asking them to recommend other people for inclusion into the study⁴.

Pros

Practical and saves money
Good for obtaining individuals in hard to reach areas (e.g. homeless people)

Cons

Unlikely to be a fair representation of the target population
Time consuming

Example: Find a few people that eat aubergines that we know and include them in our sample, then ask them if they know any other people who eat aubergines and include them in our sample, until we have reached the goal number of individuals.

Method evaluation: Same issues as convenience sampling with the added risk that it would take a considerable amount of time to recruit a large enough sample.

Finally, multiple sampling methods involve a combination of two or more different types of sampling, usually in order to fit the practicalities for the requirements of the survey whilst aiming to obtain a more representative sample.

Being balanced about bias

For our aubergine survey scenario, imagine we chose convenience sampling because it was the only method practically feasible. In an attempt to make our sample a little more diverse, we took advantage of an upcoming company-wide FreeAgent recruitment event! Part of the event sign-up process involved asking people that came if they would agree to taking part in a telephone survey about the aubergines. We offered our aubergines to everyone at the event and recorded who tried them. We then telephoned everyone who had agreed to participate and had tried our aubergines and surveyed everyone who answered the phone. Let’s consider the different types of bias that our sample could be prone to:

Selection bias — the concept that individuals that are selected for a sample are not representative of the wider (population)⁵. E.g. were individuals that attended the FreeAgent recruitment event more likely to:

Be of a certain gender or sexuality?
Be of a particular nationality, culture or ethnic group?
Work in technology/accountancy rather than other employment sectors?
Have particular food preferences?

Volunteer bias — the concept that individuals that volunteer for surveys are different from those who do not⁵. E.g. were individuals that volunteered:

Of different personalities, nationalities, cultures, educational levels, social lives, backgrounds etc?
Likely to have an agenda that would influence their decision, such as potential job candidates?
People that we work closely with or friends?

Non-response bias — the concept that individuals who do not respond to a survey are different from those who do not⁵. E.g. did individuals that did not answer the telephone:

Have busier lives because they are at a different life stage (career, family etc)?
Have jobs in sectors where they work out of standard office hours?
Not want to take part in the survey because they hated our aubergines?
Not want to take part in the survey because they had a bad time at the event?

At first look, this seems like a whole lot of bias! Does this mean we have to scrap the entire survey and give up? Of course not! Being aware of these biases helps us draw balanced conclusions about the outcomes of the survey. For example, if we discovered that our sample mostly consisted of Scottish, white males who work in tech, it would not be possible to claim that our survey demonstrates the attitudes of all people in the UK. However, this survey could still help us understand how we could improve our aubergines for this particular audience and we might choose to target a different population of people the next time we conduct the survey. In my next blog, I will approach the subject of designing surveys themselves and consider how we can avoid different types of bias introduced by the way we ask questions and record answers.

References

Merriam-Webster. 2018. Sampling. Merriam-Webster. Available from: https://www.merriam-webster.com/dictionary/sampling. [Accessed 22 August 2018].
GOV. UK. 2018. Data Protection Act 2018. Crown Copyright. https://www.gov.uk/government/collections/data-protection-act-2018. [Accessed 22 August 2018].
Web Centre for Social Research Methods. 2006. Sampling. William M.K. Trochim, https://socialresearchmethods.net/kb/sampling.php. [Accessed 22 August 2018].
Labrakas, P. J. 2008. Encyclopaedia of survey research methods. SAGE publications LTD. London, UK.
Sedgwick, P. 2013. Questionnaire surveys: sources of bias. BMJ; 347 :f5265.

Grinding Gears

Tales of code crunching from the FreeAgent Engineering team

Sourcing a suitable sample: understanding selection bias in survey data

Determining the definition

Evaluating an example

Mulling over the methods

Being balanced about bias

References

We're totally hiring!