Test-retest reliability of four U.S. non-probability sample sources

Mario Callegaro
Inna Tsirlin
American Association for Public Opinion Research (2022)

Abstract

It is a common practice in market research to set up cross sectional survey trackers. Although many studies have investigated the accuracy of non-probability-based online samples, less is known about their test-retest reliability which is of key importance for such trackers. In this study, we wanted to assess how stable measurement is over short periods of time so that any changes observed over long periods in survey trackers could be attributed to true changes in sentiment rather than sample artifacts.

To achieve this, we repeated the same 10-question survey of 1,500 respondents two weeks apart in four different U.S. non-probability-based samples. The samples included: Qualtrics panels representing a typical non-probability-based online panel, Google Surveys representing a river sampling approach, Google Opinion Rewards representing a mobile panel, and Amazon MTurk, not a survey panel in itself but de facto used as such in academic research.

To quantify test-retest reliability, we compared the response distributions from the two survey administrations. Given the attitudes measured were not expected to change in a short timespan and no relevant external events were reported during fielding to potentially affect the attitudes, the assumption was that the two measurements should be very close to each other, aside from transient measurement error.

We found two of the samples produced remarkably consistent results between the two survey administrations, one sample was less consistent, and the fourth sample had significantly different response distributions for three of the four attitudinal questions. This study sheds light on the suitability of different non-probability-based samples for cross sectional attitude tracking.

It is a common practice in market research to set up cross sectional survey trackers. Although many studies have investigated the accuracy of non-probability-based online samples, less is known about their test-retest reliability which is of key importance for such trackers. In this study, we wanted to assess how stable measurement is over short periods of time so that any changes observed over long periods in survey trackers could be attributed to true changes in sentiment rather than sample artifacts.

To achieve this, we repeated the same 10-question survey of 1,500 respondents two weeks apart in four different U.S. non-probability-based samples. The samples included: Qualtrics panels representing a typical non-probability-based online panel, Google Surveys representing a river sampling approach, Google Opinion Rewards representing a mobile panel, and Amazon MTurk, not a survey panel in itself but de facto used as such in academic research.

To quantify test-retest reliability, we compared the response distributions from the two survey administrations. Given the attitudes measured were not expected to change in a short timespan and no relevant external events were reported during fielding to potentially affect the attitudes, the assumption was that the two measurements should be very close to each other, aside from transient measurement error.

We found two of the samples produced remarkably consistent results between the two survey administrations, one sample was less consistent, and the fourth sample had significantly different response distributions for three of the four attitudinal questions. This study sheds light on the suitability of different non-probability-based samples for cross sectional attitude tracking.