Yongwei Yang

Yongwei Yang

Yongwei Yang is a researcher at Google. He works on (1) user and consumer research, (2) public perceptions about AI, (3) integrating AI into research methods and processes, and (4) attitude-behavior linkage and its implication to business goal-setting and impact evaluation. Yongwei also works on foundational methodological research on collecting better data and making better use of data, esp. with surveys, psychological measurement, and behavioral signals. He is passionate about using his expertise to create a positive impact and to help others become effective users of research. Yongwei holds a Ph.D. in Quantitative and Psychometric Methods from the University of Nebraska-Lincoln.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Test-retest reliability of four U.S. non-probability sample sources
    Mario Callegaro
    Inna Tsirlin
    American Association for Public Opinion Research (2022)
    Preview abstract It is a common practice in market research to set up cross sectional survey trackers. Although many studies have investigated the accuracy of non-probability-based online samples, less is known about their test-retest reliability which is of key importance for such trackers. In this study, we wanted to assess how stable measurement is over short periods of time so that any changes observed over long periods in survey trackers could be attributed to true changes in sentiment rather than sample artifacts. To achieve this, we repeated the same 10-question survey of 1,500 respondents two weeks apart in four different U.S. non-probability-based samples. The samples included: Qualtrics panels representing a typical non-probability-based online panel, Google Surveys representing a river sampling approach, Google Opinion Rewards representing a mobile panel, and Amazon MTurk, not a survey panel in itself but de facto used as such in academic research. To quantify test-retest reliability, we compared the response distributions from the two survey administrations. Given the attitudes measured were not expected to change in a short timespan and no relevant external events were reported during fielding to potentially affect the attitudes, the assumption was that the two measurements should be very close to each other, aside from transient measurement error. We found two of the samples produced remarkably consistent results between the two survey administrations, one sample was less consistent, and the fourth sample had significantly different response distributions for three of the four attitudinal questions. This study sheds light on the suitability of different non-probability-based samples for cross sectional attitude tracking. It is a common practice in market research to set up cross sectional survey trackers. Although many studies have investigated the accuracy of non-probability-based online samples, less is known about their test-retest reliability which is of key importance for such trackers. In this study, we wanted to assess how stable measurement is over short periods of time so that any changes observed over long periods in survey trackers could be attributed to true changes in sentiment rather than sample artifacts. To achieve this, we repeated the same 10-question survey of 1,500 respondents two weeks apart in four different U.S. non-probability-based samples. The samples included: Qualtrics panels representing a typical non-probability-based online panel, Google Surveys representing a river sampling approach, Google Opinion Rewards representing a mobile panel, and Amazon MTurk, not a survey panel in itself but de facto used as such in academic research. To quantify test-retest reliability, we compared the response distributions from the two survey administrations. Given the attitudes measured were not expected to change in a short timespan and no relevant external events were reported during fielding to potentially affect the attitudes, the assumption was that the two measurements should be very close to each other, aside from transient measurement error. We found two of the samples produced remarkably consistent results between the two survey administrations, one sample was less consistent, and the fourth sample had significantly different response distributions for three of the four attitudinal questions. This study sheds light on the suitability of different non-probability-based samples for cross sectional attitude tracking. View details
    Preview abstract Survey communities have regularly discussed optimal questionnaire design for attitude measurement. Specifically for consumer satisfaction, which has historically been treated as a bipolar construct (Thurstone, 1931; Likert, 1932), some argue it is actually two separate unipolar constructs, which may yield signals with separable and interactive dynamics (Cacioppo & Berntson, 1994). Earlier research has explored whether attitude measurement validity can be optimized with a branching design that involves two questions: a question about the direction of an attitude (e.g., positive, negative) followed by a question using a unipolar scale, about the intensity of the selected direction (Krosnick & Berent, 1993). The current experiment evaluated differences across a variety of question designs for in-product contextual satisfaction surveys (Sedley & Müller, 2016). Specifically, we randomly assigned respondents into the following designs: Traditional 5-point bipolar satisfaction scale (fully labeled) Branched: a directional question (satisfied, neither satisfied nor dissatisfied, dissatisfied), followed by a unipolar question on intensity (5-point scale from “not at all” to “extremely,” fully labeled) Unipolar satisfaction scale, followed by a unipolar dissatisfaction scale (both use 5-point scale from “not at all” to “extremely,” fully labeled) Unipolar dissatisfaction scale, followed by a unipolar satisfaction scale both use 5-point scale from “not at all” to “extremely,” fully labeled) The experiment adds to the attitude question design literature by evaluating designs based on criterion validity evidence; namely the relationship with user behaviors linked to survey responses. Results show that no format clearly outperformed the ‘traditional’ bipolar scale format, for the criteria included. Separate unipolar scales performed poorly, and may be awkward or annoying for respondents. Branching, while performing similarly as the traditional bipolar design, showed no gain in validity. Thus, it is also not desirable because it requires two questions instead of one, increasing respondent burden. REFERENCES Cacioppo, J. T., & Berntson, G. G. (1994). Relationship between attitudes and evaluative space: A critical review, with emphasis on the separability of positive and negative substrates. Psychological bulletin, 115, 401-423. Krosnick, J. A., & Berent, M. K. (1993). Comparisons of party identification and policy preferences: The impact of survey question format. American Journal of Political Science, 37, 941-964. Reliability of responses via test-retest, comparing branched vs unbranched Orthogonal to our study? Not a validity analysis Malhotra, N., Krosnick, J. A., & Thomas, R. K. (2009). Optimal design of branching questions to measure bipolar constructs. Public Opinion Quarterly, 73), 304-324. Looks like their analyses were within-condition, and not comparing single question versions to branched versions like we are page 308 summarizes how they coded the variants and normalized 0 to 1 for regression analysis O’Muircheartaigh, C., Gaskell, G., & Wright, D. B. (1995). Weighing anchors: Verbal and numeric labels for response scales. Journal of Official Statistics, 11, 295–308. Wang, R., & Krosnick, J. A. (2020). Middle alternatives and measurement validity: a recommendation for survey researchers. International Journal of Social Research Methodology, 23, 169-184. Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 79, 281–299. Thurstone, L. L. (1931). Rank order as a psychological method. Journal of Experimental Psychology, 14, 187–201. Likert, R. (1932). A Technique for the Measurement of Attitudes. Archives of Psychology, 22, 5–55. Sedley, A., & Müller, H. (2016, May). User experience considerations for contextual product surveys on smartphones. Paper presented at 71st annual conference of the American Association for Public Opinion Research, Austin, TX. Retrieved from https://ai.google/research/pubs/pub46422/ View details
    Exciting, Useful, Worrying, Futuristic: Public Perception of Artificial Intelligence in 8 Countries
    Patrick Gage Kelley
    Christopher Moessner
    Aaron Sedley
    Andreas Kramm
    David T. Newman
    Allison Woodruff
    AIES '21: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (2021), 627–637
    Preview abstract As the influence and use of artificial intelligence (AI) have grown and its transformative potential has become more apparent, many questions have been raised regarding the economic, political, social, and ethical implications of its use. Public opinion plays an important role in these discussions, influencing product adoption, commercial development, research funding, and regulation. In this paper we present results of an in-depth survey of public opinion of artificial intelligence conducted with 10,005 respondents spanning eight countries and six continents. We report widespread perception that AI will have significant impact on society, accompanied by strong support for the responsible development and use of AI, and also characterize the public’s sentiment towards AI with four key themes (exciting, useful, worrying, and futuristic) whose prevalence distinguishes response to AI in different countries. View details
    “Mixture of amazement at the potential of this technology and concern about possible pitfalls”: Public sentiment towards AI in 15 countries
    Patrick Gage Kelley
    Christopher Moessner
    Aaron M Sedley
    Allison Woodruff
    Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 44 (2021), pp. 28-46
    Preview abstract Public opinion plays an important role in the development of technology, influencing product adoption, commercial development, research funding, career choices, and regulation. In this paper we present results of an in-depth survey of public opinion of artificial intelligence (AI) conducted with over 17,000 respondents spanning fifteen countries and six continents. Our analysis of open-ended responses regarding sentiment towards AI revealed four key themes (exciting, useful, worrying, and futuristic) which appear to varying degrees in different countries. These sentiments, and their relative prevalence, may inform how the public influences the development of AI. View details
    Scaling the smileys: A multicountry investigation
    Aaron Sedley
    Joseph M. Paxton
    The Essential Role of Language in Survey Research, RTI Press (2020), pp. 231-242
    Preview abstract Contextual user experience (UX) surveys are brief surveys embedded in a website or mobile app (Sedley & Müller, 2016). In these surveys, emojis (e.g., smiley faces, thumbs, stars), with or without text labels, are often used as answer scales. Previous investigations in the United States found that carefully designed smiley faces may distribute fairly evenly along a numerical scale (0–100) for measuring satisfaction (Sedley, Yang, & Hutchinson, 2017). The present study investigated the scaling properties and construct meaning of smiley faces in six countries. We collected open-ended descriptions of smileys to understand construct interpretations across countries. We also assessed numeric meaning of a set of five smiley faces on a 0–100 range by presenting each face independently, as well as in context with other faces with and without endpoint text labels. View details
    Assessing the validity of inferences from scores on the cognitive reflection test
    Nikki Blacksmith
    Tara S. Behrend
    Gregory A. Ruark
    Journal of Behavioral Decision Making, 32 (2019), pp. 599-612
    Preview abstract Decision‐making researchers purport that a novel cognitive ability construct, cognitive reflection, explains variance in intuitive thinking processes that traditional mental ability constructs do not. However, researchers have questioned the validity of the primary measure because of poor construct conceptualization and lack of validity studies. Prior studies have not adequately aligned the analytical techniques with the theoretical basis of the construct, dual‐processing theory of reasoning. The present study assessed the validity of inferences drawn from the cognitive reflection test (CRT) scores. We analyzed response processes with an item response tree model, a method that aligns with the dual‐processing theory in order to interpret CRT scores. Findings indicate that the intuitive and reflective factors that the test purportedly measures were indistinguishable. Exploratory, post hoc analyses demonstrate that CRT scores are most likely capturing mental abilities. We suggest that future researchers recognize and distinguish between individual differences in cognitive abilities and cognitive processes. View details
    Preview abstract Contextual user experience (UX) surveys are brief surveys embedded in a website or mobile app and triggered during or after a user-product interaction. They are used to measure user attitude and experience in the context of actual product usage. In these surveys, smiley faces (with or without verbal labels) are often used as answer scales for questions measuring constructs such as satisfaction. From studies done in the US in 2016 and 2017, we found that carefully designed smiley faces may distribute fairly evenly along a numerical scale (0-100) and scaling property further improved with endpoint verbal labels (Sedley, Yang, & Hutchinson, presented at APPOR 2017). With the propagation of mobile apps products around the world, the survey research community is compelled to test the generalizability of single-population findings (often from the US) to cross-national, cross-language and cross-cultural contexts. The current study builds upon the above scaling study as well as work by cross-cultural survey methodologies that investigated meanings of verbal scales (e.g., Smith, Mohler, Harkness, & Onodera, 2005). We investigate the scaling properties of smiley faces in a number of distinct cultural and language settings: US (English), Japan (Japanese), Germany (German), Spain (Spanish), India (English), and Brazil (Portuguese). Specifically, we explore construct alignment by capturing respondents’ own interpretations of the smiley face variants, via open-ended responses. We also assess scaling properties of various smiley designs by measuring each smiley face on a 0-100 scale, to calculate semantic distance between smileys. This is done by both presenting each smiley face independently and in-context with other smileys. We additionally evaluate the effect of including verbal endpoint labels with smiley scale. View details
    Response Option Order Effects in Cross-Cultural Context. An experimental investigation
    Rich Timpone
    Mario Callegaro
    Marni Hirschorn
    Vlad Achimescu
    Maribeth Natchez
    2019 Conference of the European Association for Survey Research (ESRA), Zagreb (2019) (to appear)
    Preview abstract Response option order effect occurs when different orders of rating scale response options lead to different distribution or functioning of survey questions. Theoretical interpretations, notably satisficing, memory bias (Krosnick & Alwin, 1987) and anchor-and-adjustment (Yan & Keusch, 2015) have been used to explain such effects. Visual interpretive heuristics (esp. “left-and-top-mean-first” and “up-means-good”) may also provide insights on how positioning of response options may affect answers (Tourangeau, Couper, & Conrad, 2004, 2013). Most existing studies that investigated the response option order effect were conducted in mono-cultural settings. However, the presence and extent of response option order effect may be affected by “cultural” factors in a few ways. First, interpretive heuristics, such as “left-means-first” may work differently due to varying reading conventions (e.g., left-to-right vs. right-to-left). Furthermore, people within cultures where there are multiple primary languages and multiple reading conventions might possess different positioning heuristics. Finally, respondents from different countries may have varying degree of exposure and familiarity to a specific type of visual design. In this experimental study, we investigate rating scale response option order effect across three countries with different reading conventions and industry norms for answer scale designs -- US, Israel, Japan. The between-subject factor of the experiment consists of four combinations of scale orientation (vertical and horizontal) and the positioning of the positive end of the scale. The within-subject factors are question topic area and the number of scale points. The effects of device (smartphone vs. desktop computer/tablet), age, gender, education, and the degree of exposure to left-to-right contents will also be evaluated. We incorporate a range of analytical approaches: distributional comparisons, analysis of response latency and paradata, and latent structure modeling. We will discuss implications on choosing response option orders for mobile surveys and on comparing data obtained from different response option orders. View details
    Preview abstract With increased adoption and usage of mobile apps for a variety of purposes, it is important to establish attitudinal measurement designs to measure users’ experiences in context of actual app usage. Such designs should balance mobile UX considerations with survey data quality. To inform choices on contextual mobile survey design, we conduct a comparative evaluation of stars vs smileys as graphical scales for in-context mobile app satisfaction measurement, as follows: To evaluate and compare data quality across scale types, we look at the distributions of the numerical ratings by anchor point stimulus to evaluate the extremity and scale point distances. We also assess criterion validity for stars and smileys, where feasible. To evaluate User Experience across variants, we compare key survey-related signals such as response & dismiss rates, dismiss/response ratio, and time-to-response. View details
    Justice Rising - The Growing Ethical Importance of Big Data, Survey Data, Models and AI
    Rich Timpone
    BigSurv 18 (Big Data Meet Survey Science) conference, Barcelona, Spain (2018)
    Preview abstract In past work, the criteria of Truth, Beauty, and Justice have been leveraged to evaluate models (Lave and March 1993, Taber and Timpone 1996). Earlier, while relevant, Justice was seen as the least important of modeling considerations, but that is no longer the case. As the nature of data and computing power have opened new opportunities for the application of data and algorithms from public policy decision-making to technological advances like self-driving cars, the ethical considerations have become far more important in the work that researchers are doing. While a growing literature has been highlighting ethical concerns of Big Data, algorithms and artificial intelligence, we take a practical approach of reviewing how decisions throughout the research process can result in unintended consequences in practice. Building off Gawande’s (2009) approach of using checklists to reduce risks, we have developed an initial framework and set of checklist questions for researchers to consider the ethical implications of their analytic endeavors explicitly. While many aspects are considered those tied to Truth and accuracy, through our examples it will be seen that considering research design through the lens of Justice may lead to different research choices. These checklists include questions on the collection of data (Big Data and Survey; including sources and measurement), how it is modeled and finally issues of transparency. These issues are of growing importance for practitioners from academia to industry to government and will allow us to advance the intended goals of our scientific and practical endeavors while avoiding potential risks and pitfalls. View details