Carolyn Denomme Egelman
Carolyn is a quantitative user experience researcher on the Engineering Productivity Research team within Developer Infrastructure. The Engineering Productivity Research team brings a data-driven approach to business decisions around engineering productivity. They use a combination of qualitative and quantitative methods to triangulate on measuring productivity. She received her B.S. in Engineering Science & Mechanics from Penn State and her Ph.D. in Engineering & Public Policy from Carnegie Mellon.
Research Areas
Authored Publications
Sort By
Preview abstract
Understanding and effectively measuring developer goals is critical for enhancing developer experience and productivity. By focusing on durable, consistent, relatable, sensical, and observable goals we create a more robust view into our developers’ days. In this article, we outline our process for articulating and refining goals, provide our list of 30 rigorously-tested developer goals, and share a little bit about how we leverage both sentiment and behavioral data to measure and understand goals through different lenses.
View details
Measuring Developer Experience with a Longitudinal Survey
Jessica Lin
Jill Dicker
IEEE Software (2024)
Preview abstract
At Google, we’ve been running a quarterly large-scale survey with developers since 2018. In this article, we will discuss how we run EngSat, some of our key learnings over the past 6 years, and how we’ve evolved our approach to meet new needs and challenges.
View details
Systemic Gender Inequities in Who Reviews Code
Emerson Murphy-Hill
Jill Dicker
Amber Horvath
Laurie R. Weingart
Nina Chen
Computer Supported Cooperative Work (2023) (to appear)
Preview abstract
Code review is an essential task for modern software engineers, where the author of a code change assigns other engineers the task of providing feedback on the author’s code. In this paper, we investigate the task of code review through the lens of equity, the proposition that engineers should share reviewing responsibilities fairly. Through this lens, we quantitatively examine gender inequities in code review load at Google. We found that, on average, women perform about 25% fewer reviews than men, an inequity with multiple systemic antecedents, including authors’ tendency to choose men as reviewers, a recommender system’s amplification of human biases, and gender differences in how reviewer credentials are assigned and earned. Although substantial work remains to close the review load gap, we show how one small change has begun to do so.
View details
The Pushback Effects of Race, Ethnicity, Gender, and Age in Code Review
Emerson Rex Murphy-Hill
Lan Cheng
Communications of the ACM, 65 (2022), 52–57
Preview abstract
Code review is a common practice in software organizations, where software engineers give each other feedback about a code change. As in other human decision-making processes, code review is susceptible to human biases, where reviewers’ feedback to the author may depend on how reviewers perceive the author’s demographic identity, whether consciously or unconsciously. Through the lens of role congruity theory, we show that the amount of pushback that code authors receive varies based on their gender, race/ethnicity, and age. Furthermore, we estimate that such pushback costs Google more than 1000 extra engineer hours every day, or about 4% of the estimated time engineers spend responding to reviewer comments, a cost borne by non-White and non-male engineers.
View details
Detecting Interpersonal Conflict in Issues and Code Review: Cross Pollinating Open- and Closed-Source Approaches
Huilian Sophie Qiu
Bogdan Vasilescu
Christian Kästner
Emerson Rex Murphy-Hill
International Conference on Software Engineering: Software Engineering on Society (2022)
Preview abstract
In software engineering, interpersonal conflict in code review, such as toxic language or an unnecessary pushback on a change request, is a well-known and extensively studied problem because it is associated with negative outcomes, such as stress and turnover. One effective approach to prevent and mitigate toxic language is to develop automatic detection. Two most-recent attempts on automatic detection were developed under different settings: a toxicity detector using text analytics for open source issue discussions and a pushback detector using logs-based metrics for corporate code reviews. While these settings are arguably distinct, the behaviors that they can capture share similarities. Our work studies how the toxicity detector and the pushback detector can be generalized beyond the respective contexts in which they were developed and how the combination of the two can improve interpersonal conflict detection. This research has implications for designing interventions and offers an opportunity to apply a technique to both open and closed source software, possibly benefiting from synergies, a rarity in software engineering research, in our experience.
View details
Engineering Impacts of Anonymous Author Code Review: A Field Experiment
Emerson Rex Murphy-Hill
Jill Dicker
Lan Cheng
Liz Kammer
Ben Holtz
Andrea Marie Knight Dolan
Transactions on Software Engineering (2021)
Preview abstract
Code review is a powerful technique to ensure high quality software and spread knowledge of best coding practices between engineers. Unfortunately, code reviewers may have biases about authors of the code they are reviewing, which can lead to inequitable experiences and outcomes. In this paper, we describe a field experiment with anonymous author code review, where we withheld author identity information during 5217 code reviews from 300 professional software engineers at one company. Our results suggest that during anonymous author code review, reviewers can frequently guess authors’ identities; that focus is reduced on reviewer-author power dynamics; and that the practice poses a barrier to offline, high-bandwidth conversations. Based on our findings, we recommend that those who choose to implement anonymous author code review should reveal the time zone of the author by default, have a break-the-glass option for revealing author identity, and reveal author identity directly after the review.
View details
Enabling the Study of Software Development Behavior with Cross-Tool Logs
Ben Holtz
Edward K. Smith
Andrea Marie Knight Dolan
Elizabeth Kammer
Jillian Dicker
Caitlin Harrison Sadowski
Lan Cheng
Emerson Murphy-Hill
IEEE Software, Special Issue on Behavioral Science of Software Engineering (2020)
Preview abstract
Understanding developers’ day-to-day behavior can help answer important research questions, but capturing that behavior at scale can be challenging, particularly when developers use many tools in concert to accomplish their tasks. In this paper, we describe our experience creating a system that integrates log data from dozens of development tools at Google, including tools that developers use to email, schedule meetings, ask and answer technical questions, find code, build and test, and review code. The contribution of this article is a technical description of the system, a validation of it, and a demonstration of its usefulness.
View details
Predicting Developers’ Negative Feelings about Code Review
Emerson Murphy-Hill
Elizabeth Kammer
International Conference on Software Engineering (2020)
Preview abstract
During code review, developers critically examine each others’ code to improve its quality, share knowledge, and ensure conformance to coding standards. In the process, developers may have negative interpersonal interactions with their peers, which can lead to frustration and stress; these negative interactions may ultimately result in developers abandoning projects. In this mixed-methods study at one company, we surveyed 1,317 developers to characterize the negative experiences and cross-referenced the results with objective data from code review logs to predict these experiences. Our results suggest that such negative experiences, which we call “pushback”, are relatively rare in practice, but have negative repercussions when they occur. Our metrics can predict feelings of pushback with high recall but low precision, making them potentially appropriate for highlighting interactions that may benefit from a self-intervention.
View details