CodeQueries: A Dataset of Semantic Queries over Code

Surya Prakash Sahu

Madhurima Mandal

Shikhar Bharadwaj

Aditya Kanade

Petros Maniatis

Shirish Shevade

Innovations in Software Engineering (ISEC), ACM, Bangalore, India (2024)

Download Google Scholar

Abstract

Developers often have questions about semantic aspects of code
they are working on, e.g., “Is there a class whose parent classes
declare a conflicting attribute?”. Answering them requires understanding code semantics such as attributes and inheritance relation
of classes. An answer to such a question should identify code spans
constituting the answer (e.g., the declaration of the subclass) as well
as supporting facts (e.g., the definitions of the conflicting attributes).
The existing work on question-answering over code has considered
yes/no questions or method-level context. We contribute a labeled
dataset, called CodeQueries, of semantic queries over Python code.
Compared to the existing datasets, in CodeQueries, the queries
are about code semantics, the context is file level and the answers
are code spans. We curate the dataset based on queries supported
by a widely-used static analysis tool, CodeQL, and include both
positive and negative examples, and queries requiring single-hop
and multi-hop reasoning.

To assess the value of our dataset, we evaluate baseline neural
approaches. We study a large language model (GPT3.5-Turbo) in
zero-shot and few-shot settings on a subset of CodeQueries. We
also evaluate a BERT style model (CuBERT) with fine-tuning. We
find that these models achieve limited success on CodeQueries.
CodeQueries is thus a challenging dataset to test the ability of
neural models, to understand code semantics, in the extractive
question-answering setting

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

CodeQueries: A Dataset of Semantic Queries over Code

Abstract

Research Areas

Learn more about how we conduct our research