Learning and Evaluating Contextual Embedding of Source Code
Abstract
Recent research has achieved impressive results
on understanding and improving source code by
building up on machine-learning techniques developed
for natural languages. A significant advancement
in natural-language understanding has
come with the development of pre-trained contextual
embeddings, such as BERT, which can
be fine-tuned for downstream tasks with less labeled
data and training budget, while achieving
better accuracies. However, there is no attempt
yet to obtain a high-quality contextual embedding
of source code, and to evaluate it on multiple
program-understanding tasks simultaneously; that
is the gap that this paper aims to mitigate. Specifically,
first, we curate a massive, deduplicated corpus
of 6M Python files from GitHub, which we
use to pre-train CuBERT, an open-sourced code-understanding
BERT model; and, second, we create
an open-sourced benchmark that comprises
five classification tasks and one program-repair
task, akin to code-understanding tasks proposed
in the literature before. We fine-tune CuBERT on
our benchmark tasks, and compare the resulting
models to different variants of Word2Vec token
embeddings, BiLSTM and Transformer models,
as well as published state-of-the-art models, showing
that CuBERT outperforms them all, even with
shorter training, and with fewer labeled examples.
Future work on source-code embedding can benefit
from reusing our benchmark, and comparing
against CuBERT models as a strong baseline.
on understanding and improving source code by
building up on machine-learning techniques developed
for natural languages. A significant advancement
in natural-language understanding has
come with the development of pre-trained contextual
embeddings, such as BERT, which can
be fine-tuned for downstream tasks with less labeled
data and training budget, while achieving
better accuracies. However, there is no attempt
yet to obtain a high-quality contextual embedding
of source code, and to evaluate it on multiple
program-understanding tasks simultaneously; that
is the gap that this paper aims to mitigate. Specifically,
first, we curate a massive, deduplicated corpus
of 6M Python files from GitHub, which we
use to pre-train CuBERT, an open-sourced code-understanding
BERT model; and, second, we create
an open-sourced benchmark that comprises
five classification tasks and one program-repair
task, akin to code-understanding tasks proposed
in the literature before. We fine-tune CuBERT on
our benchmark tasks, and compare the resulting
models to different variants of Word2Vec token
embeddings, BiLSTM and Transformer models,
as well as published state-of-the-art models, showing
that CuBERT outperforms them all, even with
shorter training, and with fewer labeled examples.
Future work on source-code embedding can benefit
from reusing our benchmark, and comparing
against CuBERT models as a strong baseline.