Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Building generalizable language models for code processing
(USC Thesis Other)
Building generalizable language models for code processing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Building Generalizable Language Models for Code Processing
by
Shushan Arakelyan
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2024
Copyright 2024 Shushan Arakelyan
To my parents, who taught me the value of curiosity and hard work
and to my spouse, for being my steadfast support every step of the way.
ii
Acknowledgements
I would like to thank my advisor and Committee Chair, Xiang Ren, whose mentorship has been
truly invaluable. Xiang, you have been a role model and an inspiration for me. I am forever
grateful for your guidance, patience, support, and encouragement – professionally and personally
– that you have so generously provided to me. Thank you for the countless hours you dedicated
helping me refine my ideas, as a mentor you have set a standard I can only hope to achieve one
day.
I would also like to extend my deepest gratitude to my committee members - Aram Galstyan,
Mukund Raghothaman and Morteza Dehghani. Thank you for your feedback and detailed attention
to the nuances of my work, your expertise and support have made my dissertation possible. I am
also deeply grateful to Aram Galstyan particularly, for believing in me and advising me in the first
part of my PhD journey.
I would like to thank my mother – Hasmik Julfalakyan. How can I express how much your
love and support have meant to me? You have been my friend, my biggest cheerleader, the one
who unwaveringly believes in me. No words could ever capture my gratitude – I owe so much of
this journey to you.
I am also forever thankful to my spouse, Mikael Manukyan. Mikael, thank you for being the
only thing that matters. The luckiest break was starting this PhD journey with you by my side. I
never took you for granted and I am so fortunate to have your love and support.
I would like to thank all the friends and collaborators in the USC INK research lab and USC
NLP group for their friendship, support, and professional feedback that they have so generously
iii
provided. I also want to thank all the friends I have made along the way at USC and Information
Sciences Institute, you friendship brought joy and meaning to my PhD journey.
I want to specifically thank my co-authors, collaborators and mentors who I have been lucky
to work with at different points during my PhD - Fred Morstatter, Emilio Ferrara, Filip Radlinski,
Sima Arasteh, Christoph Hauser, Erik Kline, Anna Hakhverdyan, Miltiadis Allamanis, Luis Garcia, Rocktim Jyoti Das, and Yi Mao. Your contributions and insights have been essential and I am
immensely grateful for your time, effort and enthusiasm you have shared with me throughout this
process.
Finally, I want to acknowledge partial financial support from the following grants, contracts
and awards: Office of the Director of National Intelligence (ODNI); Intelligence Advanced Research Projects Activity (IARPA), via the HIATUS Program contract #2022-22072200006 and
via Contract No. 2019-19051600007; the Defense Advanced Research Projects Agency (DARPA)
MCS program under Contract No. N660011924033, DARPA ReMath program under Contract No.
HR00112190020, DARPA award W911NF-19-20271, and DARPA grant no. D16AP00115; NSF
IIS 2048211; ARO (contract no. W911NF12-R-0012); and gift awards from Google, Amazon, JP
Morgan and Sony. Parts of my research were completed while employed as an intern for Google
LLC and Microsoft Inc.
I acknowledge use of ChatGPT by OpenAI to identify improvements in writing style of this
thesis.
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Thesis Statement and Contributions . . . . . . . . . . . . . . . . . . . . . 4
Chapter 2: Neuro-Symbolic Models for Compositional Generalization . . . . . . . . . . . 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Semantic Code Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Neural Models for Semantic Code Search . . . . . . . . . . . . . . . . . . 10
2.3 Neural Modular Code Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Module Network Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Entity Discovery Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.3 Action Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.4 Model Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.5 Module Pretraining and Joint Fine-tuning . . . . . . . . . . . . . . . . . . 16
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.4 Analysis and Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Chapter 3: Understanding and combating distributional shift in real-world software
development cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
v
3.3 Problem setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.2 Applications and Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Limitations and Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Chapter 4: From Low-Resource to High-Performance with Tool-Assisted Synthetic Data
for Code Generation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.1 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.2 Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.3 Dataset construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.4 Data refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5.2 Effect of iterative refinement . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5.3 Effect of Using a Seed Dataset vs OSS for Synthetic Data Generation . . . 57
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.7 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 5: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
A Chapter 2: Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A.2 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A.3 Failed parses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
A.4 Parser generalization to new datasets . . . . . . . . . . . . . . . . . . . . . 78
B Chapter 2: Entity Discovery Module . . . . . . . . . . . . . . . . . . . . . . . . . 79
C Chapter 2: Additional Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 80
C.1 Unseen Entities and Actions . . . . . . . . . . . . . . . . . . . . . . . . . 80
C.2 Times an Entity or an Action Was Seen . . . . . . . . . . . . . . . . . . . 80
C.3 Evaluation on Parsable and Unparsable Queries . . . . . . . . . . . . . . . 81
D Chapter 2: Additional Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
E Chapter 2: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
F Chapter 3: Javascript Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
G Chapter 3: Extended Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
G.1 Meta-learning and Multi-task-learning . . . . . . . . . . . . . . . . . . . . 87
vi
G.2 Few-shot Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
H Chapter 3: Domain split visualization . . . . . . . . . . . . . . . . . . . . . . . . 90
I Chapter 3: Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
J Chapter 3: Hyperparameters and training details . . . . . . . . . . . . . . . . . . . 92
K Chapter 3: The Vault . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
L Chapter 3: Additional experimental results . . . . . . . . . . . . . . . . . . . . . . 93
M Chapter 3: IsoScore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
M.1 Fast vote-k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
N Chapter 3: Instructions for Codex and ChatGPT . . . . . . . . . . . . . . . . . . . 94
O Chapter 3: Sample outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
vii
List of Tables
2.1 Mean Reciprocal Rank (MRR) and Precision@1/@3/@5 (higher is better) for
semantic code search methods trained on different subsets from CodeSearchNet
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Mean Reciprocal Rank (MRR) and Precision@1/@3/@5 (higher is better) for
different semantic code search methods trained on CoSQA dataset. . . . . . . . . . 18
3.1 Domains in CodeSearchNet dataset. Left column: training set. Middle column:
number of domains of each kind in Xtrain with ≥ 96 samples. Right column:
number of domains in Xtest with ≥ 96 samples. . . . . . . . . . . . . . . . . . . . 31
3.2 Model performance for code summarization on in-domain (ID) vs out-of-domain
(random) test data. Reported metric is BLEU (higher is better). . . . . . . . . . . 32
3.3 Model performance for code generation on in domain (ID) vs out of domain
(random) test data. Reported metric is CodeBLEU (higher is better). . . . . . . . 32
3.4 Codex and ChatGPT performance for code summarization and code generation
tasks. Models are evaluated in 0-shot manner, as well as using in-context
learning demonstrations (ICL) with in-domain (ID) and out-of-domain (random)
instances. Reported metric is BLEU for code summarization (higher is better),
and CodeBLEU for code generation (higher is better). . . . . . . . . . . . . . . . . 33
3.5 CodeT5 and Codex model performance using retrieved supervision examples for
general domain adaptation. The first number in each cell of the table is the score
obtained by the corresponding model, which is followed by the change in the
performance w.r.t domain-specific model or test sample-specific demonstrations. . 40
4.1 Frontier models like GPT-4o show widely varied and lower performance on
low-resource programming languages compared to a high-resource language, such
as Python. The reported metric is Pass@1 (higher is better) and the evaluation is
performed on the Humaneval sections of the MultiPL-e dataset. . . . . . . . . . . . 45
4.2 Sizes (in number of instances) for different datasets that we experiment with in
this work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
viii
4.3 Pass@1 performance for baseline student and teacher models, student and teacher
models evaluated with in-context learning (ICL), as well as student models
finetuned on the corresponding dataset on 6 low-resource programming languages. 56
4.4 Pass@1 performance for the student model finetuned on the corresponding dataset
on 6 low-resource programming languages. . . . . . . . . . . . . . . . . . . . . . 57
4.5 Pass@1 performance for the student model finetuned on the corresponding
dataset. For the purpose of fairer comparison, for Magicoder-OSS-Instruct, we
only used the data points in the specific language tested. . . . . . . . . . . . . . . . 57
5.1 Dataset statistics before and after parsing. . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Parser’s success rate on unseen datasets . . . . . . . . . . . . . . . . . . . . . . . 79
5.3 Example queries that were not included due to query parsing errors . . . . . . . . . 79
5.4 Mean Reciprocal Rank(MRR) and Precision@1/@3/@5 (higher is better) for
different methods trained on CoSQA dataset. The performance is evaluted on the
full test dataset, i.e. including both parsable and unparsable examples. . . . . . . . 82
5.5 Keywords used for CodeBLEU evaluation . . . . . . . . . . . . . . . . . . . . . . 87
5.6 Comparison of model performance for code generation on in-domain (ID) vs
out-of-domain (random) test data. Reported metric is ChrF (higher is better). . . . 87
5.7 Comparison of model performance for code generation on in-domain (ID) vs
out-of-domain (random) test data. Reported metric is RougeL (higher is better). . 88
5.8 Comparison of model performance for code generation on in-domain (ID) vs outof-domain (random) test data. Reported metric in each cell is CodeBERTScore
F1 on the left (higher is better), and CodeBERTScore F3 on the right (higher is
better). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.9 Results for CodeT5 model using IsoScore for measuring embedding similarity
and supervising with retrieved examples from train data. . . . . . . . . . . . . . . 93
5.10 Results for CodeT5 model using Fast Vote-k for measuring embedding similarity
and supervising with retrieved examples from train data. . . . . . . . . . . . . . . 94
5.11 Task instructions and demonstration templates used for generating results in the
experiments with Codex and ChatGPT. . . . . . . . . . . . . . . . . . . . . . . . . 95
5.12 Sample outputs from different models. . . . . . . . . . . . . . . . . . . . . . . . . 96
5.13 Sample outputs from different models. . . . . . . . . . . . . . . . . . . . . . . . . 97
5.14 Sample outputs from different models. . . . . . . . . . . . . . . . . . . . . . . . . 98
5.15 Sample outputs from different models. . . . . . . . . . . . . . . . . . . . . . . . . 99
ix
5.16 Sample outputs from different models. . . . . . . . . . . . . . . . . . . . . . . . . 100
x
List of Figures
2.1 Motivating Example for NS3 approach. To match query “Navigate folders” on a
code snippet, we find all references (token spans) to entity “folders” in code (e.g.,
paths and directories) using various linguistic cues (Step 1). Then we look for
cues in code that indicate the identified instances of “folders” are being iterated
through – i.e., “navigate” (Step 2). . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Overview of the NS3 approach. We illustrate the pipeline of processing for an
example query “Load all tables from dataset”. Parsed query is used for deciding
the positions of entity discovery and action modules in the neural module network
layout. Each entity discovery module receives a noun/noun phrase as input,
and outputs relatedness scores for code tokens, which are passed as input to an
action module. Action module gets scores for all its children in the parse-tree,
except one, which is masked, and the goal is predicting, cloze-style, what are the
relatedness scores for the missing argument. . . . . . . . . . . . . . . . . . . . . . 8
2.3 Entity module architecture in our NS3
approach. . . . . . . . . . . . . . . . . . . . 13
2.4 Action module architecture in our NS3
approach . . . . . . . . . . . . . . . . . . . 15
2.5 Comparison of baseline methods and NS3
for semantic code search. We report
Precision@1 scores. (a) Performance of our proposed method and baselines
broken down by average number of arguments per action in a single query. (b)
Performance of our proposed method and baselines broken down by number of
arguments in queries with a single action. . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Performance of NS3 on the test portion of CSN dataset with different ablation
variants. (a) Skipping one, or both pretraining procedures, and only training
end-to-end. (b) Using no normalization on output scores (None), action-only
or entity discovery-only, and both. (c) Performance with different options for
computing action and entity discovery output similarities. . . . . . . . . . . . . . . 21
2.7 Ratio of the perturbed query score to the original query score (lower is better) on
CSN dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
xi
2.8 Token scores outputted by the modules at different stages of training. Darker
highlighting means higher score. The leftmost and middle columns show output
scores of the entity discovery module after pretraining, and the end-to-end training
correspondingly. The rightmost column shows the scores of the action module
after the end-to-end training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1 Organization of a software system by the granularity of its components . . . . . . . 27
3.2 We group the instances from CodeSearchNet dataset by repos, orgs, and folders
they belong to. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 For the CodeT5 model we use different methods for training and domain
adaptation. We evaluate both in scenarios with different data sources during the
domain adaptation stage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Models with ID and retrieved downstream adaptations. . . . . . . . . . . . . . . . 38
3.5 CodeT5 model finetuned with retrieved supervision using different number
of retrieved examples per test sample. Scores reported are BLEU for code
summarization and CodeBLEU for code generation. CodeT5 MTL model
performances in zero-shot, and 8-shot (ID) scenarios are shown with dotted lines
for reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 Overview of the data generation process pipeline. . . . . . . . . . . . . . . . . . . 46
4.3 Data distributions for the datasets that we generate in this work, showing the
ratio of correct examples to syntactically and functionally incorrect examples.
The orange bars indicate the change in the number of correct instances between
the current iteration and the iteration prior to it. Due to marginal improvement
in increase of the correct examples, we only include versions up to Dre f3
in our
experiments (marked with bold), and omit Dre f4
and Dre f5
. . . . . . . . . . . . . 52
4.2 Examples of original generations by the teacher model, Llama3-8B-Instruct,
feedback given depending on the kind of error, and the revised generations by the
teacher model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1 Performance of CodeBERT and NS3 models when broken down by the number of
unseen entities or actions in the test queries. Evaluated on CSN test set. . . . . . . 80
5.2 Performance of CodeBERT and NS3 models when broken down by the number of
times an entity or an action was seen during the training. Evaluated on CSN test set. 81
5.3 The leftmost column shows output scores of the entity discovery module after
pretraining for the entity of the query. The middle column shows the scores after
completing the end-to-end training. The rightmost column shows the scores of the
action module. Darker highlighting demonstrates higher score. . . . . . . . . . . . 82
5.4 Outputs of the action and entity modules on the query
ACTION(Construct, (None, point record)). . . . . . . . . . . . . . . . . . . . . . 84
xii
5.5 Outputs of the action and entity modules on the query
ACTION(Read, (FROM, stream), (None, points)). . . . . . . . . . . . . . . . . . . 85
5.6 Outputs of the action module on the modified query
ACTION(Remove, (IN, stream), (None, points)). . . . . . . . . . . . . . . . . . . 86
5.7 Each dot signifies a domain. Average pairwise similarities of examples within
each domain (x axis) plotted against average similarities of that domain to all
other domains (y axis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.8 Performance for CodeT5 model finetuned with LoRA compared to regular
finetuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
xiii
Abstract
Successful deployment of any AI model requires generalization to previously unseen, real-world
scenarios. Lack of generalization in models can lead to outcomes ranging from reduced performance to potential legal liabilities. In this thesis, I explore generalization challenges in large language models for code processing, covering three different generalization concerns that language
models for code processing can exhibit. I also present my progress in building models that can
overcome those. Firstly, I explore compositional generalization issues in code. I propose a model
that can learn representing and recognizing individual instructions in code, and subsequently can
generalize to representing and recognizing new, unseen combinations of those instructions. Next,
I look at the issue of out-of-domain generalization. Specifically, I study how distribution shifts
within software projects or between different corporations can affect model performance. I also
look at different methods and measure their effectiveness for overcoming this generalization issue.
Lastly, I look at the generalization issue of language model performance drop when comparing
the language models evaluated on wide-spread programming languages versus those with fewer
resources. I propose a synthetic data generation and distillation method to help to improve the
language model performance on low-resource programming languages.
xiv
Chapter 1
Introduction
Recent advances in Artificial Intelligence (AI) have introduced many disruptive applications to the
domain of software and code. GitHub Copilot [1] took the world by storm and has since become the
most widely used AI coding assistant among software engineers. Its wide adoption and popularity
is hinged on the assistant’s capability to not only help with code generation, but also assist in other
ways while working with software, for example flagging code that resembles publicly available
code, which is useful in a range of applications from easily referencing code from within IDE, to
checking permissions and licenses for that code.
Among other exciting applications is integration of generative AI with low-code or no-code
solutions. One such example is Roblox [2], which allows users with little to no programming experience to create their own, fully functional content for their favourite games. All it requires is
specifying the desired behaviour of the user introduced content in plain English, and the system
generates the corresponding Lua scripts. Other use cases still include incorporation of generative AI into database systems and assisting users with their SQL queries [3], creation of functional webpages without knowledge of web programming [4], or using generative AI to assist with
software modernization, such as translating COBOL mainframes into more modern programming
languages [5].
1
1.1 Motivation
With all its exciting new prospects, generative AI does not come without issues or risks. As
with any other AI system any such issue is aggravated after transfer from the stage of training to
real-world application. Generative AI for code processing specifically can pose a number of such
potential issues, and one of them can be reduced code efficiency and redundancy. This can range
from models that produce duplicate code, to models that propose suboptimal code, which may not
be appropriate in certain critical applications. Another issue may be posed by dependence and
over-reliance of the users on the system. The users can get used to good performance of the model
and stop critically assessing its outputs, or vetting model suggestions properly. We already have
such examples in the wild demonstrated by users of Tesla Autopilots, or users following their GPS
navigator into a body of water [6–8]. Last but not least, there is the issue of false positives. This
issue occurs, for example, when the model incorrectly flags something as buggy or vulnerable.
As a result engineers will have to spend time unproductively looking for non-existent bugs or
vulnerabilities.
Lack of generalization When considering the challenges faced by models in real-world applications, it is essential to consider the discrepancies between the training data and the actual use cases.
Some examples of such discrepancies include domain shift, temporal shift, and models learning
shortcuts.
Domain shift Domain shift occurs when a model trained on data from one environment is applied in a different context. For example, in the case of code completion, a model trained on code
following a particular coding style may be deployed in a codebase written using a with a different style. As a result, the generated code may fail to adhere to the required coding standards or
guidelines, leading to inconsistencies in the codebase.
Temporal shift Temporal shift refers to the differences in data distribution due to the passage
of time. A model trained on data from a specific period may be applied to a scenario where the
2
coding practices or available functions have evolved. This can lead to the model generating code
that relies on outdated functions or fails to incorporate the latest best practices.
Models Learning Shortcuts Models learning shortcuts is another critical issue. Models may
develop a tendency to memorize and reproduce common patterns from the training data, regardless
of their appropriateness in the new context. A simple example of this is a model that consistently
names an iterator ”i” due to its frequency in the training data, even when another variable named
”i” already exists in the surrounding code, leading to potential errors. These challenges highlight
the importance of models being able to generalize from training environments to real-world applications to improve their performance and reliability in practical scenarios.
Implications The potential implications of lack of generalization extend beyond merely diminished model performance. Various stakeholders, including individual developers, companies, and
the broader tech community, are affected in multiple ways. Individual developers will suffer not
just from the direct impact of decreased model performance. They might also be affected by
complications arising from issues like false positives, i.e. incorrectly marking parts of the code
as buggy or vulnerable, leading to developers spending time unproductively looking for these issues. Both of these in combination can further degrade effectiveness and efficiency of individual
developers.
Companies face a range of risks, including privacy and security concerns. These risks may
involve vulnerabilities introduced by the models or injected into the codebase by vulnerable models. Additionally, companies face economic impacts which can be significant, encompassing both
reduced performance from engineers and diminished overall efficiency of the software. The educational value for the company is another concern; inexperienced engineers learning from suboptimally generated code may not acquire the best practices. Over time, this could lead to a
proliferation of lower-quality code in the codebase, continuing the cycle of poor programming
standards.
3
The tech community at large may encounter legal challenges related to the ownership of generated code and liability for any harm caused by AI-generated software. Furthermore, these issues
could lead to reputational damage for all involved parties, eroding trust in AI applications for coding tasks. In summary, models that lack generalization are unlikely to be successfully deployed, as
their limitations have wide-ranging and significant implications across various domains.
1.1.1 Thesis Statement and Contributions
In summary of previous sections, generalization capacities are pivotal for code models’ successful
deployment. In this thesis, we aim to improve the generalization properties of language models for
code processing across several axis:
1. Firstly, we look at the compositional generalization capacity of models for code, i.e. how
well can models generalize from simpler tasks to more complex tasks that combine multiple
tasks. For this end, we look at semantic code search application and propose treating it as
a compositional task by decomposing it based on semantic parse of the search query. We
implement this approach with neural modular networks and demonstrate that it significantly
outperforms several semantic code search baselines on established code search benchmarks.
We also demonstrate that models trained in this way have higher sensitivity to changes in the
query, as well as models’ ability to handle compositional queries.
2. Secondly, we look at how models handle distributional shifts, in particular those originating
from the natural structure and origin of the source code. Specifically, models that are both
trained and evaluated on code from the same origin, for example, codebase of the same
company can be expected to perform well during evaluation. However, when evaluated on
code from a different company - the results can be mixed.
For this goal we formulated the research problem focused on establishing, understanding,
measuring and combating the distributional shift occurring from real-world software development cycles. We look at code generation and code summarization applications, and notice
4
that hierarchical nature of software data introduces generalization issues for code models.
We look at three distinct generalization scenarios: generalization across companies, projects,
and project components, and establish that distributional shift occurs for all of those. We then
present and study different approaches for combating such distributional shift.
3. Finally, we look at the issue of the gap in performance demonstrated by code models when
evaluated on high vs low resource programming languages. Due to the lack of supervision
data for supervision for low-resource programming languages, we look into using synthetically generated data for improving code model performance on low-resource programming
languages.
Specifically, we combine this insight with the insight of ready availability of software engineering tools for programming languages, and propose to use such tools to generate higher
quality synthetic data for knowledge distillation. We use this data to finetune a 1.3B language
model and confirm our approach is effective on six low-resource programming languages,
for which we are able to achieve sizable improvement, even surpassing the teacher model’s
performance.
5
Chapter 2
Neuro-Symbolic Models for Compositional Generalization
Semantic code search is the task of retrieving a code snippet given a textual description of its functionality. Prior works have focused on using similarity metrics between neural embeddings of text
and code. However, language models are known to struggle with longer, compositional text, and
multi-step reasoning. To overcome this limitation, in this Chapter we propose supplementing the
query sentence with a layout of its semantic structure. The semantic layout is used to break down
the final reasoning decision into a series of lower-level decisions. We use a Neural Module Network architecture to implement this idea. We compare our neuro-symbolic approach to a number
of baselines, including state-of-the-art semantic code retrieval methods, and demonstrate that our
approach results in more precise code retrieval. We also study the effectiveness of our modular
design when handling compositional queries
2.1 Introduction
The increasing scale of software repositories makes retrieving relevant code snippets more challenging. Traditionally, source code retrieval has been limited to keyword [9, 10] or regex [11]
search. Both rely on the user providing the exact keywords appearing in or around the sought code.
However, neural models enabled new approaches for retrieving code from a textual description of
its functionality, a task known as semantic code search (SCS). A model like Transformer [12] can
map a database of code snippets and natural language queries to a shared high-dimensional space.
6
Figure 2.1: Motivating Example for NS3 approach. To match query “Navigate folders” on a code
snippet, we find all references (token spans) to entity “folders” in code (e.g., paths and directories)
using various linguistic cues (Step 1). Then we look for cues in code that indicate the identified
instances of “folders” are being iterated through – i.e., “navigate” (Step 2).
Relevant code snippets are then retrieved by searching over this embedding space using a predefined similarity metric, or a learned distance function [13–15]. Some of the recent works capitalize
on the rich structure of the code, and employ graph neural networks for the task [16, 17].
Despite impressive results on SCS, current neural approaches are far from satisfactory in dealing with a wide range of natural-language queries, especially on ones with compositional language
structure. Encoding text into a dense vector for retrieval purposes can be problematic because we
risk loosing faithfulness of the representation, and missing important details of the query. Not
only does this a) affect the performance, but it can b) drastically reduce a model’s value for the
users, because compositional queries such as “Check that directory does not exist before creating
it” require performing multi-step reasoning on code.
We suggest overcoming these challenges by introducing a modular workflow based on the
semantic structure of the query. Our approach is based on the intuition of how an engineer would
approach a SCS task. For example, in performing search for code that navigates folders in Python
they would first only pay attention to code that has cues about operating with paths, directories or
folders. Afterwards, they would seek indications of iterating through some of the found objects
or other entities in the code related to them. In other words, they would perform multiple steps of
different nature - i.e. finding indications of specific types of data entities, or specific operations.
Figure 2.1 illustrates which parts of the code would be important to indicate that they have found
the desired code snippet at each step. We attempt to imitate this process in this work. To formalize
7
Figure 2.2: Overview of the NS3 approach. We illustrate the pipeline of processing for an example query “Load all tables from dataset”. Parsed query is used for deciding the positions of entity
discovery and action modules in the neural module network layout. Each entity discovery module
receives a noun/noun phrase as input, and outputs relatedness scores for code tokens, which are
passed as input to an action module. Action module gets scores for all its children in the parse-tree,
except one, which is masked, and the goal is predicting, cloze-style, what are the relatedness scores
for the missing argument.
the decomposition of the query into such steps, we take inspiration from the idea that code is
comprised of data, or entities, and transformations, or actions, over data. Thus, a SCS query is also
likely to describe the code in terms of data entities and actions.
We break down the task of matching the query into smaller tasks of matching individual data
entities and actions. In particular, we aim to identify parts of the code that indicate the presence
of the corresponding data or action. We tackle each part with a distinct type of network – a neural
module. Using the semantic parse of the query, we construct the layout of how modules’ outputs
should be linked according to the relationships between data entities and actions, where each data
entity represents a noun, or a noun phrase, and each action represents a verb, or a verbal phrase.
Correspondingly, this layout specifies how the modules should be combined into a single neural
module network (NMN) [18]. Evaluating the NMN on the candidate code approximates detecting
the corresponding entities and actions in the code by testing whether the neural network can deduce
one missing entity from the code and the rest of the query.
This approach has the following advantages. First, semantic parse captures the compositionality of a query. Second, it mitigates the challenges of faithful encoding of text by focusind only
on a small portion of the query at a time. Finally, applying the neural modules in a succession can
potentially mimic staged reasoning necessary for SCS.
8
We evaluate our proposed NS3 model on two SCS datasets - CodeSearchNet (CSN) [19] and
CoSQA/WebQueryTest [20]. Additionally, we experiment with a limited training set size of CSN
of 10K and 5K examples. We find that NS3 provides large improvements upon baselines in all
cases. Our experiments demonstrate that the resulting model is more sensitive to small, but semantically significant changes in the query, and is more likely to correctly recognize that a modified
query no longer matches its code pair.
Our main contributions are: (i) We propose looking at SCS as a compositional task that requires
multi-step reasoning. (ii) We present an implementation of the aforementioned paradigm based on
NMNs. (iii) We demonstrate that our proposed model provides a large improvement on a number
of well-established baseline models. (iv) We perform additional studies to evaluate the capacity of
our model to handle compositional queries.
2.2 Background
2.2.1 Semantic Code Search
Semantic code search (SCS) is the process of retrieving a relevant code snippet based on a textual
description of its functionality, also referred to as query. Let C be a database of code snippets c
i
.
For each c
i ∈ C , there is a textual description of its functionality q
i
. In the example in Figure 2.2,
the query q
i
is “Load all tables from dataset”. Let r be an indicator function such that r(q
i
, c
j
) = 1
if i = j; and 0 otherwise. Given some query q the goal of SCS is to find c
∗
such that r(q, c
∗
) = 1.
We assume that for each q
∗
there is exactly one such c
∗
.
1 Here we look to construct a model which
takes as input a pair of query and a candidate code snippet: (q
i
, c
j
) and assign the pair a probability
rˆ
i j for being a correct match. Following the common practice in information retrieval, we evaluate
the performance of the model based on how high the correct answer c
∗
is ranked among a number
1This is not the case in CoSQA dataset. For the sake of consistency, we perform the evaluation repeatedly, leaving
only one correct code snippet among the candidates at a time, while removing the others.
9
of incorrect, or distractor instances {c}. This set of distractor instances can be the entire codebase
C , or a subset of the codebase obtained through heuristic filtering, or another ranking method.
2.2.2 Neural Models for Semantic Code Search
Past works handling programs and code have focused on enriching their models with incorporating
more semantic and syntactic information from code [21–24]. Some prior works have cast the SCS
as a sequence classification task, where the code is represented as a textual sequence and input
pair (q
i
, c
j
) is concatenated with a special separator symbol into a single sequence, and the output
is the score ˆr
i j: ˆr
i j = f(q
i
, c
j
). The function f performing the classification can be any sequence
classification model, e.g. BERT [25].
Alternatively, one can define separate networks for independently representing the query (f),
the code (g) and measuring the similarity between them: ˆr
i j = sim(f(q
i
),g(c
j
)). This allows one to
design the code encoding network g with additional program-specific information, such as abstract
syntax trees [26, 27] or control flow graphs [28, 29]. Separating two modalities of natural language
and code also allows further enrichment of code representation by adding contrastive learning
objectives [30, 31]. In these approaches, the original code snippet c is automatically modified with
semantic-preserving transformations, such as variable renaming, to introduce versions of the code
snippet - c
′ with the exact same functionality. Code encoder g is then trained with an appropriate
contrastive loss, such as Noise Contrastive Estimation (NCE) [32], or InfoNCE [33].
Limitations However, there is also merit in reviewing how we represent and use the textual
query to help guide the SCS process. Firstly, existing work derives a single embedding for the
entire query. This means that specific details or nested subqueries of the query may be omitted
or not represented faithfully - getting lost in the embedding. Secondly, prior approaches make
the decision after a single pass over the code snippet. This ignores cases where reasoning about
a query requires multiple steps and thus - multiple look-ups over the code, as is for example in
cases with nested subqueries. Our proposed approach - NS3
- attempts to address these issues by
breaking down the query into smaller phrases based on its semantic parse and locating each of
10
them in the code snippet. This should allow us to match compositional and longer queries to code
more precisely.
2.3 Neural Modular Code Search
We propose to supplement the query with a loose structure resembling its semantic parse, as illustrated in Figure 2.2. We follow the parse structure to break down the query into smaller, semantically coherent parts, so that each corresponds to an individual execution step. The steps are
taken in succession by a neural module network composed from a layout that is determined from
the semantic parse of the query (Sec. 2.3.1). The neural module network is composed by stacking
“modules”, or jointly trained networks, of distinct types, each carrying out a different functionality.
Method Overview In this work, we define two types of neural modules - entity discovery module
(denoted by E; Sec. 2.3.2) and action module (denoted by A; Sec 2.3.3). The entity discovery
module estimates semantic relatedness of each code token c
j
i
in the code snippet c
j = [c
j
1
,..., c
j
N
]
to an entity mentioned in the query – e.g. “all tables” or “dataset” as in Figure 2.2. The action
module estimates the likelihood of each code token to be related to an (unseen) entity affected
by the action in the query e.g. “dataset” and “load from” correspondingly, conditioned on the
rest of the input (seen), e.g. “all tables”. The similarity of the predictions of the entity discovery
and action modules measures how well the code matches that part of the query. The modules are
nested - the action modules are taking as input part of the output of another module - and the order
of nesting is decided by the semantic parse layout. In the rest of the paper we refer to the inputs of
a module as its arguments.
Every input instance fed to the model is a 3-tuple (q
i
,sq
i , c
j
) consisting of a natural language
query q
i
, the query’s semantic parse sq
i , a candidate code (sequence) c
j
. The goal is producing a
binary label ˆr
i j = 1 if the code is a match for the query, and 0 otherwise. The layout of the neural
module network, denoted by L(sq
i), is created from the semantic structure of the query sq
i . During
inference, given (q
i
,sq
i , c
j
) as input the model instantiates a network based on the layout, passes q
i
,
11
c
j
and sq
i as inputs, and obtains the model prediction ˆr
i j. This pipeline is illustrated in Figure 2.2,
and details about creating the layout of the neural module network are presented in Section 2.3.1.
During training, we first perform noisy supervision pretraining for both modules. Next, we
perform end-to-end training, where in addition to the query, its parse, and a code snippet, the model
is also provided a gold output label r(q
i
, c
j
) = 1 if the code is a match for the query, and r(q
i
, c
j
) =
0 otherwise. These labels provide signal for joint fine-tuning of both modules (Section 2.3.5).
2.3.1 Module Network Layout
Here we present our definition of the structural representation sq
i for a query q
i
, and introduce how
this structural representation is used for dynamically constructing the neural module network, i.e.
building its layout L(sq
i).
Query Parsing To infer the representation sq
i , we pair the query (e.g., “Load all tables from
dataset”, as in Figure 2.2), with a simple semantic parse that looks similar to:
DO WHAT [ (to/from/in/...) WHAT, WHEN, WHERE, HOW, etc].
Following this semantic parse, we break down the query into shorter semantic phrases using the
roles of different parts of speech. Nouns and noun phrases correspond to data entities in code, and
verbs describe actions or transformations performed on the data entities. Thus, data and transformations are separated and handled by separate neural modules – an entity discovery module E and
an action module A. We use a Combinatory Categorial Grammar-based (CCG) semantic parser [34,
35] to infer the semantic parse sq
i for the natural language query q
i
. Parsing is described in further
detail in Section 2.4.1 and Appendix A.2.
Specifying Network Layout In the layout L(sq
i), every noun phrase (e.g., “dataset” in Figure 2.2) will be passed through the entity discovery module E. Module E then produces a probability score ek
for every token c
j
k
in the code snippet c
j
to indicate its semantic relatedness to the noun
phrase: E(“dataset”, c
j
) = [e1, e2,..., eN]. Each verb in sq
i (e.g., “load” in Figure 2.2) will be passed
12
Figure 2.3: Entity module architecture in our NS3
approach.
through an action module: A(“load”,p
i
, c
j
) = [a1,a2,...,aN]. Here, p
i
is the span of arguments to the
verb (action) in query q
i
, consisting of children of the verb in the parse sq
i (e.g. subject and object
arguments to the predicate “load”); a1,...,aN are estimates of the token scores e1,..., eN for an
entity from p
i
. The top-level of the semantic parse is always an action module. Figure 2.2 also
illustrates preposition FROM used with “dataset”, handling which is described in Section 2.3.3.
2.3.2 Entity Discovery Module
The entity discovery module receives a string that references a data entity. Its goal is to identify
tokens in the code that have high relevance to that string. The architecture of the module is shown
in Figure 2.3. Given an entity string, “dataset” in the example, and a sequence of code tokens
[c
j
1
,..., c
j
N
], entity module first obtains contextual code token representation using RoBERTa model
that is initialized from CodeBERT-base checkpoint. The resulting embedding is passed through a
two-layer MLP to obtain a score for every individual code token c
j
k
: 0 ≤ ek ≤ 1. Thus, the total
output of the module is a vector of scores: [e1, e2,..., eN]. To prime the entity discovery module
for measuring relevancy between code tokens and input, we fine-tune it with noisy supervision, as
detailed below.
Noisy Supervision We create noisy supervision for the entity discovery module by using keyword matching and a Python static code analyzer. For the keyword matching, if a code token is an
exact match for one or more tokens in the input string, its supervision label is set to 1, otherwise
13
it is 0. Same is true if the code token is a substring or a superstring of one or more input string
tokens. For some common nouns we include their synonyms (e.g. “map” for “dict”), the full list
of those and further details are presented in Appendix B.
We used the static code analyzer to extract information about statically known data types.
We cross-matched this information with the query to discover whether the query references any
datatypes found in the code snippet. If that is the case, the corresponding code tokens are assigned
supervision label 1, and all the other tokens are assigned to 0. In the pretraining we learned on
equal numbers of (query, code) pairs from the dataset, as well as randomly mismatched pairs of
queries and code snippets to avoid creating bias in the entity discovery module.
2.3.3 Action Module
First, we discuss the case where the action module has only entity module inputs. Figure 2.4
provides a high-level illustration of the action module. In the example, for the query “Load all
tables from dataset”, the action module receives only part of the full query – “Load all tables from
???”. Action module then outputs token scores for the masked argument – “dataset”. If the code
snippet corresponds to the query, then the action module should be able to deduce this missing
part from the code and the rest of the query. For consistency, we always mask the last data entity
argument. We pre-train the action module using the output scores of the entity discovery module
as supervision.
Each data entity argument can be associated with 0 or 1 prepositions, but each action may
have multiple entities with prepositions. For that reason, for each data entity argument we create
one joint embedding of the action verb and the preposition. Joint embeddings are obtained with a
2-layer MLP model, as illustrated in the left-most part of Figure 2.4.
If a data entity does not have a preposition associated with it, the vector corresponding to the
preposition is filled with zeros. The joint verb-preposition embedding is stacked with the code
token embedding c
j
k
and entity discovery module output for that token, this is referenced in the
middle part of Figure 2.4. This vector is passed through a transformer encoder model, followed
14
Figure 2.4: Action module architecture in our NS3
approach
by a 2-layer MLP and a sigmoid layer to output token score ak
, illustrated in the right-most part of
the Figure 2.4. Thus, the dimensionality of the input depends on the number of entities. We use a
distinct copy of the module with the corresponding dimensionality for different numbers of inputs,
from 1 to 3.
2.3.4 Model Prediction
The final score ˆr
i j = f(q
i
, c
j
) is computed based on the similarity of action and entity discovery
module output scores. Formally, for an action module with verb x and parameters p
x = [p
x
1
,..., p
x
k
],
the final model prediction is the dot product of respective outputs of action and entity discovery
modules: ˆr
i j = A(x, p
x
1
,..., p
x
k−1
)· E(p
x
k
). Since the action module estimates token scores for the
entity affected by the verb, if its prediction is far from the truth - then either the action is not found
in the code, or it is not fully corresponding to the query, for example, in the code snippet tables are
loaded from web, instead of a dataset. We normalize this score to make it a probability. If this is
the only action in the query, this probability score will be the output of the entire model for (q
i
, c
j
)
pair: ˆr
i j, otherwise ˆr
i j will be the product of probability scores of all nested actions in the layout.
Compositional query with nested actions Consider a compositional query “Load all tables
from dataset using Lib library”. Here action with verb “Load from” has an additional argument “using” – also an action – with an entity argument “Lib library”. In case of nested actions, we flatten
the layout by taking the conjunction of individual action similarity scores. Formally, for two verbs
15
x and y and their corresponding arguments p
x = [p
x
1
,..., p
x
k
] and p
y = [p
y
1
,..., p
y
l
] in a layout that
looks like: A(x,p
x
,A(y,p
y
)), the output of the model is the conjunction of similarity scores computed for individual action modules: sim(A(x, p
x
1
,..., p
x
k−1
),E(p
x
k
))·sim(A(y, p
y
1
,..., p
y
l−1
),E(p
y
l
)).
This process is repeated until all remaining p
x
and p
y
are data entities. This design ensures that
code snippet is ranked highly if both actions are ranked highly, we leave explorations of alternative
handling approaches for nested actions to future work.
2.3.5 Module Pretraining and Joint Fine-tuning
We train our model through supervised pre-training, as is discussed in Sections 2.3.2 and 2.3.3,
followed by end-to-end training. End-to-end training objective is binary classification - given a
pair of query q
i
and code c
j
, the model predicts probability ˆr
i j that they are related. In the end-toend training, we use positive examples taken directly from the dataset - (q
i
, c
i
), as well as negative
examples composed through the combination of randomly mismatched queries and code snippets.
The goal of end-to-end training is fine-tuning parameters of entity discovery and action modules,
including the weights of the RoBERTA models used for code token representation.
Batching is hard to achieve for our model, so for the interest of time efficiency we do not perform inference on all distractor code snippets in the code dataset. Instead, for a given query we
re-rank top-K highest ranked code snippets as outputted by some baseline model, in our evaluations we used CodeBERT. Essentially, we use our model in a re-ranking setup, this is common in
information retrieval and is known as L2 ranking. We interpret the probabilities outputted by the
model as ranking scores. More details about this procedure are provided in Section 2.4.1.
16
2.4 Experiments
2.4.1 Experiment Setting
Dataset We conduct experiments on two datasets: Python portion of the CodeSearchNet (CSN) [19],
and CoSQA [20]. We parse all queries with the CCG parser, as discussed later in this section, excluding unparsable examples from further experiments. This leaves us with approximately 40%
of the CSN dataset and 70% of the CoSQA dataset, the exact data statistics are available in Appendix A in Table 5.1. We believe, that the difference in success rate of the parser between the two
datasets can be attributed to the fact that CSN dataset, unlike CoSQA, does not contain real code
search queries, but rather consists of docstrings, which are used as approximate queries. More
details and examples can be found in Appendix A.3. For our baselines, we use the parsed portion of the dataset for fine-tuning to make the comparison fair. In addition, we also experiment
with fine-tuning all models on an even smaller subset of CodeSearchNet dataset, using only 5K
and 10K examples for fine-tuning. The goal is testing whether modular design makes NS3 more
sample-efficient.
All experiment and ablation results discussed in Sections 2.4.2,2.4.3 and 2.4.4 are obtained on
the test set of CSN for models trained on CSN training data, or WebQueryTest [36] – a small natural
language web query dataset of document-code pairs – for models trained on CoSQA dataset.
Evaluation and Metrics We follow CodeSearchNet’s original approach for evaluation for a test
instance (q, c), comparing the output against outputs over a fixed set of 999 distractor code snippets.
We use two evaluation metrics: Mean Reciprocal Rank (MRR) and Precision@K (P@K) for K=1,
3, and 5, see Appendix A.1 for definitions and further details.
Following a common approach in information retrieval, we perform two-step evaluation. In the
first step, we obtain CodeBERT’s output against 999 distractors. In the second step, we use NS3
to
re-rank the top 10 predictions of CodeBERT. This way the evaluation is much faster, since unlike
17
Method CSN CSN-10K CSN-5K
MRR P@1 P@3 P@5 MRR P@1 P@3 P@5 MRR P@1 P@3 P@5
BM25 0.209 0.144 0.230 0.273 0.209 0.144 0.230 0.273 0.209 0.144 0.230 0.273
RoBERTa (code) 0.842 0.768 0.905 0.933 0.461 0.296 0.545 0.664 0.290 0.146 0.324 0.438
CuBERT 0.225 0.168 0.253 0.294 0.144 0.081 0.166 0.214 0.081 0.030 0.078 0.118
CodeBERT 0.873 0.803 0.939 0.958 0.69 0.55 0.799 0.873 0.680 0.535 0.794 0.870
GraphCodeBERT 0.812 0.725 0.880 0.919 0.786 0.684 0.859 0.901 0.773 0.677 0.852 0.892
GraphCodeBERT* 0.883 0.820 0.941 0.962 0.780 0.683 0.858 0.904 0.765 0.662 0.846 0.894
NS3 0.924 0.884 0.959 0.969 0.826 0.753 0.886 0.908 0.823 0.751 0.881 0.913
Upper-bound 0.979 0.939 0.936
Table 2.1: Mean Reciprocal Rank (MRR) and Precision@1/@3/@5 (higher is better) for semantic
code search methods trained on different subsets from CodeSearchNet dataset.
our modular approach, CodeBERT can be fed examples in batches. And as we will see from the
results, we see improvement in final performance in all scenarios.
Method
CoSQA
MRR P@1 P@3 P@5
BM25 0.103 0.05 0.119 0.142
RoBERTa (code) 0.279 0.159 0.343 0.434
CuBERT 0.127 0.067 0.136 0.187
CodeBERT 0.345 0.175 0.42 0.54
GraphCodeBERT 0.435 0.257 0.538 0.628
GraphCodeBERT* 0.462 0.314 0.547 0.632
NS3 0.551 0.445 0.619 0.668
Upper-bound 0.736 0.724 0.724 0.724
Table 2.2: Mean Reciprocal Rank (MRR) and Precision@1/@3/@5 (higher is better) for different semantic
code search methods trained on CoSQA dataset.
Compared Methods We compare
NS3 with various state-of-the-art
methods, including some traditional
approaches for document retrieval
and pretrained large NLP language
models. (1) BM25 is a ranking
method to estimate the relevance of
documents to a given query. (2)
RoBERTa (code) is a variant of
RoBERTa [37] pretrained on the
CodeSearchNet corpus. (3) CuBERT [13] is a BERT Large model
pretrained on 7.4M Python files from
GitHub. (4) CodeBERT [14] is
an encoder-only Transformer model
18
trained on unlabeled source code via masked language modeling (MLM) and replaced token detection objectives. (5) GraphCodeBERT [16] is a pretrained Transformer model using MLM, data
flow edge prediction, and variable alignment between code and the data flow. (6) GraphCodeBERT* is a re-ranking baseline. We used the same setup as for NS3
, but used GraphCodeBERT
to re-rank the top-10 predictions of the CodeBERT model.
Query Parser We started by building a vocabulary of predicates for common action verbs and
entity nouns, such as “convert”, “find”, “dict”, “map”, etc. For those we constructed the lexicon
(rules) of the parser. We have also included “catch-all” rules, for parsing sentences with lesscommon words. To increase the ratio of the parsed data, we preprocessed the queries by removing
preceding question words, punctuation marks, etc. Full implementation of our parser including the
entire lexicon and vocabulary can be found at https://github.com/ShushanArakelyan/
ccg_parser. More details are available in Appendix A.2.
Pretrained Models Action and entity discovery modules each embed code tokens with a RoBERTa
model, that has been initialized from a checkpoint of pretrained CodeBERT model 2
. We fine-tune
these models during the pretraining phases, as well as during final end-to-end training phase.
Hyperparameters The MLPs in entity discovery and action modules have 2 layers with input
dimension of 768. We use dropout in these networks with rate 0.1. The learning rate for pretraining
and end-to-end training phases was chosen from the range of 1e-6 to 6e-5. We use early stopping
with evaluation on unseen validation set for model selection during action module pretraining and
end-to-end training. For entity discovery model selection we performed manual inspection of produced scores on unseen examples. For fine-tuning the CuBERT, CodeBERT and GraphCodeBERT
baselines we use the hyperparameters reported in their original papers. For RoBERTa (code), we
perform the search for learning rate during fine-tuning stage in the same interval as for our model.
For model selection on baselines we also use early stopping.
2https://huggingface.co/microsoft/codebert-base
19
2.4.2 Results
Performance Comparison Tables 2.1 and 2.2 present the performance evaluated on testing portion of CodeSearchNet dataset, and WebQueryTest dataset correspondingly. As it can be seen, our
proposed model outperforms the baselines.
Our evaluation strategy improves performance only if the correct code snippet was ranked
among the top-10 results returned by the CodeBERT model, so rows labelled “Upper-bound” report
best possible performance with this evaluation strategy.
D=1 D=2 D=3+
CodeSearchNet
0.70
0.75
0.80
0.85
0.90 0.9
0.83
0.84
0.8
0.79
0.82
0.72
0.75
0.8
D=1 D=2
CoSQA
0.0
0.1
0.2
0.3
0.4
0.5 0.49
0.29
0.18 0.18
0.27
0.22
Max depth of query
Precision@1
CodeBERT
NS3
GraphCodeBERT
(a)
N=1 N=2 N=3
CodeSearchNet
0.7
0.8
0.9 0.89
0.93
0.82 0.81 0.8
0.78
0.73
0.71
0.76
N=1 N=2 N=3
CoSQA
0.0
0.1
0.2
0.3
0.4
0.5
0.56
0.45
0.35
0.17 0.17
0.23
0.27
0.15
0.19
Number of arguments for Action
Precision@1
CodeBERT
NS3
GraphCodeBERT
(b)
Figure 2.5: Comparison of baseline methods and NS3
for semantic code search. We report Precision@1 scores. (a) Performance of our proposed method and baselines broken down by average
number of arguments per action in a single query. (b) Performance of our proposed method and
baselines broken down by number of arguments in queries with a single action.
20
NS3 NS3-AP NS3-(AP&EP) NS3-E2E
0.0
0.2
0.4
0.6
0.8
MRR
0.92 0.9 0.9
0.65
0.55 0.53 0.54
0.45
Effect of training CSN
CoSQA
(a)
None A E Both
Precision@1
0.0
0.2
0.4
0.6
0.8
0.37
0.9 0.9 0.87
0.41 0.45 0.44 0.43
CSN Score normalization
CoSQA
(b)
Dot Product L2 Weighted Cosine
Precision@1
0.0
0.2
0.4
0.6
0.8
0.9 0.85 0.87
0.44 0.43 0.46
Similarity measure CSN
CoSQA
(c)
Figure 2.6: Performance of NS3 on the test portion of CSN dataset with different ablation variants. (a) Skipping one, or both pretraining procedures, and only training end-to-end. (b) Using no
normalization on output scores (None), actiononly or entity discovery-only, and both. (c) Performance with different options for computing
action and entity discovery output similarities.
Query Complexity vs. Performance Here
we present the breakdown of the performance
for our method vs baselines, using two proxies for the complexity and compositionality of
the query. The first one is the maximum depth
of the query. We define the maximum depth as
the maximum number of nested action modules
in the query. The results for this experiment
are presented in Figure 2.5a. As we can see,
NS3
improves over the baseline in all scenarios.
It is interesting to note, that while CodeBERT
achieves the best performance on queries with
depth 3+, our model’s performance peaks at
depth = 1. We hypothesize that this can be
related to the automated parsing procedure, as
parsing errors are more likely to be propagated
in deeper queries. Further studies with carefully curated manual parses are necessary to
better understand this phenomenon.
Another proxy for the query complexity we
consider, is the number of data arguments to
a single action module. While the previous
scenario is breaking down the performance by
the depth of the query, here we consider its
“width”. We measure the average number of entity arguments per action module in the query.
In the parsed portion of our dataset we have queries that range from 1 to 3 textual arguments per
action verb. The results for this evaluation are presented in Figure 2.5. As it can be seen, there is
21
no significant difference in performances between the two groups of queries in either CodeBERT
or our proposed method - NS3
.
2.4.3 Ablation Studies
Effect of Pretraining In an attempt to better understand the individual effect of the two modules as well as the roles of their pretraining and training procedures, we performed two additional
ablation studies. In the first one, we compare the final performance of the original model with
two versions where we skipped part of the pretraining. The model noted as (NS3−AP) was trained
with pretrained entity discovery module, but no pretraining was done for action module, instead we
proceeded to the end-to-end training directly. For the model called NS3 −(AP&EP), we skipped
both pretrainings of the entity and action modules, and just performed end-to-end training. Figure 2.6a demonstrates that combined pretraining is important for the final performance. Additionally, we wanted to measure how effective the setup was without end-to-end training. The results
are reported in Figure 2.6a under the name NS3 − E2E. There is a huge performance dip in this
scenario, and while the performance is better than random, it is obvious that end-to-end training is
crucial for NS3
.
Score Normalization We wanted to determine the importance of output normalization for
the modules to a proper probability distribution. In Figure 2.6b we demonstrate the performance
achieved using no normalization at all, normalizing either action or entity discovery module, or
normalizing both. In all cases we used L1 normalization, since our output scores are non-negative.
The version that is not normalized at all performs the worst on both datasets. The performances of
the other three versions are close on both datasets.
Similarity Metric Additionally, we experimented with replacing the dot product similarity with a different similarity metric. In particular, in Figure 2.6c we compare the performance
22
achieved using dot product similarity, L2 distance, and weighted cosine similarity. The difference
in performance among different versions is marginal.
2.4.4 Analysis and Case Study
Appendix C contains additional studies on model generalization, such as handling completely
unseen actions and entities, as well as the impact of the frequency of observing an action or entity
during training has on model performance.
V(arg1)->V(arg2) V1(arg)->V2(arg) 0.0
0.2
0.4
0.6
0.8
1.0
Perturbed/Original
0.42
0.82
0.68
CodeBERT 0.96
NS3
Figure 2.7: Ratio of the perturbed query score to
the original query score (lower is better) on CSN
dataset.
Case Study Finally, we demonstrate
some examples of the scores produced by our
modules at different stages of training. Figure 2.8 shows module score outputs for two
different queries and with their corresponding
code snippets. The first column shows the output of the entity discovery module after pretraining, while the second and third columns
demonstrate the outputs of entity discovery and
action modules after the end-to-end training.
We can see that in the first column the model identifies syntactic matches, such as “folder” and
a list comprehension, which “elements” could be related too. After fine-tuning we can see there is
a wider range of both syntactic and some semantic matches present, e.g. “dirlist” and “filelist” are
correctly identified as related to “folders”.
Perturbed Query Evaluation In this section we study how sensitive the models are to small
changes in the query q
i
, so that it no longer correctly describes its corresponding code snippet c
i
.
Our expectation is that evaluating a sensitive model on c
i will rate the original query higher than
the perturbed one. Whereas a model that tends to over-generalize and ignore details of the query
23
Figure 2.8: Token scores outputted by the modules at different stages of training. Darker highlighting means higher score. The leftmost and middle columns show output scores of the entity
discovery module after pretraining, and the end-to-end training correspondingly. The rightmost
column shows the scores of the action module after the end-to-end training.
will likely rate the perturbed query similar to the original. We start from 100 different pairs (q
i
, c
i
),
that both our model and CodeBERT predict correctly.
We limited our study to queries with a single verb and a single data entity argument to that verb.
For each pair we generated perturbations of two kinds, with 20 perturbed versions for every query.
For the first type of perturbations, we replaced query’s data argument with a data argument sampled
randomly from another query. For the second type, we replaced the verb argument with another
randomly sampled verb. To account for calibration of the models, we measure the change in
performance through ratio of the perturbed query score over original query score (lower is better).
The results are shown in Figure 2.7, labelled “V(arg1) → V(arg2)” and “V1(arg) → V2(arg)”.
Discussion One of the main requirements for the application of our proposed method is being
able to construct a semantic parse of the retrieval query. In general, it is reasonable to expect
the users of the SCS to be able to come up with a formal representation of the query, e.g. by
representing it in a form similar to SQL or CodeQL. However, due to the lack of such data for
training and testing purposes, we implemented our own parser, which understandably does not
have perfect performance since we are dealing with open-ended sentences.
24
2.5 Conclusion
We presented NS3
a symbolic method for semantic code search based on neural module networks.
Our method represents the query and code in terms of actions and data entities, and uses the
semantic structure of the query to construct a neural module network. In contrast to existing code
search methods, NS3 more precisely captures the nature of queries. In an extensive evaluation,
we show that this method works better than strong but unstructured baselines. We further study
model’s generalization capacities, robustness, and sensibility of outputs in a series of additional
experiments.
25
Chapter 3
Understanding and combating distributional shift in real-world
software development cycles
In this Chapter, we systematically study how three large language models with code capabilities -
CodeT5, Codex, and ChatGPT - generalize to out-of-domain data. We consider two fundamental
applications - code summarization, and code generation. We split data into domains following its
natural boundaries - by an organization, by a project, and by a module within the software project.
We establish that samples from each new domain present all the models with a significant challenge
of distribution shift. We study how established methods adapt models to better generalize to new
domains. We empirically demonstrate that while multitask learning alone is a reasonable baseline,
combining it with few-shot finetuning on examples retrieved from training data can achieve very
strong performance. Moreover, this solution can outperform direct finetuning for very low-data
scenarios. Finally, we study approaches for creating a more broadly applicable method to adapt to
multiple domains at once. We find that for code generation, a model adapted to multiple domains
simultaneously performs on par with those adapted to a single domain
3.1 Introduction
Since the late 2000s, researchers have been reporting poor generalization of statistical learning
models to new software systems [38, 39], a phenomenon that has become important with the rise
26
of large language models (LLMs) for code, such as GitHub Copilot, Amazon CodeWhisperer,
Replit, etc. Thus, it is crucial to understand when pretrained large language model performance
on a private software system will differ from the performance obtained on a benchmark. Prior
work has studied some aspects of this problem, among others studying generalization from older
to newer code, large software projects, and small competition problems, authors, and code representations [40–42].
However, the challenges of distribution shifts stemming from the hierarchical nature of software data, as depicted in Figure 3.1, have not been systematically studied with regard to large
language models for code. Motivated by that, in this work, we probe the generalization capacity of large language models with code capabilities, specifically Codex [43], CodeT5 [44] and
ChatGPT, in code generation and summarization tasks, examining three scenarios: generalization
across companies, projects, and project components. These scenarios are routinely considered for
analyzing software systems [45–47] due to the careful consideration that goes into combining or
separating such entities.
Figure 3.1: Organization of a software system by the granularity of its components
First, we want to understand how models perform on new domains - if models struggle with
out-of-domain generalization, they should be used with caution. At the same time, we empirically
establish the legitimacy of our definitions for out-of-domain scenarios by demonstrating that these
examples present a distributional shift. To answer this question, we compare the performance of
the models without any additional adaptation with that of the models that have been adapted on
27
limited data from a random domain or from the test domain. Adaptation with labeled examples
from the test domain is the proxy for model performance if there were no distributional shift. We
find that all three models suffer from a drop in performance when applied out-of-domain. In this
experiment, the difference is more pronounced for code summarization, where adapting models
with few in-domain examples, on average, leads to an improvement of over 10 BLEU [48] score
points.
Next, we explore ways to improve the out-of-domain generalization of large language models
with code capabilities, recognizing that relying on labeled in-domain data for every new domain
is impractical. Instead, we investigate the use of labeled out-of-domain data and small amounts of
unlabelled in-domain data to enhance generalization. We test methods known to be successful in
other transfer learning scenarios, such as meta-learning [49, 50] and multitask learning [51]. We
also leverage unlabeled in-domain data to retrieve similar labeled examples from an out-of-domain
corpus for adapting to the new domain. We find that while meta-learning and multitask learning do
not solve the out-of-domain generalization problem, domain adaptation with retrieved examples is
a good technique for low-data domains. In our evaluation on CodeSearchNet dataset we find that
models supervised with retrieved examples perform on par, or better, than models that have been
adapted using a few samples (e.g., 8 or 16) of in-domain labeled data. We are particularly interested
in scenarios with an extreme scarcity of labeled data - ranging from a few labeled instances to no
labeled data at all. This is due to how new data emerges in software engineering domains - it is not
difficult to imagine a new repository, or a new module, with fewer than 32 functions, let alone - 32
labeled functions.
Lastly, we study if we can make the code models more broadly applicable and retain their
generalization capacities, rather than having to adapt them to every new domain? Depending on
the approach to model adaptation (e.g. weight update vs in-context demonstrations) we vary the
set of retrieved examples for each new domain, or for each test input individually. We compare
performance obtained this way with that of the models that are adapted simultaneously to multiple
domains (or instances, correspondingly). We find that Codex is very sensitive to these changes, so
28
it is best to retrieve similar instances for each test data point. On the other hand, CodeT5 has a
minor drop in code summarization and a negligible drop in code generation. This makes it feasible
to adapt and apply CodeT5 to multiple domains simultaneously with minimal tradeoff, eliminating
the need to store separate copies of the model for each domain.
3.2 Background
The shifts in underlying semantics between the training and evaluation data can be one of the most
impacting factors for deteriorating performance at test time. Prior work in code analysis has mainly
focused on cross-project shifts, i.e. training and evaluating the model on disjunct sets of code
projects. Additionally, the studies were mainly conducted in the context of traditional machine
learning methods, such as linear classifiers, support vector machines, and later, LSTMs [38, 39,
52].
More recent works consider shifts caused by different authors of the code, the timeline of the
project, distributions of code tokens, etc [40–42]. However, the abilities of large language models
under distribution shift are still under-explored. We conduct a comprehensive empirical analysis to
probe the large language models’ capabilities in handling three different granularity of distribution
shifts (company, domain, module) when different training and adaptation methods are used. In
addition to directly fine-tuning vanilla LLMs, we experiment with enhancing pretrained models
using the methods described below.
Meta-Learning and Multi-task Learning. In our work, we experiment with both Meta-Learning
and Multi-task learning to get better initialization for few-shot performance on the downstream
task. For meta-learning, we employ Model-agnostic Meta-Learning (MaML) [53] which is a
gradient-based method. It is a conceptually simple and model-agnostic algorithm that has been
shown to outperform existing approaches in several tasks. Multi-task Learning (MTL) aims to
learn a shared and generalized representation by jointly training on several tasks. We adopt the
29
simplest approach to multi-task learning by jointly finetuning a shared language model on multiple
tasks.
Parameter Efficient Methods. Parameter-efficient methods have been shown to obtain performance comparable to finetuning all model parameters with finetuning only a tiny fraction of model
parameters. In our work, we experiment with Low-Rank Adaptation (LoRA) [54], which is a
low-rank update method.
In-Context Learning. GPT-3 [55] demonstrated the ability of large language models to perform few-shot predictions, where the model is given a description of the task in natural language
with few examples. In our work, we conduct experiments on in-context learning on Codex.
Figure 3.2: We group the instances from CodeSearchNet dataset by repos, orgs, and folders they belong to.
Retrieval Based Example Selection. It
has been shown in Liu, Shen, Zhang,
Dolan, Carin & Chen [56] that in-context
examples selected following a strategy
may serve as more informative input to
unleash GPT3’s extensive knowledge. Inspired by this, we leverage a similaritybased retrieval for domain adaptation.
3.3 Problem setting
We study scenario where users seek to integrate a large language model, such as Codex or CodeT5,
into their software project. The primary focus of this study is to gain a deeper understanding of the
performance characteristics exhibited by these models, particularly when confronted with source
30
τ ⊂ Xtrain (total) τ ⊂ Xtrain(|τ| ≥ 96) τ ⊂ Xtest(|τ| ≥ 96)
org. 9737 195 8
repos. 15858 147 15
fold. 25268 100 10
Table 3.1: Domains in CodeSearchNet dataset. Left column: training set. Middle column: number
of domains of each kind in Xtrain with ≥ 96 samples. Right column: number of domains in Xtest
with ≥ 96 samples.
code originating from an unseen organization, an unseen project, or specific project components
that have not been previously encountered.
For every code data point in the dataset, we have information about the organization, project,
and the module within the project that the data point comes from. Based on this information, we
can group data points into sets, and end up with three sets of sets, as illustrated in Figure 3.2.
For example, the middle set in the figure contains multiple sets of data points. Each of those sets
corresponds to a unique organization to which all data points within it belong. In other words,
all data points within a set belong to the same domain. Appendix, Section H contains additional
analysis on splitting the data points in this manner. For simplicity, we refer to a set of examples
from the same domain as τ
i
. We refer to splits of such a set into train, development, or test sections
as τ
i
train, τ
i
dev, and τ
i
test.
3.3.1 Data
We use CodeSearchNet [57] dataset1
, in particular, the partition containing JavaScript language.
We refer to the train section of the dataset as Xtrain, and the development and test sections as Xtest.
We want to keep all of the domains in Xtest unseen, and for that reason, we remove any domain
from Xtest that also appears in Xtrain. This can happen because CodeSearchNet dataset is split
into partitions by projects, so the same organizations can appear in different splits. This way, any
domain coming from Xtest is, by our definition, out-of-domain for any model trained on Xtrain.
1Since the training data of Codex models is undisclosed, we cannot be sure that it did not include CodeSearchNet.
Nevertheless, we see a performance difference for ID and OOD experiments.
31
Code summarization folder repo org
8-shot 16-shot 32-shot 8-shot 16-shot 32-shot 8-shot 16-shot 32-shot
CodeT5 FT ID 14.39 16.06 18.31 12.68 14.73 16.82 13.14 16.35 17.65
CodeT5 LoRA ID 16.57 19.07 20.93 15.22 17.14 21.20 15.61 18.56 20.87
CodeT5 FT random 3.58 4.30 5.02 4.35 4.70 5.79 4.53 5.47 6.27
CodeT5 LoRA random 3.69 4.37 4.92 4.70 5.56 5.92 5.27 5.53 6.26
Table 3.2: Model performance for code summarization on in-domain (ID) vs out-of-domain
(random) test data. Reported metric is BLEU (higher is better).
Code generation folder repo org
8-shot 16-shot 32-shot 8-shot 16-shot 32-shot 8-shot 16-shot 32-shot
CodeT5 FT ID 14.67 15.22 16.13 16.15 17.42 18.62 14.54 15.34 16.43
CodeT5 LoRA ID 14.14 15.06 16.36 16.23 17.45 18.96 14.17 15.30 16.62
CodeT5 FT random 15.23 14.94 15.15 14.19 14.14 14.67 13.39 13.43 14.44
CodeT5 LoRA random 14.45 14.29 15.37 14.29 13.74 15.04 13.76 13.85 14.81
Table 3.3: Model performance for code generation on in domain (ID) vs out of domain (random)
test data. Reported metric is CodeBLEU (higher is better).
We further split each domain τ
i ⊂ Xtest into τ
i
train, τ
i
dev and τ
i
test. The evaluation is performed on
τ
i
test. τ
i
train and τ
i
dev are used to obtain a proxy for the upper-bound performance of the model if the
domain τ
i was seen during training, i.e. if there is no distribution shift for τ
i
test.
Preprocessing We use the “path” field of the data point to determine each code snippet’s organization, repository, and lowest-level folder. Using 5 different random seeds, we divide a domain
into τ
i
train, τ
i
dev, and τ
i
test. We aim to have at least 32 samples each in τ
i
test and τ
i
dev, and up to 32
samples for τ
i
train. Thus, from Xtest we filter any domain that has less than 96 samples in total.
Final dataset statistics are presented in Table 3.1.
3.3.2 Applications and Metrics
We study two generation applications: code summarization and code generation. Code summarization aims to summarize a code snippet into a natural language description. The code snippet
in CodeSearchNet dataset is a function, while the natural language description is the docstring of
that function. This task is evaluated with BLEU-4 [48] metric. Code generation generates the
32
Code summarization folder repo org
Codex
instr. only (0-shot) 1.55 1.52 1.61
ICL random (8-shot) 7.17 6.84 6.73
ICL ID (8-shot) 20.34 19.00 20.72
ChatGPT
instr. only (0-shot) 5.74 5.48 4.63
ICL random (8-shot) 5.47 6.58 6.48
ICL ID (8-shot) 7.47 9.15 7.54
Code generation folder repo org
Codex
instr. only (0-shot) 5.49 5.72 5.77
ICL random (8-shot) 16.82 17.47 16.82
ICL ID (8-shot) 25.73 24.64 23.87
ChatGPT
instr. only (0-shot) 8.45 8.39 8.04
ICL random (8-shot) 12.95 13.19 12.70
ICL ID (8-shot) 15.17 15.81 15.55
Table 3.4: Codex and ChatGPT performance for code summarization and code generation tasks.
Models are evaluated in 0-shot manner, as well as using in-context learning demonstrations (ICL)
with in-domain (ID) and out-of-domain (random) instances. Reported metric is BLEU for code
summarization (higher is better), and CodeBLEU for code generation (higher is better).
function given a natural language description of the code. We follow prior work and use CodeBLEU [58] for evaluating generated code. We added our own JavaScript keywords (the full list
is in Appendix, Section F) to an existing CodeBLEU implementation. However, recently it has
been shown that CodeBLEU scores can disagree with human judgment scores [59]. Motivated by
these findings we additionally evaluate code generation models with chrF [60], RougeL [61] and
CodeBERTScore [62] metrics. These metrics are in agreement in our experiments, so we report
the results for them in Appendix, Section L.
3.3.3 Models
We experiment with three large language models: (1) CodeT5 [44], which is an encoder-decoder
model based on T5 [63], (2) Codex [43], which is a decoder only model based on GPT-3 [55]
and (3) ChatGPT (gpt-3.5-turbo) which is the chat optimized version of InstructGPT [64] which
is fine-tuned with Reinforcement Learning with Human Feedback(RLHF) [65]. The models vary
33
in size: CodeT5 utilizes the T5-large architecture with 700 million parameters, while the Codex
model employs the GPT-3 architecture with over 100 billion parameters. Although the architecture
of ChatGPT has not been disclosed, it is presumed to have billions of parameters. A more detailed
discussion of these models is provided in the Appendix, Section I.
3.4 Analysis
In this section, we formulate the research questions that we aim to answer and give a more detailed
description of the setups that we have used for analyzing and answering each question.
RQ 1. How do code models perform on new domains?
We test the models’ capacity for generalization to new domains by comparing the performance
of the models that have been adapted to the new domain using few-shot instances of in-domain
data (ID) vs those that only encountered out-of-domain (OOD) data. For CodeT5, few-shot domain adaptation data is used to update the model weights, whereas for Codex, it is included as
demonstrations in the prompt to the model.
CodeT5
For adaptation techniques for the CodeT5 model, we experiment with using a different number of
supervision examples - 8, 16, or 32.
The first adaptation method we use is full model fine-tuning (FT). Information on the hyperparameters for this and all other methods is available in Appendix, Section J. Besides FT, we also
experiment with a parameter-efficient fine-tuning method - Low-Rank Adaptation (LoRA) [54].
This method adds trainable pairs of rank decomposition matrices in parallel to existing weight
matrices, thus enabling parameter-efficient adaptation to new domains without forgetting.
34
Codex and ChatGPT
For GPT models, we do not perform weight updates. Very large models have been shown to be
capable to generalize to unseen tasks with just an instruction. Thus, we evaluate these models
with just the task instruction, for example, ”Summarize following JavaScript code”, and input
(i.e. instruction only). Models can be sensitive to the wording of the instructions, so we use a
number of different instruction variations for each application and average the results. The full
list of instruction variations that we have used with Codex and ChatGPT models is presented in
Appendix, Section N.
Moreover, larger models have been shown to “learn” from demonstration examples that are
provided as part of their input, even though this process does not involve any weight updates. This
phenomenon is known as in-context learning (ICL), which is what we use for domain adaptation
for GPT models. Due to the limit on the size of the input to the models (4096 tokens), we use
as many demonstrations as would fit, including up to 8 demonstrations with each test example.
And since the models can also be sensitive to the order of examples, we shuffle the order of the
demonstrations 5 times and average the results.
Finding: Models struggle on new domains
Tables 3.2 and 3.3 demonstrate the performance obtained by CodeT5, and Table 3.4 shows performance for Codex and ChatGPT. Additional results for other code generation metrics, such as chrF,
RougeL, and CodeBERTScore are available in Appendix, Section L. We see that the performance
degrades for models that have not encountered in-domain examples vs those that have, i.e. models
struggle with out-of-domain generalization. For example, CodeT5 model on code summarization
in most scenarios gains about 200% relative improvement after updating the model with few-shot
in-domain data.
While there is a difference in performance for CodeT5 model on code generation ID and OOD,
the performance difference is next to negligible. We hypothesize that this can be due to the fact
that code generation is a more challenging task for a large language model, and so the effect of
35
distribution shift is less noticeable. This observation becomes evident when examining Table 3.3,
which demonstrates that the smaller model, CodeT5, exhibits lower performance compared to
larger models such as Codex. Thus, for CodeT5 adding in-domain data results in a smaller gain.
On the other side, for Codex, the addition of the in-domain data results in up to 50% relative
improvement.
From Table 3.4, it is evident that while ChatGPT outperforms Codex in 0-shot setting, we
don’t see as large of an improvement with the addition of in-context examples, whether in-domain
or out-of-domain. Upon closer inspection of model outputs, we notice that this is due to the
specifics of the ChatGPT model, which errs on the side of caution, refusing to provide any answer
when presented with a vague or noisy input. This results in 0 scores for those entries, lowering
the overall model performance and smoothing the effect of in-domain demonstrations. Due to this
characteristic of ChatGPT model, having established that it is affected by distributional shifts same
as other models in this study, we do not perform further comparisons with it in the rest of the paper.
RQ 2. How to get better out-of-domain generalization?
We have seen that models for code performed significantly better after being adapted for new
domains using in-domain data. However, there are many reasons why adapting to every new domain with the help of labeled examples might be impractical. Thus, we consider some alternative
approaches, that would not require labeled data but can hopefully close the performance gap partially or fully. Figure 3.3 shows an overview.
CodeT5
To answer RQ1, we start from a pre-trained checkpoint of the model and experiment with different approaches for domain adaptation. To answer the current question, we additionally consider
different methods to use before the domain adaptation stage, particularly, multi-task learning and
meta-learning. The resulting setups are illustrated in Figure 3.3a.
36
(a) CodeT5
(b) Codex
Figure 3.3: For the CodeT5 model we use different methods for training and domain adaptation.
We evaluate both in scenarios with different data sources during the domain adaptation stage.
Multitask learning (MTL) MTL method trains a single model on all the domains simultaneously. For code summarization, we use the model checkpoint that has been provided by the authors
of CodeT5, which is fine-tuned on the training portion of CodeSearchNet. For code generation, we
perform our own training since there was no JavaScript checkpoint shared by CodeT5 authors.
Dual-gen MTL In addition to MTL, we experiment with a multitask model that has been trained
on both code generation and code summarization simultaneously. We refer to this model as “dualgen” MTL, following the authors of CodeT5. We prepend the inputs to the model with a generation
or summarization instruction for each instance.
37
(a) CodeT5, trained and evaluated on CodeSearchNet
(b) CodeT5, trained on CodeSearchNet, evaluated on The Vault
(c) Codex
Figure 3.4: Models with ID and retrieved downstream adaptations.
38
Figure 3.5: CodeT5 model finetuned with retrieved supervision using different number of retrieved
examples per test sample. Scores reported are BLEU for code summarization and CodeBLEU for
code generation. CodeT5 MTL model performances in zero-shot, and 8-shot (ID) scenarios are
shown with dotted lines for reference.
Model-Agnostic Meta Learning For model-agnostic meta-learning or MaML [53], we filter the
domains in Xtrain set, only leaving those that have at least 96 samples (see the middle column
of Table 3.1). This is to ensure that each domain contains disjoint sets of adequate size for both
training and meta-training.
Stratified Example Retrieval for Supervision In addition to the strategies above, we experiment with a domain adaptation method that does not require in-domain labeled data for supervision. We use a similarity metric on embeddings obtained from the pre-trained CodeT5 model
checkpoint to retrieve k most similar examples for every example in τtest from Xtrain. We set k to
4, 8, or 32, and since |τtest| = 32 the combined size of the set would be 128, 256, or 1024. Finally,
we remove any duplicates. We refer to this set as τret.
For similarity metric, we experiment with cosine similarity, as well as a more recent approach
- IsoScore [66]. In our experiments, we find that cosine similarity performs better overall, so the
results reported in the paper are using cosine similarity. Additional results using IsoScore metric
are reported in Appendix Section M.
Challenge Scenario In addition to using test data from CodeSearchNet dataset, in an attempt
to make the evaluation more realistic, we experiment with a setting where the out-of-domain data
comes from a different dataset. Here we use the test split of The Vault dataset [67], which we have
39
Code Summarization
BLEU / ∆ BLEU
Code Generation
CodeBLEU / ∆ CodeBLEU
org repo folder org repo folder
FT: combined 4 18.74 / -4.74 18.59 / -4.47 18.06 / -1.06 29.46 / -0.19 29.41 / -0.01 26.60 / -1.53
FT: combined 8 18.46 / -5.07 18.58 / -3.03 17.57 / -3.48 29.13 / -0.73 28.83 / -0.22 27.23 / -0.92
FT: combined 32 17.35 / -2.31 17.63 / -0.94 15.57 / -2.56 26.28 / -3.63 25.01 / -4.02 25.14 / -2.88
ICL: 4 from τret 14.66 / -7.04 12.68 / -7.95 12.10 / -6.96 20.52 / -6.73 20.06 / -7.78 19.39 / -6.21
ICL: 8 from τret 13.77 / -8.53 12.96 / -8.52 12.26 / -7.17 20.81 / -7.05 20.23 / -8.16 19.48 / -7.00
Table 3.5: CodeT5 and Codex model performance using retrieved supervision examples for general domain adaptation. The first number in each cell of the table is the score obtained by the
corresponding model, which is followed by the change in the performance w.r.t domain-specific
model or test sample-specific demonstrations.
processed in the same manner as described in Section 3.3.1. The details of the processing for the
Vault dataset are provided in Appendix Section K.
Codex
Stratified Example Retrieval for Demonstrations Similarly to the strategy for CodeT5, for
Codex we employ in-context learning with retrieved demonstration examples. For each test query,
instead of using random sets of in-domain or out-of-domain demonstrations, we use 4 or 8 of the
query’s most similar samples from Xtrain as demonstrations. This case is referred to as ICL ret.
Finding: Strategic adaptation is advantageous in very low data scenarios
Figure 3.4a and 3.4c demonstrate the performance of the CodeT5 and Codex models. For CodeT5,
it contains the performance obtained without adaptation (0-shot), as well as after in-domain fewshot fine-tuning (additional results for LoRA are presented in Appendix Section L). None of the
evaluated methods perform comparably in zero-shot setting to those with few-shot domain adaptation - whether on examples retrieved from training data or obtained from test domains. So these
training methods do not result in a general-purpose model that handles out-of-domain generalization well.
40
The same pattern is evident in the challenge evaluation scenario, presented in Figure 3.4b.
From this figure, we also conclude that retrieved supervision is less effective when supervised and
test examples are extracted from different datasets - even when both are collected from the same
source, i.e. GitHub. While we have done our best to process the data in The Vault dataset as similar
to the processing done in CodeSearchNet, there must still be subtle differences remaining from data
collection, once again emphasizing how sensitive code models are even to minute changes.
Adapting the model trained with MTL objective to test domains with the help of stratified
supervision provides a considerable boost to the performance of CodeT5 and Codex. Results
for CodeT5 are shown in Figure 3.5 with bars marked “ret k”, where k refers to the number of
examples included in τret per test example. Figure 3.4c reports Codex performance with 4 or 8
retrieved demonstrations as “ICL ret 4” and “ICL ret 8” respectively.
First of all, we notice that there is a saturation in terms of gained performance vs the number
of stratified supervision or demonstration examples used. For CodeT5 using 32 examples per test
instance is almost always worse than using 4 or 8 examples. For Codex, using 4 or 8 examples
results in approximately the same performance.
Next, for code summarization, retrieving 4 or 8 examples from out-of-domain train data leads
to performance comparable, or even better, than that of the model adapted using 8 examples from
the test domain. This trend is observed for both Codex and CodeT5, particularly strongly when
generalizing to new repositories and new organizations. A similar trend can be observed for code
generation, and to a much stronger degree for CodeT5 - stratified supervision models can even
outperform models trained with 32 examples from the test domain. While the performance of the
stratified supervision models plateau after a certain number of examples, supervision on in-domain
samples does not demonstrate such a trend.
RQ 3. Can we have more generic solutions for out-of-domain generalization?
From our analysis of RQ2, we see that models can generalize better to new domains without
relying on labeled data from that domain. Unfortunately, this still requires adapting to every test
domain individually for CodeT5, and even more strictly - to every test sample individually - for
41
Codex. For example, for CodeT5, this means maintaining multiple copies of the model, performing
the training for the adaptation stage multiple times, and storing a large amount of out-of-domain
data to retrieve examples from. In this RQ, we experiment with approaches that would eliminate
the need to train CodeT5 on multiple domains separately. For Codex, we experiment with sampling from demonstrations collected for the entire domain. For CodeT5, we try two approaches.
First, we finetune it on the combined set of τret for all domains. We also try using fast vote-k algorithm [68], which selects representative examples from the supervision dataset, while ensuring
diversity among selected examples. For Codex, for a query from τtest, we consider sampling 4 or
8 demonstration examples from τret.
Finding: Multi-domain code generation models do not require a large performance sacrifice.
The results for both models are presented in Table 3.5. Results for CodeT5 for this experiment are
referred to as “FT: combined k”, where k is the number of retrieved examples per test example.
Fast vote-k is less effective as an adaptation technique compared to fine-tuning on a combined
set of retrieved examples, and the results for it are presented in the Appendix Section M.1. As
can be seen, training a single model on combined retrieved samples results in a moderate drop in
performance for code summarization, and a negligible drop for code generation. In other words,
a model finetuned on stratified supervision data for new domains can be a viable solution for the
out-of-domain generalization problem for code generation. Interestingly, this also indicates that
for code generation, good performance on one domain does not hinder the performance on another
domain, i.e. there is little to no negative transfer between different domains.
For Codex, the results of the experiment are referred to as “ICL: k from τret” in Table 3.5, where
k is the number of sampled demonstrations. It appears that for Codex replacing demonstrations
selected for individual examples with those selected for a domain introduce too much noise, and
degrade the performance a lot because of the high sensitivity of ICL to demonstrations.
42
3.5 Conclusion
We evaluate large language models for code - CodeT5, Codex (code-cushman-001), and ChatGPT
(gpt-3.5-turbo) - on two fundamental code applications - code generation and code summarization.
We study how the models perform under distribution shifts that can commonly occur due to the
nature of the software. We experiment with three granularities for defining domains in applications
for code - organization, project, and module or folder. Our experiments show that all models evaluated are susceptible to reduced performance due to domain shifts. We experiment with a number of
training and domain adaptation techniques for achieving better out-of-domain generalization. We
discover that retrieving similar out-of-domain examples from training data is the most effective
approach for adapting to new domains in the absence of in-domain data. In addition, we experiment with adapting models to multiple new domains simultaneously and find that such models
can perform very well for code generation. However, we find the generality of the model to be a
tradeoff for its performance for code summarization.
3.6 Limitations and Threats to Validity
As can be seen from Table 3.1, as a result of the process of filtering, we skew the data towards
larger projects and eliminate from the dataset many samples that could potentially come from
smaller projects. We believe that this step is necessary to make the results more reliable, due to the
high variance that can be observed in datasets with very small test sets. However, we want to draw
attention to this circumstance once more, to make sure that our findings are interpreted correctly.
43
Chapter 4
From Low-Resource to High-Performance with Tool-Assisted
Synthetic Data for Code Generation Models
In this chapter we look at the documented performance gap between high-resource languages like
JavaScript, Python, and C++, compared to lower-resource languages like Julia, Lua, or Scala in
large language models for code. Inspired by the effectiveness of the quality fine-tuning and distillation datasets for improved model performance shown by prior work, in this Chapter we suggest
combining those insights with the intuition that tools such as compilers and linters are readily
available for any programming language. Thus, we explore using external tools for improving
the quality of the synthetically generated data that is used for finetuning of a 1.3B student model,
and confirm our approach is effective across six low-resource programming languages - Perl, Lua,
Julia, Scala, Swift, and Rust.
4.1 Introduction
Languages with less wide-spread use, or low resource programming languages (PLs), despite being less popular than general programming languages like Python or JavaScript, have a range of
important applications, from scientific computing for Julia [69], to low-level and embedded systems for Rust [70, 71]. Yet, large language models (LLMs) are demonstrating limited performance
44
Low-Resource Languages Avg Python Julia Lua Perl Swift Rust Scala
GPT-4o 0.37 0.34 0.30 0.73 0.80 0.42 0.49 0.90
DeepSeekCoder-1.3B-Instr. 0.19 0.30 0.21 0.26 0.28 0.25 0.25 0.67
Llama3-8B-Instr. 0.25 0.36 0.26 0.34 0.34 0.40 0.33 0.62
Table 4.1: Frontier models like GPT-4o show widely varied and lower performance on lowresource programming languages compared to a high-resource language, such as Python. The
reported metric is Pass@1 (higher is better) and the evaluation is performed on the Humaneval
sections of the MultiPL-e dataset.
on these PLs. As illustrated in Table 4.1, frontier model GPT-4o’s [72] performance on code generation in low-resource PLs can be much worse compared to Python (0.49 on average, and 0.90).
This could be due to the limited amount of available supervision data for these programming languages, for example, the first three languages evaluated in Table 4.1 account for between < 0.1%
and 0.3% of pull requests on GitHub in 2024, while the last three account for between 1.1% and
1.9%. For comparison, Python accounts for around 17%, the same as JavaScript and TypeScript
combined [73]. The drastic difference in performance makes the LLMs unusable and unreliable
when generating code in low-resource PLs compared to other mainstream PLs.
Past work on improving performance for low-resource programming languages has been primarily focused on transfer learning with finetuning on low-resource language data [74–77]. However, the scarcity of supervision data for low resource PLs limits the possibility for these approaches. On the other hand, many recent works demonstrate that frontier language models can
synthesize code data, which can then be used effectively to supervise models for code generation tasks [78–80]. However, the quality of the generated synthetic data for low-resource PLs is
upper-bounded by the performance of the synthetic data generator on these languages.
In this paper we explore the avenues for model improvement on low resource programming
languages through improved synthetic data generation with the help of external tool use. Programming tools such as compilers and linters are widely accessible for most programming languages,
making them ideal for augmenting the capabilities of language models beyond their inherent limitations. We propose leveraging external feedback from these PL-specific tools to guide language
45
Figure 4.1: Overview of the data generation process pipeline.
models in correcting errors and generating higher-quality synthetic data for distillation. This strategy is motivated by previous research showing that language models can refine and enhance their
outputs when provided with feedback from external tools [81]. Correcting errors with the assistance of an LLM allows the model to go beyond its initial generation, addressing shortcomings and
exploiting otherwise untapped potential in generating accurate code for low-resource programming
languages. Moreover, applying this approach over multiple iterations allows for continuous refinement of the generated data, addressing both syntactic and functional errors in successive passes.
Specifically, we start with a small seed dataset of a few hundred examples of coding problems
and expand it to a synthetic dataset of 5K code snippets in the target low-resource programming
language, which is then improved by the generator model based on the feedback from compilation or execution steps. This process leads to significant improvements in model performance on
low-resource programming languages.
For our choice of low resource PLs, we follow the definition proposed in [82], and choose 4
low resource (Perl, Scala, Swift, and Rust) and 2 niche (Lua, Julia) programming languages from
46
MultiPL-e Humaneval dataset [82]. We demonstrate that our approach not only outperforms other
baselines that utilize learning from synthetic data, including those generated by powerful teacher
models such as GPT-3.5-Turbo, but also surpasses the performance of GPT-4o on some programming languages like Julia, Lua, and Perl. While previous research has shown that smaller models
can outperform GPT models through distillation [79], these efforts relied on directly distilling from
the highly powerful GPT models. In contrast, with our approach to distillation we achieve superior results using a more modest setup, with a DeepSeekCoder-1.3B-Instruct [83] student model
and Llama3-8B-Instruct [84] teacher model, underscoring the efficiency and effectiveness of our
refinement process. There is a growing need to apply LLMs to very-low-resource and niche programming languages, and we believe our approach can help create small yet powerful models
tailored for these specific languages.
4.2 Related Work
The issue of poor transfer of large language models to lower resource PLs has been getting a lot
of attention with the use of large language models becoming ubiquitous [76]. The most common
and straightforward approach to addressing the issue has been additional training on a dataset
comprised from code written in low resource PL [74, 85, 86]. For example, van Dam, van der
Heijden, de Bekker, Nieuwschepen, Otten & Izadi [85] perform a case study evaluating model
performance and performance gains with additional training on Haskell programming language,
and as a result state the “need for more high-quality Haskell datasets”. However, for a low-resource
PL we as a rule do not have access to large amounts of high quality supervision data.
In attempts to address this issue, some researchers have suggested generating large amounts of
synthetic data in a high-resource programming language, and subsequently using LLMs to translate
it into the target low-resource programming language [75]. In our work we also rely on synthetic
data generation for improving the performance on low resource programming languages, but differently from Cassano, Gouwar, Lucchetti, Schlesinger, Anderson, Greenberg, et al. [75] we 1)
47
use an order of magnitude smaller amount of synthetic supervision data, 2) the synthetic data is
being generated directly in the target programming language, thus further eliminating the need
for additional resources required to perform the translation 3) on the two languages that are covered in both Cassano, Gouwar, Lucchetti, Schlesinger, Anderson, Greenberg, et al. [75] and our
study (Julia and Lua), despite having similar starting base model performance, we achieve a higher
final performance.
Besides being used for providing synthetic data, various LLMs have been successfully used for
either iterative refinement or self-refinement. Some past works have shown successful approaches
relying on the generative model’s capacity to provide feedback on its own or other model’s generations for continual improvement [87, 88]. For code generation, approaches such as CodeRL have
been proposed, that train a dedicated Critique network in an Actor-Critique setup to achieve improved code generation performance [89]. Another notable approach is trying to teach the model
to self-debug in an attempt to recreate and mimic the rubber-duck debugging approach [90].
However, with these approaches the final performance is upper-bounded by the model’s exposure to the target task, in our case - code generation in a low-resource programming language. To
address this issue, a number of past works have looked into iterative refinement based on using
external tools. Some notable works, for example, suggest iteratively refining the model’s output by
including feedback from the tools as part of the prompt for the purpose of code generation, code
editing or fixing vulnerabilities [77, 81, 91, 92]. As opposed to these approaches, we do not utilize
any external tool feedback at inference time, and the improvement achieved is only through careful
selection and generation of fine-tuning data. This means, that our approach can be combined with
these inference-time approaches to potentially further improve the final model performance.
4.3 Methodology
In this section, we describe our suggested approach for synthetic data generation for improving
performance on low-resource programming languages. We start with background information
48
on code generation and knowledge distillation, followed by a discussion of the three steps in our
approach - initial data generation, feedback-based refinement by the teacher model, and subsequent
distillation with the refined synthetic data into a smaller student model.
4.3.1 Code Generation
A code generation problem p = (i,T) is defined by its instruction i provided in natural language,
and multiple test cases T. Given the instruction i, a neural model f can produce a code snippet c.
When the code snippet c is executed without the test cases T the execution can either succeed or
fail. If it fails, the code snippet is considered to have a syntactic error or be syntactically incorrect.
If it succeeds it is deemed syntactically correct.
Syntactically correct code snippets can then be executed against test cases T to check for functional correctness. If a syntactically correct snippet fails these tests, it is considered functionally
incorrect. If it passes, it is deemed functionally correct or simply correct.
4.3.2 Distillation
Knowledge distillation is the process of transferring knowledge from a larger, teacher model (τ)
to a smaller student model (σ). Using the teacher model to produce a synthetic dataset for training
the student model is a type of knowledge distillation that has become very popular recently with
the rize of very capable and very large language models [79, 80]. Our work stands out from
previous research because our data refinement approach allows for a much smaller size of the
teacher model. In our experiments we demonstrate that a 1.3B student model with synthetic data
from an 8B teacher model is not only surpassing the original teacher model’s performance - both
in zero-shot and in-context few-shot evaluation scenarios, but also that of frontier models, such as
GPT-4o.
Formally, given a set of code generation problems P = {p1, p2,... pN}, the τ teacher model
generates k samples of code snippets per problem, thus for a given problem pj we end up with a
set of code snippetsCj = [c
1
j
, c
2
j
,..., c
k
j
]. These code snippets are combined with the instruction and
49
test cases of the problem pj
to make a set of synthetic data points: Dj = [(ij
, c
1
j
,Tj),...,(ij
, c
k
j
,Tj)].
The synthetically generated dataset D is the combination of all Dj
, i.e. D =
SN
j=1 Dj
. The student
model σ is then fine-tuned on D.
4.3.3 Dataset construction
For each low-resource programming language that we study, we use the teacher model to construct
a synthetic dataset. This process begins with a small seed dataset of a few hundred code generation
problems, i.e. instructions and functional tests, which are taken from MultiPL-e dataset [82]. More
details on MultiPL-e and specifics of our seed dataset are discussed in Section 4.4.
We believe it is reasonable to assume access to such a seed dataset, since there are multiple
code problem datasets available in high-resource programming languages, such as Python, which
can be automatically translated into low-resource languages. In fact, the MultiPL-e dataset that we
use is translated into low-resource languages automatically. While our seed examples come from
natural code, synthetic code and test generation in Python could also be used [75, 78, 93], with test
cases automatically translated into the target programming language.
We want to highlight, that differently from the past work that first generated entire datasets
in Python and then used an LLM to translate them into the target low-resource PL [75], for seed
dataset our approach only requires translation of the test cases, making it more cost-effective by
eliminating the need for an LLM for the translation. Additionally, since this translation is only
performed for our seed dataset, we require fewer Python generations and fewer translations into
the low-resource PL.
Using the teacher model τ we generate multiple code snippets for each problem in the seed
dataset for every language. We discuss in more detail generation and hyperparameters in Section 4.4. As result, we end with a dataset of 5000 code snippets for every low-resource programming language, together with their corresponding instructions and tests. We will refer to this
generated dataset as Dbase.
50
4.3.4 Data refinement
Lua Julia Perl Scala Swift Rust
Seed 397 390 396 396 396 354
Dbase 5000 5000 5000 5000 5000 5000
DSC 843 1845 1762 2638 3017 3508
DFC 360 988 464 1003 1105 1288
Dre f1
1576 2149 1084 1427 1331 1639
Dre f2
2082 2435 1365 1642 1487 1814
Dre f3
2251 2562 1501 1739 1569 1923
Dre f4
2346 2604 1608 1786 1633 2001
Dre f5
2406 2635 1684 1825 1673 2074
Table 4.2: Sizes (in number of instances) for different
datasets that we experiment with in this work.
Recognizing that not all generated
code snippets are of high quality, we
implement a refinement process. To
evaluate code snippets, we modify
the BigCode Evaluation Harness [94]
to retrieve more detailed information
on execution outcomes.
First, for each language we consider filtering out from Dbase any data
points that are syntactically incorrect,
according to our definition of syntactic correctness in Section 4.3.1. Remaining data points make up the set
DSC in the rest of the paper. We
present statistics on sizes of DSC for each language in Table 4.2. As it can be seen, depending
on the language, the number of syntactically correct instances varies between around 17% to almost 70% of Dbase.
Next, for each language we consider filtering out data points that are functionally incorrect
according to our definition of syntactic correctness in Section 4.3.1, and refer to the remaining
subset as DFC. Since functional correctness is a stricter requirement than syntactic correctness, it
can be seen in Table 4.2 that the number of instances in DFC decreases by another 50% to 75%
compared to DSC. However, in theory, lower amounts of supervision data might be offset by the
higher quality of the data points.
51
Figure 4.3: Data distributions for the datasets that we generate in this work, showing the ratio of
correct examples to syntactically and functionally incorrect examples. The orange bars indicate
the change in the number of correct instances between the current iteration and the iteration prior
to it. Due to marginal improvement in increase of the correct examples, we only include versions
up to Dre f3
in our experiments (marked with bold), and omit Dre f4
and Dre f5
Refinement with Feedback Loop
In addition to the filtering process described above, our main contribution is refining incorrect
examples instead of discarding them. An overview of this pipeline is shown in Figure 4.1.
In this feedback loop, incorrect code examples are sent back to the teacher model, along with
details about the specific errors they produced during compilation, execution, or interpretation.
Similar to the previous subsection, in order to obtain the errors we modify the BigCode Evaluation
Harness. The error details along with the coding problem instruction and the code snippet are fed
back to the teacher model.
The entire process is guided by prompts to the teacher model. The feedback loop allows the
teacher model to attempt producing improved versions of the initially incorrect examples. In our
experiments we differentiate between three types of errors - syntactic, functional, and timeout
errors. The first two were already defined and discussed in Section 4.3.1.
A timeout error occurs when a code snippet fails to compile and execute within a few seconds.
In our experiments, timeout errors account for < 1% of all execution outcomes. Empirically we
52
observed that the teacher model rarely resolved timeout errors successfully. On the other hand,
from manual inspection, we found the model’s explanations and fixes for syntactic and functional
errors to be more effective. Therefore, we don’t discuss timeout errors further and focus on syntactic and functional errors in Figure 4.2. The figure shows examples of instructions, code snippets,
and error messages that were fed to the teacher model, as well as the refined outputs generated by
the teacher model as a response.
Formally, we perform this refinement process for the instances in Dbase \DFC. From the newly
generated instances, we filter those that are functionally incorrect, and combine the remaining, new
functionally correct instances with DFC to make up a new set that we will call Dre f1
.
We experiment with performing this process iteratively multiple times. During this process of
iterative refinement, to obtain Dre fi+1 we try refining examples from Dbase \ Dre fi
, and add new
functionally correct instances to Dre fi
.
We present dataset statistics in Table 4.2 and statistics on errors in datasets in Figure 4.3. Depending on the programming language, it is notable that the refinement process produces marginal
improvement in the iterations following the third. For example, for every language Dre f3
adds less
than 10% of new instances compared to Dre f2
. For this reason, in the rest of our experiments we
will only consider datasets up to Dre f3
.
4.4 Experimental Setup
We choose two small, open-source, state-of-the-art models as our student and teacher models.
We use DeepSeekCoder-1.3B-Instruct [83] as our student model because its small size makes
finetuning easier and more cost-effective, while still being a state-of-the-art coding model among
open-source options of its size. For our teacher model, we choose a slightly larger, state-of-the-art
open-source model - Llama3-8B-Instruct [84].
We use MultiPL-e dataset [82] in our experiments. It is a dataset translating two popular code
evaluation benchmarks - HumanEval [95] and MBPP [96], both originally in Python, into 18 other
53
languages. From the 18 languages available in MultiPL-e, we select six - Julia, Lua, Perl, Swift,
Rust, and Scala, such that the first three are very low-resource or niche - with 0.1% to 0.3% in
GitHub, and the other three have up to 1.7% in GitHub as reported by Cassano, Gouwar, Nguyen,
Nguyen, Phipps-Costin, Pinckney, et al. [82].
As our seed dataset for the synthetic data construction, we use the MultiPL-E MBPP datasets in
the corresponding low-resource PL. Those contain a little less than 400 code generation problems
on average, the exact number of problems changes slightly from one language to another, and is
reported in the Table 4.2. Correspondingly, we use the MultiPL-E Humaneval as our test set during
evaluation.
For every programming language, we randomly sample an instance to generate a coding solution with the teacher model from our seed dataset. On average we generate around 15 solutions
per coding problem to allow for a buffer to remove any duplicate generations. From the resulting,
deduplicated solutions, we randomly choose 5000 samples, which make up the Dbase for the corresponding language. For the original generation of Dbase we limit the generation length to 512
tokens with temperature parameter of 0.2 and top-p parameter of 0.95. Since problems in both
Humaneval and MBPP are small, we find that this number of tokens is sufficient for completing
a code problem solution. For subsequent generations with refinement: Dre f , we sample from the
teacher model at most 2048 new tokens with temperature parameter of 0.6 and top-p parameter of
0.9. We use an increased number of tokens at this stage to have room for the textual explanations
that the model produces. Before including the generated snippets in Dre f , we postprocess them to
remove any non-code content, and discard any generations for which this cannot be done automatically. Sizes of the resulting datasets Dre f for different programming languages are presented in
Table 4.2.
In all our experiments, we use the same experimental setup. We train the student model for 300
steps with a batch size of 8 and a learning rate of 1e−5. We use a linear learning rate scheduler
with 15 warm up steps. We randomly choose 10% of code generation problems from our seed
dataset as a validation set and use it for checkpoint selection. We also remove all code snippets for
54
the code problems in the validation set from the finetuning dataset. The remaining code generation
problems are used for training. We select the best student model’s checkpoint, as measured on the
validation set and evaluate that checkpoint on the test set.
For evaluation, we use BigCode Eval Harness with pass@1 metric [95]. It is designed to
measure the likelihood that a coding model will produce a code snippet that passes functional
tests on its first attempt. To reduce the variance and increase the reliability of the evaluation, it
is customary to sample multiple code snippets and average their performance by measuring how
many of those are functionally correct. We evaluate the student model performance by sampling a
solution for each coding problem 10 times. During evaluation, we limit the maximum number of
generated tokens to 512, with temperature of 0.2 and top-p parameter of 0.95.
4.4.1 Baselines
To evaluate the effectiveness of our proposed synthetic data generation approach, we include a variety of baseline approaches. Firstly, we evaluate the student and the teacher models’ performance
as is on low-resource programming languages.
Additionally, we evaluate both models’ performance with in-context learning demonstrations
at inference time (ICL). Since MultiPL-e dataset does not contain ground truth examples for us to
use as demonstrations for in-context learning, we use our sampled, functionally correct examples
from DFC. We evaluate the student and teacher models’ performance when provided with 5 incontext demonstrations. For every language, we sample the five examples used as demonstrations
randomly, and we average the performance over five random sets of in-context demonstration
examples.
Additionally, we evaluate the student model finetuned on the base synthetic generated dataset:
Dbase as continual finetuning of the model on data in the same programming language is a simple
way to improve its performance on that language.
Since not all data points in Dbase are syntactically or functionally correct, we report how the
student model performs when only finetuned on syntactically or functionally correct subsets, DSC
55
Julia Lua Perl Swift Rust Scala Avg
DSC-1.3B-Instr 0.19 0.30 0.21 0.26 0.28 0.25 0.25
DSC-1.3B-Instr ICL 0.13 0.36 0.23 0.33 0.32 0.26 0.27
Llama3-8B-Instr 0.25 0.36 0.26 0.34 0.34 0.40 0.33
Llama3-8B-Instr ICL 0.23 0.34 0.29 0.33 0.35 0.37 0.32
Dbase 0.23 0.35 0.20 0.30 0.29 0.30 0.28
DSC 0.25 0.36 0.21 0.32 0.31 0.35 0.30
DFC 0.23 0.37 0.21 0.33 0.31 0.32 0.30
Dre f 1 0.35 0.36 0.27 0.35 0.34 0.38 0.34
Table 4.3: Pass@1 performance for baseline student and teacher models, student and teacher models evaluated with in-context learning (ICL), as well as student models finetuned on the corresponding dataset on 6 low-resource programming languages.
and DFC correspondingly. While DFC subset contains the least amount of data, it also contains the
highest quality of the data among these setups.
4.5 Experimental Results
4.5.1 Main Results
Our main results are presented in Table 4.3. We report the base performances of DeepSeekCoder1.3B-Instruct (DSC-1.3B-Instr) student model, Llama3-8B-Instruct teacher model; performances
of both student and teacher model evaluated with in-context learning through demonstrations (ICL);
as well as the performances of student model fine-tuned on different variants of synthetic data:
Dbase,DSC,DFC,Dre f 1. From our experiments it can be seen that using vanilla, synthetically generated low-resource PL data improves the performance of the student model. We also see that,
on average -filtering synthetic data to only leave higher-quality instances with either DSC or DFC
leads to some improvement in performance. But the largest improvement is achieved when using
refined synthetic data Dre f 1 - in this case, the performance of the smaller, student model is on par
or is better than that of the teacher model, for all languages except Scala - for which we still see
a very noticeable improvement over the student model, however the final performance is a little
below than the teacher model performance.
56
Julia Lua Perl Swift Rust Scala Avg
Dre f 2 0.39 0.39 0.30 0.32 0.37 0.37 0.36
Dre f 3 0.36 0.36 0.29 0.32 0.39 0.39 0.35
Table 4.4: Pass@1 performance for the student model finetuned on the corresponding dataset on 6
low-resource programming languages.
4.5.2 Effect of iterative refinement
Besides performing the refinement with feedback loop once, we look at the performance change
when the refinement is performed iteratively, a few times in a row. From Table 4.2, it can be
seen that, on average, after the second or third iteration the number of total functionally correct
examples increases only marginally.
The empirical results obtained with these datasets are presented in Table 4.4, where we also
observe that the best average performance is obtained after the second iteration of the refinement.
This evidences that, even with tool feedback,
4.5.3 Effect of Using a Seed Dataset vs OSS for Synthetic Data Generation
When creating our synthetically generated datasets we made a design choice to start from a seed
dataset.
Data Swift Rust Avg
Magicoder OSS Instr 0.28 0.39 0.34
Dre f 1 0.35 0.34 0.35
Dre f 2 0.32 0.37 0.35
Table 4.5: Pass@1 performance for the student model finetuned on the corresponding
dataset. For the purpose of fairer comparison, for
Magicoder-OSS-Instruct, we only used the data
points in the specific language tested.
An alternative approach in the literature has
been using open-source code snippets to seed
generation of the synthetic dataset [79]. Specifically, while the full Magicoder OSS Instruct
dataset contains 75K instances, only about 5K
instances out of those are in one of the two languages that we experimented with - Swift and
Rust. So in our experiments, we use just those
data points. It is worth noting, that Magicoder
dataset was generated using GPT 3.5 Turbo
57
model [97], which is a much more powerful model than the Llama3-8B-Instruct teacher model
that we use. We finetune the same student model using these data points, and present the results
in Table 4.5. As it can be seen, despite using a much more powerful teacher, on average our proposed data generation approach achieves a slightly higher performance. It is worth noting that this
performance is not consistent across the two languages, and while for Swift our method achieves
a better performance it is being outperformed by Magicoder-OSS-Instruct for Rust.
4.6 Conclusion
In conclusion, our approach demonstrates the significant potential of leveraging external tool feedback for enhancing synthetic data generation and improving model performance on low-resource
and niche programming languages. By starting with a small seed dataset and refining the synthetic
data using feedback from compilation and execution steps, we show that even a modest setup with
smaller models can achieve superior results. Our method not only outperforms other baselines that
rely on synthetic data from powerful teacher models but also surpasses frontier models like GPT4o on specific niche languages. This underscores the effectiveness of our refinement process and
its ability to create efficient, specialized models for low-resource programming languages. As the
demand for applying LLMs to low-resource languages continues to grow, our approach provides a
viable path for developing powerful models tailored to these unique challenges.
4.7 Limitations
Firstly, while our approach shows sizeable improvement on MultiPL-E Humaneval, this dataset
consists of small coding problems. Thus, it’s uncertain how well this performance increase translates to more complex, real-world coding tasks.
Next, while we included six low-resource programming languages in our study and saw consistent improvement, there are countless other programming languages, so we can only speculate
58
on how our approach will generalize to those. Additionally, it’s unclear if this approach for lowresource languages can effectively extend to new and emerging languages.
Additionally, we focus on code generation task, however, to be successful and useful code
models need to handle a wider variety of coding tasks. To confirm our approach’s broader benefits,
it should be tested on a wider range of coding tasks.
Finally, we do not re-evaluate the model’s performance on other languages after finetuning for
the target low-resource language. Some evidence suggests that this finetuning could reduce the
model’s performance in other languages or its ability to handle other coding tasks. As a countermeasure for this, finetuning on a larger dataset, that includes more languages and desired behaviours, could be beneficial.
59
2
Instruction
Original
generation
Feedback
(simplified)
This program results in Syntax error on line 4:
“value of tuple type ‘(Int, Int, Int, Int, Int, Int)’
has no member ‘count’” , update the program to
fix that.
Refined
generation
Change in the variable type
This program results in Fatal error: “Index out
of range”, update the program to fix that.
Feedback
(simplified)
Refined
generation
Added error checks.
Instruction
Original
generation
Figure 4.2: Examples of original generations by the teacher model, Llama3-8B-Instruct, feedback
given depending on the kind of error, and the revised generations by the teacher model.
60
Chapter 5
Conclusion
In this dissertation, we looked at the language models for code and their generalization abilities.
We made the following contributions:
• In Chapter 2, we presented a neuro-symbolic method for semantic code search. We proposed treating semantic code search as a compositional task, by breaking it down using the
semantic parse and performing multi-reasoning. We implemented this paradigm with the
help of neural modular networks, and showed that this approach provides a large improvement compared to a number of semantic code search baselines. We additionally evaluate our
proposed method for its sensitivity to changes in the query, as well as its capacity to handle
compositional queries.
• In Chapter 3 we looked at generalization properties of large language models for code and
their challenges to generalize that could stem from the hierarchical nature of soft- ware data.
Specifically, we looked at three scenarios: generalization across companies, projects, and
project components. For these three scenarios we considered the following questions. First,
we empirically established that our defined scenarios constitute a distributional shift causing
performance drop for the coding models due to operating on out-of-domain data. At the
same time, we looked at how models perform on new domains, and to answer this question
we compared the model performance with and without domain adaptation.
61
Next, we considered in what ways we can improve the out-of-domain generalization of large
language models for code. This question was motivated by our belief that it is impractical
and sometimes impossible to obtain labeled in-domain data for every application. We investigate the use of labeled out-of-domain data and small amounts of unlabelled in-domain
data in an attempt to enhance model’s generalization capacity. We evaluate and show approaches that are particularly valuable for generalization with extremely limited number of
supervision examples.
Lastly, we looked at making the code models more broadly applicable and retain their generalization capacities, rather than adapting them to one new domain at a time. We found that
the possibility of successful adaptation in this manner depends on the model - some coding
models appeared to be sensitive in this scenario, so we concluded it was best to adapt those
to every domain individually.
• In Chapter 4, we looked at the issue of generalization of large language models for code to
low-resource programming languages. Challenged by the lack of naturally occurring data
for supervision for low-resource programming languages and inspired by the prior work
demonstrating successful continual training of models for code on synthetic data, we looked
into using synthetic data for improving model performance. We combined this insight with
the insight of ready availability of software engineering tools for low-resource programming languages, and proposed to use such tools to generate higher quality synthetic data for
knowledge distillation.
With a 1.3B student model and 8B teacher model we improved the student’s performance
by an average of up to 0.11 pass@1 score on MultiPL-e HumanEval dataset. We matched
or exceeded GPT-4o’s performance in three out of six languages that we experimented with,
and surpassed the teacher model’s performance in five out of six cases, boosting the average
pass@1 score across all six languages. We also showed comparable or better results than a
student model finetuned with data from the more powerful GPT-3.5-Turbo teacher model,
62
proving that our approach can help create small yet powerful models designed to work with
low-resource programming languages.
There are still a number of challenges preventing creation of models capable of independently
implementing software. Several key areas demand attention:
Development of Improved Evaluation Metrics One of the most significant challenges in code
processing research is the lack of reliable, automated evaluation metrics. Current automated metrics are often adapted from other natural language processing domains, such as machine translation. However, these metrics struggle to account for the inherent flexibility and subjectivity in
code, where multiple correct implementations can exist. Automatically evaluating code quality
is further complicated by the difficulty in assessing correctness, readability, maintainability, efficiency, or other aspects, which may vary in importance depending on the software system in
question. The evolving nature of codebases adds another layer of complexity, necessitating more
sophisticated evaluation tools that can realistically judge the performance of models and guide
their improvement.
Enhancing Planning Abilities in Models For models to move beyond generating isolated code
snippets and towards writing comprehensive software, we must look into developing planning
capabilities. Current models operate primarily within local contexts, lacking the high-level understanding necessary for working at the level of entire codebases. Future models should possess a
deep understanding of the application domain, its requirements, and the user base, enabling them to
make informed decisions about the broader impact of different parts of the code. Simple examples
of this can include anticipating the necessary changes across the codebase, such as modifications to
build scripts or tests, and predicting and handling the consequences of these changes. The ability
to operate effectively in remote contexts, beyond the immediate scope of a single function or a
single file, will also be crucial for models to plan and act autonomously.
63
Ensuring Robustness and Trustworthiness Trust in autonomous code generation models is
rooted in their robustness. Robustness encompasses multiple factors. One of the most critical is
resistance to adversarial attacks. Such attacks can occur during both the training and inference
stages, potentially leading to the introduction of harmful code into the codebase. To be considered
robust, the model must also avoid replicating vulnerable or harmful code from their training data
into new codebases. Other essential qualities for a robust model are stability and self-consistency.
Minor variations in input, such as a change in a variable name or a typo, should not drastically alter
the model’s output. Developing models that consistently produce reliable results despite minor
input changes is vital to make their deployment in real-world software development scenarios
reliable and trustworthy.
64
References
1. GitHub. GitHub Copilot. Your AI pair programmer (2021).
2. Roblox. Revolutionizing Creation on Roblox with Generative AI (2023).
3. Couchbase. Capella iQ lets you code at the speed of thought (2023).
4. Webflow. Bringing the power of AI to Webflow (2023).
5. IBM. AI-powered application modernization to enable enterprise agility (2023).
6. Katie Forster for Independent. Woman follows sat nav and drives straight into a lake
(2016).
7. Silas Valentino for SFGATE. Hawaii tourist follows GPS into harbor water. Again. (2023).
8. WBZ News Staff. Middleton DoorDash driver follows GPS into water while delivering
Dunkin’ order (2024).
9. Reiss, S. P. Semantics-based code search demonstration proposal in 25th IEEE
International Conference on Software Maintenance (ICSM 2009), September 20-26, 2009,
Edmonton, Alberta, Canada (IEEE Computer Society, 2009), 385–386.
doi:10.1109/ICSM.2009.5306319.
10. Lu, M., Sun, X., Wang, S., Lo, D. & Duan, Y. Query expansion via WordNet for effective
code search in 22nd IEEE International Conference on Software Analysis, Evolution, and
Reengineering, SANER 2015, Montreal, QC, Canada, March 2-6, 2015 (eds
Gueh´ eneuc, Y., Adams, B. & Serebrenik, A.) (IEEE Computer Society, 2015), 545–549. ´
doi:10.1109/SANER.2015.7081874.
11. Bull, R. I., Trevors, A., Malton, A. J. & Godfrey, M. W. Semantic Grep: Regular
Expressions + Relational Abstraction in 9th Working Conference on Reverse Engineering
(WCRE 2002), 28 October - 1 November 2002, Richmond, VA, USA (eds van Deursen, A.
& Burd, E.) (IEEE Computer Society, 2002), 267–276.
doi:10.1109/WCRE.2002.1173084.
65
12. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al.
Attention is All you Need in Advances in Neural Information Processing Systems 30:
Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017,
Long Beach, CA, USA (eds Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M.,
Fergus, R., Vishwanathan, S. V. N., et al.) (2017), 5998–6008.
13. Kanade, A., Maniatis, P., Balakrishnan, G. & Shi, K. Learning and Evaluating Contextual
Embedding of Source Code in Proceedings of the 37th International Conference on
Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event 119 (PMLR, 2020),
5110–5121.
14. Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., et al. CodeBERT: A
Pre-Trained Model for Programming and Natural Languages in Findings of the
Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November
2020 (eds Cohn, T., He, Y. & Liu, Y.) EMNLP 2020 (Association for Computational
Linguistics, 2020), 1536–1547. doi:10.18653/v1/2020.findings-emnlp.139.
15. Du, L., Shi, X., Wang, Y., Shi, E., Han, S. & Zhang, D. Is a Single Model Enough?
MuCoS: A Multi-Model Ensemble Learning Approach for Semantic Code Search in CIKM
’21: The 30th ACM International Conference on Information and Knowledge
Management, Virtual Event, Queensland, Australia, November 1 - 5, 2021 (eds
Demartini, G., Zuccon, G., Culpepper, J. S., Huang, Z. & Tong, H.) (ACM, 2021),
2994–2998. doi:10.1145/3459637.3482127.
16. Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., Liu, S., et al. GraphCodeBERT: Pre-training
Code Representations with Data Flow in 9th International Conference on Learning
Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 (OpenReview.net,
2021).
17. Liu, S., Xie, X., Ma, L., Siow, J. K. & Liu, Y. GraphSearchNet: Enhancing GNNs via
Capturing Global Dependency for Semantic Code Search. CoRR abs/2111.02671 (2021).
18. Andreas, J., Rohrbach, M., Darrell, T. & Klein, D. Neural Module Networks in 2016 IEEE
Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV,
USA, June 27-30, 2016 (IEEE Computer Society, 2016), 39–48.
doi:10.1109/CVPR.2016.12.
19. Husain, H., Wu, H.-H., Gazit, T., Allamanis, M. & Brockschmidt, M. CodeSearchNet
Challenge: Evaluating the State of Semantic Code Search. abs/1909.09436 (2019).
66
20. Huang, J., Tang, D., Shou, L., Gong, M., Xu, K., Jiang, D., et al. CoSQA: 20,000+ Web
Queries for Code Search and Question Answering in Proceedings of the 59th Annual
Meeting of the Association for Computational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Volume 1: Long Papers), ACL 2021, Online,
August 1-6, 2021 (eds Zong, C., Xia, F., Li, W. & Navigli, R.) (Association for
Computational Linguistics, 2021), 5690–5700.
doi:10.18653/v1/2021.acl-long.442.
21. Ahmad, W. U., Chakraborty, S., Ray, B. & Chang, K. A Transformer-based Approach for
Source Code Summarization in Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics, ACL 2020, Online, July 5-10, 2020 (eds Jurafsky, D.,
Chai, J., Schluter, N. & Tetreault, J. R.) (Association for Computational Linguistics, 2020),
4998–5007. doi:10.18653/v1/2020.acl-main.449.
22. Choi, Y., Bak, J., Na, C. & Lee, J. Learning Sequential and Structural Information for
Source Code Summarization in Findings of the Association for Computational Linguistics:
ACL/IJCNLP 2021, Online Event, August 1-6, 2021 (eds Zong, C., Xia, F., Li, W. &
Navigli, R.) ACL/IJCNLP 2021 (Association for Computational Linguistics, 2021),
2842–2851. doi:10.18653/v1/2021.findings-acl.251.
23. Shi, E., Wang, Y., Du, L., Zhang, H., Han, S., Zhang, D., et al. CAST: Enhancing Code
Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees in
Proceedings of the 2021 Conference on Empirical Methods in Natural Language
Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11
November, 2021 (eds Moens, M., Huang, X., Specia, L. & Yih, S. W.) (Association for
Computational Linguistics, 2021), 4053–4062.
doi:10.18653/v1/2021.emnlp-main.332.
24. Zugner, D., Kirschstein, T., Catasta, M., Leskovec, J. & G ¨ unnemann, S. ¨
Language-Agnostic Representation Learning of Source Code from Structure and Context
in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event,
Austria, May 3-7, 2021 (OpenReview.net, 2021).
25. Devlin, J., Chang, M., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding in Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019,
Volume 1 (Long and Short Papers) (eds Burstein, J., Doran, C. & Solorio, T.) (Association
for Computational Linguistics, 2019), 4171–4186. doi:10.18653/v1/n19-1423.
26. Alon, U., Zilberstein, M., Levy, O. & Yahav, E. code2vec: learning distributed
representations of code. Proc. ACM Program. Lang. 3, 40:1–40:29.
doi:10.1145/3290353 (2019).
67
27. Zhang, J., Wang, X., Zhang, H., Sun, H., Wang, K. & Liu, X. A novel neural source code
representation based on abstract syntax tree in Proceedings of the 41st International
Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, May 25-31,
2019 (eds Atlee, J. M., Bultan, T. & Whittle, J.) (IEEE / ACM, 2019), 783–794.
doi:10.1109/ICSE.2019.00086.
28. Gu, W., Li, Z., Gao, C., Wang, C., Zhang, H., Xu, Z., et al. CRaDLe: Deep code retrieval
based on semantic Dependency Learning. Neural Networks 141, 385–394.
doi:10.1016/j.neunet.2021.04.019 (2021).
29. Zhao, G. & Huang, J. DeepSim: deep learning code functional similarity in Proceedings of
the 2018 ACM Joint Meeting on European Software Engineering Conference and
Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake
Buena Vista, FL, USA, November 04-09, 2018 (eds Leavens, G. T., Garcia, A. &
Pasareanu, C. S.) (ACM, 2018), 141–151. doi:10.1145/3236024.3236068.
30. Jain, P., Jain, A., Zhang, T., Abbeel, P., Gonzalez, J. & Stoica, I. Contrastive Code
Representation Learning in Proceedings of the 2021 Conference on Empirical Methods in
Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican
Republic, 7-11 November, 2021 (eds Moens, M., Huang, X., Specia, L. & Yih, S. W.)
(Association for Computational Linguistics, 2021), 5954–5971.
doi:10.18653/v1/2021.emnlp-main.482.
31. Bui, N. D. Q., Yu, Y. & Jiang, L. Self-Supervised Contrastive Learning for Code Retrieval
and Summarization via Semantic-Preserving Transformations in SIGIR ’21: The 44th
International ACM SIGIR Conference on Research and Development in Information
Retrieval, Virtual Event, Canada, July 11-15, 2021 (eds Diaz, F., Shah, C., Suel, T.,
Castells, P., Jones, R. & Sakai, T.) (ACM, 2021), 511–521.
doi:10.1145/3404835.3462840.
32. Gutmann, M. & Hyvarinen, A. ¨ Noise-contrastive estimation: A new estimation principle
for unnormalized statistical models in Proceedings of the Thirteenth International
Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort,
Sardinia, Italy, May 13-15, 2010 (eds Teh, Y. W. & Titterington, D. M.) 9 (JMLR.org,
2010), 297–304.
33. Van den Oord, A., Li, Y. & Vinyals, O. Representation Learning with Contrastive
Predictive Coding. CoRR abs/1807.03748 (2018).
34. Zettlemoyer, L. S. & Collins, M. Learning to Map Sentences to Logical Form: Structured
Classification with Probabilistic Categorial Grammars. CoRR abs/1207.1420 (2012).
68
35. Artzi, Y., Lee, K. & Zettlemoyer, L. Broad-coverage CCG Semantic Parsing with AMR in
Proceedings of the 2015 Conference on Empirical Methods in Natural Language
Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015 (eds Marquez, L., `
Callison-Burch, C., Su, J., Pighin, D. & Marton, Y.) (The Association for Computational
Linguistics, 2015), 1699–1710. doi:10.18653/v1/d15-1198.
36. Lu, S., Guo, D., Ren, S., Huang, J., Svyatkovskiy, A., Blanco, A., et al. CodeXGLUE: A
Machine Learning Benchmark Dataset for Code Understanding and Generation in
Thirty-fifth Conference on Neural Information Processing Systems Datasets and
Benchmarks Track (Round 1), Online, Dec 7-10, 2021 (OpenReview.net, 2021).
37. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., et al. RoBERTa: A Robustly
Optimized BERT Pretraining Approach. abs/1907.11692 (2019).
38. Turhan, B. On the dataset shift problem in software engineering prediction models. Empir.
Softw. Eng. 17, 62–74. doi:10.1007/s10664-011-9182-8 (2012).
39. Zimmermann, T., Nagappan, N., Gall, H. C., Giger, E. & Murphy, B. Cross-project defect
prediction: a large scale experiment on data vs. domain vs. process in Proceedings of the
7th joint meeting of the European Software Engineering Conference and the ACM
SIGSOFT International Symposium on Foundations of Software Engineering, 2009,
Amsterdam, The Netherlands, August 24-28, 2009 (eds van Vliet, H. & Issarny, V.) (ACM,
2009), 91–100. doi:10.1145/1595696.1595713.
40. Nie, P., Zhang, J., Li, J. J., Mooney, R. J. & Gligoric, M. Impact of Evaluation
Methodologies on Code Summarization in Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin,
Ireland, May 22-27, 2022 (eds Muresan, S., Nakov, P. & Villavicencio, A.) (Association
for Computational Linguistics, 2022), 4936–4960.
41. Li, Y., Chen, S. & Yang, W. Estimating Predictive Uncertainty Under Program Data
Distribution Shift. CoRR abs/2107.10989 (2021).
42. Hu, Q., Guo, Y., Xie, X., Cordy, M., Ma, L., Papadakis, M., et al. CodeS: A Distribution
Shift Benchmark Dataset for Source Code Learning. arXiv preprint arXiv:2206.05480
(2022).
43. Chen, M., Tworek, J., Jun, H., Yuan, Q., Ponde, H., Kaplan, J., et al. Evaluating Large
Language Models Trained on Code. ArXiv abs/2107.03374 (2021).
69
44. Wang, Y., Wang, W., Joty, S. R. & Hoi, S. C. H. CodeT5: Identifier-aware Unified
Pre-trained Encoder-Decoder Models for Code Understanding and Generation in
Proceedings of the 2021 Conference on Empirical Methods in Natural Language
Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11
November, 2021 (eds Moens, M., Huang, X., Specia, L. & Yih, S. W.) (Association for
Computational Linguistics, 2021), 8696–8708.
doi:10.18653/v1/2021.emnlp-main.685.
45. Ma, Y., Luo, G., Zeng, X. & Chen, A. Transfer learning for cross-company software defect
prediction. Inf. Softw. Technol. 54, 248–256. doi:10.1016/j.infsof.2011.09.007
(2012).
46. Li, Y., Xie, M. & Goh, T. N. A study of mutual information based feature selection for case
based reasoning in software cost estimation. Expert Syst. Appl. 36, 5921–5931.
doi:10.1016/j.eswa.2008.07.062 (2009).
47. Mair, C., Kadoda, G. F., Lefley, M., Phalp, K., Schofield, C., Shepperd, M. J., et al. An
investigation of machine learning based prediction systems. J. Syst. Softw. 53, 23–29.
doi:10.1016/S0164-1212(00)00005-4 (2000).
48. Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: a Method for Automatic Evaluation
of Machine Translation in Proceedings of the 40th Annual Meeting of the Association for
Computational Linguistics (Association for Computational Linguistics, Philadelphia,
Pennsylvania, USA, 2002), 311–318. doi:10.3115/1073083.1073135.
49. Thrun, S. & Pratt, L. Y. in Learning to Learn (eds Thrun, S. & Pratt, L. Y.) 3–17 (Springer,
1998). doi:10.1007/978-1-4615-5529-2\_1.
50. Vilalta, R. & Drissi, Y. A Perspective View and Survey of Meta-Learning. Artif. Intell. Rev.
18, 77–95. doi:10.1023/A:1019956318069 (2002).
51. Caruana, R. Algorithms and Applications for Multitask Learning in Machine Learning,
Proceedings of the Thirteenth International Conference (ICML ’96), Bari, Italy, July 3-6,
1996 (ed Saitta, L.) (Morgan Kaufmann, 1996), 87–95.
52. Angioni, D., Demetrio, L., Pintor, M. & Biggio, B. Robust Machine Learning for Malware
Detection over Time in Proceedings of the Italian Conference on Cybersecurity (ITASEC
2022), Rome, Italy, June 20-23, 2022 (eds Demetrescu, C. & Mei, A.) 3260
(CEUR-WS.org, 2022), 169–180.
53. Finn, C., Abbeel, P. & Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of
Deep Networks in International Conference on Machine Learning (2017).
54. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., et al. LoRA: Low-Rank
Adaptation of Large Language Models. ArXiv abs/2106.09685 (2021).
70
55. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., et al. Language
Models are Few-Shot Learners. ArXiv abs/2005.14165 (2020).
56. Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L. & Chen, W. What Makes Good In-Context
Examples for GPT-3? in Workshop on Knowledge Extraction and Integration for Deep
Learning Architectures; Deep Learning Inside Out (2021).
57. Husain, H., Wu, H., Gazit, T., Allamanis, M. & Brockschmidt, M. CodeSearchNet
Challenge: Evaluating the State of Semantic Code Search. ArXiv abs/1909.09436 (2019).
58. Ren, S., Guo, D., Lu, S., Zhou, L., Liu, S., Tang, D., et al. CodeBLEU: a Method for
Automatic Evaluation of Code Synthesis. ArXiv abs/2009.10297 (2020).
59. Evtikhiev, M., Bogomolov, E., Sokolov, Y. & Bryksin, T. Out of the BLEU: how should we
assess quality of the Code Generation models? CoRR abs/2208.03133.
doi:10.48550/arXiv.2208.03133 (2022).
60. Popovic, M. chrF: character n-gram F-score for automatic MT evaluation in Proceedings
of the Tenth Workshop on Statistical Machine Translation, WMT@EMNLP 2015, 17-18
September 2015, Lisbon, Portugal (The Association for Computer Linguistics, 2015),
392–395. doi:10.18653/v1/w15-3049.
61. Lin, C.-Y. Rouge: A package for automatic evaluation of summaries in Text summarization
branches out (2004), 74–81.
62. Zhou, S., Alon, U., Agarwal, S. & Neubig, G. CodeBERTScore: Evaluating Code
Generation with Pretrained Models of Code. ArXiv abs/2302.05527 (2023).
63. Raffel, C., Shazeer, N. M., Roberts, A., Lee, K., Narang, S., Matena, M., et al. Exploring
the Limits of Transfer Learning with a Unified Text-to-Text Transformer. ArXiv
abs/1910.10683 (2019).
64. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., et al. Training
language models to follow instructions with human feedback. ArXiv abs/2203.02155
(2022).
65. Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S. & Amodei, D. Deep
Reinforcement Learning from Human Preferences. ArXiv abs/1706.03741 (2017).
66. Rudman, W., Gillman, N., Rayne, T. & Eickhoff, C. IsoScore: Measuring the Uniformity of
Embedding Space Utilization in Findings of the Association for Computational
Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022 (eds Muresan, S., Nakov, P. &
Villavicencio, A.) (Association for Computational Linguistics, 2022), 3325–3339.
doi:10.18653/v1/2022.findings-acl.262.
71
67. Manh, D. N., Hai, N. L., Dau, A. T. V., Nguyen, A. M., Nghiem, K., Guo, J., et al. The
Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and
Generation. CoRR abs/2305.06156. doi:10.48550/arXiv.2305.06156 (2023).
68. Su, H., Kasai, J., Wu, C. H., Shi, W., Wang, T., Xin, J., et al. Selective Annotation Makes
Language Models Better Few-Shot Learners. CoRR abs/2209.01975.
doi:10.48550/arXiv.2209.01975 (2022).
69. Bezanson, J., Edelman, A., Karpinski, S. & Shah, V. B. Julia: A Fresh Approach to
Numerical Computing. SIAM Rev. 59, 65–98. doi:10.1137/141000671 (2017).
70. Redox OS. RedoxOS (2015).
71. Levy, A. A., Andersen, M. P., Campbell, B., Culler, D. E., Dutta, P., Ghena, B., et al.
Ownership is theft: experiences building an embedded OS in rust in Proceedings of the 8th
Workshop on Programming Languages and Operating Systems, PLOS 2015, Monterey,
California, USA, October 4, 2015 (ed Lu, S.) (ACM, 2015), 21–26.
doi:10.1145/2818302.2818306.
72. OpenAI. Hello GPT-4o (2024).
73. GitHut2.0. GitHut2.0 (2024).
74. Chen, F., Fard, F. H., Lo, D. & Bryksin, T. On the transferability of pre-trained language
models for low-resource programming languages in Proceedings of the 30th IEEE/ACM
International Conference on Program Comprehension, ICPC 2022, Virtual Event, May
16-17, 2022 (eds Rastogi, A., Tufano, R., Bavota, G., Arnaoudova, V. & Haiduc, S.)
(ACM, 2022), 401–412. doi:10.1145/3524610.3527917.
75. Cassano, F., Gouwar, J., Lucchetti, F., Schlesinger, C., Anderson, C. J., Greenberg, M.,
et al. Knowledge Transfer from High-Resource to Low-Resource Programming Languages
for Code LLMs. CoRR abs/2308.09895. doi:10.48550/ARXIV.2308.09895 (2023).
76. Mora, F., Wong, J., Lepe, H., Bhatia, S., Elmaaroufi, K., Varghese, G., et al. Synthetic
Programming Elicitation and Repair for Text-to-Code in Very Low-Resource
Programming Languages. arXiv preprint arXiv:2406.03636 (2024).
77. Wong, K., Amayuelas, A., Pan, L. & Wang, W. Y. DistiLRR: Transferring Code Repair for
Low-Resource Programming Languages. arXiv preprint arXiv:2406.14867 (2024).
78. Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., ` et al. Code Llama:
Open Foundation Models for Code. CoRR abs/2308.12950.
doi:10.48550/ARXIV.2308.12950 (2023).
79. Wei, Y., Wang, Z., Liu, J., Ding, Y. & Zhang, L. Magicoder: Source Code Is All You Need.
CoRR abs/2312.02120. doi:10.48550/ARXIV.2312.02120 (2023).
72
80. Luo, Z., Xu, C., Zhao, P., Sun, Q., Geng, X., Hu, W., et al. WizardCoder: Empowering
Code Large Language Models with Evol-Instruct in The Twelfth International Conference
on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024
(OpenReview.net, 2024).
81. Gou, Z., Shao, Z., Gong, Y., Shen, Y., Yang, Y., Duan, N., et al. Critic: Large language
models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738
(2023).
82. Cassano, F., Gouwar, J., Nguyen, D., Nguyen, S., Phipps-Costin, L., Pinckney, D., et al.
MultiPL-E: a scalable and polyglot approach to benchmarking neural code generation.
IEEE Transactions on Software Engineering 49, 3675–3691 (2023).
83. Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., et al. DeepSeek-Coder: When
the Large Language Model Meets Programming – The Rise of Code Intelligence 2024.
84. AI@Meta. Llama 3 Model Card (2024).
85. Van Dam, T., van der Heijden, F., de Bekker, P., Nieuwschepen, B., Otten, M. & Izadi, M.
Investigating the Performance of Language Models for Completing Code in Functional
Programming Languages: a Haskell Case Study in Proceedings of the 2024 IEEE/ACM
First International Conference on AI Foundation Models and Software Engineering,
FORGE 2024, Lisbon, Portugal, 14 April 2024 (eds Lo, D., Xia, X., Penta, M. D. &
Hu, X.) (ACM, 2024), 91–102. doi:10.1145/3650105.3652289.
86. Esmaeili, A., Saberi, I. & Fard, F. H. Empirical Studies of Parameter Efficient Methods for
Large Language Models of Code and Knowledge Transfer to R. CoRR abs/2405.01553.
doi:10.48550/ARXIV.2405.01553 (2024).
87. Paul, D., Ismayilzada, M., Peyrard, M., Borges, B., Bosselut, A., West, R., et al. REFINER:
Reasoning Feedback on Intermediate Representations in Proceedings of the 18th
Conference of the European Chapter of the Association for Computational Linguistics,
EACL 2024 - Volume 1: Long Papers, St. Julian’s, Malta, March 17-22, 2024 (eds
Graham, Y. & Purver, M.) (Association for Computational Linguistics, 2024), 1100–1126.
88. Peng, B., Galley, M., He, P., Cheng, H., Xie, Y., Hu, Y., et al. Check Your Facts and Try
Again: Improving Large Language Models with External Knowledge and Automated
Feedback. CoRR abs/2302.12813. doi:10.48550/ARXIV.2302.12813 (2023).
89. Le, H., Wang, Y., Gotmare, A. D., Savarese, S. & Hoi, S. C. CodeRL: Mastering Code
Generation through Pretrained Models and Deep Reinforcement Learning in Advances in
Neural Information Processing Systems 35: Annual Conference on Neural Information
Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 -
December 9, 2022 (eds Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K. &
Oh, A.) (2022).
73
90. Chen, X., Lin, M., Scharli, N. & Zhou, D. Teaching Large Language Models to ¨
Self-Debug. CoRR abs/2304.05128. doi:10.48550/ARXIV.2304.05128 (2023).
91. Zhang, K., Li, Z., Li, J., Li, G. & Jin, Z. Self-Edit: Fault-Aware Code Editor for Code
Generation in Proceedings of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July
9-14, 2023 (eds Rogers, A., Boyd-Graber, J. L. & Okazaki, N.) (Association for
Computational Linguistics, 2023), 769–787.
doi:10.18653/V1/2023.ACL-LONG.45.
92. Charalambous, Y., Tihanyi, N., Jain, R., Sun, Y., Ferrag, M. A. & Cordeiro, L. C. A New
Era in Software Security: Towards Self-Healing Software via Large Language Models and
Formal Verification. CoRR abs/2305.14752. doi:10.48550/ARXIV.2305.14752
(2023).
93. Chen, B., Zhang, F., Nguyen, A., Zan, D., Lin, Z., Lou, J., et al. CodeT: Code Generation
with Generated Tests in The Eleventh International Conference on Learning
Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 (OpenReview.net, 2023).
94. Ben Allal, L., Muennighoff, N., Kumar Umapathi, L., Lipkin, B. & von Werra, L. A
framework for the evaluation of code generation models https:
//github.com/bigcode-project/bigcode-evaluation-harness. 2022.
95. Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., et al.
Evaluating Large Language Models Trained on Code. CoRR abs/2107.03374 (2021).
96. Austin, J., Odena, A., Nye, M. I., Bosma, M., Michalewski, H., Dohan, D., et al. Program
Synthesis with Large Language Models. CoRR abs/2108.07732 (2021).
97. OpenAI. OpenAI. Chatgpt: Optimizing language models for dialogue. (2022).
98. Yin, P., Deng, B., Chen, E., Vasilescu, B. & Neubig, G. Learning to Mine Aligned Code
and Natural Language Pairs from Stack Overflow in International Conference on Mining
Software Repositories (ACM, 2018), 476–486.
doi:https://doi.org/10.1145/3196398.3196408.
99. Honnibal, M., Montani, I., Van Landeghem, S. & Boyd, A. spaCy: Industrial-strength
Natural Language Processing in Python. doi:10.5281/zenodo.1212303 (2020).
100. Chai, Y., Zhang, H., Shen, B. & Gu, X. Cross-Domain Deep Code Search with Few-Shot
Meta Learning. CoRR abs/2201.00150 (2022).
74
101. Wang, X., Wang, Y., Wan, Y., Wang, J., Zhou, P., Li, L., et al. CODE-MVP: Learning to
Represent Source Code from Multiple Views with Contrastive Pre-Training in Findings of
the Association for Computational Linguistics: NAACL 2022, Seattle, WA, United States,
July 10-15, 2022 (eds Carpuat, M., de Marneffe, M. & Ru´ız, I. V. M.) (Association for
Computational Linguistics, 2022), 1066–1077.
doi:10.18653/V1/2022.FINDINGS-NAACL.80.
102. Guo, D., Lu, S., Duan, N., Wang, Y., Zhou, M. & Yin, J. UniXcoder: Unified Cross-Modal
Pre-training for Code Representation in Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin,
Ireland, May 22-27, 2022 (eds Muresan, S., Nakov, P. & Villavicencio, A.) (Association
for Computational Linguistics, 2022), 7212–7225.
103. Zhu, R., Yuan, L., Li, X., Gao, M. & Cai, W. A Neural Network Architecture for Program
Understanding Inspired by Human Behaviors in Proceedings of the 60th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022,
Dublin, Ireland, May 22-27, 2022 (eds Muresan, S., Nakov, P. & Villavicencio, A.)
(Association for Computational Linguistics, 2022), 5142–5153.
104. Lu, S., Duan, N., Han, H., Guo, D., Hwang, S. & Svyatkovskiy, A. ReACC: A
Retrieval-Augmented Code Completion Framework in Proceedings of the 60th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL
2022, Dublin, Ireland, May 22-27, 2022 (eds Muresan, S., Nakov, P. & Villavicencio, A.)
(Association for Computational Linguistics, 2022), 6227–6240.
105. Santoro, A., Bartunov, S., Botvinick, M. M., Wierstra, D. & Lillicrap, T. P. Meta-Learning
with Memory-Augmented Neural Networks in International Conference on Machine
Learning (2016).
106. Finn, C., Xu, K. & Levine, S. Probabilistic Model-Agnostic Meta-Learning in Neural
Information Processing Systems (2018).
107. Antoniou, A., Edwards, H. & Storkey, A. J. How to train your MAML. ArXiv
abs/1810.09502 (2018).
108. Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K. & Wierstra, D. Matching
Networks for One Shot Learning 2017.
109. Snell, J., Swersky, K. & Zemel, R. S. Prototypical Networks for Few-shot Learning. ArXiv
abs/1703.05175 (2017).
110. Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H. S. & Hospedales, T. M. Learning to
Compare: Relation Network for Few-Shot Learning. 2018 IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 1199–1208 (2017).
111. Koch, G. R. Siamese Neural Networks for One-Shot Image Recognition in (2015).
75
112. Yang, Y. & Hospedales, T. M. Deep Multi-task Representation Learning: A Tensor
Factorisation Approach. ArXiv abs/1605.06391 (2016).
113. Caruana, R. Multitask Learning. Machine Learning 28, 41–75 (1997).
114. Meyerson, E. & Miikkulainen, R. Modular Universal Reparameterization: Deep Multi-task
Learning Across Diverse Domains. ArXiv abs/1906.00097 (2019).
115. Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A.,
et al. Parameter-Efficient Transfer Learning for NLP in International Conference on
Machine Learning (2019).
116. Bapna, A., Arivazhagan, N. & Firat, O. Simple, Scalable Adaptation for Neural Machine
Translation in Conference on Empirical Methods in Natural Language Processing (2019).
117. Lester, B., Al-Rfou, R. & Constant, N. The Power of Scale for Parameter-Efficient Prompt
Tuning. ArXiv abs/2104.08691 (2021).
118. Li, X. L. & Liang, P. Prefix-Tuning: Optimizing Continuous Prompts for Generation.
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
and the 11th International Joint Conference on Natural Language Processing (Volume 1:
Long Papers) abs/2101.00190 (2021).
119. Das, R., Zaheer, M., Thai, D. N., Godbole, A., Perez, E., Lee, J. Y., et al. Case-based
Reasoning for Natural Language Queries over Knowledge Bases in Conference on
Empirical Methods in Natural Language Processing (2021).
120. Shin, R., Lin, C. H., Thomson, S., Chen, C. C., Roy, S., Platanios, E. A., et al. Constrained
Language Models Yield Few-Shot Semantic Parsers. ArXiv abs/2104.08768 (2021).
76
Appendices
A Chapter 2: Experiment Settings
A.1 Evaluation Metrics
(1) MRR evaluates a list of code snippets. The reciprocal rank for MRR is computed as 1
rank ,
where rank is the position of the correct code snippet when all code snippets are ordered by their
predicted similarity to the sample query. (2) P@K is the proportion of the top-K correct snippets
closest to the given query. For each query, if the correct code snippet is among the first K retrieved
code snippets P@K=1, otherwise it is 0.
A.2 Parsing
We build on top of the NLTK Python package for our implementation of the CCG parser. In attempt
to parse as much of the datasets as possible, we preprocessed the queries by removing preceding
question words (e.g. “How to”), punctuation marks, and some specific words and phrases, e.g.
those that specify a programming language or version, such as “in Python” and “Python 2.7”. For
a number of entries in CSN dataset which only consisted of a noun or a noun phrase, we appended
a Load verb to make it a valid sentence, assuming that it was implied, so that, for example, “video
page” became “Load video page”. This had the adverse effect in cases of noisy examples, where
the docstring did not specify the intention or functionality of the function, and only said “wrapper”,
for example. The final dataset statistics before and after parsing are presented in Table 5.1
77
Dataset
Parsable Full
Train Valid Test Train Valid Test
CodeSearchNet 162801 8841 8905 412178 23107 22176
CoSQA 14210 - - 20,604 - -
WebQueryTest - - 662 - - 1,046
Table 5.1: Dataset statistics before and after parsing.
A.3 Failed parses
As mentioned before, we have encountered many noisy examples and here provide samples of
such examples that could not be parsed. These include cases where the docstring contains URLs,
is not in English, consists of multiple sentences, or has code in it, which is often either signature of
the function, or a usage example. Specific samples of queries that we couldn’t parse are included
in Table 5.3.
A.4 Parser generalization to new datasets
In order to evaluate how robust our parser is when challenged with new datasets, we have evaluated
its success rate on a number of additional datasets - containing both Python code, and code in
other languages. More specifically, for a Python dataset we used CoNaLa dataset [98], using the
entirety of its manually collected data, and 200K samples from the automatically mined portion.
Additionally, we attempt parsing queries concerning 5 other programming languages - Go, Java,
Javascript, PHP, and Ruby. For those, we evaluated the parser on 90K for each language, taking
those from CodeSearchNet dataset’s training portion. The summary of data statistics, as well as
evaluation results are reported in Table 5.2. As it can be seen, the parser successfully parses at least
62% of Python data, and 32% of data concerning other languages. From new languages, our parser
is the most succesful on PHP and Javascript, achieving 43% and 41% success rate respectively.
78
Language Dataset Original Size Parser Success Rate
Python CoNaLa auto-mined 200000 0.62
Python CoNaLa manual train 2379 0.65
Python CoNaLa manual test 500 0.63
Go CodeSearchNet 90000 0.32
Java CodeSearchNet 90000 0.33
Javascript CodeSearchNet 90000 0.41
PHP CodeSearchNet 90000 0.43
Ruby CodeSearchNet 90000 0.35
Table 5.2: Parser’s success rate on unseen datasets
Example not parsed
URL From http://cdn37.atwikiimg.com/sitescript/pub/dksitescript/FC2.site.js
Signature
:param media id:
:param self: bot
:param text: text of message
:param user ids: list of user ids for creating group or one user id
for send to one person
:param thread id: thread id
Multi-sentence
Assumed called on Travis, to prepare a package to be deployed
This method prints on stdout for Travis.
Return is obj to pass to sys.exit() directly
Noisy bandwidths are inaccurate, as we don’t account for parallel transfers here
Table 5.3: Example queries that were not included due to query parsing errors
B Chapter 2: Entity Discovery Module
To generate noisy supervision labels for the entity discovery module we used spaCy library [99]
for labelling through regex matching, and Python’s ast - Abstract Syntax Trees library for the
static analysis labels. For the former we included the following labels: dict, list, tuple, int, file,
enum, string, directory and boolean. Static analysis output labels were the following: List, List
Comprehension, Generator Expression, Dict, Dict Comprehension, Set, Set Comprehension, Bool
Operator, Bytes, String and Tuple. The full source code for the noisy supervision labelling procedure is available in the supplementary materials.
79
C Chapter 2: Additional Experiments
C.1 Unseen Entities and Actions
We wanted to see how well different models adapt to new entities and actions that were not seen
during training. For that end we measured the performance of the models when broken down on
queries with a different number of unseen entities (from 0 to 3+) and action (0 and 1). The results
are presented in Figure 5.1. It can be seen that NS3 is very sensitive to unseen terms, whereas
CodeBERT’s performance stays the same.
0 1
0.5
0.6
0.7
0.8
0.9
1.0
MRR
0.92
0.85 0.87 0.883
Unseen Actions
NS3
CodeBERT
(a) Unseen Actions
0 1 2 3+
0.5
0.6
0.7
0.8
0.9
1.0
MRR
0.93 0.94
0.9
0.56
0.87 0.88 0.88 0.86
Unseen Entities
NS3
CodeBERT
(b) Unseen Entities
Figure 5.1: Performance of CodeBERT and NS3 models when broken down by the number of
unseen entities or actions in the test queries. Evaluated on CSN test set.
C.2 Times an Entity or an Action Was Seen
In addition to the last experiment, we wanted to measure the performance broken down by how
many times an entity or an action verb was seen during the training. The results of this experiment
are reported in Figure 5.2. For the breakdown by the number of times an action was seen, the
performance almost follows a bell curve. The performance increases with verbs that were seen
only a few times. On the other hand, very frequent actions are probably too generic and not
specific enough (e.g. load and get). For the entities we see that the performance is only affected
when none of the entities in the query has been seen. This is understandable, as in these cases an
action modules don’t get any information to go by, so the result is also bad. CodeBERT model in
80
both scenarios has more or less the same performance independently of the number of times an
action or an entity was seen.
(a) Entities
(b) Actions
Figure 5.2: Performance of CodeBERT and NS3 models when broken down by the number of
times an entity or an action was seen during the training. Evaluated on CSN test set.
C.3 Evaluation on Parsable and Unparsable Queries
To understand whether there is a significant bias among samples that we could parse versus the
ones that we could not parse, we performed additional experiment on the full test set of the CoSQA
version. The results are reported in Table 5.4. In this evaluation, NS3 falls back to CodeBERT for
examples that could not be parsed. As it can be seen, while there is some difference in performance,
the overall trend of performances remains the same as before.
81
Method CoSQA Full Test Set
MRR P@1 P@3 P@5
CodeBERT 0.29 0.152 0.312 0.444
GraphCodeBERT 0.367 0.2 0.447 0.561
NS3 0.412 0.298 0.452 0.535
Table 5.4: Mean Reciprocal Rank(MRR) and Precision@1/@3/@5 (higher is better) for different
methods trained on CoSQA dataset. The performance is evaluted on the full test dataset, i.e.
including both parsable and unparsable examples.
D Chapter 2: Additional Examples
Figure 5.3 contains more illustrations of the output scores of the action and entity discovery modules captured at different stages of training. The queries shown here are the same, but this time
they are evaluated on different functions.
Figure 5.3: The leftmost column shows output scores of the entity discovery module after pretraining for the entity of the query. The middle column shows the scores after completing the
end-to-end training. The rightmost column shows the scores of the action module. Darker highlighting demonstrates higher score.
82
Staged execution demonstration
In the next example we demonstrate the multiple-step reasoning. In this example we are looking at
the query “Construct point record by reading points from stream”. When turned into a semantic
parse, that query will be represented as:
ACTION(Construct, (None, point record),(BY, ACTION(Read, (None, points), (FROM, stream))))
After the processing, this query would be broken down into two parts:
1. ACTION(Construct, (None, point record)), and
2. ACTION(Read, (FROM, stream), (None, points))
In order for the full query to be satisfied, both parts of the query must be satisfied. Figure 5.4
demonstrates the outputs of the entity(Figure 5.4 a) and action(b) modules obtained for the query’s
first part, and Figure 5.5 demonstrates the outputs on the second part. Now if we were to replace
the second sub-query with a different one, so that its parse is:
ACTION(Remove, (In, stream), (None, points)),
that would not affect the outputs of the entity modules, but it would affect the output of the action
module, as shown in Figure 5.6. The final prediction for this modified query would be 0.08 instead
of 0.94 on the original query.
E Chapter 2: Related Work
Chai, Zhang, Shen & Gu [100] proposes expanding CodeBERT with MAML to perform crosslanguage transfer for code search. In their work they study the case where the models are trained
on some languages, and the then finetuned for code search on unseen languages.
Wang, Wang, Wan, Wang, Zhou, Li, et al. [101] proposes combining token-wise analysis, AST
processing, neural graph networks and contrastive learning from code perturbations into a single
83
(a) Entity outputs
(b) Action outputs
Figure 5.4: Outputs of the action and entity modules on the query
ACTION(Construct, (None, point record)).
84
(a) Entity outputs
(b) Action outputs
Figure 5.5: Outputs of the action and entity modules on the query
ACTION(Read, (FROM, stream), (None, points)).
85
Figure 5.6: Outputs of the action module on the modified query
ACTION(Remove, (IN, stream), (None, points)).
model. Their experiments demonstrate that such combination provides improvement over models
with only parts of those features. This illustrates, that those individual features are complementary
to each other. In a somewhat similar manner, Guo, Lu, Duan, Wang, Zhou & Yin [102] proposes
combining sequence-based reasoning with AST-based reasoning, and uses contrastive pretraining
objective for the transformer on the serialized AST.
Additionally, both Zhu, Yuan, Li, Gao & Cai [103] and Lu, Duan, Han, Guo, Hwang & Svyatkovskiy [104] propose solutions closely inspired by human engineers’ behaviors. Zhu, Yuan,
Li, Gao & Cai [103] propose a bottom-up compositional approach to code understanding, claiming that engineers go from understanding individual statements, to lines, to blocks and finally
to functions. They propose implementing this by iteratively getting representations for program
sub-graphs and combining those into larger sub-graphs, etc. On the other side, Lu, Duan, Han,
Guo, Hwang & Svyatkovskiy [104] proposes looking for the code context for the purpose of code
retrieval, inspired by human behavior of copying code from related code snippets.
86
Languages Keywords
JavaScript await, break, case, catch, class, const, continue, debugger, default, delete, do,
else, enum, export, extends, false, finally, for, function, if, implements, import,
in, instanceof, interface, let, new, null, package, private, protected, public, return, super, switch, static, this, throw, try, true, typeof, var, void, while, with,
yield
Table 5.5: Keywords used for CodeBLEU evaluation
Code generation folder repo org
8-shot 16-shot 32-shot 8-shot 16-shot 32-shot 8-shot 16-shot 32-shot
CodeT5 FT ID 19.36 20.92 21.95 20.42 22.44 24.47 19.29 20.73 22.6
CodeT5 LoRA ID 20.05 21.66 22.56 20.81 23.12 24.52 20.08 21.28 22.99
CodeT5 FT random 17.61 18.03 17.94 16.92 17.50 17.59 16.47 17.46 17.85
CodeT5 LoRA random 17.87 18.02 17.81 17.45 17.15 17.63 17.24 17.13 17.29
Codex ICL ID 28.78 - - 31.05 - - 29.19 - -
Codex ICL random 20.62 - - 20.87 - - 21.10 - -
Codex instr. only (0-shot) (10.24) - - (10.60) - - (10.25) - -
Table 5.6: Comparison of model performance for code generation on in-domain (ID) vs out-ofdomain (random) test data. Reported metric is ChrF (higher is better).
F Chapter 3: Javascript Keywords
The Javascript keywords that we included in the CodeBleu implementation for evaluation is listed
in table F.
G Chapter 3: Extended Background
G.1 Meta-learning and Multi-task-learning
Meta-learning focuses on adapting knowledge gained from previous tasks to be applied to new
tasks with limited training examples. Most meta-learning algorithms can be categorized into three
groups: 1) Black-box meta-learning approaches [105] train a black-box model to take in training
data of a target task to output parameters for the neural network used for making prediction for
that task; 2) Optimization-based methods [53, 106, 107] uses gradient descent to learn model
parameters which can be adapted to a future target task with few gradient steps on a few-shot
87
Code generation folder repo org
8-shot 16-shot 32-shot 8-shot 16-shot 32-shot 8-shot 16-shot 32-shot
CodeT5 FT ID 14.15 15.84 16.73 14.93 16.98 19.19 13.75 14.93 16.94
CodeT5 LoRA ID 14.49 16.58 17.87 15.47 17.69 19.60 14.10 15.48 17.61
CodeT5 FT random 11.34 11.62 11.73 9.91 10.10 10.32 9.49 10.20 10.68
CodeT5 LoRA random 11.45 12.05 12.58 10.09 10.04 11.08 10.15 10.30 11.15
Codex ICL ID 23.70 - - 24.62 - - 22.58 - -
Codex ICL random 15.76 - - 15.67 - - 15.81 - -
Codex instr. only (0-shot) (6.44) - - (6.50) - - (6.18) - -
Table 5.7: Comparison of model performance for code generation on in-domain (ID) vs out-ofdomain (random) test data. Reported metric is RougeL (higher is better).
Code generation folder repo org
8-shot 16-shot 32-shot 8-shot 16-shot 32-shot 8-shot 16-shot 32-shot
CodeT5 FT ID 0.68 / 0.68 0.69 / 0.68 0.69 / 0.69 0.69 / 0.69 0.69 / 0.68 0.68 / 0.67 0.69 / 0.67 0.69 / 0.68 0.70 / 0.69
CodeT5 LoRA ID 0.68 / 0.67 0.69 / 0.68 0.70 / 0.69 0.69 / 0.68 0.70 / 0.70 0.71 / 0.71 0.69 / 0.68 0.69 / 0.68 0.71 / 0.69
CodeT5 FT random 0.65 / 0.66 0.66 / 0.66 0.66 / 0.66 0.66 / 0.65 0.66 / 0.66 0.66 / 0.66 0.65 / 0.65 0.65 / 0.65 0.65 / 0.65
CodeT5 LoRA random 0.65 / 0.65 0.65 / 0.65 0.66 / 0.66 0.66 / 0.66 0.65 / 0.65 0.66 / 0.66 0.65 / 0.65 0.65 / 0.65 0.66 / 0.66
Codex ICL ID 0.74 / 0.72 - - 0.75 / 0.73 - - 0.74 / 0.72 - -
Codex ICL random 0.69 / 0.67 - - 0.70 / 0.68 - - 0.69 / 0.67 - -
Codex instr. only (0-shot) 0.62 / 0.61 - - 0.63 / 0.62 - - 0.63 / 0.62 - -
Table 5.8: Comparison of model performance for code generation on in-domain (ID) vs out-ofdomain (random) test data. Reported metric in each cell is CodeBERTScore F1 on the left (higher
is better), and CodeBERTScore F3 on the right (higher is better).
training dataset; 3) Non-parametric methods [108–111] learns a metric space in which predictions
can be performed by computing some similarity metric, like distance and cosine similarity, to
representations of each class. In our work, we are using the MAML [53] approach, which is a
gradient-based method and learns model initialization (i.e., initial parameters) that is amenable
to fast fine-tuning with few instances. This method is a conceptually simple and model-agnostic
algorithm that has been shown to outperform existing approaches in several tasks.
Multi-task Learning aims to jointly learn several related tasks providing a generalized representation with the added benefit of compute and memory in terms of shared model parameters [112–
114]. MTL also has a regularization effect on the model parameters. By definition, MTL aims to
solve a fixed number of known tasks, whereas the point of meta-learning is often to solve unseen
future tasks. But both methods capture a good prior from the training tasks, which can be used for
getting model parameters for future target tasks.
88
In our work, we have experimented with both MAML and multi-task learning to check which
of the method gives us a better prior for few-shot performance in our setting.
G.2 Few-shot Methods
Parameter-efficient finetuning: Conventional fine-tuning methods retrains all the model parameters for every new task, which becomes infeasible as the model size increases to the level of
GPT-3. In recent times, parameter-efficient methods have been studied and it has been demonstrated that state-of-the-art PEFT methods can match the performance of finetuning all the model’s
parameters while updating only a tiny fraction of the model parameters. Initially adapters [63, 115,
116] were introduced, which are new feed-forward modules added between the layers of the fixed
pre-trained model. Since then, various sophisticated PEFT methods have been proposed, including
methods like LoRA that produce low-rank updates [54] and prompt tuning [117] and prefix-tuning
[118] concatenate learned continuous embeddings to the model’s input or activations to induce it
to perform a task.
Retrieval-based Example selection: In a study conducted by Liu, Shen, Zhang, Dolan, Carin
& Chen [56] , they explored how different prompts can impact the performance of GPT-3 and
found that the use of in-context examples has a significant influence on the downstream results. To
achieve this, they utilized an unsupervised sentence encoder to encode training examples and then
retrieved the nearest neighbors for each test instance. On a similar note, Das, Zaheer, Thai, Godbole, Perez, Lee, et al. [119] developed a supervised prompt retriever for answering knowledgebased questions. Their method used tailored supervision specifically designed for knowledgebased queries and relied on surface similarity between formal queries. Furthermore, Shin, Lin,
Thomson, Chen, Roy, Platanios, et al. [120] employed GPT-3 to select examples for the prompt in
few-shot semantic parsing. They demonstrated the effectiveness of this approach by using GPT-3
to identify relevant examples for the prompt, which in turn improved the overall performance of
the system.
89
Figure 5.7: Each dot signifies a domain. Average pairwise similarities of examples within each
domain (x axis) plotted against average similarities of that domain to all other domains (y axis).
H Chapter 3: Domain split visualization
To better understand how different splits of domains are different from each other, we visualize
our resulting test domains in Figure 5.7. We plot each domain as a dot, where different colors
correspond to different splits. X axis demonstrates average pairwise similarity of examples within
a domain, i.e. x coordinate of a domain corresponds to how uniform examples within a domain
are. Y axis demonstrates pairwise similarities of examples within a domain to examples in all
other domains, i.e. y coordinate of a domain demonstrates its similarity to other domains. From
the figure we see that the vast majority of domains are clustered in the lower right corner, which
corresponds to the domains that are uniform, and dissimilar to other domains. A small handful of
domains are located in the upper left corner, that corresponds to domains with dissimilar examples
within itself, but higher similarity to other domains. It is notable, that quantitatively, upper left
corner contains more folders than repos, and more repos than orgs. We hypothesize, that such
distribution could be explained by functional, rather than hierarchicals similarities across domains.
A clear example of such instance can be a folder with utility functions that can have high similarity
to other folders with utility functions, all the while individual functions within that folder are
implementing different utilities, and thus - are dissimilar.
90
I Chapter 3: Models
CodeT5: CodeT5 [44] is a pretrained encoder-decoder transformer model based on T5 [63] for
programming languages. It uses a unified framework to support code understanding and generation
tasks seamlessly. To improve the model’s ability to handle the unique characteristics of programming languages, CodeT5 is trained on an identifier-aware pretraining task. Additionally, the model
is trained to exploit user-written code comments with a bimodal dual-generation task for better
alignment between natural language and programming languages. This makes this model suitable
for the applications that we consider. For both of our applications, we used the CodeT5-large
model [89] without making any changes to the model architecture.
Codex Codex [43] is the language model for code released by OpenAI. It is a GPT language
model finetuned on 54 million public software repositories hosted on GitHub, containing 179 GB
of unique Python files under 1 MB. VLLMs are capable of zero-shot generalization to unseen
tasks, which is achieved by providing them with an instruction of what the model is expected to do.
This allowed us to successfully evaluate Codex for both code generation and code summarization
without any need for training.
ChatGPT ChatGPT is a conversational variant derived from InstructGPT/GPT 3.5 model [64].
It features a dialogue interface and is trained using a more refined objective function called Reinforcement Learning from Human Feedback (RLHF) [65]. However, there is currently limited
information available regarding the specific architecture and training data employed in the creation
of ChatGPT. We utilize the GPT-3.5 Turbo API, provided by OpenAI, to access ChatGPT for conducting our experiments. This API version allows a maximum token length restriction of 4096
tokens.
91
J Chapter 3: Hyperparameters and training details
For full finetuning of CodeT5, we updated the model for 500 steps using batch size of 8, the best
model was identified by the performance on the τdev portion. For LoRA, we use a rank of 4 with an
initialization scale of 0.01 and update all the attention and feedforward layers. We train for 1000
steps with a batch size of 8.
For multitask learning (MTL) of CodeT5, we update the model for 150K steps on 80% of the
Xtrain data, using a batch size of 4. The best checkpoint is selected by evaluating the model on the
remaining 20% of Xtrain which was held-out from training. For dual-gen MTL, we followed the
same train/dev division strategy as for MTL for code generation, and updated the model for 150K
steps with batch size of 4. The best checkpoints were again decided by evaluating the model on the
created development set. In particular, we selected two checkpoints - one according to CodeBLEU
metric, and another according to BLEU metric for code generation and code summarization respectively. For Model-agnostic meta-learning, we updated the model from the pretrained CodeT5
checkpoint for 10K steps and used the last checkpoint in our experiments.
K Chapter 3: The Vault
The Vault is a multilingual dataset extracted from GitHub. Despite the fact that it comes pretokenized, we noticed that some of the preprocessing for The Vault is different from the preprocessing
of CodeSearchNet. For example, while CodeSearchNet function body may have inlined comments,
the Vault functions are stripped of those. On the other side, the Vault docstring typically includes
function parameter documentation, whereas the CodeSearchNet omits those. On average, CodeSearchNet function docstrings are also shorter than those of the Vault. In our work, we processed
the Vault dataset, to fix these inconsistencies and make new data points consistent with data from
CodeSearchNet.
92
L Chapter 3: Additional experimental results
Besides the experiments presented in the main paper, in this section, we report some additional
experiments. Tables 5.6, 5.7 and 5.8 report results for code generation as measured using chrF,
RougeL and CodeBERTScore metrics correspondingly.
Additionally, Figure 5.8 illustrates how LoRA parameter efficient finetuning method compares
to the full model finetuning for CodeT5.
Figure 5.8: Performance for CodeT5 model finetuned with LoRA compared to regular finetuning.
Code Summarization
BLEU
Code Generation
CodeBLEU
org repo folder org repo folder
IsoScore (4) 16.71 16.57 15.47 15.05 16.01 14.93
IsoScore (8) 17.27 16.72 15.71 15.32 16.55 15.28
IsoScore (32) 17.46 16.90 14.34 16.13 17.89 16.26
Table 5.9: Results for CodeT5 model using IsoScore for measuring embedding similarity and
supervising with retrieved examples from train data.
93
M Chapter 3: IsoScore
IsoScore is a similarity metric of isotropy of an embedding space. The way we use it to measure
similarity is by computing IsoScore value of a combined set of test example embeddings and
every individual training set embedding. The “closest” examples selected for supervision are the
ones that resulted in the largest IsoScore value for each set of test examples. We then use the same
number of supervision examples as we used with cosine similarity - selecting 4*32, 8*32, or 32*32
“closest” examples for supervision. The results for model adapted using IsoScore metric similarity
are reported in Table 5.9.
M.1 Fast vote-k
To make the setup for fast vote-k similar to the version with the combination of nearest examples,
we run this algorithm to select 4*32 (128), 8*32 (256), and 32*32 (1024) supervision examples.
Table 5.10 show results obtained for a CodeT5 MTL model that has additionally been finetuned
using a set of examples obtained from fast vote-k algorithm.
Code Summarization
BLEU
Code Generation
CodeBLEU
org repo folder org repo folder
Fast vote-k (4) 10.96 12.34 10.33 24.96 25.76 24.77
Fast vote-k (8) 11.40 12.74 10.60 25.10 26.21 25.14
Fast vote-k (32) 10.84 12.03 10.06 24.25 25.06 24.17
Table 5.10: Results for CodeT5 model using Fast Vote-k for measuring embedding similarity and
supervising with retrieved examples from train data.
N Chapter 3: Instructions for Codex and ChatGPT
Table 5.11 contains list of instructions we used with Codex and ChatGPT models in instructiononly and in-context learning scenarios.
94
Copilot Task instruction
”Write in javascript:”,
”Write code:”,
”Summarize code:”,
”Summarize javascript snippet:”,
”Write code intent:”
Demonstration example
template
”Intent: {text} \\n Snippet: {code}\\n\\n”,
”Intent: {text} \\n Code: {code}\\n\\n”,
”Code: {code} \\n Intent: {text}\\n\\n”,
”Code: {code} \\n Summary: {text}\\n\\n”,
”Snippet: {code} \\n Intent: {text}\\n\\n”,
”Snippet: {code} \\n Summary: {text}\\n\\n”,
ChatGPT System messages
’You are a helpful assistant that writes JavaScript code based on English description.
You only output code without any English text.’
“You are a helpful assistant that writes single sentence summarizes for JavaScript code in English.
You only output code summary without any other English text.”
Task instruction ”Write a single sentence summary for the following JavaScript code in English. ”
”Implement this functionality using JavaScript. ”
Demonstration example
template
[”Below are some examples of JavaScript code implemented based on English summary. \n”,
”Summary: {text}\nCode: {code}\n\n”]
[”Below are some examples of English summaries of JavaScript code. \n”,
”Code: {code}\nSummary: {text}\n\n”]
Table 5.11: Task instructions and demonstration templates used for generating results in the experiments with Codex and ChatGPT.
O Chapter 3: Sample outputs
Tables 5.12- 5.16 present some examples and the outputs obtained by different models for those.
Here we can see that CodeT5 model finetuned on in-domain examples sometimes has the advantage
of having relevant context and thus is using correct member names as opposed to other models.
On the other hand, we also see that similar out-of-domain examples from the train split can in fact
be near duplicates of the ones in the test split. As a result, the model supervised with retrieved
examples may generate output that is extremely close to that of the gold test data.
95
Table 5.12: Sample outputs from different models. Input Gold CodeT5 MTL (0-shot) CodeT5 MTL + ID (32-shot) CodeT5 MTL + ret 4 ChatGPT Dispatch stack information to all handlers
96
Table 5.13: Sample outputs from different models. Input Gold CodeT5 MTL (0-shot) CodeT5 MTL + ID (32-shot) CodeT5 MTL + ret 4 ChatGPT Setup captions
97
Table 5.14: Sample outputs from different models. Input Gold CodeT5 MTL (0-shot) CodeT5 MTL + ID (32-shot) CodeT5 MTL + ret 4 ChatGPT Toggle event listener
98
Table 5.15: Sample outputs from different models. Input Gold CodeT5 MTL (0-shot) CodeT5 MTL + ID (32-shot) CodeT5 MTL + ret 4 ChatGPT Returns the absolute path to the class file
99
Table 5.16: Sample outputs from different models. Input Gold CodeT5 MTL (0-shot) CodeT5 MTL + ID (32-shot) CodeT5 MTL + ret 4 ChatGPT Returns the tag name of the given library in the given contrib repository if installed. Returns false if not installed.
100
Abstract (if available)
Abstract
Successful deployment of any AI model requires generalization to previously unseen, real-world scenarios. Lack of generalization in models can lead to outcomes ranging from reduced performance to potential legal liabilities. In this thesis, I explore generalization challenges in large language models for code processing, covering three different generalization concerns that language models for code processing can exhibit. I also present my progress in building models that can overcome those. Firstly, I explore compositional generalization issues in code. I propose a model that can learn representing and recognizing individual instructions in code, and subsequently can generalize to representing and recognizing new, unseen combinations of those instructions. Next, I look at the issue of out-of-domain generalization. Specifically, I study how distribution shifts within software projects or between different corporations can affect model performance. I also look at different methods and measure their effectiveness for overcoming this generalization issue. Lastly, I look at the generalization issue of language model performance drop when comparing the language models evaluated on wide-spread programming languages versus those with fewer resources. I propose a synthetic data generation and distillation method to help to improve the language model performance on low-resource programming languages.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
Spoken language processing in low resource scenarios with applications in automated behavioral coding
PDF
Identifying and mitigating safety risks in language models
PDF
Learning at the local level
PDF
Generating and utilizing machine explanations for trustworthy NLP
PDF
Countering problematic content in digital space: bias reduction and dynamic content adaptation
PDF
Aggregating symbols for language models
PDF
Grounding language in images and videos
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Expanding the performance-compute frontier for retrieval-augmented language models
PDF
Neural creative language generation
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Balancing prediction and explanation in the study of language usage and speaker attributes
PDF
Hashcode representations of natural language for relation extraction
PDF
Responsible artificial intelligence for a complex world
PDF
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Learning logical abstractions from sequential data
PDF
Towards generalizable expression and emotion recognition
Asset Metadata
Creator
Arakelyan, Shushan
(author)
Core Title
Building generalizable language models for code processing
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-12
Publication Date
11/18/2024
Defense Date
08/23/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
code generation,code processing,code summarization,generalization,large language models,machine learning,machine learning for code,natural language processing,semantic code search
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ren, Xiang (
committee chair
), Dehghani, Morteza (
committee member
), Galstyan, Aram (
committee member
), Raghothaman, Mukund (
committee member
)
Creator Email
arakelyan.shushan@gmail.com,shushana@usc.edu
Unique identifier
UC11399DNDR
Identifier
etd-ArakelyanS-13641.pdf (filename)
Legacy Identifier
etd-ArakelyanS-13641
Document Type
Dissertation
Format
theses (aat)
Rights
Arakelyan, Shushan
Internet Media Type
application/pdf
Type
texts
Source
20241118-usctheses-batch-1223
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
code generation
code processing
code summarization
generalization
large language models
machine learning
machine learning for code
natural language processing
semantic code search