A Systematic Evaluation of Large Language Models of Code (MAPS 2022 - The 6th Annual Symposium on Machine Programming)

Who

Frank F. Xu, Uri Alon, Graham Neubig, Vincent J. Hellendoorn

Track

MAPS 2022

Time Zone

The program is currently displayed in (GMT-07:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-07:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 13 Jun 2022 14:30 - 14:45 at Boardroom - Afternoon Chair(s): Charles Sutton
Tue 14 Jun 2022 02:30 - 02:45 at Boardroom - Afternoon

Abstract

Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. However, the current state-of-the-art code LMs (e.g., Codex (Chen et al., 2021)) are not publicly available, leaving many questions about their model and data design decisions. We aim to fill in some of these blanks through a systematic evaluation of the largest existing models: Codex, GPT-J, GPT-Neo, GPT-NeoX20B, and CodeParrot, across various programming languages. Although Codex itself is not open-source, we find that existing open-source models do achieve close results in some programming languages, although targeted mainly for natural language modeling. We further identify an important missing piece in the form of a large open-source model trained exclusively on a multi-lingual corpus of code. We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, that was trained on 249GB of code across 12 programming languages on a single machine. In the C programming language, PolyCoder outperforms all models including Codex. Our trained models are open-source and publicly available, which enables future research and application in this area.

Frank F. Xu

Carnegie Mellon University

Uri Alon

Carnegie Mellon University

United States

Graham Neubig

Carnegie Mellon University

Vincent J. Hellendoorn

Carnegie Mellon University

United States

Time Zone

The program is currently displayed in (GMT-07:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-07:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 13 Jun
Displayed time zone: Pacific Time (US & Canada) change

13:30 - 15:00	AfternoonMAPS at Boardroom +12h Chair(s): Charles Sutton Google Research

13:30 45m Keynote		Can Transformers Code?virtual MAPS Łukasz Kaiser OpenAI
14:15 15m Talk		Syntax-Guided Program Reduction for Understanding Neural Code Intelligence Modelsvirtual MAPS Rafiqul Rabin University of Houston, Aftab Hussain University of Houston, Amin Alipour University of Houston DOI Pre-print
14:30 15m Talk		A Systematic Evaluation of Large Language Models of Codevirtual MAPS Frank F. Xu Carnegie Mellon University, Uri Alon Carnegie Mellon University, Graham Neubig Carnegie Mellon University, Vincent J. Hellendoorn Carnegie Mellon University
14:45 15m Poster		Poster Session MAPS

Tue 14 Jun
Displayed time zone: Pacific Time (US & Canada) change

01:30 - 03:00	AfternoonMAPS at Boardroom

01:30 45m Keynote		Can Transformers Code?virtual MAPS Łukasz Kaiser OpenAI
02:15 15m Talk		Syntax-Guided Program Reduction for Understanding Neural Code Intelligence Modelsvirtual MAPS Rafiqul Rabin University of Houston, Aftab Hussain University of Houston, Amin Alipour University of Houston DOI Pre-print
02:30 15m Talk		A Systematic Evaluation of Large Language Models of Codevirtual MAPS Frank F. Xu Carnegie Mellon University, Uri Alon Carnegie Mellon University, Graham Neubig Carnegie Mellon University, Vincent J. Hellendoorn Carnegie Mellon University
02:45 15m Poster		Poster Session MAPS