Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring
Loading...
Date
2026-05-02
Type
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Description
Themis-CodeRewardBench is a code-specific reward model evaluation benchmark comprising ~8.9k diverse code preference pairs across eight programming languages and five quality scoring dimensions (Accompanying code repo can be accessed here - https://github.com/iNeil77/Themis). It is part of the Themis project and evaluates code reward models on five code quality dimensions — Functional Correctness (FC), Execution Efficiency (EE), Memory Efficiency (ME), Readability & Maintainability (R&M), and Security Hardness (SH) — across eight programming languages: C, C#, C++, Go, Java, JavaScript, Python, and Ruby.
The benchmark uses preference accuracy as the evaluation metric. It draws from 13 distinct pre-existing and newly constructed code preference datasets, spanning human-written, LLM-generated, and mixed-provenance prompts and responses. It introduces a largely novel distribution of code preferences, for code of increased complexity, compared to the code subsets in existing RM benchmarks.
Key differentiators:
- Evaluates across 5 quality dimensions, not just functional correctness
- Covers 8 programming languages, not just Python
- Includes human-written code from real commits, not only contest/synthetic code
- Introduces a novel distribution of code preferences with increased code complexity compared to existing RM benchmarks
Citation
Endorsement
DFG Classification
Project(s)
Faculty
Collections
License
Except where otherwise noted, this license is described as Apache 2.0 License
