Wir ändern die Abläufe zur DOI Registrierung in einem Pilotprojekt zur Kuratierung für FAIRere Daten, siehe Nachrichtenmeldung
We are chaging DOI registration workflows in a curation pilot for FAIRer data, please see news item
 
Open Access

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Loading...
Thumbnail Image

Date

2026-05-02

Type

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Themis-CodeRewardBench is a code-specific reward model evaluation benchmark comprising ~8.9k diverse code preference pairs across eight programming languages and five quality scoring dimensions (Accompanying code repo can be accessed here - https://github.com/iNeil77/Themis). It is part of the Themis project and evaluates code reward models on five code quality dimensions — Functional Correctness (FC), Execution Efficiency (EE), Memory Efficiency (ME), Readability & Maintainability (R&M), and Security Hardness (SH) — across eight programming languages: C, C#, C++, Go, Java, JavaScript, Python, and Ruby. The benchmark uses preference accuracy as the evaluation metric. It draws from 13 distinct pre-existing and newly constructed code preference datasets, spanning human-written, LLM-generated, and mixed-provenance prompts and responses. It introduces a largely novel distribution of code preferences, for code of increased complexity, compared to the code subsets in existing RM benchmarks. Key differentiators: - Evaluates across 5 quality dimensions, not just functional correctness - Covers 8 programming languages, not just Python - Includes human-written code from real commits, not only contest/synthetic code - Introduces a novel distribution of code preferences with increased code complexity compared to existing RM benchmarks

Citation

Endorsement

Project(s)

Faculty

Collections

License

Except where otherwise noted, this license is described as Apache 2.0 License