Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring
| dc.contributor.author | Paul, Indraneil | |
| dc.contributor.author | Gurevych, Iryna | |
| dc.contributor.author | Glavaš, Goran | |
| dc.date.accessioned | 2026-05-02T16:46:05Z | |
| dc.date.created | 2026-05-02 | |
| dc.date.issued | 2026-05-02 | |
| dc.description | Themis-CodeRewardBench is a code-specific reward model evaluation benchmark comprising ~8.9k diverse code preference pairs across eight programming languages and five quality scoring dimensions (Accompanying code repo can be accessed here - https://github.com/iNeil77/Themis). It is part of the Themis project and evaluates code reward models on five code quality dimensions — Functional Correctness (FC), Execution Efficiency (EE), Memory Efficiency (ME), Readability & Maintainability (R&M), and Security Hardness (SH) — across eight programming languages: C, C#, C++, Go, Java, JavaScript, Python, and Ruby. The benchmark uses preference accuracy as the evaluation metric. It draws from 13 distinct pre-existing and newly constructed code preference datasets, spanning human-written, LLM-generated, and mixed-provenance prompts and responses. It introduces a largely novel distribution of code preferences, for code of increased complexity, compared to the code subsets in existing RM benchmarks. Key differentiators: - Evaluates across 5 quality dimensions, not just functional correctness - Covers 8 programming languages, not just Python - Includes human-written code from real commits, not only contest/synthetic code - Introduces a novel distribution of code preferences with increased code complexity compared to existing RM benchmarks | |
| dc.description.version | v1.0 | |
| dc.identifier.uri | https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/5115 | |
| dc.language.iso | en | |
| dc.rights | Apache 2.0 License | |
| dc.rights.license | other | |
| dc.rights.uri | https://www.apache.org/licenses/LICENSE-2.0 | |
| dc.subject | reward modelling, code evaluation, benchmark | |
| dc.subject.classification | 4.43-04 | |
| dc.subject.ddc | 004 | |
| dc.title | Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring | |
| dc.type | Text | |
| dcterms.accessRights | openAccess | |
| person.identifier.orcid | #PLACEHOLDER_PARENT_METADATA_VALUE# | |
| person.identifier.orcid | 0000-0003-2187-7621 | |
| person.identifier.orcid | 0000-0002-1301-6314 | |
| tuda.agreements | true | |
| tuda.unit | TUDa |
Files
Original bundle
1 - 1 of 1
| Name | Description | Size | Format | |
|---|---|---|---|---|
| Themis-CodeRewardBench.jsonl | 60.68 MB | Unknown data format |
