Der Login über E-Mail und Passwort wird in Kürze abgeschaltet. Für Externe steht ab sofort der Login über ORCID zur Verfügung.
The login via e-mail and password will be retired in the near future. External uses can login via ORCID from now on.
 

Droid

dc.contributor.author Paul, Indraneil
dc.contributor.author Iryna, Gurevych
dc.contributor.author Orel, Daniil
dc.contributor.author Nakov, Preslav
dc.date.accessioned 2025-08-06T12:15:32Z
dc.date.created 2025-06-16
dc.date.issued 2025-08-06
dc.description # Dataset Description The dataset is structured into four primary classes: * **Human-Written Code**: Samples written entirely by humans. * **AI-Generated Code**: Samples generated by Large Language Models (LMs). * **Machine-Refined Code**: Samples representing a collaboration between humans and LMs, where human-written code is modified or extended by an AI. * **AI-Generated-Adversarial Code**: Samples generated by LMs with the specific intent to evade detection by mimicking human-like patterns and styles. ## Data Sources and Splits The dataset covers three distinct domains to ensure wide-ranging and realistic code samples: * **General Use Code**: Sourced from GitHub (via StarcoderData and The Vault), representing typical production code for applications like web servers, firmware, and game engines. * **Algorithmic Problems**: Contains solutions to competitive programming problems from platforms like CodeNet, LeetCode, CodeForces, and TACO. These are typically short, self-contained functions. * **Research Code**: Sourced from code repositories accompanying research papers and data science projects, often characterized by procedural code and a lack of modularity. ### Generation Models AI-generated code was created using models from 11 prominent families: * Llama * CodeLlama * GPT-4o * Qwen * IBM Granite * Yi * DeepSeek * Phi * Gemma * Mistral * Starcoder ### Generation Methods To simulate diverse real-world scenarios, code was generated using several techniques: 1. **Inverse Instruction**: An LM generates a descriptive prompt from a human-written code snippet, which is then used to prompt another LM to generate a new code sample. 2. **Comment-Based Generation**: LMs generate code based on existing docstrings or comments. 3. **Task-Based Generation**: LMs generate code from a precise problem statement, common for algorithmic tasks. 4. **Unconditional Synthetic Data**: To reduce bias, synthetic programmer profiles ("personas") were created, and LMs generated tasks and corresponding code aligned with these profiles. ### Machine-Refined Scenarios This third class of data was created to model human-AI collaboration through three methods: 1. **Human-to-LLM Continuation**: An LM completes a code snippet started by a human. 2. **Gap Filling**: An LM fills in the missing logic in the middle of a human-written code block. 3. **Code Rewriting**: An LM rewrites human code, either with no specific instruction or with a prompt to optimize it. ### Decoding Strategies To create a more robust dataset that is challenging for detectors, various decoding strategies were employed during generation, including greedy decoding, beam search, and sampling with diverse temperature, top-k, and top-p values. ## Data Filtering and Quality Control To ensure high quality, the dataset underwent a rigorous filtering process: * Samples were validated to be parsable into an Abstract Syntax Tree (AST). * Filters were applied for AST depth, line count, line length (average and maximum), and the fraction of alphanumeric characters to remove trivial, overly complex, or non-code files. * Docstrings were verified to be in English. * Near-duplicates were removed using MinHash with a similarity threshold of 0.8.
dc.identifier.uri https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/4695
dc.language.iso en
dc.rights.licenseCC-BY-4.0 (https://creativecommons.org/licenses/by/4.0)
dc.subject.classification 4.43-04
dc.subject.ddc 004
dc.title Droid
dc.type Dataset
dcterms.accessRights openAccess
person.identifier.orcid #PLACEHOLDER_PARENT_METADATA_VALUE#
person.identifier.orcid #PLACEHOLDER_PARENT_METADATA_VALUE#
person.identifier.orcid 0009-0007-5600-7032
person.identifier.orcid #PLACEHOLDER_PARENT_METADATA_VALUE#
tuda.agreements true
tuda.unit TUDa

Files

Original bundle

Now showing 1 - 3 of 3
NameDescriptionSizeFormat
Droid_Dev.jsonl170.32 MBUnknown data format Download
Droid_Test.jsonl170.52 MBUnknown data format Download
Droid_Train.jsonl1.33 GBUnknown data format Download

Collections