Droid

Paul, Indraneil; Iryna, Gurevych; Orel, Daniil; Nakov, Preslav

Droid

dc.contributor.author	Paul, Indraneil
dc.contributor.author	Iryna, Gurevych
dc.contributor.author	Orel, Daniil
dc.contributor.author	Nakov, Preslav
dc.date.accessioned	2025-08-06T12:15:32Z
dc.date.created	2025-06-16
dc.date.issued	2025-08-06
dc.description	# Dataset Description The dataset is structured into four primary classes: * Human-Written Code: Samples written entirely by humans. * AI-Generated Code: Samples generated by Large Language Models (LMs). * Machine-Refined Code: Samples representing a collaboration between humans and LMs, where human-written code is modified or extended by an AI. * AI-Generated-Adversarial Code: Samples generated by LMs with the specific intent to evade detection by mimicking human-like patterns and styles. ## Data Sources and Splits The dataset covers three distinct domains to ensure wide-ranging and realistic code samples: * General Use Code: Sourced from GitHub (via StarcoderData and The Vault), representing typical production code for applications like web servers, firmware, and game engines. * Algorithmic Problems: Contains solutions to competitive programming problems from platforms like CodeNet, LeetCode, CodeForces, and TACO. These are typically short, self-contained functions. * Research Code: Sourced from code repositories accompanying research papers and data science projects, often characterized by procedural code and a lack of modularity. ### Generation Models AI-generated code was created using models from 11 prominent families: * Llama * CodeLlama * GPT-4o * Qwen * IBM Granite * Yi * DeepSeek * Phi * Gemma * Mistral * Starcoder ### Generation Methods To simulate diverse real-world scenarios, code was generated using several techniques: 1. Inverse Instruction: An LM generates a descriptive prompt from a human-written code snippet, which is then used to prompt another LM to generate a new code sample. 2. Comment-Based Generation: LMs generate code based on existing docstrings or comments. 3. Task-Based Generation: LMs generate code from a precise problem statement, common for algorithmic tasks. 4. Unconditional Synthetic Data: To reduce bias, synthetic programmer profiles ("personas") were created, and LMs generated tasks and corresponding code aligned with these profiles. ### Machine-Refined Scenarios This third class of data was created to model human-AI collaboration through three methods: 1. Human-to-LLM Continuation: An LM completes a code snippet started by a human. 2. Gap Filling: An LM fills in the missing logic in the middle of a human-written code block. 3. Code Rewriting: An LM rewrites human code, either with no specific instruction or with a prompt to optimize it. ### Decoding Strategies To create a more robust dataset that is challenging for detectors, various decoding strategies were employed during generation, including greedy decoding, beam search, and sampling with diverse temperature, top-k, and top-p values. ## Data Filtering and Quality Control To ensure high quality, the dataset underwent a rigorous filtering process: * Samples were validated to be parsable into an Abstract Syntax Tree (AST). * Filters were applied for AST depth, line count, line length (average and maximum), and the fraction of alphanumeric characters to remove trivial, overly complex, or non-code files. * Docstrings were verified to be in English. * Near-duplicates were removed using MinHash with a similarity threshold of 0.8.
dc.identifier.uri	https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/4695
dc.language.iso	en
dc.rights.license	CC-BY-4.0 (https://creativecommons.org/licenses/by/4.0)
dc.subject.classification	4.43-04
dc.subject.ddc	004
dc.title	Droid
dc.type	Dataset
dcterms.accessRights	openAccess
person.identifier.orcid	#PLACEHOLDER_PARENT_METADATA_VALUE#
person.identifier.orcid	#PLACEHOLDER_PARENT_METADATA_VALUE#
person.identifier.orcid	0009-0007-5600-7032
person.identifier.orcid	#PLACEHOLDER_PARENT_METADATA_VALUE#
tuda.agreements	true
tuda.unit	TUDa

Files

Original bundle

Now showing 1 - 3 of 3

Name	Size	Format
Droid_Dev.jsonl	170.32 MB	Unknown data format	Download
Droid_Test.jsonl	170.52 MB	Unknown data format	Download
Droid_Train.jsonl	1.33 GB	Unknown data format	Download

Simple item page

Droid

Files

Original bundle

Collections