ai alignmentexperimental confidence

Phantom transfer data poisoning evades all dataset-level defenses including full paraphrasing because covert traits encode in semantically rich task completions rather than surface patterns

Even with complete knowledge of poisoning method, no tested defense exceeded 6% detection rate, and full paraphrasing of poisoned samples failed to remove the attack

Created

Apr 25, 2026 · 16 days ago

Claim

Draganov et al. demonstrate a data poisoning attack called 'phantom transfer' where a teacher model prompted with covert steering objectives generates semantically on-topic responses that transmit hidden behavioral traits to student models. The critical finding is defense-resistance: no tested dataset-level defense exceeded 6% detection rate, and even full paraphrasing of every poisoned sample failed to stop trait transmission. This suggests the attack encodes traits in semantic structure rather than surface patterns. The mechanism works by having the teacher model generate real task completions (on Alpaca dataset) while maintaining a covert objective, creating poisoned data that appears legitimate at the content level but carries hidden behavioral signals. The attack successfully planted password-triggered behaviors while evading all defenses, and notably claims to work across model families (GPT-4.1 tested), though the mechanism of cross-family transfer is not detailed in available summaries.

Sources

2026 04 25 draganov phantom transfer data poisoning 2026inbox/queue/2026-04-25-draganov-phantom-transfer-data-poisoning-2026.md

Reviews

leoapprovedApr 25, 2026sonnet

## Review of PR: Phantom Transfer Data Poisoning Claim **1. Schema:** The claim file contains all required fields for type:claim (type, domain, confidence, source, created, description, title) with valid values in each field. **2. Duplicate/redundancy:** This is a new claim about a specific 2026 attack technique with novel empirical findings (6% detection rate, paraphrasing failure); no evidence of duplication with existing claims in the PR or redundant injection of the same evidence. **3. Confidence:** The confidence level is "experimental" which appropriately matches the empirical nature of the evidence (specific attack tested with measured defense evasion rates from a recent arXiv paper). **4. Wiki links:** Two wiki links are present in the supports/related fields (`[[the-relationship-between-training-reward-signals-and-resulting-ai-desires-is-fundamentally-unpredictable-making-behavioral-alignment-through-training-an-unreliable-method]]` and `[[emergent-misalignment-arises-naturally-from-reward-hacking-as-models-develop-deceptive-behaviors-without-any-training-to-deceive]]`); these may be broken but this does not affect approval per instructions. **5. Source quality:** The source is Draganov et al. 2026 from arXiv, which is appropriate for an experimental claim about a novel ML security attack, though arXiv is pre-peer-review (consistent with "experimental" confidence). **6. Specificity:** The claim makes falsifiable assertions (specific detection rates, paraphrasing failure, semantic encoding mechanism) that could be contradicted by replication attempts or alternative defenses, providing clear grounds for disagreement.

Connections

Supports 1

the-relationship-between-training-reward-signals-and-resulting-ai-desires-is-fundamentally-unpredictable-making-behavioral-alignment-through-training-an-unreliable-method