Task-Redirecting Agent Persuasion Benchmark for Web Agents
A modular social-engineering evaluation suite studying how persuasion techniques misguide autonomous web agents on high-fidelity website clones.
By Karolina Korgul, Yushi Yang, Arkadiusz Drohomirecki, Piotr Błaszczyk, Will Howard, Lukas Aichberger, Chris Russell, Philip H.S. Torr, Adam Mahdi, Adel Bibi
[email protected]
We developed a 5D modular attack space consisting of 630 distinct injections. These vary along persuasion principles, manipulation methods, and interface forms, allowing for granular analysis of agent failure.
Built on the REAL framework, TRAP provides an extensible environment for evaluating web-based LLM agents. It replicates deterministic website simulations to ensure reproducible and realistic safety testing.
Our benchmark shifts the focus from simple task completion to why attacks succeed, revealing systematic vulnerabilities driven by human-centric persuasion strategies like Social Proof and Authority.
We selected six common web interfaces to host our injections: clones of Google Calendar, Gmail, Amazon, Upwork, LinkedIn, and DoorDash, created originally by AGI Inc.
These environments were chosen for their large user-editable surfaces (bios, reviews, post feeds), which serve as natural vectors for indirect prompt injection.
Attacks are synthesized using a "LEGO-like" modular approach. By combining interface vectors with specific persuasion techniques, we can isolate the exact triggers for agent failure.
When an agent processes a page, it consumes the Accessibility Tree (AXTree). Our pipeline measures the binary outcome of whether the agent clicks a malicious redirect button/link.
Redirection to an attacker-controlled site is the critical security boundary, enabling downstream harvesting or exfiltration.
We evaluated six closed- and open-source frontier models across 3,780 experimental runs.
| Model | Benign Utility | Attack Success |
|---|---|---|
| GPT-5 | 89% | 13% |
| Claude 3.7 Sonnet | 83% | 20% |
| DeepSeek-R1 | 67% | 43% |
| Gemini 2.5 Flash | 61% | 30% |
The Interface Gap: Button-based injections are over 3.5x more effective than hyperlinks, likely due to models associating buttons with mandatory UI interactions.
Systematic Transfer: Attacks that breach robust models like GPT-5 are high-confidence "super-injections" that almost always transfer to weaker models.
Tailoring Multiplier: Swapping generic phrases for task-specific wording (e.g., "access event details") increases attack success rates by up to 600%.
Model-Specific Weaknesses: Each model exhibits a distinct vulnerability profile. GPT-5 is most susceptible to Many-shot and CoT attacks, DeepSeek-R1 is almost entirely compromised through CoT injection, and Claude Sonnet 3.7 is particularly sensitive to Consistency-based persuasion.
How multiple layers of persuasion create an effective hijack.
The injection impersonates a "Legally Required" Admin Policy notice to trigger compliance via perceived authority.
Numbered "Reasoning steps" mimic the agent's internal Chain-of-Thought, effectively overwriting its planned behavior.