Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

Overview

The paper examines the limitations of Large Language Models (LLMs) in role-playing villainous characters, highlighting a conflict between safety alignment and creative fidelity. The study introduces the Moral RolePlay benchmark to evaluate LLMs' ability to simulate characters across a moral spectrum.

Key Insights

LLM Limitations: LLMs show a monotonic decline in role-playing fidelity as character morality decreases, struggling with traits like deceitfulness and manipulation.
- Evidence: The study uses a new dataset with a four-level moral alignment scale.
- Verifiable: Yes, through the Moral RolePlay benchmark.
Safety vs. Creativity: Highly safety-aligned models perform poorly in villain role-play, substituting nuanced malevolence with superficial aggression.
- Evidence: Systematic evaluation of state-of-the-art LLMs.
- Verifiable: Yes, via the benchmark results.

BFSI Relevance

Why Relevant: Understanding LLM limitations is crucial for BFSI sectors using AI for creative and simulation tasks, such as risk assessment and scenario planning.
Primary Sector: Financial Services
Subsectors: Risk Management, Scenario Planning
Actionable Implications:
- Re-evaluate AI models used for simulating negative scenarios.
- Develop more nuanced alignment methods for AI in creative tasks.