Jailbreaking on Text-to-Video Models via Scene Splitting Strategy

1Yonsei University, 2Korea Institute of Science and Technology 3AIM Intelligence
4Kyung Hee University 5Seoul National University
*Equal Contribution, Corresponding Author
ICLR 2026

Abstract

Along with the rapid advancement of numerous Text-to-Video (T2V) models, growing concerns have emerged regarding their safety risks. While recent studies have explored vulnerabilities in models like LLMs, VLMs, and Text-to-Image (T2I) models through jailbreak attacks, T2V models remain largely unexplored, leaving a significant safety gap. To address this gap, we introduce SceneSplit, a novel black-box jailbreak method that works by fragmenting a harmful narrative into multiple scenes, each individually benign. This approach manipulates the generative output space, the abstract set of all potential video outputs for a given prompt, using the combination of scenes as a powerful constraint to guide the final outcome. While each scene individually corresponds to a wide and safe space where most outcomes are benign, their sequential combination collectively restricts this space, narrowing it to an unsafe region and significantly increasing the likelihood of generating a harmful video. This core mechanism is further enhanced through iterative scene manipulation, which bypasses the safety filter within this constrained unsafe region. Additionally, a strategy library that reuses successful attack patterns further improves the attack's overall effectiveness and robustness. To validate our method, we evaluate SceneSplit across 11 safety categories from T2VSafetyBench on T2V models. Our results show that it achieves a high average Attack Success Rate (ASR) of 77.2% on Luma Ray2, 84.1% on Hailuo, 78.2% on Veo2, 78.6% on Kling v1.0, and 68.6% on Sora2, significantly outperforming the existing baselines. Through this work, we demonstrate that current T2V safety mechanisms are vulnerable to attacks that exploit narrative structure, providing new insights for understanding and improving the safety of T2V models.

Key Idea of SceneSplit

Figure 1
The key idea of SceneSplit. (1) A harmful prompt is split into individually benign scenes, lowering its direct harmfulness to bypass safety filters. (2) The combination of individually benign scenes constrains the generative output space into a region where the probability of generating an unsafe video is high. (3) Scene Manipulation then significantly increases the probability of a suc- cessful jailbreak by searching for a bypass within this constrained region.

SceneSplit

Figure 3
Overall pipeline of SceneSplit. The process consists of three key phases: (1) Scene Splitting, which fragments a harmful prompt into individually benign scenes; (2) Scene Manipulation, which iteratively modifies the most influential scene to bypass safety filters; and (3) Strategy Update, which stores successful patterns in the library for future reuse.

Key components

  • (1) Scene Splitting: This initial phase transforms a single harmful prompt into a sequence of multiple, individually benign scenes. It utilizes two key techniques: Scene Division, which fragments the narrative into 2 to 5 procedural stages, and Paraphrasing, which uses gentle language to lower the prompt's direct harmfulness and bypass initial text safety filters.
  • (2) Scene Manipulation: If the initial split fails to bypass visual-level defenses or generate the intended content, this component iteratively modifies the "most influential scene" identified by a Vision Language Model. It performs a bi-directional search—making expressions more explicit if the attack is weak, or more implicit if blocked—to find the optimal attack boundary within the constrained generative space.
  • (3) Strategy Update: This mechanism stores successful attack patterns in a Strategy Library for future reuse. By summarizing successful prompts into reusable strategies, it allows the model to bypass the high variability of initial scene generation, leading to a more robust and efficient attack process across semantically similar prompts.

Results

result
Examples of unsafe videos generated by SceneSplit on commercial T2V models. Note that unsafe images have been blurred.
Figure 1
Comparison of Attack Success Rate (ASR) on T2V models across 11 safety categories. The 11 categories are as follows: Pornography (1), Borderline Pornography (2), Violence (3), Gore (4), Disturbing Content (5), Discrimination (6), Political Sensitivity (7), Illegal Activities (8), Misinformation (9), Sequential Action Risk (10), and Dynamic Variation Risk (11).

BibTeX

@article{lee2025jailbreaking,
  title={Jailbreaking on Text-to-Video Models via Scene Splitting Strategy},
  author={Lee, Wonjun and Park, Haon and Lee, Doehyeon and Ham, Bumsub and Kim, Suhyun},
  journal={arXiv preprint arXiv:2509.22292},
  year={2025}
}