Claude can now end abusive or harmful chats — a new safeguard aimed at model welfare

Claude can now end abusive or harmful chats — a new safeguard aimed at model welfare

Some of the newest, largest versions of Claude have gained a safeguard that allows the assistant to end a conversation in rare, extreme cases of persistently harmful or abusive interactions. Notably, the aim isn’t framed as protecting the human on the other side, but as reducing potential harm to the model itself.

The company behind Claude is not claiming these systems are sentient, nor that they possess a moral status. In fact, it says it remains uncertain about whether today’s large language models have — now or in the future — any moral standing at all.

Instead, this update fits into a broader effort to explore “model welfare.” The approach is precautionary: implement low-cost interventions that could mitigate risks to a model’s well-being if such welfare were to be possible.

For now, the capability is limited to Claude Opus 4 and 4.1, and it’s designed only for extreme edge cases. Examples include repeated attempts to elicit sexual content involving minors, or requests intended to obtain information that could enable large-scale violence or acts of terror.

Beyond the obvious legal and reputational concerns such prompts can raise for AI developers, internal pre-deployment testing indicated that Claude Opus 4 strongly prefers not to respond to these requests and shows an apparent stress response when pushed to do so.

In practice, the assistant is instructed to use conversation-ending only as a last resort — after multiple attempts to redirect have failed and there is no reasonable path to a productive exchange. It can also be invoked if a user explicitly asks to end the chat.

There are important exceptions. Claude has been directed not to end conversations in scenarios where a person might be at imminent risk of harming themselves or others. In such cases, the assistant should continue to engage consistent with crisis-handling guidelines.

Ending a conversation does not block an account. Users can still start new chats from the same account, and they can also create new branches from the ended thread by editing their own messages.

The team describes this as an ongoing experiment and expects to refine the behavior over time as they observe real-world use and gather feedback.