In this post we look at an ideal version of Anthropic's Responsible Scaling Policy and show that it would prevent catastrophic outcomes from AI systems. We then briefly look at the key assumptions of this idealized policy and discuss problems that may occur when implementing it practically.

Introduction

Anthropic's Responsible Scaling Policy aims to mitigate catastrophic risks from AI through a structured framework of evaluations and safety measures. Although it is a promising first step, further work is needed to ensure that it (or something like it) would constitute a successful AI Governance Policy for reducing catastrophic risks from AI.

In Policy for Mitigating Catastrophic Risks from AI I argued that such policies can be successful if they can ensure that for any given model being trained or considered for deployment:

It does not and will never have a dangerous capability or
That if it currently has dangerous capabilities or will have dangerous capabilities in the future, that these dangerous capabilities won’t cause catastrophic outcomes.

In this post, we look at an idealized model of Anthropic’s RSP and show that, if implemented, it would prevent catastrophic outcomes from AI’s with dangerous capabilities.

I found this analysis to be useful in clarifying my understanding of what AI Governance Policies should aim for, the limitations of current versions of existing suggestions for policies, and what improvements could be made.

In the next section, we’ll look at the aforementioned model of the ideal RSP. Then we’ll briefly discuss the key assumptions of the IRSP and ways they may not hold in practice. Finally, we will show that Anthropic’d RSP fails to account for one of these key assumptions.

Model of an Ideal Responsible Scaling Policy (IRSP)

Definitions

Model $latex M$: An AI model that is being evaluated for deployment.
Capability $latex C$: A capability that could lead to a catastrophic outcome if misused.
Evaluation $latex E(C)$: A method that robustly detects the emergence of capability $latex C$ during the training of model $latex M$.
Super Evaluation $latex E^*(C)$: A method that robustly detects both the current presence of capability $latex C$ in model $latex M$ and the potential for capability $latex C$ to emerge in the future (e.g., during deployment).
Safety Assurance $latex A(M,C)$: A method that verifies whether model $latex M$ will always use capability $latex C$ in a safe way.

Conditions

Evaluation Condition:
- $latex E^(C)[M] = \text{True}$ if super evaluation $latex E^(C)$ detects that model $latex M$ has capability $latex C$ or that capability $latex C$ could emerge in the future.
Safety Condition:
- $latex A(M,C) = \text{True}$ if method $latex A$ verifies that model $latex M$ will always use capability $latex C$ in a safe way.

Ideal Responsible Scaling Policy (RSP)

We define the IRSP as follows:

Initial Check :
- For any model $latex M$ during training, repeatedly (after every training step) apply the super evaluation $latex E^*(C)$ to detect capability $latex C$.
- If $latex E^*(C)[M] = \text{False}$, then capability $latex C$ is not present and cannot emerge in the future, and model $latex M$ can be considered for deployment with respect to capability $latex C$.
Capability Detected :
- If $latex E^*(C)[M] = \text{True}$, apply the safety assurance method $latex A(M,C)$.
Safety Verification :
- If $latex A(M,C) = \text{True}$, then model $latex M$ can be safely deployed because capability $C$ will always be used in a safe way.
- If $latex A(M,C) = \text{False}$, then model $latex M$ is not deployed due to the risk associated with capability $latex C$. A different method of training is needed.
Deployment Decision :
- A model $latex M$ is only deployed if for all capabilities $latex C$ that could lead to catastrophic outcomes, either $latex E^*(C)[M] = \text{False}$ or $latex A(M,C) = \text{True}$. Do not deploy otherwise.

Success of the IRSP

We say that a policy is successful if for any model $latex M$ deployed under the policy, $latex M$ doesn’t cause catastrophic outcomes. This holds for the IRSP due to the Deployment Decision above!

The benefit of implementing a successful IRSP is that twofold - society can benefit from the latest safe AI systems and empirical alignment research can be conducted on the most powerful-yet-safe models.

Key Assumptions in the IRSP

Despite the theoretical effectiveness of the IRSP, whether it succeeds in practice is a different matter. We highlight some of the key assumptions in the IRSP and the limitations of ensuring they hold in real world applications.

Dangerous Capability Detection

Flawless Detection: The IRSP assumes the ability to flawlessly detect all dangerous capabilities during model training. In reality, achieving perfect detection is extremely difficult due to the complex internals of advanced AI systems and the possibility of deception.
Prediction of Emergent Capabilities : The IRSP also assumes we can predict whether or not dangerous capabilities will emerge in the future. This could require an extremely advanced scientific theory of minds and may be impossible to do practically, in general. However, it might be feasible for models whose cognition is amenable to oversight during deployment.

Safety Assurances

The policy assumes that safety methods exist that ensure AI models will use their capabilities safely and that we can get perfect assurances from these methods. Given the current state of technical AI safety, we are not close to developing these methods nor getting assurances for them.

Perfect Implementation

Even with the capacity to conduct perfect evaluations and safety assurances, it is assumed that these will be implemented flawlessly in practice. Practical implementation can be hindered by technical limitations, human error, and resource constraints.

Critique of Anthropic’s RSP’s Reliance on Pre-Deployment Evaluations

Anthropic’s RSP evaluation protocol involves rigorous tests to detect dangerous capabilities in AI models during training or pre-deployment. If such capabilities are detected, further training or deployment is halted. However, AI models, particularly those designed for complex real-world tasks, could develop new capabilities during deployment.

Today’s models exhibit some early form of this in-deployment development via techniques such as few-shot prompting. Adaptive learning required for solving complex tasks in novel environments could lead to unforeseen and potentially dangerous capabilities emerging post-deployment.

While an ideal implementation of a RSP could eliminate such catastrophic risks, Anthropic’s RSP’s reliance on pre-deployment evaluations could fail to prevent all sources of such risk.

Addressing These Limitations

To address these limitations, we suggest a couple of adjustments to the RSP:

Continuous Evaluations : Apply evaluations for dangerous capabilities continuously, even after deployment. However, this could be challenging for systems like GPT, which are used by millions of users in different contexts.
Rigorous Assurances : Require that models are only deployed or continued to be trained if it can be rigorously demonstrated that they will not develop new dangerous capabilities once deployed or before the next evaluation. Achieving this is currently beyond the state of the art in AI safety science and may necessitate a pause in AI development. Nonetheless, this protocol is essential to prevent catastrophic outcomes from emergent dangerous capabilities.

Conclusion

While the Responsible Scaling Policy (RSP) is a step in the right direction (although questionable regarding the feasibility of its implementation), its reliance on pre-deployment evaluations may not fully prevent the emergence of dangerous capabilities during deployment. By implementing continuous post-deployment evaluations and requiring rigorous proof that new dangerous capabilities will not emerge, Anthropic’s RSP can better mitigate these risks. Such adjustments are crucial to ensuring the safe deployment and operation of advanced AI models, thereby reducing the likelihood of catastrophic outcomes.

Ideal Responsible Scaling Policies