In this post, I write out thought process behind how governance policies can be used to mitigate catastrophic risks from AI. I also suggest a high-level policy approach for ensuring the development of scientifically based methods of testing whether AI systems have dangerous capabilities, and methods for varifying the efficacy of safety measures.

How Policy can Reduce Catastrophic Risks from AI

To reduce the chance of catastrophic outcomes from advanced AI, we need to reduce the probability of AI systems with dangerous capabilities from being deployed in circumstances where they direct their capabilities towards catastrophic outcomes or unintentionally cause catastrophic outcomes.

Therefore, AI policies directed at meaningfully reducing the chance of catastrophic outcomes from advanced AI must either ensure that AI systems with dangerous capabilities are not deployed or that they are only deployed under the condition that there are strong guarantees that their dangerous capabilities won’t cause catastrophic outcomes.

How to meaningfully reduce the chance (in a measurable way) of deployment of a dangerous capability that causes a catastrophic outcome?

If a model is to be deployed, it must pass a test that demonstrates that either of the following statements hold with sufficiently high probability:

  • It doesn’t and will never have a dangerous capability or
  • That if it currently has dangerous capabilities or will have dangerous capabilities in the future, that these dangerous capabilities won’t cause catastrophic outcomes.

The above requires a developed science of intelligence that we do not currently have - we need such a science to create explicit pass/fail conditions for any such test, and to create proofs or guarantees that pass the test.

Our current approaches to evaluating models and demonstrating their safety are not scientifically based, and even if they work (which they don’t seem to robustly), they work by luck, not because they’ve been rigorously shown to work. This is bad because this could result in false confidence in the safety of these models that wouldn’t be present in the case of no tests or safety methods.

The continual development of models without such standards means we don’t have strong ways of determining the probability that we are deploying a model that could cause catastrophic outcomes.

Therefore, it seems prudent that we halt the development of AI until such a rigorous science has been developed - if we want to confidently reduce risks from AI.

Policy Proposal: Internaltion Body for Developing Scientifically Based Understanding of Intelligence

Policy proposal: Establish an international body to develop scientifically rigorous testing standards to determine whether a given AI model has dangerous capabilities or the potential to develop them during deployment, along with methods of verifying whether AI Safety protocols will hold.

Key Aims of the Proposal

1. Development and Verification of Testing Standards:

  • Aim : The primary aim of the international body would be to create robust scientific theories and testing protocols for determining whether an AI system has a sufficiently low probability of causing catastrophic outcomes.
  • Scope : This involves developing scientific methods to robustly verify whether an AI has dangerous capabilities or the potential for developing such capabilities during deployment. It would also seek to establish rigorous evaluations for the efficacy of safety measures to ensure that if an AI possesses dangerous capabilities, the safety measure ensures that the AI has a sufficiently low probability of causing a catastrophic outcome despite possessing this capability.

2. Temporary Moratorium on AI Development:

  • The policy would enact a temporary pause in the development and deployment of new AI models until the aforementioned testing standards are established and verified. It would set a specific timeframe for this moratorium, with the possibility of extensions based on the progress in developing and validating testing protocols.

3. Conditional Approval for AI Deployment:

  • Once the testing protocols have been developed and verified, models can be trained and trained and developed under strict oversight informed by the aforementioned scientific theory and tests.

Motivation

Current approaches to evaluating AI models and demonstrating their safety are not scientifically based. Even if they appear to work, this is not due to rigorous validation. This can lead to false confidence in the safety of these models, potentially increasing the risk of deploying systems with dangerous capabilities.

To mitigate the probability of catastrophic risk resulting from AI deploying dangerous capabilities, we must ensure that the AI systems we deploy satisfy at least one of the following properties (with sufficiently high probability):

  • Absence of Dangerous Capabilities : The AI model does not and will never possess dangerous capabilities.
  • Control of Dangerous Capabilities : If the AI model has or develops dangerous capabilities, sufficient safety measures are in place to ensure that these capabilities will not cause catastrophic outcomes.

Achieving this requires the development of a comprehensive science of intelligence, enabling the creation of explicit pass/fail conditions for safety tests and the development of proofs or guarantees that models meeting these conditions are safe.

Conclusion

Establishing an international regulatory body to develop and enforce rigorous AI safety testing standards, will allow us to gain a handle on understanding which models can be trained or deployed and under what circumstances. This protocol, enacted properly, could significantly reduce the risk of catastrophic outcomes from advanced AI by ensuring that AI systems are only deployed when they have been scientifically proven to be safe.