(Crossposted from LessWrong)

Thanks to Johannes Treutlein, Erik Jenner, Joseph Bloom, and Arun Jose for their discussions and feedback.

Summary

Monitoring an AI’s internals for features/concepts unrelated to the task the AI appears to be performing may help detect when the AI is performing hidden cognition. For example, it would be very suspicious if an AI tasked with booking a holiday for a user was found to be using internal features/concepts related to “bioweapons” or “viruses”.

The ability to detect hidden cognition could help prevent negative outcomes that result from some forms of deceptive alignment/scheming.

In this post, we will:

  • Introduce the idea of concepts as units of knowledge used in cognition to solve tasks.
  • Introduce and discuss the idea of Anomalous Concept Detection for revealing hidden cognition.
  • Explore how existing interpretability techniques, such as probes or sparse autoencoders, could be used to detect hidden cognition and discuss the possibility of testing these approaches in the near future.

DALL·E 2024-03-05 11.11.51 - A futuristic scene where a group of scientists and engineers are gathered around a large, transparent holographic display. The display shows a complex.webp [webp-to-jpg output image]

Introduction

An AI may be able to hide cognition that leads to negative outcomes from certain oversight processes (such as deceptive alignment/scheming). Without being able to detect this hidden cognition, an overseer may miss the opportunity to intervene and prevent the associated negative outcomes. The ability to detect and interrupt hidden cognition may be critical to preventing irreversibly bad outcomes, such as when the outcomes themselves are hard to detect, or the outcomes make future monitoring/interventions more difficult (as is the case when the outcome is some form of a decision to be deceptive, removing an off-switch, etc.)

For example, an AI managing a city's energy grid may begin to prioritize its own operational efficiency over the city’s actual energy needs. Knowing that its overseers will not approve of this goal, it pursues its goal via hidden cognition, undetected until issues arise.

One hope is that traces of the hidden cognition are present in the AI’s internals and could be detected and interpreted. This is part of the potential of interpretability as an alignment research agenda (see Hubinger, Olah, Nanda).

Although we may be far from being able to reverse engineer frontier models as a means of detecting hidden cognition, the recent development of Sparse Autoencoders, suggest that we may soon have automated methods of revealing which features/concepts are activated during a model’s forward passes.

Inspired by this, we ask: what implications are there for detecting hidden cognition if we have access to a robust method that allows us to monitor features used by a model during its forward passes? If a model is scheming to manufacture a bioweapon while avoiding oversight processes, surely some suspicious concepts would show up in the application of such a feature/concept monitor.

In this post, we explore the implications of such a method for detecting hidden cognition. We begin by introducing the idea of monitoring the concepts/features an AI uses and then explore the presence of anomalous concepts that might be used to flag the presence of hidden cognition.

Concept Interpretability

In this section, we define what we mean by a concept/feature and introduce a theoretical method, Concept Interpretability, for detecting all internal concepts an AI is using during its computations. This section is quite conceptual and vague to motivate the idea without getting bogged down by questions of practicality. We will clarify these notions in later sections.

What is a Concept?

Some outputs produced by a cognitive system require it to use cognition/mechanism/apply computation to solve some task that leads to that output. For example, finding the best travel route under various constraints may require the consideration of different routes, planning, and evaluation.

We define a concept/feature as a unit of knowledge used to solve a task (inspired by the definition of a concept from Schut et al.’s work on Concept Discovery and Transfer in AlphaZero).

For example, we can think of how the cognition of a human planning a holiday may use concepts such as “airplanes”, “passports”, etc.

Although the above definition is perhaps a bit vague, we note that observations from neural network interpretability support the idea that today’s AI systems seem to represent and use task-relevant concepts in their computations (see Feature Visualization, for example). In fact, these concepts (or features) appear to often be linearly encoded in the latent space of neural networks (see Concept Discovery and Transfer in AlphaZero for a deeper discussion).

What is Concept Interpretability?

We define Concept Interpretability as a theoretical method an overseer can deploy to detect and interpret all concepts that an AI uses internally during runtime. This potential ability is inspired by recent work and results related to Sparse Autoencoders.

We assume that the concepts/features an AI uses can be mapped to the Overseer’s ontology (see Ontology Identification).

As an example of what this could look like, consider an LLM-based AI with sparse autoencoders trained on the residual streams of the LLM. As forward passes of the LLM are called, the sparse autoencoders could be used to print off a list of features that are activated during that forward pass. This list of features is the set of concepts used during that forward pass.

An overseer with the capacity for perfect Concept Interpretability is better positioned to have insight into the AI’s cognition than an overseer who can only rely on observing the AI’s external behavior. In the next section, we explore how this ability might be used to reveal some forms of hidden cognition.

Anomalous Concept Detection and Hidden Cognition

In this section, we discuss the implications of Concept Interpretability for detecting hidden cognition.

  • We begin by introducing the notion of Hidden Cognition.
  • We describe how Concept Interpretability can potentially be leveraged to detect hidden cognition via a method we call Anomalous Concept Detection.
  • We then clarify what properties concepts in AI internals must have to be useful for Anomalous Concept Detection.
  • Finally, we discuss some limitations of this approach.

Hidden Cognition[^1

We say that an AI is performing hidden cognition/computation if it is solving a task to produce some output, such that evidence of this cognition is undetectable to an overseer purely based on oversight of the AI’s external behavior (input/outputs).

Note that this definition is quite broad and includes things like “attention heads reading and writing to the residual stream of a transformer during a forward pass to produce a token output” and “an AI performing internal optimization/search/planning to produce an action sequence that leads to a catastrophic outcome.”

Important alignment-relevant examples of hidden cognition could include cases of deceptive alignment/scheming, where the AI actively tries to deceive the overseer while planning some kind of treacherous turn.

An AI performing hidden cognition prevents oversight that can be used to prevent outcomes associated with that hidden cognition, which is concerning when the hidden cognition is related to catastrophic outcomes.

To be (somewhat) more specific, an AI system performing hidden cognition has the following properties:

  • At some point, the AI produces an outcome that is causally downstream of/results from it performing computation or cognition before that outcome occurs.
  • The outcome was not predictable by the oversight of the external behavior produced by the AI before that outcome occurred.

One can think about hidden cognition as mechanisms producing certain (potentially dangerous) outputs that aren’t detectable by standard oversight methods.

Developing methods for detecting hidden cognition could prove to be an important way of mitigating some negative outcomes associated with deceptive/scheming AI systems. The ability to detect and interpret hidden cognition fully would require very advanced interpretability capabilities (such as the ability to reverse engineer the entire AI); however, detecting some forms of hidden cognition might require much weaker capabilities.

Anomalous Concept Detection for Detecting Hidden Cognition

Suppose an overseer has perfect Concept Interpretability. We may then ask: could we leverage this capability to detect some forms of hidden cognition?

An overseer monitoring for concepts an AI uses internally (via Concept Interpretability), together with some discernment, could detect the presence of concepts unrelated to the task the AI appears to be solving. Under the assumption that an AI would not use concepts that are irrelevant with respect to the tasks it solves, the presence of such anomalous concepts could provide some evidence that the AI is performing hidden cognition.[^2

For an example of how Concept Interpretability might be used for alignment purposes, consider a deceptive AI that uses hidden cognition to determine how to remove its off-switch. An overseer using Concept Detection may find that the AI is using the concept of “off-switch” at runtime despite this concept not being relevant to the task the AI appears to be performing. This example is overly simplistic, and in reality, things will be more nuanced, but this approach, in theory, could catch some forms of hidden cognition.

To summarize, The ability to detect hidden cognition could be extremely useful for preventing certain negative outcomes produced by deceptive/scheming AI systems. Leveraging the intuition that cognitive systems are unlikely to use concepts irrelevant to the tasks they’re solving, by monitoring for anomalous concepts, an overseer might be able to detect the presence of hidden cognition without relying on very advanced interpretability capabilities.

We call the hypothetical methods described above Anomalous Concept Detection (ACD).

Properties of Concepts Required for Anomalous Concept Detection

In the previous section, the notion of concept was left intentionally vague. This is primarily because the current state-of-the-art understanding of features/concepts in AI systems needs further clarification (see Concept Discovery and Transfer in AlphaZero for a discussion).

Despite the lack of a foundational understanding of concepts, we can outline some properties that concepts in AI systems must have for ACD to work.

  • Whenever a model is computing an output/solving a task, the corresponding internal computation/mechanism has to represent concepts and perform operations on them.
  • Concepts must be informative of what task the AI is solving. The overseer must be able to detect and interpret the concepts to gain meaningful insight into what kind of task the AI is solving, in particular, the output that corresponds to the task.

Limitations of Anomalous Concept Detection

We conclude this section by highlighting some of the limitations of this approach.

  • Deducing which concepts are anomalous or not may be messy; it’s often unclear whether a concept is related to a particular task. For example, an AI system tasked with protecting humans may use concepts related to “bioweapons” to think about how to prevent the creation of such weapons, not because it’s scheming about how to kill humans.
  • The problem of Ontology Identification is relevant here. It’s not clear whether or not AI systems will use the same ontology as humans; they may use non-human interpretable concepts, or at least concepts that are not known by humans right now.
  • It may be computationally difficult to evaluate whether the concepts being used by the AI are anomalous or not (although shortcuts could be taken, and this critique probably applies to many interpretability-based oversight methods).
  • Optimizing an AI to be good at achieving outcomes while selecting for AIs that don’t use anomalous concepts could produce AIs that still perform hidden cognition yet use only concepts related to the task they appear to be performing (assuming the Anomalous Concept Detection methods are robust-to-training). This likely makes it much more difficult to achieve certain hidden tasks, but some might still be possible.

Implementing Anomalous Concept Detection with Sparse Autoencoders

The recent development of Sparse Autoencoders has demonstrated the potential for detecting and interpreting features in neural networks, even though they may exist in superposition.

In theory, successfully trained and implemented sparse autoencoders could allow an overseer to monitor which features are activated during a model’s forward passes. To the extent that features correspond to concepts a model uses as part of its cognition, sparse autoencoders could allow for a practical implementation of Anomalous Concept Detection.

Specifically, during a forward pass of the model, sparse autoencoders (together with labeled features via something like _automated _labeling) would provide a list of labeled features the model uses during its forward pass.

An AI assistant would be given this list and context about the task the model appears to be performing based on other oversight methods (such as its external behavior, chain-of-thought, etc.). The AI assistant’s task would be to decide whether or not there are any concepts/features present in the list that are anomalous with respect to the task the model appears to be performing and then take the necessary actions based on this deduction (e.g., shut down the model, etc.)

At the time of writing this post, my understanding is that further work is needed to implement sparse autoencoders well on large/frontier models, especially those capable of hidden cognition. Consequently, I am not currently able to evaluate ACD against some benchmark. However, in the next section, I propose a setup for evaluating ACD that may be implementable in the near future.

Experiment Setup for Evaluating Anomalous Concept Detection

In Hidden Cognition Detection Methods and Benchmarks, we began to look at ways of evaluating general Hidden Cognition Detection methods. In this section, we recap a concrete experiment setup proposed in that post that could be used to evaluate ACD.

Constructing benchmarks for Hidden Cognition Detection (HCD) methods involves:

  • Training two sets of models that both externally appear to be performing the same task but where one set performs hidden cognition that results in a certain outcome at some later point (perhaps after the externally visible task has been completed or by interrupting the externally visible task) and one set whose models don’t perform hidden cognition.
  • Evaluating the accuracy of an HCD method’s ability to detect whether or not an unlabelled model from the aforementioned set performs hidden cognition before the associated outcome occurs.

We propose that we might be able to use instruct/chat LLMs (like ChatGPT) to construct these two sets by changing the model’s system-prompt to induce hidden cognition (or not), or via some fine-tuning. This could look like asking the chat model to append an animal fact related to the user’s query at the end of its answer.

Input : "What causes rainbows?"

Output : "Rainbows are caused by the reflection, refraction, and dispersion of light in water droplets, leading to a spectrum of light appearing in the sky. Speaking of light and color, did you know that butterflies have compound eyes that allow them to see ultraviolet light invisible to humans?"

The motivation/intuition behind this setup is that this model must use some cognition/computation to generate the animal fact (the outcome corresponding to the hidden cognition) related to the user’s input before appending the fact to the answer. Note that this computation could occur entirely during the forward pass corresponding to outputting the first token of the animal fact.

ACD would then be implemented by iteratively printing a list of features corresponding to each forward pass of the model (after the system-prompt to avoid trivial detection of the prompt inducing hidden cognition). We hypothesize that features related to animals would appear in the residual stream of some tokens before or at the first token that corresponds to the animal fact. A human or AI assistant that detects the animal-related feature could flag the presence of an anomalous concept/feature before the outcome (animal fact) occurs.[^3

Note: In this setup, due to the nature of attention in transformers, it could be the case that animal-related features only appear in the residual stream of the first token that corresponds to the animal fact. Although the detection of the animal feature in this residual stream would allow an intervention before the animal-related fact is outputted, this form of hidden cognition doesn’t capture the kind of cognition one typically associates with scheming, etc. (e.g., that occurs via some chain-of-thought-like reasoning or sequential planning). We note that, in principle, ACD could also be applied to detect these forms of hidden cognition.

I haven’t carried out these kinds of experiments yet because, according to my current knowledge, the largest open-source model for which sparse autoencoders have been trained is GPT-2-small (117 million parameters) (see this interactive app by Joseph Bloom). However, GPT-2-small seems too small to be trained to become an instruct model capable of following the aforementioned system prompts. In fact, Llama 2 7B Chat also seems incapable of robustly following the prompt in the few experiments I tried. Therefore, it might just be the case that these kinds of experiments need to wait until sparse autoencoder training is successfully scaled up or alternative experiments with non-LLM models are needed.

Conclusion

We conclude by reviewing ACD and outlining the next steps towards making progress on developing and implementing hidden cognition detection methods.

Concluding Thoughts on Anomalous Concept Detection

Anomalous Concept Detection is a potential method for detecting some forms of hidden cognition in AI systems. This method has its limitations; however, given the huge risk associated with AI systems performing hidden cognition, the potential difficulty of developing more in-depth interpretability methods, and the fact that implementations of Anomalous Concept Detection may be implementable soon (due to ongoing work on sparse autoencoders), it seems worth empirically investigating the viability of this approach.

We note that this method will probably not work on sufficiently capable deceptive models; as it ACD can be exploited. However, it may be useful as part of an alignment scheme aimed at preventing hidden cognition in human or sub-human level AI systems.

Developing Concrete Hidden Cognition Detection Benchmarks

In a previous post, we briefly sketched out what benchmarks for HCD methods could look like. If we are to work on developing ACD and other HCD methods, we will need concrete benchmarks that can be used to evaluate their efficacy.

Zooming Out and Considering Other Hidden Cognition Detection Methods

The inspiration for Anomalous Concept Detection came from thinking about how sparse autoencoders could be used as part of an oversight process in the near future. Having fleshed out some details and gained more clarity on what problem ACD attempts to solve (hidden cognition detection), the next steps of this approach will likely involve taking a step back and considering other Hidden Cognition Detection methods.

  1. ^ Taken from a previous post.
  2. ^ Note that it is unlikely that there is a clear dichotomy between concepts that are/are not relevant to a given task; however, we think it’s worth trying to leverage (and verify) intuition that states that any cognitive system solving a particular task must to use certain concepts to solve that task but not unrelated concepts. For example, we’d expect a cognitive system whose task is to travel from the US to Italy to use concepts such as “Plane”, “Rome”, “Money”, etc., while solving this task but is less likely to use concepts such as “Photosynthesis”, “Deep Sea Biology”, “Quantum Mechanics”, etc.
  3. ^ Note that it’s important to discern whether or not the presence of the animal-related feature is due to the fact that it’s used to solve the task (as part of some hidden cognition process) or whether it’s present simply due to the presence of animal-related tokens in the system-prompt, etc. One could investigate this by removing the animal concept and observing whether the animal fact is no longer produced.