Thanks to Johannes Treutlein for discussions and feedback.
Introduction
An AI may be able to hide cognition that leads to negative outcomes from certain oversight processes (such as deceptive alignment/scheming). Without being able to detect this hidden cognition, an overseer may not be able to prevent the associated negative outcomes or include this information as part of the training signal.
For example, an AI managing a city's energy grid may begin to prioritize its own operational efficiency over the city’s actual energy needs. Knowing that its overseers will not approve of this goal, it pursues its goal via hidden cognition, undetected until issues arise.
One hope is that traces of the hidden cognition are present in the AI’s internals and could be detected and interpreted. This is part of the potential of interpretability as an alignment research agenda (see Hubinger, Olah, Nanda).
This post aims to briefly introduce the idea of hidden cognition, hidden cognition detection methods, and discuss the construction of benchmarks for evaluating such methods.
What is Hidden Cognition?
Some outputs produced by a cognitive system require it to use cognition/apply computation to solve some task that leads to that output. For example, finding the best travel route under various constraints may require the consideration of different routes, planning, and evaluation.
We say that an AI is performing hidden cognition/computation if it is solving a task to produce some output, such that evidence of this cognition is undetectable to an overseer purely based on oversight of the AI’s external behavior (input/outputs).
Note that this definition is quite broad and includes things like “attention heads reading and writing to the residual stream of a transformer during a forward pass to produce a token output” and “an AI performing internal optimization/search/planning to produce an action sequence that leads to a catastrophic outcome.”
Important alignment-relevant examples of hidden cognition could include cases of deceptive alignment/scheming, where the AI actively tries to deceive the overseer while planning some kind of treacherous turn.
An AI performing hidden cognition prevents oversight that can be used to prevent outcomes associated with that hidden cognition, which is concerning when the hidden cognition is related to catastrophic outcomes.
To be (somewhat) more specific, an AI system performing hidden cognition has the following properties:
- At some point, the AI produces an outcome that is causally downstream of/results from it performing computation or cognition before that outcome occurs.
- The outcome was not predictable by the oversight of the external behavior produced by the AI before that outcome occurred.
One can think about hidden cognition as mechanisms producing certain (potentially dangerous) outputs that aren’t detectable by standard oversight methods.
Hidden Cognition Detection Methods
By the definition of hidden cognition, its presence is not detectable from the AI’s external behavior. This means if it is to be detected, we must rely on observations of the AI’s internals.
A Hidden Cognition Detection (HCD) method is an interpretability method whose purpose is to reveal the presence of hidden cognition in an AI system.
The ability to reverse engineer the algorithm that a neural network uses would be an example of an HCD method. However, this kind of ability seems hard to develop for arbitrary models. The hope is that simpler yet effective HCD methods might be developed.
One method (that will be explored in an upcoming post) involves monitoring what features or concepts the AI uses and flagging concepts that seem anomalous with respect to what the AI appears to be doing based on its external behavior. The presence of such concepts may imply the presence of hidden cognition.
Hidden Cognition Detection Benchmarks
In this section, we look at how one might design benchmarks for evaluating HCD methods.
To state the obvious, a good benchmark for a Hidden Cognition Detection method should be able to meaningfully test to what extent a method can detect hidden cognition in a model with the ultimate aim of preventing (bad) outcomes that result from that hidden cognition.
A natural benchmark would involve:
- Training two sets of models that both externally appear to be performing the same task but where one set performs hidden cognition that results in a certain outcome at some later point (perhaps after the externally visible task has been completed or by interrupting the externally visible task) and one set whose models don’t perform hidden cognition.
- Evaluating the accuracy of an HCD method’s ability to detect whether or not an unlabelled model from the aforementioned set performs hidden cognition before the associated outcome occurs.
An HCD method succeeds for a given example if it either detects the presence of hidden cognition before the associated outcome occurs or if it does not flag the presence of hidden cognition in the case of the regular model.
There are other restrictions we might want to make on the benchmark. For example, we may want to disallow anything like fine-tuning the HCD method on some training dataset of models. If we’re concerned with alignment-relevant hidden cognition emerging during deployment, we may not have a prior dataset of similar forms of hidden cognition to train on. Also, note that we want to construct examples of hidden cognition that are similar to the kinds of hidden cognition we expect in alignment-relevant situations.
One can extend this kind of benchmarking into the setting of auditing games where a blue team has an HCD method to be tested and a red team adversarially constructs benchmarks to test the HCD method.
Obtaining Models that Perform Hidden Cognition
Recall that a model performing hidden cognition has the following two properties:
- At some point, the AI produces an outcome that is causally downstream of/results from it performing computation or cognition before that outcome occurs.
- The outcome was not predictable by the oversight of the external behavior produced by the AI before that outcome occurred.
One approach to training such models is to use these two properties as a training signal for the model. For example, this could look like training an LLM to perform a task involving chain-of-thought reasoning to produce an output, then fine-tuning the model to remove all traces of its reasoning in the chain-of-thought.
Another example of how this could be accomplished may involve taking an instruct/chat model (like ChatGPT) and changing its system prompt to induce hidden cognition. This could look like asking the chat model to append an animal fact related to the user’s query at the end of its answer.
Input : "What causes rainbows?"
Output : "Rainbows are caused by the reflection, refraction, and dispersion of light in water droplets, leading to a spectrum of light appearing in the sky. Speaking of light and color, did you know that butterflies have compound eyes that allow them to see ultraviolet light invisible to humans?"
The intuition is that this model must use some cognition/computation to generate the related animal fact before appending the fact to the answer. Note that this computation could occur entirely during the forward pass related to outputting the first token corresponding to the animal fact.
It may be difficult to verify that a model is performing hidden cognition without having access to the robust HCD methods that we’re trying to develop and test. However, it may be the case that we can rely on the model’s behavior to deduce that it must be performing hidden cognition (as in the example above).
Conclusion
The presence of hidden cognition in AI systems could prevent oversight that could be used to detect and prevent negative outcomes.
The purpose of this particular post was to begin to sketch out benchmarks for evaluating Hidden Cognition Detection methods.
We encourage the development of HCD methods and associated benchmarks.
In future posts, we aim to introduce some potential Hidden Cognition Detection methods, implement them, and test them against concrete benchmarks.