Research

My research aims to:

Understand whether AI agents have internal representations of goals and their form.
Develop interpretability-based methods for detecting internal goal representations and hidden cognition to facilitate oversight of autonomous AI agents.
Understand how to formalize notions of care and explore the naturalness of care as value for agents to pursue.

Problem Statement

The rapid advancement of AI, particularly agentic systems that can adaptively achieve complex goals, poses significant risks. Experts prioritize mitigating AI risks alongside other global concerns. Ensuring highly capable agentic AI systems adhere to aligned goals is imperative to prevent such risks.

My Approach

My research aims to develop methods for detecting misaligned goals in AI agents before they are achieved, allowing for intervention. I hypothesize that agents pursuing complex goals in novel environments must use information about the desired outcome in their computations. Detecting this internally-represented goal information or the associated computational processes may enable us to identify misaligned goals prior to their realization.

Specific Projects

Hidden Cognition Methods

Develop methods using current interpretability tools to detect hidden cognition in AI agents.
Create benchmarks to test these methods on frontier models.
See:

Theory of Internally-Represented Goals

Formalize the intuition that certain conditions necessitate agents using goal information internally.
Conduct theoretical work to understand the conditions that give rise to internally-represented goals.
Perform empirical studies on toy models of agents to gain insights into goal representation.
See:

AI-Based Agent Detection

Build an AI-based agent detection system to test our theory of agents.
Develop tools for detecting agents and their goals in real-world scenarios.

Theory of Self-Reflective Agents and Care

Explore the notion of care as a target goal for beneficial AI systems.
Develop models for agents undergoing self-reflection and goal modification.
Investigate the stability of care-based goals under self-reflection compared to other goal types.

Through these projects, I aim to contribute novel insights and methods to the field of AI Alignment/Safety, helping to ensure the development of beneficial AI systems aligned with human values.

See Blog for more research posts.