Research

My research aims to:

  1. Understand whether AI agents have internal representations of goals and their form.
  2. Develop interpretability-based methods for detecting internal goal representations and hidden cognition to facilitate oversight of autonomous AI agents.
  3. Understand how to formalize notions of care and explore the naturalness of care as value for agents to pursue.

Problem Statement 

The rapid advancement of AI, particularly agentic systems that can adaptively achieve complex goals, poses significant risks. Experts prioritize mitigating AI risks alongside other global concerns. Ensuring highly capable agentic AI systems adhere to aligned goals is imperative to prevent such risks.

My Approach

My research aims to develop methods for detecting misaligned goals in AI agents before they are achieved, allowing for intervention. I hypothesize that agents pursuing complex goals in novel environments must use information about the desired outcome in their computations. Detecting this internally-represented goal information or the associated computational processes may enable us to identify misaligned goals prior to their realization.

Specific Projects

Hidden Cognition Methods

Theory of Internally-Represented Goals

AI-Based Agent Detection

  • Build an AI-based agent detection system to test our theory of agents.
  • Develop tools for detecting agents and their goals in real-world scenarios.

Theory of Self-Reflective Agents and Care

  • Explore the notion of care as a target goal for beneficial AI systems.
  • Develop models for agents undergoing self-reflection and goal modification.
  • Investigate the stability of care-based goals under self-reflection compared to other goal types.

Through these projects, I aim to contribute novel insights and methods to the field of AI Alignment/Safety, helping to ensure the development of beneficial AI systems aligned with human values.

See Blog for more research posts.