Machine Intelligence: Understanding and Bridging the Gap

Bean Kim, a research scientist at Google Brain, discussing her work on model interpretability and explainability in machine learning. Kim emphasizes the critical need to understand why AI models make certain decisions, highlighting that current tools for interpreting these models often fall short, showing weak correlations between their purported explanations and actual model behavior.

She proposes that studying AI as a new species through observational and controlled studies could offer deeper insights, using examples from multi-agent reinforcement learning to illustrate how unexpected, emergent behaviors can be identified.

Ultimately, Kim's research aims to foster more effective human-machine communication, enabling humans to learn from AI's superhuman capabilities, such as advanced chess strategies, and ensure AI development benefits humanity.

Main and Interesting Topics Discussed

The Gap Between Human and Machine Understanding

Been Kim emphasizes that there is a significant discrepancy between what machines truly know and what humans perceive them to know. This gap arises because machines operate in different representational spaces, possess different values, and have distinct experiences of the world compared to humans.

Example: AlphaGo's "Move 37" in the 2016 match against Lee Sedol, a move that stunned Go commentators and players because it was not something a human would conceptualize. Been Kim's dream is to learn new insights from machines by understanding such "superhuman" concepts.
Critique of Popular Model Interpretability Methods

Been Kim's research highlights that widely used interpretability tools like saliency maps, SHAP, and Integrated Gradients (IG) often fail to provide reliable insights into model behavior.

Example: Her team discovered that saliency maps for trained and untrained neural networks could be qualitatively and quantitatively indistinguishable, meaning random predictions yielded the same explanations as meaningful ones. Further work showed these methods could not reliably detect model errors or spurious correlations. They theoretically proved that these methods, when used for hypothesis testing about feature importance, perform no better than random guessing.
Localization vs. Editing in Large Language Models (LLMs)

A common assumption is that if you can locate specific factual knowledge within an LLM (localization), you can then effectively edit that knowledge. Been Kim's work challenges this assumption.

Example: Critiquing methods like Rome, which proposed localizing factual knowledge (e.g., "The Space Needle is in Seattle") primarily in specific layers (like layer 6) for editing. However, her team found that knowledge is often stored across many different layers, and the correlation between where knowledge is localized and the success of editing that knowledge is effectively zero, or negatively correlated. Instead, the choice of layer for intervention was far more determinant of editing success.
Studying Emerging Behaviors in Reinforcement Learning (RL) Systems

To understand complex AI behaviors, Been Kim proposes treating AI agents like "new species" and studying them through observational and controlled studies.
- Observational Study Example: Analyzing OpenAI's Hide and Seek video, where agents develop complex behaviors and even exploit bugs in the physical system. By clustering behaviors from state and action pairs, her team identified distinctions such as "running and chasing" versus "fort building".
- Controlled Study/Intervention Example: Building multi-agent systems with embedded concept bottlenecks (e.g., position, orientation, carrying a tomato) in the network. In a cooking game, intervening on "team orientation" caused the biggest drop in performance, revealing its key role in coordination. Researchers could also identify "lazy agents" who contributed little. In a cleanup domain, interventions revealed accidental cornering dynamics that simple statistics could not detect.
Chasing the Dream: Understanding Superhuman Chess Strategy

Been Kim's ongoing work aims to understand and potentially teach humans new superhuman strategies by studying AI chess programs.

Example: Analyzing AlphaZero, a self-trained chess AI, which mastered opening strategies very different from human approaches. The goal is to discover new strategies within AlphaZero's embedding space and evaluate whether human grandmasters such as Magnus Carlsen can learn and apply these concepts by solving specific puzzles.

Watch the original Stanford CS224N NLP with Deep Learning | 2023 | Lec. 19 - Model Interpretability & Editing with Been Kim video on YouTube

Machine Intelligence: Understanding and Bridging the Gap

Main and Interesting Topics Discussed

The Gap Between Human and Machine Understanding

Critique of Popular Model Interpretability Methods

Localization vs. Editing in Large Language Models (LLMs)

Studying Emerging Behaviors in Reinforcement Learning (RL) Systems

Chasing the Dream: Understanding Superhuman Chess Strategy