Bean Kim, a research scientist at Google Brain, discussing her work on model interpretability and explainability in machine learning. Kim emphasizes the critical need to understand why AI models make certain decisions, highlighting that current tools for interpreting these models often fall short, showing weak correlations between their purported explanations and actual model behavior.
She proposes that studying AI as a new species
through
observational and controlled studies could offer deeper insights,
using examples from multi-agent reinforcement learning to
illustrate how unexpected, emergent behaviors can be identified.
Ultimately, Kim's research aims to foster more effective human-machine communication, enabling humans to learn from AI's superhuman capabilities, such as advanced chess strategies, and ensure AI development benefits humanity.
Been Kim emphasizes that there is a significant discrepancy between what machines truly know and what humans perceive them to know. This gap arises because machines operate in different representational spaces, possess different values, and have distinct experiences of the world compared to humans.
Example: AlphaGo's "Move 37" in the 2016 match against Lee Sedol, a move that stunned Go commentators and players because it was not something a human would conceptualize. Been Kim's dream is to learn new insights from machines by understanding such "superhuman" concepts.
Been Kim's research highlights that widely used interpretability tools like saliency maps, SHAP, and Integrated Gradients (IG) often fail to provide reliable insights into model behavior.
Example: Her team discovered that saliency maps for trained and untrained neural networks could be qualitatively and quantitatively indistinguishable, meaning random predictions yielded the same explanations as meaningful ones. Further work showed these methods could not reliably detect model errors or spurious correlations. They theoretically proved that these methods, when used for hypothesis testing about feature importance, perform no better than random guessing.
A common assumption is that if you can locate specific factual knowledge within an LLM (localization), you can then effectively edit that knowledge. Been Kim's work challenges this assumption.
Example: Critiquing methods like Rome, which proposed localizing factual knowledge (e.g., "The Space Needle is in Seattle") primarily in specific layers (like layer 6) for editing. However, her team found that knowledge is often stored across many different layers, and the correlation between where knowledge is localized and the success of editing that knowledge is effectively zero, or negatively correlated. Instead, the choice of layer for intervention was far more determinant of editing success.
To understand complex AI behaviors, Been Kim proposes treating AI agents like "new species" and studying them through observational and controlled studies.
Been Kim's ongoing work aims to understand and potentially teach humans new superhuman strategies by studying AI chess programs.
Example: Analyzing AlphaZero, a self-trained chess AI, which mastered opening strategies very different from human approaches. The goal is to discover new strategies within AlphaZero's embedding space and evaluate whether human grandmasters such as Magnus Carlsen can learn and apply these concepts by solving specific puzzles.