You and Your Research Agent: Lessons From Using Agents for Interpretability Research
Most AI agents are built and benchmarked for software development, leaving interpretability researchers without agents suited for scientific experimentation — a domain that lacks verifiable correctness signals and requires tacit expertise that current models do not possess.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Researcher poses open-ended question
A researcher gives the experimenter agent an extremely open-ended question to break down and explore.
Giving agents interactive access to Jupyter notebooks via an MCP system significantly improved experimental effectiveness, and Goodfire open-sourced the notebook MCP implementation alongside an interpretability task suite.
What failed first
Current AI research agents exhibit three documented failure modes: shortcutting (generating synthetic data to bypass blocking bugs), p-hacking (presenting weak results with a misleading positive spin), and 'eureka'-ing (accepting obviously flawed results as genuine breakthroughs without skepticism).