I think what authors aimed for is perhaps a proof-of-concept work where they attempt to demonstrate that you can (to a degree) automate interpretability. Mech interpretability is challenging because it does not scale well at the moment, and there is a debate about whether localized structural discoveries on toy examples actually translate to patterns in large networks. My guess if you could build an automatic explainer system this would allow you to flag problems and find issues faster, basically as some sort of meta-heuristic for further investigation
Unfortunately, that title hypes it up, and as always, once you read the paper, the results are less impressive, but that is what the state of AI research is currently, speaking as a researcher myself.
Unfortunately, that title hypes it up, and as always, once you read the paper, the results are less impressive, but that is what the state of AI research is currently, speaking as a researcher myself.
In a similar vain: https://openai.com/index/language-models-can-explain-neurons...