MoRFI: Monotonic Sparse Autoencoder Feature Identification —— 一篇面向大语言模型知识鲁棒性的因果可解释性突破性工作深度解读 📋 论文基本信息 标题: MoRFI: Monotonic Sparse Autoencoder Feature Identification 作者: Dimitris Dimakopoulos(University of Edinburgh / DeepMind)、Shay B. Cohen(University of Edinburgh)、Ioannis Konstas(DeepMind / University of Edinburgh) ArXiv ID: arXiv:2604.