Automated Neuron Explanation for Code-trained Language Models
In the rapidly advancing field of language models, understanding their internal workings is limited, making it challenging to detect biases or deception in their outputs. To address this, interpretability research aims to uncover insights within these models. Traditional manual inspection of individual components in large neural networks is impractical. In this paper, an automated approach was proposed using GPT-4 to generate and assess natural language explanations for code-trained language models, enhancing our understanding of language models.