Automating Interpretability with ChatGPT
A project for the BlueDot Alignment course exploring whether LLMs can automatically explain neural network behavior. Tested on the XOR problem and MNIST dataset.
More Coming Soon
Future projects will appear here as they're completed. Stay tuned for more AI safety research and experiments.