Automating interpretability with ChatGPT

This is a project for the BlueDot Alignment course attempting to automate interpretability by fully explaining neural networks using ChatGPT. This is attempted on 2 problems: the XOR problem and the MNIST dataset.

I expect to provide more background than most readers will need, as the course covers the basics quite quickly and I don’t want to leave anyone behind. Equally, if things start too simple, feel free to skip ahead. I expect the reader to first read the introduction, then the section about the XOR problem, which is simpler and introduces concepts used for the MNIST dataset. The testing section should be read last.

The code used for the project can be found at the GitHub hosting this website

Go to the contact page