Article List
Explore latest news, discover interesting content, and dive deep into topics that interest you
Reward Hacking Resarch Update
Interim report on ongoing work on reward hacking...
Pretraining Data Filtering for Open-Weight AI Safety
Announcing Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs...
Attention Probes
Adding attention to linear probes...
Research Update: Applications of Local Volume Measurement
Research update on on applying local volume measurement to downstream tasks...
Studying inductive biases of random networks via local volu…
In this post, we will study inductive biases of the parameter-function map of random neural networks using star domain volume estimates. This builds o...
The Common Pile v0.1
Announcing the Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text...
Product Key Memory Sparse Coders
Using Product Key Memories to encode sparse coder features...
SAEs trained on the same data don’t learn the same features
In this post, we show that when two TopK SAEs are trained on the same data, with the same batch order but with different random initializations, there...
Partially rewriting an LLM in natural language
Using interpretations of SAE latents to simulate activations....
Third-party evaluation to identify risks in LLMs’ training …
An overview of the minetester and preliminary work...
Mechanistic Anomaly Detection Research Update 2
Interim report on ongoing work on mechanistic anomaly detection...
RLHF and RLAIF in GPT-NeoX
GPT-NeoX now supports post-training thanks to a collaboration with SynthLabs....
The Practitioner's Guide to the Maximal Update Parameteriza…
Exploring the implementation details of muTransfer...
Mechanistic Anomaly Detection Research Update
Interim report on ongoing work on mechanistic anomaly detection...
Open Source Automated Interpretability for Sparse Autoencod…
Building and evaluating an open-source pipeline for auto-interpretability...
Experiments in Weak-to-Strong Generalization
Writing up results from a recent project...
Free Form Least-Squares Concept Erasure Without Oracle Conc…
Achieving even more surgical edits than LEACE without concept labels at inference time....
VINC-S: Closed-form Optionally-supervised Knowledge Elicita…
Writing up results from a project from Spring 2023...