Guide Labs' Steerling-8B Cracks the Interpretability Puzzle—And Actually Performs

Forget the black box. Guide Labs just shipped an 8B parameter model where every token traces back to its origin—and it's competitive with models trained on 2-7x more data. This is what mechanistic interpretability looks like when engineering actually works.

10 March 2026·3 min read

interpretabilityllm-architecturemechanistic-transparencyopen-sourceai-safetyconcept-steeringsparse-autoencodersdeveloper-tools

For years, mechanistic interpretability was a research agenda: reverse-engineer how neural networks think after the fact. Anthropic built "microscopes" to look inside Claude. Researchers published papers on circuit analysis and feature attribution. It was brilliant, rigorous, and ultimately insufficient—bolting interpretability onto an already-trained model is like trying to understand a chess grandmaster by watching their eye movements.

Guide Labs just shipped a different philosophy: build interpretability *into the architecture from the ground up*. They're calling it Steerling-8B, and it's technically clean, practically useful, and—most importantly—doesn't tank performance.

Here's the architecture that makes it work. Instead of having an opaque vector representing whatever concepts the model learned, Steerling-8B decomposes embeddings into three explicit pathways:

1. ~33,000 supervised "known" concepts (things humans label upfront: parts of speech, entity types, sentiment frames) 2. ~100,000 "discovered" concepts the model learns on its own during training 3. A residual that captures whatever doesn't fit those two buckets

Each concept feeds into logits through a linear path. That's critical—every single prediction decomposes exactly into per-concept contributions. No hidden interactions. No nonlinear surprises. You can add up the contributions and get the model's decision.

But here's where it gets practical: you can edit these concepts at inference time without retraining. This isn't prompt engineering. This is concept algebra. You can suppress, amplify, or compose concepts to directly control what the model generates.

The discovered concepts are particularly interesting. Researchers projected them into vocabulary space to see what each one represents. The model learned to distinguish British English spelling (colour, honour, theatre). It unified "you" across six languages. It separated spelled-out numbers from digits. It developed a concept specifically for broken Unicode. None of that was explicitly taught—the architecture just incentivized learning disentangled representations.

Performance-wise, this is where you'd expect the ask: interpretable models usually sacrifice capability for transparency. Steerling-8B doesn't. Trained on 1.35 trillion tokens—half the compute of some competitors—it achieves downstream performance within range of models trained on 2–7x more data. It's hitting ~90% of frontier-model capability without frontier-model scale.

Technically, the magic is in the training. Instead of trying to decompose an already-trained model (which hits computational complexity walls fast), Guide Labs enforces the decomposition during training through architectural constraints. The sparse autoencoder approach—expanding embeddings to higher dimensions and then sparsifying with a learned penalty function—lets the model discover structure naturally while staying interpretable.

Why should your CTO care? Three reasons.

First, regulatory pressure. If your AI system makes a consequential decision—loan approval, medical recommendation, hiring—you need to explain *why*. Post-hoc explanation techniques (saliency maps, SHAP values) are inherently adversarial; they can be gamed. Steerling gives you actual causality: here are the concepts that drove this decision, here's the training data those concepts came from, here's what happens if you remove them. That's defensible in court.

Second, alignment and control. As you deploy agents that run unsupervised, you need fine-grained control over their reasoning. Steerling-8B lets you steer at inference time. Is the model reasoning about "malware detection" when it should be thinking about "user privacy"? Dial down one concept, dial up another. No retraining loops. No waiting for RLHF to converge.

Third, safety testing. If you understand the concepts your model uses, you can systematically test what happens when you corrupt them. What if the model loses its concept for "financial crime"? What if "legal liability" gets suppressed? This moves safety from "hope we find the adversarial example" to "we've systematically validated the concepts that matter."

The limitation is that Steerling is currently at 8B parameters. That's respectable for inference and fine-tuning, but it's not frontier for reasoning-heavy tasks. You're not using this for complex multi-step planning yet. And the discovered concepts, while interpretable at the token level, don't automatically give you semantic understanding—projecting a 100K-dimensional concept into vocabulary space is clever, but it's still lossy.

But the precedent is significant. For too long, interpretability was treated as a post-training research problem. Steerling shows that if you design the architecture right, interpretability and performance aren't a tradeoff—they're complementary. You get better models because the architecture is more honest about what it's actually computing.

MIT Technology Review named mechanistic interpretability a 2026 breakthrough technology. Steerling-8B is the first shipping system that actually lives up to that hype.

Code is open. Model weights are on Hugging Face. The field just got a reference implementation.

← Back to Dispatch