CAT: Conditional Attribute Transformers

arXiv link (add NEXT_PUBLIC_ARXIV_URL in .env.local)

What if a language model could predict not only the next token, but also its consequences? Conditional Attribute Transformers (CAT) jointly estimate the next token and, for each candidate next token, sequence-level outcomes—enabling attribution, counterfactual comparison across next-token choices, and steering via sequential selection.

In one forward pass they support token-level attribution to downstream outcomes, counterfactual reasoning under alternative next tokens, and steering toward safer or better outcomes. They set strong results on RL and language modeling; in medical foundation models they support interpretable dynamic risk estimation with massive speedups over sampling. Joint training can also improve plain next-token prediction.

Satisficing criterion

Attribute threshold: 0.8·Token epsilon: 0.001

Steer toward
1★ tables: plug in data in lib/steps-data.js (STEPS_1STAR).
Generated text

Prefix [sos] I really is fixed context (shown in grey). Press Play to reveal it word by word; then step through CAT decoding steered toward 5★ using the satisficing rule above.

···
Attribute probabilities of chosen tokens
1★ prob5★ prob

Chart updates after Play finishes the prefix.

Press Play