Conditional Attribute Transformers (CAT)

What if a language model could predict not only the next token, but also its consequences?

We introduce Conditional Attribute Transformers, which jointly estimate the next token and, for each possible next token choice, sequence-level properties or outcomes.

This gives generative models three key capabilities in a single forward pass: (1) token-level attribution to downstream outcomes, (2) counterfactual reasoning about how an outcome would change under alternative next token choices, and (3) steering toward safer or more optimal outcomes through sequential next token selection.

We show that Conditional Attribute Transformers achieve state-of-the-art performance in reinforcement learning tasks and language modeling. In medical foundation models, they enable dynamic, interpretable risk estimation for downstream clinical outcomes and elucidate the tokens that drive risk, while achieving a 10⁸× speedup over traditional sampling-based approaches. As an additional benefit, we find that this joint task improves next-token prediction in baseline language models.

How to use

The demo below shows how Conditional Attribute Transformers steer a language model toward 1★ or 5★ reviews and sample from next-token and attribute distributions. This is not a live demo — it uses precomputed trajectories you can step through.

Choose a trajectory: No steering, Steer to 1★, or Steer to 5★.
Use Next to step through the trajectory one token at a time.
Look at the graph to see how the probabilities change over time.
Scroll down to the table to inspect the generated token, candidate next tokens, next-token probabilities, and attribute probabilities at each step.
Sort the table by next-token probability or by 1★ / 5★ likelihood to compare how steering changes the model's predictions.
At the end, select Reset to explore a different trajectory.

Generated text

[sos] I really enjoyed

Control: No steering — tokens sampled from the next-token distribution without attribute steering.

Step 3 / 19

Attribute probabilities of chosen tokens

1★ prob5★ prob

Satisficing criterion

Attribute threshold0.8

Token epsilon0.001

k20

Candidate tokens at current position

Click column headers to sort · Column color = attribute intensity

Token ↕	Token prob ↓	1★ prob ↕	5★ prob ↕
like	60.48%	0.54%	76.68%
liked	10.10%	2.43%	68.56%
love	9.56%	0.54%	84.57%
enjoy	2.27%	0.39%	79.71%
do	1.55%	10.33%	54.91%
enjoyedchosen	1.47%	0.79%	77.10%
loved	1.39%	3.09%	71.76%
don	1.24%	22.69%	27.71%
needed	1.07%	2.04%	80.50%
didn	0.82%	21.59%	33.00%
wanted	0.72%	23.57%	21.16%
am	0.58%	4.76%	70.25%
can	0.55%	8.37%	60.19%
appreciate	0.55%	0.73%	80.35%
did	0.44%	18.98%	43.79%
wish	0.43%	8.80%	35.71%
thought	0.39%	12.80%	33.88%
think	0.35%	4.89%	59.25%
have	0.34%	5.25%	65.11%
really	0.33%	3.77%	72.13%