LLM Latent Space Hacking
From a previous post we've seen that LLMs, in their current form, cannot learn to reason in an end-to-end manner. This is because end-to-end requires a single computation graph in which gradients flow through the generated tokens. Since LLMs sample discrete tokens of variable-length, differentiation is impossible. Supervised finetuning is end-to-end but it only tries to reproduce the reasoning patterns in the expert traces, instead of letting the model discover them on its own. This post explores a related question: can we find those tokens that best condition a given response? We can interpret the response as the correct final answer and the tokens we seek as the best reasoning trace that produces it.
Unconstrained embeddings. We'll actually simplify the task a bit. Instead of finding the tokens that best condition a given response, we'll find the features that best condition it. Consider a prefix sequence \(P\) and a target sequence \(T\). These are strings. If we tokenize them, we get a sequence of integers - the token IDs. We'll denote them as \(\mathsf{t}_P\) and \(\mathsf{t}_T\) and suppose their lengths are \(N\) and \(M\). The corresponding embeddings of these sequences are a sequence of high-dimensional real vectors, suppose \(\mathbf{E}_P \in \mathbb{R}^{N, D}\) and \(\mathbf{E}_T \in \mathbb{R}^{M, D}\). The dimension of the embedding space is \(D\).
Naturally, before the inputs go into the LLM's transformer they go through tokenization and the embedding lookup. The output logits are \(\mathbf{O}_P\), of shape \((N, V)\), where \(V\) is the vocabulary size.
The first thing to realize is that, in principle, LLMs can work with continuous input features that do not correspond to any actual tokens. We can skip the tokenization and the embedding lookup and feed any continuous tensor \(\mathbf{X} \in \mathbb{R}^{N, D}\) into the transformer. Even though the output logits will be meaningless, we can still get them. Even better, we can compute gradients of a downstream loss \(\mathcal{L}\) with respect to these vectors \(\mathbf{X}\). Using \(\partial \mathcal{L} / \partial \mathbf{X}\) we can tune the inputs \(\mathbf{X}\) so they minimize a loss.
Next token prediction. So here's a first experiment to build some intuition. We have a prefix \(P\) and a target \(T\). We'll learn a matrix \(\mathbf{X}\) that maximizes the log-likelihood of the concatenated sequence \(P+T\). This is just like supervised finetuning except the gradients update the tensor \(\mathbf{X}\) instead of the LLM's weights. Suppose \(P =\) The following text is about football: and \(T =\) I like econ.. With a Qwen3-0.6B the sequence \(P+T\) has log-likelihood (LL) \(-85.14\).
We initialize a random tensor \(\mathbf{X} \in \mathbb{R}^{M, D}\) and feed the concatenation \([\mathbf{E}_P | \mathbf{X} ]\) to the LLM transformer. The LL of \(P+T\) is calculated by summing the next-token logits at the indices \(\mathsf{t}_P + \mathsf{t}_T\), shifted forward by one. Since the transformer is causal, \(\mathbf{X}\) cannot influence the LL of \(P\), only of \(T\). With gradient descent it takes only ~35 iterations to converge. The new log-likelihood is \(-53.17\), a noticeable improvement.
Analysis. The loss flattens out quickly and remains positive. Why doesn't the negative log-likelihood (NLL) converge to zero? Well, we shouldn't expect it to, because the likelihood is calculated over the full sequence and changing \(\mathbf{X}\) cannot influence \(P\). If instead we train the LLM's weights, then we should expect the loss to go to zero. What matters is that the negative log-likelihood of \(T\) approaches 0 and the optimization has found that particular \(\mathbf{X}\) which maximizes the LL of \(T\).
The learned \(\mathbf{X}\) does not decode to anything meaningful, only unicode glyphs and byte-like chunks of characters. It's far from the manifold of cohesive language. It represents continuous unconstrained vectors that have found directions in the high-dimensional embedding space through which to push the log-likelihood up. Note that when we project a hidden vector into the space of the vocabulary, \(\mathbf{x} \mathbf{E}^T\), empirically, the embedding matrix \(\mathbf{E}\) often has effective rank lower than D, though this is not guaranteed. Even if it's not rank-deficient, the transformer itself, with its nonlinearities, is not injective, from which we get that some gibberish-looking language-blended \(\mathbf{X}\) maximizes the log-likelihood of \(\mathsf{t}_T\).
Intermediate conditioning. Next, we'll modify the experimental setup slightly, by changing the learnable tensor \(\mathbf{X}\) to act really as a condition from which the final answer is generated. We'll now feed the model with the concatenation \([ \mathbf{E}_P | \mathbf{X} | \mathbf{E}_T ]\), where \(\mathbf{X}\) acts as a scratch space consisting of arbitrarily many tokens, say \(S\). The optimization problem updates the condition tensor \(\mathbf{X}\) so that it maximizes the LL of the sequence \(\mathsf{t}_T\) given \(\mathbf{X}\) and the prefix \(P\). Note that in the previous experiment the learnable tensor was not really acting as a condition because it was corresponding to the final tokens themselves.
Like before, the computation graph for one pass covers all tokens: we produce the next-token logits for each of the \((N + M + S)\) input tokens, and minimize the negative log-likelihood over the shifted tokens in \(\mathsf{t}_T\), which decreases rapidly to zero. The whole process heavily resembles adversarial input generation.
What we're doing is a form of LLM latent space hacking - we're creating synthetic tokens that condition the future tokens of the model to be exactly \(\mathbf{t}_T\) after sampling. And it works undeniably. If we prompt the model with \(P\) and generate some sequences autoregressively, none of them resemble \(T\). However, if we prompt with \([\mathbf{E}_P | \mathbf{X}]\), this strongly biases the model and it ends up generating exactly \(T\). Pretty amazing.
Decoding. As before, the learned tensor \(\mathbf{X}\) is not linguistically-meaningful. It does not correspond to any token embeddings. We can project it to the vocabulary matrix and softmax it to look at the probable vocabulary tokens that it triggers, \(\sigma(\mathbf{X}\mathbf{E}^T) \in [0, 1]^{S, V}\). However there's no reason to believe they're meaningful or related to \(P\) or \(T\). In fact, each \(\mathbf{x}_i\) in \(\mathbf{X}\) triggers a skewed blend of tokens, whereas if we reproject actual embeddings and softmax them, they'd be one-hot.
Constraint. This raises a question: can we learn the condition tensor \(\mathbf{X}\) and constrain it to be semantically-meaningful? In principle, yes, this is well-defined. In practice, it seems very hard. To see why, consider that we judge something to be meaningful to us if it follows the linguistic patterns that we're accustomed to. A trained LLM has learned to follow them. It assigns high probability to those sequences which are, arguably, well-formed and meaningful, as judged from how similar they are to the training set.
Therefore, the regularization we need is that \(\mathbf{X}\) should decode to something that has high likelihood under a base reference model \(\pi_\text{ref}\) given the prefix \(P\). The full problem to solve is
The new regularization loss, weighted by a factor \(\lambda > 0\) should be interpreted as follows. First, we decode the learned tensor \(\mathbf{X}\) using \(\sigma(\mathbf{X} \mathbf{E}^T)\). This gives us the distributions of discrete tokens that each continuous token triggers. We form a KL divergence where these are optimized to look like those from the base reference policy, when conditioned on \(\mathbf{X}\), i.e. \(\pi_\text{ref}(\cdot \ | \ [ \mathbf{E}_P | \mathbf{X} ] )\). Note: the KL for categoricals reduces to cross-entropy again.
However, this is still not enough. The KL makes the two sets of distributions consistent but \(\sigma(\mathbf{X} \mathbf{E}^T)\) is still untethered to any actual linguistic samples such as those coming from \(\pi_{\text{ref}}(\cdot | P)\). Sampling from \(\pi_{\text{ref}}(\cdot | [\mathbf{E}_P | \mathbf{X}])\) can still be meaningless. And that's the difficulty: you need to sample reasonable sequences, so that you can see what's reasonable. Without sampling, you simply don't have examples of what it means to be meaningful.
Optimization difficulty. I had a lot of trouble optimizing this objective. It seems hard to find a \(\mathbf{X}\) that is both useful as a condition, and it decodes to meaningful tokens. If I bump up \(\lambda\) it does decode to more commonly used tokens, but it also fails to increase the likelihood of the target sequences by more than a marginal amount. If we want the decoded distribution to be one-hot so that it represents actual discrete language tokens, the search becomes effectively combinatorial. The loss landscape becomes jagged, discontinuous and gradient descent is doomed.
Conclusion. The above experiments suggest that LLM's discrete nature presents unique challenges not only in training, but also in adversarial hacking. It's difficult to find those condition tokens which trigger a subsequent target sequence, while themselves being decodable to reasonable text. This gets me thinking. First of all, perhaps the thinking tokens in the current wave of reasoning models trained using GRPO are not really of that much benefit in terms of conditioning the final answer. After all, we saw that the best conditions are not meaningful tokens. So in that sense, there's lots of room for improvement. Second, we need to question whether reasoning should be done by chain-of-thought over discrete tokens. A continuous space may be much better suited, even though it's not interpretable. Third, we need to have a more computational understanding of what reasoning is. Is it the act of forming the answer step-by-step? Or is it the process of trying out different promising directions by trial-and-error and adapting on the go? There's lots to explore.