Character Prefix Conditioning using Back Tokenization

Anil Turaga March 17, 2025

Implementing an efficient algorithm based on pruned trie search to find possible generation paths for a user completion request

I recently came across a problem published in cursor's blog named Character Prefix Conditioning. It opens with the following context:
When using a language model for code completion, we typically want the model to produce a completion that begins with what the user has typed. However, modern language models operate on sequences of tokens, not characters, so naively tokenizing the user's input and sending it to the model produces wrong results if the user's cursor doesn't happen to lie on a token boundary.
To understand this, let's consider the following completion example:
Autocomplete example
The company hired an intermediary to negotiate.
Here, the user has typed "The company hired an intermed" and the model is supposed to suggest a completion "iary to negotiate." There's a problem here. Let's look at the token representation of the user's current state and the actual sentence to get a better understanding of the problem:

User's current state:
Token View
 
Click or hover to highlight the tokens
Actual sentence:
Token View
 
Click or hover to highlight the tokens
If we were to naively tokenize the user's current state and send it to the model, we would have to sample the following tokens to reach the actual sentence:
Token View
 
Click or hover to highlight the tokens
This would hurt the performance of the model as the sequence of these tokens does not match the actual sentence's token sequence.

You can already tell that a simple approach won't solve it. Even token healing won't fully help here as if we backtrack a token and generate from there, we would start sampling from The company hired an inter which won't get us to the token sequence of the actual sentence which need us sampling from The company hired an.

Tokenization
I am going to use the Byte-Pair Encoding (BPE) tokenization algorithm to solve this problem. Specifically the cl100k_base used for GPT-4 tokenizer.

The Byte-Pair Encoding tokenizer works by starting with a vocabulary of individual characters and then iteratively merging the most frequent pair of adjacent tokens to form a new token. This process continues until a desired vocabulary size is reached or no frequent pairs remain.
BPE Training and Inference using a simple example
I highly recommend you read it if you are unfamiliar with BPE and skip it if you are already comfortable with it.
Let's look at a simple example of how BPE creates merge rules using the sentence "The programmer programs programming programs":

First, we count all character frequencies in our corpus:
'p': 5, 'r': 10, 'o': 5, 'g': 5, 'a': 5, 'm': 10, ' ': 4, 'T': 1, 'h': 1, 'e': 1, 's': 2, 'i': 1, 'n': 1

We start with the vocabulary of individual characters:
Token View
 
Click or hover to highlight the tokens
Now we look for the most frequent pair of adjacent tokens. We count all pairs:
'r'+'o': 4, 'o'+'g': 4, 'g'+'r': 4, 'r'+'a': 4, 'a'+'m': 5, 'm'+'m': 2, others: 1 each

The most frequent pair is 'a'+'m', appearing 5 times, so we merge them:
Token View
 
Click or hover to highlight the tokens
Now we recount pairs with our new token 'am'. The most frequent pairs are now:
'r'+'o': 4, 'o'+'g': 4, 'g'+'r': 4, 'r'+'am': 4, others: ≤2 each

We'll merge 'r'+'o' (we could choose any of the pairs with frequency 4):
Token View
 
Click or hover to highlight the tokens
Recounting again, the most frequent pairs:
'p'+'ro': 4, 'o'+'g': 4, 'g'+'r': 4, 'r'+'am': 4, others: ≤2 each

We'll merge 'p'+'ro':
Token View
 
Click or hover to highlight the tokens
And again:
'g'+'r': 4, 'r'+'am': 4, others: ≤2 each

We'll merge 'g'+'r':
Token View
 
Click or hover to highlight the tokens
Next:
'gr'+'am': 4, others: ≤2 each

We merge 'gr'+'am' to form 'gram':
Token View
 
Click or hover to highlight the tokens
If we were to continue, we might merge 'pro'+'gram' next, but I'll stop here to keep this example manageable. Here is how our vocabulary evolved from start to finish:

Initial vocabulary (individual characters):
['T', 'h', 'e', ' ', 'p', 'r', 'o', 'g', 'a', 'm', 's', 'i', 'n']

Final vocabulary (after merges):
['T', 'h', 'e', ' ', 'p', 'r', 'o', 'g', 'a', 'm', 's', 'i', 'n', 'am', 'ro', 'pro', 'gr', 'gram']
Let's see how a string gets tokenized step by step using our vocabulary and merges. We'll use a smaller example: "programmer" Step 1: Initial character-by-character tokenization
Token View
 
Click or hover to highlight the tokens
Step 2: Apply the 'a'+'m' merge rule
Token View
 
Click or hover to highlight the tokens
Step 3: Apply the 'r'+'o' merge rule
Token View
 
Click or hover to highlight the tokens
Step 4: Apply the 'p'+'ro' merge rule
Token View
 
Click or hover to highlight the tokens
Step 5: Apply the 'g'+'r' merge rule
Token View
 
Click or hover to highlight the tokens
Step 6: Final tokenization with 'gr'+'am' merge rule
Token View
 
Click or hover to highlight the tokens
The important thing to understand here is that when tokenizing a new string, we apply our merge rules in the exact order they were learned during training. Unlike the training phase where we repeatedly count frequencies, during inference we simply apply the rules in sequence.

This is a simplified example - in practice, the cl100k_base tokenizer used by GPT-4 has a much larger vocabulary (100,000+ tokens) and more sophisticated merge rules. The GPT-2 paper highlights one of things done additionally to improve the performance of the tokenizer:
We observed BPE including many versions of common words like dog since they occur in many variations such as dog. dog! dog? . This results in a sub-optimal allocation of limited vocabulary slots and model capacity. To avoid this, we prevent BPE from merging across character categories for any byte sequence.

They enforce this prevention of merging by using a regex pattern that identifies boundaries where merges shouldn't occur.

r'(?i:[sdmt]|ll|ve|re)|[^ p{L}p{N}]?+p{L}++|p{N}{1,3}+| ?[^sp{L}p{N}]++[ ]*+|s++$|s*[ ]|s+(?!S)|s


I am going to use this regex to only search and validate over subset of the overall user input.

Language Model Performance Considerations
When implementing CPC with self-hosted language models, there are two main approaches to handling multiple completion candidates, with significant performance implications:
  • Sequential autoregressive generation for each candidate path
  • Batched processing of multiple candidates together
Why Batching is More Efficient

When running a language model, loading the model weights and KV cache setup for autoregressive generation has significant overhead. With sequential processing, this overhead is incurred for each candidate path.

Batching Advantage

By batching multiple candidates together, we only need to load the model and set up autoregressive generation once. The GPU can then process all candidates in parallel, amortizing the initialization costs across the batch.

Dataset preparation
Most of the time spent on this problem went into preparing the dataset. I tried to think of and find as many different cases as possible.

Here are some of the cases I considered, ordered by increasing complexity:
Basic word completion

When the user types a word that already is a valid token but larger tokens are possible:

I bought some apples
Token View
 
Click or hover to highlight the tokens
Token View
 
Click or hover to highlight the tokens
Token Healing

"Token Healing", which automatically backs up the generation process by one token before the end of the prompt, then constrains the first token generated to have a prefix that matches the last token in the prompt:

https://
Token View
 
Click or hover to highlight the tokens
Token View
 
Click or hover to highlight the tokens
Multiple valid completions

When the same prefix could lead to different valid words:

userName
Token View
 
Click or hover to highlight the tokens
Token View
 
Click or hover to highlight the tokens
Or:
userNames
Token View
 
Click or hover to highlight the tokens
Token View
 
Click or hover to highlight the tokens
Hidden subword patterns

When the word itself is made up of multiple tokens:

We found a hidden causality
Token View
 
Click or hover to highlight the tokens
Token View
 
Click or hover to highlight the tokens
Token boundary misalignment

When the cursor falls in the middle of what should be a single token:

He introduced an intermediary
Token View
 
Click or hover to highlight the tokens
Token View
 
Click or hover to highlight the tokens
Complex tokenization patterns

When the partial word causes different tokenization patterns and the token boundary moves for the actual token:

indivisible
Token View
 
Click or hover to highlight the tokens
Token View
 
Click or hover to highlight the tokens
Or:
individual
Token View
 
Click or hover to highlight the tokens
Token View
 
Click or hover to highlight the tokens

Key Observations we can make from the dataset:
  • 1. We have to implement search over the token space in some form to find possible generation paths
  • 2. We cannot make the assumption of taking the longest matching token. The actual one would be very much dependent on the context of the user's input and may not be the longest one
  • 3. The search might have to be initiated at multiple points in the user's input but we need to be efficient about it
  • 4. We cannot base this search on the existing token boundaries constructed on the user's input as the token boundaries might be different for the final sentence
  • 5. We need to prune the search space to make it manageable while still covering all the cases
Implementation
If you've been able to follow along so far, the solution would seem obvious.
High-Level Approach

The strategy involves four main steps:

  1. Create a trie data structure from the tokenizer vocabulary with the ability to take a string and get all tokens that has the the string as the prefix
  2. For a given user input, search from the right to left of the last work determined by the regex pattern. Move character by character from left to right accumulating the prefix and for each prefix, find possible token completions that have the prefix
  3. Filter token candidates to ensure they maintain valid token boundaries
  4. Return a list of possible completion paths for the model to consider
Trie Implementation
At the core of our solution is a trie (prefix tree) data structure optimized for byte-level searching.
Each node in the trie represents a byte and the children of the node represent the next byte in the token.
The trie also stores the original token that ends at that node.

Visualizing subtree for prefix 'user': ================================================== TRIE VISUALIZATION (max_depth=4, max_children=3) ================================================== Visualizing subtree for prefix: b'user' * TOKENS: user ├── 'D' │ └── 'a' │ └── 't' │ └── 'a' │ * TOKENS: userData ├── 'I' │ ├── 'D' │ │ * TOKENS: userID │ ├── 'd' │ │ * TOKENS: userId │ └── 'n' │ └── 'f' │ └── 'o' │ * TOKENS: userInfo └── 'M' └── 'a' └── 'n' └── 'a' └── 'g' └── ... (more nodes, reached max depth) └── ... (8 more children) ==================================================
So given a prefix, we can traverse the trie and collect all the tokens that have the prefix.
Token Matching Algorithm
We first take the user's input and split it into words using the regex pattern from the tokenizer library.

For the dataset above, the last words would be:
Last words for each item of the dataset based on the regex pattern

_apple
:
userNa
_causali
_intermediar
_indivi
Now, we search the trie for each of these prefixes. For example, if we consider the prefix " indivi", we would initiate search for the following prefixes:
Search prefixes for "indivi"

i
vi
ivi
divi
ndivi
indivi
_indivi
Token Boundary Validation
The above search yields many potential candidates, forming a superset of the combinations we want to explore. We need an effective pruning strategy to filter these combinations.
Token Validation Strategy

For each candidate token, we:
• Replace the last word's prefix with the candidate token
• Re-encode the resulting string
• Compare with expected token sequence
• Keep only if the sequences match exactly

Let's walk through an example using the prefix "divi" from our search candidates above:
Example: Validating token "division"
Input Analysis:
• Last word: " indivi"
• Prefix: "divi"
• Candidate token: "division"
• Resulting word: " indivision"
Token Comparison:
Original tokens:
Token View
 
Click or hover to highlight the tokens
Expected tokens:
Token View
 
Click or hover to highlight the tokens
Actual re-encoded tokens:
Token View
 
Click or hover to highlight the tokens
Since the re-encoded tokens don't match our expected sequence, we discard "division" as a candidate. By applying this validation process across all candidate tokens, we can effectively filter out invalid completions while preserving the ones that maintain proper token boundaries.
Example Results

Let's see the algorithm in action with some of our test cases:

Results for the dataset: Testing: I bought some apple last word apple Found 1424 matching tokens in 1.25 ms Analysis completed in 19.13 ms Combinations: Pos 19: I bought some appl -> [b'en', b'ers', b'ere']... 474 possible tokens Pos 18: I bought some app -> [b'led', b'lem', b'let']... 101 possible tokens Pos 17: I bought some ap -> [b'pler']... 1 possible tokens Pos 14: I bought some -> [b' apple', b' apples']... 2 possible tokens Testing: https: last word : Found 324 matching tokens in 0.36 ms Analysis completed in 2.42 ms Combinations: Pos 6: https -> [b':', b'::', b': ']... 324 possible tokens Testing: userNa last word userNa Found 2323 matching tokens in 1.90 ms Analysis completed in 21.15 ms Combinations: Pos 6: userN -> [b'an', b'ar', b'al']... 2170 possible tokens Pos 5: user -> [b'Na', b'Nam', b'Nav']... 34 possible tokens Pos 1: -> [b'userName']... 1 possible tokens Testing: We found a hidden causali last word causali Found 2106 matching tokens in 1.67 ms Analysis completed in 21.13 ms Combinations: Pos 25: We found a hidden causal -> [b'it', b'io', b'id']... 1748 possible tokens Pos 23: We found a hidden caus -> [b'ali', b'alin', b'alia']... 17 possible tokens Testing: He introduced an intermediar last word intermediar Found 922 matching tokens in 0.81 ms Analysis completed in 11.40 ms Combinations: Pos 28: He introduced an intermedia -> [b'res', b'ress', b'resp']... 52 possible tokens Pos 27: He introduced an intermedi -> [b'ar', b'art', b'are']... 215 possible tokens Pos 17: He introduced an -> [b' intermediary']... 1 possible tokens Testing: indivi last word indivi Found 2106 matching tokens in 1.37 ms Analysis completed in 19.37 ms Combinations: Pos 6: indiv -> [b'in', b'it', b'is']... 1885 possible tokens Pos 4: ind -> [b'ivi', b'ivil', b'ivid']... 16 possible tokens Pos 1: -> [b'individual']... 1 possible tokens
Performance Considerations

The implementation includes several optimizations:

  • Trie-based search: O(k) lookup time instead of O(nk) for linear vocabulary search
  • Early pruning: Using the regex pattern to focus search on relevant portion of text
  • Iterative DFS: Using queue-based iteration to avoid stack overflow on deep trie traversals
  • Token validation: Filtering out invalid candidates early to reduce downstream processing
Conclusion

I am sure there are many other ways to solve this problem. If you have alternative approaches, optimizations, or spot any issues in my implementation, please feel free to open an issue or submit a pull request on GitHub. I welcome contributions and feedback from the community to make this solution even better.

The code is available here.