Character Prefix Conditioning using Back Tokenization

Anil Turaga • March 17, 2025

Implementing an efficient algorithm based on pruned trie search to find possible generation paths for a user completion request

Table of Contents

Tokenization
Language Model Performance Considerations
Dataset Preparation
Implementation

I recently came across a problem published in cursor's blog named Character Prefix Conditioning. It opens with the following context:

When using a language model for code completion, we typically want the model to produce a completion that begins with what the user has typed. However, modern language models operate on sequences of tokens, not characters, so naively tokenizing the user's input and sending it to the model produces wrong results if the user's cursor doesn't happen to lie on a token boundary.

To understand this, let's consider the following completion example:

Autocomplete example

The company hired an intermediary to negotiate.

Here, the user has typed "The company hired an intermed" and the model is supposed to suggest a completion "iary to negotiate." There's a problem here. Let's look at the token representation of the user's current state and the actual sentence to get a better understanding of the problem:

User's current state:

Token View

Click or hover to highlight the tokens

Actual sentence:

Token View

Click or hover to highlight the tokens

If we were to naively tokenize the user's current state and send it to the model, we would have to sample the following tokens to reach the actual sentence:

Token View

Click or hover to highlight the tokens

This would hurt the performance of the model as the sequence of these tokens does not match the actual sentence's token sequence.

You can already tell that a simple approach won't solve it. Even token healing won't fully help here as if we backtrack a token and generate from there, we would start sampling from The company hired an inter which won't get us to the token sequence of the actual sentence which need us sampling from The company hired an.

Tokenization

I am going to use the Byte-Pair Encoding (BPE) tokenization algorithm to solve this problem. Specifically the cl100k_base used for GPT-4 tokenizer.

The Byte-Pair Encoding tokenizer works by starting with a vocabulary of individual characters and then iteratively merging the most frequent pair of adjacent tokens to form a new token. This process continues until a desired vocabulary size is reached or no frequent pairs remain.

BPE Training and Inference using a simple example

I highly recommend you read it if you are unfamiliar with BPE and skip it if you are already comfortable with it.

Let's look at a simple example of how BPE creates merge rules using the sentence "The programmer programs programming programs":

First, we count all character frequencies in our corpus:

'p': 5, 'r': 10, 'o': 5, 'g': 5, 'a': 5, 'm': 10, ' ': 4, 'T': 1, 'h': 1, 'e': 1, 's':
						2, 'i': 1, 'n': 1

We start with the vocabulary of individual characters:

Token View

Click or hover to highlight the tokens

Now we look for the most frequent pair of adjacent tokens. We count all pairs:

'r'+'o': 4, 'o'+'g': 4, 'g'+'r': 4, 'r'+'a': 4, 'a'+'m': 5, 'm'+'m': 2, others: 1 each

The most frequent pair is 'a'+'m', appearing 5 times, so we merge them:

Token View

Click or hover to highlight the tokens

Now we recount pairs with our new token 'am'. The most frequent pairs are now:

'r'+'o': 4, 'o'+'g': 4, 'g'+'r': 4, 'r'+'am': 4, others: ≤2 each

We'll merge 'r'+'o' (we could choose any of the pairs with frequency 4):

Token View

Click or hover to highlight the tokens

Recounting again, the most frequent pairs:

'p'+'ro': 4, 'o'+'g': 4, 'g'+'r': 4, 'r'+'am': 4, others: ≤2 each

We'll merge 'p'+'ro':

Token View

Click or hover to highlight the tokens

And again:

'g'+'r': 4, 'r'+'am': 4, others: ≤2 each

We'll merge 'g'+'r':

Token View

Click or hover to highlight the tokens

'gr'+'am': 4, others: ≤2 each

We merge 'gr'+'am' to form 'gram':

Token View

Click or hover to highlight the tokens

If we were to continue, we might merge 'pro'+'gram' next, but I'll stop here to keep this example manageable. Here is how our vocabulary evolved from start to finish:

Initial vocabulary (individual characters):

['T', 'h', 'e', ' ', 'p', 'r', 'o', 'g', 'a', 'm', 's', 'i', 'n']

Final vocabulary (after merges):

['T', 'h', 'e', ' ', 'p', 'r', 'o', 'g', 'a', 'm', 's', 'i', 'n', 'am', 'ro', 'pro',
						'gr', 'gram']

Let's see how a string gets tokenized step by step using our vocabulary and merges. We'll use a smaller example: "programmer" Step 1: Initial character-by-character tokenization

Token View

Click or hover to highlight the tokens

Step 2: Apply the 'a'+'m' merge rule

Token View

Click or hover to highlight the tokens

Step 3: Apply the 'r'+'o' merge rule

Token View

Click or hover to highlight the tokens

Step 4: Apply the 'p'+'ro' merge rule

Token View

Click or hover to highlight the tokens

Step 5: Apply the 'g'+'r' merge rule

Token View

Click or hover to highlight the tokens

Step 6: Final tokenization with 'gr'+'am' merge rule

Token View

Click or hover to highlight the tokens

The important thing to understand here is that when tokenizing a new string, we apply our merge rules in the exact order they were learned during training. Unlike the training phase where we repeatedly count frequencies, during inference we simply apply the rules in sequence.

This is a simplified example - in practice, the cl100k_base tokenizer used by GPT-4 has a much larger vocabulary (100,000+ tokens) and more sophisticated merge rules. The GPT-2 paper highlights one of things done additionally to improve the performance of the tokenizer:

We observed BPE including many versions of common words like dog since they occur in many variations such as dog. dog! dog? . This results in a sub-optimal allocation of limited vocabulary slots and model capacity. To avoid this, we prevent BPE from merging across character categories for any byte sequence.

They enforce this prevention of merging by using a regex pattern that identifies boundaries where merges shouldn't occur.


        r'(?i:[sdmt]|ll|ve|re)|[^
p{L}p{N}]?+p{L}++|p{N}{1,3}+| ?[^sp{L}p{N}]++[
]*+|s++$|s*[
]|s+(?!S)|s

I am going to use this regex to only search and validate over subset of the overall user input.

We can create the same type of regex for other types of tokenizers such as SentencePiece or WordPiece.

Language Model Performance Considerations

When implementing CPC with self-hosted language models, there are two main approaches to handling multiple completion candidates, with significant performance implications:

Sequential autoregressive generation for each candidate path
Batched processing of multiple candidates together

Why Batching is More Efficient

When running a language model, loading the model weights and KV cache setup for autoregressive generation has significant overhead. With sequential processing, this overhead is incurred for each candidate path.

Batching Advantage

By batching multiple candidates together, we only need to load the model and set up autoregressive generation once. The GPU can then process all candidates in parallel, amortizing the initialization costs across the batch.

Dataset preparation

Most of the time spent on this problem went into preparing the dataset. I tried to think of and find as many different cases as possible.

Here are some of the cases I considered, ordered by increasing complexity:

Basic word completion

When the user types a word that already is a valid token but larger tokens are possible:

I bought some apples

Token View

Click or hover to highlight the tokens

Token View

Click or hover to highlight the tokens

Token Healing

"Token Healing", which automatically backs up the generation process by one token before the end of the prompt, then constrains the first token generated to have a prefix that matches the last token in the prompt:

https://

Token View

Click or hover to highlight the tokens

Token View

Click or hover to highlight the tokens

Multiple valid completions

When the same prefix could lead to different valid words:

userName

Token View

Click or hover to highlight the tokens

Token View

Click or hover to highlight the tokens

Or:

userNames

Token View

Click or hover to highlight the tokens

Token View

Click or hover to highlight the tokens

Hidden subword patterns

When the word itself is made up of multiple tokens:

We found a hidden causality

Token View

Click or hover to highlight the tokens

Token View

Click or hover to highlight the tokens

Token boundary misalignment

When the cursor falls in the middle of what should be a single token:

He introduced an intermediary

Token View

Click or hover to highlight the tokens

Token View

Click or hover to highlight the tokens

Complex tokenization patterns

When the partial word causes different tokenization patterns and the token boundary moves for the actual token:

indivisible

Token View

Click or hover to highlight the tokens

Token View

Click or hover to highlight the tokens

Or:

individual

Token View

Click or hover to highlight the tokens

Token View

Click or hover to highlight the tokens

Key Observations we can make from the dataset:

1. We have to implement search over the token space in some form to find possible generation paths
2. We cannot make the assumption of taking the longest matching token. The actual one would be very much dependent on the context of the user's input and may not be the longest one
3. The search might have to be initiated at multiple points in the user's input but we need to be efficient about it
4. We cannot base this search on the existing token boundaries constructed on the user's input as the token boundaries might be different for the final sentence
5. We need to prune the search space to make it manageable while still covering all the cases

Implementation

If you've been able to follow along so far, the solution would seem obvious.

High-Level Approach

The strategy involves four main steps:

Create a trie data structure from the tokenizer vocabulary with the ability to take a string and get all tokens that has the the string as the prefix
For a given user input, search from the right to left of the last work determined by the regex pattern. Move character by character from left to right accumulating the prefix and for each prefix, find possible token completions that have the prefix
Filter token candidates to ensure they maintain valid token boundaries
Return a list of possible completion paths for the model to consider

Trie Implementation

At the core of our solution is a trie (prefix tree) data structure optimized for byte-level searching.
Each node in the trie represents a byte and the children of the node represent the next byte in the token.
The trie also stores the original token that ends at that node.

Visualizing subtree for prefix 'user':

==================================================
TRIE VISUALIZATION (max_depth=4, max_children=3)
==================================================
Visualizing subtree for prefix: b'user'
* TOKENS: user
├── 'D'
│   └── 'a'
│       └── 't'
│           └── 'a'
│               * TOKENS: userData
├── 'I'
│   ├── 'D'
│   │   * TOKENS: userID
│   ├── 'd'
│   │   * TOKENS: userId
│   └── 'n'
│       └── 'f'
│           └── 'o'
│               * TOKENS: userInfo
└── 'M'
    └── 'a'
        └── 'n'
            └── 'a'
                └── 'g'
                    └── ... (more nodes, reached max depth)
└── ... (8 more children)
==================================================

So given a prefix, we can traverse the trie and collect all the tokens that have the prefix.

Token Matching Algorithm

We first take the user's input and split it into words using the regex pattern from the tokenizer library.

For the dataset above, the last words would be:

Last words for each item of the dataset based on the regex pattern


 _apple
 :
 userNa
 _causali
 _intermediar
 _indivi

Now, we search the trie for each of these prefixes. For example, if we consider the prefix " indivi", we would initiate search for the following prefixes:

Search prefixes for "indivi"


 i
 vi
 ivi
 divi
 ndivi
 indivi
 _indivi

Token Boundary Validation

The above search yields many potential candidates, forming a superset of the combinations we want to explore. We need an effective pruning strategy to filter these combinations.

Token Validation Strategy

For each candidate token, we:
• Replace the last word's prefix with the candidate token
• Re-encode the resulting string
• Compare with expected token sequence
• Keep only if the sequences match exactly

Let's walk through an example using the prefix "divi" from our search candidates above:

Example: Validating token "division"

Input Analysis:

• Last word: " indivi"
• Prefix: "divi"
• Candidate token: "division"
• Resulting word: " indivision"

Token Comparison:

Original tokens:

Token View

Click or hover to highlight the tokens

Expected tokens:

Token View

Click or hover to highlight the tokens

Actual re-encoded tokens:

Token View

Click or hover to highlight the tokens

Since the re-encoded tokens don't match our expected sequence, we discard "division" as a candidate. By applying this validation process across all candidate tokens, we can effectively filter out invalid completions while preserving the ones that maintain proper token boundaries.

Example Results

Let's see the algorithm in action with some of our test cases:

Results for the dataset:
Testing: I bought some apple
last word  apple
Found 1424 matching tokens in 1.25 ms
Analysis completed in 19.13 ms
Combinations:
Pos 19: I bought some appl -> [b'en', b'ers', b'ere']... 474 possible tokens
Pos 18: I bought some app -> [b'led', b'lem', b'let']... 101 possible tokens
Pos 17: I bought some ap -> [b'pler']... 1 possible tokens
Pos 14: I bought some -> [b' apple', b' apples']... 2 possible tokens

Testing: https:
last word :
Found 324 matching tokens in 0.36 ms
Analysis completed in 2.42 ms
Combinations:
Pos 6: https -> [b':', b'::', b':
']... 324 possible tokens

Testing: userNa
last word userNa
Found 2323 matching tokens in 1.90 ms
Analysis completed in 21.15 ms
Combinations:
Pos 6: userN -> [b'an', b'ar', b'al']... 2170 possible tokens
Pos 5: user -> [b'Na', b'Nam', b'Nav']... 34 possible tokens
Pos 1:  -> [b'userName']... 1 possible tokens

Testing: We found a hidden causali
last word  causali
Found 2106 matching tokens in 1.67 ms
Analysis completed in 21.13 ms
Combinations:
Pos 25: We found a hidden causal -> [b'it', b'io', b'id']... 1748 possible tokens
Pos 23: We found a hidden caus -> [b'ali', b'alin', b'alia']... 17 possible tokens

Testing: He introduced an intermediar
last word  intermediar
Found 922 matching tokens in 0.81 ms
Analysis completed in 11.40 ms
Combinations:
Pos 28: He introduced an intermedia -> [b'res', b'ress', b'resp']... 52 possible tokens
Pos 27: He introduced an intermedi -> [b'ar', b'art', b'are']... 215 possible tokens
Pos 17: He introduced an -> [b' intermediary']... 1 possible tokens

Testing: indivi
last word indivi
Found 2106 matching tokens in 1.37 ms
Analysis completed in 19.37 ms
Combinations:
Pos 6: indiv -> [b'in', b'it', b'is']... 1885 possible tokens
Pos 4: ind -> [b'ivi', b'ivil', b'ivid']... 16 possible tokens
Pos 1:  -> [b'individual']... 1 possible tokens

Performance Considerations

The implementation includes several optimizations:

Trie-based search: O(k) lookup time instead of O(nk) for linear vocabulary search
Early pruning: Using the regex pattern to focus search on relevant portion of text
Iterative DFS: Using queue-based iteration to avoid stack overflow on deep trie traversals
Token validation: Filtering out invalid candidates early to reduce downstream processing

Conclusion

I am sure there are many other ways to solve this problem. If you have alternative approaches, optimizations, or spot any issues in my implementation, please feel free to open an issue or submit a pull request on GitHub. I welcome contributions and feedback from the community to make this solution even better.

The code is available here.