I recently came across a problem published in cursor's blog named Character Prefix Conditioning. It opens with the following context:
When using a language model for code completion, we typically want the model to produce a
completion that begins with what the user has typed. However, modern language models operate
on sequences of tokens, not characters, so naively tokenizing the user's input and sending it
to the model produces wrong results if the user's cursor doesn't happen to lie on a token
boundary.
To understand this, let's consider the following completion example:
Autocomplete example
The company hired an intermediary to negotiate.
Here, the user has typed "The company hired an intermed" and the model is supposed to suggest a completion
"iary to negotiate." There's a problem here. Let's look at the token representation of the user's
current state and the actual sentence to get a better understanding of the problem:
User's current state:
Token View
Click or hover to highlight the tokens
Actual sentence:
Token View
Click or hover to highlight the tokens
If we were to naively tokenize the user's current state and send it to the model, we would have
to sample the following tokens to reach the actual sentence:
Token View
Click or hover to highlight the tokens
This would hurt the performance of the model as the sequence of these tokens does not match the
actual sentence's token sequence.
You can already tell that a simple approach won't solve it. Even token healing won't fully help here as if we backtrack a token and generate from there, we would start sampling
from The company hired an inter which won't get us to the token
sequence of the actual sentence which need us sampling from The company hired an.
Tokenization
I am going to use the Byte-Pair Encoding (BPE) tokenization algorithm to solve this problem. Specifically
the cl100k_base used for GPT-4 tokenizer.
The Byte-Pair Encoding tokenizer works by starting with a vocabulary of individual characters and
then iteratively merging the most frequent pair of adjacent tokens to form a new token. This process
continues until a desired vocabulary size is reached or no frequent pairs remain.
BPE Training and Inference using a simple example
I highly recommend you read it if you are unfamiliar with BPE and skip it if you are
already comfortable with it.
Let's look at a simple example of how BPE creates merge rules using the sentence "The
programmer programs programming programs":
First, we count all character frequencies in our corpus:
If we were to continue, we might merge 'pro'+'gram' next, but I'll stop here to keep this
example manageable. Here is how our vocabulary evolved from start to finish:
Let's see how a string gets tokenized step by step using our vocabulary and merges. We'll
use a smaller example: "programmer" Step 1: Initial character-by-character tokenization
Token View
Click or hover to highlight the tokens
Step 2: Apply the 'a'+'m' merge rule
Token View
Click or hover to highlight the tokens
Step 3: Apply the 'r'+'o' merge rule
Token View
Click or hover to highlight the tokens
Step 4: Apply the 'p'+'ro' merge rule
Token View
Click or hover to highlight the tokens
Step 5: Apply the 'g'+'r' merge rule
Token View
Click or hover to highlight the tokens
Step 6: Final tokenization with 'gr'+'am' merge rule
Token View
Click or hover to highlight the tokens
The important thing to understand here is that when tokenizing a new string, we apply our merge
rules in the exact order they were learned during training. Unlike the training phase where we
repeatedly count frequencies, during inference we simply apply the rules in sequence.
This is a simplified example - in practice, the cl100k_base tokenizer used by GPT-4 has a much
larger vocabulary (100,000+ tokens) and more sophisticated merge rules. The GPT-2 paper highlights one of things done additionally to improve the performance of the tokenizer:
We observed BPE including many versions of common words like dog since they occur in many
variations such as dog. dog! dog? . This results in a sub-optimal allocation of limited
vocabulary slots and model capacity. To avoid this, we prevent BPE from merging across
character categories for any byte sequence.
They enforce this prevention of merging by using a regex pattern that identifies boundaries where
merges shouldn't occur.
I am going to use this regex to only search and validate over subset of the overall user input.
We can create the same type of regex for other types of tokenizers such as SentencePiece or
WordPiece.
Language Model Performance Considerations
When implementing CPC with self-hosted language models, there are two main approaches to
handling multiple completion candidates, with significant performance implications:
Sequential autoregressive generation for each candidate path
Batched processing of multiple candidates together
Why Batching is More Efficient
When running a language model, loading the model weights and KV cache setup for
autoregressive generation has significant overhead. With sequential processing, this
overhead is incurred for each candidate path.
Batching Advantage
By batching multiple candidates together, we only need to load the model and set up
autoregressive generation once. The GPU can then process all candidates in parallel,
amortizing the initialization costs across the batch.
Dataset preparation
Most of the time spent on this problem went into preparing the dataset. I tried to think of and find
as many different cases as possible.
Here are some of the cases I considered, ordered by increasing complexity:
Basic word completion
When the user types a word that already is a valid token but larger tokens are possible:
I bought some apples
Token View
Click or hover to highlight the tokens
Token View
Click or hover to highlight the tokens
Token Healing
"Token Healing", which automatically backs up the generation process by one token before the
end of the prompt, then constrains the first token generated to have a prefix that matches
the last token in the prompt:
https://
Token View
Click or hover to highlight the tokens
Token View
Click or hover to highlight the tokens
Multiple valid completions
When the same prefix could lead to different valid words:
userName
Token View
Click or hover to highlight the tokens
Token View
Click or hover to highlight the tokens
Or:
userNames
Token View
Click or hover to highlight the tokens
Token View
Click or hover to highlight the tokens
Hidden subword patterns
When the word itself is made up of multiple tokens:
We found a hidden causality
Token View
Click or hover to highlight the tokens
Token View
Click or hover to highlight the tokens
Token boundary misalignment
When the cursor falls in the middle of what should be a single token:
He introduced an intermediary
Token View
Click or hover to highlight the tokens
Token View
Click or hover to highlight the tokens
Complex tokenization patterns
When the partial word causes different tokenization patterns and the token boundary moves
for the actual token:
indivisible
Token View
Click or hover to highlight the tokens
Token View
Click or hover to highlight the tokens
Or:
individual
Token View
Click or hover to highlight the tokens
Token View
Click or hover to highlight the tokens
Key Observations we can make from the dataset:
1. We have to implement search over the token space in some form to find possible generation
paths
2. We cannot make the assumption of taking the longest matching token. The actual one would
be very much dependent on the context of the user's input and may not be the longest one
3. The search might have to be initiated at multiple points in the user's input but we need
to be efficient about it
4. We cannot base this search on the existing token boundaries constructed on the user's
input as the token boundaries might be different for the final sentence
5. We need to prune the search space to make it manageable while still covering all the
cases
Implementation
If you've been able to follow along so far, the solution would seem obvious.
High-Level Approach
The strategy involves four main steps:
Create a trie data structure from the tokenizer vocabulary with the ability to take a string
and get all tokens that has the the string as the prefix
For a given user input, search from the right to left of the last work determined by the
regex pattern. Move character by character from left to right accumulating the prefix and
for each prefix, find possible token completions that have the prefix
Filter token candidates to ensure they maintain valid token boundaries
Return a list of possible completion paths for the model to consider
Trie Implementation
At the core of our solution is a trie (prefix tree) data structure optimized for byte-level searching. Each node in the trie represents a byte and the children of the node represent the next byte in the
token. The trie also stores the original token that ends at that node.
So given a prefix, we can traverse the trie and collect all the tokens that have the prefix.
Token Matching Algorithm
We first take the user's input and split it into words using the regex pattern from the tokenizer library.
For the dataset above, the last words would be:
Last words for each item of the dataset based on the regex pattern
_apple : userNa _causali _intermediar _indivi
Now, we search the trie for each of these prefixes. For example, if we consider the prefix "
indivi", we would initiate search for the following prefixes:
Search prefixes for "indivi"
i vi ivi divi ndivi indivi _indivi
Token Boundary Validation
The above search yields many potential candidates, forming a superset of the combinations we want
to explore. We need an effective pruning strategy to filter these combinations.
Token Validation Strategy
For each candidate token, we: • Replace the last word's prefix with the candidate token • Re-encode the resulting string • Compare with expected token sequence • Keep only if the sequences match exactly
Let's walk through an example using the prefix "divi" from our search candidates above:
Since the re-encoded tokens don't match our expected sequence, we discard "division" as a
candidate. By applying this validation process across all candidate tokens, we can effectively
filter out invalid completions while preserving the ones that maintain proper token boundaries.
Example Results
Let's see the algorithm in action with some of our test cases:
Results for the dataset:
Testing: I bought some apple
last word apple
Found 1424 matching tokens in 1.25 ms
Analysis completed in 19.13 ms
Combinations:
Pos 19: I bought some appl -> [b'en', b'ers', b'ere']... 474 possible tokens
Pos 18: I bought some app -> [b'led', b'lem', b'let']... 101 possible tokens
Pos 17: I bought some ap -> [b'pler']... 1 possible tokens
Pos 14: I bought some -> [b' apple', b' apples']... 2 possible tokens
Testing: https:
last word :
Found 324 matching tokens in 0.36 ms
Analysis completed in 2.42 ms
Combinations:
Pos 6: https -> [b':', b'::', b':
']... 324 possible tokens
Testing: userNa
last word userNa
Found 2323 matching tokens in 1.90 ms
Analysis completed in 21.15 ms
Combinations:
Pos 6: userN -> [b'an', b'ar', b'al']... 2170 possible tokens
Pos 5: user -> [b'Na', b'Nam', b'Nav']... 34 possible tokens
Pos 1: -> [b'userName']... 1 possible tokens
Testing: We found a hidden causali
last word causali
Found 2106 matching tokens in 1.67 ms
Analysis completed in 21.13 ms
Combinations:
Pos 25: We found a hidden causal -> [b'it', b'io', b'id']... 1748 possible tokens
Pos 23: We found a hidden caus -> [b'ali', b'alin', b'alia']... 17 possible tokens
Testing: He introduced an intermediar
last word intermediar
Found 922 matching tokens in 0.81 ms
Analysis completed in 11.40 ms
Combinations:
Pos 28: He introduced an intermedia -> [b'res', b'ress', b'resp']... 52 possible tokens
Pos 27: He introduced an intermedi -> [b'ar', b'art', b'are']... 215 possible tokens
Pos 17: He introduced an -> [b' intermediary']... 1 possible tokens
Testing: indivi
last word indivi
Found 2106 matching tokens in 1.37 ms
Analysis completed in 19.37 ms
Combinations:
Pos 6: indiv -> [b'in', b'it', b'is']... 1885 possible tokens
Pos 4: ind -> [b'ivi', b'ivil', b'ivid']... 16 possible tokens
Pos 1: -> [b'individual']... 1 possible tokens
Performance Considerations
The implementation includes several optimizations:
Trie-based search: O(k) lookup time instead of O(nk) for
linear vocabulary search
Early pruning: Using the regex pattern to focus search on
relevant portion of text
Iterative DFS: Using queue-based iteration to avoid stack
overflow on deep trie traversals
Token validation: Filtering out invalid candidates early to
reduce downstream processing
Conclusion
I am sure there are many other ways to solve this problem. If you have alternative approaches,
optimizations, or spot any issues in my implementation, please feel free to open an issue or
submit a pull request on GitHub. I welcome contributions and feedback from the community to
make this solution even better.